Logo
Overview

SF 20K Personas: Large-Scale Synthetic Population Dataset

November 23, 2025
1 min read

Overview

Dataset of 20,000 procedurally generated personas representing San Francisco population. Includes demographics, behavioral patterns, social networks, and mobility profiles for urban simulation.

Dataset Composition

Demographics

Statistically accurate distribution of:

  • Age and gender
  • Household composition
  • Income levels
  • Employment sectors
  • Educational attainment

Behavioral Profiles

Individual characteristics including:

  • Daily activity patterns
  • Transportation preferences
  • Commercial behaviors
  • Social interaction frequency
  • Routine schedules

Social Networks

Generated social graphs with:

  • Family relationships
  • Workplace connections
  • Community ties
  • Friendship networks
  • Influence propagation paths

Mobility Patterns

Synthetic movement data:

  • Home and work locations
  • Daily travel routes
  • Mode choice preferences
  • Time-of-day patterns
  • Weekend behaviors

Generation Methodology

Procedural generation pipeline ensures statistical alignment with census data while maintaining individual variation. Privacy-preserving by design as all individuals are synthetic.

Applications

Urban Planning

Evaluate infrastructure changes and policy impacts on representative population. Assess equity and accessibility across demographic groups.

Transportation Modeling

Simulate traffic patterns, transit demand, and mode shift scenarios with realistic population distribution.

Epidemiology

Disease spread modeling with realistic social contact networks and mobility patterns.

Public Health

Intervention targeting and resource allocation based on demographic and behavioral segmentation.

Emergency Response

Evacuation planning and resource distribution accounting for population heterogeneity.

Format

JSON dataset with per-persona entries. Includes spatial coordinates, temporal patterns, and network connections. Compatible with agent-based modeling frameworks.

Validation

Generated distributions validated against:

  • US Census data
  • American Community Survey
  • SF transportation surveys
  • Employment statistics

GitHub Repository