1 Introduction

Over the last two decades, computational modeling approaches, and especially geographically-explicit simulations have become crucial in understanding the complex nature of cities and are playing a important role in Urban Science (Batty, 2007; Heppenstall et al., 2011) ranging from the micro-movement of pedestrians to that of macro patterns of urban growth (e.g., Crooks et al., 2015; Xie et al., 2007). Coinciding with this growth is proliferation of data from various sources (e.g., cell phones, social media, GPS logs) which are being used in geographical urban simulations in order to capture reality better (e.g., Ye et al., 2021; Huang et al., 2020; Manley et al., 2015). At the same time, there are growing demands for obtaining realistic individual-level population data to simulate and capture more population characteristics or behaviors within the urban environment. Hence, researchers have placed considerable efforts to create artificial or synthetic populations (e.g., Barrett et al., 2009; Müller & Axhausen, 2011; Wise 2014; Gallagher et al., 2018; Lim, 2020). Besides academic researchers, research organizations, such as, US Census, Research Triangle Institute (RTI) International, are also attempting to build and share synthetic population datasets to support various research activities including computational urban modeling (Renardy et al., 2020).

However, the majority of existing synthetic population datasets share two common limitations. One is that they miss social networks, the other is that they can not be reused in different modeling applications. The lack of social networks is a concern as such networks play a critical role in various urban situations from that of crisis (e.g., Ersing & Kost, 2012), to that of disease spread (e.g., Pescosolido & Levy, 2002), and human mobility (e.g., Hu et al., 2021). As such, modelers are increasingly using social networks in urban simulations, especially those utilizing agent-based modeling (e.g., Hespenstall et al., 2021; Batty, 2013; Kim etal., 2020) and we would argue that such synthetic populations should also include social networks. The rationale for creating social networks is that they allow us to study the connections among various agents within different contexts (e.g., family, social group) and how such connections lead to aggregate patterns emerging.

As discussed, not only are social networks missing from existing publicly available synthetic populations, most of synthetic population datasets are often built to meet specific research requirements, such as, traffic simulation in a specific area (e.g., Wiseet al., 2017) or disease spread in a specific city (e.g., Eubank et al., 2004) which makes it challenging to reuse such a specific synthetic population for different research questions or study areas. To overcome these limitations, the purpose of this paper is to introduce a new method that creates a synthetic population dataset incorporating social networks and share both of the method and resulting dataset with the scientific community to enable reuse of them as researchers see fit.

To illustrate the novelty of our proposed method and how it can be reused for different urban applications, we demonstrate three agent-based models which tackle three different types of problems all of which require realistic synthetic populations (i.e., traffic dynamics, disaster responses, and disease outbreaks). These three models are built in different programming languages (i.e., Java and Python), but they all use the synthetic population dataset generated by using the method introduced in Section 3. Our motivation for this is twofold. First to demonstrate the re-usability and generalizablity of our synthetic population in different modeling languages and platforms. Second, by sharing the synthetic population and models, we provide a basic test environment to allow readers not only to replicate the results presented in this paper, but also to extend the models to their own application areas. In the remainder of this paper, we provide a brief background into synthetic population generation (Section 2) before introducing our methodology in Section 3. We then show the results in Section 4 before turning to illustrate how our dataset can be used to explore traffic dynamics, disaster responses, and disease outbreaks (Section 5). Finally in Section 6, we provide a summary of the paper and identify areas of further work.

2 Background

The vast majority of population synthesis methods utilized in agent-based models originate from microsimulation techniques (Orcutt, 1957) and involve a two-step process of fitting a population to a set of relevant attributes and constraints and then generating individual units on the fitted population (Birkin & Wu, 2012; Müller & Axhausen, 2011; Wise, 2014). Traditionally, population synthesis methods can be broken into two main approaches: 1) synthetic reconstruction (SR); and 2) combinatorial optimization (CO) or re-weighting (Barthelemy & Toint, 2013; Pritchard & Miller, 2012). Also, in the last few years, we have witnessed an increasing number of studies using statistical (Sun & Erath, 2015; Saadi et al., 2016) and machine learning approaches (Borysov et al., 2019). However, such approaches are still in their infancy and not widely used. The SR method involves obtaining the joint distributions of relevant attributes and using Iterative Proportional Fitting (IPF) with the sample population to create a fitted population and generate individual households from that population (Deming & Stephan, 1940). While, CO involves creating a population and modifying it with the sample population until it meets a threshold of required constraints, for example a specific age distribution (Huang & Williamson, 2002; Wise, 2014). Both methods have their advantages and disadvantages. For example, CO can minimize errors by using constraints extracted from disaggregated datasets, such as, Public Use Microdata Areas (PUMAs) or Samples of Anonymized Records (SAR) (Farooq et al., 2013; Lim 2020; Wheaton et al., 2009). However, a disadvantage is that it is computationally intensive while at the same time needs detailed disaggregated data which is not always available. Such disaggregated data is not needed for SR.

Turning to applications, synthetic populations have been utilized to study demography, ecology, epidemiology, transportation, to name a few (e.g., Geard et al., 2013; Müller & Axhausen, 2011; Birkin & Wu, 2012). One methodology that has grown over the last two decades to capture the complexity of urban systems, is that of agent-based modeling (Crooks et al., 2019). At the core of agent-based models are individual agents which form a synthetic population, who interact with each other, and it is through these interactions aggregate patterns emerge (e.g., individuals driving cars can lead to traffic jams). Within agent-based modeling, synthetic populations are generally considered to be just the artificial society of the agents and we would argue that little attention has been given to how to incorporate realistic social networks for a given population based on actual real world demographic information. Even though there are many agent-based models that utilize social networks (e.g., Amblard et al., 2015), most of these networks were grown during the simulation (e.g., Pires & Crooks 2017; Kim et al., 2020), use stylized networks (e.g., Alizadeh et al., 2017), or simply assume adjacent agents are part of the same network (e.g., Agrawal et al., 2013). Others have created synthetic populations with realistic social networks (e.g., Eubank et al., 2004; Barrett et al., 2009), however, the agents operate within a network and are not geographically-explicit. Others have created geographically-explicit synthetic populations with social networks, but in such works, the agents’ social networks are based on social media connections and not family or workplace relationships (Wise, 2014). In the following section, we address these limitations and introduce a new method that creates a geographically-explicit synthetic population with realistic social networks.

3 Methodology

3.1 Study area

Before presenting our method, we first introduce our study area as shown in Fig. 1: New York City and its surrounding area (the state of Connecticut and parts of New Jersey, New York state, and Pennsylvania). The study area covers a 262 x 234 km area, home to 23,004,272 people based on the 2010 US Census. Our rationale for choosing this study area, is that it is slightly larger than the official metropolitan statistical area allowing us to capture high-density urban, suburban and rural areas, and thus it provides a more heterogeneous synthetic population. In addition, using this larger area allows potentially capturing more patterns of life, such as commuting patterns in and out the New York City.

Fig. 1
figure 1

Study Area

3.2 Method details

Figure 2 shows the workflow of our synthetic population generation method along with the data and code packages used in each step. This method is derived from the works of (Barthelemy & Toint, 2013) and (Wise, 2014), but includes extensions such as the creation of social networks based on real locations and the addition of educational sites as shown in Table 1. Before proceeding into the workflow, data preprocessing was carried out, which included ensuring the road network was topologically connected and no duplicate records (i.e., edges) existed. The first step was to create spaces on the road network to place homes and businesses based on the road type, for example, primary and secondary roads where used to constrain the location of homes and work spaces. We then created individuals using synthetic reconstruction based on the 2010 US Census data and grouped them to households. In step 2, individuals in households were then assigned workplaces consistent with the data from the U.S. Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics (LODES) dataset sourced from (US Census Bureau, 2019). Furthermore, we assigned younger household members to the closest daycare/schools based on their ages (e.g., daycare, elementary, middle, and high school) whose locations were sourced from the US Environmental Protection Agency (EPA) Office of Environmental Information (OEI) collected by (USEPA Office of Environmental Information, 2015). Lastly, three types of social networks were created based on (1) being in the same household, (2) working in the same workplace, or (3) attending the same educational institute. Household networks are fully connected, while their family members’ school and work networks are based on interactions with individuals in these locations based on concepts from small world social networks (e.g., Newman & Watts, 1999; Dunbar, 1998). Our synthetic population method was coded in Python. we made the source code, source data, and resulting synthetic data publicly available which allows others to repeat the processes presented in this paper or utilize the generated data (Code: https://bit.ly/SynPopABM; Source and resulting synthetic population data: https://osf.io/3vsaj/)

Fig. 2
figure 2

Workflow for Generation of Synthetic Population and Networks

Table 1 Various synthetic population methods: adaptions and differences

3.2.1 Step 1: create individuals and spaces

To create individuals and spaces in Step 1 (Fig. 2), there are two predominant ways to synthesize populations, as noted in in Section 2. While we could have chosen combinatorial optimization, we did not choose it because this method requires disaggregate level data to minimize the error during the population synthesis process as discussed in Section 2. Such data is not always available and as we wanted a generic method that could be applied to multiple areas (or countries) we chose synthetic reconstruction, as this method only requires data at the census tract level. Using synthetic reconstruction, we created individuals to represent every person within every census tract and assign their sex and age based on information from the 2010 US Census data (US Census Bureau, 2010c). Specifically, we managed to create individuals under each age group of male and female. However, we did not consider ethnicity or race in our methods because the focus of this method is on locations and networks more than detailed demographic characteristics (e.g. race). Within this step, after creating the individuals, our method grouped them into households based on the household types within a tract using a uniform distribution, which is inspired by (Barthelemy & Toint, 2013) and (Wise, 2014). However, we also added some constraints related to age differences among the members under the same household to minimize the error between the synthetic dataset and US Census data. These constraints only applied to household types with children under 18 and were derived through trial and error with respect to population synthesis. The specifics of the constraints are: 1) husband’s age – wife’s age between (-4, +9); 2) father’s age - child’s age less then or equal to 50; 3) mother’s age – child’s age less than or equal to 40. The US Census categorizes households into 11 types (such as, husband-and-wife families, single childless adult households, households with a child less than 18, and male and female single householders over 65). However, based on our analysis of the data, we added one more type, that of people living in group quarters, which can be institutional (e.g., correctional facilities for adults, juvenile facilities, nursing facilities/skilled-nursing facilities) or non-institutional (e.g., college/university student housing, or military quarters). We further assumed that for each tract, there’s only one example of group quarters and those who belong to a group quarters all live in this location. Hence, there are total of 12 types households in our synthetic population.

3.2.2 Step2: assign daytime locations

Daytime locations (i.e., workplaces and daycare/schools) were assigned to every agent generated in Step 1. Previous methods such as (Barthelemy & Toint, 2013) did not include such characteristics and (Wise, 2014) only considered workplaces. Our method also placed home and work locations along the road network to make the synthetic population geographically-explicit. Such a process was not carried out in (Barthelemy & Toint, 2013) synthetic population work and hence their population is not geographically-explicit. While our approach potentially requires pre-processing of geographic maps compared to procedurally-generated cities (Kim et al. 2018), it is more realistic to represent geographic areas. Furthermore, to mimic the “real” world we also took into consideration general zoning, where we restricted businesses to secondary roads with the exception of religious centers and daycare/schools that may be located on residential roads. No businesses were allowed to be placed on primary roads (i.e., highways or motorways) as these are divided and have limited-access (US Census Bureau, 2010a). Without detailed land parcel information and the wish to keep the method simple, houses were placed on local roads at point locations at least 50m apart (see Fig. 3 (c)) or on top of each other when population density is high (e.g., representing apartment complexes, see Fig. 3 (d)) based on the number of occupied housing units in each census tract. Workplaces were randomly placed either onto secondary roads, approximately 20m apart or at local road intersections. The number of workplaces in each census tract is disaggregated from County level business establishment counts from the US Census Bureau’s County Business Patterns (US Census Bureau, 2010b). To determine the size of the population in each workplace, we used a lognormal distribution within census tracts based on findings that job size distributions in U.S. cities are lognormal (Eeckhout, 2004). The information of commuting patterns related to the work places was derived from the LEHDs LODES dataset based on (US Census Bureau, 2019). After we aggregated the information to the tract-level, we assigned work-age agents to a random workplace location within a tract based on the origin-destination statistics. Data for school locations were extracted from the Educational Institution dataset retrieved from the EPA OEI (USEPA Office of Environmental Information, 2015). The dataset contains geographic coordinates of educational institutions, enrollments, grade levels, and start age and end age of each institution. We assigned school-age agents to the nearest available school location within a tract. School-age agents were sorted into schools based on grade and enrollment levels. Figure 3(a) and (b) displays representative suburban and high density urban area results of this step, an example of household, workplace and education locations for one census tract within our study area.

Fig. 3
figure 3

(a) Suburban Area: Mapped locations for Census Tract 34027043402, NJ: ; (b) High Density Urban Area: Mapped locations for Census Tract 36061017700, NY; (c) Suburban Area: Population Density; (d) High Density Urban Area: Population Density

3.2.3 Step:3: create social networks

Finally, as Step 3, we created three types of networks, specifically: household, work, and educational (which are subdivided into school and daycare networks). In the work of (Barthelemy & Toint, 2013), social networks were not generated, while (Wise, 2014) only generated social networks based on a ego network from the social media data. Our approach is different: individuals received a link to each agent based on living in the same household and either working in the same workplace or attending the same daycare/education institute. Small-world networks are one of the most implemented networks in social simulations (Amblard et al., 2015). Hence, if the group size of a household, work or education site was greater than 5, a (Newman & Watts, 1999) small-world network was generated where the number 5 was chosen based on the work of (Dunbar, 1998). While for people who live in the same household, if the group size was less than 5, everyone was connected. Figure 4 shows the process to create the social networks. Figure 4(a) shows selected population from our dataset, as discussed above people under same household are fully connected (4(b)). Work and educational network were created based on the worksite location and educational location (4(c)). Figure 4 (d) is an example of the multi-layer network within a home including an individual’s family ties and proximity ties to people at work and school.

Fig. 4
figure 4

Creation of Social Networks: (a) Selected Population; (b) Creation of a Household Network; (c) Creation of Work and Educational Networks for each Member of the Household; (d) The Household, its Networks within the Full Census Tract

4 Results from the synthetic population generation

After providing the details about our method in Section 3, this section illustrates the resulting datasets. Five different datasets were generated, including one synthetic population dataset and four social network datesets (links between the network datasets and the population dataset was via the unique ID of the synthetic individual). Table 2 shows the basic information of each dataset. To display the structure of the resulting datasets, Fig. 5(a) shows a sample record extracted from the synthetic population dataset and Fig. 5(b) shows the sample of the synthetic social networks extracted from household network. Furthermore, it is also important to verify and validate the results, which is what we turn to in section 4.1 and 4.2. To verify our method, basic statistical analyses were carried out which are presented below. For validation, our results are compared to a benchmark dataset (i.e., RTI data) to test the robustness of our method.

Fig. 5
figure 5

Sample Records: (a) Synthetic Population; (b) Synthetic Social Network

Table 2 Summary of the generated synthetic datasets (n differs in the size depending on the network, n ∈[0,14])

4.1 Verification

Verification of our synthetic population was done by comparing selected measures from the official US Census data to our results as shown in Table 3. During this process, we found the number of individuals in our synthetic population was not identical to the 2010 US Census data that showed 23,004,272 individuals lived in the study area. Looking into this issue, we identified 116 problematic census tracts out of a total 5,500 tracts. These problematic tracts had inconsistent counts with respect to total numbers of males and females or no data was given for the number of individuals under 18 years old from the official US Census data. Over the whole dataset, there were 82,970 individuals living in those problematic tracts. Our method was not able to generate the individuals in these problematic tracts because of the inconsistencies discussed above. However, the percentage of the individuals living in problematic tracts is only 0.36% of the total population of the study area. We decided to leave these tracts out of our synthetic population. In the end, our synthetic population method generates 22,921,302 individuals, including 17,697,433 adults (age ≥ 18) and 5,223,869 children (Age < 18), which were grouped into 8,457,710 households. However, the total number of households was slightly greater than the number from US Census records, because we generated households for people living in group quarters (as discussed in Section 3). If we excluded this population we would have the identical household amount to the official US Census data (i.e., 8,453,097).

Table 3 Population counts based on multiple sources

Other than the overall population number comparison with US Census, we also verified the male and female population under each age group by comparing it with US Census data. The rationale for this is that we used the 18 age groups for male and female directly from US Census data to create population in step 2 of our method. In this verification part, we excluded the problematic census tracts identified above. Figure 6 displays the male and female population under each age group from US Census and our population dataset. No differences were identified under all age groups, thus our method was able to generate a identical number of male and female population under each age group.

Fig. 6
figure 6

Verification Results: Comparison between Our Population and Census on Each Age Group’s Population

In addition to general population statistics, as noted in Section 3.2, our synthetic population method also generated geographically-explicit social networks. Social networks play an important role with respect to the spread of information and social capital (in times of disasters), and diseases (e.g., Eubank et al., 2004; Ersing & Kost, 2012; Burger, 2020; Pires & Crooks, 2017). With respect to the household networks, our networks ranged from individuals in cliques with 0 to 10 ties (connections) with the majority of the population in small groups of 2 to 4 individuals as shown in Fig. 7. While the work networks were more skewed which reflects many small places of work, school and daycare networks have a higher degree of connectivity as shown in Table 4 due to the greater number of people in each location.

Fig. 7
figure 7

Various Social Networks from Our Synthetic Population Method: (a) Household; (b) Work; (c) School; (d) Daycare

Table 4 Social networks statistics from our synthetic population method

4.2 Validation

Turning to validation, we chose two reference datasets to compare our synthetic population against which are shown in Fig. 8. First, we compared our synthetic population to the US Census (US Census Bureau, 2010c) dataset based on the average household size attribute. Our rationale for choosing this attribute was that in our synthetic population generation method (Section 3.2), we did not use this attribute when creating the households. On average, the household size difference between the two datasets ranged from -0.01 to 0, indicating the synthetic population varied only slightly from the US Census. Turning to the second datatset, the RTI International synthetic population (RTI, 2019), our rationale for using this dataset is that it was generated using synthetic reconstruction, which is the same method as ours but does not include social networks (as discussed in Section 1) (Wheaton et al., 2009). In addition, our method only used census tract data to generate the synthetic population, while the generation of RTI used both aggregate and disaggregate data (e.g., the Public Use Microdata Sample (PUMS) data from the US Census). Hence, other than proving our population synthesis method was robust, we would also like to argue that our method was able to generate a robust synthetic population by using only census tract data, which is an advantage when disaggregate data is not available. To demonstrate the robustness of our population, similar to the US Census comparison above, we compared average household size with RTI’s average household size from their synthetic population. Figure 8 shows the difference of household size between our synthetic population and that of the RTI population. Overall the difference is less than 1 (i.e., ranging from -0.22 to 0.78). As this figure shows, our synthetic population compares favorably with that of the RTI Synthetic population.

Fig. 8
figure 8

Comparison of Differences Between Our Synthetically Generated Average household size with the Census and RTI Synthetic Data

5 Use cases

As discussed in Section 1, synthetic populations have been utilized by various modeling techniques such as microsimulation and agent-based modeling to explore a wide range of phenomena such as traffic dynamics, disaster response, and disease outbreaks (e.g., Wise et al., 2017; Eubank et al., 2004; Jiang et al., 2020; Xu et al., 2017). Hence, to demonstrate the utility generalizability of our synthetic population, the generated dataset was applied to three different agent-based modeling use cases exploring a variety of urban phenomena. The first use case simulates traffic dynamics in a high-density urban setting (i.e., Manhattan island and surrounding areas, Section 5.1). The second use case (Section 5.2) builds upon the first use case in the sense that it shares the same study area, but it simulates how people might react in the immediate aftermath of a disaster. The last use case (Section 5.3) simulates a hypothetical disease spread through our synthetic population. Our rationale for choosing these applications, other than to demonstrate how the geographically-explicit synthetic population can be used, also shows how our extensions to previous population synthesis work (i.e., social networks and explicit home and work locations) can be used to gain insight into such phenomena. However, it needs to be noted, that these are simplistic models and are given as pedagogic examples. Our aim is to demonstrate the re-usability of our synthetic population dataset and, as discussed in Section 1, all three models have been made publicly available for readers to explore and extend (See: https://bit.ly/SynPopABM).

5.1 Use case 1: traffic dynamics

Agent-based modeling has been used extensively in urban transportation (Wise et al., 2017) which therefore motivated this application and to demonstrate how our synthetic population could be used in this domain. We used JAVA in MASON platform to build a simple agent-based model to capture traffic congestion in a high-density urban area (Luke et al., 2005). Figure 9 shows the structure of this traffic model, which comprises three main components: 1) agent; 2) environment; and 3) model visualization. The synthetic population was used to initialize the agent component, which contained the agents’ basic attributes (e.g., age, sex, home location, work location, etc.) and behaviors (e.g., path finding and commuting). The Environment component not only represented the model’s physical environment (e.g., road networks, buildings), but also comprised of the model scheduler that runs all agents’ behaviors (i.e., find path and commute). Lastly, the visualization component updated the model user interface at each time step, for example displaying the physical environment and updating agent’s location. Within this model, we used one time step to represent one minute of “real” time. The rationale to use one minute is that it enables us to capture small-scale movement and traffic congestion such as the morning rush hour by simulating agents commuting from home to work.

Fig. 9
figure 9

Traffic Dynamic Model Component Structure

Before simulating large scale traffic dynamics, verification experiments were undertaken for one road segment to ensure the model had the ability to capture congestion. Figure 10 shows the commuting speed simulated by the model with various sizes of vehicles (e.g. 1, 10, 100, 1000) for a 2.1 km road segment in Manhattan. As to be expected, with more vehicles, the commute along the road segment took longer due to traffic delays. Building upon this verification example and in order to overcome the computational constraints for running the traffic model at full scale, we used 1 agent to represent 100 or 1000 agents. Figure 10 shows the model can generate the same commute speed when 1 agent represents either 100 and 1000 agents. With these verification steps, we feel confident that our model captured basic traffic dynamics. Therefore, we loaded a 0.01% sample of our synthetic population into the model and scale other parameters accordingly. During the simulation, we capture each individual’s geo-location at every time step. Figure 11 shows a heat map of traffic density during the morning commute focusing on Manhattan island NYC and surrounding areas. In the figure, the darkest red color indicates the area suffering from the worst traffic delays. Furthermore, the model demonstrates how home and work locations from the synthetic population can be utilized for modeling traffic dynamics.

Fig. 10
figure 10

Model Verification Using Selected Road Segment

Fig. 11
figure 11

Traffic Density During Morning Commute

5.2 Use case 2: groups formation after a disaster event

Building upon the traffic model discussed in Section 5.1, where we demonstrated how normal patterns of life can be generated (i.e., traffic congestion) using the synthetic population. In this use case, we extend the traffic model to explore how people react in the immediate aftermath of a disaster. Especially, how an individual’s social networks impact their decision making. Our rationale for this is that social networks impact people’s decision making during disaster events and new social networks emerge as people react and respond to a disaster (e.g., Burger, 2020; Ersing & Kost, 2012; Jones & Faas, 2016; Dynes, 2006).

To capture this, another agent-based model was built by extending both agent and environment components of the traffic model as shown in Fig. 12. Within the Agent component, attributes related to health status were added to represent the effects on individual’s health brought by the disaster event. Also, the main difference between this model and the one presented in Section 5.1 is the addition of the three types of social networks (Section 3.2.3) to the agent component, which were initialized from the social networks of our synthetic population (Section 3.2.2). The behaviors of the agents were also altered from the previous model, whereby agents can now respond to a disaster event. The response behaviors were based on observations from previous studies (e.g., Mawson, 2007) and implemented using a fast and frugal decision tree (Kennedy & Bassett, 2011). Each step of the model represents one minute of “real” time, during which each agent decides and acts (i.e., moves towards a goal or stays in place). This time step was chosen because we wanted to represent the fine scale decision making and movements of the population. For simplification, agents only commuted by vehicle; however, immediately after a disaster, some flee on foot. Agents also had different work-day schedules and workplaces, resulting in individualized commuter behavior. After the disaster, agents altered their normal behavior from daily commuting to finding safety for themselves (Mawson, 2007; Maslow, 1943). Depending on the availability of shelter and agent knowledge after the disaster, they either choose to form ad hoc groups (i.e., emergent networks), shelter in place, or flee from the affected area. Once they have achieved relative safety, they search for, locate, and join members of their household (based on their social networks instantiated by the synthetic population dataset). Figure 13 shows the impact of a large explosion in Manhattan and the health status of the agents in and around the immediate blast area as well as the commuting region 1 minute after the explosion. After the explosion, surviving agents flee damaged buildings and create new social groups in the process of attempting to go home. Figure 14 shows an example of how an agent’s social network from the synthetic population changed (i.e., new ad hoc group formation) as agents interact with others within the disaster area and flee. Through such experiments, we can explore and characterize the reaction of the population of NYC to a disaster.

Fig. 12
figure 12

Model Component Structure of Population Respond to Disaster event

Fig. 13
figure 13

Agents’ Health Status After 1 Minute of the Disaster Event

Fig. 14
figure 14

Simulation of Social Networks Changes for a Sample of the Population: (a) Before the Disaster Event; (b) Emergent Social Networks After the Disaster Event

5.3 Use case 3: disease dynamics

As discussed in Section 2, one of the fields utilizing synthetic populations is that of epidemic research. To demonstrate how our synthetic population dataset can be utilized in such a domain, we utilized an agent-based model as the tool to develop a simple susceptible, exposed, infected and recovered (SEIR) compartmentalized disease model in Python. Figure 15 shows the model components. Within agent component, we integrated SEIR into this component to represent agents’ health status. Also, the model took our synthetic population dataset including its social network as input parameters for the initialization of the Agent component and for guiding agent-to-agent interactions (e.g., members of the same household, or children at the same school or colleagues at the same work site interact with each other). The model simulates how a hypothetical disease could spread through the system and captures the health status at each time step of the model. In addition, the model also captured where this change in health status occurs (e.g., at home or work). While the traffic model demonstrated how researchers could use home and work locations to generate traffic congestion, this example shows the utility of why we included social networks in our synthetic population method. Also, by implementing the model in Python, it shows how our synthetic population dataset can be used in other programming languages.

Fig. 15
figure 15

Disease Model Component Structure

As the synthetic population is geographically-explicit, it is straightforward to select agents for just one location (e.g., home or work). In this example, we extracted only agents who had home locations in Ulster county, NY. Within this county, there are total of 153,253 people living in 58,094 different households, shown in Fig. 16. In this model, one day is divided into 3 time periods (each being 8 hours) which are analogous with simple daily rhythm of home (either sleeping or getting up), being at work (or going to school), and being back at home. In any of the time periods, the agents interact (i.e., spreads or is infected by the disease) through their networks at that specific time. For example, while at work, agents will only interact with others in their work network. While agents are at home, they interact solely with their household networks. It is through these interactions that a disease can spread. In this example, we start off with only one agent being infected with a hypothetical disease. Figure 17(a) shows the SEIR curve, which indicates that the model utilizing our synthetic population can capture standard epidemiological disease dynamics. In addition, the relative locations of where agents get exposed to the disease can be counted. Figure 17(b) show the cumulative number of agents who get exposed to the disease at specific locations (i.e., home, work, daycare, and schools).

Fig. 16
figure 16

Disease Model Study Area

Fig. 17
figure 17

Disease Model Results: (a) SEIR Results; (b) Exposed Locations

6 Conclusion and discussion

While synthetic population research has been growing and is becoming more and more utilized within the geographically-explicit simulations, little attention has been given to creating geographically-explicit synthetic populations with realistic social networks. To address this shortcoming, in this paper, we have presented a novel method for generating a synthetic population dataset by utilizing aggregate census data which incorporates not only home and work locations but also social networks. Our verification and validation experiments demonstrated that our synthetic population results were comparable to aggregate census tract level (i.e., Census Data) and individual level (i.e., RTI Data) data showing that our method is robust.

Our method also addresses one of the challenges in population synthesis especially when it comes to agent-based modeling. Specifically, the ability to instantiate social networks at the model initialization rather than during the running of the simulation. We consider this important because social networks are increasingly being used in the field of agent-based modeling but so far such networks have not been incorporated in synthetic population as we noted in Section 2.

To show the re-usability and generalizablity of our publicly available dataset, we presented three use cases. The first showed how the synthetic population can generate possible traffic dynamics (Section 5.1 while the other two (i.e., disaster response and a disease outbreak Sections 5.2 and 5.3) explicitly showed how social networks can be used in geographically-explicit settings. By implementing these use cases, we have shown the synthetic population can be applied and reused in different modeling environments and programming languages to meet various research purposes. Our intention in sharing our synthetic population method and data is to demonstrate how our synthetic population can be utilized in situations where fine-scale individual-level data might not be available but could be constructed based on aggregate data. As noted in Section 1, all models built for the use cases are shared online. Our rationale for this is to allow reproduction of anything presented in this paper and to allow researchers to extend and adapt such methods and models in this nascent field of integrating spatial data and social networks into large-scale agent-based modeling simulations.

While our method can create synthetic population datasets along with its social networks, similar to all synthetic population methods, there is always room for improvement. Especially, given the latest US Census data that was available to us was from the 2010. Thus, the synthetic population result will not align with the latest demographics of the area. However, our method was created based on the standard census data, which makes it possible to modify and replicate the method and generate the population when the 2020 US Census becomes available. This is one reason why we share the code of our method in Section 3. Furthermore, when the 2020 US Census becomes available, we would like to extend our synthetic population dataset to comprise all the US population. Currently, the synthetic population did not comprise financial, employment or educational backgrounds, with such data, the synthetic population could be potentially used by an agent-based modeler wanting to explore issues such as residential location or public health issues (e.g., Jiang et al., 2021; Li et al., 2020). This could be remedied by adding additional variables to the agents from the census dataset.

As discussed in Section 1, the reuse of existing synthetic population datasets has been a challenge within the field of agent-based modeling. One reason behind this is that there are no standard method or agreement to some extent on what attributes should be added to a synthetic population dataset. This is similar to agent-based modeling, more generally, where there are many definitions of such models are in the sense of what is an agent, and also there are many possible attributes that can be assigned to an agent depending on the purpose of the model. Due to these issues, a well built synthetic population should always aim to solve a specific research question and can not be reused to explore other questions. We consider this may undermine the previous efforts made by researchers to the field of utilizing synthetic population datasets in agent-based modeling. To address this challenge caused by these issues, another potential aspect for the future work would be proposing a set of standards that can be referenced by researchers to create, store and describe their own methods and datasets potentially being inspired by initiatives such as the ODD protocol or CityGML and ontological principles which formalize data exchanges (Grimm et al., 2006; Kutzner et al., 2020; Lee et al., 2016). Furthermore, such standards could be beneficial to the field of information management by allowing researchers either to replicate the method or to reuse the resulting datasets.

With respect to the models, we noted in Section 5 that these were only pedagogic in nature and therefore like all models there are limitations and there is always space to improve them. As for the traffic dynamic model (Section 5.1), we only considered people commute from home to work because imitated human behaviors information can be extracted from census data for our synthetic population. Thus, other human activities besides commuting (e.g., shopping, social activities, sporting events) could be added to the model if data was available. Potentially such data could come from multiple data sources, such social media (Padilla et al. 2014; Kavak et al. 2018), urban infrastructure data (e.g., building types) (Zhang et al., 2021) and human mobility data (e.g., aggregates anonymized location data) (Elarde et al., 2021). By doing this, various types of traffic dynamics could be captured. Similarly, when we simulated the hypothetical disease spread (Section 5.3), we only considered home, work, and educational locations. Such data discussed above could also be added to the disease model to add more types of interaction spaces and locations. In addition, these interactions would generate more types of human activities and could shed light on disease outbreaks from the individual level. For example, infected individuals’ trajectory data could be utilized to understand the human mobility during a disease outbreak (e.g., Cheng et al., 2021; Hu et al., 2021) and assist in contact tracing. To expand the disaster model (Section 5.2), more research is needed to form new emergent groups. However, this would require a greater study of post disaster events or other types of emergency field trails for observations, something which is currently not available (Burger, 2020). Even with such limitations and areas of further work, the three models presented in this paper provide a basic experimental setting for agent-based modelers who have their own preferences for programming languages and research areas but want to utilize geographically-explicit synthetic populations at model initialization. This was not the main purpose of the paper, but it illustrates how one could use our novel synthetic population method in a variety of use cases (with or without geographically-explicit social networks) to explore complex geographical systems where agents interact with each other and their environment.