Abstract
Data are important components of any research; however, it is often the case that the required data are not readily available. Researchers often fuse multiple datasets to obtain the data required to complete their work. In urban simulation, spatially referencing data is of paramount importance to capture local variations in travel and preferences. Data fusion typically obfuscates the spatial reference by merging records from different locations. Population synthesis is used to match these fused household records to a plausible location based on aggregate sociodemographic statistics. In some cases, researchers must also synthesize the necessary data. This paper outlines a data fusion workflow for a statistically valid synthetic population for use in urban simulation models. We develop the framework for the case of household-level expenditure and individuals’ time use patterns. The Greater Toronto Area (GTA) in Canada is used as a testbed. The results of the data fusion and synthesis are validated against statistics from a large-sample travel survey conducted in the GTA, showing a good fit with the validation dataset. Finally, we outline how the framework could be applied in other contexts where a single dataset is unavailable.
Similar content being viewed by others
Notes
Throughout the paper, we use the term expenditure to refer to monetary spending only. Time is referenced as either time spent or time allocated to an activity.
Ye et al. (2017) provide pseudo code for the algorithm in the original paper.
References
Astroza, S., Pinjari, A.R., Bhat, C.R., Jara-Díaz, S.R.: A microeconomic theory-based latent class multiple discrete-continuous choice model of time use and goods consumption. Transp. Res. Rec. 2664, 31–41 (2017). https://doi.org/10.3141/2664-04
Backor, K., Golde, S., Nie, N.: Estimating survey fatigue in time use study. Paper Presented at the 2007 International Association for Time Use Research Conference, Washington, D.C., pp. 1–59 (2007)
Barthelemy, J., Toint, P.L.: Synthetic population generation without a sample. Transp. Sci. 47(2), 266–279 (2013). https://doi.org/10.1287/trsc.1120.0408
Bhat, C.R.: A new generalized heterogeneous data model (GHDM) to jointly model mixed types of dependent variables. Transp. Res. B 79, 50–77 (2015). https://doi.org/10.1016/j.trb.2015.05.017
Browning, M., Gørtz, M.: Spending time and money within the household. Scand. J. Econ. 114(3), 681–704 (2012)
Dane, G., Arentze, T.A., Timmermans, H.J.P., Ettema, D.: Simultaneous modeling of individuals’ duration and expenditure decisions in out-of-home leisure activities. Transp. Res. Part A 70, 93–103 (2014). https://doi.org/10.1016/j.tra.2014.10.003
Fang, L., Zhu, G.: Time allocation and home production technology. J. Econ. Dyn. Control 78, 88–101 (2017). https://doi.org/10.1016/j.jedc.2017.02.009
Gargiulo, F., Ternes, S., Huet, S., Deffuant, G.: An iterative approach for generating statistically realistic populations of households. PLoS ONE (2010). https://doi.org/10.1371/journal.pone.0008828
Hössinger, R., Aschauer, F., Jara-Díaz, S., Jokubauskaite, S., Schmid, B., Peer, S., Axhausen, K.W., Gerike, R.: A joint time-assignment and expenditure-allocation model: value of leisure and value of time assigned to travel for specific population segments. Transportation (2019). https://doi.org/10.1007/s11116-019-10022-w
Huynh, N., Barthélemy, J., Perez, P.: A heuristic combinatorial optimisation approach to synthesising a population for agent-based modelling purposes. JASSS (2016). https://doi.org/10.18564/jasss.3198
Jara-Díaz, S.R.: On the goods-activities technical relations in the time allocation theory. Transportation 30(3), 245–260 (2003). https://doi.org/10.1023/A:1023936911351
Jara-Díaz, S., Rosales-Salas, J.: Understanding time use: Daily or weekly data? Transp. Res. Part A (2015). https://doi.org/10.1016/j.tra.2014.07.009
Jara-Díaz, S.R., Munizaga, M.A., Greeven, P., Guerra, R., Axhausen, K.: Estimating the value of leisure from a time allocation model. Transp. Res. Part B 42(10), 946–957 (2008). https://doi.org/10.1016/j.trb.2008.03.001
Jeong, B., Lee, W., Kim, D.-S., Shin, H.: Copula-based approach to synthetic population generation. PLoS ONE (2016). https://doi.org/10.1371/journal.pone.0159496
Konduri, K.C., Tagle, S.A., Sana, B., Pendyala, R.M., Jara-díaz, S.R.: A joint analysis of time use and consumer expenditure data: An examination of two alternative approaches to deriving values of time. Transp. Res. Rec. 2231, 53–60 (2011)
Lee, A.: Generating synthetic microdata from published marginal tables and confidentialised files. Comput. Sci. 17, 1–121 (2009)
Lenorm, M., Deffuant, G.: Generating a synthetic population of individuals in households: Sample-free vs sample-based methods. JASSS (2013). https://doi.org/10.18564/jasss.2319
Lohr, S.L.: Sampling: Design and Data Analysis, 2nd edn. Cengage Learning, Brooks/Cole (2010)
Malatest, & DMG: Transportation tomorrow survey 2016. http://dmg.utoronto.ca/transportation-tomorrow-survey/tts-reports (2018)
Munizaga, M., Jara-Díaz, S., Olguín, J., Rivera, J.: Generating twins to build weekly time use data from multiple single day OD surveys. Transportation 38(3), 511–524 (2011). https://doi.org/10.1007/s11116-010-9311-z
Pu, Y., Dai, S., Gan, Z., Wang, W., Wang, G., Zhang, Y., Henao, R., Carin, L.: JointGAN: Multi-Domain Joint Distribution Learning with Generative Adversarial Nets. (2018)
Rubin, D.B.: Discussion: Statistical disclosure limitation. J. off. Stat. 9(2), 461–468 (1993)
Saadi, I., Mustafa, A., Teller, J., Farooq, B., Cools, M.: Hidden Markov model-based population synthesis. Transp. Res. Part B 90, 1–21 (2016). https://doi.org/10.1016/j.trb.2016.04.007
Saadi, I., Farooq, B., Mustafa, A., Teller, J., Cools, M.: An efficient hierarchical model for multi-source information fusion. Expert Syst. Appl. 110, 352–362 (2018). https://doi.org/10.1016/j.eswa.2018.06.018
Sakshaug, J.W., Raghunathan, T.E.: Generating synthetic microdata to estimate small area statistics in the American Community Survey. Stat. Transit. 15(3), 341–368 (2014)
van Nostrand, C., Sivaraman, V., Pinjari, A.: Analysis of long-distance vacation travel demand in the United States: A multiple discrete-continuous choice framework. Transportation 40(1), 151–171 (2013). https://doi.org/10.1007/s11116-012-9397-6
Williams, L.J., Hartman, N., Cavazotte, F.: Method variance and marker variables: A review and comprehensive cfa marker technique. Org. Res. Methods 13(3), 477–514 (2010). https://doi.org/10.1177/1094428110366036
Ye, P., Hu, X., Yuan, Y., Wang, F.Y.: Population synthesis based on joint distribution inference without disaggregate samples. JASSS (2017). https://doi.org/10.18564/jasss.3533
Zhang, A., Kang, J.E., Axhausen, K., Kwon, C.: Multi-day activity-travel pattern sampling based on single-day data. Transp. Res. Part C 89(2017), 96–112 (2018). https://doi.org/10.1016/j.trc.2018.01.024
Acknowledgements
The study was funded by an NSERC CGS-D Scholarship and a CRDCN Emerging Scholar Grant by the first author and an NSERC Discovery Grant by the second author.
Author information
Authors and Affiliations
Contributions
The authors confirm contribution to the paper as follows: study conception and design: JH; Analysis and interpretation of results: J. Hawkins; Draft manuscript preparation: JH. Overall supervision: KMNH. All authors reviewed the results and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hawkins, J., Habib, K.N. A multi-source data fusion framework for joint population, expenditure, and time use synthesis. Transportation 50, 1323–1346 (2023). https://doi.org/10.1007/s11116-022-10279-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11116-022-10279-8