Machine Learning

, Volume 95, Issue 1, pp 129–146 | Cite as

Tracking people over time in 19th century Canada for longitudinal analysis

  • Luiza Antonie
  • Kris Inwood
  • Daniel J. Lizotte
  • J. Andrew Ross
Article

Abstract

Linking multiple databases to create longitudinal data is an important research problem with multiple applications. Longitudinal data allows analysts to perform studies that would be unfeasible otherwise. We have linked historical census databases to create longitudinal data that allow tracking people over time. These longitudinal data have already been used by social scientists and historians to investigate historical trends and to address questions about society, history and economy, and this comparative, systematic research would not be possible without the linked data. The goal of the linking is to identify the same person in multiple census collections. Data imprecision in historical census data and the lack of unique personal identifiers make this task a challenging one. In this paper we design and employ a record linkage system that incorporates a supervised learning module for classifying pairs of records as matches and non-matches. We show that our system performs large scale linkage producing high quality links and generating sufficient longitudinal data to allow meaningful social science studies. We demonstrate the impact of the longitudinal data through a study of the economic changes in 19th century Canada.

Keywords

Record linkage Classification Historical census 

References

  1. Antonie, L., Baskerville, P., Inwood, K., & Ross, J. A. (2014, forthcoming). Change amid continuity in Canadian work patterns during the 1870s. In Lives in transition: longitudinal perspectives from historical sources. Google Scholar
  2. Baskerville, P.: (2014, forthcoming). Wilson Benson revisited: movement and persistence in rural Perth County, Ontario, 1871–1881. In Lives in transition: longitudinal perspectives from historical sources. Google Scholar
  3. Baskerville, P. & Inwood, K. (Eds.) (2014, forthcoming). Lives in transition: longitudinal perspectives from historical sources. Kingston and Montreal: McGill-Queen’s University Press. Google Scholar
  4. Bilgic, M., Licamele, L., Getoor, L., & Shneiderman, B. (2006). D-dupe: an interactive tool for entity resolution in social networks. In Visual analytics science and technology (VAST). Baltimore. Google Scholar
  5. Bourbeau, R., Légaré, J., & Édmond, V. (1997). New birth cohort life tables for Canada and Quebec, 1801–1991. Google Scholar
  6. Chambers, E. J. (1964). Late nineteenth century business cycles in Canada. Canadian Journal of Economics and Political Science, 3, 391–412. CrossRefGoogle Scholar
  7. Chang, C. C., & Lin, C. J. (2001). Libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  8. Christen, P. (2008). Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08 (pp. 151–159). CrossRefGoogle Scholar
  9. Cranfield, J., & Inwood, K. (2014, forthcoming). Genes, class or culture? French–English height differences in Canada. In Lives in transition: longitudinal perspectives from historical sources. Google Scholar
  10. Darroch, G. (2014, forthcoming). Lives in motion: revisiting the ‘agricultural ladder’ in 1860s Ontario, a study of linked microdata. In Lives in transition: longitudinal perspectives from historical sources. Google Scholar
  11. Drummond, I. (1987). Progress without planning: the economic history of Ontario from confederation to the Second World War. Toronto: University of Toronto Press. Google Scholar
  12. Elfeky, M. G., Elmagarmid, A. K., & Verykios, V. S. (2002). Tailor: a record linkage tool box. In Proceedings of the 18th international conference on data engineering, ICDE ’02 (pp. 17–28). CrossRefGoogle Scholar
  13. Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19, 1–16. CrossRefGoogle Scholar
  14. Emery, J., Inwood, K., & Thille, H. (2007). Hecksher–Ohlin in Canada: new estimates of regional wages and land price. Australian Economic History Review, 47(1), 22–48. CrossRefGoogle Scholar
  15. Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183–1210. CrossRefMATHGoogle Scholar
  16. Fryxell, A., Inwood, K., & van Tassel, A. (2014, forthcoming). Aboriginal and mixed race men in the Canadian expeditionary force 1914–1918. In Lives in transition: longitudinal perspectives from historical sources. Google Scholar
  17. Gagan, D. (1982). Hopeful travellers families, land, and social change in Mid-Victorian Peel County, Canada West. Toronto: University of Toronto Press. Google Scholar
  18. Goeken, R., Huynh, L., Lenius, T., & Vick, R. (2011). New methods of census record linking. Historical Methods, 44(1), 7–14. CrossRefGoogle Scholar
  19. Green, A., & Urquhart, M. (1987). New estimates of output growth in Canada: measurement and interpretation. In Perspectives on Canadian economic history (pp. 182–199). Google Scholar
  20. Hall, P. K., & Ruggles, S. (2004). Restless in the midst of their prosperity: new evidence of the internal migration patterns of Americans, 1850–1990. Journal of American History, 91, 829–846. CrossRefGoogle Scholar
  21. Inwood, K., & Keay, I. (2012). Diverse paths to industrial development: evidence from late nineteenth century Canada. European Review of Economic History, 16, 311–333. CrossRefGoogle Scholar
  22. Kang, H., Getoor, L., Shneiderman, B., Bilgic, M., & Licamele, L. (2008). Interactive entity resolution in relational data: a visual analytic tool and its evaluation. IEEE Transactions on Visualization and Computer Graphics, 14(5), 999–1014. CrossRefGoogle Scholar
  23. Kealey, G. (1980). Toronto workers respond to industrial capitalism (pp. 1867–1892). Toronto: University of Toronto Press. Google Scholar
  24. Newcombe, H. B. (1988). Handbook of record linkage: methods for health and statistical studies, administration, and business. New York: Oxford University Press Google Scholar
  25. Newcombe, H., Kennedy, J., Axford, S., & James, A. (1959). Automatic linkage of vital records. Science, 130, 954–959. CrossRefGoogle Scholar
  26. Philips, L. (2000). The double metaphone search algorithm. C/C++ Users Journal. Google Scholar
  27. Rahm, E., & Do, H. H. (2000). Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 23, 2000. Google Scholar
  28. Ruggles, S. (2006). Linking historical censuses: a new approach. History and Computing, 14, 213–224. CrossRefGoogle Scholar
  29. Urquhart, M. C. (1986). New estimates of gross national product, Canada, 1870–1926: some implications for Canadian development. In Long term factors in American economic growth (pp. 9–94). Chicago: University of Chicago Press. Google Scholar
  30. Winkler, W. E. (2006). Overview of record linkage and current research directions. Statistical Research Division Report. Google Scholar
  31. Wu, T. F., Lin, C. J., & Weng, R. C. (2004). Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5, 975–1005. MathSciNetMATHGoogle Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Luiza Antonie
    • 1
  • Kris Inwood
    • 2
  • Daniel J. Lizotte
    • 3
  • J. Andrew Ross
    • 4
  1. 1.Historical Data Research UnitUniversity of GuelphGuelphCanada
  2. 2.Department of Economics and FinanceUniversity of GuelphGuelphCanada
  3. 3.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada
  4. 4.Department of HistoryUniversity of GuelphGuelphCanada

Personalised recommendations