Advertisement

Robust Temporal Graph Clustering for Group Record Linkage

  • Charini NanayakkaraEmail author
  • Peter Christen
  • Thilina Ranbaduge
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11440)

Abstract

Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains need to be linked to allow advanced analytics. A popular type of data used in such a context are historical registries containing birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be conducted. Specifically, it is impossible to track individuals, families, or households over time. Once such data sets are linked and family trees are available it is possible to, for example, investigate how education, health, mobility, and employment influence the lives of people over two or even more generations. The linkage of historical records is challenging because of data quality issues and because often there are no ground truth data available. Unsupervised techniques need to be employed, which generally are based on similarity graphs generated by comparing individual records. In this paper we present a novel temporal clustering approach aimed at linking records of the same group (such as all births by the same mother) where temporal constraints (such as intervals between births) need to be enforced. We combine a connected component approach with an iterative merging step which considers temporal constraints to obtain accurate clustering results. Experiments on a real Scottish data set show the superiority of our approach over a previous clustering approach for record linkage.

Keywords

Entity resolution Star clustering Vital records Birth bundling 

Notes

Acknowledgements

This work was supported by ESRC grants ES/K00574X/2 Digitising Scotland and ES/L007487/1 ADRC-S. We like to thank Alice Reid of the University of Cambridge and her colleagues Ros Davies and Eilidh Garrett for their work on the Isle of Skye database, and their helpful advice on historical Scottish demography. This work was partially funded by the Australian Research Council under DP160101934.

References

  1. 1.
    Antonie, L., Inwood, K., Lizotte, D.J., Ross, J.A.: Tracking people over time in 19th century Canada for longitudinal analysis. Mach. Learn. 95, 129–146 (2014)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Aslam, J.A., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithms Appl. 8, 95–129 (2004)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Bailey, M., Cole, C., et al.: How well do automated methods perform in historical samples? Evidence from new ground truth. Technical report, NBER (2017)Google Scholar
  4. 4.
    Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M.: Population Reconstruction. Springer, Heidelberg (2015).  https://doi.org/10.1007/978-3-319-19884-2CrossRefGoogle Scholar
  5. 5.
    Christen, P.: Data Matching. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31164-2CrossRefGoogle Scholar
  6. 6.
    Christen, V., Groß, A., Fisher, J., Wang, Q., Christen, P., Rahm, E.: Temporal group linkage and evolution analysis for census data. In: EDBT, Venice (2017)Google Scholar
  7. 7.
    Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: ACM ICML, Pittsburgh, pp. 233–240 (2006)Google Scholar
  8. 8.
    Dibben, C., Williamson, L., Huang, Z.: Digitising Scotland (2012). http://gtr.rcuk.ac.uk/projects?ref=ES/K00574X/2
  9. 9.
    Dillon, L.Y.: Integrating nineteenth-century Canadian and American census data sets. Comput. Hum. 30(5), 381–392 (1996)CrossRefGoogle Scholar
  10. 10.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  11. 11.
    Fu, Z., Christen, P., Zhou, J.: A graph matching method for historical census household linkage. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8443, pp. 485–496. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-06608-0_40CrossRefGoogle Scholar
  12. 12.
    Hand, D., Christen, P.: A note on using the f-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)Google Scholar
  14. 14.
    Kum, H.C., Krishnamurthy, A., Machanavajjhala, A., Ahalt, S.C.: Social genome: putting big data to work for population informatics. IEEE Comput. 47(1), 56–63 (2014)CrossRefGoogle Scholar
  15. 15.
    Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRefGoogle Scholar
  16. 16.
    Nanayakkara, C., Christen, P., Ranbaduge, T.: Temporal graph-based clustering for historical record linkage. In: MLG, held at ACM SIGKDD, London (2018)Google Scholar
  17. 17.
    On, B.W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: IEEE ICDE, Istanbul (2007)Google Scholar
  18. 18.
    Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers. History Comput. 14(1–2), 61–86 (2002)CrossRefGoogle Scholar
  19. 19.
    Ruggles, S., Fitch, C.A., Roberts, E.: Historical census record linkage. Ann. Rev. Sociol. 44(1), 19–37 (2018)CrossRefGoogle Scholar
  20. 20.
    Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A. (eds.) ADBIS 2017. LNCS, vol. 10509, pp. 278–293. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-66917-5_19CrossRefGoogle Scholar
  21. 21.
    Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 576–592. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-93417-4_37CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Charini Nanayakkara
    • 1
    Email author
  • Peter Christen
    • 1
  • Thilina Ranbaduge
    • 1
  1. 1.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations