Multi-Source Entity Resolution for Genealogical Data

  • Julia EfremovaEmail author
  • Bijan Ranjbar-Sahraei
  • Hossein Rahmani
  • Frans A. Oliehoek
  • Toon Calders
  • Karl Tuyls
  • Gerhard Weiss


In this chapter, we study the application of existing entity resolution (ER) techniques on a real-world multi-source genealogical dataset. Our goal is to identify all persons involved in various notary acts and link them to their birth, marriage, and death certificates. We analyze the influence of additional ER features, such as name popularity, geographical distance, and co-reference information on the overall ER performance. We study two prediction models: regression trees and logistic regression. In order to evaluate the performance of the applied algorithms and to obtain a training set for learning the models we developed an interactive interface for getting feedback from human experts. We perform an empirical evaluation on the manually annotated dataset in terms of precision, recall, and F-score. We show that using name popularity, geographical distance together with co-reference information helps to significantly improve ER results.


Death Certificate Entity ResolutionEntity Resolution Candidate Pair Natural Language Processing Technique Genealogical Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors are grateful to the BHIC Center for the support in data gathering, data analysis and direction. In particular, we would like to thank Rien Wols and Anton Schuttelaars whose efforts were instrumental to this research and their patience and support appeared infinite. This research has been carried under Mining Social Structures from Genealogical Data (project no. 640.005.003) project, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).


  1. Alsaleh, M., & van Oorschot, P. C. (2013). Evaluation in the absence of absolute ground truth: Toward reliable evaluation methodology for scan detectors. International Journal of Information Security, 12(2), 97–110.CrossRefGoogle Scholar
  2. Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD’04 (pp. 11–18). USA: ACM.Google Scholar
  3. Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge Discovery from Data, 1(1), 5.CrossRefGoogle Scholar
  4. Bilenko, M. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE international conference on data mining ICDM-2006 (pp. 87–96). Piscataway: IEEE.Google Scholar
  5. Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37(1), 51–89.CrossRefGoogle Scholar
  6. Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. In Proceedings of the ‘workshop on mining complex data’ (MCD’06), held at IEEE ICDM’06 (pp. 290–294).Google Scholar
  7. Christen, P. (2008). Automatic record linkage using seeded nearest neighbour and support vector machine classification. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ‘08 (pp. 151–159). USA: ACM.CrossRefGoogle Scholar
  8. Christen, P. (2012). Data matching. New York: Springer Publishing Company, Incorporated.CrossRefGoogle Scholar
  9. Cohen, W. W., Kautz, H. A., & McAllester, D. A. (2000). Hardening soft information sources. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo & I. Parsa (Eds.) KDD (pp. 255–259). USA: ACM.Google Scholar
  10. Efremova, J., Montes García, A., & Calders, T. (2015). Classification of historical notary acts with noisy labels. In Proceedings of the 37th European conference on information retrieval, ECIR’15. Vienna, Austria: Springer.Google Scholar
  11. Efremova, J., Ranjbar-Sahraei, B., & Calders, T. (2014). A hybrid disambiguation measure for inaccurate cultural heritage data. In The 8th workshop on LaTeCH (pp. 47–55).Google Scholar
  12. Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2013). An interactive, web-based tool for genealogical entity resolution. In 25th Benelux Conference on Artificial Intelligence (BNAIC’13), The Netherlands.Google Scholar
  13. Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2014). A baseline method for genealogical entity resolution. In Proceedings of the Workshop on Population Reconstruction, Organized in the Framework of the LINKS Project.Google Scholar
  14. Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRefGoogle Scholar
  15. Florian, R., Ittycheriah, A., Jing, H., & Zhang, T. (2003). Named entity recognition through classifier combination. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, CONLL ‘03 (Vol. 4, pp. 168–171). USA: Association for Computational Linguistics.CrossRefGoogle Scholar
  16. Getoor, L., & Machanavajjhala, A. (2012). Entity resolution: Theory, practice & open challenges. In International Conference on Very Large Data Bases.Google Scholar
  17. Getoor, L., & Machanavajjhala, A. (2013). Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1527–1527). USA: ACM.Google Scholar
  18. Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. SIGMOD Record, 24(2), 127–138.CrossRefGoogle Scholar
  19. Huijsmans, D. (2013). Dataset historische Nederlandse toponiemen spatio-temporeel 1812–2012. In IISG-LINKS. Google Scholar
  20. Ivie, S., Henry, G., Gatrell, H., & Giraud-Carrier, C. (2007). A metricbased machine learning approach to genealogical record linkage. In Proceedings of the 7th Annual Workshop on Technology for Family History and Genealogical Research. Google Scholar
  21. Lawson, J. S. (2006). Record linkage techniques for improving online genealogical research using census index records. In Proceedings of the Section on Survey Research Methods. Google Scholar
  22. McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–178). USA: ACM.Google Scholar
  23. Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1), 3–26.CrossRefGoogle Scholar
  24. Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. San Rafael: Morgan and Claypool Publishers.Google Scholar
  25. Nuanmeesri, S., & Baitiang, C. (2008). Genealogical information searching system. In 4th IEEE International Conference on Management of Innovation and Technology, ICMIT 2008 (pp. 1255–1259).Google Scholar
  26. Rahmani, H., Ranjbar-Sahraei, B., Weiss, G., & Tuyls, K. (2014). Contextual entity resolution approach for genealogical data. In Workshop on knowledge discovery, data mining and machine learning.Google Scholar
  27. Ramachandran, S., Deshpande, O., Roseman, C. C., Rosenberg, N. A., Feldman, M. W., & Cavalli-Sforza, L. L. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proceedings of the National Academy of Sciences of the United States of America, 102(44), 15942–15947.CrossRefGoogle Scholar
  28. Sammut, C., & Webb, G. I. (2010). Encyclopedia of machine learning. Berlin: Springer.CrossRefGoogle Scholar
  29. Schraagen, M. (2011). Complete coverage for approximate string matching in record linkage using bit vectors. In ICTAI’11 (pp. 740–747).Google Scholar
  30. Schraagen, M., & Hoogeboom, H. J. (2011). Predicting record linkage potential in a family reconstruction graph. In 23th Benelux Conference on Artificial Intelligence (BNAIC’11), Belgium. Google Scholar
  31. Schraagen, M., & Kosters, W. (2014). Record linkage using graph consistency. In Machine learning and data mining in pattern recognition. Lecture Notes in Computer Science (pp. 471–483). New York: Springer International PublishingGoogle Scholar
  32. Schulz, K. U., & Mihov, S. (2002). Fast string correction with levenshtein automata. International Journal of Document Analysis and Recognition (IJDAR), 5(1), 67–85.CrossRefGoogle Scholar
  33. Singla, P., & Domingos, P. (2006). Entity resolution with markov logic. Proceedings of the Sixth International Conference on Data Mining, ICDM’06 (pp. 572–582). USA: IEEE Computer Society.CrossRefGoogle Scholar
  34. Štajner, T., & Mladenić, D. (2009). Entity resolution in texts using statistical learning and ontologies. Proceedings of the 4th Asian Conference on the Semantic Web, ASWC’09 (pp. 91–104). Berlin: Springer.Google Scholar
  35. Sweet, C., Özyer, T., & Alhajj, R. (2007). Enhanced graph based genealogical record linkage. Proceedings of the 3rd International Conference on Advanced Data Mining and Applications, ADMA’07 (pp. 476–487). Berlin: Springer.CrossRefGoogle Scholar
  36. Van den Bosch, A., Busser, B., Canisius, S., & Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational linguistics in the Netherlands: Selected papers from the seventeenth CLIN meeting (pp. 99–114).Google Scholar
  37. Winkler, W. E. (1995). Matching and record linkage. In Business survey methods (pp. 355–384). New York: Wiley.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Julia Efremova
    • 1
    Email author
  • Bijan Ranjbar-Sahraei
    • 2
  • Hossein Rahmani
    • 2
  • Frans A. Oliehoek
    • 3
    • 5
  • Toon Calders
    • 1
    • 4
  • Karl Tuyls
    • 5
  • Gerhard Weiss
    • 2
  1. 1.Eindhoven University of TechnologyEindhovenThe Netherlands
  2. 2.Maastricht UniversityMaastrichtThe Netherlands
  3. 3.University of AmsterdamAmsterdamThe Netherlands
  4. 4.Université Libre de BruxellesBrusselsBelgium
  5. 5.University of LiverpoolLiverpoolUK

Personalised recommendations