Advertisement

Using Metric Space Indexing for Complete and Efficient Record Linkage

  • Özgür Akgün
  • Alan Dearle
  • Graham Kirby
  • Peter Christen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10939)

Abstract

Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.

Keywords

Entity resolution Data matching Similarity search Blocking 

Notes

Acknowledgements

This work was supported by ESRC grants ES/K00574X/2 “Digitising Scotland” and ES/L007487/1 “Administrative Data Research Centre—Scotland”.

We thank Alice Reid of the University of Cambridge and her colleagues, especially Ros Davies and Eilidh Garrett, for the work undertaken on the Kilmarnock and Isle of Skye databases.

References

  1. 1.
    Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE ICDM, Hong Kong, pp. 87–96 (2006)Google Scholar
  2. 2.
    Bo, L., Yujian, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007).  https://doi.org/10.1109/TPAMI.2007.1078CrossRefGoogle Scholar
  3. 3.
    Broder, A.: On the resemblance and containment of documents. In: IEEE Compression and Complexity of Sequences, Salerno, Italy, pp. 21–29 (1997)Google Scholar
  4. 4.
    Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31164-2CrossRefGoogle Scholar
  5. 5.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)Google Scholar
  6. 6.
    Ciaccia, P., Patella, M., Rabitti, F., Zezula, P.: Indexing metric spaces with M-tree. In: Italian Symposium on Advanced Database Systems 1997, pp. 67–86 (1997)Google Scholar
  7. 7.
    Connor, R.: A tale of four metrics. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 210–217. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46759-7_16CrossRefGoogle Scholar
  8. 8.
    Connor, R., Vadicamo, L., Rabitti, F.: High-dimensional simplexes for supermetric search. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds.) SISAP 2017. LNCS, pp. 96–109. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-68474-1_7CrossRefGoogle Scholar
  9. 9.
    Dibben, C., Williamson, L., Huang, Z.: Digitising Scotland (2012). http://gtr.rcuk.ac.uk/projects?ref=ES/K00574X/2
  10. 10.
    Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7(1), 1–198 (2015)CrossRefGoogle Scholar
  11. 11.
    Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012)Google Scholar
  12. 12.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  13. 13.
    Fisher, J., Wang, Q.: Unsupervised measuring of entity resolution consistency. In: IEEE ICDM DINA Workshop, pp. 218–221 (2015)Google Scholar
  14. 14.
    Hand, D., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. SIGMOD Rec. 27(2), 237–248 (1998)CrossRefGoogle Scholar
  16. 16.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM TOC, Dallas, pp. 604–613 (1998)Google Scholar
  17. 17.
    Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE ICDM, Dallas, pp. 340–349 (2013)Google Scholar
  18. 18.
    Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: EDBT, Lausanne, pp. 525–536 (2010)Google Scholar
  19. 19.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Cybern. Control Theory 10, 707–710 (1966)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)CrossRefGoogle Scholar
  21. 21.
    McCallum, A.: Cora dataset: cora.csv (2017).  https://doi.org/10.3886/E4728V1
  22. 22.
    Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006)Google Scholar
  23. 23.
    Monge, A.E., Elkan, C.P.: The field-matching problem: algorithm and applications. In: ACM SIGKDD, Portland, pp. 267–270 (1996)Google Scholar
  24. 24.
    Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  25. 25.
    Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684–695 (2016)Google Scholar
  26. 26.
    Ramadan, B., Christen, P.: Unsupervised blocking key selection for real-time entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 574–585. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18032-8_45CrossRefGoogle Scholar
  27. 27.
    Reid, A., Garrett, E., Davies, R., Blaikie, A.: Scottish census enumerators’ books: Skye, Kilmarnock, Rothiemay and Torthorwald, 1861–1901. Economic and Social Data Service (2006)Google Scholar
  28. 28.
    Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers: a ‘sets of related individuals’ approach. History Comput. 14(1–2), 61–86 (2002)CrossRefGoogle Scholar
  29. 29.
    Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11257-2_20CrossRefGoogle Scholar
  30. 30.
    Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 562–573. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-18032-8_44CrossRefGoogle Scholar
  31. 31.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer, Boston (2010).  https://doi.org/10.1007/0-387-29151-2CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Özgür Akgün
    • 1
  • Alan Dearle
    • 1
  • Graham Kirby
    • 1
  • Peter Christen
    • 2
  1. 1.School of Computer ScienceUniversity of St AndrewsSt AndrewsScotland
  2. 2.Research School of Computer ScienceThe Australian National UniversityCanberraAustralia

Personalised recommendations