Skip to main content

Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage

  • Chapter
  • First Online:
Linking and Mining Heterogeneous and Multi-view Data

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

Abstract

Data integration has become one of the main challenges in the era of Big Data analytics. Often to enable decision-making, data from different sources have to be integrated and linked together. For example, multi-source data integration is vital to police, counter terrorism and national security to allow efficient and accurate verification of people. One of the key challenges in the data integration process is matching records that represent the same real-world entity (e.g. person). This process is referred to as record linkage. In many cases, data sets do not share a unique identifier (e.g. National Insurance Number), hence records need to be matched by comparing their corresponding attributes. Most of the existing record linkage methods require assistance from a domain expert for handcrafting domain-specific linking rules. More automatic approaches, based on using machine learning, were also proposed. However, those approaches relay on having a substantial set of manually labelled records, which makes them inapplicable in real-world scenarios. Given the importance of the problem, record linkage has witnessed a strong interest in the past decade. As a result, significant progress has been made in this area. In particular, the problem of reducing the manual effort and the amount of labelled data required for constructing record linkage models has been addressed in many studies. In this chapter, we review the most recently proposed approaches to semi-supervised and unsupervised record linkage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.cs.utexas.edu/users/ml/riddle/data.html.

  2. 2.

    https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution.

  3. 3.

    https://www.census.gov/data/datasets.html.

  4. 4.

    http://www.freedb.org/en/download__server_software.4.html.

References

  1. Arasu, A., Gotz, M., Kaushik, R.: On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 783–794. ACM, New York (2010)

    Google Scholar 

  2. Baxter, R., Christen, P., Churches, T., et al.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol. 3, pp. 25–27. Citeseer (2003)

    Google Scholar 

  3. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)

    Article  Google Scholar 

  4. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)

    Google Scholar 

  5. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: Proceedings of the 2016 International Conference on Management of Data, pp. 969–984. ACM, New York (2016)

    Google Scholar 

  6. Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 151–159. ACM, New York (2008)

    Google Scholar 

  7. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin (2012)

    Book  Google Scholar 

  8. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)

    Article  Google Scholar 

  9. Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)

    Google Scholar 

  10. DuVall, S.L., Kerber, R.A., Thomas, A.: Extending the Fellegi–Sunter probabilistic record linkage method for approximate field comparators. J. Biomed. Inform. 43(1), 24–30 (2010)

    Article  Google Scholar 

  11. Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: Tailor: a record linkage toolbox. In: Proceedings 18th International Conference on Data Engineering, 2002, pp. 17–28. IEEE, Piscataway (2002)

    Google Scholar 

  12. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  13. Guan, S., Jin, X., Jia, Y., Wang, Y., Shen, H., Cheng, X.: Self-learning and embedding based entity alignment. In: 2017 IEEE International Conference on Big Knowledge (ICBK), pp. 33–40. IEEE, Piscataway (2017)

    Google Scholar 

  14. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, Berlin (2007)

    MATH  Google Scholar 

  15. Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. Proc. VLDB Endowment 5(11), 1638–1649 (2012)

    Article  Google Scholar 

  16. Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. Web Semant. Sci. Serv. Agents World Wide Web 23, 2–15 (2013)

    Article  Google Scholar 

  17. Iwata, T., Ishiguro, K.: Robust unsupervised cluster matching for network data. Data Min. Knowl. Disc. 31(4), 1132–1154 (2017)

    Article  MathSciNet  Google Scholar 

  18. Iwata, T., Lloyd, J.R., Ghahramani, Z.: Unsupervised many-to-many object matching for relational data. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 607–617 (2016)

    Article  Google Scholar 

  19. Jaccard, P.: Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Nat. 37, 241–272 (1901)

    Google Scholar 

  20. Jain, S., Neal, R.M.: A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Stat. 13(1), 158–182 (2004)

    Article  MathSciNet  Google Scholar 

  21. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  22. Jia, Y., Wang, Y., Lin, H., Jin, X., Cheng, X.: Locally adaptive translation for knowledge graph embedding. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 992–998 (2016)

    Google Scholar 

  23. Jurek, A., Deepak, P.: It pays to be certain: unsupervised record linkage via ambiguity minimization. In: Proceedings of 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (2018)

    Chapter  Google Scholar 

  24. Jurek, A., Hong, J., Chi, Y., Liu, W.: A novel ensemble learning approach to unsupervised record linkage. Inf. Syst. 71, 40–54 (2017)

    Article  Google Scholar 

  25. Kejriwal, M., Miranker, D.P.: Semi-supervised instance matching using boosted classifiers. In: European Semantic Web Conference, pp. 388–402. Springer, Berlin (2015)

    Chapter  Google Scholar 

  26. Kejriwal, M., Miranker, D.P.: An unsupervised instance matcher for schema-free RDF data. Web Semant. Sci. Serv. Agents World Wide Web 35, 102–123 (2015)

    Article  Google Scholar 

  27. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)

    MathSciNet  Google Scholar 

  28. Li, G.: Human-in-the-loop data integration. Proc. VLDB Endowment 10(12), 2006–2017 (2017)

    Article  Google Scholar 

  29. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: AAAI Conference on Artificial Intelligence, vol. 15, pp. 2181–2187 (2015)

    Google Scholar 

  30. Marcus, A., Wu, E., Karger, D., Madden, S., Miller, R.: Human-powered sorts and joins. Proc. VLDB Endowment 5(1), 13–24 (2011)

    Article  Google Scholar 

  31. Naumann, F., Herschel, M.: An introduction to duplicate detection. Synth. Lect. Data Manage. 2(1), 1–87 (2010)

    Article  Google Scholar 

  32. Ngomo, A.C.N., Lyko, K.: Eagle: efficient active learning of link specifications using genetic programming. In: Extended Semantic Web Conference, pp. 149–163. Springer, Berlin (2012)

    Google Scholar 

  33. Ngomo, A.C.N., Lyko, K.: Unsupervised learning of link specifications: deterministic vs. non-deterministic. In: Proceedings of the 8th International Conference on Ontology Matching, vol. 1111, pp. 25–36 (2013). https://CEUR-WS.org

  34. Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Extended Semantic Web Conference, pp. 119–133. Springer, Berlin (2012)

    Google Scholar 

  35. Sadinle, M., et al.: Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8(4), 2404–2434 (2014)

    Article  MathSciNet  Google Scholar 

  36. Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mobile Comput. Commun. Rev. 5(1), 3–55 (2001)

    Article  MathSciNet  Google Scholar 

  37. Sherif, M.A., Ngomo, A.C.N., Lehmann, J.: W ombat–a generalization approach for automatic link discovery. In: European Semantic Web Conference, pp. 103–119. Springer, Berlin (2017)

    Google Scholar 

  38. Steorts, R., Hall, R., Fienberg, S.: Smered: a Bayesian approach to graphical record linkage and de-duplication. In: Artificial Intelligence and Statistics, pp. 922–930 (2014)

    Google Scholar 

  39. Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: International Conference on Privacy in Statistical Databases, pp. 253–268. Springer, Berlin (2014)

    Google Scholar 

  40. Steorts, R.C., et al.: Entity resolution with empirically motivated priors. Bayesian Anal. 10(4), 849–875 (2015)

    Article  MathSciNet  Google Scholar 

  41. Steorts, R.C., Hall, R., Fienberg, S.E.: A Bayesian approach to graphical record linkage and deduplication. J. Am. Stat. Assoc. 111(516), 1660–1672 (2016)

    Article  MathSciNet  Google Scholar 

  42. Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. Proc. VLDB Endowment 4(10), 622–633 (2011)

    Article  Google Scholar 

  43. Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 229–240. ACM, New York (2013)

    Google Scholar 

  44. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph and text jointly embedding. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1591–1601 (2014)

    Google Scholar 

  45. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI Conference on Artificial Intelligence, vol. 14, pp. 1112–1119 (2014)

    Google Scholar 

  46. Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 562–573. Springer, Berlin (2015)

    Chapter  Google Scholar 

  47. Wang, S., Xiao, X., Lee, C.H.: Crowd-based deduplication: an adaptive approach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1263–1277. ACM, New York (2015)

    Google Scholar 

  48. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: American Statistical Association 1990 Proceedings of the Section on Survey Research Methods, pp. 354–359 (1990)

    Google Scholar 

  49. Zhang, D., Guo, L., He, X., Shao, J., Wu, S., Shen, H.T.: A graph-theoretic fusion framework for unsupervised entity resolution. In: Proceedings of the 34th IEEE International Conference on Data Engineering (2018)

    Google Scholar 

  50. Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? Proce. VLDB Endowment 10(5), 541–552 (2017)

    Article  Google Scholar 

  51. Zhu, L., Ghasemi-Gol, M., Szekely, P., Galstyan, A., Knoblock, C.A.: Unsupervised entity resolution on multi-type graphs. In: International Semantic Web Conference, pp. 649–667. Springer, Berlin (2016)

    Google Scholar 

  52. Zhu, H., Xie, R., Liu, Z., Sun, M.: Iterative entity alignment via joint knowledge embeddings. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4258–4264. AAAI Press, Menlo Park (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Jurek-Loughrey .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jurek-Loughrey, A., P, D. (2019). Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage. In: P, D., Jurek-Loughrey, A. (eds) Linking and Mining Heterogeneous and Multi-view Data. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-01872-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01872-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01871-9

  • Online ISBN: 978-3-030-01872-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics