Advertisement

Entity Matching Across Multiple Heterogeneous Data Sources

  • Chao Kong
  • Ming GaoEmail author
  • Chen Xu
  • Weining Qian
  • Aoying Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9642)

Abstract

Entity matching is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to entity profiling, product recommendation, etc. The matching process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we design an unsupervised approach, called EMAN, to match entities across two or more heterogeneous data sources. The algorithm utilizes the locality sensitive hashing schema to reduce the candidate tuples and speed up the matching process. To handle the heterogeneous entity attributes, we employ the exponential family to model the similarities between the different attributes. EMAN is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EMAN on re-identifying entities from the same data source, as well as matching entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.

Keywords

Entity matching Exponential family Locality sensitive hashing 

Notes

Acknowledgements

This work is supported by the National Basic Research Program (973) of China (No. 2012CB316203) and NSFC under Grant No. U1401256, 61402177, 61402180 and 61232002. This work is also supported by CCF-Tecent Research Program of China (No. AGR20150114), NSF of Shanghai (No. 14ZR1412600), and a fund of ECNU for oversea scholars, international conference and domestic scholarly visits.

References

  1. 1.
    Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  2. 2.
    Scannapieco, M., Figotin, I., Bertino, E., Elmagarmid, A.K.: Privacy preserving schema and data matching. In: SIGMOD, pp. 653–664 (2007)Google Scholar
  3. 3.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD, pp. 269–278 (2002)Google Scholar
  4. 4.
    Wang, Y.R., Madnick, S.E.: The inter-database instance identification problem in integrating autonomous systems. In: Data Eng, pp. 46–55. IEEE (1989)Google Scholar
  5. 5.
    Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138 (1995)Google Scholar
  6. 6.
    Jin, L., Li, C., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)CrossRefGoogle Scholar
  7. 7.
    Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2014)CrossRefGoogle Scholar
  8. 8.
    Kolb, L., Thor, A., Rahm, E.: Block-based load balancing for entity resolution with MapReduce. In: CIKM, pp. 2397–2400 (2011)Google Scholar
  9. 9.
    Whang, S., Garcia-Molina, H.: Entity resolution with evolving rules. PVLDB 3(1), 1326–1337 (2010)Google Scholar
  10. 10.
    Getoor, L., Machanavajjhala, A.: Entity resolution: theory practice & open challenges. PVLDB 5(12), 2018–2019 (2012)Google Scholar
  11. 11.
    Singla, P., Domingos, P.: Entity resolution with markov logic. In: ICDM, pp. 572–582 (2006)Google Scholar
  12. 12.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)CrossRefzbMATHGoogle Scholar
  13. 13.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  15. 15.
    Winkler, W.E.: Overview of Record Linkage and Current Research Directions. U.S. Census Brueau, Washington (2006)Google Scholar
  16. 16.
    Fellegi, I.P.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  17. 17.
    Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: how similar is similar. PVLDB 4(10), 622–633 (2011)Google Scholar
  18. 18.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learning string similarity measures. In: ACM SIGKDD, pp. 39–48 (2003)Google Scholar
  19. 19.
    Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: ACM SIGMOD International Conference on Management of Data, pp. 85–96 (2005)Google Scholar
  20. 20.
    Roos, L.L., Wajda, A.: Record linkage strategies. part I: estimating information and evaluating approaches. Methods Inf. Med. 30(2), 117–123 (1991)Google Scholar
  21. 21.
    Grannis, S.J, Overhage, J,M, McDonald, C.J: Analysis of identifier performance using a deterministic linkage algorithm. In: AMIA (2002)Google Scholar
  22. 22.
    Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)Google Scholar
  23. 23.
    Lee, S., Lee, J., Hwang, S.-W.: Scalable entity matching computation with materialization. In: CIKM, pp. 2353–2356 (2011)Google Scholar
  24. 24.
    DuVall, S.L., Kerber, R.A., Thomas, A.: Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. J. Biomed. Inform. 43(1), 24–30 (2010)CrossRefGoogle Scholar
  25. 25.
    Sadinle, M., Fienberg, S.E.: A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. J. Am. Stat. Assoc. 108(502), 385–397 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Gao, M., Lim, E.-P., Lo, D., Zhu, F., Prasetyo, P.K., Zhou, A.: C.N.L.: Collective network linkage across heterogeneous social network. In: ICDM (2015)Google Scholar
  27. 27.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Chao Kong
    • 1
  • Ming Gao
    • 1
    Email author
  • Chen Xu
    • 2
  • Weining Qian
    • 1
  • Aoying Zhou
    • 1
  1. 1.Institute for Data Science and Engineering, ECNU-PINGAN Innovative Research Center for Big DataEast China Normal UniversityShanghaiChina
  2. 2.Technische Universität BerlinBerlinGermany

Personalised recommendations