The VLDB Journal

, Volume 22, Issue 6, pp 773–795 | Cite as

Joint entity resolution on multiple datasets

Regular Paper

Abstract

Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact the resolution of other types of records. In this paper we propose a flexible, modular resolution framework where existing ER algorithms developed for a given record type can be plugged in and used in concert with other ER algorithms. Our approach also makes it possible to run ER on subsets of similar records at a time, important when the full data are too large to resolve together. We study the scheduling and coordination of the individual ER algorithms, in order to resolve the full dataset, and show the scalability of our approach. We also introduce a “state-based” training technique where each ER algorithm is trained for the particular execution context (relative to other types of records) where it will be used.

Keywords

Entity resolution Joint entity resolution Physical execution Influence graph Execution plan Expander function State-based training Data cleaning 

References

  1. 1.
    Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools, 2nd edn. Addison Wesley, Boston (2006)Google Scholar
  2. 2.
    Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)Google Scholar
  3. 3.
    Azevedo, A., Santos, M.F.: Kdd, semma and crisp-dm: a parallel overview. In: IADIS European Conference Data Mining, pp. 182–185 (2008)Google Scholar
  4. 4.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)CrossRefGoogle Scholar
  5. 5.
    Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SDM (2006)Google Scholar
  6. 6.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1) Article No. 5 (2007)Google Scholar
  7. 7.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)Google Scholar
  8. 8.
    Brucker, P.: Scheduling Algorithms, 4th edn. Springer, Berlin (2004)CrossRefMATHGoogle Scholar
  9. 9.
    Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Berlin (2012)CrossRefGoogle Scholar
  10. 10.
    Culotta, A., Mccallum, A.: A conditional model of deduplication for multi-type relational data. Technical report, University of Massachusetts (2005)Google Scholar
  11. 11.
    Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: CIKM, pp. 257–258 (2005)Google Scholar
  12. 12.
    Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)Google Scholar
  13. 13.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  14. 14.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  15. 15.
    Garey, M.R., Johnson, D.S., Sethi, R.: The complexity of flowshop and jobshop scheduling. Math. Oper. Res. 1, 117–129 (1976). doi:10.1287/moor.1.2.117 MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Graham, R.L., Grahamt, R.L.: Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 416–429 (1969)Google Scholar
  17. 17.
    Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138 (1995) Google Scholar
  18. 18.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)Google Scholar
  19. 19.
  20. 20.
    Newcombe, H.B., Kennedy, J.M.: Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM 5(11), 563–566 (1962)CrossRefGoogle Scholar
  21. 21.
    Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  22. 22.
    Parag S., Domingos, P.: Multi-relational record linkage. In: KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)Google Scholar
  23. 23.
    Poon, H., Domingos, P.: Joint inference in information extraction. In: AAAI, pp. 913–918 (2007)Google Scholar
  24. 24.
    Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)Google Scholar
  25. 25.
    Sadinle, M., Hall, R., Fienberg, S.E.: Approaches to multiple record linkage. In: ISI (2011)Google Scholar
  26. 26.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD, pp. 269–278 (2002)Google Scholar
  27. 27.
    Singla, P., Domingos, P.: Entity resolution with markov logic. In: ICDM, pp. 572–582 (2006)Google Scholar
  28. 28.
  29. 29.
    Tarjan, R.E.: Edge-disjoint spanning trees and depth-first search. Acta Inf. 6, 171–185 (1976)MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)CrossRefGoogle Scholar
  31. 31.
    Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. PVLDB 3(1), 1326–1337 (2010)Google Scholar
  32. 32.
    Whang, S.E., Garcia-Molina, H.: Joint entity resolution. In: ICDE, pp. 294–305 (2012)Google Scholar
  33. 33.
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219–232 (2009)Google Scholar
  34. 34.
    Winkler, W.: Overview of record linkage and current research directions. Technical report, Statistical Research Division, US Bureau of the Census, Washington, DC (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Google ResearchMountain ViewUSA
  2. 2.Computer Science DepartmentStanford UniversityStanfordUSA

Personalised recommendations