The VLDB Journal

, Volume 18, Issue 1, pp 255–276 | Cite as

Swoosh: a generic approach to entity resolution

  • Omar Benjelloun
  • Hector Garcia-Molina
  • David Menestrina
  • Qi Su
  • Steven Euijong Whang
  • Jennifer Widom
Regular Paper

Abstract

We consider the entity resolution (ER) problem (also known as deduplication, or merge–purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the four properties. F-Swoosh in addition assumes knowledge of the “features” (e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

Keywords

Entity resolution Generic entity resolution Data cleaning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, pp. 586–597 (2002)Google Scholar
  2. 2.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)Google Scholar
  3. 3.
    Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: FOCS, p. 238 (2002)Google Scholar
  4. 4.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003). http://citeseer.ist.psu.edu/article/baxter03comparison.html
  5. 5.
    Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: WWW, pp. 463–470 (2005)Google Scholar
  6. 6.
    Benjelloun, O., Garcia-Molina, H., Jonas, J., Menestrina, D., Whang, S., Su, Q., Widom, J.: Swoosh : a generic approach to entity resolution. Technical Report, Stanford University (2006). http://dbpubs.stanford.edu/pub/2005-5
  7. 7.
    Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh : a family of algorithms for generic, distributed entity resolution. In: ICDCS (2007)Google Scholar
  8. 8.
    Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (2004)Google Scholar
  9. 9.
    Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Sixth SIAM Conference on Data Mining (2006)Google Scholar
  10. 10.
    Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: International Conference on Intelligence Analysis (2005). https://analysis.mitre.org/Google Scholar
  11. 11.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD, pp. 313–324 (2003)Google Scholar
  12. 12.
    Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE, Tokyo, Japan (2005)Google Scholar
  13. 13.
    Cohen, W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18, 288–321 (2000)CrossRefGoogle Scholar
  14. 14.
    Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD (2005)Google Scholar
  15. 15.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  16. 16.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: Language, model, and algorithms. In: Proceedings of VLDB, pp. 371–380 (2001)Google Scholar
  17. 17.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  18. 18.
    Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences (2003)Google Scholar
  19. 19.
    Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)Google Scholar
  20. 20.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  21. 21.
    IBM: DB2 Entity Analytic Solutions. http://www-306.ibm.com/software/data/db2/eas/
  22. 22.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
  23. 23.
    Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, p. 137 (2003)Google Scholar
  24. 24.
    Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: Proceedings of the SIAM International Conference on Data Mining, Newport Beach, CA (2005)Google Scholar
  25. 25.
    McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169–178, Boston, MA (2000)Google Scholar
  26. 26.
    Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB (2006)Google Scholar
  27. 27.
    Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)Google Scholar
  28. 28.
    Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Inf. Fusion 7(2), 176–196 (2006)CrossRefGoogle Scholar
  29. 29.
    Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  30. 30.
    Parag, D.P.: Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)Google Scholar
  31. 31.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD, Edmonton, Alberta (2002)Google Scholar
  32. 32.
    Schallehn, E., Sattler, K.U., Saake, G.: Extensible and similarity-based grouping for data integratio. In: ICDE, p. 277 (2002)Google Scholar
  33. 33.
    Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Proceedings of PKDD, pp. 297 – 308 (2005)Google Scholar
  34. 34.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  35. 35.
    Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM. 22(2), 215–225 (1975)MATHCrossRefMathSciNetGoogle Scholar
  36. 36.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)CrossRefGoogle Scholar
  37. 37.
    Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A bayesian decision model for cost optimal record matching. VLDB J. 12(1), 28–40(2003). http://www.cs.purdue.edu/homes/mgelfeky/Papers/vldbj12(1).pdfGoogle Scholar
  38. 38.
    Winkler, W.: Overview of record linkage and current research directions. Technical Report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC (2006)Google Scholar
  39. 39.
    Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi–Sunter model of record linkage. In: American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 667–671 (1988)Google Scholar

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  • Omar Benjelloun
    • 1
  • Hector Garcia-Molina
    • 2
  • David Menestrina
    • 2
  • Qi Su
    • 2
  • Steven Euijong Whang
    • 2
  • Jennifer Widom
    • 2
  1. 1.Google Inc.Mountain ViewUSA
  2. 2.Computer Science DepartmentStanford UniversityStanfordUSA

Personalised recommendations