Unsupervised Duplicate Detection Using Sample Non-duplicates

  • Patrick Lehti
  • Peter Fankhauser
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4244)


The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Typical current methods require deep understanding of the application domain or a good representative training set, which entails significant costs. In this paper we present an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy. Evaluations show that our approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.


Support Vector Machine Similarity Measure Related Object Independence Assumption Decision Module 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)CrossRefGoogle Scholar
  2. 2.
    Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs (2002)Google Scholar
  3. 3.
    Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 9–37 (1998)CrossRefGoogle Scholar
  4. 4.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E.: An extensible framework for data cleaning. In: Proceddings of the 16th International Conference on Data Engineering ICDE 2000, vol. 312 (2000)Google Scholar
  5. 5.
    Monge, A., Elkan, C.: An efficient domain independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery (1997)Google Scholar
  6. 6.
    Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130, 954–959 (1959)CrossRefGoogle Scholar
  7. 7.
    Winkler, W.E.: Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 667–671 (1988)Google Scholar
  8. 8.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B 34, 1–38 (1977)MathSciNetGoogle Scholar
  9. 9.
    Winkler, W.E.: Improved decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 274–279 (1993)Google Scholar
  10. 10.
    Larsen, M.D., Rubin, D.B.: Alternative automated record linkage using mixture models. Journal of the American Statistical Association 79, 32–41 (2001)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: AUAI 2004: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 454–461. AUAI Press (2004)Google Scholar
  12. 12.
    Elfeky, M.G., Verykios, V.S., Elmargarid, A.K.: Tailor: A record linkage toolbox. In: Proceedings of the 18th International Conference on Data Engineering (ICDE 2002), Washington, DC, USA, vol. 17. IEEE Computer Society, Los Alamitos (2002)Google Scholar
  13. 13.
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta (2002)Google Scholar
  14. 14.
    Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02-296, Artificial Intelligence Laboratory. University of Texas at Austin, Austin, TX (2002)Google Scholar
  15. 15.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Information Systems Journal 26, 635–656 (2001)CrossRefGoogle Scholar
  16. 16.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta (2002)Google Scholar
  17. 17.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases(VLDB 2002) (2002)Google Scholar
  18. 18.
    Parag, D.P.: Multi-relational record linkage. In: Proceedings of the KDD 2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)Google Scholar
  19. 19.
    Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD Conference, pp. 85–96 (2005)Google Scholar
  20. 20.
    Bhattacharya, I., Getoor, L.: Deduplication and group detection using links. In: Proceedings of the KDD 2004 Workshop on Link Analysis and Group Detection (2004)Google Scholar
  21. 21.
    Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: Advances in Neural Information Processing Systems 15. MIT Press, Cambridge (2003)Google Scholar
  22. 22.
    Ley, M.: DBLP computer science bibliography, http://dblp.uni-trier.de/
  23. 23.
    Fachinformationszentrum-Karlsruhe: CompuScience, http://www.zblmath.fiz-karlsruhe.de/cs/
  24. 24.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string metrics for matching names and records. In: Proceedings of the KDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 13–18 (2003)Google Scholar
  25. 25.
    Evert, S.: Computational approaches to collocations, http://www.collocations.de/
  26. 26.
    Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. Computational Linguistics 16, 22–29 (1990)Google Scholar
  27. 27.
    Church, K.W., Gale, W., Hanks, P., Hindle, D.: Using statistics in lexical analysis. Lexical Acquisition: Using On-line Recources to Build a Lexicon, 115–164 (1991)Google Scholar
  28. 28.
    Sachs, L.: Angewandte Statistik, pp. 434–435. Springer, Berlin (2004)MATHGoogle Scholar
  29. 29.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Math., Stat. and Prob., pp. 281–296 (1967)Google Scholar
  30. 30.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. CAD-Integrated Circuits and Systems 13, 888–905 (2000)Google Scholar
  31. 31.
    Lehti, P., Fankhauser, P.: A Precise Blocking Method for Record Linkage. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 210–220. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  32. 32.
    Baeza-Yates, R., Ribiero-Neto, B.: Modern Information Retrieval, pp. 74–79. Addison-Wesley, Reading (1999)Google Scholar
  33. 33.
    Chang, C.C., Lin, C.J.: Libsvm - a library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  34. 34.
    Cohen, W.W., Ravikumar, P., Fienberg, S.: Secondstring - an open-source java-based package of approximate string-matching techniques, http://secondstring.sourceforge.net/
  35. 35.
    Levenshtein, V.I.: Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady 10, 707–710 (1966)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Patrick Lehti
    • 1
  • Peter Fankhauser
    • 1
  1. 1.Fraunhofer IPSIDarmstadtGermany

Personalised recommendations