Efficient Duplicate Record Detection Based on Similarity Estimation

  • Mohan Li
  • Hongzhi Wang
  • Jianzhong Li
  • Hong Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6184)


In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them and the similarity is measured as the weight of such matching. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in this paper. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.


Heterogeneous Records Duplicate Detection Record Similarity Similarity Estimation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (2007)Google Scholar
  2. 2.
    Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (May 1998)Google Scholar
  3. 3.
    Kuhn, H.W.: The hungarian method for the assignment problem. Naval res. Logist. Quart. (1955)Google Scholar
  4. 4.
    Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Indust. App1. Math. (1957)Google Scholar
  5. 5.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD, pp. 39–48 (August 2003)Google Scholar
  6. 6.
    Chandel, Hassanzadeh, O., Koudas, N., et al.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (June 2007)Google Scholar
  7. 7.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324 (June 2003)Google Scholar
  8. 8.
    Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. on Information Systems 18(3), 288–321 (2000)CrossRefGoogle Scholar
  9. 9.
    Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: SIGMOD, pp. 175–186 (May 2001)Google Scholar
  10. 10.
    Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS (December 2004)Google Scholar
  11. 11.
    Viola, P.A., Narasimhan, M.: Learning to extract information from semi-structured text using a discriminative context free grammar. In: SIGIR, pp. 330–337 (August 2005)Google Scholar
  12. 12.
    Cohen, W.W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: SIGKDD, pp. 89–98 (August 2004)Google Scholar
  13. 13.
    Arasu, Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40-49 (April 2008)Google Scholar
  14. 14.
    Arasu, Kaushik, R.: A Grammar-based Entity Representation Framework for Data Cleaning. In: SIGMOD, pp. 233–244 (June 2009)Google Scholar
  15. 15.
    Mohan, L., Hongzhi, W., Jianzhong, L., Hong, G.: Duplicate Record Detection Method Based on Optimal Bipartite Graph Matching. In: NDBC (October 2009)Google Scholar
  16. 16.
    Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Mohan Li
    • 1
  • Hongzhi Wang
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.Harbin Institute of TechnologyHarbin

Personalised recommendations