Active Duplicate Detection

  • Ke Deng
  • Liwei Wang
  • Xiaofang Zhou
  • Shazia Sadiq
  • Gabriel Pui Cheong Fung
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5981)


The aim of duplicate detection is to group records in a relation which refer to the same entity in the real world such as a person or business. Most existing works require user specified parameters such as similarity threshold in order to conduct duplicate detection. These methods are called user-first in this paper. However, in many scenarios, pre-specification from the user is very hard and often unreliable, thus limiting applicability of user-first methods. In this paper, we propose a user-last method, called Active Duplicate Detection (ADD), where an initial solution is returned without forcing user to specify such parameters and then user is involved to refine the initial solution. Different from user-first methods where user makes decision before any processing, ADD allows user to make decision based on an initial solution. The identified initial solution in ADD enjoys comparatively high quality and is easy to be refined in a systematic way (at almost zero cost).


Initial Solution Leaf Node Child Node Edit Distance Distance Range 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD (1995)Google Scholar
  2. 2.
    Hernandez, M., Stolfo, S.: Real-world data is dirty: data cleansing and the merge/purge problem for large databases. Data mining and knowledge discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  3. 3.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using axtive learning. In: SIGKDD (2002)Google Scholar
  4. 4.
    Wang, Y., Madnick, S.: The inter-database instance identification problem in integrating autonomous systems. In: ICDE (1989)Google Scholar
  5. 5.
    Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: A generic approach to entity resolution. The VLDB Journal (2008)Google Scholar
  6. 6.
    Newcombe, H.: Record linking: The design of efficient systems for linking records into individual and family histories. Am. J. Human Genetics 19(3), 335–359 (1967)Google Scholar
  7. 7.
    Tepping, B.: A model for optimum linkage of records. J. Am. Statistical Assoc. 63(324), 1321–1332 (1968)CrossRefGoogle Scholar
  8. 8.
    Felligi, I., Sunter, A.: A theory for record linkage. Journal of the Amercian Statistical Society 64, 1183–1210 (1969)Google Scholar
  9. 9.
    Newcombe, H.: Handbook of Record Linkage. Oxford Univ. Press, Oxford (1988)Google Scholar
  10. 10.
    Monge, A., Elkan, C.: An efficient domain independent algorithm for detecting approacimatly duplicate database records. In: SIGKDD (1997)Google Scholar
  11. 11.
    Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)Google Scholar
  12. 12.
    Cohen, W., Richman, J.: Learing to match and cluster large hihg-dimensional data sets for data integration. In: SIGKDD (2002)Google Scholar
  13. 13.
    Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE (2005)Google Scholar
  14. 14.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. TKDE 19(1), 1–16 (2007)Google Scholar
  15. 15.
    Winkler, W.: Data cleaning methods. In: SIGMOD workshop on data cleaning, record linkage, and object identification (2003)Google Scholar
  16. 16.
    Tejada, S., Knoblosk, C., Minton, S.: Learing domain-independent string trandformation weights for high accuracy object identification. In: SIGKDD (2002)Google Scholar
  17. 17.
    Jain, A., Dubes, R.: Algorithms for clustering data. Prentice Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  18. 18.
    Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Ke Deng
    • 1
  • Liwei Wang
    • 2
  • Xiaofang Zhou
    • 1
  • Shazia Sadiq
    • 1
  • Gabriel Pui Cheong Fung
    • 1
  1. 1.The University of QueenslandAustralia
  2. 2.Wuhan UniversityChina

Personalised recommendations