Advertisement

An incremental clustering scheme for data de-duplication

  • Gianni Costa
  • Giuseppe MancoEmail author
  • Riccardo Ortale
Article

Abstract

We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.

Keywords

Clustering-mining methods and algorithms Record classification Indexing methods and structures Locality-sensitive hashing Min-wise independent permutations Approximated similarity measures De-duplication 

References

  1. Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 20–29Google Scholar
  2. Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the international conference on very large databases, pp 586–597Google Scholar
  3. Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the international conference on very large databases, pp 918–929Google Scholar
  4. Bawa M, Tyson S, Condie, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the international conference on world wide web, pp 651–660Google Scholar
  5. Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the international conference on world wide web, pp 131–140Google Scholar
  6. Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, pp 11–18Google Scholar
  7. Bilenko M, Mooney RJ (2003a) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 39–48Google Scholar
  8. Bilenko M, Mooney RJ (2003b) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD workshop on data cleaning, record linkage, and object consolidation, pp 7–12Google Scholar
  9. Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering on the Web. In: Proceedings of the international conference on World Wide Web, pp 1157–1166Google Scholar
  10. Broder A, Charikar M, Frieze AM, Mitzenmacher M (1998) Minwise independent permutations. In: Proceedings of the ACM symposium on theory of computing, pp 327–336Google Scholar
  11. Cesario E, Folino F, Manco G, Pontieri L (2005) An incremental clustering scheme for duplicate detection in large databases. In: Proceedings of the international conference databases and applications symposium, pp 89–95Google Scholar
  12. Cesario E, Folino F, Locane A, Manco G, Ortale R (2008) Boosting text segmentation via progressive classification. J Knowl Inf Syst 15(3): 285–320CrossRefGoogle Scholar
  13. Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD conference on management of data, pp 313–324Google Scholar
  14. Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the international conference on data engineering, pp 865–876Google Scholar
  15. Chavez E, Navarro G, Baeza-Yates R, Luis Marroquin J (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321CrossRefGoogle Scholar
  16. Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the international conference on very large databases, pp 426–435Google Scholar
  17. Cochinwala M, Dalal S, Elmagarmid AK, Verykios VS (2005) Record matching: past, present and futureGoogle Scholar
  18. Cohen W, Richman J (2001) Learning to match and cluster entity names. In: Proceedings of the ACM SIGIR workshop on mathematical/formal methods in information retrieval, pp 13–18Google Scholar
  19. Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 475–480Google Scholar
  20. Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI workshop on information integration on the web, pp 73–78Google Scholar
  21. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the international conference on knowledge discovery and data mining, pp 226–231Google Scholar
  22. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64: 1183–1210CrossRefGoogle Scholar
  23. Ganti V et al (1999) Clustering large datasets in arbitrary metric spaces. In: Proceedings of the international conference on data engineering, pp 502–511Google Scholar
  24. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large databases, pp 518–529Google Scholar
  25. Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (Almost) for free. In: Proceedings of the international conference on very large databases, pp 491–500Google Scholar
  26. Gu L, Baxter RA, Vickers D, Rainsford C (2003) Record linkage: current practice and future directions. Technical Report, number 03/83. CSIRO Mathematical and Information SciencesGoogle Scholar
  27. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 73–84Google Scholar
  28. Guha S, Rastogi R, Shim K (2001) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5): 345–366CrossRefGoogle Scholar
  29. Gunsfield D (1997) Algorithms on strings, trees and sequences. Cambridge University Press, CambridgeGoogle Scholar
  30. Hernández MA, Stolfo SJ (1995) The Merge/Purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 127–138Google Scholar
  31. Hjatason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4): 517–518CrossRefGoogle Scholar
  32. Indyk P, Motwani R (1998) Approximate nearest neighbor-towards removing the curse of dimensionality. In: Proceedings of symposium on theory of computing, pp 604–613Google Scholar
  33. Ipeirotis PG, Verykios VS, Elmagarmid AK (2007) Duplicate record detection: a review. IEEE Trans Knowl Data Eng 18(1): 1–16Google Scholar
  34. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323CrossRefGoogle Scholar
  35. Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain independent data cleaning. In: Proceedings of the SIAM conference on data mining, pp 262–273Google Scholar
  36. McCallum AK, Nigam K, Ungar L (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 169–178Google Scholar
  37. Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. In: Proceedings of the international conference on knowledge discovery and data mining, pp 267–270Google Scholar
  38. Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, pp 23–29Google Scholar
  39. Monge AE, Elkan CP (2001) Automatic segmentation of text into structured records. In: Proceedings of the ACM SIGMOD conference on management of dataGoogle Scholar
  40. Neiling M, Jurk S (2003) The object identification framework. In: Proceedings of the KDD workshop on data cleaning, record linkage, and object consolidation, pp 37–39Google Scholar
  41. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278Google Scholar
  42. Sarawagi S, Kirpal A (2004) Efficient exact set-similarity joins. In: Proceedings of the SIGMOD international conference on management of data, pp 743–754Google Scholar
  43. Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 350–359Google Scholar
  44. Ukkonen E (1982) Approximate string matching using q-grams and maximal matches. Theor Comput Sci 92(1): 191–211CrossRefMathSciNetGoogle Scholar
  45. Weber R, Schek HJ, Blott S (1998) A quantitative analsysis and performance study for similarity search in high-dimensional spaces. In: Proceedings of the international conference on very large databases, pp 194–205Google Scholar
  46. Winkler WE (1990) String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the section on survey research methods, American Statistical Association, pp 354–359Google Scholar
  47. Winkler WE (1999) The state of record linkage and current research problems. Technical Report. Statistical Research Division, US Census BureauGoogle Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  1. 1.ICAR-CNRRendeItaly

Personalised recommendations