Knowledge and Information Systems

, Volume 27, Issue 1, pp 1–21 | Cite as

Approximate data instance matching: a survey

  • Carina Friedrich Dorneles
  • Rodrigo Gonçalves
  • Ronaldo dos Santos Mello
Survey Paper

Abstract

Approximate data matching is a central problem in several data management processes, such as data integration, data cleaning, approximate queries, similarity search and so on. An approximate matching process aims at defining whether two data represent the same real-world object. For atomic values (strings, dates, etc), similarity functions have been defined for several value domains (person names, addresses, and so on). For matching aggregated values, such as relational tuples and XML trees, approaches alternate from the definition of simple functions that combine values of similarity of record attributes to sophisticated techniques based on machine learning, for example. For complex data comparison, including structured and semistructured documents, existing approaches use both structure and data for the comparison, by either considering or not considering data semantics. This survey presents terminology and concepts that base approximated data matching, as well as discusses related work on the use of similarity functions in such a subject.

Keywords

Instance data matching Similarity function Similarity matching Record linkage Record matching Duplicate detection Object matching Entity resolution 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th international conference on foundations of data organization and algorithms (FODO). Springer, London, pp 69–84Google Scholar
  2. 2.
    Al-Khalifa S, Yu C, Jagadish HV (2003) Querying structured text in an xml database. In: Proceedings of the 29th ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 4–15Google Scholar
  3. 3.
    Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: Proceedings of the 2008 IEEE 24th international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 40–49Google Scholar
  4. 4.
    Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on very large data bases (VLDB). VLDB Endowment, pp 918–929Google Scholar
  5. 5.
    Aygun RS (2008) S2s: structural-to-syntactic matching similar documents. Knowl Inf Syst 16(3): 303–329CrossRefGoogle Scholar
  6. 6.
    Barioni MC, Razente HL, Traina C Jr, Traina AJM (2005) Querying complex objects by similarity in sql. In: Proceedings of the 20th Brazilian symposium on databases (SBBD), pp 130–144Google Scholar
  7. 7.
    Bhattacharya I, Getoor L (2005) Relational clustering for multi-type entity resolution. In: Proceedings of the 4th international workshop on multi-relational mining (MRDM). ACM, New York, pp 3–12Google Scholar
  8. 8.
    Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5): 16–23CrossRefGoogle Scholar
  9. 9.
    Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 39–48Google Scholar
  10. 10.
    Bollacker K, Lawrence S, Giles CL (1998) CiteSeer: an autonomous web agent for automatic retrieval and identification of interesting publications. In: Proceedings of the 2nd international conference on autonomous agents. ACM Press, New York, pp 116–123Google Scholar
  11. 11.
    Boser BE, Guyon IM, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory (COLT). ACM, New York, pp 144–152Google Scholar
  12. 12.
    Broder A (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences (SEQUENCES). IEEE Computer Society, Washington, p 21Google Scholar
  13. 13.
    Bruno N, Chaudhuri S, Gravano L (2002) Top-k selection queries over relational databases: mapping strategies and performance evaluation. ACM Trans Database Syst 27(2): 153–187CrossRefGoogle Scholar
  14. 14.
    Bryan B, Schneider J, Nichol R, Miller C, Genovese C, Wasserman L (2006) Active learning for identifying function threshold boundaries. Advances in neural information processing systems (NIPS). MIT PressGoogle Scholar
  15. 15.
    Bueno R, Traina AJM, Traina C Jr (2005) Accelerating approximate similarity queries using genetic algorithms. In: Proceedings of the 2005 ACM symposium on applied computing (SAC), Santa Fe, New Mexico, USA, pp 617–622Google Scholar
  16. 16.
    Buttler D (2004) A short survey of document structure similarity algorithms. In: Proceedings of the international conference on internet computing, Las Vegas, Nevada, USA, pp 3–9Google Scholar
  17. 17.
    Carman MJ, Knoblock CA (2005) Inducing source descriptions for automated web service composition. In: Proceedings of the AAAI workshop on exploring planning and scheduling for web services, grid, and autonomic computing, Pittsburgh, PennsylvaniaGoogle Scholar
  18. 18.
    Carvalho JCP, da Silva AS (2003) Finding similar identities among objects from multiple web sources. In: Proceedings of the 5th ACM international workshop on web information and data management (WIDM). ACM, New York, pp 90–93Google Scholar
  19. 19.
    Chapman S (2004) SimMetrics: a Java & C # .NET library of similarity metrics. http://sourceforge.net/projects/simmetrics/. Accessed 13 March 2009
  20. 20.
    Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases (VLDB), VLDB Endowment, pp 327–338Google Scholar
  21. 21.
    Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 29th ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 313–324Google Scholar
  22. 22.
    Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 5Google Scholar
  23. 23.
    Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the 21st international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 865–876Google Scholar
  24. 24.
    Chaudhuri S, Gravano L (1999) Evaluating top-k selection queries. In: Proceedings of the 25th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 397–410Google Scholar
  25. 25.
    Chawathe SS, Garcia-Molina H (1997) Meaningful change detection in structured data. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 26–37Google Scholar
  26. 26.
    Cheng T, Chang KC-C (2007) Entity search engine: towards agile best-effort information integration over the web. In: Third biennial conference on innovative data systems research (CIDR), Asilomar, CA, USA, pp 108–113Google Scholar
  27. 27.
    Chuan Xiao, Wei Wang XL (2008) Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: Proceedings of the VLDB Endow. VLDB Endowment, pp 933–944Google Scholar
  28. 28.
    Cohen WW (1998a) Integration of heterogeneous databases without common domains using queries based on textual similarity, pp 201–212Google Scholar
  29. 29.
    Cohen WW (1998b) Providing database-like access to the web using queries based on textual similarity. SIGMOD Rec 27(2): 558–560CrossRefGoogle Scholar
  30. 30.
    Cohen WW (1999) Recognizing structure in web pages using similarity queries. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence (AAAI/IAAI). American Association for Artificial Intelligence, Menlo Park, pp 59–66Google Scholar
  31. 31.
    Cohen WW (2000) Whirl: a word-based information representation language. Artif Intell 118(1–2): 163–196CrossRefMATHGoogle Scholar
  32. 32.
    Cohen WW, Ravikumar P, Fienberg S (2003a) Secondstring: open source java-based package of approximate string-matching. http://secondstring.sourceforge.net/. Accessed 20 Dec 2005
  33. 33.
    Cohen WW, Ravikumar P, Fienberg SE (2003b) A comparison of string distance metrics for name-matching tasks. In: Proceedings of workshop on information integration on the Web, IIWeb, pp 73–78Google Scholar
  34. 34.
    Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, New York, pp 475–480Google Scholar
  35. 35.
    Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIS 41(6): 391–407CrossRefGoogle Scholar
  36. 36.
    Dey D, Sarkar S (1996) A probabilistic relational model and algebra. ACM Trans Database Syst 21(3): 339–369CrossRefGoogle Scholar
  37. 37.
    Diaconis P, Graham R (1997) Spearman’s footrule as a measure of disarray. J R Stat Soc 39(2): 262–268MathSciNetGoogle Scholar
  38. 38.
    Doan A, Lu Y, Lee Y, Han J (2003) Profile-based object matching for information integration. IEEE Intell Syst 18(5): 54–59CrossRefGoogle Scholar
  39. 39.
    Dorneles CF, Heuser CA, da Silva AS, de Moura ES (2009) A generic strategy for combining similarity metrics in approximate matching. Inf Syst (Oxford) 34: 673–689Google Scholar
  40. 40.
    Dorneles CF, Heuser CA, Lima AEN, da Silva AS, de Moura ES (2004) Measuring similarity between collection of values. In: Proceedings of the 6th annual ACM international workshop on web information and data management (WIDM). ACM, New York, pp 56–63Google Scholar
  41. 41.
    Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceedings of the 14th annual symposium combinatorial pattern matching (CPM), pp 83–95Google Scholar
  42. 42.
    Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19: 1–16CrossRefGoogle Scholar
  43. 43.
    Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2005) Fast detection of XML structural similarity. IEEE Trans Knowl Data Eng 17(2): 160–175CrossRefGoogle Scholar
  44. 44.
    Galhardas H, Florescu D, Shasha D, Simon E (2000) An extensible framework for data cleaning. In: Proceedings of the 16th international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 312Google Scholar
  45. 45.
    Galhardas H, Florescu D, Shasha D, Simon E, Saita C-A (2001) Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 371–380Google Scholar
  46. 46.
    Gao L, Wang M, Wang XS, Padmanabhan S (2004) Expressing and optimizing similarity-based queries in sql. In: 23rd international conference on conceptual modeling (ER), Shanghai, China, pp 464–478Google Scholar
  47. 47.
    Getoor, L, Taskar, B (eds) (2007) Introduction to statistical relational learning. MIT Press, CambridgeMATHGoogle Scholar
  48. 48.
    Giles CL, Bollacker K, Lawrence S (1998) CiteSeer: an automatic citation indexing system. In: Proceedings of the third ACM conference on digital libraries (DL). ACM, New York, pp 89–98Google Scholar
  49. 49.
    Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proceedings of the 27th international conference on very large data bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, pp 491–500Google Scholar
  50. 50.
    Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an rdbms for web data integration. In: Proceedings of the 12th international conference on World Wide Web (WWW). ACM, New York, pp 90–101Google Scholar
  51. 51.
    Guha S, Koudas N, Marathe A, Srivastava D (2004) Merging the results of approximate match operations. In: Proceedings of the thirtieth international conference on very large data bases (VLDB). VLDB Endowment, pp 636–647Google Scholar
  52. 52.
    Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. SIGMOD Rec 24(2): 127–138CrossRefGoogle Scholar
  53. 53.
    Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1): 9–37CrossRefGoogle Scholar
  54. 54.
    Huffman S, Steier D (1995a) Heuristic joins to integrate structured heterogeneous data. In: Working notes of the AAAI spring symposium on information gathering from heterogeneous, distributed environment, pp 74–77Google Scholar
  55. 55.
    Huffman SB, Steier D (1995b) A navigation assistant for data source selection and integration, Working papers, Price WaterhouseGoogle Scholar
  56. 56.
    Ilyas F, Aref G, Elmagarmid K (2004) Supporting top-k join queries in relational databases. VLDB J 13(3): 754–765CrossRefGoogle Scholar
  57. 57.
    Ilyas IF, Aref WG (2005) Rank-aware query processing and optimization. In: Proceedings of the 21st international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 1144Google Scholar
  58. 58.
    Ilyas IF, Aref WG, Elmagarmid AK (2003) Supporting top-k join queries in relational databases. In: VLDBGoogle Scholar
  59. 59.
    Jin L, Li C, Mehrotra S (2002) Efficient similarity string joins in large data sets, technical report, University of California, Irvine. http://www.ics.uci.edu/chenli/pub/strjoin.ps
  60. 60.
    Kailing K, Kriegel H-P, Schönauer S, Seidl T (2004) Efficient similarity search for hierarchical data in large databases. In: 9th international conference on extending database technology (EDBT), Heraklion, Crete, Greece, pp 676–693Google Scholar
  61. 61.
    Knoblock CA, Ambite JL, Thakkar S (2003) A view integration approach to dynamic composition of web services. In: Proceedings of ICAPS workshop on planning for web servicesGoogle Scholar
  62. 62.
    Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Proceedings of the thirtieth international conference on very large data bases (VLDB), VLDB Endowment, pp 1078–1086Google Scholar
  63. 63.
    Laender AHF, Gonalves MA, Roberto PA (2004) Bdbcomp: building a digital library for the brazilian computer science community. In: ACM/IEEE joint conference on digital libraries (JCDL), Tuscon, AZ, USA, pp 23–24Google Scholar
  64. 64.
    Lee L (2001) On the effectiveness of the skew divergence for statistical language analysis. Artif Intell Stat, pp 65–72Google Scholar
  65. 65.
    Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady, vol 10Google Scholar
  66. 66.
    Luis Leit a, Calado P, Weis M (2007) Structure-based inference of xml similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM). ACM, New York, pp 293–302Google Scholar
  67. 67.
    Ma Y, Chbeir R (2005) Content and structure based approach for xml similarity. In: Proceedings of the fifth international conference on computer and information technology (CIT). IEEE Computer Society, Washington, pp 136–140Google Scholar
  68. 68.
    Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 117–128Google Scholar
  69. 69.
    Michalowski M, Ambite JL, Knoblock CA, Minton S, Thakkar S, Tuchinda R (2004) Retrieving and semantically integrating heterogeneous data from the web. IEEE Intell Syst 19(3)Google Scholar
  70. 70.
    Michalowski M, Thakkar S, Knoblock CA (2005) Automatically utilizing secondary sources to align information across data sources. AI Mag 26(1): 72–79Google Scholar
  71. 71.
    Milano D, Scannapieco M, Catarci T (2006) Structure aware xml object identification. In: Proceedings of the first Int’l VLDB workshop on clean databases (CleanDB). Seoul, KoreaGoogle Scholar
  72. 72.
    Mitchell TM (1997) Machine learning. McGraw-Hill, New YorkMATHGoogle Scholar
  73. 73.
    Motro A (1988) Vague: a user interface to relational databases that permits vague queries. ACM Trans Off Inf Syst 6(3): 187–214CrossRefGoogle Scholar
  74. 74.
    Navarro G (2001) A guided tour of approximate string matching. ACM Comput Surv 33(1): 31–88CrossRefGoogle Scholar
  75. 75.
    Nayak R (2008) Fast and effective clustering of xml data using structural information. Knowl Inf Syst 14(2): 197–215CrossRefMathSciNetGoogle Scholar
  76. 76.
    Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: 5th international workshop on the web and databases (WebDB), Madison, Wisconsin, USA, pp 61–66Google Scholar
  77. 77.
    Park U, Seo Y (2005) An implementation of XML documents search system based on similarity in structure and semantics. In: International workshop on challenges in web information retrieval and integrationGoogle Scholar
  78. 78.
    Puhlmann S, Weis M, Naumann F (2006) XML duplicate detection using sorted neighborhoods. In: Proceedings of the 10th international conference on extending database technology (EDBT), pp 773–791Google Scholar
  79. 79.
    Ragnemalm I (1993) The euclidean distance transform, Ph.D thesis, Department of Electrical Engineering, Linkpping UniversityGoogle Scholar
  80. 80.
    Salton G, McGill M (1984) Introduction to modern information retrieval. McGraw-Hill, New YorkGoogle Scholar
  81. 81.
    Schallehn E, Sattler K-U, Saake G (2001) Advanced grouping and aggregation for data integration. In: CIKM ’01 proceedings of the tenth international conference on information and knowledge management. ACM, New York, pp 547–549Google Scholar
  82. 82.
    Schallehn E, Sattler K-U, Saake G (2004) Efficient similarity-based operations for data integration. Data Knowl Eng 48(3): 361–387CrossRefGoogle Scholar
  83. 83.
    Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: Proceedings of the twelfth international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 536–545Google Scholar
  84. 84.
    Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: Proceedings of the 20th national conference on artificial intelligence (AAAI). AAAI Press, pp 862–867Google Scholar
  85. 85.
    Stasiu RK, Heuser CA, Silva R (2005) Estimating recall and precision for vague queries in databases. In: Proceedings of the 17th international conference advanced information systems engineering (CAISE), pp 187–200Google Scholar
  86. 86.
    Takasu A, Fukagawa D, Akutsu T (2007) Statistical learning algorithm for tree similarity. In: Proceedings of the seventh IEEE international conference on data mining (ICDM). IEEE Computer Society, Washington, pp 667–672Google Scholar
  87. 87.
    Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8)Google Scholar
  88. 88.
    Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification, pp 350–359Google Scholar
  89. 89.
    Thor A, Rahm E (2007) Moma—a mapping-based object matching system. In: ‘Third biennial conference on innovative data systems research (CIDR). Asilomar, CA, USA, pp 247–258Google Scholar
  90. 90.
    Tran T, Nayak R, Bruza P (2008) Combining structure and content similarities for xml document clustering. In: Proceedings of the 7th Australasian data mining conference (AusDM), vol 87. ACS, Glenelg, pp 219–226Google Scholar
  91. 91.
    Vinson AR, Heuser CA, Silva AS, de Moura ES (2007) An approach to xml path matching. In: Proceedings of the 9th annual ACM international workshop on web information and data management (WIDM). ACM, New York, pp 17–24Google Scholar
  92. 92.
    Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1): 55–73CrossRefGoogle Scholar
  93. 93.
    Wang T-Y, Ré C, Suciu D (2008) Implementing not exists predicates over a probabilistic database. In: Proceedings of the international workshop on quality in databases and management of uncertain data, pp 73–86Google Scholar
  94. 94.
    Wang W, Xiao C, Lin X, Zhang C (2009) Efficient approximate entity extraction with edit distance constraints. In: Proceedings of the 35th SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 759–770Google Scholar
  95. 95.
    Weis M, Naumann F (2004) Detecting duplicate objects in xml documents. In: Proceedings of the 2004 international workshop on information quality in information systems (IQIS). ACM, New York, pp 10–19Google Scholar
  96. 96.
    Weis M, Naumann F (2005) Dogmatix tracks down duplicates in xml. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data (SIGMOD). ACM, New York, pp 431–442Google Scholar
  97. 97.
    Weis M, Naumann F (2006) Detecting duplicates in complex xml data. In: Proceedings of the 22nd international conference on data engineering (ICDE). IEEE Computer Society, Washington, p 109Google Scholar
  98. 98.
    Widom J (2008) Managing and mining uncertain data. In: Trio: a system for data, uncertainty, and lineage. Springer, BerlinGoogle Scholar
  99. 99.
    Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: Proceedings of the IEEE international conference on data engineering (ICDE). IEEE Computer Society, Washington, pp 916–927Google Scholar
  100. 100.
    Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceeding of the 17th international conference on World Wide Web (WWW). ACM, New York, pp 131–140Google Scholar
  101. 101.
    Yang J, Cheung WK, Chen X (2005) Integrating element and term semantics for similarity-based xml document clustering. In: Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence (WI). IEEE Computer Society, Washington, pp 222–228Google Scholar
  102. 102.
    Yongming G, Dehua C, Jiajin L (2008) Clustering xml documents by combining content and structure. In: Proceedings of the international symposium on information science and engieering. IEEE Computer Society, Los Alamitos, pp 583–587Google Scholar
  103. 103.
    Yu CT, Philip G, Meng W (2003) Distributed top-n query processing with possibly uncooperative local systems. In: Proceedings of the 29th international conference on very large data bases (VLDB). VLDB Endowment, pp 117–128Google Scholar
  104. 104.
    Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3): 321–334CrossRefGoogle Scholar
  105. 105.
    Zhao H, Ram S (2005) Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf Syst 30(2): 119–132CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Carina Friedrich Dorneles
    • 1
  • Rodrigo Gonçalves
    • 1
  • Ronaldo dos Santos Mello
    • 1
  1. 1.Centro Tecnologico (CTC), Depto. Informatica e Estatistica (INE)Universidade Federal de Santa Catarina (UFSC)FlorianopolisBrazil

Personalised recommendations