Applied Intelligence

, Volume 35, Issue 3, pp 359–374 | Cite as

Meta similarity

Article

Abstract

To see if two given strings are matched, various string similarity metrics have been employed and these string similarities can be categorized into three classes: (a) Edit-distance-based similarities, (b) Token-based similarities, and (c) Hybrid similarities. In essence, since different types of string similarities have different pros and cons in measuring the similarity between two strings, string similarity metrics in each class are likely to work well for particular data sets. Toward this problem, we propose a novel Meta Similarity that both (i) outperforms the existing similarity metrics and (ii) is the least affected by a variety of data sets. Our claim is empirically validated through extensive experimental tests—our proposal shows an improvement to the largest 20% average recall, compared to the best case of the existing similarity metrics and our method is the most stable, showing from 0.95 to 1.0 average recall range in all the data sets.

Keywords

String similarity Machine learning Entity resolution Linear systems 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: VLDB Google Scholar
  2. 2.
    arXiv.org e Print archive. http://arxiv.org/
  3. 3.
    Bekkerman R, McCallum A (2005) Disambiguating web appearances of people in a social network. In: Int’l world wide web conf (WWW) Google Scholar
  4. 4.
    Benjelloun O, Garcia-Molina H, Su Q, Widom J (2005) Swoosh: A generic approach to entity resolution. Technical report, Stanford University Google Scholar
  5. 5.
    Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery Google Scholar
  6. 6.
    Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name-matching in information integration. IEEE Intell Syst 18(5):16–23 CrossRefGoogle Scholar
  7. 7.
    Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: ACM SIGMOD Google Scholar
  8. 8.
    Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string distance metrics for name-matching tasks. In: II Web workshop held in conjunction with IJCAI Google Scholar
  9. 9.
    Cohen WW (2000) Data integration using similarity joins and a word-based information representation language. Inf Syst Google Scholar
  10. 10.
    Digital bibliography and library project (DBLP). http://dblp.uni-trier.de/
  11. 11.
    Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM SIGMOD Google Scholar
  12. 12.
    Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. In: IEEE ICDE, February 2002 Google Scholar
  13. 13.
    Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. San Jose, California, Feb 2002 Google Scholar
  14. 14.
    Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: A survey. IEEE Trans Knowl Data Eng 19(1):1–16 CrossRefGoogle Scholar
  15. 15.
    Fellegi IP, Sunter AB (1964) A theory for record linkage. J Am Stat Assoc 1183–1210 Google Scholar
  16. 16.
    Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Soc 64:1183–1210 Google Scholar
  17. 17.
    Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Assoc. Comput Linguist 19(1) Google Scholar
  18. 18.
    Golub GH, van Loan CF (1999) Matrix computations. Johns Hopkins University Press, Baltimore Google Scholar
  19. 19.
    Gravano L, Ipeirotis P, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, Srivastava D (2001) Using q-grams in a DBMS for approximate string processing. IEEE Data Eng Bull 24(4) Google Scholar
  20. 20.
    Han H, Zha H, Lee Giles C (2005) Name disambiguation in author citations using a K-way spectral clustering method. In: ACM/IEEE joint conf on digital libraries (JCDL), Jun 2005 Google Scholar
  21. 21.
    Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD Google Scholar
  22. 22.
    Hernandez MA, Stolf SJ (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. J Data Min Knowl Discov Google Scholar
  23. 23.
    Hong Y, On B-W, Lee D (2004) System support for name authority control problem in digital libraries: OpenDBLP approach. In: European conf on digital libraries, Bath, UK, September 2004 Google Scholar
  24. 24.
    Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420 CrossRefGoogle Scholar
  25. 25.
    Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain-independent data cleaning. In: SIAM data mining (SDM) conf Google Scholar
  26. 26.
    Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing. IEEE Comput 32(6):67–71 Google Scholar
  27. 27.
    Lee D, Kang J, Mitra P, Giles C, On B (2006) Are your citations clean? New challenges and scenarios in maintaining digital libraries. In: Communications of the ACM Google Scholar
  28. 28.
    Lee D, On B-W, Kang J, Park S (2005) Effective and scalable solutions for mixed and split citation problems in digital libraries. In: ACM SIGMOD workshop on information quality in information systems (IQIS), June 2005 Google Scholar
  29. 29.
    Ley M (2002) The DBLP computer science bibliography: Evolution, research issues, perspectives. In: Int’l symp on string processing and information retrieval (SPIRE), Lisbon, Portugal, Sep 2002 Google Scholar
  30. 30.
    CiteSeer: Scientific literature digital library. http://www.citeseer.org/
  31. 31.
    Li X, Ma B, Li M, Chen X, Vitanyi M (2004) The similarity metric. IEEE Tran Inf Theory Google Scholar
  32. 32.
    Malin B (2005) Unsupervised name disambiguation via social network similarity. In: SIAM SDM workshop on link analysis, counterterrorism and security Google Scholar
  33. 33.
    McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM KDD, pp 169–178 Google Scholar
  34. 34.
    Monge AE, Elkan C (1996) The field matching problem: Algorithms and applications, 23–29 Google Scholar
  35. 35.
    Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. J Data Min Knowl Discov Google Scholar
  36. 36.
    Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records, 23–29 Google Scholar
  37. 37.
    RIDDLE: Repository of information on duplicate detection record linkage and identity uncertainty. http://www.cs.utexas.edu/users/ml/riddle/data.html
  38. 38.
    On B-W, Lee D, Kang J, Mitra P (2005) Comparative study of name disambiguation problem using a scalable blocking-based framework. In: ACM/IEEE joint conf on digital libraries, June 2005 Google Scholar
  39. 39.
    On B-W, Koudas N, Lee D, Srivastava D (2007) Group linkage Google Scholar
  40. 40.
    Pasula H, Marthi B, Milch B, Russell S, Shpitser I (2003) Identity uncertainty and citation matching. In: Advances in neural information processing systems. MIT Press, Cambridge Google Scholar
  41. 41.
    Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning, 269–278 Google Scholar
  42. 42.
    Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI Google Scholar
  43. 43.
  44. 44.
    SecondString: Open source Java-based package of approximate string-Matching. http://secondstring.sourceforge.net/
  45. 45.
    The MathWorks Matlab function reference. http://www.mathworks.com/
  46. 46.
    Verykios VS, Elmagarmid AK, Houstis EN (2000) Automating the approximate record matching process. Inf Sci 126(1–4):83–98 Google Scholar
  47. 47.
    Warnner JW, Brown EW (2001) Automated name authority control. In: ACM/IEEE joint conf on digital libraries (JCDL) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Singapore Management UniversitySingaporeSingapore
  2. 2.Troy UniversityTroyUSA

Personalised recommendations