String Distance Metrics for Reference Matching and Search Query Correction

  • Jakub Piskorski
  • Marcin Sydow
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4439)


String distance metrics have been widely used in various applications concerning processing of textual data. This paper reports on the exploration of their usability for tackling the reference matching task and for the automatic correction of misspelled search engine queries, in the context of highly inflective languages, in particular focusing on Polish. The results of numerous experiments in different scenarios are presented and they revealed some preferred metrics. Surprisingly good results were observed for correcting misspelled search engine queries. Nevertheless, a more in-depth analysis is necessary to achieve improvements. The work reported here constitutes a good point of departure for further research on this topic.


string distance metrics reference matching search engine query correction information retrieval inflective languages 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Morton, T.: Coreference for NLP Applications. In: Proceedings of ACL 1997 (1997)Google Scholar
  2. 2.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the KDD2003 (2003)Google Scholar
  3. 3.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 (2003)Google Scholar
  4. 4.
    Piskorski, J.: Named Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–132. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  5. 5.
    Linden, K.: Multilingual modeling of cross-lingual spelling variants. Information Retrieval 9(3), 295–310 (2006)CrossRefGoogle Scholar
  6. 6.
    Kukich, K.: Techniques for automatically correcting words in text. In: CSC ’93: Proceedings of the 1993 ACM conference on Computer science, Indianapolis, Indiana, United States, ACM Press, New York (1993)Google Scholar
  7. 7.
    Levenshtein, V.: Binary Codes for Correcting Deletions, Insertions, and Reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)MathSciNetGoogle Scholar
  8. 8.
    Landau, G., Vishkin, U.: Fast Parallel and Serial Approximate String Matching. Journal of Algorithms 10, 157–169 (1997)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Needleman, S., Wunsch, C.: A General Method Applicable to Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48(3), 443–453 (1970)CrossRefGoogle Scholar
  10. 10.
    Smith, T., Waterman, M.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)CrossRefGoogle Scholar
  11. 11.
    Winkle, W.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC (1999)Google Scholar
  12. 12.
    Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches. Theoretical Computer Science 92(1), 191–211 (1992)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Russell, R.: U.S. Patent 1,261,167 (1918),
  14. 14.
    Monge, A., Elkan, C.: The Field Matching Problem: Algorithms and Applications. In: Proceedings of Knowledge Discovery and Data Mining 1996, pp. 267–270 (1996)Google Scholar
  15. 15.
  16. 16.
    Cucerzan, S., Brill, E.: Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Proceedings of EMNLP 2004, pp. 293–300 (2004)Google Scholar
  17. 17.
    Gravano, L., et al.: Using q-grams in a DBMS for Approximate String Processing. IEEE Data Engineering Bulletin 24(4), 28–34 (2001)Google Scholar
  18. 18.
    Bilenko, M., Mooney, R.: Employing trainable string similarity metrics for information integration. In: Proceedings of the IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico (August 2003)Google Scholar
  19. 19.

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Jakub Piskorski
    • 1
  • Marcin Sydow
    • 2
  1. 1.Joint Research Center of the European Commission, Web and Language Technology Group of IPSC, T.P. 267, Via Fermi 1, 21020 Ispra (VA)Italy
  2. 2.Polish-Japanese Institute of Information Technology (PJIIT), Department of Intelligent Systems, Koszykowa 86, 02-008 WarsawPoland

Personalised recommendations