Comparison of String Distance Metrics for Lemmatisation of Named Entities in Polish

  • Jakub Piskorski
  • Marcin Sydow
  • Karol Wieloch
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5603)


This paper presents the results of recent experiments on application of string distance metrics to the problem of named entity lemmatisation in Polish. It extends of our work in [1] by introducing new results for organisation names. Furthermore, the results presented here and in [2,3] centering around the same topic were used to make a comparative study of the average usefulness of the numerous examined string distance metrics to lemmatisation of Polish named-entities of various types. In particular, we focus on lemmatisation of country names, organisation names and person names.


named entities lemmatisation string distance metrics highly inflective languages 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Piskorski, J., Sydow, M.: Usability of String Distance Metrics for Name Matching Tasks in Polish. In: Proceedings of LTC 2007, Poznań, Poland (2007)Google Scholar
  2. 2.
    Piskorski, J., Sydow, M., Kupść, A.: Lemmatization of Polish Person Names. In: Proceedings of the ACL 2007 Workshop on Balto-Slavonic Natural Language Processing 2007 (BSNLP 2007), Prague, Czech Republic (2007)Google Scholar
  3. 3.
    Piskorski, J., Sydow, M.: String distance metrics for reference matching and search query correction. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 353–365. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Morton, T.: Coreference for NLP Applications. In: Proceedings of ACL (1997)Google Scholar
  5. 5.
    Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-2003), Acapulco, Mexico, pp. 73–78 (2003)Google Scholar
  6. 6.
    Cohen, E., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Matching Names and Records. KDD Work. on Data Cleaning, Object Consolid (2003)Google Scholar
  7. 7.
    Christen, P.: A Comparison of Personal Name Matching: Techniques and Practical Issues. Technical report, TR-CS-06-02, Computer Science Laboratory, The Australian National University, Canberra, Australia (2006)Google Scholar
  8. 8.
    Piskorski, J.: Named-entity recognition for polish with sProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–133. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Elmagaramid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19(1) (2007)Google Scholar
  10. 10.
    Piskorski, J., Wieloch, K., Sydow, M.: On Knowledge-poor Methods for Person Name Matching and Lemmatization for Highly Inflectional Languages. Information Retrieval: Special Issue on non-English Web Search (to appear, 2009)Google Scholar
  11. 11.
    Levenshtein, V.: Binary Codes for Correcting Deletions, Insertions, and Reversals. Doklady Akademii Nauk SSSR 163(4), 845–848 (1965)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Needleman, S., Wunsch, C.: A General Method Applicable to Search for Similarities in the Amino Acid Sequence of Two Proteins. Molec. Biol. J. 48(3), 443–453 (1970)CrossRefGoogle Scholar
  13. 13.
    Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  14. 14.
    Bartolini, I., Ciaccia, P., Patella, M.: String matching with metric trees using an approximate distance. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 271–283. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Winkler, W.: The State of Record Linkage and Current Research Problems. Technical report, U.S. Bureau of the Census, Washington, DC (1999)Google Scholar
  16. 16.
    Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches. Theoretical Computer Science 92(1), 191–211 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for Approximate String Processing. IEEE Data Engineering Bulletin 24(4), 28–34 (2001)Google Scholar
  18. 18.
    Keskustalo, H., Pirkola, A., Visala, K., Leppänen, E., Järvelin, K.: Non-adjacent digrams improve matching of cross-lingual spelling variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 252–265. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Monge, A., Elkan, C.: The Field Matching Problem: Algorithms and Applications. In: Proceedings of Knowledge Discovery and Data Mining 1996, pp. 267–270 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jakub Piskorski
    • 1
  • Marcin Sydow
    • 2
  • Karol Wieloch
    • 3
  1. 1.Joint Research Centre of the European CommissionWeb Mining and Intelligence of IPSC,T.P. 267IspraItaly
  2. 2.Web Mining Lab, Intelligent Systems Dept.Polish-Japanese Institute of Information TechnologyWarsawPoland
  3. 3.Department of Information SystemsPoznań Univeristy of EconomicsPoznańPoland

Personalised recommendations