Information Retrieval

, Volume 12, Issue 3, pp 275–299 | Cite as

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

Article

Abstract

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.

Keywords

Person name matching Highly inflectional languages Lemmatization String distance metrics 

References

  1. Agirre, E., Marquez, L., & Wicentowski, R. (2007). Proceedings of SemEval2007 4th International Workshop on Semantic Evaluations, Prague, Czech Republic. ACL.Google Scholar
  2. Bagga, A., & Baldwin, B. (1998). Entity-based Cross-document Co-referencing Using the Vector Space Model. In Proceedings of the ACL 1998, Montreal, Quebec, Canada (pp. 79–85).Google Scholar
  3. Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, USA.Google Scholar
  4. Bollegalla, D., Matsuo, Y., & Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. In Proceedings of the World Wide Web Conference 2007, Banff, Alberta, Canada.Google Scholar
  5. Bollegalla, D., Honma, T., Matsuo, Y., & Ishizuka, M. (2008). Identification of personal name aliases on the web. In Proceedings of the World Wide Web Conference 2008, Beijing, China.Google Scholar
  6. Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. Technical report, TR-CS-06-02, Computer Science Laboratory, The Australian National University, Canberra, Australia.Google Scholar
  7. Coates-Steohens, S. (1992). The analysis and acquisition of proper names for the understanding of a free text. Computers and the Humanities, 26, 441–456.CrossRefGoogle Scholar
  8. Cohen, E., Ravikumar, P., & Fienberg, S. (2003a). A comparison of string metrics for matching names and records. In Proceedings of KDD Workshop on Data Cleaning and Object Consolidation.Google Scholar
  9. Cohen, W., Ravikumar, P., & Fienberg, S. (2003b). A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico (pp. 73–78).Google Scholar
  10. Cucerzan, S. (2007). Large scale named entity disambiguation based on Wikipedia data. In Proceedings of the EMNLP-CoNLL Joint Conference, Prague, Czech Republic, ACL.Google Scholar
  11. Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRefGoogle Scholar
  12. Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRefGoogle Scholar
  13. Fernandez, M., De la Clergerie, E., & Vilares, M. (2007). Knowledge acquisition through error mining. In Proceedings of RANLP’07, Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria (pp. 220–224).Google Scholar
  14. Fleischman, M., & Hovy, E. (2004). Multi-document person name resolution. In Proceedings of the Workshop on Reference Resolution at ACL 2004, Barcelona, Spain.Google Scholar
  15. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Pietarinen, L., & Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 28–34.Google Scholar
  16. Grzenia, J. (1998). Słownik nazw własnych—ortografia, wymowa, słowotwórstwo i odmiana. Warszawa: PWN.Google Scholar
  17. Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA (pp. 127–138). New York: ACM.Google Scholar
  18. Jaro, M. (1989). Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society, 84(406), 414–420.Google Scholar
  19. Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., & Jarvelin, K. (2003). Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proceedings of SPIRE, LNCS 22857, Manaus, Brazil (pp. 252–265).Google Scholar
  20. Klementiev, A., & Roth, D. (2006). Weakly supervised named-entity transliteration and discovery from multilingual comparable corpora. In Proceedings of ACL 2006 Conference. ACLGoogle Scholar
  21. Levenshtein, V. (1965). Binary codes for correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4), 845–848.MathSciNetGoogle Scholar
  22. Li, X., Morie, P., & Rothd, D. (2004). Identification and tracing of ambiguous names: Discriminative and generative approaches. In Proceedings of the National Conference on Artificial Intelligence 2004.Google Scholar
  23. Lindén, K. (2008). A probabilistic model for guessing base forms of new words by analogy. In Proceedings of CICling-2008, 9th International Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel.Google Scholar
  24. Mann, G., & Yarowsky, D. (2003). Unsupervised personal name disambiguation. In Proceedings of CoNLL 2003, Edmonton, Canada (pp. 33–40).Google Scholar
  25. Miłkowski, M. (2007). Morfologik. Web document: http://morfologik.blogspot.com.
  26. Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of Knowledge Discovery and Data Mining 1996 (pp. 267–270).Google Scholar
  27. Ntoulas, A., Stamou, S., & Tzagarakis, M. (2001). Using a WWW search engine to evaluate normalization performance for a highly inflectional language. In Proceedings of ACL 2001 (Companion Volume) (pp. 31–36).Google Scholar
  28. On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CA, USA (pp. 344–353). ACM.Google Scholar
  29. Pedersen, T., Purandare, A., & Kulkarni, A. (2005). Name discrimination by clustering similar contexts. In CICLing (pp. 226–237).Google Scholar
  30. Piskorski, J. (2005). Named-entity recognition for Polish with SProUT. In L. Bolc, Z. Michalewicz, & T. Nishida (Eds.), LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland.Google Scholar
  31. Piskorski, J., Sydow, M., & Kupść, A. (2007). Lemmatization of Polish person names. In Proceedings of the ACL Workshop on Balto-Slavonic Natural Language Processing 2007—Special Theme: Information Extraction and Enabling Technologies (BSNLP’2007). Held at ACL’2007, Prague, Czech Republic, 2007. Stroudsburg, PA: ACL.Google Scholar
  32. Piskorski, J., Wieloch, K., Pikuła, M., & Sydow, M. (2008). Towards person name matching for highly inflective languages. In Proceedings of the WWW’2008 workshop on Natural Language Processing Challenges in the Information Explosion Era (NLPIX 2008).Google Scholar
  33. Pouliquen, B., & Steinberger, R. (2009). Automatic construction of multilingual name dictionaries. In: C. Goutte, N. Cancedda, M. Dymetman, & G. Foster (Eds.), Learning machine translation (pp. 59–78). MIT Press – Advances in Neural Information Processing (NIPS) Series.Google Scholar
  34. Przepiórkowski, A. (2005). The IPI PAN corpus in numbers. In Z. Vetulani (Ed.), Proceedings of the 2nd Language & Technology Conference, Poznań, Poland.Google Scholar
  35. Smith, T., & Waterman, M. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.CrossRefGoogle Scholar
  36. Steinberger, R., & Pouliquen, B. (2007). Cross-lingual named entity recognition. Journal Linguisticae Investigationes, Special Issue on Named Entity Recognition and Categorisation, 30(1), 135–162.Google Scholar
  37. Ukkonen, E. (1992). Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1), 191–211.MATHCrossRefMathSciNetGoogle Scholar
  38. Vilares, J., Alonso, M., & Vilares Ferro, M. (2004). Morphological and syntactic processing for text retrieval. In DEXA (pp. 371–380).Google Scholar
  39. Weiss, D. (2005). A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In Proceedings of the 2nd Language and Technology Conference (LTC’2005), Poznań, Poland, 2005 (pp. 216–221).Google Scholar
  40. Weiss, D. (2007). Korpus Rzeczpospolitej. URL: http://www.cs.put.poznan.pl/dweiss/rzeczpospolita.
  41. Winkler, W. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Joint Research Centre of the European CommissionIspraItaly
  2. 2.Poznań University of EconomicsPoznanPoland
  3. 3.Web Mining Lab, Polish-Japanese Institute of Information TechnologyWarszawaPoland

Personalised recommendations