Information Retrieval

, Volume 12, Issue 3, pp 275–299

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

Article

Abstract

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.

Keywords

Person name matching Highly inflectional languages Lemmatization String distance metrics 

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Joint Research Centre of the European CommissionIspraItaly
  2. 2.Poznań University of EconomicsPoznanPoland
  3. 3.Web Mining Lab, Polish-Japanese Institute of Information TechnologyWarszawaPoland

Personalised recommendations