Advertisement

Is a Morphologically Complex Language Really that Complex in Full-Text Retrieval?

  • Kimmo Kettunen
  • Eija Airio
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4139)

Abstract

In this paper we show that keyword variation of a morphologically complex language, Finnish, can be handled effectively for IR purposes by generating only the textually most frequent forms of the keyword. Theoretically Finnish nouns have about 2,000 different forms, but occurrences of most of the forms are rare. Corpus statistics showed that about 84 – 88 per cent of the occurrences of inflected noun forms are forms of only six cases out of the 14 possible. This number – maximally 2*6 – of keyword’s variant forms makes it feasible to try them all in a search. IR results of the frequent keyword form variation coverage were tested with three to twelve keyword variant forms in two test collections, TUTK and CLEF 2003’s Finnish material. The results show that the frequent keyword form generation method competes well with the gold standard, lemmatization, with nine and twelve variant keyword forms.

Keywords

Average Precision Word Form Case Form Complex Language Inflectional Language 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Popovič, M., Willett, P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. Journal of the American Society for Information Science 43, 384–390 (1992)CrossRefGoogle Scholar
  2. 2.
    Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual Document Retrieval for European Languages. Information Retrieval 7, 33–52 (2004)CrossRefGoogle Scholar
  3. 3.
    Airio, E.: Word Normalization and Decompounding in Mono- and Bilingual IR. Information Retrieval (to appear, 2005) Google Scholar
  4. 4.
    Koskenniemi, K.: Finite State Morphology and Information Retrieval. Natural Language Engineering 2, 331–336 (1996)CrossRefGoogle Scholar
  5. 5.
    Galvez, C., Moya-Anegón, F., Solana, V.H.: Term Conflation Methods in Information Retrieval. Non-linguistic and Linguistic Approaches. Journal of Documentation 61, 520–547 (2005)Google Scholar
  6. 6.
    Jacquemin, C., Tzoukerman, E.: NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax. In: Strzralkowski, T. (ed.) Natural Language Information Retrieval, pp. 25–74. Kluwer Academic Publishers, Dordrecht (1999)Google Scholar
  7. 7.
    Kettunen, K.: Developing an Automatic Linguistic Truncation Operator for Best-match Retrieval in Inflected Word Form Text Database Indexes. Journal of Information Science 32 (to appear, 2006)Google Scholar
  8. 8.
    Kettunen, K., Kunttu, T., Järvelin, K.: To Stem or Lemmatize a Highly Inflectional Language in a Probabilistic IR Environment? Journal of Documentation 61, 476–496 (2005)CrossRefGoogle Scholar
  9. 9.
    Braschler, M., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7, 291–316 (2004)CrossRefGoogle Scholar
  10. 10.
    Mayfield, J., McNamee, P.: Single N-gram Stemming. In: Proceedings of Sigir 2003. The Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)Google Scholar
  11. 11.
    Tomlinson, S.: Lexical and algorithmic stemming compared for 9 european languages with hummingbird searchServerTM at CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 286–300. Springer, Heidelberg (2004), Availabe at: http://clef.iei.pi.cnr.it/2003/WN_web/19.pdf CrossRefGoogle Scholar
  12. 12.
    Koskenniemi, K.: A System for Generating Finnish Inflected Word Forms. In: Karlsson, F. (ed.) Computational Morphosyntax. Report on research 1981 – 1984, Publications of the Department of General linguistics, University of Helsinki, vol. 13, pp. 63–80 (1985)Google Scholar
  13. 13.
    Baayen, R.H.: Statistical Models for Word Frequency Distribution. Computers and the Humanities 26, 347–363 (1993)CrossRefGoogle Scholar
  14. 14.
    Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)MATHGoogle Scholar
  15. 15.
    Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8, 243–257 (1993)CrossRefGoogle Scholar
  16. 16.
    Biber, D.: Using Register-diversified Corpora for General Language Studies. Computational Linguistics 19, 219–241 (1993)Google Scholar
  17. 17.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)MATHGoogle Scholar
  18. 18.
    Karlsson, F.: Frequency Considerations in Morphology. Zeitsschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung 39, 19–28 (1986)Google Scholar
  19. 19.
    Karlsson, F.: Defectivity. In: Booij, G., et al. (eds.) Morphology. An International Handbook on Inflection and Word-Formation, Walter de Gruyter, Berlin, vol. 1, pp. 647–654 (2000)Google Scholar
  20. 20.
    Kostić, A., Marković, T., Baucal, A.: Inflectional Morphology and Word Meaning: Orthogonal or Co-implicative Cognitive Domains. In: Baayen, R.H., Schreuder, R. (eds.) Morphological Structure in Language Processing. Trends in Linguistics, Studies and Monographs, Mouton de Gruyter, Berlin, vol. 151, pp. 1–43 (2003)Google Scholar
  21. 21.
    Karlsson, F.: Suomen kielen äänne- ja muotorakenne. WSOY, Helsinki (1983)Google Scholar
  22. 22.
    Räsänen, S.: Havaintoja suomen sijojen frekvensseistä. (Observations of frequencies of the Finnish cases) Sananjalka 21, 17–43 (1979)Google Scholar
  23. 23.
    Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T.R., Alho, I.: Iso suomen kielioppi.: Suomalaisen Kirjallisuuden Seura, Helsinki (2004)Google Scholar
  24. 24.
    Sormunen, E.: A Method for Measuring Wide Range Performance of Boolean Queries in Full-text Databases. Acta Universitatis Tamperensis 748, Tampere (2000)Google Scholar
  25. 25.
    Creutz, M., Linden, K.: Morpheme Segmentation Gold Standards for Finnish and English. Helsinki University of Technology. Publications in Computer and Information Science. Report A77. Espoo (2004)Google Scholar
  26. 26.
    Creutz, M.: Two E-mails (May 17, 2005)Google Scholar
  27. 27.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, USA (1999)Google Scholar
  28. 28.
    Saukkonen, P., Haipus, M., Niemikorpi, A., Sulkala, H.: Suomen kielen taajuussanasto (A Frequency Dictionary of Finnish). WSOY, Helsinki (1979)Google Scholar
  29. 29.
    Kettunen, K.: Sijamuodot haussa - tarvitseeko kaikkea hakutermien morfologista vaihtelua kattaa? Ms. Sci. Thesis, University of Tampere, Department of Information Studies (2005)Google Scholar
  30. 30.
    Peters, C.: Introduction to the CLEF 2003 Working Notes (accessed September 1, 2005), Available at: http://www.clef-campaign.org/2003/WN_web/00.2%20-%20intro.pdf
  31. 31.
    Sormunen, E.: The Effectiveness of Free-text Searching in Full-text Databases Containing Newspaper Articles and Abstracts. Research Publications 790. Technical Research Centre of Finland, Espoo (in Finnish, English abstract) (1994)Google Scholar
  32. 32.
    Holman, E.: Finnmorf: A Computerized Research Tool for Students of Finnish Morphology. Computers and the Humanities 22, 165–172 (1988)CrossRefGoogle Scholar
  33. 33.
    Lassila, E.: Suomen kielen sanamuodot taivuttava ohjelma FORMO. In: Mäkelä, M., Linnainmaa, S., Ukkonen, E. (eds.) STeP 1988. Invited Papers. Contributed Papers: Applications, pp. 118–126. Finnish Artificial Intelligence Society, Helsinki (1988)Google Scholar
  34. 34.
    Kekäläinen, J.: The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Acta Universitatis Tamperensis 678, Tampere (1999)Google Scholar
  35. 35.
    Allan, J., Callan, J., Croft, B., Ballesteros, L., Byrd, D., Swan, R., Xu, J.: INQUERY Does Battle with TREC-6. In: Voorhees, E., Harman, D. (eds.) Proceedings of the TREC 6 Conference (1997) (accessed November 15, 2005), Available from: http://trec.nist.gov/pubs/trec6/t6_proceedings.html
  36. 36.
    Broglio, J., Callan, J., Croft, W.B.: INQUERY System Overview. In: Proceedings of the TIPSTER text program (Phase I). Morgan Kaufmann Publishers, San Francisco (1994)Google Scholar
  37. 37.
    Jansen, B., Spink, A., Sarasevic, T.: Real Life, Real Users, and Real Needs: a Study and Analysis of User Queries on the Web. Information Processing & Management 36, 207–227 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kimmo Kettunen
    • 1
  • Eija Airio
    • 1
  1. 1.Department of Information StudiesUniversity of TampereFinland

Personalised recommendations