Skip to main content

Is a Morphologically Complex Language Really that Complex in Full-Text Retrieval?

  • Conference paper
Advances in Natural Language Processing (FinTAL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

Abstract

In this paper we show that keyword variation of a morphologically complex language, Finnish, can be handled effectively for IR purposes by generating only the textually most frequent forms of the keyword. Theoretically Finnish nouns have about 2,000 different forms, but occurrences of most of the forms are rare. Corpus statistics showed that about 84 – 88 per cent of the occurrences of inflected noun forms are forms of only six cases out of the 14 possible. This number – maximally 2*6 – of keyword’s variant forms makes it feasible to try them all in a search. IR results of the frequent keyword form variation coverage were tested with three to twelve keyword variant forms in two test collections, TUTK and CLEF 2003’s Finnish material. The results show that the frequent keyword form generation method competes well with the gold standard, lemmatization, with nine and twelve variant keyword forms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Popovič, M., Willett, P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. Journal of the American Society for Information Science 43, 384–390 (1992)

    Article  Google Scholar 

  2. Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual Document Retrieval for European Languages. Information Retrieval 7, 33–52 (2004)

    Article  Google Scholar 

  3. Airio, E.: Word Normalization and Decompounding in Mono- and Bilingual IR. Information Retrieval (to appear, 2005)

    Google Scholar 

  4. Koskenniemi, K.: Finite State Morphology and Information Retrieval. Natural Language Engineering 2, 331–336 (1996)

    Article  Google Scholar 

  5. Galvez, C., Moya-Anegón, F., Solana, V.H.: Term Conflation Methods in Information Retrieval. Non-linguistic and Linguistic Approaches. Journal of Documentation 61, 520–547 (2005)

    Google Scholar 

  6. Jacquemin, C., Tzoukerman, E.: NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax. In: Strzralkowski, T. (ed.) Natural Language Information Retrieval, pp. 25–74. Kluwer Academic Publishers, Dordrecht (1999)

    Google Scholar 

  7. Kettunen, K.: Developing an Automatic Linguistic Truncation Operator for Best-match Retrieval in Inflected Word Form Text Database Indexes. Journal of Information Science 32 (to appear, 2006)

    Google Scholar 

  8. Kettunen, K., Kunttu, T., Järvelin, K.: To Stem or Lemmatize a Highly Inflectional Language in a Probabilistic IR Environment? Journal of Documentation 61, 476–496 (2005)

    Article  Google Scholar 

  9. Braschler, M., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7, 291–316 (2004)

    Article  Google Scholar 

  10. Mayfield, J., McNamee, P.: Single N-gram Stemming. In: Proceedings of Sigir 2003. The Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)

    Google Scholar 

  11. Tomlinson, S.: Lexical and algorithmic stemming compared for 9 european languages with hummingbird searchServerTM at CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 286–300. Springer, Heidelberg (2004), Availabe at: http://clef.iei.pi.cnr.it/2003/WN_web/19.pdf

    Chapter  Google Scholar 

  12. Koskenniemi, K.: A System for Generating Finnish Inflected Word Forms. In: Karlsson, F. (ed.) Computational Morphosyntax. Report on research 1981 – 1984, Publications of the Department of General linguistics, University of Helsinki, vol. 13, pp. 63–80 (1985)

    Google Scholar 

  13. Baayen, R.H.: Statistical Models for Word Frequency Distribution. Computers and the Humanities 26, 347–363 (1993)

    Article  Google Scholar 

  14. Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)

    MATH  Google Scholar 

  15. Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8, 243–257 (1993)

    Article  Google Scholar 

  16. Biber, D.: Using Register-diversified Corpora for General Language Studies. Computational Linguistics 19, 219–241 (1993)

    Google Scholar 

  17. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  18. Karlsson, F.: Frequency Considerations in Morphology. Zeitsschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung 39, 19–28 (1986)

    Google Scholar 

  19. Karlsson, F.: Defectivity. In: Booij, G., et al. (eds.) Morphology. An International Handbook on Inflection and Word-Formation, Walter de Gruyter, Berlin, vol. 1, pp. 647–654 (2000)

    Google Scholar 

  20. Kostić, A., Marković, T., Baucal, A.: Inflectional Morphology and Word Meaning: Orthogonal or Co-implicative Cognitive Domains. In: Baayen, R.H., Schreuder, R. (eds.) Morphological Structure in Language Processing. Trends in Linguistics, Studies and Monographs, Mouton de Gruyter, Berlin, vol. 151, pp. 1–43 (2003)

    Google Scholar 

  21. Karlsson, F.: Suomen kielen äänne- ja muotorakenne. WSOY, Helsinki (1983)

    Google Scholar 

  22. Räsänen, S.: Havaintoja suomen sijojen frekvensseistä. (Observations of frequencies of the Finnish cases) Sananjalka 21, 17–43 (1979)

    Google Scholar 

  23. Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T.R., Alho, I.: Iso suomen kielioppi.: Suomalaisen Kirjallisuuden Seura, Helsinki (2004)

    Google Scholar 

  24. Sormunen, E.: A Method for Measuring Wide Range Performance of Boolean Queries in Full-text Databases. Acta Universitatis Tamperensis 748, Tampere (2000)

    Google Scholar 

  25. Creutz, M., Linden, K.: Morpheme Segmentation Gold Standards for Finnish and English. Helsinki University of Technology. Publications in Computer and Information Science. Report A77. Espoo (2004)

    Google Scholar 

  26. Creutz, M.: Two E-mails (May 17, 2005)

    Google Scholar 

  27. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, USA (1999)

    Google Scholar 

  28. Saukkonen, P., Haipus, M., Niemikorpi, A., Sulkala, H.: Suomen kielen taajuussanasto (A Frequency Dictionary of Finnish). WSOY, Helsinki (1979)

    Google Scholar 

  29. Kettunen, K.: Sijamuodot haussa - tarvitseeko kaikkea hakutermien morfologista vaihtelua kattaa? Ms. Sci. Thesis, University of Tampere, Department of Information Studies (2005)

    Google Scholar 

  30. Peters, C.: Introduction to the CLEF 2003 Working Notes (accessed September 1, 2005), Available at: http://www.clef-campaign.org/2003/WN_web/00.2%20-%20intro.pdf

  31. Sormunen, E.: The Effectiveness of Free-text Searching in Full-text Databases Containing Newspaper Articles and Abstracts. Research Publications 790. Technical Research Centre of Finland, Espoo (in Finnish, English abstract) (1994)

    Google Scholar 

  32. Holman, E.: Finnmorf: A Computerized Research Tool for Students of Finnish Morphology. Computers and the Humanities 22, 165–172 (1988)

    Article  Google Scholar 

  33. Lassila, E.: Suomen kielen sanamuodot taivuttava ohjelma FORMO. In: Mäkelä, M., Linnainmaa, S., Ukkonen, E. (eds.) STeP 1988. Invited Papers. Contributed Papers: Applications, pp. 118–126. Finnish Artificial Intelligence Society, Helsinki (1988)

    Google Scholar 

  34. Kekäläinen, J.: The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Acta Universitatis Tamperensis 678, Tampere (1999)

    Google Scholar 

  35. Allan, J., Callan, J., Croft, B., Ballesteros, L., Byrd, D., Swan, R., Xu, J.: INQUERY Does Battle with TREC-6. In: Voorhees, E., Harman, D. (eds.) Proceedings of the TREC 6 Conference (1997) (accessed November 15, 2005), Available from: http://trec.nist.gov/pubs/trec6/t6_proceedings.html

  36. Broglio, J., Callan, J., Croft, W.B.: INQUERY System Overview. In: Proceedings of the TIPSTER text program (Phase I). Morgan Kaufmann Publishers, San Francisco (1994)

    Google Scholar 

  37. Jansen, B., Spink, A., Sarasevic, T.: Real Life, Real Users, and Real Needs: a Study and Analysis of User Queries on the Web. Information Processing & Management 36, 207–227 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kettunen, K., Airio, E. (2006). Is a Morphologically Complex Language Really that Complex in Full-Text Retrieval?. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_42

Download citation

  • DOI: https://doi.org/10.1007/11816508_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37334-6

  • Online ISBN: 978-3-540-37336-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics