Advertisement

Information Retrieval

, Volume 9, Issue 3, pp 249–271 | Cite as

Word normalization and decompounding in mono- and bilingual IR

  • Eija AirioEmail author
Article

Abstract

The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and Swedish. The source language of the bilingual runs is English, and the target languages are Finnish, German and Swedish. In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized decompounded index performs better than retrieval in a lemmatized compound index. The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds are used instead of phrases in Finnish, German and Swedish. No remarkable performance differences could be found between stemming and lemmatization.

Keywords

Monolingual information retrieval bilingual information retrieval lemmatization stemming decompounding 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Airio E, Keskustalo H, Hedlund T and Pirkola A (2003) UTACLIR @ CLEF2002 – Bilingual and multilingual runs with a unified process. In Peters C, Braschler M, Gonzalo J and Kluck M, (eds.), Advances in cross-language information retrieval. Results of the cross-language evaluation forum - CLEF 2002, Lecture Notes in Computer Science 2785, Springer, pp. 91–100Google Scholar
  2. Alkula R (2000) Merkkijonoista suomen kielen sanoiksi. Ph D. Thesis, University of Tampere, Department of Information Studies, Acta Universitatis Tampererensis 763. Acta Electronica Universitatis Tamperensis 51. http://acta.uta.fi(pdf(951-44-4886-3.pdf
  3. Braschler M and Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Information Retrieval, 7:291–316CrossRefGoogle Scholar
  4. Harman D (1991) How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15CrossRefGoogle Scholar
  5. Hedlund T, Pirkola A and Järvelin K (2001) Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval. Information Processing and Retrieval, 37(1):147–161CrossRefGoogle Scholar
  6. Hedlund T, Keskustalo H, Pirkola A, Airio E and Järvelin K (2002a) UTACLIR @ CLEF 2001 – Effects of compound splitting and n-gram techniques. In: Peters C, Braschler M, Gonzalo J and Kluck M, (eds.), Evaluation of cross-language information retrieval systems. Second workshop of the cross-language evaluation forum, CLEF 2001, Lecture Notes in Computer Science 2406, Springer, pp. 118–136Google Scholar
  7. Hedlund T, Keskustalo H, Airio E and Pirkola A (2002b) UTACLIR : An extendable query translation system. In: Gey FC, Kando N and Peters C, (eds.), SIGIR 2002 Workshop I, Cross-Language Information Retrieval: A Research Map. University of Tampere, FinlandGoogle Scholar
  8. Hollink V, Kamps J, Monz C and De Rijke M (2004) Monolingual document retrieval for European languages. Information Retrieval, 7:33–52CrossRefGoogle Scholar
  9. Hull DA (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70–84CrossRefGoogle Scholar
  10. Kettunen K, Kunttu T and Järvelin K (2004) To stem or lemmatize a highly inflectional language in a probabilistic IR environment? Journal of Documentation, 61(xxx): xxx-xxx, accepted with minor revision.Google Scholar
  11. Koskenniemi K (1983) Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. University of Helsinki, Finland. Publications No. 11Google Scholar
  12. Koskenniemi K (1985) A general two-level computational model for word-form recognition and production. In Karlsson F, (eds.), Computational morphosyntax. Report on research 1981–84. Publications No. 13. University of Helsinki, Department of General Linguistics, pp. 1–18Google Scholar
  13. Kraaij W (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 40–48Google Scholar
  14. Kraaij W (2004) Variations on language modeling for information retrieval. CTIT PhD. –thesis No. 04-62, University of TwenteGoogle Scholar
  15. Krovetz R (1993) Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 191–202Google Scholar
  16. Larkey LS, Ballesteros L and Connell M (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 275–282Google Scholar
  17. Lennon M, Peirce DS, Tarry BD and Willet P (1981) An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(1981):177–183Google Scholar
  18. McNamee P and Mayfield J (2001) A language-independent approach to European text retrieval. In Peters C, (eds.), Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069. Springer, Lisbon, Portugal, pp. 129–139Google Scholar
  19. Niedermair GT, Thurmair G and Büttel I (1984) MARS: a retrieval tool on the basis of morphological analysis. In van Rijsbergen CJ, (eds.), Research and Development in Information Retrieval. Cambridge University Press, pp. 369–381Google Scholar
  20. Pirkola A (1998) The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 55–63Google Scholar
  21. Pirkola A (2001) Morphological typology of languages for IR. Journal of Documentation, 57(3):330–348CrossRefGoogle Scholar
  22. Popovič M and Willet P (1992) The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):384–390CrossRefGoogle Scholar
  23. Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137. http://telemat.det.unifi.it(book(2001(wchange(download(stem_porter.htmlGoogle Scholar
  24. Porter M (1981) Snowball: A language for stemming algorithms. http://snowball.tartarus.org(texts(introduction.html (visited January 7th, 2004)

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of Information StudiesTampere UniversityUniversity of TampereFinland

Personalised recommendations