Abstract
The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and Swedish. The source language of the bilingual runs is English, and the target languages are Finnish, German and Swedish. In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized decompounded index performs better than retrieval in a lemmatized compound index. The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds are used instead of phrases in Finnish, German and Swedish. No remarkable performance differences could be found between stemming and lemmatization.
Article PDF
Similar content being viewed by others
References
Airio E, Keskustalo H, Hedlund T and Pirkola A (2003) UTACLIR @ CLEF2002 – Bilingual and multilingual runs with a unified process. In Peters C, Braschler M, Gonzalo J and Kluck M, (eds.), Advances in cross-language information retrieval. Results of the cross-language evaluation forum - CLEF 2002, Lecture Notes in Computer Science 2785, Springer, pp. 91–100
Alkula R (2000) Merkkijonoista suomen kielen sanoiksi. Ph D. Thesis, University of Tampere, Department of Information Studies, Acta Universitatis Tampererensis 763. Acta Electronica Universitatis Tamperensis 51. http://acta.uta.fi(pdf(951-44-4886-3.pdf
Braschler M and Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Information Retrieval, 7:291–316
Harman D (1991) How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15
Hedlund T, Pirkola A and Järvelin K (2001) Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval. Information Processing and Retrieval, 37(1):147–161
Hedlund T, Keskustalo H, Pirkola A, Airio E and Järvelin K (2002a) UTACLIR @ CLEF 2001 – Effects of compound splitting and n-gram techniques. In: Peters C, Braschler M, Gonzalo J and Kluck M, (eds.), Evaluation of cross-language information retrieval systems. Second workshop of the cross-language evaluation forum, CLEF 2001, Lecture Notes in Computer Science 2406, Springer, pp. 118–136
Hedlund T, Keskustalo H, Airio E and Pirkola A (2002b) UTACLIR : An extendable query translation system. In: Gey FC, Kando N and Peters C, (eds.), SIGIR 2002 Workshop I, Cross-Language Information Retrieval: A Research Map. University of Tampere, Finland
Hollink V, Kamps J, Monz C and De Rijke M (2004) Monolingual document retrieval for European languages. Information Retrieval, 7:33–52
Hull DA (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70–84
Kettunen K, Kunttu T and Järvelin K (2004) To stem or lemmatize a highly inflectional language in a probabilistic IR environment? Journal of Documentation, 61(xxx): xxx-xxx, accepted with minor revision.
Koskenniemi K (1983) Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. University of Helsinki, Finland. Publications No. 11
Koskenniemi K (1985) A general two-level computational model for word-form recognition and production. In Karlsson F, (eds.), Computational morphosyntax. Report on research 1981–84. Publications No. 13. University of Helsinki, Department of General Linguistics, pp. 1–18
Kraaij W (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 40–48
Kraaij W (2004) Variations on language modeling for information retrieval. CTIT PhD. –thesis No. 04-62, University of Twente
Krovetz R (1993) Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 191–202
Larkey LS, Ballesteros L and Connell M (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 275–282
Lennon M, Peirce DS, Tarry BD and Willet P (1981) An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(1981):177–183
McNamee P and Mayfield J (2001) A language-independent approach to European text retrieval. In Peters C, (eds.), Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069. Springer, Lisbon, Portugal, pp. 129–139
Niedermair GT, Thurmair G and Büttel I (1984) MARS: a retrieval tool on the basis of morphological analysis. In van Rijsbergen CJ, (eds.), Research and Development in Information Retrieval. Cambridge University Press, pp. 369–381
Pirkola A (1998) The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 55–63
Pirkola A (2001) Morphological typology of languages for IR. Journal of Documentation, 57(3):330–348
Popovič M and Willet P (1992) The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):384–390
Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137. http://telemat.det.unifi.it(book(2001(wchange(download(stem_porter.html
Porter M (1981) Snowball: A language for stemming algorithms. http://snowball.tartarus.org(texts(introduction.html (visited January 7th, 2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Airio, E. Word normalization and decompounding in mono- and bilingual IR. Inf Retrieval 9, 249–271 (2006). https://doi.org/10.1007/s10791-006-0884-2
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10791-006-0884-2