Word normalization and decompounding in mono- and bilingual IR

Airio, Eija

doi:10.1007/s10791-006-0884-2

Word normalization and decompounding in mono- and bilingual IR

Published: June 2006

Volume 9, pages 249–271, (2006)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Word normalization and decompounding in mono- and bilingual IR

Download PDF

Eija Airio¹

185 Accesses
23 Citations
3 Altmetric
Explore all metrics

Abstract

The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and Swedish. The source language of the bilingual runs is English, and the target languages are Finnish, German and Swedish. In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized decompounded index performs better than retrieval in a lemmatized compound index. The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds are used instead of phrases in Finnish, German and Swedish. No remarkable performance differences could be found between stemming and lemmatization.

References

Airio E, Keskustalo H, Hedlund T and Pirkola A (2003) UTACLIR @ CLEF2002 – Bilingual and multilingual runs with a unified process. In Peters C, Braschler M, Gonzalo J and Kluck M, (eds.), Advances in cross-language information retrieval. Results of the cross-language evaluation forum - CLEF 2002, Lecture Notes in Computer Science 2785, Springer, pp. 91–100
Alkula R (2000) Merkkijonoista suomen kielen sanoiksi. Ph D. Thesis, University of Tampere, Department of Information Studies, Acta Universitatis Tampererensis 763. Acta Electronica Universitatis Tamperensis 51. http://acta.uta.fi(pdf(951-44-4886-3.pdf
Braschler M and Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Information Retrieval, 7:291–316
Article Google Scholar
Harman D (1991) How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15
Article Google Scholar
Hedlund T, Pirkola A and Järvelin K (2001) Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval. Information Processing and Retrieval, 37(1):147–161
Article Google Scholar
Hedlund T, Keskustalo H, Pirkola A, Airio E and Järvelin K (2002a) UTACLIR @ CLEF 2001 – Effects of compound splitting and n-gram techniques. In: Peters C, Braschler M, Gonzalo J and Kluck M, (eds.), Evaluation of cross-language information retrieval systems. Second workshop of the cross-language evaluation forum, CLEF 2001, Lecture Notes in Computer Science 2406, Springer, pp. 118–136
Hedlund T, Keskustalo H, Airio E and Pirkola A (2002b) UTACLIR : An extendable query translation system. In: Gey FC, Kando N and Peters C, (eds.), SIGIR 2002 Workshop I, Cross-Language Information Retrieval: A Research Map. University of Tampere, Finland
Google Scholar
Hollink V, Kamps J, Monz C and De Rijke M (2004) Monolingual document retrieval for European languages. Information Retrieval, 7:33–52
Article Google Scholar
Hull DA (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70–84
Article Google Scholar
Kettunen K, Kunttu T and Järvelin K (2004) To stem or lemmatize a highly inflectional language in a probabilistic IR environment? Journal of Documentation, 61(xxx): xxx-xxx, accepted with minor revision.
Google Scholar
Koskenniemi K (1983) Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. University of Helsinki, Finland. Publications No. 11
Koskenniemi K (1985) A general two-level computational model for word-form recognition and production. In Karlsson F, (eds.), Computational morphosyntax. Report on research 1981–84. Publications No. 13. University of Helsinki, Department of General Linguistics, pp. 1–18
Kraaij W (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 40–48
Kraaij W (2004) Variations on language modeling for information retrieval. CTIT PhD. –thesis No. 04-62, University of Twente
Krovetz R (1993) Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 191–202
Larkey LS, Ballesteros L and Connell M (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 275–282
Lennon M, Peirce DS, Tarry BD and Willet P (1981) An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(1981):177–183
Google Scholar
McNamee P and Mayfield J (2001) A language-independent approach to European text retrieval. In Peters C, (eds.), Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069. Springer, Lisbon, Portugal, pp. 129–139
Google Scholar
Niedermair GT, Thurmair G and Büttel I (1984) MARS: a retrieval tool on the basis of morphological analysis. In van Rijsbergen CJ, (eds.), Research and Development in Information Retrieval. Cambridge University Press, pp. 369–381
Pirkola A (1998) The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, pp. 55–63
Pirkola A (2001) Morphological typology of languages for IR. Journal of Documentation, 57(3):330–348
Article Google Scholar
Popovič M and Willet P (1992) The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):384–390
Article Google Scholar
Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137. http://telemat.det.unifi.it(book(2001(wchange(download(stem_porter.html
Google Scholar
Porter M (1981) Snowball: A language for stemming algorithms. http://snowball.tartarus.org(texts(introduction.html (visited January 7th, 2004)

Download references

Author information

Authors and Affiliations

Department of Information Studies, Tampere University, Kanslerinrinne 1, 33014, University of Tampere, Finland
Eija Airio

Authors

Eija Airio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eija Airio.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Airio, E. Word normalization and decompounding in mono- and bilingual IR. Inf Retrieval 9, 249–271 (2006). https://doi.org/10.1007/s10791-006-0884-2

Download citation

Received: 11 June 2004
Revised: 03 January 2005
Accepted: 01 February 2005
Issue Date: June 2006
DOI: https://doi.org/10.1007/s10791-006-0884-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Word normalization and decompounding in mono- and bilingual IR

Abstract

Article PDF

Similar content being viewed by others

New Areas of Application of Comparable Corpora

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Word normalization and decompounding in mono- and bilingual IR

Abstract

Article PDF

Similar content being viewed by others

New Areas of Application of Comparable Corpora

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation