Abstract
We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Pirkola, A.: Morphological typology of languages for IR. Journal of Documentation 57(3), 330–348 (2001)
Francis, N.W., Kucera, H.: Brown corpus. Department of Linguistics, Brown University, Providence, Rhode Island (1964) (1 million words)
James, K.: The Holy Bible, containing the Old and New Testaments and the Apocrypha in the authorized King James version. Thomas Nelson, Nashville, New York (1977)
Hammarström, H.: A naive theory of morphology and an algorithm for extraction. In: Wicentowski, R., Kondrak, G. (eds.) SIGPHON 2006: Eighth Meeting of the Proceedings of the ACL Special Interest Group on Computational Phonology, Association for Computational Linguistics, New York City, USA, June 8, pp. 79–88 (2006)
Borin, L.: Parole-korpusen vid språkbanken, göteborgs universitet, Accessed the 11th of Febuary 2004(1997) (20 million words), http://spraakbanken.gu.se
Goldsmith, J., Higgins, D., Soglasnova, S.: Automatic language-specific stemming in information retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 273–283. Springer, Heidelberg (2001)
Melucci, M., Orio, N.: A novel method for stemmer generation based on hidden markov models. In: CIKM 2003: Proceedings of the twelfth international conference on Information and knowledge management, pp. 131–138. ACM Press, New York (2003)
Dryer, M.S.: Prefixing versus suffixing in inflectional morphology. In: Comrie, B., Dryer, M.S., Gil, D., Haspelmath, M. (eds.) World Atlas of Language Structures, pp. 110–113. Oxford University Press, Oxford (2005)
The British & Foreign Bible Society: Maori Bible. The British & Foreign Bible Society, London, England (1996)
Bibelsällskapet, S.: Gamla och Nya testamentet: de kanoniska böckerna. Norstedt, Stockgholm (1917)
Summer Institute of Linguistics: Bible: New testament and old testament selctions in kuku-yalanji (1985)
Bauer, W., Parker, W., Evans, T.K.: Maori. Descriptive Grammars. Routledge, London (1993)
Williams, H.W.: A dictionary of the Maori language, 7th edn. GP Books, Wellington (1971)
Patz, E.: A Grammar of the Kuku Yalanji Language of North Queensland. In: Research School of Pacific and Asian Studies, Pacific Linguistics, vol. 257, Australian National University, Canberra (2002)
Hershberger, H.D., Hershberger, R.: Kuku-Yalanji dictionary. Work Papers of SIL - AAB. Series B, vol. 7. Summer Institute of Linguistics, Darwin (1982)
Sanders, G.: On the analysis and implications of maori verb alternations. Lingua 80, 149–196 (1990)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Erjavec, T., Džeroski, S.: Machine learning of morphosyntactic structure: Lemmatizing slovene words. Applied Artificial Intelligence 18, 17–41 (2004)
Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1), 26–30 (2003)
Rogati, M., McCarley, S., Yang, Y.: Unsupervised learning of arabic stemming using a parallel corpus. In: ACL 2003: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 391–398 (2003)
Hull, D.A.: Stemming algorithms: A case study for detailed evaluation. Journal of the American Soicety for Information Science 47(1), 70–84 (1996)
Galambos, L.: Multilingual Stemmer in Web Environment. PhD thesis, Faculty of Mathematics and Physics, Charles University in Prague (2004)
Flenner, G.: Ein quantitatives morphsegmentierungssystem für spanische wortformen. In: Klenk, U. (ed.) Computatio Linguae II: Aufsätze zur algorithmischen und Quantitativen Analyse der Sprache, Zeitschrift für Dialektologie und Linguistik: Beihefte, Franz Steiner, Stuttgart, vol. 83, pp. 31–62 (1994)
Jacquemin, C.: Guessing morphology from terms and corpora. In: Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1997), Philadelphia, PA (1997)
Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 207–216 (2000)
Baroni, M., Matiasek, J., Trost, H.: Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Proceedings of the Workshop on Morphological and Phonological Learning of ACL/SIGPHON 2002, pp. 48–57 (2002)
Clark, A.: Learning morphology with pair hidden markov models. In: ACL (Companion Volume), pp. 55–60 (2001)
Ćavar, D., Herring, J., Ikuta, T., Rodrigues, P., Schrementi, G.: On induction of morphology grammars and its role in bootstrapping. In: Jäger, G., Monachesi, P., Penn, G., Wintner, S. (eds.) Proceedings of Formal Grammar 2004, pp. 47–62 (2004)
Brent, M.R., Murthy, S., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In: Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida (1995)
Déjean, H.: Concepts et algorithmes pour la découverte des structures formelles des langues. PhD thesis, Université de Caen Basse Normandie (1998)
Snover, M.G., Jarosz, G.E., Brent, M.R.: Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12. ACL Publications (2002)
Argamon, S., Akiva, N., Amit, A., Kapah, O.: Efficient unsupervised recursive word segmentation using minimum description length. In: COLING 2004, Geneva, Switzerland, August 22-29 (2004)
Goldsmith, J.: Unsupervised learning of the morphology of natural language. Computational Linguistics 27(2), 153–198 (2001)
Neuvel, S., Fulop, S.A.: Unsupervised learning of morphology without morphemes. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12, pp. 9–15. ACL Publications (2002)
Gaussier, É.: Unsupervised learning of derivational morphology from inflectional lexicons. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999). Association for Computational Linguistics, Philadephia (1999)
Sharma, U., Kalita, J., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 1–10. Association for Computational Linguistics, Philadelphia (2002)
Oliver, A.: Adquisició d’informació lèxica i morfosintàctica a partir de corpus sense anotar: aplicació al rus i al croat. PhD thesis, Universitat de Barcelona (2004)
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 1–33 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hammarström, H. (2006). Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_25
Download citation
DOI: https://doi.org/10.1007/11880592_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)