Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words

  • Harald Hammarström
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4182)


We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.


Unsupervised Learning Minimum Description Length Terminal Segment Curve Drop Brown Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Pirkola, A.: Morphological typology of languages for IR. Journal of Documentation 57(3), 330–348 (2001)CrossRefGoogle Scholar
  2. 2.
    Francis, N.W., Kucera, H.: Brown corpus. Department of Linguistics, Brown University, Providence, Rhode Island (1964) (1 million words)Google Scholar
  3. 3.
    James, K.: The Holy Bible, containing the Old and New Testaments and the Apocrypha in the authorized King James version. Thomas Nelson, Nashville, New York (1977)Google Scholar
  4. 4.
    Hammarström, H.: A naive theory of morphology and an algorithm for extraction. In: Wicentowski, R., Kondrak, G. (eds.) SIGPHON 2006: Eighth Meeting of the Proceedings of the ACL Special Interest Group on Computational Phonology, Association for Computational Linguistics, New York City, USA, June 8, pp. 79–88 (2006)Google Scholar
  5. 5.
    Borin, L.: Parole-korpusen vid språkbanken, göteborgs universitet, Accessed the 11th of Febuary 2004(1997) (20 million words),
  6. 6.
    Goldsmith, J., Higgins, D., Soglasnova, S.: Automatic language-specific stemming in information retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 273–283. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  7. 7.
    Melucci, M., Orio, N.: A novel method for stemmer generation based on hidden markov models. In: CIKM 2003: Proceedings of the twelfth international conference on Information and knowledge management, pp. 131–138. ACM Press, New York (2003)CrossRefGoogle Scholar
  8. 8.
    Dryer, M.S.: Prefixing versus suffixing in inflectional morphology. In: Comrie, B., Dryer, M.S., Gil, D., Haspelmath, M. (eds.) World Atlas of Language Structures, pp. 110–113. Oxford University Press, Oxford (2005)Google Scholar
  9. 9.
    The British & Foreign Bible Society: Maori Bible. The British & Foreign Bible Society, London, England (1996)Google Scholar
  10. 10.
    Bibelsällskapet, S.: Gamla och Nya testamentet: de kanoniska böckerna. Norstedt, Stockgholm (1917)Google Scholar
  11. 11.
    Summer Institute of Linguistics: Bible: New testament and old testament selctions in kuku-yalanji (1985)Google Scholar
  12. 12.
    Bauer, W., Parker, W., Evans, T.K.: Maori. Descriptive Grammars. Routledge, London (1993)Google Scholar
  13. 13.
    Williams, H.W.: A dictionary of the Maori language, 7th edn. GP Books, Wellington (1971)Google Scholar
  14. 14.
    Patz, E.: A Grammar of the Kuku Yalanji Language of North Queensland. In: Research School of Pacific and Asian Studies, Pacific Linguistics, vol. 257, Australian National University, Canberra (2002)Google Scholar
  15. 15.
    Hershberger, H.D., Hershberger, R.: Kuku-Yalanji dictionary. Work Papers of SIL - AAB. Series B, vol. 7. Summer Institute of Linguistics, Darwin (1982)Google Scholar
  16. 16.
    Sanders, G.: On the analysis and implications of maori verb alternations. Lingua 80, 149–196 (1990)CrossRefGoogle Scholar
  17. 17.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  18. 18.
    Erjavec, T., Džeroski, S.: Machine learning of morphosyntactic structure: Lemmatizing slovene words. Applied Artificial Intelligence 18, 17–41 (2004)CrossRefGoogle Scholar
  19. 19.
    Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1), 26–30 (2003)CrossRefGoogle Scholar
  20. 20.
    Rogati, M., McCarley, S., Yang, Y.: Unsupervised learning of arabic stemming using a parallel corpus. In: ACL 2003: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 391–398 (2003)Google Scholar
  21. 21.
    Hull, D.A.: Stemming algorithms: A case study for detailed evaluation. Journal of the American Soicety for Information Science 47(1), 70–84 (1996)CrossRefGoogle Scholar
  22. 22.
    Galambos, L.: Multilingual Stemmer in Web Environment. PhD thesis, Faculty of Mathematics and Physics, Charles University in Prague (2004)Google Scholar
  23. 23.
    Flenner, G.: Ein quantitatives morphsegmentierungssystem für spanische wortformen. In: Klenk, U. (ed.) Computatio Linguae II: Aufsätze zur algorithmischen und Quantitativen Analyse der Sprache, Zeitschrift für Dialektologie und Linguistik: Beihefte, Franz Steiner, Stuttgart, vol. 83, pp. 31–62 (1994)Google Scholar
  24. 24.
    Jacquemin, C.: Guessing morphology from terms and corpora. In: Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1997), Philadelphia, PA (1997)Google Scholar
  25. 25.
    Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 207–216 (2000)Google Scholar
  26. 26.
    Baroni, M., Matiasek, J., Trost, H.: Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Proceedings of the Workshop on Morphological and Phonological Learning of ACL/SIGPHON 2002, pp. 48–57 (2002)Google Scholar
  27. 27.
    Clark, A.: Learning morphology with pair hidden markov models. In: ACL (Companion Volume), pp. 55–60 (2001)Google Scholar
  28. 28.
    Ćavar, D., Herring, J., Ikuta, T., Rodrigues, P., Schrementi, G.: On induction of morphology grammars and its role in bootstrapping. In: Jäger, G., Monachesi, P., Penn, G., Wintner, S. (eds.) Proceedings of Formal Grammar 2004, pp. 47–62 (2004)Google Scholar
  29. 29.
    Brent, M.R., Murthy, S., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In: Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida (1995)Google Scholar
  30. 30.
    Déjean, H.: Concepts et algorithmes pour la découverte des structures formelles des langues. PhD thesis, Université de Caen Basse Normandie (1998)Google Scholar
  31. 31.
    Snover, M.G., Jarosz, G.E., Brent, M.R.: Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12. ACL Publications (2002)Google Scholar
  32. 32.
    Argamon, S., Akiva, N., Amit, A., Kapah, O.: Efficient unsupervised recursive word segmentation using minimum description length. In: COLING 2004, Geneva, Switzerland, August 22-29 (2004)Google Scholar
  33. 33.
    Goldsmith, J.: Unsupervised learning of the morphology of natural language. Computational Linguistics 27(2), 153–198 (2001)CrossRefMathSciNetGoogle Scholar
  34. 34.
    Neuvel, S., Fulop, S.A.: Unsupervised learning of morphology without morphemes. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12, pp. 9–15. ACL Publications (2002)Google Scholar
  35. 35.
    Gaussier, É.: Unsupervised learning of derivational morphology from inflectional lexicons. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999). Association for Computational Linguistics, Philadephia (1999)Google Scholar
  36. 36.
    Sharma, U., Kalita, J., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 1–10. Association for Computational Linguistics, Philadelphia (2002)Google Scholar
  37. 37.
    Oliver, A.: Adquisició d’informació lèxica i morfosintàctica a partir de corpus sense anotar: aplicació al rus i al croat. PhD thesis, Universitat de Barcelona (2004)Google Scholar
  38. 38.
    Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 1–33 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Harald Hammarström
    • 1
  1. 1.Chalmers UniversityGothenburgSweden

Personalised recommendations