Skip to main content

Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words

  • Conference paper
Book cover Information Retrieval Technology (AIRS 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Included in the following conference series:

Abstract

We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pirkola, A.: Morphological typology of languages for IR. Journal of Documentation 57(3), 330–348 (2001)

    Article  Google Scholar 

  2. Francis, N.W., Kucera, H.: Brown corpus. Department of Linguistics, Brown University, Providence, Rhode Island (1964) (1 million words)

    Google Scholar 

  3. James, K.: The Holy Bible, containing the Old and New Testaments and the Apocrypha in the authorized King James version. Thomas Nelson, Nashville, New York (1977)

    Google Scholar 

  4. Hammarström, H.: A naive theory of morphology and an algorithm for extraction. In: Wicentowski, R., Kondrak, G. (eds.) SIGPHON 2006: Eighth Meeting of the Proceedings of the ACL Special Interest Group on Computational Phonology, Association for Computational Linguistics, New York City, USA, June 8, pp. 79–88 (2006)

    Google Scholar 

  5. Borin, L.: Parole-korpusen vid språkbanken, göteborgs universitet, Accessed the 11th of Febuary 2004(1997) (20 million words), http://spraakbanken.gu.se

  6. Goldsmith, J., Higgins, D., Soglasnova, S.: Automatic language-specific stemming in information retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 273–283. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  7. Melucci, M., Orio, N.: A novel method for stemmer generation based on hidden markov models. In: CIKM 2003: Proceedings of the twelfth international conference on Information and knowledge management, pp. 131–138. ACM Press, New York (2003)

    Chapter  Google Scholar 

  8. Dryer, M.S.: Prefixing versus suffixing in inflectional morphology. In: Comrie, B., Dryer, M.S., Gil, D., Haspelmath, M. (eds.) World Atlas of Language Structures, pp. 110–113. Oxford University Press, Oxford (2005)

    Google Scholar 

  9. The British & Foreign Bible Society: Maori Bible. The British & Foreign Bible Society, London, England (1996)

    Google Scholar 

  10. Bibelsällskapet, S.: Gamla och Nya testamentet: de kanoniska böckerna. Norstedt, Stockgholm (1917)

    Google Scholar 

  11. Summer Institute of Linguistics: Bible: New testament and old testament selctions in kuku-yalanji (1985)

    Google Scholar 

  12. Bauer, W., Parker, W., Evans, T.K.: Maori. Descriptive Grammars. Routledge, London (1993)

    Google Scholar 

  13. Williams, H.W.: A dictionary of the Maori language, 7th edn. GP Books, Wellington (1971)

    Google Scholar 

  14. Patz, E.: A Grammar of the Kuku Yalanji Language of North Queensland. In: Research School of Pacific and Asian Studies, Pacific Linguistics, vol. 257, Australian National University, Canberra (2002)

    Google Scholar 

  15. Hershberger, H.D., Hershberger, R.: Kuku-Yalanji dictionary. Work Papers of SIL - AAB. Series B, vol. 7. Summer Institute of Linguistics, Darwin (1982)

    Google Scholar 

  16. Sanders, G.: On the analysis and implications of maori verb alternations. Lingua 80, 149–196 (1990)

    Article  Google Scholar 

  17. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  18. Erjavec, T., Džeroski, S.: Machine learning of morphosyntactic structure: Lemmatizing slovene words. Applied Artificial Intelligence 18, 17–41 (2004)

    Article  Google Scholar 

  19. Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1), 26–30 (2003)

    Article  Google Scholar 

  20. Rogati, M., McCarley, S., Yang, Y.: Unsupervised learning of arabic stemming using a parallel corpus. In: ACL 2003: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 391–398 (2003)

    Google Scholar 

  21. Hull, D.A.: Stemming algorithms: A case study for detailed evaluation. Journal of the American Soicety for Information Science 47(1), 70–84 (1996)

    Article  Google Scholar 

  22. Galambos, L.: Multilingual Stemmer in Web Environment. PhD thesis, Faculty of Mathematics and Physics, Charles University in Prague (2004)

    Google Scholar 

  23. Flenner, G.: Ein quantitatives morphsegmentierungssystem für spanische wortformen. In: Klenk, U. (ed.) Computatio Linguae II: Aufsätze zur algorithmischen und Quantitativen Analyse der Sprache, Zeitschrift für Dialektologie und Linguistik: Beihefte, Franz Steiner, Stuttgart, vol. 83, pp. 31–62 (1994)

    Google Scholar 

  24. Jacquemin, C.: Guessing morphology from terms and corpora. In: Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1997), Philadelphia, PA (1997)

    Google Scholar 

  25. Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 207–216 (2000)

    Google Scholar 

  26. Baroni, M., Matiasek, J., Trost, H.: Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Proceedings of the Workshop on Morphological and Phonological Learning of ACL/SIGPHON 2002, pp. 48–57 (2002)

    Google Scholar 

  27. Clark, A.: Learning morphology with pair hidden markov models. In: ACL (Companion Volume), pp. 55–60 (2001)

    Google Scholar 

  28. Ćavar, D., Herring, J., Ikuta, T., Rodrigues, P., Schrementi, G.: On induction of morphology grammars and its role in bootstrapping. In: Jäger, G., Monachesi, P., Penn, G., Wintner, S. (eds.) Proceedings of Formal Grammar 2004, pp. 47–62 (2004)

    Google Scholar 

  29. Brent, M.R., Murthy, S., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In: Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida (1995)

    Google Scholar 

  30. Déjean, H.: Concepts et algorithmes pour la découverte des structures formelles des langues. PhD thesis, Université de Caen Basse Normandie (1998)

    Google Scholar 

  31. Snover, M.G., Jarosz, G.E., Brent, M.R.: Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12. ACL Publications (2002)

    Google Scholar 

  32. Argamon, S., Akiva, N., Amit, A., Kapah, O.: Efficient unsupervised recursive word segmentation using minimum description length. In: COLING 2004, Geneva, Switzerland, August 22-29 (2004)

    Google Scholar 

  33. Goldsmith, J.: Unsupervised learning of the morphology of natural language. Computational Linguistics 27(2), 153–198 (2001)

    Article  MathSciNet  Google Scholar 

  34. Neuvel, S., Fulop, S.A.: Unsupervised learning of morphology without morphemes. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12, pp. 9–15. ACL Publications (2002)

    Google Scholar 

  35. Gaussier, É.: Unsupervised learning of derivational morphology from inflectional lexicons. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999). Association for Computational Linguistics, Philadephia (1999)

    Google Scholar 

  36. Sharma, U., Kalita, J., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 1–10. Association for Computational Linguistics, Philadelphia (2002)

    Google Scholar 

  37. Oliver, A.: Adquisició d’informació lèxica i morfosintàctica a partir de corpus sense anotar: aplicació al rus i al croat. PhD thesis, Universitat de Barcelona (2004)

    Google Scholar 

  38. Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 1–33 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hammarström, H. (2006). Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_25

Download citation

  • DOI: https://doi.org/10.1007/11880592_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45780-0

  • Online ISBN: 978-3-540-46237-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics