Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words

Hammarström, Harald

doi:10.1007/11880592_25

Harald Hammarström²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Included in the following conference series:

Asia Information Retrieval Symposium

977 Accesses
3 Citations

Abstract

We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pirkola, A.: Morphological typology of languages for IR. Journal of Documentation 57(3), 330–348 (2001)
Article Google Scholar
Francis, N.W., Kucera, H.: Brown corpus. Department of Linguistics, Brown University, Providence, Rhode Island (1964) (1 million words)
Google Scholar
James, K.: The Holy Bible, containing the Old and New Testaments and the Apocrypha in the authorized King James version. Thomas Nelson, Nashville, New York (1977)
Google Scholar
Hammarström, H.: A naive theory of morphology and an algorithm for extraction. In: Wicentowski, R., Kondrak, G. (eds.) SIGPHON 2006: Eighth Meeting of the Proceedings of the ACL Special Interest Group on Computational Phonology, Association for Computational Linguistics, New York City, USA, June 8, pp. 79–88 (2006)
Google Scholar
Borin, L.: Parole-korpusen vid språkbanken, göteborgs universitet, Accessed the 11th of Febuary 2004(1997) (20 million words), http://spraakbanken.gu.se
Goldsmith, J., Higgins, D., Soglasnova, S.: Automatic language-specific stemming in information retrieval. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 273–283. Springer, Heidelberg (2001)
Chapter Google Scholar
Melucci, M., Orio, N.: A novel method for stemmer generation based on hidden markov models. In: CIKM 2003: Proceedings of the twelfth international conference on Information and knowledge management, pp. 131–138. ACM Press, New York (2003)
Chapter Google Scholar
Dryer, M.S.: Prefixing versus suffixing in inflectional morphology. In: Comrie, B., Dryer, M.S., Gil, D., Haspelmath, M. (eds.) World Atlas of Language Structures, pp. 110–113. Oxford University Press, Oxford (2005)
Google Scholar
The British & Foreign Bible Society: Maori Bible. The British & Foreign Bible Society, London, England (1996)
Google Scholar
Bibelsällskapet, S.: Gamla och Nya testamentet: de kanoniska böckerna. Norstedt, Stockgholm (1917)
Google Scholar
Summer Institute of Linguistics: Bible: New testament and old testament selctions in kuku-yalanji (1985)
Google Scholar
Bauer, W., Parker, W., Evans, T.K.: Maori. Descriptive Grammars. Routledge, London (1993)
Google Scholar
Williams, H.W.: A dictionary of the Maori language, 7th edn. GP Books, Wellington (1971)
Google Scholar
Patz, E.: A Grammar of the Kuku Yalanji Language of North Queensland. In: Research School of Pacific and Asian Studies, Pacific Linguistics, vol. 257, Australian National University, Canberra (2002)
Google Scholar
Hershberger, H.D., Hershberger, R.: Kuku-Yalanji dictionary. Work Papers of SIL - AAB. Series B, vol. 7. Summer Institute of Linguistics, Darwin (1982)
Google Scholar
Sanders, G.: On the analysis and implications of maori verb alternations. Lingua 80, 149–196 (1990)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Erjavec, T., Džeroski, S.: Machine learning of morphosyntactic structure: Lemmatizing slovene words. Applied Artificial Intelligence 18, 17–41 (2004)
Article Google Scholar
Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1), 26–30 (2003)
Article Google Scholar
Rogati, M., McCarley, S., Yang, Y.: Unsupervised learning of arabic stemming using a parallel corpus. In: ACL 2003: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 391–398 (2003)
Google Scholar
Hull, D.A.: Stemming algorithms: A case study for detailed evaluation. Journal of the American Soicety for Information Science 47(1), 70–84 (1996)
Article Google Scholar
Galambos, L.: Multilingual Stemmer in Web Environment. PhD thesis, Faculty of Mathematics and Physics, Charles University in Prague (2004)
Google Scholar
Flenner, G.: Ein quantitatives morphsegmentierungssystem für spanische wortformen. In: Klenk, U. (ed.) Computatio Linguae II: Aufsätze zur algorithmischen und Quantitativen Analyse der Sprache, Zeitschrift für Dialektologie und Linguistik: Beihefte, Franz Steiner, Stuttgart, vol. 83, pp. 31–62 (1994)
Google Scholar
Jacquemin, C.: Guessing morphology from terms and corpora. In: Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1997), Philadelphia, PA (1997)
Google Scholar
Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 207–216 (2000)
Google Scholar
Baroni, M., Matiasek, J., Trost, H.: Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: Proceedings of the Workshop on Morphological and Phonological Learning of ACL/SIGPHON 2002, pp. 48–57 (2002)
Google Scholar
Clark, A.: Learning morphology with pair hidden markov models. In: ACL (Companion Volume), pp. 55–60 (2001)
Google Scholar
Ćavar, D., Herring, J., Ikuta, T., Rodrigues, P., Schrementi, G.: On induction of morphology grammars and its role in bootstrapping. In: Jäger, G., Monachesi, P., Penn, G., Wintner, S. (eds.) Proceedings of Formal Grammar 2004, pp. 47–62 (2004)
Google Scholar
Brent, M.R., Murthy, S., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In: Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida (1995)
Google Scholar
Déjean, H.: Concepts et algorithmes pour la découverte des structures formelles des langues. PhD thesis, Université de Caen Basse Normandie (1998)
Google Scholar
Snover, M.G., Jarosz, G.E., Brent, M.R.: Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12. ACL Publications (2002)
Google Scholar
Argamon, S., Akiva, N., Amit, A., Kapah, O.: Efficient unsupervised recursive word segmentation using minimum description length. In: COLING 2004, Geneva, Switzerland, August 22-29 (2004)
Google Scholar
Goldsmith, J.: Unsupervised learning of the morphology of natural language. Computational Linguistics 27(2), 153–198 (2001)
Article MathSciNet Google Scholar
Neuvel, S., Fulop, S.A.: Unsupervised learning of morphology without morphemes. In: Workshop on Morphological and Phonological Learning at Association for Computational Linguistics 40th Anniversary Meeting (ACL 2002), July 6-12, pp. 9–15. ACL Publications (2002)
Google Scholar
Gaussier, É.: Unsupervised learning of derivational morphology from inflectional lexicons. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999). Association for Computational Linguistics, Philadephia (1999)
Google Scholar
Sharma, U., Kalita, J., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pp. 1–10. Association for Computational Linguistics, Philadelphia (2002)
Google Scholar
Oliver, A.: Adquisició d’informació lèxica i morfosintàctica a partir de corpus sense anotar: aplicació al rus i al croat. PhD thesis, Universitat de Barcelona (2004)
Google Scholar
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 1–33 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Chalmers University, 412 96, Gothenburg, Sweden
Harald Hammarström

Authors

Harald Hammarström
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, National University of Singapore, 3 Science Drive 2, 117543, Singapore
Hwee Tou Ng
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Mun-Kew Leong
Department of Computer Science, School of Computing, National University of Singapore, 117543, Singapore
Min-Yen Kan
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, P.O. Box, 119613, Singapore
Donghong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hammarström, H. (2006). Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_25

Download citation

DOI: https://doi.org/10.1007/11880592_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics