Methods and Algorithms for Unsupervised Learning of Morphology

Can, Burcu; Manandhar, Suresh

doi:10.1007/978-3-642-54906-9_15

Burcu Can¹⁷ &
Suresh Manandhar¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8403))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2144 Accesses
5 Citations

Abstract

This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Argamon, S., Akiva, N., Amir, A., Kapah, O.: Efficient unsupervised recursive word segmentation using minimum description length. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004, pp. 1058–1064. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Arısoy, E., Dutaǧacı, H., Arslan, L.M.: A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Process. 86, 2844–2862 (2006)
Article MATH Google Scholar
Aunimo, L., Heinonen, O., Kuuskoski, R., Makkonen, J., Petit, R., Virtanen, O.: Question answering system for incomplete and noisy data. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 193–206. Springer, Heidelberg (2003)
Chapter Google Scholar
Baayen, R.: Word Frequency Distributions. Kluwer Academic Publishers (2001)
Google Scholar
Bernhard, D.: Unsupervised morphological segmentation based on segment predictability and word segments alignment. In: PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes (2006)
Google Scholar
Berton, A., Fetter, P., Regel-Brietzmann, P.: Compound words in large-vocabulary German speech recognition systems. In: Proceedings of the Fourth International Conference on Spoken Language, ICSLP 1996, vol. 2, pp. 1165–1168 (October 1996)
Google Scholar
Bilotti, M.W., Katz, B., Lin, J.: What works better for question answering: Stemming or morphological query expansion? In: Proceedings of the Information Retrieval for Question Answering (IR4QA) Workshop at SIGIR (2004)
Google Scholar
Blackwell, D., MacQueen, J.B.: Ferguson distributions via polya urn schemes. The Annals of Statistics 1, 353–355 (1973)
Article MATH MathSciNet Google Scholar
Bordag, S.: Two-step approach to unsupervised morpheme segmentation. In: Proceedings of 2nd Pascal Challenges Workshop, pp. 25–29 (2006)
Google Scholar
Bordag, S.: Unsupervised and Knowledge-Free Morpheme Segmentation and Analysis. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 881–891. Springer, Heidelberg (2008)
Chapter Google Scholar
Brent, M.R.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, 71–105 (1999)
Article MATH Google Scholar
Brent, M.R., Murthy, S.K., Lundberg, A.: Discovering morphemic suffixes a case study in mdl induction. In: Fifth International Workshop on AI and Statistics, Ft., pp. 264–271 (1995)
Google Scholar
Brown, P.F., Della Pietra, V.J., Della Pietra, S.A., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Can, B., Manandhar, S.: Clustering morphological paradigms using syntactic categories. In: Peters, C., Di Nunzio, G.M., Kurimo, M., Mandl, T., Mostefa, D., Peñas, A., Roda, G. (eds.) CLEF 2009. LNCS, vol. 6241, pp. 641–648. Springer, Heidelberg (2010)
Chapter Google Scholar
Can, B., Manandhar, S.: Probabilistic hierarchical clustering of morphological paradigms. In: EACL, pp. 654–663 (2012)
Google Scholar
Chan, E.: Structures and distributions in morphology learning. PhD thesis, University of Pennsylvania (2008)
Google Scholar
Clark, A.S.: Inducing syntactic categories by context distribution clustering. In: Proceedings of CoNLL 2000 and LLL 2000, pp. 91–94 (2000)
Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, vol. 10, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2002)
Google Scholar
Creutz, M.: Unsupervised segmentation of words using prior distributions of morph length and frequency. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, ACL 2003, vol. 1, pp. 280–287. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Creutz, M.: Induction of the Morphology of Natural Language: Unsupervised Morpheme Segmentation with Application to Automatic Speech Recognition. PhD thesis, Computer and Information Science, University of Technology, Espoo, Finland (2006)
Google Scholar
Creutz, M., Hirsimäki, T., Kurimo, M., Puurula, A., Pylkkönen, J., Siivola, V., Varjokallio, M., Arisoy, E., Saraçlar, M., Stolcke, A.: Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Trans. Speech Lang. Process. 5, 1–29 (2007)
Article Google Scholar
Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the ACL 2002 Workshop on Morphological and Phonological Learning, MPL 2002, vol. 6, pp. 21–30. Association for Computational Linguistics, Stroudsburg (2002)
Google Scholar
Creutz, M., Lagus, K.: Induction of a simple morphology for highly-inflecting languages. In: Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology, SIGMorPhon 2004, pp. 43–51. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Creutz, M., Lagus, K.: Inducing the morphological lexicon of a natural language from unannotated text. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning, pp. 106–113 (2005)
Google Scholar
de Gispert, A., Mariño, J.: On the impact of morphology in English to Spanish statistical mt. Speech Communication 50, 1034–1046 (2008)
Article Google Scholar
Déjean, H.: Morphemes as necessary concept for structures discovery from untagged corpora. In: Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, NeMLaP3/CoNLL 1998, pp. 295–298. Association for Computational Linguistics, Stroudsburg (1998)
Google Scholar
Demberg, V.: A language-independent unsupervised model for morphological segmentation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 680–685 (2007)
Google Scholar
Dreyer, M., Eisner, J.: Discovering morphological paradigms from plain text using a dirichlet process mixture model. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 616–627. Association for Computational Linguistics, Edinburgh (July 2011)
Google Scholar
Ford, A., Singh, R., Martohardjono, G.: Pace Panini. Peter Lang (1967)
Google Scholar
Gelbukh, A., Alexandrov, M., Han, S.-Y.: Detecting inflection patterns in natural language by minimization of morphological model. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 432–438. Springer, Heidelberg (2004)
Chapter Google Scholar
Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2), 153–198 (2001)
Article MathSciNet Google Scholar
Goldsmith, J.: An algorithm for the unsupervised learning of morphology. In: Natural Language Engineering, vol. 12, pp. 353–371 (2006)
Google Scholar
Goldwater, S., Griffiths, T.L., Johnson, M.: Interpolating between types and tokens by estimating power-law generators. In: Advances in Neural Information Processing Systems, vol. 18. MIT Press, Cambridge (2006)
Google Scholar
Goldwater, S., McClosky, D.: Improving statistical mt through morphological analysis. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT 2005, pp. 676–683. Association for Computational Linguistics, Stroudsburg (2005)
Google Scholar
Grünwald, P.: A tutorial introduction to the minimum description length principle. In: Advances in Minimum Description Length: Theory and Applications. MIT Press (2005)
Google Scholar
Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, pp. 573–580. Association for Computational Linguistics, Stroudsburg (2005)
Google Scholar
Hafer, M.A., Weiss, S.F.: Word segmentation by letter successor varieties. Information Storage and Retrieval 10(11-12), 371–385 (1974)
Article Google Scholar
Hammarstrm, H.: A survey and classification of methods for (mostly) unsupervised learning of morphology. In: The 16th Nordic Conference of Computational Linguistics, NODALIDA 2007, Tartu, Estonia, May 25-26. NEALT (2007)
Google Scholar
Harman, D.: How effective is suffixing. Journal of the American Society for Information Science 42(1), 7–15 (1991)
Article Google Scholar
Harris, Z.S.: From phoneme to morpheme. Language 31(2), 190–222 (1955)
Article Google Scholar
Ishwaran, H., James, L.F.: Generalized weighted chinese restaurant processes for species sampling mixture models. Statistica Sinica 13 (2003)
MathSciNet Google Scholar
Järvelin, K., Pirkola, A.: Morphological processing in mono- and cross-lingual information retrieval. In: Arppe, A., Carlson, L., Lindén, K., Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H., Yli-Jyrä, A. (eds.) Inquiries into Words, Constraints and Contexts. Festschrift for Kimmo Koskenniemi on his 60th Birthday, pp. 214–226. CSLI Publications, Stanford (2005)
Google Scholar
Kazakov, D.: Unsupervised learning of naive morphology with genetic algorithms. In: ECML/Mlnet Workshop on Empirical Learning of Natural Language Processing Tasks, Prague, pp. 105–112 (1997)
Google Scholar
Kazakov, D., Manandhar, S.: Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. In: Machine Learning, pp. 43–121 (2001)
Google Scholar
Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes, pp. 31–35 (2006)
Google Scholar
Kettunen, K., Kunttu, T., Järvelin, K.: To stem or lemmatize a highly inflectional language in a probabilistic ir environment? Journal of Documentation 61(4), 476–496 (2005)
Article Google Scholar
Kirchhoff, K., Vergyri, D., Bilmes, J., Duh, K., Stolcke, A.: Morphology-based language modeling for conversational Arabic speech recognition. Computer Speech & Language 20(4), 589–608 (2006)
Article Google Scholar
Toutanova, K., Suzuki, H., Ruopp, A.: Applying morphology generation models to machine translation. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 514–522. Association for Computational Linguistics, Columbus (2008)
Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1993, pp. 191–202. ACM, New York (1993)
Chapter Google Scholar
Kurimo, M., Lagus, K., Virpioja, S., Turunen, V.: Morpho challenge 2010 (June 2011), http://research.ics.tkk.fi/events/morphochallenge2010/
Kurimo, M., Virpioja, S., Turunen, V.: Proceedings of the morpho challenge 2010 workshop. In: Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, SIGMORPHON 2010, pp. 87–95. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
Larson, M., Willett, D., Khler, J., Rigoll, G.: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. In: International Conference on Spoken Language Processing, pp. 945–948 (2000)
Google Scholar
Lavallée, J.F., Langlais, P.: Morphological acquisition by formal analogy. In: Working Notes for the CLEF 2009 Workshop (September 2009)
Google Scholar
Lignos, C.: Learning from unseen data. In: Kurimo, M., Virpioja, S., Turunen, V., Lagus, K. (eds.) Proceedings of the Morpho Challenge 2010 Workshop, Aalto University, Espoo, Finland, pp. 35–38 (2010)
Google Scholar
Lignos, C., Chan, E., Marcus, M.P., Yang, C.: A rule-based unsupervised morphology learning framework. In: Working Notes for the CLEF 2009 Workshop (September 2009)
Google Scholar
Manandhar, S., Deroski, S., Erjavec, T.: Learning multilingual morphology with clog. In: Page, D. (ed.) ILP 1998. LNCS, vol. 1446, pp. 135–144. Springer, Heidelberg (1998)
Chapter Google Scholar
Minkov, E., Toutanova, K., Suzuki, H.: Generating complex morphology for machine translation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 128–135. Association for Computational Linguistics, Prague (2007)
Google Scholar
Monson, C., Carbonell, J.G., Lavie, A., Levin, L.: Paramor: Finding paradigms across morphology. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 900–907. Springer, Heidelberg (2008)
Chapter Google Scholar
Monson, C., Hollingshead, K., Roark, B.: Probabilistic ParaMor. In: Proceedings of the 10th CLEF Conference on Multilingual Information Access Evaluation: Text Retrieval Experiments, CLEF 2009 (September 2009)
Google Scholar
Morrison, D.R.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15, 514–534 (1968)
Article Google Scholar
Neuvel, S., Fulop, S.A.: Unsupervised learning of morphology without morphemes. In: Proceedings of the ACL 2002 Workshop on Morphological and Phonological Learning, MPL 2002, vol. 6, pp. 31–40. Association for Computational Linguistics, Stroudsburg (2002)
Chapter Google Scholar
Orbanz, P., Teh, Y.W.: Bayesian nonparametric models. In: Encyclopedia of Machine Learning, pp. 81–89. Springer (2010)
Google Scholar
Poon, H., Cherry, C., Toutanova, K.: Unsupervised morphological segmentation with log-linear models. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2009, pp. 209–217. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Poon, H., Domingos, P.: Joint unsupervised coreference resolution with Markov logic. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp. 650–659. Association for Computational Linguistics, Stroudsburg (2008)
Chapter Google Scholar
Roeland Ordelman, A.V.H., Jong, F.D.: Compound decomposition in Dutch large vocabulary speech recognition. In: Proceedings of Eurospeech 2003, pp. 225–228 (2003)
Google Scholar
Rosenfeld, R.: A whole sentence maximum entropy language model. In: Proceedings of the IEEE Workshop on Speech Recognition and Understanding (1997)
Google Scholar
Schleicher, A.: Zur Morphologie der Spreche, St. Pétersburg. moires de l’Académie Impériale des Sciences de St. Pétersburg Series VII, vol. 1(7) (1859)
Google Scholar
Sirts, K., Alumäe, T.: A hierarchical dirichlet process model for joint part-of-speech and morphology induction. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2012, pp. 407–416. Association for Computational Linguistics, Stroudsburg (2012)
Google Scholar
Smith, N.A., Eisner, J.: Contrastive estimation: training log-linear models on unlabeled data. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, pp. 354–362. Association for Computational Linguistics, Stroudsburg (2005)
Chapter Google Scholar
Snyder, B., Barzilay, R.: Unsupervised multilingual learning for morphological segmentation. In: Proceedings of ACL 2008: HLT, pp. 737–745. Association for Computational Linguistics, Columbus (June 2008)
Google Scholar
Spiegler, S., Monson, C.: Emma: A novel evaluation metric for morphological analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING (August 2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of York, Heslington, York, YO10 5GH, UK
Burcu Can & Suresh Manandhar

Authors

Burcu Can
View author publications
You can also search for this author in PubMed Google Scholar
Suresh Manandhar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Can, B., Manandhar, S. (2014). Methods and Algorithms for Unsupervised Learning of Morphology. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54906-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-54906-9_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54905-2
Online ISBN: 978-3-642-54906-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics