Skip to main content

Statistical Morphological Disambiguation for Agglutinative Languages

Abstract

We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%.

This is a preview of subscription content, access via your institution.

REFERENCES

  • Brants, T. “TnT – A Statistical Part-of-speech Tagger”. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000). Seattle, WA, 2000.

  • Brill, E. “Transformation-based Error-driven learning and Natural Language Processing: A Case Study in Part-of-speech Tagging”. Computational Linguistics, 21(4) (1995a), pp. 543–566.

    Google Scholar 

  • Brill, E. “Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging”. Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA, 1995b.

  • Çarki K., P. Geutner and T. Schultz. “Turkish LVCSR: Towards Better Speech recognition for Agglutinative Languages”. ICASSP 2000: IEEE International Conference on Acoustics, Speech and Signal Processing, Istanbul, Turkey, 2000.

  • Charniak, E., C. Hendrickson, N. Jacobson and M. Perkowitz. “Equations for Part-of-speech Tagging”. Proceedings of the Eleventh National Conference on Artificial Intelligence, AAAI Press/MIT Press, Menlo Park, CA, 1993, pp. 784–789.

    Google Scholar 

  • Church, K.W. “A Stochastic Parts Program and a Noun Phrase Parser for Unrestricted Text”. Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas, 1988.

  • Cutting, D., J. Kupiec, J. Pedersen and P. Sibun. “A Practical Part-of-speech Tagger”. Proceedings of the Third Conference of Applied Natural Language Processing, Trento, Italy, 1992.

  • Daelemans, W., J. Zavrel, P. Nerck and S. Gillis. “Mbt: A Memory-based Part of Speech Taggergenerator”. In Proceedings of the Fourth Workshop on Very Large Corpora. Eds. E. Ejerhead and I. Dagan, 1996, pp. 14–27.

  • Dermatas, E. and G. Kokkinakis. “Automatic Stochastic Tagging of Natural Language Texts”. Computational Linguistics, 21(2) (1995), pp. 137–163.

    Google Scholar 

  • DeRose, S.J. “Grammatical Category Disambiguation by Statistical Optimization”. Computational Linguistics, 14 (1988), pp. 31–39.

    Google Scholar 

  • Elworthy, D. “Tagset Design and Inflected Languages”. From Texts to Tags: Issues in Multilingual Language Analysis, Proceedings of the ACL SIGDAT Workshop, University College, Belfield, Dublin, Ireland, 1995, pp. 1–9.

    Google Scholar 

  • Erguvanh, E.E. The Function of Word Order in Turkish. Ph.D. Dissertation, University of California, Los Angeles, 1979.

    Google Scholar 

  • Ezeiza, N., I. Alegria, J.M. Arriola, R. Urizar and I. Aduriz. “Combining Stochastic and RulebasedMethods for Disambiguation in Agglutinative Languages”. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, Canada, 1998, pp. 379–384.

  • Gale, W.A. “Good-turing Smoothing without Tears”. Technical Report, Bell Labs. The corresponding postscript file can bi found at hhtp://cm.bell-labs.com/cm/ms/departments/sia/doc/94.5.ps, 1994.

  • Garside, R. The Computational Analysis of English: A Corpus-based Approach. Eds. R. Garside, G. Sampson and G. Leech, Longman, London, chapter The CLAWS word-tagging system, 1998, pp. 30–41.

    Google Scholar 

  • `, J. and B. Hladká. “Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset”. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (COLING/ACLs98),Montreal, Canada, 1998, pp. 483–490.

  • Hajič J. “Morphological Tagging: Data vs. Dictionaries”. Proceedings of the Applied Natural Language Processing and the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL), Seattle, 2000.

  • Hakkani-Tür, D.Z., K. Oflazer and G. Tür. “Statistical Morphological Disambiguation for Agglutinative Languages”. Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), 2000.

  • Hankamer, J. Lexical Representation and Process. Ed. W. Marslen-Wilson, The MIT, Press, chapter Morphological Parsing and the Lexicon, 1989.

  • Karlsson, F., A. Voutilainen, J. Heikkilä and A. Anttila. Constraint Grammar-A Languageindependent System for Parsing Unrestricted Text, Mouton de Gruyter, 1995.

  • Katz. “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer”. IEEE Transactions on Acoustics, Speech, and Signal Processing, volume assp35:3, 1987, pp. 400–401.

    Google Scholar 

  • Levinger, M., U. Ornan and A. Itai. “Learning Morpho-lexical Probabilities from an Untagged Corpus with an Application to Hebrew”. Computational Linguistics 21(3) (1995), pp. 383–404.

    Google Scholar 

  • Manning, C.D. and H. Schutze. Foundations of Statistical Natural Processing, The MIT Press, 1999.

  • Megyesi, B. “Improving Brillss POS Tagger for an Agglutinative Language”. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Eds. F. Pascale and Z. Joe, College Park, Maryland, USA, 1999, pp. 275–284.

    Google Scholar 

  • Merialdo, B. “Tagging English Text with a Proabilistic Model”. Computational Linguistics, 20(2) (1994), pp. 155–172.

    Google Scholar 

  • Oflazer, K. and I. Kuruöz. “Tagging and Morphological Disambiguation of Turkish Text”. Proceedings of the 4th Applied Natural Language Processing Conference, ACL, 1994, pp. 144–149.

  • Oflazer, K. and G. Tür. “Combining Hand-Crafted Rules and Unsupervised Learning in Constraintbased Morphological Disambiguation”. In Proceedings of the ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing. Eds. E. Brill and K. Church, 1996.

  • Oflazer, K. and G. Tür. “Morphological Disambiguation by Voting Constraints”. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACLs97/EACLs97), Madrid, Spain, 1997.

  • Oflazer, K., D.Z. Hakkani-Tür and G. Tür. “Design for a Turkish Treebank”. Proceedings of Workshop on Linguistically Interpreted Corpora, at EACLs99, Bergen, Norway, 1999.

  • Oflazer, K. “Two-level Description of Turkish Morphology”. Literary and Linguistic Computing, 9(2) (1994), pp. 137–148.

    Google Scholar 

  • Oflazer, K. “Dependency Parsing with a Extended Finite State Approach”. Proceedings of the 37th Annual Meeting of the Association of Computational Linguistics, College Park, Maryland, 1999.

    Google Scholar 

  • Ratnaparkhi, A. “A Maximum Entropy Model for Part-of speech Tagging”. Proceedings of the Empirical Methods in Natural Language Processing Conference, University of Pennsylvania, 1996.

  • Robbins, H., and J.V. Ryzin. Introduction of Statistics, SRA, Science Research Associates, Inc., 1975.

  • Stolcke, Andreas. SRILM – the SRI Language Modeling Toolkit. http://www.speech.dri.com/ ?projects/srilm/, 1999.

  • Tür, G. “Using Multiple Sources of Information for Constraint-based Morphological Disambiguation”. Masterss thesis, Department of Computer Engineering and Information Science, Bilkent University, Ankara, Turkey, 1996.

    Google Scholar 

  • van Kalteren, H. (ed.). Syntactic Wordclass Tagging. Text, Speech and Language Technology. Kluwer Academic Publishers, 1999.

  • Voutilainen, A. “Does Tagging Help Parsing? A Case Study on Finite State Parsing”. In Proceedings of the International Workshop on Finite State Methods in Natural Language Processing (FSMNLPs98). Eds. L. Karttunen and K. Oflazer, Bilkent University, Ankara, Turkey, 1998, pp. 25–36.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dilek Z. Hakkani-Tür.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hakkani-Tür, D.Z., Oflazer, K. & Tür, G. Statistical Morphological Disambiguation for Agglutinative Languages. Computers and the Humanities 36, 381–410 (2002). https://doi.org/10.1023/A:1020271707826

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1020271707826

  • agglutinative languages
  • morphological disambiguation
  • n-gram language models
  • statistical natural language processing
  • Turkish