Abstract
We present statistical models for morphological disambiguation in agglutinative languages, with a specific application to Turkish. Turkish presents an interesting problem for statistical models as the potential tag set size is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflectional groups and surface roots in trigram models. Among the four models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy increases to 95.07%.
This is a preview of subscription content, access via your institution.
REFERENCES
Brants, T. “TnT – A Statistical Part-of-speech Tagger”. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000). Seattle, WA, 2000.
Brill, E. “Transformation-based Error-driven learning and Natural Language Processing: A Case Study in Part-of-speech Tagging”. Computational Linguistics, 21(4) (1995a), pp. 543–566.
Brill, E. “Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging”. Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA, 1995b.
Çarki K., P. Geutner and T. Schultz. “Turkish LVCSR: Towards Better Speech recognition for Agglutinative Languages”. ICASSP 2000: IEEE International Conference on Acoustics, Speech and Signal Processing, Istanbul, Turkey, 2000.
Charniak, E., C. Hendrickson, N. Jacobson and M. Perkowitz. “Equations for Part-of-speech Tagging”. Proceedings of the Eleventh National Conference on Artificial Intelligence, AAAI Press/MIT Press, Menlo Park, CA, 1993, pp. 784–789.
Church, K.W. “A Stochastic Parts Program and a Noun Phrase Parser for Unrestricted Text”. Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas, 1988.
Cutting, D., J. Kupiec, J. Pedersen and P. Sibun. “A Practical Part-of-speech Tagger”. Proceedings of the Third Conference of Applied Natural Language Processing, Trento, Italy, 1992.
Daelemans, W., J. Zavrel, P. Nerck and S. Gillis. “Mbt: A Memory-based Part of Speech Taggergenerator”. In Proceedings of the Fourth Workshop on Very Large Corpora. Eds. E. Ejerhead and I. Dagan, 1996, pp. 14–27.
Dermatas, E. and G. Kokkinakis. “Automatic Stochastic Tagging of Natural Language Texts”. Computational Linguistics, 21(2) (1995), pp. 137–163.
DeRose, S.J. “Grammatical Category Disambiguation by Statistical Optimization”. Computational Linguistics, 14 (1988), pp. 31–39.
Elworthy, D. “Tagset Design and Inflected Languages”. From Texts to Tags: Issues in Multilingual Language Analysis, Proceedings of the ACL SIGDAT Workshop, University College, Belfield, Dublin, Ireland, 1995, pp. 1–9.
Erguvanh, E.E. The Function of Word Order in Turkish. Ph.D. Dissertation, University of California, Los Angeles, 1979.
Ezeiza, N., I. Alegria, J.M. Arriola, R. Urizar and I. Aduriz. “Combining Stochastic and RulebasedMethods for Disambiguation in Agglutinative Languages”. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, Canada, 1998, pp. 379–384.
Gale, W.A. “Good-turing Smoothing without Tears”. Technical Report, Bell Labs. The corresponding postscript file can bi found at hhtp://cm.bell-labs.com/cm/ms/departments/sia/doc/94.5.ps, 1994.
Garside, R. The Computational Analysis of English: A Corpus-based Approach. Eds. R. Garside, G. Sampson and G. Leech, Longman, London, chapter The CLAWS word-tagging system, 1998, pp. 30–41.
`, J. and B. Hladká. “Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset”. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (COLING/ACLs98),Montreal, Canada, 1998, pp. 483–490.
Hajič J. “Morphological Tagging: Data vs. Dictionaries”. Proceedings of the Applied Natural Language Processing and the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL), Seattle, 2000.
Hakkani-Tür, D.Z., K. Oflazer and G. Tür. “Statistical Morphological Disambiguation for Agglutinative Languages”. Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), 2000.
Hankamer, J. Lexical Representation and Process. Ed. W. Marslen-Wilson, The MIT, Press, chapter Morphological Parsing and the Lexicon, 1989.
Karlsson, F., A. Voutilainen, J. Heikkilä and A. Anttila. Constraint Grammar-A Languageindependent System for Parsing Unrestricted Text, Mouton de Gruyter, 1995.
Katz. “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer”. IEEE Transactions on Acoustics, Speech, and Signal Processing, volume assp35:3, 1987, pp. 400–401.
Levinger, M., U. Ornan and A. Itai. “Learning Morpho-lexical Probabilities from an Untagged Corpus with an Application to Hebrew”. Computational Linguistics 21(3) (1995), pp. 383–404.
Manning, C.D. and H. Schutze. Foundations of Statistical Natural Processing, The MIT Press, 1999.
Megyesi, B. “Improving Brillss POS Tagger for an Agglutinative Language”. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Eds. F. Pascale and Z. Joe, College Park, Maryland, USA, 1999, pp. 275–284.
Merialdo, B. “Tagging English Text with a Proabilistic Model”. Computational Linguistics, 20(2) (1994), pp. 155–172.
Oflazer, K. and I. Kuruöz. “Tagging and Morphological Disambiguation of Turkish Text”. Proceedings of the 4th Applied Natural Language Processing Conference, ACL, 1994, pp. 144–149.
Oflazer, K. and G. Tür. “Combining Hand-Crafted Rules and Unsupervised Learning in Constraintbased Morphological Disambiguation”. In Proceedings of the ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing. Eds. E. Brill and K. Church, 1996.
Oflazer, K. and G. Tür. “Morphological Disambiguation by Voting Constraints”. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACLs97/EACLs97), Madrid, Spain, 1997.
Oflazer, K., D.Z. Hakkani-Tür and G. Tür. “Design for a Turkish Treebank”. Proceedings of Workshop on Linguistically Interpreted Corpora, at EACLs99, Bergen, Norway, 1999.
Oflazer, K. “Two-level Description of Turkish Morphology”. Literary and Linguistic Computing, 9(2) (1994), pp. 137–148.
Oflazer, K. “Dependency Parsing with a Extended Finite State Approach”. Proceedings of the 37th Annual Meeting of the Association of Computational Linguistics, College Park, Maryland, 1999.
Ratnaparkhi, A. “A Maximum Entropy Model for Part-of speech Tagging”. Proceedings of the Empirical Methods in Natural Language Processing Conference, University of Pennsylvania, 1996.
Robbins, H., and J.V. Ryzin. Introduction of Statistics, SRA, Science Research Associates, Inc., 1975.
Stolcke, Andreas. SRILM – the SRI Language Modeling Toolkit. http://www.speech.dri.com/ ?projects/srilm/, 1999.
Tür, G. “Using Multiple Sources of Information for Constraint-based Morphological Disambiguation”. Masterss thesis, Department of Computer Engineering and Information Science, Bilkent University, Ankara, Turkey, 1996.
van Kalteren, H. (ed.). Syntactic Wordclass Tagging. Text, Speech and Language Technology. Kluwer Academic Publishers, 1999.
Voutilainen, A. “Does Tagging Help Parsing? A Case Study on Finite State Parsing”. In Proceedings of the International Workshop on Finite State Methods in Natural Language Processing (FSMNLPs98). Eds. L. Karttunen and K. Oflazer, Bilkent University, Ankara, Turkey, 1998, pp. 25–36.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hakkani-Tür, D.Z., Oflazer, K. & Tür, G. Statistical Morphological Disambiguation for Agglutinative Languages. Computers and the Humanities 36, 381–410 (2002). https://doi.org/10.1023/A:1020271707826
Issue Date:
DOI: https://doi.org/10.1023/A:1020271707826
- agglutinative languages
- morphological disambiguation
- n-gram language models
- statistical natural language processing
- Turkish