Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus

  • Haşim Sak
  • Tunga Güngör
  • Murat Saraçlar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5221)

Abstract

In this paper, we propose a set of language resources for building Turkish language processing applications. Specifically, we present a finite-state implementation of a morphological parser, an averaged perceptron-based morphological disambiguator, and compilation of a web corpus. Turkish is an agglutinative language with a highly productive inflectional and derivational morphology. We present an implementation of a morphological parser based on two-level morphology. This parser is one of the most complete parsers for Turkish and it runs independent of any other external system such as PC-KIMMO in contrast to existing parsers. Due to complex phonology and morphology of Turkish, parsing introduces some ambiguous parses. We developed a morphological disambiguator with accuracy of about 98% using averaged perceptron algorithm. We also present our efforts to build a Turkish web corpus of about 423 million words.

Keywords

Morphological parsing Morphological disambiguation Turkish Web corpus 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Collins, M.: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In: EMNLP (2002)Google Scholar
  2. 2.
    Dilek, Z.H.T., Oflazer, K., Tür, G.: Statistical Morphological Disambiguation for Agglutinative Languages. Computers and the Humanities 36(4) (2002)Google Scholar
  3. 3.
    Ezeiza, N., Alegria, I., Arriola, J.M., Urizar, R., Aduriz, I.: Combining Stochastic and Rule-based Methods for Disambiguation in Agglutinative Languages. In: COLING-ACL (1998)Google Scholar
  4. 4.
    Hajic, J., Hladka, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: COLING-ACL, pp. 483–490 (1998)Google Scholar
  5. 5.
    Koskenniemi, K.: A General Computational Model for Word-form Recognition and Production. In: 22nd Annual Meeting on Association for Computational Linguistics, pp. 178–181 (1984)Google Scholar
  6. 6.
    Lewis, G.: Turkish Grammar. Oxford University Press, Oxford (2001)Google Scholar
  7. 7.
    Megyesi, B.: Improving Brill’s PoS Tagger for an Agglutinative Language. In: Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)Google Scholar
  8. 8.
    Mohri, M.: Finite-state Transducers in Language and Speech Processing. Computational Linguistics 23(2), 269–311 (1997)MathSciNetGoogle Scholar
  9. 9.
    Oflazer, K., Tür, G.: Morphological Disambiguation by Voting Constraints. In: ACL, pp. 222–229 (1997)Google Scholar
  10. 10.
    Sak, H., Güngör, T., Saraçlar, M.: Morphological Disambiguation of Turkish Text with Perceptron Algorithm. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 107–118. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  11. 11.
    Salor, Ö., Pellom, B.L., Çiloğlu, T., Hacıoğlu, K., Demirekler, M.: On Developing New Text and Audio Corpora and Speech Recognition Tools for the Turkish Language. In: ICSLP (2002)Google Scholar
  12. 12.
    Say, B., Zeyrek, D., Oflazer, K., Özge, U.: Development of a Corpus and a Treebank for Present-day Written Turkish. In: 11th International Conference of Turkish Linguistics (2002)Google Scholar
  13. 13.
    Yüret, D., Türe, F.: Learning Morphological Disambiguation Rules for Turkish. In: HLT-NAACL (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Haşim Sak
    • 1
  • Tunga Güngör
    • 1
  • Murat Saraçlar
    • 2
  1. 1.Computer Engineering DepartmentBoğaziçi UniversityBebekTurkey
  2. 2.Electrical and Electronic Engineering DepartmentBoğaziçi UniversityBebekTurkey

Personalised recommendations