Skip to main content

Language Modeling for Turkish Text and Speech Processing

  • Chapter
  • First Online:
Turkish Natural Language Processing

Abstract

This chapter presents an overview of language modeling followed by a discussion of the challenges in Turkish language modeling. Sub-lexical units are commonly used to reduce the high out-of-vocabulary (OOV) rates of morphologically rich languages. These units are either obtained by morphological analysis or by unsupervised statistical techniques. For Turkish, the morphological analysis yields word segmentations both at the lexical and surface forms which can be used as sub-lexical language modeling units. Discriminative language models, which outperform generative models for various tasks, allow for easy integration of morphological and syntactic features into language modeling. The chapter provides a review of both generative and discriminative approaches for Turkish language modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    But as noted in Chap. 2, most high-frequency words have a single morpheme so most likely inflected words have more than 1.7 morphemes.

  2. 2.

    Aalto University, Finland. Department of Computer Science. “Morpho Challenge”: https:morpho.aalto.fi/events/morphochallenge/ (Accessed Sept. 14, 2017).

  3. 3.

    Aalto University, Finland. Department of Computer Science. “Morpho Challenge: Results”: https:morpho.aalto.fi/events/morphochallenge/results-tur.html (Accessed Sept. 14, 2017).

References

  • Alumäe T (2006) Methods for Estonian large vocabulary speech recognition. PhD thesis, Tallinn University of Technology, Tallinn

    Google Scholar 

  • Arısoy E (2004) Turkish dictation system for radiology and broadcast news applications. Master’s thesis, Boğaziçi University, Istanbul

    Google Scholar 

  • Arısoy E, Saraçlar M (2009) Lattice extension and vocabulary adaptation for Turkish LVCSR. IEEE Trans Audio Speech Lang Process 17(1):163–173

    Google Scholar 

  • Arısoy E, Dutağacı H, Arslan LM (2006) A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Process 86(10):2844–2862

    Google Scholar 

  • Arısoy E, Sak H, Saraçlar M (2007) Language modeling for automatic Turkish broadcast news transcription. In: Proceedings of INTERSPEECH, Antwerp, pp 2381–2384

    Google Scholar 

  • Arısoy E, Can D, Parlak S, Sak H, Saraçlar M (2009a) Turkish broadcast news transcription and retrieval. IEEE Trans Audio Speech Lang Process 17(5):874–883

    Google Scholar 

  • Arısoy E, Pellegrini T, Saraçlar M, Lamel L (2009b) Enhanced Morfessor algorithm with phonetic features: application to Turkish. In: Proceedings of SPECOM, St. Petersburg

    Google Scholar 

  • Arısoy E, Saraçlar M, Roark B, Shafran I (2012) Discriminative language modeling with linguistic and statistically derived features. IEEE Trans Audio Speech Lang Process 20(2):540–550

    Google Scholar 

  • Bahl L, Brown P, deSouza P, Mercer R (1986) Maximum mutual information estimation of Hidden Markov Model parameters for speech recognition. In: Proceedings of ICASSP, Tokyo, pp 49–52

    Google Scholar 

  • Bayer AO, Çiloğlu T, Yöndem MT (2006) Investigation of different language models for Turkish speech recognition. In: Proceedings of IEEE signal processing and communications applications conference, Antalya, pp 1–4

    Google Scholar 

  • Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    Google Scholar 

  • Berger AL, Della Pietra SD, Della Pietra VJD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71

    Google Scholar 

  • Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of NAACL-HLT, Edmonton, pp 4–6

    Google Scholar 

  • Brent MR (1999) An efficient, probabilistically sound algorithm for segmentation and word discovery. Mach Learn 34:71–105

    Google Scholar 

  • Brown PF, Pietra VJD, deSouza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  • Çarkı K, Geutner P, Schultz T (2000) Turkish LVCSR: towards better speech recognition for agglutinative languages. In: Proceedings of ICASSP, Istanbul, pp 1563–1566

    Google Scholar 

  • Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14(4):283–332

    Google Scholar 

  • Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–394

    Google Scholar 

  • Choueiter G, Povey D, Chen SF, Zweig G (2006) Morpheme-based language modeling for Arabic. In: Proceedings of ICASSP, Toulouse, pp 1052–1056

    Google Scholar 

  • Çiloğlu T, Çömez M, Şahin S (2004) Language modeling for Turkish as an agglutinative language. In: Proceedings of IEEE signal processing and communications applications conference, Kuşadası, pp 461–462

    Google Scholar 

  • Collins M (2002) Discriminative training methods for Hidden Markov Models: theory and experiments with perceptron algorithms. In: Proceedings of EMNLP, Philadelphia, PA, pp 1–8

    Google Scholar 

  • Collins M, Saraçlar M, Roark B (2005) Discriminative syntactic language modeling for speech recognition. In: Proceedings of ACL, Ann Arbor, MI, pp 507–514

    Google Scholar 

  • Creutz M, Lagus K (2002) Unsupervised discovery of morphemes. In: Proceedings of the workshop on morphological and phonological learning, Philadelphia, PA, pp 21–30

    Google Scholar 

  • Creutz M, Lagus K (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Publications in Computer and Information Science Report A81, Helsinki University of Technology, Helsinki

    Google Scholar 

  • Dutağacı H (2002) Statistical language models for large vocabulary continuous speech recognition of Turkish. Master’s thesis, Boğaziçi University, Istanbul

    Google Scholar 

  • Erdoğan H, Büyük O, Oflazer K (2005) Incorporating language constraints in sub-word based speech recognition. In: Proceedings of ASRU, San Juan, PR, pp 98–103

    Google Scholar 

  • Eryiğit G, Nivre J, Oflazer K (2008) Dependency parsing of Turkish. Comput Linguist 34(3):357–389

    Google Scholar 

  • Goldsmith J (2001) Unsupervised learning of the morphology of a natural language. Comput Linguist 27(2):153–198

    Google Scholar 

  • Hacıoğlu K, Pellom B, Çiloğlu T, Öztürk Ö, Kurimo M, Creutz M (2003) On lexicon creation for Turkish LVCSR. In: Proceedings of EUROSPEECH, Geneva, pp 1165–1168

    Google Scholar 

  • Hakkani-Tür DZ (2000) Statistical language modeling for agglutinative languages. PhD thesis, Bilkent University, Ankara

    Google Scholar 

  • Harris Z (1967) Morpheme boundaries within words: report on a computer test. In: Transformations and discourse analysis papers, vol 73. University of Pennsylvania, Philadelphia, PA

    Google Scholar 

  • Hetherington IL (1995) A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA

    Google Scholar 

  • Hirsimäki T (2009) Advances in unlimited vocabulary speech recognition for morphologically rich languages. PhD thesis, Helsinki University of Technology, Espoo

    Google Scholar 

  • Hirsimäki T, Creutz M, Siivola V, Kurimo M, Virpioja S, Pylkkönen J (2006) Unlimited vocabulary speech recognition with morph language models applied to Finnish. Comput Speech Lang 20(4):515–541

    Google Scholar 

  • Jelinek F (1997) Statistical methods for speech recognition. The MIT Press, Cambridge, MA

    Google Scholar 

  • Kanevsky D, Roukos S, Sedivy J (1998) Statistical language model for inflected languages. US patent No: 5,835,888

    Google Scholar 

  • Khudanpur S, Wu J (2000) Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling. Comput Speech Lang 14:355–372

    Google Scholar 

  • Kirchhoff K, Vergyri D, Bilmes J, Duh K, Stolcke A (2006) Morphology-based language modeling for conversational Arabic speech recognition. Comput Speech Lang 20(4):589–608

    Google Scholar 

  • Kneissler J, Klakow D (2001) Speech recognition for huge vocabularies by using optimized sub-word units. In: Proceedings of INTERSPEECH, Aalborg, pp 69–72

    Google Scholar 

  • Kurimo M, Puurula A, Arısoy E, Siivola V, Hirsimäki T, Pylkkönen J, Alumäe T, Saraçlar M (2006) Unlimited vocabulary speech recognition for agglutinative languages. In: Proceedings of NAACL-HLT, New York, NY, pp 487–494

    Google Scholar 

  • Kwon OW, Park J (2003) Korean large vocabulary continuous speech recognition with morpheme-based recognition units. Speech Comm 39:287–300

    Google Scholar 

  • Lafferty JD, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, Williams, MA, pp 282–289

    Google Scholar 

  • Manning C, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA

    Google Scholar 

  • Mengüşoğlu E, Deroo O (2001) Turkish LVCSR: database preparation and language modeling for an agglutinative language. In: Proceedings of ICASSP, Salt Lake City, UT, pp 4018–4021

    Google Scholar 

  • Mihajlik P, Fegyò T, Tüske Z, Ircing P (2007) A morpho-graphemic approach for the recognition of spontaneous speech in agglutinative languages like Hungarian. In: Proceedings of INTERSPEECH, Antwerp, pp 1497–1500

    Google Scholar 

  • Pellegrini T, Lamel L (2007) Using phonetic features in unsupervised word decompounding for ASR with application to a less-represented language. In: Proceedings of INTERSPEECH, Antwerp, pp 1797–1800

    Google Scholar 

  • Pellegrini T, Lamel L (2009) Automatic word decompounding for ASR in a morphologically rich language: application to amharic. IEEE Trans Audio Speech Lang Process 17(5):863–873

    Google Scholar 

  • Podvesky P, Machek P (2005) Speech recognition of Czech – inclusion of rare words helps. In: Proceedings of the ACL student research workshop, Ann Arbor, MI, pp 121–126

    Google Scholar 

  • Povey D, Woodland PC (2000) Large-scale MMIE training for conversational telephone speech recognition. In: Proceedings of NIST speech transcription workshop, College Park, MD

    Google Scholar 

  • Povey D, Woodland PC (2002) Minimum phone error and i-smoothing for improved discriminative training. In: Proceedings of ICASSP, Orlando, FL, pp 105–108

    Google Scholar 

  • Roark B (2001) Probabilistic top-down parsing and language modeling. Comput Linguist 27(2):249–276

    Google Scholar 

  • Roark B, Saraçlar M, Collins MJ, Johnson M (2004) Discriminative language modeling with conditional random fields and the perceptron algorithm. In: Proceedings of ACL, Barcelona, pp 47–54

    Google Scholar 

  • Roark B, Saraçlar M, Collins M (2007) Discriminative n-gram language modeling. Comput Speech Lang 21(2):373–392

    Google Scholar 

  • Rosenfeld R (1994) Adaptive statistical language modeling: a maximum entropy approach. PhD thesis, Carnegie Mellon University, Pittsburgh, PA

    Google Scholar 

  • Rosenfeld R (1995) Optimizing lexical and n-gram coverage via judicious use of linguistic data. In: Proceedings of EUROSPEECH, Madrid, pp 1763–1766

    Google Scholar 

  • Rosenfeld R, Chen SF, Zhu X (2001) Whole-sentence exponential language models: a vehicle for linguistic-statistical integration. Comput Speech Lang 15(1):55–73

    Google Scholar 

  • Rotovnik T, Maučec MS, Kačic Z (2007) Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Commun 49(6):437–452

    Google Scholar 

  • Sak H, Güngör T, Saraçlar M (2007) Morphological disambiguation of Turkish text with perceptron algorithm. In: Proceedings of CICLING, Mexico City, pp 107–118

    Google Scholar 

  • Sak H, Saraçlar M, Güngör T (2010) Morphology-based and sub-word language modeling for Turkish speech recognition. In: Proceedings of ICASSP, Dallas, TX, pp 5402–5405

    Google Scholar 

  • Sak H, Güngör T, Saraçlar M (2011) Resources for Turkish morphological processing. Lang Resour Eval 45(2):249–261

    Google Scholar 

  • Sak H, Saraçlar M, Güngör T (2012) Morpholexical and discriminative language models for Turkish automatic speech recognition. IEEE Trans Audio Speech Lang Process 20(8):2341–2351

    Google Scholar 

  • Schwenk H (2007) Continuous space language models. Comput Speech Lang 21(3):492–518

    Google Scholar 

  • Shafran I, Hall K (2006) Corrective models for speech recognition of inflected languages. In: Proceedings of EMNLP, Sydney, pp 390–398

    Google Scholar 

  • Siivola V, Hirsimäki T Teemu, Creutz M, Kurimo M (2003) Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Proceedings of EUROSPEECH, Geneva, pp 2293–2296

    Google Scholar 

  • Singh-Miller N, Collins M (2007) Trigger-based language modeling using a loss-sensitive perceptron algorithm. In: Proceedings of ICASSP, Honolulu, HI, pp 25–28

    Google Scholar 

  • Wang W, Harper MP (2002) The SuperARV language model: investigating the effectiveness of tightly integrating multiple knowledge sources. In: Proceedings of EMNLP, Philadelphia, PA, pp 238–247

    Google Scholar 

  • Whittaker E, Woodland P (2000) Particle-based language modelling. In: Proceedings of ICSLP, Beijing, vol 1, pp 170–173

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Murat Saraçlar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Arısoy, E., Saraçlar, M. (2018). Language Modeling for Turkish Text and Speech Processing. In: Oflazer, K., Saraçlar, M. (eds) Turkish Natural Language Processing. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-90165-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-90165-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-90163-3

  • Online ISBN: 978-3-319-90165-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics