Skip to main content
Log in

Experimenting with factored language model and generalized back-off for Hindi

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Language modeling is a statistical technique to represent the text data in machine readable format. It finds the probability distribution of sequence of words present in the text. Language model estimates the likelihood of upcoming words in some spoken or written conversation. Markov assumption enables language model to predict the next word depending on previous n − 1 words, called as n-gram, in the sentence. Limitation of n-gram technique is that it utilizes only preceding words to predict the upcoming word. Factored language modeling is an extension to n-gram technique that facilitates to integrate grammatical and linguistic knowledge of the words such as number, gender, part-of-speech tag of the word, etc. in the model for predicting the next word. Back-off is a method to resort to less number of preceding words in case of unavailability of more words in contextual history. This research work finds the effect of various combinations of linguistic features and generalized back-off strategies on the upcoming word prediction capability of language model over Hindi language. The paper empirically compares the results obtained after utilizing linguistic features of Hindi words in factored language model against baseline n-gram technique. The language models are compared using perplexity metric. In summary, the factored language model with product combine strategy produces the lowest perplexity of 1.881235. It is about 50% less than traditional baseline trigram model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Rosenfeld R (2000) Two decades of statistical language modeling: where do we go from here? In: Proceedings of the 2000 IEEE international conference, vol 88, no 8, pp 1270–1278. https://doi.org/10.1109/5.880083

  2. Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modelling. In: Proceedings of the thirty-fourth annual meeting of the association for computational linguistics, San Francisco, pp 310–318. https://doi.org/10.3115/981863.981904

  3. Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel back-off. In: Proceedings of the HLT/NAAC, pp 4–6. https://www.aclweb.org/anthology/N03-2002.pdf, https://doi.org/10.3115/1073483.1073485

  4. Kirchhoff K, Bilmes J, Duh K (2008) Factored language models tutorial. University of Washington, Washington

    Google Scholar 

  5. Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the 2002 international conference on spoken language processing, Denver, Colorado

  6. Stolcke A, Wheng J, Wang W, Abrash V (2011) SRILM at sixteen: update and outlook. In: Proceedings of the 2011 IEEE automatic speech recognition and understanding workshop, Waikoloa

  7. Vazhenina D, Markov K (2013) Factored language modeling for Russian LVCSR. In: Proceedings of the international joint conference on awareness science and technology and ubi-Media computing (iCAST-UMEDIA), Aizuwakamatsu, pp 205–211. https://doi.org/10.1109/icawst.2013.6765434

  8. Kipyatkova I, Karpov A (2014) Study of morphological factors of factored language models for Russian ASR. In: Proceedings of the international conference on speech and computer (SPECOM), Novi Sad, pp 451–458. https://doi.org/10.1007/978-3-319-11581-8_56

  9. Tachbelie MY, Abate ST, Menzel W (2009) Morpheme-based and factored language modeling for Amharic speech recognition. In: Proceedings of the 4th conference on human language technology: challenges for computer science and linguistics, Poznan, pp 82–93. https://doi.org/10.1007/978-3-642-20095-3_8

  10. Sak H, Saraçlar M, Güngör T (2010) Morphology-based and sub-word language modeling for Turkish speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, Dallas, pp 5402–5405. https://doi.org/10.1109/icassp.2010.5494927

  11. Novais E, Ivandré P (2012) Portuguese text generation using factored language models. J Braz Comput Soc 19(2):135–146. https://doi.org/10.1007/s13173-012-0095-1

    Article  Google Scholar 

  12. Lazăr M, Militaru D (2013) A Romanian language modeling using linguistic factors. In: Proceedings of the 7th conference on speech technology and human—computer dialogue, (SpeD), Cluj-Napoca, Romania, pp 1–6. https://doi.org/10.1109/sped.2013.6682649

  13. Alumae Z (2006) Sentence-adapted factored language model for transcribing Estonian speech. In: Proceedings of the IEEE international conference on acoustics speech and signal processing (ICASSP), Toulouse, France, pp 429–432. https://doi.org/10.1109/icassp.2006.1660049

  14. Hirsimaki T, Pylkkonen J, Kurimo M (2009) Importance of high-order n-gram models in morph-based speech recognition. IEEE Trans Audio Speech Lang Process 17(4):724–732. https://doi.org/10.1109/tasl.2008.2012323

    Article  Google Scholar 

  15. Mousa A, Shaik M, Schlüter R, Ney H (2011) Morpheme based factored language models for German LVCSR. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, Florence, pp 1445–1448

  16. Choueiter G, Povey D, Chen, S.F., Zweig G (2006) Morpheme based factored language models for Arabic LVCSR. In: Proceedings of the IEEE international conference on acoustics speech and signal processing, Toulouse, France, pp 1053–1056. https://doi.org/10.1109/icassp.2006.1660205

  17. Adel H, Vu NT, Kirchhoff K, Telaar D, Schultz T (2015) Syntactic and semantic features for code-switching factored language models. IEEE/ACM Trans Audio Speech Lang Process. 23(3):431–440. https://doi.org/10.1109/taslp.2015.2389622

    Article  Google Scholar 

  18. Adel H, Kirchhoff K, Telaar D, Thang V, Schlippe T, Schultz T (2014) Features for factored language models for code-switching speech. In: Proceedings of the SLTU-2014, St. Petersburg, Russia, pp 32–38

  19. Kirchhoff K, Vergyri D, Bilmes J, Duh K, Stolcke A (2006) Morphology-based language modeling for conversational Arabic speech recognition. Comput Speech Lang 20(4):589–608

    Article  Google Scholar 

  20. Ganji S, Sinha R (2018) A novel approach for effective recognition of the code-switched data on monolingual language model. In: Proceedings of the Interspeech 2018, Hyderabad, pp 1953–1957. https://doi.org/10.21437/interspeech.2018-1259

  21. Gregor D. Zdravko K (2017) Context-dependent factored language models. EURASIP J Audio Speech Music Process 2017, Article No.: 104 https://doi.org/10.1186/s13636-017-0104-6

  22. Koehn P, Hoang H (2007) Factored Translation Models. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), association for computational linguistics, Prague, Czech Republic, pp 868–876. https://www.aclweb.org/anthology/D07-1091.pdf

  23. Nair J, Krishnan A, Deetha K (2016) An efficient English to Hindi machine translation system using hybrid mechanism. In: Proceedings of the international conference on advances in computing, communications and informatics (ICACCI), Jaipur, India. https://doi.org/10.1109/icacci.2016.7732363

  24. Ramanathan A, Choudhary H, Ghosh A, Bhattacharyya P (2009) Case markers and morphology: addressing the crux of the fluency problem in English-Hindi SMT. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, Suntec, Singapore, pp 800–808. https://www.aclweb.org/anthology/P09-1090.pdf

  25. Sreelekha S, Bhattacharyya P (2017) Role of morphology injection in SMT: a case study from Indian language perspective. ACM Trans Asian Low-Resour Lang Inf Process 17(1):1. https://doi.org/10.1145/3129208

    Article  Google Scholar 

  26. Patel R, Pimpale P, Sasikumar M (2018) Machine translation in Indian languages: challenges and resolution. J Intell Syst 1:5. https://doi.org/10.1515/jisys-2018-0014

    Article  Google Scholar 

  27. Jaya K, Gupta D (2016) Exploration of corpus augmentation approach for English–Hindi bidirectional statistical machine translation system. Int J Electrical Comput Eng (IJECE) 6(3):1059–1071. https://doi.org/10.11591/ijece.v6i3.8904

    Article  Google Scholar 

  28. Kumar A, Dhanalakshmi M, Soman K, Rajendran S (2014) Factored statistical machine translation system for English to Tamil language. Pertanika J Soc Sci Humanit 22(4):1045–1061

    Google Scholar 

  29. Dungarwal P, Chatterjee R, Mishra A, Kunchukuttan A, Shah R, Bhattacharyya P (2014) The IIT Bombay Hindi to English translation system at WMT 2014. In: Proceedings of the ninth workshop on statistical machine translation, association for computational linguistics, Baltimore, Maryland USA, pp 90–96. https://www.aclweb.org/anthology/W14–3308.pdf, https://doi.org/10.3115/v1/w14-3308

  30. Sachdeva K, Srivastava R, Jain S, Sharma D (2014) Hindi to English machine translation: using effective selection in multi-model SMT. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14) Reykjavik, Iceland, European Language Resources Association (ELRA), pp 1807–1811. http://www.lrec-conf.org/proceedings/lrec2014/pdf/682_Paper.pdf

  31. García-Martínez M, Barrault L, Bougares F (2017) Neural machine translation by generating multiple linguistic factors. In: Proceedings of the conference on statistical language and speech processing (SLSP), pp 21–31. https://doi.org/10.1007/978-3-319-68456-7_2

  32. Wilken P, Matusov E (2019) Novel applications of factored neural machine translation. https://arxiv.org/pdf/1910.03912.pdf

  33. Hokamp C (2017) Ensembling factored neural machine translation models for automatic post-editing and quality estimation. In: Proceedings of the conference on machine translation, Association for Computational Linguistics, vol. 2: shared task papers, Copenhagen, Denmark, pp 647–654. https://www.aclweb.org/anthology/W17-4775, https://doi.org/10.18653/v1/w17-4775

  34. Brants T, Popat A, Xu P, Och F. Dean J (2007) Large language models in machine translation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, Prague, pp 858–867

  35. Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust Speech Signal Process 35(3):400–401. https://doi.org/10.1109/TASSP.1987.1165125

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arun R. Babhulgaonkar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Babhulgaonkar, A.R., Sonavane, S.P. Experimenting with factored language model and generalized back-off for Hindi. Int. j. inf. tecnol. 14, 2105–2118 (2022). https://doi.org/10.1007/s41870-020-00503-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-020-00503-y

Keywords

Navigation