Abstract
Language modeling is a statistical technique to represent the text data in machine readable format. It finds the probability distribution of sequence of words present in the text. Language model estimates the likelihood of upcoming words in some spoken or written conversation. Markov assumption enables language model to predict the next word depending on previous n − 1 words, called as n-gram, in the sentence. Limitation of n-gram technique is that it utilizes only preceding words to predict the upcoming word. Factored language modeling is an extension to n-gram technique that facilitates to integrate grammatical and linguistic knowledge of the words such as number, gender, part-of-speech tag of the word, etc. in the model for predicting the next word. Back-off is a method to resort to less number of preceding words in case of unavailability of more words in contextual history. This research work finds the effect of various combinations of linguistic features and generalized back-off strategies on the upcoming word prediction capability of language model over Hindi language. The paper empirically compares the results obtained after utilizing linguistic features of Hindi words in factored language model against baseline n-gram technique. The language models are compared using perplexity metric. In summary, the factored language model with product combine strategy produces the lowest perplexity of 1.881235. It is about 50% less than traditional baseline trigram model.
Similar content being viewed by others
References
Rosenfeld R (2000) Two decades of statistical language modeling: where do we go from here? In: Proceedings of the 2000 IEEE international conference, vol 88, no 8, pp 1270–1278. https://doi.org/10.1109/5.880083
Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modelling. In: Proceedings of the thirty-fourth annual meeting of the association for computational linguistics, San Francisco, pp 310–318. https://doi.org/10.3115/981863.981904
Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel back-off. In: Proceedings of the HLT/NAAC, pp 4–6. https://www.aclweb.org/anthology/N03-2002.pdf, https://doi.org/10.3115/1073483.1073485
Kirchhoff K, Bilmes J, Duh K (2008) Factored language models tutorial. University of Washington, Washington
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the 2002 international conference on spoken language processing, Denver, Colorado
Stolcke A, Wheng J, Wang W, Abrash V (2011) SRILM at sixteen: update and outlook. In: Proceedings of the 2011 IEEE automatic speech recognition and understanding workshop, Waikoloa
Vazhenina D, Markov K (2013) Factored language modeling for Russian LVCSR. In: Proceedings of the international joint conference on awareness science and technology and ubi-Media computing (iCAST-UMEDIA), Aizuwakamatsu, pp 205–211. https://doi.org/10.1109/icawst.2013.6765434
Kipyatkova I, Karpov A (2014) Study of morphological factors of factored language models for Russian ASR. In: Proceedings of the international conference on speech and computer (SPECOM), Novi Sad, pp 451–458. https://doi.org/10.1007/978-3-319-11581-8_56
Tachbelie MY, Abate ST, Menzel W (2009) Morpheme-based and factored language modeling for Amharic speech recognition. In: Proceedings of the 4th conference on human language technology: challenges for computer science and linguistics, Poznan, pp 82–93. https://doi.org/10.1007/978-3-642-20095-3_8
Sak H, Saraçlar M, Güngör T (2010) Morphology-based and sub-word language modeling for Turkish speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, Dallas, pp 5402–5405. https://doi.org/10.1109/icassp.2010.5494927
Novais E, Ivandré P (2012) Portuguese text generation using factored language models. J Braz Comput Soc 19(2):135–146. https://doi.org/10.1007/s13173-012-0095-1
Lazăr M, Militaru D (2013) A Romanian language modeling using linguistic factors. In: Proceedings of the 7th conference on speech technology and human—computer dialogue, (SpeD), Cluj-Napoca, Romania, pp 1–6. https://doi.org/10.1109/sped.2013.6682649
Alumae Z (2006) Sentence-adapted factored language model for transcribing Estonian speech. In: Proceedings of the IEEE international conference on acoustics speech and signal processing (ICASSP), Toulouse, France, pp 429–432. https://doi.org/10.1109/icassp.2006.1660049
Hirsimaki T, Pylkkonen J, Kurimo M (2009) Importance of high-order n-gram models in morph-based speech recognition. IEEE Trans Audio Speech Lang Process 17(4):724–732. https://doi.org/10.1109/tasl.2008.2012323
Mousa A, Shaik M, Schlüter R, Ney H (2011) Morpheme based factored language models for German LVCSR. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, Florence, pp 1445–1448
Choueiter G, Povey D, Chen, S.F., Zweig G (2006) Morpheme based factored language models for Arabic LVCSR. In: Proceedings of the IEEE international conference on acoustics speech and signal processing, Toulouse, France, pp 1053–1056. https://doi.org/10.1109/icassp.2006.1660205
Adel H, Vu NT, Kirchhoff K, Telaar D, Schultz T (2015) Syntactic and semantic features for code-switching factored language models. IEEE/ACM Trans Audio Speech Lang Process. 23(3):431–440. https://doi.org/10.1109/taslp.2015.2389622
Adel H, Kirchhoff K, Telaar D, Thang V, Schlippe T, Schultz T (2014) Features for factored language models for code-switching speech. In: Proceedings of the SLTU-2014, St. Petersburg, Russia, pp 32–38
Kirchhoff K, Vergyri D, Bilmes J, Duh K, Stolcke A (2006) Morphology-based language modeling for conversational Arabic speech recognition. Comput Speech Lang 20(4):589–608
Ganji S, Sinha R (2018) A novel approach for effective recognition of the code-switched data on monolingual language model. In: Proceedings of the Interspeech 2018, Hyderabad, pp 1953–1957. https://doi.org/10.21437/interspeech.2018-1259
Gregor D. Zdravko K (2017) Context-dependent factored language models. EURASIP J Audio Speech Music Process 2017, Article No.: 104 https://doi.org/10.1186/s13636-017-0104-6
Koehn P, Hoang H (2007) Factored Translation Models. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), association for computational linguistics, Prague, Czech Republic, pp 868–876. https://www.aclweb.org/anthology/D07-1091.pdf
Nair J, Krishnan A, Deetha K (2016) An efficient English to Hindi machine translation system using hybrid mechanism. In: Proceedings of the international conference on advances in computing, communications and informatics (ICACCI), Jaipur, India. https://doi.org/10.1109/icacci.2016.7732363
Ramanathan A, Choudhary H, Ghosh A, Bhattacharyya P (2009) Case markers and morphology: addressing the crux of the fluency problem in English-Hindi SMT. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, Suntec, Singapore, pp 800–808. https://www.aclweb.org/anthology/P09-1090.pdf
Sreelekha S, Bhattacharyya P (2017) Role of morphology injection in SMT: a case study from Indian language perspective. ACM Trans Asian Low-Resour Lang Inf Process 17(1):1. https://doi.org/10.1145/3129208
Patel R, Pimpale P, Sasikumar M (2018) Machine translation in Indian languages: challenges and resolution. J Intell Syst 1:5. https://doi.org/10.1515/jisys-2018-0014
Jaya K, Gupta D (2016) Exploration of corpus augmentation approach for English–Hindi bidirectional statistical machine translation system. Int J Electrical Comput Eng (IJECE) 6(3):1059–1071. https://doi.org/10.11591/ijece.v6i3.8904
Kumar A, Dhanalakshmi M, Soman K, Rajendran S (2014) Factored statistical machine translation system for English to Tamil language. Pertanika J Soc Sci Humanit 22(4):1045–1061
Dungarwal P, Chatterjee R, Mishra A, Kunchukuttan A, Shah R, Bhattacharyya P (2014) The IIT Bombay Hindi to English translation system at WMT 2014. In: Proceedings of the ninth workshop on statistical machine translation, association for computational linguistics, Baltimore, Maryland USA, pp 90–96. https://www.aclweb.org/anthology/W14–3308.pdf, https://doi.org/10.3115/v1/w14-3308
Sachdeva K, Srivastava R, Jain S, Sharma D (2014) Hindi to English machine translation: using effective selection in multi-model SMT. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14) Reykjavik, Iceland, European Language Resources Association (ELRA), pp 1807–1811. http://www.lrec-conf.org/proceedings/lrec2014/pdf/682_Paper.pdf
García-Martínez M, Barrault L, Bougares F (2017) Neural machine translation by generating multiple linguistic factors. In: Proceedings of the conference on statistical language and speech processing (SLSP), pp 21–31. https://doi.org/10.1007/978-3-319-68456-7_2
Wilken P, Matusov E (2019) Novel applications of factored neural machine translation. https://arxiv.org/pdf/1910.03912.pdf
Hokamp C (2017) Ensembling factored neural machine translation models for automatic post-editing and quality estimation. In: Proceedings of the conference on machine translation, Association for Computational Linguistics, vol. 2: shared task papers, Copenhagen, Denmark, pp 647–654. https://www.aclweb.org/anthology/W17-4775, https://doi.org/10.18653/v1/w17-4775
Brants T, Popat A, Xu P, Och F. Dean J (2007) Large language models in machine translation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, Prague, pp 858–867
Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust Speech Signal Process 35(3):400–401. https://doi.org/10.1109/TASSP.1987.1165125
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Babhulgaonkar, A.R., Sonavane, S.P. Experimenting with factored language model and generalized back-off for Hindi. Int. j. inf. tecnol. 14, 2105–2118 (2022). https://doi.org/10.1007/s41870-020-00503-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-020-00503-y