Abstract
Notwithstanding the success of Neural Machine Translation (NMT), we have seen that translation of resource poor Indic languages and techniques applied to the improvement of translation quality for these languages is still under explored. In this work, we have first tried to build two base NMT systems which are capable of translation in two directions for the languages Assamese and English. To build the system, we have used OpenNMT-py, a neural machine translation framework based on PyTorch. We have achieved BLEU score of 9.01 and 14.71 respectively for English to Assamese and Assamese to English translation direction during our base model training with our custom test set. Next, public domain monolingual data in English and Assamese are translated batchwise and newly translated data is used to train new models. Significant BLEU point improvements for both directions are seen. Two different test sets viz. publicly available FLORES-101 test set and our in-house test set of 500 domain specific sentences are used during our experiments. Use of length penalty and model averaging further improved the BLEU scores in both the direction. In this paper, we have shown that use of phased translation of monolingual data with length penalty and model averaging contributes well in achieving higher BLEU scores for data poor languages.
Similar content being viewed by others
Data availability
Not applicable.
References
Ahmed MA, Kashyap K, Sarma SK (2023) Pre-processing and resource modelling for english-assamese nmt system. In: 2023 4th international conference on computing and communication systems (I3CS), pp 1–6. https://doi.org/10.1109/I3CS58314.2023.10127567
Akella K, Allu SH, Ragupathi SS et al (2020) Exploring pair-wise nmt for indian languages. arXiv:2012.05786
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR. arXiv:1409.0473
Baruah KK, Das P, Hannan A et al (2014) Assamese-english bilingual machine translation. CoRR. arXiv:1407.2019
Baruah R, Mundotiya RK, Singh AK (2021) Low resource neural machine translation: Assamese to/from other indo-aryan (indic) languages. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3469721
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
Cho K, van Merrienboer B, Gülçehre Ç et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR. arXiv:1406.1078
Choudhary H, Rao S, Rohilla R (2020) Neural machine translation for low-resourced Indian languages. In: Proceedings of the twelfth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 3610–3615. https://aclanthology.org/2020.lrec-1.444
Devi CS, Purkayastha BS (2023) An empirical analysis on statistical and neural machine translation system for english to mizo language. Int J Inf Technol 1–8
Dubey P (2019) The hindi to dogri machine translation system: grammatical perspective. Int J Inf Technol 11(1):171–182
Eberhard DM, Gary F (2022) Ethnologue: languages of the World. SIL International, Dallas, Texas. https://www.ethnologue.com/
Gage P (1994) A new algorithm for data compression. C Users J 12(2):23–38
Gogoi A, Baruah N, Sarma SK et al (2021) Improving stemming for assamese information retrieval. Int J Inf Technol 13(5):1763–1768
Goyal N, Gao C, Chaudhary V et al (2021) The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation. CoRR. arXiv:2106.03193
Haddow B, Kirefu F (2020) Pmindia—a collection of parallel corpora of languages of india. CoRR. arXiv:2001.09907
Jain A, Mhaskar S, Bhattacharyya P (2021) Evaluating the performance of back-translation for low resource english-marathi language pair: Cfilt-iitbombay@loresmt 2021. In: Proceedings of the 4th workshop on technologies for MT of low resource languages (LoResMT2021), pp 158–162
Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, Washington, USA, pp 1700–1709. https://aclanthology.org/D13-1176
Kandimalla A, Lohar P, Maji SK et al (2022) Improving english-to-indian language neural machine translation systems. Information 13(5):245
Klein G, Kim Y, Deng Y et al (2017) Opennmt: open-source toolkit for neural machine translation. In: Proc. ACL. https://doi.org/10.18653/v1/P17-4012
Koehn P, Hoang H, Birch A et al (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume Proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180. https://aclanthology.org/P07-2045
Koul N, Manvi SS (2021) A proposed model for neural machine translation of sanskrit into english. Int J Inf Technol 13(1):375–381
Kunchukuttan A (2020) The IndicNLP Library. https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf
Kunchukuttan A, Kakwani D, Golla S et al (2020) Ai4bharat-indicnlp corpus: monolingual corpora and word embeddings for indic languages. CoRR. arXiv:2005.00085
Lalrempuii C, Soni B (2023) Extremely low-resource multilingual neural machine translation for indic mizo language. Int J Inf Technol 1–8
Laskar SR, Khilji AFUR, Pakray P et al (2020) Multimodal neural machine translation for English to Hindi. In: Proceedings of the 7th workshop on Asian translation. Association for Computational Linguistics, Suzhou, China, pp 109–113. https://aclanthology.org/2020.wat-1.11
Laskar SR, Pakray P, Bandyopadhyay S (2021) Neural machine translation for low resource assamese–english. In: Maji AK, Saha G, Das S, et al (eds) Proceedings of the international conference on computing and communication systems. Springer Singapore, Singapore, pp 35–44. https://doi.org/10.1007/978-981-33-4084-8_4
Laskar SR, Paul B, Dadure P et al (2023) English-assamese neural machine translation using prior alignment and pre-trained language model. Comput Speech Lang 82(101):524
Nath B, Sarkar S, Das S et al (2022) A trie based lemmatizer for assamese language. Int J Inf Technol 14(5):2355–2360
Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135. https://aclanthology.org/P02-1040
Pathak A, Pakray P (2019) Neural machine translation for indian languages. J Intell Syst 28(3):465–477. https://doi.org/10.1515/jisys-2018-0065
Ramesh G, Doddapaneni S, Bheemaraj A et al (2021) Samanantar: the largest publicly available parallel corpora collection for 11 indic languages. CoRR. arXiv:2104.05596
Ramesh G, Doddapaneni S, Bheemaraj A et al (2022) Samanantar: the largest publicly available parallel corpora collection for 11 indic languages. Trans Assoc Comput Linguist 10:145–162. https://doi.org/10.1162/tacl_a_00452
Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en
Revanuru K, Turlapaty K, Rao S (2017) Neural machine translation of indian languages. In: Proceedings of the 10th annual ACM India compute conference. Association for Computing Machinery, New York, NY, USA, Compute ’17, pp 11–20. https://doi.org/10.1145/3140107.3140111
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, Germany, pp 1715–1725. https://doi.org/10.18653/v1/P16-1162. https://aclanthology.org/P16-1162
Singh MT, Borgohain R, Gohain S (2014) An english-assamese machine translation system. Int J Comput Appl 93:1–6
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NIPS’17, pp 6000–6010
Wu Y, Schuster M, Chen Z et al (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144
Xu N, Li Y, Xu C et al (2019) Analysis of back-translation methods for low-resource neural machine translation. In: Natural language processing and Chinese computing: 8th CCF international conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8. Springer, pp 466–475
Acknowledgements
We thank the Ministry of Electronics and Information Technology (MeitY), Government of India for their assistance through the Project ISHAAN.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All the Author declares that he/she have no conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kashyap, K., Sarma, S.K. & Ahmed, M.A. Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging. Int. j. inf. tecnol. 16, 1539–1549 (2024). https://doi.org/10.1007/s41870-023-01714-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-023-01714-9