Skip to main content
Log in

Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Notwithstanding the success of Neural Machine Translation (NMT), we have seen that translation of resource poor Indic languages and techniques applied to the improvement of translation quality for these languages is still under explored. In this work, we have first tried to build two base NMT systems which are capable of translation in two directions for the languages Assamese and English. To build the system, we have used OpenNMT-py, a neural machine translation framework based on PyTorch. We have achieved BLEU score of 9.01 and 14.71 respectively for English to Assamese and Assamese to English translation direction during our base model training with our custom test set. Next, public domain monolingual data in English and Assamese are translated batchwise and newly translated data is used to train new models. Significant BLEU point improvements for both directions are seen. Two different test sets viz. publicly available FLORES-101 test set and our in-house test set of 500 domain specific sentences are used during our experiments. Use of length penalty and model averaging further improved the BLEU scores in both the direction. In this paper, we have shown that use of phased translation of monolingual data with length penalty and model averaging contributes well in achieving higher BLEU scores for data poor languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1

Similar content being viewed by others

Data availability

Not applicable.

Notes

  1. https://ai4bharat.iitm.ac.in/samanantar

  2. https://nplt.in/

  3. https://www.pmindia.gov.in

  4. https://gauhati.ac.in/academic/technology/information-technology

  5. https://github.com/ymoslem/MT-Preparation

  6. https://github.com/rsennrich/subword-nmt

  7. https://github.com/OpenNMT/OpenNMT-py

References

  1. Ahmed MA, Kashyap K, Sarma SK (2023) Pre-processing and resource modelling for english-assamese nmt system. In: 2023 4th international conference on computing and communication systems (I3CS), pp 1–6. https://doi.org/10.1109/I3CS58314.2023.10127567

  2. Akella K, Allu SH, Ragupathi SS et al (2020) Exploring pair-wise nmt for indian languages. arXiv:2012.05786

  3. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR. arXiv:1409.0473

  4. Baruah KK, Das P, Hannan A et al (2014) Assamese-english bilingual machine translation. CoRR. arXiv:1407.2019

  5. Baruah R, Mundotiya RK, Singh AK (2021) Low resource neural machine translation: Assamese to/from other indo-aryan (indic) languages. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3469721

  6. Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.

  7. Cho K, van Merrienboer B, Gülçehre Ç et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR. arXiv:1406.1078

  8. Choudhary H, Rao S, Rohilla R (2020) Neural machine translation for low-resourced Indian languages. In: Proceedings of the twelfth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 3610–3615. https://aclanthology.org/2020.lrec-1.444

  9. Devi CS, Purkayastha BS (2023) An empirical analysis on statistical and neural machine translation system for english to mizo language. Int J Inf Technol 1–8

  10. Dubey P (2019) The hindi to dogri machine translation system: grammatical perspective. Int J Inf Technol 11(1):171–182

    Article  Google Scholar 

  11. Eberhard DM, Gary F (2022) Ethnologue: languages of the World. SIL International, Dallas, Texas. https://www.ethnologue.com/

  12. Gage P (1994) A new algorithm for data compression. C Users J 12(2):23–38

    Google Scholar 

  13. Gogoi A, Baruah N, Sarma SK et al (2021) Improving stemming for assamese information retrieval. Int J Inf Technol 13(5):1763–1768

    Article  Google Scholar 

  14. Goyal N, Gao C, Chaudhary V et al (2021) The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation. CoRR. arXiv:2106.03193

  15. Haddow B, Kirefu F (2020) Pmindia—a collection of parallel corpora of languages of india. CoRR. arXiv:2001.09907

  16. Jain A, Mhaskar S, Bhattacharyya P (2021) Evaluating the performance of back-translation for low resource english-marathi language pair: Cfilt-iitbombay@loresmt 2021. In: Proceedings of the 4th workshop on technologies for MT of low resource languages (LoResMT2021), pp 158–162

  17. Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, Washington, USA, pp 1700–1709. https://aclanthology.org/D13-1176

  18. Kandimalla A, Lohar P, Maji SK et al (2022) Improving english-to-indian language neural machine translation systems. Information 13(5):245

    Article  Google Scholar 

  19. Klein G, Kim Y, Deng Y et al (2017) Opennmt: open-source toolkit for neural machine translation. In: Proc. ACL. https://doi.org/10.18653/v1/P17-4012

  20. Koehn P, Hoang H, Birch A et al (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume Proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180. https://aclanthology.org/P07-2045

  21. Koul N, Manvi SS (2021) A proposed model for neural machine translation of sanskrit into english. Int J Inf Technol 13(1):375–381

    Google Scholar 

  22. Kunchukuttan A (2020) The IndicNLP Library. https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf

  23. Kunchukuttan A, Kakwani D, Golla S et al (2020) Ai4bharat-indicnlp corpus: monolingual corpora and word embeddings for indic languages. CoRR. arXiv:2005.00085

  24. Lalrempuii C, Soni B (2023) Extremely low-resource multilingual neural machine translation for indic mizo language. Int J Inf Technol 1–8

  25. Laskar SR, Khilji AFUR, Pakray P et al (2020) Multimodal neural machine translation for English to Hindi. In: Proceedings of the 7th workshop on Asian translation. Association for Computational Linguistics, Suzhou, China, pp 109–113. https://aclanthology.org/2020.wat-1.11

  26. Laskar SR, Pakray P, Bandyopadhyay S (2021) Neural machine translation for low resource assamese–english. In: Maji AK, Saha G, Das S, et al (eds) Proceedings of the international conference on computing and communication systems. Springer Singapore, Singapore, pp 35–44. https://doi.org/10.1007/978-981-33-4084-8_4

  27. Laskar SR, Paul B, Dadure P et al (2023) English-assamese neural machine translation using prior alignment and pre-trained language model. Comput Speech Lang 82(101):524

    Google Scholar 

  28. Nath B, Sarkar S, Das S et al (2022) A trie based lemmatizer for assamese language. Int J Inf Technol 14(5):2355–2360

    Article  Google Scholar 

  29. Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135. https://aclanthology.org/P02-1040

  30. Pathak A, Pakray P (2019) Neural machine translation for indian languages. J Intell Syst 28(3):465–477. https://doi.org/10.1515/jisys-2018-0065

    Article  Google Scholar 

  31. Ramesh G, Doddapaneni S, Bheemaraj A et al (2021) Samanantar: the largest publicly available parallel corpora collection for 11 indic languages. CoRR. arXiv:2104.05596

  32. Ramesh G, Doddapaneni S, Bheemaraj A et al (2022) Samanantar: the largest publicly available parallel corpora collection for 11 indic languages. Trans Assoc Comput Linguist 10:145–162. https://doi.org/10.1162/tacl_a_00452

    Article  Google Scholar 

  33. Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en

  34. Revanuru K, Turlapaty K, Rao S (2017) Neural machine translation of indian languages. In: Proceedings of the 10th annual ACM India compute conference. Association for Computing Machinery, New York, NY, USA, Compute ’17, pp 11–20. https://doi.org/10.1145/3140107.3140111

  35. Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, Germany, pp 1715–1725. https://doi.org/10.18653/v1/P16-1162. https://aclanthology.org/P16-1162

  36. Singh MT, Borgohain R, Gohain S (2014) An english-assamese machine translation system. Int J Comput Appl 93:1–6

    Article  Google Scholar 

  37. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215

  38. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NIPS’17, pp 6000–6010

  39. Wu Y, Schuster M, Chen Z et al (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144

  40. Xu N, Li Y, Xu C et al (2019) Analysis of back-translation methods for low-resource neural machine translation. In: Natural language processing and Chinese computing: 8th CCF international conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8. Springer, pp 466–475

Download references

Acknowledgements

We thank the Ministry of Electronics and Information Technology (MeitY), Government of India for their assistance through the Project ISHAAN.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kishore Kashyap.

Ethics declarations

Conflict of interest

All the Author declares that he/she have no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kashyap, K., Sarma, S.K. & Ahmed, M.A. Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging. Int. j. inf. tecnol. 16, 1539–1549 (2024). https://doi.org/10.1007/s41870-023-01714-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-023-01714-9

Keywords

Navigation