Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging

Kashyap, Kishore; Sarma, Shikhar Kumar; Ahmed, Mazida Akhtara

doi:10.1007/s41870-023-01714-9

Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging

Original Research
Published: 30 January 2024

Volume 16, pages 1539–1549, (2024)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

Kishore Kashyap ORCID: orcid.org/0000-0002-4144-0235¹,
Shikhar Kumar Sarma¹^na1 &
Mazida Akhtara Ahmed¹^na1

89 Accesses
Explore all metrics

Abstract

Notwithstanding the success of Neural Machine Translation (NMT), we have seen that translation of resource poor Indic languages and techniques applied to the improvement of translation quality for these languages is still under explored. In this work, we have first tried to build two base NMT systems which are capable of translation in two directions for the languages Assamese and English. To build the system, we have used OpenNMT-py, a neural machine translation framework based on PyTorch. We have achieved BLEU score of 9.01 and 14.71 respectively for English to Assamese and Assamese to English translation direction during our base model training with our custom test set. Next, public domain monolingual data in English and Assamese are translated batchwise and newly translated data is used to train new models. Significant BLEU point improvements for both directions are seen. Two different test sets viz. publicly available FLORES-101 test set and our in-house test set of 500 domain specific sentences are used during our experiments. Use of length penalty and model averaging further improved the BLEU scores in both the direction. In this paper, we have shown that use of phased translation of monolingual data with length penalty and model averaging contributes well in achieving higher BLEU scores for data poor languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hybrid Translation Model for Pidgin English to English Language Translation

Neural machine translation for limited resources English-Nyishi pair

Article 02 November 2023

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

Data availability

Not applicable.

Notes

References

Ahmed MA, Kashyap K, Sarma SK (2023) Pre-processing and resource modelling for english-assamese nmt system. In: 2023 4th international conference on computing and communication systems (I3CS), pp 1–6. https://doi.org/10.1109/I3CS58314.2023.10127567
Akella K, Allu SH, Ragupathi SS et al (2020) Exploring pair-wise nmt for indian languages. arXiv:2012.05786
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR. arXiv:1409.0473
Baruah KK, Das P, Hannan A et al (2014) Assamese-english bilingual machine translation. CoRR. arXiv:1407.2019
Baruah R, Mundotiya RK, Singh AK (2021) Low resource neural machine translation: Assamese to/from other indo-aryan (indic) languages. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3469721
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
Cho K, van Merrienboer B, Gülçehre Ç et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR. arXiv:1406.1078
Choudhary H, Rao S, Rohilla R (2020) Neural machine translation for low-resourced Indian languages. In: Proceedings of the twelfth language resources and evaluation conference. European Language Resources Association, Marseille, France, pp 3610–3615. https://aclanthology.org/2020.lrec-1.444
Devi CS, Purkayastha BS (2023) An empirical analysis on statistical and neural machine translation system for english to mizo language. Int J Inf Technol 1–8
Dubey P (2019) The hindi to dogri machine translation system: grammatical perspective. Int J Inf Technol 11(1):171–182
Article Google Scholar
Eberhard DM, Gary F (2022) Ethnologue: languages of the World. SIL International, Dallas, Texas. https://www.ethnologue.com/
Gage P (1994) A new algorithm for data compression. C Users J 12(2):23–38
Google Scholar
Gogoi A, Baruah N, Sarma SK et al (2021) Improving stemming for assamese information retrieval. Int J Inf Technol 13(5):1763–1768
Article Google Scholar
Goyal N, Gao C, Chaudhary V et al (2021) The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation. CoRR. arXiv:2106.03193
Haddow B, Kirefu F (2020) Pmindia—a collection of parallel corpora of languages of india. CoRR. arXiv:2001.09907
Jain A, Mhaskar S, Bhattacharyya P (2021) Evaluating the performance of back-translation for low resource english-marathi language pair: Cfilt-iitbombay@loresmt 2021. In: Proceedings of the 4th workshop on technologies for MT of low resource languages (LoResMT2021), pp 158–162
Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, Washington, USA, pp 1700–1709. https://aclanthology.org/D13-1176
Kandimalla A, Lohar P, Maji SK et al (2022) Improving english-to-indian language neural machine translation systems. Information 13(5):245
Article Google Scholar
Klein G, Kim Y, Deng Y et al (2017) Opennmt: open-source toolkit for neural machine translation. In: Proc. ACL. https://doi.org/10.18653/v1/P17-4012
Koehn P, Hoang H, Birch A et al (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume Proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180. https://aclanthology.org/P07-2045
Koul N, Manvi SS (2021) A proposed model for neural machine translation of sanskrit into english. Int J Inf Technol 13(1):375–381
Google Scholar
Kunchukuttan A (2020) The IndicNLP Library. https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf
Kunchukuttan A, Kakwani D, Golla S et al (2020) Ai4bharat-indicnlp corpus: monolingual corpora and word embeddings for indic languages. CoRR. arXiv:2005.00085
Lalrempuii C, Soni B (2023) Extremely low-resource multilingual neural machine translation for indic mizo language. Int J Inf Technol 1–8
Laskar SR, Khilji AFUR, Pakray P et al (2020) Multimodal neural machine translation for English to Hindi. In: Proceedings of the 7th workshop on Asian translation. Association for Computational Linguistics, Suzhou, China, pp 109–113. https://aclanthology.org/2020.wat-1.11
Laskar SR, Pakray P, Bandyopadhyay S (2021) Neural machine translation for low resource assamese–english. In: Maji AK, Saha G, Das S, et al (eds) Proceedings of the international conference on computing and communication systems. Springer Singapore, Singapore, pp 35–44. https://doi.org/10.1007/978-981-33-4084-8_4
Laskar SR, Paul B, Dadure P et al (2023) English-assamese neural machine translation using prior alignment and pre-trained language model. Comput Speech Lang 82(101):524
Google Scholar
Nath B, Sarkar S, Das S et al (2022) A trie based lemmatizer for assamese language. Int J Inf Technol 14(5):2355–2360
Article Google Scholar
Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135. https://aclanthology.org/P02-1040
Pathak A, Pakray P (2019) Neural machine translation for indian languages. J Intell Syst 28(3):465–477. https://doi.org/10.1515/jisys-2018-0065
Article Google Scholar
Ramesh G, Doddapaneni S, Bheemaraj A et al (2021) Samanantar: the largest publicly available parallel corpora collection for 11 indic languages. CoRR. arXiv:2104.05596
Ramesh G, Doddapaneni S, Bheemaraj A et al (2022) Samanantar: the largest publicly available parallel corpora collection for 11 indic languages. Trans Assoc Comput Linguist 10:145–162. https://doi.org/10.1162/tacl_a_00452
Article Google Scholar
Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en
Revanuru K, Turlapaty K, Rao S (2017) Neural machine translation of indian languages. In: Proceedings of the 10th annual ACM India compute conference. Association for Computing Machinery, New York, NY, USA, Compute ’17, pp 11–20. https://doi.org/10.1145/3140107.3140111
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, Germany, pp 1715–1725. https://doi.org/10.18653/v1/P16-1162. https://aclanthology.org/P16-1162
Singh MT, Borgohain R, Gohain S (2014) An english-assamese machine translation system. Int J Comput Appl 93:1–6
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NIPS’17, pp 6000–6010
Wu Y, Schuster M, Chen Z et al (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144
Xu N, Li Y, Xu C et al (2019) Analysis of back-translation methods for low-resource neural machine translation. In: Natural language processing and Chinese computing: 8th CCF international conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8. Springer, pp 466–475

Download references

Acknowledgements

We thank the Ministry of Electronics and Information Technology (MeitY), Government of India for their assistance through the Project ISHAAN.

Author information

ShikharKumar Sarma and Mazida Akhtara Ahmed contributed equally to this work.

Authors and Affiliations

Department of Information Technology, Gauhati University, Jalukbari, Guwahati, 781014, Assam, India
Kishore Kashyap, Shikhar Kumar Sarma & Mazida Akhtara Ahmed

Authors

Kishore Kashyap
View author publications
You can also search for this author in PubMed Google Scholar
Shikhar Kumar Sarma
View author publications
You can also search for this author in PubMed Google Scholar
Mazida Akhtara Ahmed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kishore Kashyap.

Ethics declarations

Conflict of interest

All the Author declares that he/she have no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kashyap, K., Sarma, S.K. & Ahmed, M.A. Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging. Int. j. inf. tecnol. 16, 1539–1549 (2024). https://doi.org/10.1007/s41870-023-01714-9

Download citation

Received: 24 July 2023
Accepted: 19 December 2023
Published: 30 January 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s41870-023-01714-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging

Abstract

Access this article

Similar content being viewed by others

A Hybrid Translation Model for Pidgin English to English Language Translation

Neural machine translation for limited resources English-Nyishi pair

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging

Abstract

Access this article

Similar content being viewed by others

A Hybrid Translation Model for Pidgin English to English Language Translation

Neural machine translation for limited resources English-Nyishi pair

Integrating Knowledge Encoded by Linguistic Phenomena of Indian Languages with Neural Machine Translation

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation