Statistical machine translation of Indian languages: a survey

Abstract

In this study, performance analysis of a state-of-art phrase-based statistical machine translation (SMT) system is presented on eight Indian languages. State of the art in SMT on different Indian languages to English language has also been discussed briefly. The motivation of this study was to promote the development of SMT and linguistic resources for these Indian language pairs, as the current systems are in infancy stage due to sparse data resources. EMILLE and crowdsourcing parallel corpora have been used in this study for experimental purposes. The study is concluded by presenting the performance of baseline SMT system for Indian languages (Bengali, Gujarati, Hindi, Malayalam, Punjabi, Tamil, Telugu and Urdu) into English with average 10–20 % accurate results for all the language pairs. As a result of this study, both of these annotated parallel corpora resources and SMT system will serve as benchmarks for future approaches to SMT in Hindi → English, Urdu → English, Punjabi → English, Telugu → English, Tamil → English, Gujarati → English, Bengali → English and Malayalam → English.

This is a preview of subscription content, access via your institution.

Fig. 1

References

  1. 1.

    Koehn P (2010) A book on statistical machine translation. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  2. 2.

    Islam Z, Tiedemann J, Eisele A (2010) English to Bangla phrase-based machine translation. In: Proceedings of the 14th annual conference of the European Association for Machine Translation

  3. 3.

    Koehn P, Och F, Marcu D (2003) Statistical Phrase-Based Translation. In: HLT-NAACL: conference combining Human Language Technology conference series and the North American chapter of the Association for Computational Linguistics conference series, pp 48–54

  4. 4.

    Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th annual meeting of the Association for Computational Linguistics (ACL), pp 440–447

  5. 5.

    Shannon CE (1948) A mathematical theory of communication. Bell System Tech J 27:379–423 and 623–656

  6. 6.

    The Editors of Encyclopædia Britannica (2014) Bengali language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 15 June 2014

  7. 7.

    The Editors of Encyclopædia Britannica (2014) Gujarati language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014

  8. 8.

    The Editors of Encyclopædia Britannica (2014) Hindi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014

  9. 9.

    The Editors of Encyclopædia Britannica (2014) Malayalam language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014

  10. 10.

    The Editors of Encyclopædia Britannica (2014) Punjabi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014

  11. 11.

    The Editors of Encyclopædia Britannica (2014) Tamil language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014

  12. 12.

    The Editors of Encyclopædia Britannica (2014) Telugu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014

  13. 13.

    The Editors of Encyclopædia Britannica (2014) Urdu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014

  14. 14.

    Dasgupta S, Wasif A, Azam S (2004) An optimal way towards machine translation from English to Bengali. In: Proceedings of the 7th international conference on computer and information technology (ICCIT)

  15. 15.

    Naskar SK, Bandyopadhyay S (2006) A phrasal EBMT system for translating English to Bengali. In: Workshop on language, artificial intelligence and computer science for natural language processing applications, Bangkok, Thailand, pp 69–72

  16. 16.

    Anwar MM, Anwar MZ, Bhuiyan MA-A (2009) Syntax analysis and machine translation of Bangla sentences. Int J Comput Sci Netw Secur 9:317–326

    Google Scholar 

  17. 17.

    Durrani N, Sajjad H, Fraser A, Schmid H (2010) Hindi-to-Urdu machine translation through transliteration. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp 465–474

  18. 18.

    Roy M (2009) A semi-supervised approach to Bengali–English phrase-based statistical Machine Translation. In: Proceedings of the 22nd Canadian conference on artificial intelligence

  19. 19.

    Singh D et al (2012) Modeling phrase based SMT for English to Hindi language. Int J Res Rev Eng Sci Technol 1:95–99

    Google Scholar 

  20. 20.

    Sharma N (2011) English to Hindi statistical machine translation system. Dissertation, Thapar University, Patiala

  21. 21.

    Yamada K, Knight K (2001) A syntax -based statistical translation model. In: Proceedings of the 39th annual meeting of the ACL, pp 523–530

  22. 22.

    Eisner J (2003) Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the ACL interactive poster/demonstration sessions, pp 205–208

  23. 23.

    Liu T, Che W, Li S, Hu Y, Liu H (2005) Semantic role labeling system using maximum entropy classifier. In: Proceedings of CoNLL, pp 189–192

  24. 24.

    Bisazza A, Federico M (2010) Chunk-based verb reordering in VSO sentences for Arabic–English statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR, WMT’10, pp 235–243

  25. 25.

    Jawaid B, Zeman D (2011) Word-order issues in English-to-Urdu statistical machine translation. Prague Bull Math Linguist 95:87–106 (ISSN 0032-6585)

    Article  Google Scholar 

  26. 26.

    Khan N, Anwar W, Bajwa U, Durrani N (2013) English to Urdu hierarchical phrase based SMT system. In: The fourth workshop n South and Southeast Asian NLP (WSSANLP), International joint conference on natural language processing, pp 72–76

  27. 27.

    Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics (ACL)

  28. 28.

    Singh G (2008) A Punjabi to Hindi Machine translation system. In: Proceeding of COLING, 22nd international conference on computational linguistics

  29. 29.

    Khalilov M et al (2008) Neural network language models for translation with limited data. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, Dayton, Ohio, pp 445–451

  30. 30.

    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  31. 31.

    Baker P, EMILLE (2002) A 70-million word corpus of indic languages: data collection, mark-up and harmonization. In: Proceedings of the 3rd language resources and evaluation conference, pp 819–825, LREC’

  32. 32.

    Moore R (2002) Fast and accurate sentence alignment of bilingual corpora. In: Conference of the association for machine translation in the Americas

  33. 33.

    Koehn P (2005) EuroParl: a parallel corpus for statistical machine translation. The tenth Machine Translation Summit, Phuket, Thailand, pp 79–86

  34. 34.

    Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for Six Indian languages via Crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409

  35. 35.

    Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180

  36. 36.

    Och F (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  MATH  Google Scholar 

  37. 37.

    Stolcke A (2002) SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing, Denver, Colorado, pp 257–286

  38. 38.

    Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual meeting of the association for computational linguistics

  39. 39.

    Hasler E, Haddow B, Koehn P (2012) Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the seventh international workshop on spoken language translation, pp 268–275

  40. 40.

    Kumar S, Byrne WJ (2004) Minimum bayes-risk decoding for statistical machine translation. The fifth meeting of the North American Chapter of the ACL, Boston, USA, pp 169–176

  41. 41.

    Huang L, Chiang D (2007) Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 144–151

  42. 42.

    Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 427–436

  43. 43.

    Papineni K (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th annual meeting of the Association for Computational Linguistics (ACL), pp 311–318

  44. 44.

    Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 15th conference of the European chapter of the ACL

  45. 45.

    Cohn T, Lapata M (2007) Machine Translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Prague, Czech Republic, pp 728–735

  46. 46.

    Bertoldi N, Barbaiani M, Federico M, Cattoni R (2008) Phrase-based statistical machine translation with pivot languages. In: International workshop on spoken language translation evaluation campaign on Spoken Language Translation (IWSLT), Hawaii, USA, pp 143–149

  47. 47.

    Koehn P, Monz C (2005) Shared task: statistical machine translation between European languages. In: Proceedings of the ACL workshop on building and using parallel texts

Download references

Acknowledgement

We would like to thank Dr. Nadir Durrani from University of Edinburgh for his helpful comments and suggestions during the experimentation and proof reading the write up, which has helped us a lot to improve the paper. He also provided examples to be included in the text.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Waqas Anwar.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khan Jadoon, N., Anwar, W., Bajwa, U.I. et al. Statistical machine translation of Indian languages: a survey. Neural Comput & Applic 31, 2455–2467 (2019). https://doi.org/10.1007/s00521-017-3206-2

Download citation

Keywords

  • Statistical machine translation (SMT)
  • Parallel corpus
  • Natural language processing (NLP)
  • Phrase-based translation