Advertisement

Neural Computing and Applications

, Volume 31, Issue 7, pp 2455–2467 | Cite as

Statistical machine translation of Indian languages: a survey

  • Nadeem Khan Jadoon
  • Waqas AnwarEmail author
  • Usama Ijaz Bajwa
  • Farooq Ahmad
Original Article

Abstract

In this study, performance analysis of a state-of-art phrase-based statistical machine translation (SMT) system is presented on eight Indian languages. State of the art in SMT on different Indian languages to English language has also been discussed briefly. The motivation of this study was to promote the development of SMT and linguistic resources for these Indian language pairs, as the current systems are in infancy stage due to sparse data resources. EMILLE and crowdsourcing parallel corpora have been used in this study for experimental purposes. The study is concluded by presenting the performance of baseline SMT system for Indian languages (Bengali, Gujarati, Hindi, Malayalam, Punjabi, Tamil, Telugu and Urdu) into English with average 10–20 % accurate results for all the language pairs. As a result of this study, both of these annotated parallel corpora resources and SMT system will serve as benchmarks for future approaches to SMT in Hindi → English, Urdu → English, Punjabi → English, Telugu → English, Tamil → English, Gujarati → English, Bengali → English and Malayalam → English.

Keywords

Statistical machine translation (SMT) Parallel corpus Natural language processing (NLP) Phrase-based translation 

Notes

Acknowledgement

We would like to thank Dr. Nadir Durrani from University of Edinburgh for his helpful comments and suggestions during the experimentation and proof reading the write up, which has helped us a lot to improve the paper. He also provided examples to be included in the text.

References

  1. 1.
    Koehn P (2010) A book on statistical machine translation. Cambridge University Press, CambridgezbMATHGoogle Scholar
  2. 2.
    Islam Z, Tiedemann J, Eisele A (2010) English to Bangla phrase-based machine translation. In: Proceedings of the 14th annual conference of the European Association for Machine TranslationGoogle Scholar
  3. 3.
    Koehn P, Och F, Marcu D (2003) Statistical Phrase-Based Translation. In: HLT-NAACL: conference combining Human Language Technology conference series and the North American chapter of the Association for Computational Linguistics conference series, pp 48–54Google Scholar
  4. 4.
    Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th annual meeting of the Association for Computational Linguistics (ACL), pp 440–447Google Scholar
  5. 5.
    Shannon CE (1948) A mathematical theory of communication. Bell System Tech J 27:379–423 and 623–656Google Scholar
  6. 6.
    The Editors of Encyclopædia Britannica (2014) Bengali language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 15 June 2014Google Scholar
  7. 7.
    The Editors of Encyclopædia Britannica (2014) Gujarati language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
  8. 8.
    The Editors of Encyclopædia Britannica (2014) Hindi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
  9. 9.
    The Editors of Encyclopædia Britannica (2014) Malayalam language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
  10. 10.
    The Editors of Encyclopædia Britannica (2014) Punjabi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
  11. 11.
    The Editors of Encyclopædia Britannica (2014) Tamil language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
  12. 12.
    The Editors of Encyclopædia Britannica (2014) Telugu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
  13. 13.
    The Editors of Encyclopædia Britannica (2014) Urdu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
  14. 14.
    Dasgupta S, Wasif A, Azam S (2004) An optimal way towards machine translation from English to Bengali. In: Proceedings of the 7th international conference on computer and information technology (ICCIT)Google Scholar
  15. 15.
    Naskar SK, Bandyopadhyay S (2006) A phrasal EBMT system for translating English to Bengali. In: Workshop on language, artificial intelligence and computer science for natural language processing applications, Bangkok, Thailand, pp 69–72Google Scholar
  16. 16.
    Anwar MM, Anwar MZ, Bhuiyan MA-A (2009) Syntax analysis and machine translation of Bangla sentences. Int J Comput Sci Netw Secur 9:317–326Google Scholar
  17. 17.
    Durrani N, Sajjad H, Fraser A, Schmid H (2010) Hindi-to-Urdu machine translation through transliteration. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp 465–474Google Scholar
  18. 18.
    Roy M (2009) A semi-supervised approach to Bengali–English phrase-based statistical Machine Translation. In: Proceedings of the 22nd Canadian conference on artificial intelligenceGoogle Scholar
  19. 19.
    Singh D et al (2012) Modeling phrase based SMT for English to Hindi language. Int J Res Rev Eng Sci Technol 1:95–99Google Scholar
  20. 20.
    Sharma N (2011) English to Hindi statistical machine translation system. Dissertation, Thapar University, PatialaGoogle Scholar
  21. 21.
    Yamada K, Knight K (2001) A syntax -based statistical translation model. In: Proceedings of the 39th annual meeting of the ACL, pp 523–530Google Scholar
  22. 22.
    Eisner J (2003) Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the ACL interactive poster/demonstration sessions, pp 205–208Google Scholar
  23. 23.
    Liu T, Che W, Li S, Hu Y, Liu H (2005) Semantic role labeling system using maximum entropy classifier. In: Proceedings of CoNLL, pp 189–192Google Scholar
  24. 24.
    Bisazza A, Federico M (2010) Chunk-based verb reordering in VSO sentences for Arabic–English statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR, WMT’10, pp 235–243Google Scholar
  25. 25.
    Jawaid B, Zeman D (2011) Word-order issues in English-to-Urdu statistical machine translation. Prague Bull Math Linguist 95:87–106 (ISSN 0032-6585) CrossRefGoogle Scholar
  26. 26.
    Khan N, Anwar W, Bajwa U, Durrani N (2013) English to Urdu hierarchical phrase based SMT system. In: The fourth workshop n South and Southeast Asian NLP (WSSANLP), International joint conference on natural language processing, pp 72–76Google Scholar
  27. 27.
    Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics (ACL)Google Scholar
  28. 28.
    Singh G (2008) A Punjabi to Hindi Machine translation system. In: Proceeding of COLING, 22nd international conference on computational linguisticsGoogle Scholar
  29. 29.
    Khalilov M et al (2008) Neural network language models for translation with limited data. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, Dayton, Ohio, pp 445–451Google Scholar
  30. 30.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
  31. 31.
    Baker P, EMILLE (2002) A 70-million word corpus of indic languages: data collection, mark-up and harmonization. In: Proceedings of the 3rd language resources and evaluation conference, pp 819–825, LREC’Google Scholar
  32. 32.
    Moore R (2002) Fast and accurate sentence alignment of bilingual corpora. In: Conference of the association for machine translation in the AmericasGoogle Scholar
  33. 33.
    Koehn P (2005) EuroParl: a parallel corpus for statistical machine translation. The tenth Machine Translation Summit, Phuket, Thailand, pp 79–86Google Scholar
  34. 34.
    Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for Six Indian languages via Crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409Google Scholar
  35. 35.
    Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180Google Scholar
  36. 36.
    Och F (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51CrossRefzbMATHGoogle Scholar
  37. 37.
    Stolcke A (2002) SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing, Denver, Colorado, pp 257–286Google Scholar
  38. 38.
    Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual meeting of the association for computational linguisticsGoogle Scholar
  39. 39.
    Hasler E, Haddow B, Koehn P (2012) Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the seventh international workshop on spoken language translation, pp 268–275Google Scholar
  40. 40.
    Kumar S, Byrne WJ (2004) Minimum bayes-risk decoding for statistical machine translation. The fifth meeting of the North American Chapter of the ACL, Boston, USA, pp 169–176Google Scholar
  41. 41.
    Huang L, Chiang D (2007) Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 144–151Google Scholar
  42. 42.
    Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 427–436Google Scholar
  43. 43.
    Papineni K (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th annual meeting of the Association for Computational Linguistics (ACL), pp 311–318Google Scholar
  44. 44.
    Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 15th conference of the European chapter of the ACLGoogle Scholar
  45. 45.
    Cohn T, Lapata M (2007) Machine Translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Prague, Czech Republic, pp 728–735Google Scholar
  46. 46.
    Bertoldi N, Barbaiani M, Federico M, Cattoni R (2008) Phrase-based statistical machine translation with pivot languages. In: International workshop on spoken language translation evaluation campaign on Spoken Language Translation (IWSLT), Hawaii, USA, pp 143–149Google Scholar
  47. 47.
    Koehn P, Monz C (2005) Shared task: statistical machine translation between European languages. In: Proceedings of the ACL workshop on building and using parallel textsGoogle Scholar

Copyright information

© The Natural Computing Applications Forum 2017

Authors and Affiliations

  • Nadeem Khan Jadoon
    • 1
  • Waqas Anwar
    • 2
    Email author
  • Usama Ijaz Bajwa
    • 2
  • Farooq Ahmad
    • 2
  1. 1.Department of Computer Science, COMSATS Institute of Information TechnologyAbbottabadPakistan
  2. 2.Department of Computer Science, COMSATS Institute of Information TechnologyLahorePakistan

Personalised recommendations