Statistical machine translation of Indian languages: a survey
- 282 Downloads
- 1 Citations
Abstract
In this study, performance analysis of a state-of-art phrase-based statistical machine translation (SMT) system is presented on eight Indian languages. State of the art in SMT on different Indian languages to English language has also been discussed briefly. The motivation of this study was to promote the development of SMT and linguistic resources for these Indian language pairs, as the current systems are in infancy stage due to sparse data resources. EMILLE and crowdsourcing parallel corpora have been used in this study for experimental purposes. The study is concluded by presenting the performance of baseline SMT system for Indian languages (Bengali, Gujarati, Hindi, Malayalam, Punjabi, Tamil, Telugu and Urdu) into English with average 10–20 % accurate results for all the language pairs. As a result of this study, both of these annotated parallel corpora resources and SMT system will serve as benchmarks for future approaches to SMT in Hindi → English, Urdu → English, Punjabi → English, Telugu → English, Tamil → English, Gujarati → English, Bengali → English and Malayalam → English.
Keywords
Statistical machine translation (SMT) Parallel corpus Natural language processing (NLP) Phrase-based translationNotes
Acknowledgement
We would like to thank Dr. Nadir Durrani from University of Edinburgh for his helpful comments and suggestions during the experimentation and proof reading the write up, which has helped us a lot to improve the paper. He also provided examples to be included in the text.
References
- 1.Koehn P (2010) A book on statistical machine translation. Cambridge University Press, CambridgezbMATHGoogle Scholar
- 2.Islam Z, Tiedemann J, Eisele A (2010) English to Bangla phrase-based machine translation. In: Proceedings of the 14th annual conference of the European Association for Machine TranslationGoogle Scholar
- 3.Koehn P, Och F, Marcu D (2003) Statistical Phrase-Based Translation. In: HLT-NAACL: conference combining Human Language Technology conference series and the North American chapter of the Association for Computational Linguistics conference series, pp 48–54Google Scholar
- 4.Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th annual meeting of the Association for Computational Linguistics (ACL), pp 440–447Google Scholar
- 5.Shannon CE (1948) A mathematical theory of communication. Bell System Tech J 27:379–423 and 623–656Google Scholar
- 6.The Editors of Encyclopædia Britannica (2014) Bengali language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 15 June 2014Google Scholar
- 7.The Editors of Encyclopædia Britannica (2014) Gujarati language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
- 8.The Editors of Encyclopædia Britannica (2014) Hindi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
- 9.The Editors of Encyclopædia Britannica (2014) Malayalam language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
- 10.The Editors of Encyclopædia Britannica (2014) Punjabi language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
- 11.The Editors of Encyclopædia Britannica (2014) Tamil language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
- 12.The Editors of Encyclopædia Britannica (2014) Telugu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
- 13.The Editors of Encyclopædia Britannica (2014) Urdu language. Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 18 June 2014Google Scholar
- 14.Dasgupta S, Wasif A, Azam S (2004) An optimal way towards machine translation from English to Bengali. In: Proceedings of the 7th international conference on computer and information technology (ICCIT)Google Scholar
- 15.Naskar SK, Bandyopadhyay S (2006) A phrasal EBMT system for translating English to Bengali. In: Workshop on language, artificial intelligence and computer science for natural language processing applications, Bangkok, Thailand, pp 69–72Google Scholar
- 16.Anwar MM, Anwar MZ, Bhuiyan MA-A (2009) Syntax analysis and machine translation of Bangla sentences. Int J Comput Sci Netw Secur 9:317–326Google Scholar
- 17.Durrani N, Sajjad H, Fraser A, Schmid H (2010) Hindi-to-Urdu machine translation through transliteration. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp 465–474Google Scholar
- 18.Roy M (2009) A semi-supervised approach to Bengali–English phrase-based statistical Machine Translation. In: Proceedings of the 22nd Canadian conference on artificial intelligenceGoogle Scholar
- 19.Singh D et al (2012) Modeling phrase based SMT for English to Hindi language. Int J Res Rev Eng Sci Technol 1:95–99Google Scholar
- 20.Sharma N (2011) English to Hindi statistical machine translation system. Dissertation, Thapar University, PatialaGoogle Scholar
- 21.Yamada K, Knight K (2001) A syntax -based statistical translation model. In: Proceedings of the 39th annual meeting of the ACL, pp 523–530Google Scholar
- 22.Eisner J (2003) Learning non-isomorphic tree mappings for machine translation. In: Proceedings of the ACL interactive poster/demonstration sessions, pp 205–208Google Scholar
- 23.Liu T, Che W, Li S, Hu Y, Liu H (2005) Semantic role labeling system using maximum entropy classifier. In: Proceedings of CoNLL, pp 189–192Google Scholar
- 24.Bisazza A, Federico M (2010) Chunk-based verb reordering in VSO sentences for Arabic–English statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR, WMT’10, pp 235–243Google Scholar
- 25.Jawaid B, Zeman D (2011) Word-order issues in English-to-Urdu statistical machine translation. Prague Bull Math Linguist 95:87–106 (ISSN 0032-6585) CrossRefGoogle Scholar
- 26.Khan N, Anwar W, Bajwa U, Durrani N (2013) English to Urdu hierarchical phrase based SMT system. In: The fourth workshop n South and Southeast Asian NLP (WSSANLP), International joint conference on natural language processing, pp 72–76Google Scholar
- 27.Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics (ACL)Google Scholar
- 28.Singh G (2008) A Punjabi to Hindi Machine translation system. In: Proceeding of COLING, 22nd international conference on computational linguisticsGoogle Scholar
- 29.Khalilov M et al (2008) Neural network language models for translation with limited data. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, Dayton, Ohio, pp 445–451Google Scholar
- 30.Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
- 31.Baker P, EMILLE (2002) A 70-million word corpus of indic languages: data collection, mark-up and harmonization. In: Proceedings of the 3rd language resources and evaluation conference, pp 819–825, LREC’Google Scholar
- 32.Moore R (2002) Fast and accurate sentence alignment of bilingual corpora. In: Conference of the association for machine translation in the AmericasGoogle Scholar
- 33.Koehn P (2005) EuroParl: a parallel corpus for statistical machine translation. The tenth Machine Translation Summit, Phuket, Thailand, pp 79–86Google Scholar
- 34.Post M, Callison-Burch C, Osborne M (2012) Constructing parallel corpora for Six Indian languages via Crowdsourcing. In: Proceedings of the seventh workshop on statistical machine translation, pp 401–409Google Scholar
- 35.Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 177–180Google Scholar
- 36.Och F (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51CrossRefzbMATHGoogle Scholar
- 37.Stolcke A (2002) SRILM-an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing, Denver, Colorado, pp 257–286Google Scholar
- 38.Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual meeting of the association for computational linguisticsGoogle Scholar
- 39.Hasler E, Haddow B, Koehn P (2012) Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the seventh international workshop on spoken language translation, pp 268–275Google Scholar
- 40.Kumar S, Byrne WJ (2004) Minimum bayes-risk decoding for statistical machine translation. The fifth meeting of the North American Chapter of the ACL, Boston, USA, pp 169–176Google Scholar
- 41.Huang L, Chiang D (2007) Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp 144–151Google Scholar
- 42.Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 427–436Google Scholar
- 43.Papineni K (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th annual meeting of the Association for Computational Linguistics (ACL), pp 311–318Google Scholar
- 44.Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 15th conference of the European chapter of the ACLGoogle Scholar
- 45.Cohn T, Lapata M (2007) Machine Translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Prague, Czech Republic, pp 728–735Google Scholar
- 46.Bertoldi N, Barbaiani M, Federico M, Cattoni R (2008) Phrase-based statistical machine translation with pivot languages. In: International workshop on spoken language translation evaluation campaign on Spoken Language Translation (IWSLT), Hawaii, USA, pp 143–149Google Scholar
- 47.Koehn P, Monz C (2005) Shared task: statistical machine translation between European languages. In: Proceedings of the ACL workshop on building and using parallel textsGoogle Scholar