Abstract
When translating into morphologically rich languages, statistical MT approaches face the problem of data sparsity. The severity of the sparseness problem will be high when the corpus size of morphologically richer language is less. Even though, we can use factored models to correctly generate morphological forms of words, the problem of data sparseness limits their performance. In this paper, we describe a simple and effective solution which is based on enriching the input corpora with various morphological forms of words. We use this method with the phrase-based and factor-based experiments on two morphologically rich languages: Hindi and Marathi when translating from English. We evaluate the performance of our experiments both in terms of automatic evaluation and subjective evaluation such as adequacy and fluency. We observe that the morphology injection method helps in improving the quality of translation. We further analyze that the morph injection method helps in handling the data sparseness problem to a great level.
Similar content being viewed by others
Availability of data and materials
The created Lexical Resources are freely available using a creative commons license.
References
Avramidis E, Koehn P (2008) Enriching morphologically poor languages for statistical machine translation. In: ACL
Chahuneau V, Schlinger E, Smith NA, Dyer C (2013) Translating into morphologically rich languages with synthetic phrases. In: EMNLP
De Marneffe M-C, Manning CD (2008) Stanford typed dependencies manual. http://nlp.stanford.edu/software/dependenciesmanual.pdf
Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-CoNLL
Koehn P, Och FJ, Marcu D (2007) Statistical phrase-based translation. NAACL on human language technology-volume 1. ACL
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL
Singh S, Sarma VM (2011) Verbal inflection in Hindi: a distributed morphology approach. In: PACLIC
Singh S, Sarma VM, Muller S (2010) Hindi noun inflection and distributed morphology. Universite Paris Diderot, Paris 7, France. Muller S (ed). CSLI Publications (2006), p 307
Sreelekha S, Dabre R, Bhattacharyya P (2013) Comparison of SMT and RBMT, the requirement of hybridization for Marathi–Hindi MT ICON. In: 10th international conference on NLP, December 2013
Tamchyna A, Bojar O (2013) No free lunch in factored phrase-based machine translation. In: Computational linguistics and intelligent text processing. Springer, Berlin Heidelberg, pp 210–223
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature rich part of speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol 1. Association for Computational Linguistics, 27 May 2003, pp 173–180. https://doi.org/10.3115/1073445.1073478
Acknowledgements
The authors would like to thank Prof. Pushpak Bhattacharyya for his guidance during this work. The authors would like to thank Department of Science & Technology, Govt. of India for providing fund under Woman Scientist Scheme (WOS-A) with the Project Code-SR/WOS-A/ET/1075/2014. The author would like to acknowledge her own associated works published in ACM and ACL web.
Funding
This work is funded by Department of Science & Technology, Govt. of India under Woman Scientist Scheme (WOS-A) with the Project Code-SR/WOS-A/ET/1075/2014.
Author information
Authors and Affiliations
Contributions
The first author is the sole author of this work.
Corresponding author
Ethics declarations
Conflict of interest
The author declares that there is no conflict of interest associated with this work.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sreelekha, S. Morphology generation for English-Indian language statistical machine translation. Soft Comput 25, 3657–3664 (2021). https://doi.org/10.1007/s00500-020-05393-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-05393-7