Skip to main content
Log in

Morphology generation for English-Indian language statistical machine translation

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

When translating into morphologically rich languages, statistical MT approaches face the problem of data sparsity. The severity of the sparseness problem will be high when the corpus size of morphologically richer language is less. Even though, we can use factored models to correctly generate morphological forms of words, the problem of data sparseness limits their performance. In this paper, we describe a simple and effective solution which is based on enriching the input corpora with various morphological forms of words. We use this method with the phrase-based and factor-based experiments on two morphologically rich languages: Hindi and Marathi when translating from English. We evaluate the performance of our experiments both in terms of automatic evaluation and subjective evaluation such as adequacy and fluency. We observe that the morphology injection method helps in improving the quality of translation. We further analyze that the morph injection method helps in handling the data sparseness problem to a great level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Availability of data and materials

The created Lexical Resources are freely available using a creative commons license.

Notes

  1. http://www.statmt.org/moses/.

  2. https://hlt.fbk.eu/technologies/irstlm-irst-languagemodelling-toolkit.

  3. http://nlp.stanford.edu/software/tagger.shtml.

References

  • Avramidis E, Koehn P (2008) Enriching morphologically poor languages for statistical machine translation. In: ACL

  • Chahuneau V, Schlinger E, Smith NA, Dyer C (2013) Translating into morphologically rich languages with synthetic phrases. In: EMNLP

  • De Marneffe M-C, Manning CD (2008) Stanford typed dependencies manual. http://nlp.stanford.edu/software/dependenciesmanual.pdf

  • Koehn P, Hoang H (2007) Factored translation models. In: EMNLP-CoNLL

  • Koehn P, Och FJ, Marcu D (2007) Statistical phrase-based translation. NAACL on human language technology-volume 1. ACL

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL

  • Singh S, Sarma VM (2011) Verbal inflection in Hindi: a distributed morphology approach. In: PACLIC

  • Singh S, Sarma VM, Muller S (2010) Hindi noun inflection and distributed morphology. Universite Paris Diderot, Paris 7, France. Muller S (ed). CSLI Publications (2006), p 307

  • Sreelekha S, Dabre R, Bhattacharyya P (2013) Comparison of SMT and RBMT, the requirement of hybridization for Marathi–Hindi MT ICON. In: 10th international conference on NLP, December 2013

  • Tamchyna A, Bojar O (2013) No free lunch in factored phrase-based machine translation. In: Computational linguistics and intelligent text processing. Springer, Berlin Heidelberg, pp 210–223

  • Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature rich part of speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol 1. Association for Computational Linguistics, 27 May 2003, pp 173–180. https://doi.org/10.3115/1073445.1073478

Download references

Acknowledgements

The authors would like to thank Prof. Pushpak Bhattacharyya for his guidance during this work. The authors would like to thank Department of Science & Technology, Govt. of India for providing fund under Woman Scientist Scheme (WOS-A) with the Project Code-SR/WOS-A/ET/1075/2014. The author would like to acknowledge her own associated works published in ACM and ACL web.

Funding

This work is funded by Department of Science & Technology, Govt. of India under Woman Scientist Scheme (WOS-A) with the Project Code-SR/WOS-A/ET/1075/2014.

Author information

Authors and Affiliations

Authors

Contributions

The first author is the sole author of this work.

Corresponding author

Correspondence to Sreelekha S..

Ethics declarations

Conflict of interest

The author declares that there is no conflict of interest associated with this work.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sreelekha, S. Morphology generation for English-Indian language statistical machine translation. Soft Comput 25, 3657–3664 (2021). https://doi.org/10.1007/s00500-020-05393-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-05393-7

Keywords

Navigation