Skip to main content
Log in

Domain-specific machine translation with recurrent neural network for software localization

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Software localization is the process of adapting a software product to the linguistic, cultural and technical requirements of a target market. It allows software companies to access foreign markets that would be otherwise difficult to penetrate. Many studies have been carried out to locate need-to-translate strings in software and adapt UI layout after text translation in the new language. However, no work has been done on the most important and time-consuming step of software localization process, i.e., the translation of software text. Due to some unique characteristics of software text, for example, application-specific meanings, context-sensitive translation, domain-specific rare words, general machine translation tools such as Google Translate cannot properly address linguistic and technical nuance in translating software text for software localization. In this paper, we propose a neural-network based translation model specifically designed and trained for mobile application text translation. We collect large-scale human-translated bilingual sentence pairs inside different Android applications, which are crawled from Google Play store. We customize the original RNN encoder-decoder neural machine translation model by adding categorical information addressing the domain-specific rare word problem which is common phenomenon in software text. We evaluate our approach in translating the text of testing Android applications by both BLEU score and exact match rate. The results show that our method outperforms the general machine translation tool, Google Translate, and generates more acceptable translation for software localization with less needs for human revision. Our approach is language independent, and we show the generality of our approach between English and the other five official languages used in United Nation (UN).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://translate.google.com

  2. https://www.bing.com/translator

  3. Note that “domain-specific” in this work refer to the domain of the software engineering, instead of app category.

  4. Although Google Play distinguishes detailed game category such as cards, racing, puzzle, we take them as one game category.

  5. https://developer.android.com/guide/topics/resources/localization

  6. We do not use cross validation for evaluation as the training process takes a long time on our PC.

  7. https://www.transifex.com/

  8. https://crowdin.com/

  9. https://www.smartling.com/

  10. This indeed limits the scale of our experiment because it is a paid service to use Google Translate API for large-scale translation (https://cloud.google.com/translate/v2/pricing).

  11. https://translate.google.com/

  12. http://fanyi.youdao.com/

  13. The detailed checking results can be found in https://sites.google.com/view/domainspecifictranslation/

  14. https://en.wikipedia.org/wiki/World_population

  15. https://en.wikipedia.org/wiki/English-speaking_world

References

  • Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM Sigmod Record, ACM, vol 22, pp 207–216

    Article  Google Scholar 

  • Alameer A, Mahajan S, Halfond WG (2016) Detecting and localizing internationalization presentation failures in web applications

  • Alshaikh Z, Mostafa S, Wang X, He S (2015) A empirical study on the status of software localization in open source projects

  • Apktool (2018) A tool for reverse engineering android apk files. https://ibotpeaches.github.io/Apktool/

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:14090473

  • Borgelt C (2012) Frequent item set mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(6):437–456

    Google Scholar 

  • Burukhin A, Gadre MA, Aldahleh AM, Farrell T, Larrinaga-Pardo JL (2007) Dynamically providing a localized user interface language resource. US Patent App. 11/869,083

  • Chen C, Chen X, Sun J, Xing Z, Li G (2018a) Data-driven proactive policy assurance of post quality in community q&a sites. Proceedings of the ACM on human-computer interaction 2(CSCW):33

  • Chen C, Gao S, Xing Z (2016a) Mining analogical libraries in q&a discussions–incorporating relational and categorical knowledge into word embedding. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 338–348

  • Chen C, Su T, Meng G, Xing Z, Liu Y (2018b) From ui design image to gui skeleton: a neural machine translator to bootstrap mobile gui implementation. In: Proceedings of the 40th international conference on software engineering. ACM, pp 665–676

  • Chen C, Xing Z (2016a) Mining technology landscape from stack overflow. In: Proceedings of the 10th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, p 14

  • Chen C, Xing Z (2016b) Similartech: automatically recommend analogical libraries across different programming languages. In: 2016 31st IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 834–839

  • Chen C, Xing Z, Han L (2016b) Techland: assisting technology landscape inquiries with insights from stack overflow. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 356–366

  • Chen C, Xing Z, Liu Y (2017a) By the community & for the community: a deep learning approach to assist collaborative editing in q&a sites. Proceedings of the ACM on Human-Computer Interaction 1(CSCW):32

  • Chen C, Xing Z, Liu Y (2018c) What’s spain’s paris? mining analogical libraries from q&a discussions. Empir Softw Eng, pp 1–40

  • Chen C, Xing Z, Liu Y, Ong KLX (2019) Mining likely analogical apis across third-party libraries via large-scale unsupervised api semantics embedding. IEEE Trans Softw Eng

  • Chen C, Xing Z, Wang X (2017b) Unsupervised software-specific morphological forms inference from informal discussions. In: Proceedings of the 39th international conference on software engineering. IEEE Press, pp 450–461

  • Chen G, Chen C, Xing Z, Bowen X (2016c) Learning a dual-language vector space for domain-specific cross-lingual question retrieval. In: 31st IEEE/ACM international conference on automated software engineering (ASE), IEEE/ACM

  • Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:14061078

  • Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 2-Volume 2, Association for computational linguistics, pp 718–726

  • Eck M, Vogel S, Waibel A (2004) Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th international conference on computational linguistics, association for computational linguistics, p 792

  • Fitzpatrick C, Whelan JP, Doyle RP, Lane JG, McHugh B, Farrell T, Barnes P, McQuaid AM, Mowatt D (2013) Dynamic screentip language translation. US Patent 8,612,893

  • Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Linguist 33(3):293–303

    Article  MathSciNet  Google Scholar 

  • Gao S, Chen C, Xing Z, Ma Y, Song W, Lin SW (2019) A neural model for method name generation from functional description. In: 2019 IEEE 26th international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE

  • Graves A, Mohamed A-r, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 6645–6649

  • Green S, Cer D, Manning C (2014) Phrasal: a toolkit for new directions in statistical machine translation. In: Proceedings of the ninth workshop on statistical machine translation, pp 114–121

  • Google Play Store (2018a). https://play.google.com/store

  • Gu X, Zhang H, Kim S (2018) Deep code search. In: Proceedings of the 40th international conference on software engineering. ACM, pp 933–944

  • Gu X, Zhang H, Zhang D, Kim S (2016) Deep api learning. arXiv:160508535

  • Gu X, Zhang H, Zhang D, Kim S (2017) Deepam: migrate apis with multi-modal sequence to sequence learning. arXiv:170407734

  • Holzer H, Ant F, Nogueira D, Semolini K, Martin C, Aiken M, Balan S, Zetzsche J, Avval SF, Carl M et al (2011) An analysis of google translate accuracy

  • Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: Proceedings of the 26th conference on program comprehension. ACM, pp 200–210

  • Huang Y, Chen C, Xing Z, Lin T, Liu Y (2018) Tell them apart: distilling technology differences from crowd-scale comparison discussions. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. ACM, pp 214–224

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology-volume 1, association for computational linguistics, pp 48–54

  • Luong MT, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv:14108206

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781

  • Mikolov T, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems

  • Mikolov T, Deoras A, Povey D, Burget L, Cernockỳ J (2011) Strategies for training large scale neural network language models. In: 2011 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 196–201

  • Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, vol 2, p 3

  • Muntés Mulero V, Paladini Adell P, España Bonet C, Màrquez Villodre L (2012) Context-aware machine translation for software localization. In: Proceedings of the 16th annual conference of the European association for machine translation: EAMT 2012: Trento, Italy, May 28th-30th 2012, pp 77–80

  • United Nations (2018b) http://www.un.org/en/sections/about-un/official-languages/index.html. http://ask.un.org/faq/14463, Accessed 2018-06-20

  • Nguyen AT, Nguyen TT, Nguyen TN (2013) Lexical statistical machine translation for language migration. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering. ACM, pp 651–654

  • Nguyen AT, Nguyen TT, Nguyen TN (2014) Migrating code with statistical machine translation. In: Companion proceedings of the 36th international conference on software engineering. ACM, pp 544–547

  • Nguyen AT, Nguyen TT, Nguyen TN (2015) Divide-and-conquer approach for multi-phase statistical migration for source code (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 585–596

  • O’Brien S (1998) Practical experience of computer-aided translation tools in the software localization industry. Unity in diversity pp 115–122

  • Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S (2015) Learning to generate pseudo-code from source code using statistical machine translation (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 574–584

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, association for computational linguistics, pp 311–318

  • Phraseapp (2018c) Software translation management. https://phraseapp.com/, Accessed 2018-06-20

  • Plamada M, Volk M (2013) Mining for domain-specific parallel text from wikipedia. ACL 2013, pp 112

  • Ren Z, Lü Y, Cao J, Liu Q, Huang Y (2009) Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications, association for computational linguistics, pp 47–54

  • Rice WR (1989) Analyzing tables of statistical tests. Evolution 43(1):223–225

    Article  Google Scholar 

  • Rich DP (2011) Method and system for improved software localization. US Patent 7,987,087

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  Google Scholar 

  • Smartling (2018d) Smartling global content translation and localization solution. https://www.smartling.com/, Accessed 2018-06-20

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112

  • Tjalling H (2016) Automatic comment generation using a neural translation model

  • Transifex (2018e) Transifex: Localization platform for translating digital content. https://www.transifex.com/, Accessed 2018-07-20

  • Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, association for computational linguistics, pp 384–394

  • Wang X, Zhang L, Xie T, Mei H, Sun J (2010) Locating need-to-translate constant strings in web applications. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 87–96

  • Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560

    Article  Google Scholar 

  • White M, Vendome C, Linares-Vásquez M, Poshyvanyk D (2015) Toward deep learning software repositories. In: 2015 IEEE/ACM 12th working conference on mining software repositories (MSR). IEEE, pp 334–345

  • Wu H, Wang H, Zong C (2008) Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In: Proceedings of the 22nd international conference on computational linguistics-volume 1, association for computational linguistics, pp 993–1000

  • Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:160908144

  • Xia X, Lo D, Zhu F, Wang X, Zhou B (2013) Software internationalization and localization: an industrial experience. In: 2013 18th international conference on Engineering of complex computer systems (ICECCS). IEEE, pp 222–231

  • Zens R, Och FJ, Ney H (2002) Phrase-based statistical machine translation. In: Annual conference on artificial intelligence. Springer, pp 18–32

  • Zhang J, Zong C et al (2013) Learning a phrase-based translation model from monolingual data with application to domain adaptation. In: ACL, vol 1, pp 1425–1434

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunyang Chen.

Additional information

Communicated by: David Lo, Meiyappan Nagappan, Fabio Palomba and Sebastian Panichella

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Chen, C. & Xing, Z. Domain-specific machine translation with recurrent neural network for software localization. Empir Software Eng 24, 3514–3545 (2019). https://doi.org/10.1007/s10664-019-09702-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-019-09702-z

Keywords

Navigation