Skip to main content

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

  • Conference paper
Trends and Advances in Information Systems and Technologies (WorldCIST'18 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 745))

Included in the following conference series:

  • 8488 Accesses

Abstract

Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wołk, K., Marasek, K., Wołk, A.: Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora. In: 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), Gdansk, pp. 517–525 (2016)

    Google Scholar 

  2. Anderson, S.R., Harrison, D., Horn, L., Zanuttini, R., Lightfoot, D.: How many languages are there in the world?: linguistic society of America (2010). http://www.linguisticsociety.org/sites/default/files/how-many-languages.pdf. Accessed 16 Feb 2017

  3. List of languages by number of native speakers (2016). Wikipedia, https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers. Accessed 16 Feb 2016

  4. Paolillo, J., Anupam, D.: Evaluating language statistics: the Ethnologue and beyond (2006). http://www.uis.unesco.org/Library/Documents/evaluating-language-statistics-ethnologue-beyond-culture-2006-en.pdf. Accessed 8 Oct 2015

  5. English language in Europe 2016 Wikipedia. https://en.wikipedia.org/wiki/English_language_in_Europe. Accessed 16 Feb 2017

  6. Munteanu, D., Fraser, A., Marcu, D.: Improved machine translation performance via parallel sentence extraction from comparable corpora. In: Human Language Technologies-The 2004 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Marina del Rey, pp. 265–272 (2004)

    Google Scholar 

  7. Callison-Burch, C., Osborne, M.: Co-training for statistical machine translation. Dissertation, School of Informatics, University of Edinburgh (2002)

    Google Scholar 

  8. Ueffing, N., Haffari, G., Sarkar, A.: Semisupervised learning for machine translation. In: Goutte, C., Cancedda, N., Dymetman, M., Foster, G. (eds.) Learning Machine Translation, pp. 237–256. MIT Press, Pittsburgh (2009)

    Google Scholar 

  9. Mann, G., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, Pittsburgh, pp. 1–8 (2001)

    Google Scholar 

  10. Kumar, S., Och, F., Macherey, W.: Improving word alignment with bridge languages. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, pp. 42–50 (2007)

    Google Scholar 

  11. Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21(3), 165–181 (2007)

    Article  Google Scholar 

  12. Habash, N., Hu, J.: Improving Arabic-Chinese statistical machine translation using English as pivot language. In: Proceedings of the Fourth Workshop on Statistical Machine Translation. Association of Computational Linguistics, Athens, pp. 173–181 (2009)

    Google Scholar 

  13. Eisele, A., Federmann, C., Uszkoreit, H., Saint-Amand, H., Kay, M., Jellinghaus, M., Hunsicker, S., Herrmann, T., Chen, Y.: Hybrid machine translation architectures within and beyond the EuroMatrix project. In: Hutchins, J., Hahn, W.V. (eds.) Hybrid MT Methods in Practice: Their Use in Multilingual Extraction, Cross-Language Information Retrieval, Multilingual Summarization, and Applications in Hand-Held Devices. Proceedings of the European Machine Translation Conference, Proceedings of the 12th Annual Conference of the European Association for Machine Translation. HITEC e.V., European Association for Machine Translation, Hamburg, Germany, pp. 27–34 (2008)

    Google Scholar 

  14. Cohn, T., Lapata, M.: Machine translation by triangulation: making effective use of multi-parallel corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, pp. 728–735 (2007)

    Google Scholar 

  15. Leusch, G., Max, A., Crego, J.M., Ney, H.: Multi-pivot translation by system combination. In: Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), Paris, pp. 299–306 (2010)

    Google Scholar 

  16. Bertoldi, N., Barbaiani, M., Federico, M., Cattoni, R.: Phrase-based statistical machine translation with pivot languages. In: Proceedings of IWSLT, Hawaii, pp. 143–149 (2008)

    Google Scholar 

  17. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association of Computational Linguistics, Prague, pp. 177–180 (2007)

    Google Scholar 

  18. Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Proceedings of International Conference Spoken Language Processing, Denver, pp. 901–904 (2002)

    Google Scholar 

  19. Junczys-Dowmunt, M., Szal, A.: SyMGiza ++: symmetrized word alignment models for statistical machine translation. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and Intelligent Information Systems: International Joint Conferences, 2011, Warsaw, pp. 379–390. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  20. Durrani, N., Sajjad, H., Hoang, H., Koehn, P.: Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, pp. 148–153 (2014)

    Google Scholar 

  21. Cettolo, M., Girardi, C., Fedirico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, pp. 261–268 (2012)

    Google Scholar 

  22. Abdelali, A., Guzman, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Ninth International Conference on Language Resources and Evaluation (LREC14), Reykjavik, pp. 1044–1054 (2014)

    Google Scholar 

  23. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, Philadelphia, pp. 311–318 (2002)

    Google Scholar 

  24. Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)

    Article  Google Scholar 

  25. Cao, G., Nie, J., Bai, J.: Integrating term relationships into language models. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, pp. 298–305 (2005)

    Google Scholar 

  26. Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999)

    Article  Google Scholar 

  27. Bellegarda, J.: Data-driven semantic language modeling, Institute for Mathematics and Its Applications Workshop (2000). http://cmusphinx.sourceforge.net/wiki/semanticlanguagemodel. Accessed 16 Feb 2017

  28. Thomo, A.: Latent semantic analysis (LSA) tutorial (2009). http://webhome.cs.uvic.ca/~thomo/svd.pdf. Accessed 16 Feb 2007

  29. Moses statistical machine translation, OOVs (2015). http://www.statmt.org/moses/?n=Advanced.OOVs#ntoc2. Accessed 27 Sept 2015

  30. Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation. Association of Computational Linguistics, Edinburgh, pp. 187–197 (2011)

    Google Scholar 

  31. Costa-jussa, M.R., Fonollosa, J.R.: Using linear interpolation and weighted reordering hypotheses in the Moses system. In: Seventh Conference on International Language Resources and Evaluation, Valletta, pp. 1712–1718 (2011)

    Google Scholar 

  32. Moses statistical machine translation, Build reordering model (2013) http://www.statmt.org/moses/?n=FactoredTraining.Build. Reordering Model. Accessed 10 Oct 2015

  33. Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association of Computational Linguistics, Edinburgh, pp. 355–362 (2011)

    Google Scholar 

  34. Wang, L., Wong, D.F., Chao, L.S., Lu, Y., Xing, J.: A systematic comparison of data selection criteria for SMT domain adaptation. Sci. World J. 2014, 745485 (2014)

    Google Scholar 

  35. Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. In: Proceedings of the EAGLES Workshop on Standards and Evaluation, Pisa, pp. 127–133 (1999)

    Google Scholar 

  36. Vanni, M., Reeder, F.: How are you doing? A look at MT evaluation. In: White, J.S. (eds.), Envisioning Machine Translation in the Information Future, AMTA 2000. LNCS, vol. 1934. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  37. Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2, 172–176 (2012)

    Article  MathSciNet  Google Scholar 

  38. Lin, S., Verspoor, K.: A semantics-enhanced language model for unsupervised word sense disambiguation. In: Ninth International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2008). Lecture Notes in Computer Science (LNCS), Haifa, pp. 287–298 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Cite this paper

Wołk, K., Wołk, A. (2018). Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 745. Springer, Cham. https://doi.org/10.1007/978-3-319-77703-0_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77703-0_37

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77702-3

  • Online ISBN: 978-3-319-77703-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics