Slavic languages in phrase-based statistical machine translation: a survey

Abstract

The demand for translations is increasing at a rate far beyond the capacity of professional translators. It is too difficult, time consuming and expensive to translate everything from scratch in each language. Machine translation offers a solution, as it provides translation automatically. Until recently, statistical machine translation has proved to be one of the most successful approaches. However, a new approach to machine translation based on neural networks has emerged with promising results. The present paper concerns phrase-based statistical machine translation, an area that has been extensively studied in the literature. The translation system consists of many components built on the premise of probabilities. Each component is described separately. Although high quality translation systems have been developed for certain language pairs, there is still a large number of languages that cause many translation errors. Languages with a rich morphology pose an especially difficult challenge for research. We address one group of morphologically rich languages: Slavic languages, which constitute a relatively homogeneous family of languages characterized by rich, inflectional morphology. The present paper offers a comprehensive survey of approaches to coping with Slavic languages in different aspects of statistical machine translation. We observe that the interest of the community in research of more difficult languages is increasing and we believe that the translation quality of those languages will reach the level of practical use in the near future.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison.

References

  1. Agić Ž, Merkler D, Berović D (2013) Parsing Croatian and Serbian by using Croatian dependency treebanks. In: Proceedings of the fourth workshop on statistical parsing of morphologically-rich languages. Seattle, Washington, USA, pp 22–33

  2. Alumäe T, Kurimo M (2010) Efficient estimation of maximum entropy language models with N-gram features: an SRILM extension. In: Proceedings of Interspeech 2010. Chiba, Japan, pp 1820–1823

  3. Arčan M, Popović M, Buitelaar P (2016) Asistent A machine translation system for Slovene, Serbian and Croatian. In: Proceedings of the conference on language technologies & digital humanities. Ljubljana, Slovenia, pp 13–20

  4. Avramidis E, Koehn P (2008) Enriching morphologically poor languages for statistical machine translation. In: Proceedings of ACL-08: HLT. Association for Computational Linguistics, Columbus, Ohio, pp 763–770

  5. Baerman M (2015) The Oxford handbook of inflection. Oxford University Press, Oxford

    Google Scholar 

  6. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473

  7. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL 2005 workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization, pp 65–72

  8. Bertoldi N, Haddow B, Fouet JB (2010) Improved minimum error rate training in Moses. Prague Bull Math Linguist 91:7–16

    Google Scholar 

  9. Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology: companion volume of the proceedings of HLT-NAACL 2003-short papers, vol 2. Association for Computational Linguistics, Edmonton, Canada, pp 4–6

  10. Bisazza A, Monz C (2014) Class-based language modeling for translating into morphologically rich languages. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 1918–1927

  11. Bohnet B, Nivre J, Boguslavsky IM, Farkas R, Ginter F, Hajič J (2013) Joint morphological and syntactic analysis for richly inflected languages. Trans Assoc Comput Linguist 1:429–440

    Article  Google Scholar 

  12. Bojar O (2007) English-to-Czech factored machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, Association for Computational Linguistics, pp 232–239

  13. Bojar O (2011) Analyzing error types in English-Czech machine translation. Prague Bull Math Linguist 95:63–76

    Article  Google Scholar 

  14. Bojar O, Čmejrek M (2007) Mathematical model of tree transformations. Public deliverable D3.2, EuroMatrix, IST-034291

  15. Bojar O, Hajič J (2008) Phrase-based and deep syntactic English-to-Czech statistical machine translation. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 143–146

  16. Bojar O, Kos K (2010) 2010 Failures in English-Czech phrase-based MT. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics (MATR). Association for Computational Linguistics, Uppsala, Sweden, pp 60–66

  17. Bojar O, Prokopová M (2006) Czech-English word alignment. In: Proceedings of the international conference on language resources and evaluation, pp 1236–1239

  18. Bojar O, Tamchyna A (2011) Forms wanted: training SMT on monolingual data. Abstract at machine translation and morphologically-rich languages. In: Research workshop of the Israel Science Foundation University of Haifa, Israel

  19. Bojar O, Wu D (2012) Towards a predicate-argument evaluation for MT. In: Proceedings of the sixth workshop on syntax, semantics and structure in statistical translation (SSST). Jeju, Republic of Korea, Association for Computational Linguistics, pp 30–38

  20. Bojar O, Zeman D (2014) Czech machine translation in the project CzechMATE. Prague Bull Math Linguist 101:71–96

    Article  Google Scholar 

  21. Bojar O, Matusov E, Ney H (2006) Czech-English phrase-based machine translation. In: Proceedings of the 5th international conference on NLP (FinTAL). Turku, Finland, pp 214–224

  22. Bojar O, Kos K, Mareček D (2010) Tackling sparse data issue in machine translation evaluation. In: Proceedings of the ACL 2010 conference short papers. Association for Computational Linguistics, Uppsala, Sweden, pp 86–91

  23. Bojar O, Jawaid B, Kamran A (2012) Probes in a taxonomy of factored phrase-based models. In: Proceedings of the 7th workshop on statistical machine translation. Association for Computational Linguistics, Montréal, Canada, pp 253–260

  24. Bojar O, Macháček M, Tamchyna A, Zeman D (2013a) Scratching the surface of possible translations. In; Proceedings of the 16th international conference text. Plzeň, Czech Republic, Speech and Dialogue, pp 465–474

  25. Bojar O, Rosa R, Tamchyna A (2013b) Chimera—three heads for English-to-Czech translation. In: Proceedings of the eighth workshop on statistical machine translation. Association for Computational Linguistics, Sofia, Bulgaria, pp 92–98

  26. Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 131–198

  27. Botha JA, Blunsom P (2014) Compositional morphology for word representations and language modelling. In: Proceedings of the 31st international conference on machine learning. Beijing, China, pp 1899–1907

  28. Brown PF, Pietra SAD, Pietra VJD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311

    Google Scholar 

  29. Brychcín T, Konopík M (2011) Morphological based language models for inflectional languages. IN: The 6th IEEE international conference on intelligent data acquisition and advanced computing systems: technology and applications. Czech Republic, Prague, pp 560–564

  30. Brychcín T, Konopík M (2015) HPS: high precision stemmer. Inf Process Manag 51(1):68–91

    Article  Google Scholar 

  31. Burlot F, Yvon F (2015) Morphology-aware alignments for translation to and from a synthetic language. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 188–195

  32. Cettolo M, Niehues J, Stker S, Bentivogli L, Cattoni R, Federico M (2015) The IWSLT 2015 evaluation campaign. In: Proceedings of the international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 2–14

  33. Chahuneau V, Schlinger E, Smith NA, Dyer C (2013) Translating into morphologically rich languages with synthetic phrases. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Seattle, Washington, USA, pp 1677–1687

  34. Chahuneau V, Smith NA, Dyer C (2013b) Knowledge-rich morphological priors for Bayesian language models. In: Proceedings of NAACL-HLT. Atlanta, Georgia, pp 1206–1215

  35. Chen SF, Goodman J (1998) An empirical study of smoothing techniques for language modelling. Technical Report TR-10-98, Computer Science Group, Harvard University

  36. Cho K, Van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734

  37. Cholakov K, Kordoni V (2014) Better statistical machine translation through linguistic treatment of phrasal verbs. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 196–201

  38. Chung J, Cho K, Bengio Y (2016) NYU-MILA neural machine translation systems for WMT16. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 268–271

  39. Costa-jussà MR (2015a) How much hybridization does machine translation need? J Assoc Inf Sci Technol 6(10):2160–2165

    Article  Google Scholar 

  40. Costa-jussà MR (2015b) Latest trends in hybrid machine translation and its applications. Comput Speech Lang 32(1):3–10

    Article  Google Scholar 

  41. Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation. Baltimore, Maryland, USA, pp 376–380

  42. Ding S, Duh K, Khayrallah H, Koehn P, Post M (2016) The JHU machine translation systems for WMT 2016. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 272–280

  43. Donaj G, Kačič Z (2016) Language modeling for automatic speech recognition of inflective languages: an applications-oriented approach using lexical data. Springer, London

    Google Scholar 

  44. Dove C, Loskutova O, de la Fuente R (2012) What’s your pick: RbMT, SMT or hybrid? In: Proceedings of 11th conference of the associationfor machine translation in the Americas (AMTA), San Diego, CA

  45. Dugonik J, Bošković B, Maučec MS, Brest J (2014) The usage of differential evolution in a statistical machine translation. In: Proceedings of the IEEE symposium series on computational intelligence (SSCI). Orlando, Florida, USA, pp 89–96

  46. Durrani N, Sajjad H (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics. Gothenburg, Sweden, pp 148–153

  47. Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics (ACL-HLT). Portland, Oregon, USA, pp 1045–1054

  48. Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual conference of the association for computational linguistics (ACL). Sofia, Bulgaria, pp 399–405

  49. Durrani N, Koehn P, Schmid H, Fraser A (2014) Investigating the usefulness of generalized word representations in SMT. In: Proceedings of the 25th annual conference on computational linguistics (COLING). Dublin, Ireland, pp 421–432

  50. Durrani N, Schmid H, Fraser A, Koehn P, Schütze H (2015) The operation sequence model—combining N-gram-based and phrase-based statistical machine translation. Comput Linguist 41(2):185–214

    MathSciNet  Article  Google Scholar 

  51. Dušek O, Žabokrtský Z, Popel M, Dušek M, Novák M, Mareček D (2012) Formemes in English-Czech deep syntactic MT. In: Proceedings of the 7th workshop on statistical machine translation. Association for Computational Linguistics, Montreal, Canada, pp 267–274

  52. Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM Model 2. In: Proceedings of NAACL. Atlanta, Georgia, USA, pp 644–648

  53. Dzikiene JK, Nivre J, Krupavičius A (2013) Lithuanian dependency parsing with rich morphological features. In: Proceedings of the fourth workshop on statistical parsing of morphologically-rich languages, pp 12–21

  54. Eisele A, Federmann C, Saint-Amand H, Jellinghaus M, Herrmann T, Chen Y (2008) Using Moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 179–182

  55. Farrús M, Costa-jussà MR, Morse MP (2012) Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations. J Am Soc Inf Sci Technol 63(1):174–184

    Article  Google Scholar 

  56. Federmann C, Hunsicker S (2011) Stochastic parse tree selection for an existing RBMT system. In: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 351–357

  57. Felice M, Specia L (2013) Investigating the contribution of linguistic information to quality estimation. Mach Transl 27:193–212

    Article  Google Scholar 

  58. Fishel M (2009) Deeper than words: morph-based alignment for statistical machine translation. In: Proceedings of the conference of the pacific association for computational linguistics (PacLing 2009), University of Hokkaido, Sapporo, Japan

  59. Galuščáková P, Bojar O (2012) Improving SMT by using parallel data of a closely related language. In: Human Language Technologies—the Baltic Perspective—proceedings of the fifth international conference Baltic HLT 2012, IOS Press, Amsterdam, Netherlands, Frontiers in AI and Applications, vol 247, pp 58–65

  60. Gao J, He X, tau Yih W, Deng L (2014) Learning continuous phrase representations for translation modeling. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 699–709

  61. Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Proceedings of the workshop software engineering, testing, and quality assurance for natural language processing. Association for Computational Linguistics, pp 49–57

  62. Gaudio R, Labaka G, Agirre E, Osenova P, Simov K, Popel M, Oele D, van Noord G, Gomes L, Ja António Rodrigues, Neale S, Ja Silva, Querido A, Rendeiro N, Branco A (2016) SMT and hybrid systems of the QTLeap project in the WMT16 IT-task. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 435–441

  63. Genzel D (2010) Automatically learning source-side reordering rules for large scale machine translation. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, pp 376–384

  64. Giménez J, Màrquez L (2010) Linguistic measures for automatic machine translation evaluation. Mach Transl 24:209–240

    Article  Google Scholar 

  65. Gimpel K, Smith NA (2014) Phrase dependency machine translation with quasi-synchronous tree-to-tree feature. Comput Linguist 40(2):349–401

    Article  Google Scholar 

  66. Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP). Vancouver, Canada, pp 676–683

  67. Graham Y, van Genabith J (2010) Factor templates for factored machine translation models. In; Proceedings of the seventh international workshop on spoken language translation (IWSLT). France, Paris, pp 275–282

  68. Green N (2011) Effects of noun phrase bracketing in dependency parsing and machine translation. In: Proceedings of the ACL 2011 student session. Association for Computational Linguistics, Portland, OR, USA, pp 69–74

  69. Green S, DeNero J (2012) A class-based agreement model for generating accurately inflected translations. In: Proceedings of the 50th annual meeting of the association for computational linguistics. Jeju, Republic of Korea, Association for Computational Linguistics, pp 146–155

  70. Hammarströ H, Borin L (2011) Unsupervised learning of morphology. Comput Linguist 37(2):309–350

    MathSciNet  Article  Google Scholar 

  71. Hirsimäki T, Pylkkönen J, Kurimo M (2009) Importance of high-order N-gram models in morph-based speech recognition. IEEE/ACM Trans Audio Speech Lang Process 17(4):724–732

    Article  Google Scholar 

  72. Ho C, Azmi Murad MA, Doraisamy S, Abdul Kadir R (2014) Extracting lexical and phrasal paraphrases: a review of the literature. Artif Intell Rev 42(4):851–894

    Article  Google Scholar 

  73. Hoang C, Sima’an K (2014) Latent domain translation models in mix-of-domains haystack. In: COLING 2014, 25th international conference on computational linguistics, proceedings of the conference: technical papers, August 23–29, 2014. Dublin, Ireland, pp 1928–1939

  74. Hoang T, Bojar O (2015) TmTriangulate: a tool for phrase table triangulation. Prague Bull Math Linguist 104:75–86

    Article  Google Scholar 

  75. Homola P, Kuboň V (2008) A hybrid machine translation system for typologically related languages. In: Proceedings of the 21st international florida-artificial-intelligence-research-society conference (FLAIRS), pp 227–228

  76. Huet S, Manishina E, Lefevre F (2013) Factored machine translation systems for Russian-English. In: Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, pp 154–157

  77. Hunsicker S, Yu C, Federmann C (2012) Machine learning for hybrid machine translation. In: Proceedings of the seventh workshop on statistical machine translation, pp 312–316

  78. Ircig P, Psutka JV, Psutka J (2009) Using morphological information for robust language modeling in Czech ASR system. IEEE/ACM Trans Audio Speech Lang Process 17(4):840–847

    Article  Google Scholar 

  79. Ircing P, Krbec P, Hajič J, Khudanpur S, Jelinek F, Psutka J, Byrne W (2001) On large vocabulary continuous speech recognition of highly inflectional language—Czech. In: Proceedings of the European conference on speech communication and technology (EUROSPEECH), pp 487–490

  80. ISO 9:1995 (1995) Information and documentation transliteration of Cyrillic characters into Latin characters Slavic and non-Slavic languages. International Organization for Standardization

  81. Jawaid B, Bojar O (2014) Two-step machine translation with lattices. In: Proceedings of the 9th international conference on language resources and evaluation (LREC 2014). Reykjavík, Iceland, pp 682–686

  82. Jean S, Firat O, Cho K, Memisevic R, Bengio Y (2015) Montreal neural machine translation systems for WMT’15. In: Proceedings of the tenth workshop on statistical machine translation. Lisboa, Portugal, pp 134–140

  83. Jeong M, Toutanova K, Suzuki H, Quirk C (2010) A discriminative lexicon model for complex morphology. In: The ninth conference of the association for machine translation in the Americas (AMTA). Association for Computational Linguistics

  84. Joty S, Guzmán F, Màrquez L, Nakov P (2014) DiscoTK: using discourse structure for machine translation evaluation. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 402–408

  85. Juhár J, Staš J, Hládek D (2012) Recent progress in development of language model for Slovak large vocabulary continuous speech recognition. In: New technologies-trends, innovations and research, pp 261–276

  86. Junczys-Dowmunt M, Szał A (2011) SyMGiza++: Symmetrized word alignment models for statistical machine translation. In: International joint conferences security and intelligent information systems (SIIS), pp 379–390

  87. Junczys-Dowmunt M, Dwojak T, Sennrich R (2016) The AMU-UEDIN submission to the WMT16 news translation task: attention-based NMT models as feature functions in phrase-based SMT. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 319–325

  88. Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1700–1709

  89. Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Trans Acoust Speech Signal Process 35(3):400–401

    MathSciNet  Article  Google Scholar 

  90. Kazi M, Salesky E, Thompson B, Ray J, Coury M, Shen W, Anderson T, Erdmann G, Gwinnup J, Young K, Ore B, Hutt M (2014) The MITLL-AFRL IWSLT 2014 MT System. In: Proceedings of the international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 65–73

  91. Kipyatkova I, Karpov A (2014) Study of Morphological factors of factored language models for Russian ASR. In: Proceedings of the 16th international conference speech and computer (SPECOM). Novi Sad, Serbia, pp 451–458

  92. Kirchhoff K, Yang M, Duh K (2006) Machine translation of parliamentary proceedings using morpho-syntactic knowledge. In: Proceedings of the TC-STAR workshop on speech-to-speech translation

  93. Kneser R, Ney H (1993) Improved clustering techniques for class-based statistical language modelling. In: Proceedings of third European conference on speech communication and technology. EUROSPEECH 1993, Berlin, Germany, pp 22–25

  94. Koehn P (2011) Statistical machine translation. Cambridge University Press, Cambridge

    Google Scholar 

  95. Koehn P, Haddow B (2012) Interpolated backoff for factored translation models. In: Proceedings of the tenth conference of the association for machine translation in the Americas (AMTA)

  96. Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). Czech Republic, Scotland, Prague, pp 868–876

  97. Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the human language technology and North American Association for computational linguistics conference (HLT/NAACL). Czech Republic, Scotland, Prague, pp 48–54

  98. Kolovratník D, Klyueva N, Bojar O (2009) Statistical machine translationrelated and unrelated languages. In: ITAT 2009 information technologies—applications and theory, Slovakia, pp 31–36

  99. Kos K, Bojar O (2009) Evaluation of machine translation metrics for Czech as the target language. Prague Bull Math Linguist 92:135–147

    Article  Google Scholar 

  100. Kuboň V, Vičič J (2014) A comparison of MT Methods for closely related languages: a case study on Czech Slovak language pair. In: Proceedings of the conference language technology for closely related languages and language variants (LT4CloseLang), pp 92–98

  101. Labaka G, España-Bonet C, Màrquez L, Sarasola K (2014) A hybrid machine translation architecture guided by syntax. Mach Transl 28(2):91–125

    Article  Google Scholar 

  102. Lembersky G, Ordan N, Wintner S (2012) Language models for machine translation: original vs. translated texts. Comput Linguist 38(4):799–825

    MathSciNet  Article  Google Scholar 

  103. Lerner U, Petrov S (2013) Source-side classifier preordering for machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP ’13). Seattle, Washington, USA, pp 513–523

  104. Libovický J, Pecina P (2015) Tolerant BLEU: a submission to the WMT14 metrics task. In: Proceedings of the ninth workshop on statistical machine translation (SMT), pp 409–413

  105. Lo C, Cherry C, Foster G, Stewart D, Islam R, Kazantseva A, Kuhn R (2016) NRC Russian-English machine translation system for WMT 2016. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 326–332

  106. Luong MT, Socher R, Manning CD (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning. Association for Computational Linguistics, Sofia, Bulgaria, pp 104–113

  107. Macherey K, Dai AM, Talbot D, Popat AC, Och F (2011) Language-independent compound splitting with morphological operations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Portland, Oregon, HLT ’11, pp 1395–1404

  108. Majewski P (2008) Syllable based language model for large vocabulary continuous speech recognition of Polish. Proceedings of the 11th international conference text, speech and dialogue (TSD). Brno, Czech Republic, pp 397–401

  109. Marasek K (2012) TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the international workshop on spoken language translation (IWSLT), Hong Kong, pp 126–129

  110. Mareček D, Rosa R, Galuščáková P, Bojar O (2011) Two-step translation with grammatical post-processing. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, WMT ’11, pp 426–432

  111. Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JAR, Costa-jussà MR (2006) N-gram-based machine translation. Comput Linguist 32(4):527–549

    MathSciNet  MATH  Article  Google Scholar 

  112. Maučec MS, Brest J (2010) Reduction of morpho-syntactic features in statistical machine translation of highly inflective language. Informatica 21(1):95–116

    MATH  Google Scholar 

  113. Maučec MS, Donaj G (2016) Morphosyntactic tags in statistical machine translation of highly inflectional language. In: Proceedings of the artificial intelligence and natural language conference (AINL FRUCT). Saint-Petersburg, Russia, pp 99–102

  114. Maučec MS, Kačič Z, Verdonik D (2014) Statistical machine translation of subtitles for highly inflected language pair. Pattern Recogn Lett 46:96–103

    Article  Google Scholar 

  115. McDonald R, Nivre J (2011) Analyzing and integrating dependency parsers. Comput Linguist 37(1):197–230

    Article  Google Scholar 

  116. Mikolov T, Kopecký J, Burget L, Glembek O, Černocký JH (2009) Neural network based language models for highly inflected languages. In: Proceedings of the ICASSP, pp 4725–4728

  117. Mikolov T, Yih W, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL HLT). Atlanta, Georgia, pp 746–751

  118. Miłkowski M (2012) The Polish language in the digital age, White Paper Series. Springer, Berlin

    Google Scholar 

  119. Minkov E, Toutanova K, Suzuki H (2007) Generating complex morphology for machine translation. In: roceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, Prague, Czech Republic, pp 128--135

  120. Molchanov A, Bykov F (2016) PROMT translation systems for WMT 2016 translation tasks. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 339–343

  121. Morchid M, Huet S, Dufour R (2014) Topic-based approach for post-processing correction of automatic translations. In: Proceedings of the 11th international workshop on spoken language translation, Lake Tahoe, pp 80–85

  122. Müller T, Schuetze H, Schmid H (2012) A comparative investigation of morphological language modeling for the languages of the European Union. In: Human language technologies: conference of the North American chapter of the association of computational linguistics, proceedings, June 3–8, 2012. Montréal, Canada, pp 386–395

  123. Munková D, Munk M (2014) An automatic evaluation of machine translation and Slavic languages. In: Proceedings of the 8th international conference on application of information and communication technologies (AICT-2014), Astana, pp 447–451

  124. Munková D, Munk M (2015) Automatic evaluation of machine translation through the residual analysis. In: Proceedings of the 11th international conference advanced intelligent computing theories and applications. Fuzhou, China, pp 481–490

  125. Niehues J, Herrmann T, Vogel S, Waibel A (2011) Wider context by using bilingual language models in machine translation. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 198–206

  126. Nivre J (2015) Towards a universal grammar for natural language processing. In: Gelbukh A (ed) Computational linguisticsand intelligent text processing. Springer, Berlin, pp 3–16

    Google Scholar 

  127. Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135

    Article  Google Scholar 

  128. Novák V, Žabokrtský Z (2007) Feature engineering in maximum spanning tree dependency parser. In: Proceedings of the 10th international conference on text. Pilsen, Czech Republic, Speech and Dialogue, pp 92–98

  129. Novák V, Nedoluzhko A, Žabokrtský Z (2013) Translation of “it” in a deep syntax framework. In: Proceedings of the workshop on discourse in machine translation (DiscoMT). Association for Computational Linguistics, Sofia, Bulgaria, pp 51–59

  130. Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting on association for computational linguistics, vol 1. Association for Computational Linguistics, Sapporo, Japan, pp 160–167

  131. Och FJ, Ney H (2003a) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    MATH  Article  Google Scholar 

  132. Och FJ, Ney H (2003b) The alignment template approach to statistical machine translation. Comput Linguist 30(4):417–449

    MATH  Article  Google Scholar 

  133. Oparin I (2008) Language models for automatic speech recognition of inflectional languages. Ph.D. Dissertation, University of West Bohemia

  134. Oparin I, Glembek O, Burget L, Černocký J (2008) Morphological random forests for language modeling of inflectional languages. In: Proceedings of the spoken language technology workshop, (IEEE). Goa, India, pp 189–192

  135. Papineni K, Roukos S, Ward T, Zhu WJ (2004) BLEU: a method for automatic evaluation of machine translation. Tech. Rep. RC22176(W0109-022), IBM Research Report, IBM

  136. Popel M, Žabokrtský Z (2010) TectoMT: Modular NLP framework. In: Proceedings of the 7th international conference on advances in natural language processing, Reykjavik, Iceland, IceTAL’10, pp 293–304

  137. Popel M, Mareček D, Green N, Žabokrtský Z (2011) Influence of parser choice on dependency-based MT. IN: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, UK, pp 433–439

  138. Popović M (2011) Hjerson: an open source tool for automatic error classification of machine translation output. Prague Bull Math Linguist 96:59–68

    Article  Google Scholar 

  139. Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation. Association for Computational Linguistics, Lisbon, Portugal, pp 392–395

  140. Popović M, Arčan M (2015) Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages. In: Proceedings of the eighteenth annual conference of the European association for machine translation (EAMT 15). Antalya, Turkey, pp 97–104

  141. Popović M, Ljubešić N (2014) Exploring cross-language statistical machine translation for closely related South Slavic languages. In: Proceedings of the conference: language technology for closely related languages and language variants (LT4CloseLang). Association for Computational Linguistics, Doha, Qatar, pp 76–84

  142. Popović M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of the 4th international conference on language resources and evaluation (LREC), Lisbon, Portugal, pp 1585–1588

  143. Popović M, Ney H (2011) Towards automatic error analysis of machine translation output. Comput Linguist 37(4):657–688

    MathSciNet  Article  Google Scholar 

  144. Popović M, Arčan M, Avramidis E, Burchardt A, Lommel AR (2015) Poor man’s lemmatisation for automatic error classification. In: The eighteenth annual conference of the European association for machine translation (EAMT 15), pp 105–112

  145. Prochazka V, Pollak P, Zdansky J, Nouza J (2011) Performance of Czech speech recognition with language models created from public resources. Radioengineering 20(4):1002–1008

    Google Scholar 

  146. Rishøj C, Søgaard A (2011) Factored translation with unsupervised word clusters. In: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 447–451

  147. Rosa R, Mareček D, Dušek O (2012) DEPFIX: a system for automatic correction of Czech MT outputs. In: Proceedings of the seventh workshop on statistical machine translation. Association for Computational Linguistics, Montreal, Canada, WMT ’12, pp 362–368

  148. Rosa R, Sudarikov R, Novák M, Popel M, Bojar O (2016) Dictionary-based domain adaptation of MT systems without retraining. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 449–455

  149. Rotovnik T, Maučec MS, Kačič Z (2007) Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Commun 49(6):437–452

    Article  Google Scholar 

  150. Ruth J, O’Regan J (2011) Shallow-transfer rule-based machine translation from Czech to Polish. In: Proceedings of the second international workshop on free/open-source rule-based machine translation, pp 69–76

  151. Salehi B, Cook P, Baldwin T (2014) Using distributional similarity of multi-way translations to predict multiword expression compositionality. In: Proceedings of the 14th conference of the european chapter of the association for computational linguistics. Association for Computational Linguistics, Gothenburg, Sweden, pp 472–481

  152. Schwenk H, Rousseau A, Attik M (2012) Large, pruned or continuous space language models on a GPU for statistical machine translation. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL HLT). Atlanta, Georgia, pp 11–19

  153. Seeker W, Kuhn J (2013) Morphological and syntactic case in statistical dependency parsing. Comput Linguist 39:23–55

    Article  Google Scholar 

  154. Sennrich R (2015) Modelling and optimizing on syntactic N-grams for statistical machine translation. Trans Assoc Computat Linguist 3:169–182

    Article  Google Scholar 

  155. Sennrich R, Haddow B, Birch A (2016a) Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, pp 371–376

  156. Sennrich R, Haddow B, Birch A (2016b) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1715–1725

  157. Shaik MAB, Mousa AED, Schüter R, Ney H (2011) Using morpheme and syllable based sub-words for Polish LVCSR. In: Proceedings of ICASSP, pp 4680–4683

  158. Shalonova K, Golénia B, Flach P (2009) Towards learning morphology for under-resourced fusional and agglutinating languages. IEEE/ACM Trans Audio Speech Lang Process 17(5):956–965

    Article  Google Scholar 

  159. Shin E, Stüker S, Kilgour K, Fügen C, Waibel A (2013) Maximum entropy language modeling for Russian ASR. In: Proceedings of the 10th international workshop on spoken language translation, Heidelberg, Germany, pp 288–294

  160. Simova I, Kordoni V (2013) Improving English-Bulgarian statistical machine translation by phrasal verb treatment. In: Workshop on multi-word units in machine translation and translation technologies, pp 62–71

  161. Slawik I, Niehues J, Waibel A (2015) Stripping adjectives: integration techniques for selective stemming in SMT systems. In: The eighteenth annual conference of the European association for machine translation (EAMT 15), pp 105–112

  162. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation error rate with targeted human annotation. In: 5th conference of the association for machine translation in the Americas (AMTA), Boston, Massachusetts

  163. Son LH, Allauzen A, Yvon F (2012) Continuous space translation models with neural networks. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies, pp 39–48

  164. Stanojević M, Sima’an K (2014) BEER: BEtter evaluation as ranking. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 414–419

  165. Tamchyna A, Bojar O (2015) What a transfer-based system brings to the combination with PBMT. In: Proceedings of the ACL 2015 fourth workshop on hybrid approaches to translation (HyTra). Association for Computational Linguistics, Beijing, China, pp 11–20

  166. Tiedemann J (2012) Character-based pivot translation for under-resourced languages and domains. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics (EACL 2012), The Association for Computational Linguistics, pp 141–151

  167. Tiedemann J, Agić Ž, Nivre J (2014) Treebank translation for cross-lingual parser induction. In: Proceedings of the eighteenth conference on computational natural language learning (CoNLL). Avignon, France, pp 130–140

  168. Tillmann C (2004) A unigram orientation model for statistical machine translation. In: Proceedings of HLT-NAACL 2004: short papers. Association for Computational Linguistics, Boston, Massachusetts, pp 101–104

  169. Tillmann C, Hewavitharana S (2013) A unified alignment algorithm for bilingual data. Nat Lang Eng 19(1):33–60

    Article  Google Scholar 

  170. Toral A, Pecina P, Wang L, van Genabith J (2015) Linguistically-augmented perplexity-based data selection for language models. Comput Speech Lang 32:11–26

    Article  Google Scholar 

  171. Toutanova K, Suzuki H, Ruopp A (2008) Applying morphology generation models to machine translation. Proc ACL. Association for Computational Linguistics, Columbus, pp 514–522

    Google Scholar 

  172. Tran K, Bisazza A, Monz C (2014) Word translation prediction for morphologically rich languages with bilingual neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1676–1688

  173. Tsvetkov Y, Dyer C, Levin L, Bhatia A (2013) Generating English determiners in phrase-based translation with synthetic translation options. In: Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, pp 271–280

  174. Vaswani A, Huang L, Chiang D (2012) Smaller alignment models for better translations: unsupervised word alignment with the l0-norm. In: Proceedings of the 50th annual meeting of the association for computational linguistics, pp 311–319

  175. Vazhenina D, Markov K (2013) Factored language modeling for Russian LVCSR. In: Proceedings of the international joint conference on awareness science and technology & ubi-media computing, pp 205–210

  176. Vidhu Bhala RV, Abirami S (2014) Trends in word sense disambiguation. Artif Intell Rev 42(2):159–171

    Article  Google Scholar 

  177. Virpioja S, Väyrynen J, Mansikkaniemi A, Kurimo M (2010) Applying morphological decomposition to statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR. Uppsala University, Uppsala, Sweden, pp 195–200

  178. Wang L, Wong DF, Chao LS, Lu Y, Xing J (2014) A systematic comparison of data selection criteria for SMT domain adaptation. Sci World J 2014

  179. Wang R, Osenova P, Simov K (2012) Linguistically-augmented Bulgarian-to-English statistical machine translation model. IN: Proceedings of the joint workshop on exploiting synergies between information retrieval and machine translation (ESIRMT) and hybrid approaches to machine translation (HyTra). Association for Computational Linguistics, Avignon, France, pp 119–128

  180. Wang R, Zhao H, Lu BL (2015) Bilingual continuous-space language model growing for statistical machine translation. IEEE/ACM Trans Audio Speech Lang Process 23(7):1209–1220

    Article  Google Scholar 

  181. Wang R, Utiyama M, Goto I, Sumita E, Zhao H, Lu BL (2016) Converting continuous-space language models into N-gram language models with efficient bilingual pruning for statistical machine translation. ACM Trans Asian Low-Resour Lang Inf Process 15(3):11:1–11:26

    Article  Google Scholar 

  182. Weller M, Kisselew M, Smekalova S, Fraser A, Schmid H, Durrani N, Sajjad H, Farkas R (2013) Munich-Edinburgh-Stuttgart submissions at WMT13: morphological and syntactic processing for SMT. In: Proceedings of the eighth workshop on statistical machine translation. Association for Computational Linguistics, Sofia, Bulgaria, pp 232–239

  183. Williams P, Sennrich R, Post M, Koehn P (2016) Syntax-based statistical machine translation. Morgan & Claypool, San Rafael

    Google Scholar 

  184. Wołk K, Marasek K (2013) Polish - English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the international workshop on spoken language translation (IWSLT), Heidelberg, Germany

  185. Wołk K, Marasek K (2014a) Enhanced bilingual evaluation understudy. In: Proceedings of the 11th international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 191–197

  186. Wołk K, Marasek K (2014b) Polish - English speech statistical machine translation systems for the IWSLT 2014. In: Proceedings of the international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 143–149

  187. Wołk K, Marasek K (2015a) Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts. Procedia Comput Sci 64:2–9

    Article  Google Scholar 

  188. Wołk K, Marasek K (2015b) PJAIT systems for the IWSLT 2015 evaluation campaign enhanced by comparable corpora. In: Proceedings of the international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 101–104

  189. Wołk K, Marasek K, Glinkowski W (2015a) Telemedicine as a special case of the machine translation. Comput Med Imaging Graph 46:249–256

    Article  Google Scholar 

  190. Wołk K, Rejmund E, Marasek K (2015b) Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy-based heuristics. In: Proceedings of the international symposium on methodologies for intelligent systems (ISMIS), pp 433–441

  191. Wróblewska A (2011) Polish-English word alignment: preliminary study. Emerg Intell Technol Ind 369:123–132

    Google Scholar 

  192. Wu X, Yu H, Liu Q (2014) RED: DCU-CASICT participation in WMT2014 metrics task. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 420–425

  193. Xiong D, Zhang M (2015) Backward and trigger-based language models for statistical machine translation. Nat Lang Eng 21(2):201–226

    MathSciNet  Article  Google Scholar 

  194. Žabokrtský Z, Ptáček J, Pajas P (2008) TectoMT: Highly modular MT system with tectogrammatics used as transfer layer. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 167–170

  195. Zeman D, Fishel M, Berka J, Bojar O (2011) Addicter: What is wrong with my translations? Prague Bull Math Linguist 96:79–88

    Article  Google Scholar 

  196. Zens R, Ney H (2006) Discriminative reordering models for statistical machine translation. In: Proceedings of the workshop on statistical machine translation, New York City, pp 55–63

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their helpful and constructive comments that greatly contributed to improving the paper. Funding was provided by Javna Agencija za Raziskovalno Dejavnost RS (Grant Nos. P2-0069, P2-0041).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mirjam Sepesy Maučec.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Maučec, M.S., Brest, J. Slavic languages in phrase-based statistical machine translation: a survey. Artif Intell Rev 51, 77–117 (2019). https://doi.org/10.1007/s10462-017-9558-2

Download citation

Keywords

  • Statistical machine translation
  • Morphology
  • Slavic language
  • Inflection
  • Free word order