Slavic languages in phrase-based statistical machine translation: a survey

Maučec, Mirjam Sepesy; Brest, Janez

doi:10.1007/s10462-017-9558-2

Slavic languages in phrase-based statistical machine translation: a survey

Published: 06 May 2017

Volume 51, pages 77–117, (2019)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

755 Accesses
8 Citations
Explore all metrics

Abstract

The demand for translations is increasing at a rate far beyond the capacity of professional translators. It is too difficult, time consuming and expensive to translate everything from scratch in each language. Machine translation offers a solution, as it provides translation automatically. Until recently, statistical machine translation has proved to be one of the most successful approaches. However, a new approach to machine translation based on neural networks has emerged with promising results. The present paper concerns phrase-based statistical machine translation, an area that has been extensively studied in the literature. The translation system consists of many components built on the premise of probabilities. Each component is described separately. Although high quality translation systems have been developed for certain language pairs, there is still a large number of languages that cause many translation errors. Languages with a rich morphology pose an especially difficult challenge for research. We address one group of morphologically rich languages: Slavic languages, which constitute a relatively homogeneous family of languages characterized by rich, inflectional morphology. The present paper offers a comprehensive survey of approaches to coping with Slavic languages in different aspects of statistical machine translation. We observe that the interest of the community in research of more difficult languages is increasing and we believe that the translation quality of those languages will reach the level of practical use in the near future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Natural language syntax complies with the free-energy principle

Article Open access 03 May 2024

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Article 27 November 2023

Notes

http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison.

References

Agić Ž, Merkler D, Berović D (2013) Parsing Croatian and Serbian by using Croatian dependency treebanks. In: Proceedings of the fourth workshop on statistical parsing of morphologically-rich languages. Seattle, Washington, USA, pp 22–33
Alumäe T, Kurimo M (2010) Efficient estimation of maximum entropy language models with N-gram features: an SRILM extension. In: Proceedings of Interspeech 2010. Chiba, Japan, pp 1820–1823
Arčan M, Popović M, Buitelaar P (2016) Asistent A machine translation system for Slovene, Serbian and Croatian. In: Proceedings of the conference on language technologies & digital humanities. Ljubljana, Slovenia, pp 13–20
Avramidis E, Koehn P (2008) Enriching morphologically poor languages for statistical machine translation. In: Proceedings of ACL-08: HLT. Association for Computational Linguistics, Columbus, Ohio, pp 763–770
Baerman M (2015) The Oxford handbook of inflection. Oxford University Press, Oxford
Book Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL 2005 workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization, pp 65–72
Bertoldi N, Haddow B, Fouet JB (2010) Improved minimum error rate training in Moses. Prague Bull Math Linguist 91:7–16
Google Scholar
Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology: companion volume of the proceedings of HLT-NAACL 2003-short papers, vol 2. Association for Computational Linguistics, Edmonton, Canada, pp 4–6
Bisazza A, Monz C (2014) Class-based language modeling for translating into morphologically rich languages. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 1918–1927
Bohnet B, Nivre J, Boguslavsky IM, Farkas R, Ginter F, Hajič J (2013) Joint morphological and syntactic analysis for richly inflected languages. Trans Assoc Comput Linguist 1:429–440
Article Google Scholar
Bojar O (2007) English-to-Czech factored machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, Association for Computational Linguistics, pp 232–239
Bojar O (2011) Analyzing error types in English-Czech machine translation. Prague Bull Math Linguist 95:63–76
Article Google Scholar
Bojar O, Čmejrek M (2007) Mathematical model of tree transformations. Public deliverable D3.2, EuroMatrix, IST-034291
Bojar O, Hajič J (2008) Phrase-based and deep syntactic English-to-Czech statistical machine translation. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 143–146
Bojar O, Kos K (2010) 2010 Failures in English-Czech phrase-based MT. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics (MATR). Association for Computational Linguistics, Uppsala, Sweden, pp 60–66
Bojar O, Prokopová M (2006) Czech-English word alignment. In: Proceedings of the international conference on language resources and evaluation, pp 1236–1239
Bojar O, Tamchyna A (2011) Forms wanted: training SMT on monolingual data. Abstract at machine translation and morphologically-rich languages. In: Research workshop of the Israel Science Foundation University of Haifa, Israel
Bojar O, Wu D (2012) Towards a predicate-argument evaluation for MT. In: Proceedings of the sixth workshop on syntax, semantics and structure in statistical translation (SSST). Jeju, Republic of Korea, Association for Computational Linguistics, pp 30–38
Bojar O, Zeman D (2014) Czech machine translation in the project CzechMATE. Prague Bull Math Linguist 101:71–96
Article Google Scholar
Bojar O, Matusov E, Ney H (2006) Czech-English phrase-based machine translation. In: Proceedings of the 5th international conference on NLP (FinTAL). Turku, Finland, pp 214–224
Bojar O, Kos K, Mareček D (2010) Tackling sparse data issue in machine translation evaluation. In: Proceedings of the ACL 2010 conference short papers. Association for Computational Linguistics, Uppsala, Sweden, pp 86–91
Bojar O, Jawaid B, Kamran A (2012) Probes in a taxonomy of factored phrase-based models. In: Proceedings of the 7th workshop on statistical machine translation. Association for Computational Linguistics, Montréal, Canada, pp 253–260
Bojar O, Macháček M, Tamchyna A, Zeman D (2013a) Scratching the surface of possible translations. In; Proceedings of the 16th international conference text. Plzeň, Czech Republic, Speech and Dialogue, pp 465–474
Bojar O, Rosa R, Tamchyna A (2013b) Chimera—three heads for English-to-Czech translation. In: Proceedings of the eighth workshop on statistical machine translation. Association for Computational Linguistics, Sofia, Bulgaria, pp 92–98
Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 131–198
Botha JA, Blunsom P (2014) Compositional morphology for word representations and language modelling. In: Proceedings of the 31st international conference on machine learning. Beijing, China, pp 1899–1907
Brown PF, Pietra SAD, Pietra VJD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311
Google Scholar
Brychcín T, Konopík M (2011) Morphological based language models for inflectional languages. IN: The 6th IEEE international conference on intelligent data acquisition and advanced computing systems: technology and applications. Czech Republic, Prague, pp 560–564
Brychcín T, Konopík M (2015) HPS: high precision stemmer. Inf Process Manag 51(1):68–91
Article Google Scholar
Burlot F, Yvon F (2015) Morphology-aware alignments for translation to and from a synthetic language. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 188–195
Cettolo M, Niehues J, Stker S, Bentivogli L, Cattoni R, Federico M (2015) The IWSLT 2015 evaluation campaign. In: Proceedings of the international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 2–14
Chahuneau V, Schlinger E, Smith NA, Dyer C (2013) Translating into morphologically rich languages with synthetic phrases. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Seattle, Washington, USA, pp 1677–1687
Chahuneau V, Smith NA, Dyer C (2013b) Knowledge-rich morphological priors for Bayesian language models. In: Proceedings of NAACL-HLT. Atlanta, Georgia, pp 1206–1215
Chen SF, Goodman J (1998) An empirical study of smoothing techniques for language modelling. Technical Report TR-10-98, Computer Science Group, Harvard University
Cho K, Van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734
Cholakov K, Kordoni V (2014) Better statistical machine translation through linguistic treatment of phrasal verbs. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 196–201
Chung J, Cho K, Bengio Y (2016) NYU-MILA neural machine translation systems for WMT16. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 268–271
Costa-jussà MR (2015a) How much hybridization does machine translation need? J Assoc Inf Sci Technol 6(10):2160–2165
Article Google Scholar
Costa-jussà MR (2015b) Latest trends in hybrid machine translation and its applications. Comput Speech Lang 32(1):3–10
Article Google Scholar
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation. Baltimore, Maryland, USA, pp 376–380
Ding S, Duh K, Khayrallah H, Koehn P, Post M (2016) The JHU machine translation systems for WMT 2016. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 272–280
Donaj G, Kačič Z (2016) Language modeling for automatic speech recognition of inflective languages: an applications-oriented approach using lexical data. Springer, London
Google Scholar
Dove C, Loskutova O, de la Fuente R (2012) What’s your pick: RbMT, SMT or hybrid? In: Proceedings of 11th conference of the associationfor machine translation in the Americas (AMTA), San Diego, CA
Dugonik J, Bošković B, Maučec MS, Brest J (2014) The usage of differential evolution in a statistical machine translation. In: Proceedings of the IEEE symposium series on computational intelligence (SSCI). Orlando, Florida, USA, pp 89–96
Durrani N, Sajjad H (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics. Gothenburg, Sweden, pp 148–153
Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics (ACL-HLT). Portland, Oregon, USA, pp 1045–1054
Durrani N, Fraser A, Schmid H, Hoang H, Koehn P (2013) Can Markov models over minimal translation units help phrase-based SMT? In: Proceedings of the 51st annual conference of the association for computational linguistics (ACL). Sofia, Bulgaria, pp 399–405
Durrani N, Koehn P, Schmid H, Fraser A (2014) Investigating the usefulness of generalized word representations in SMT. In: Proceedings of the 25th annual conference on computational linguistics (COLING). Dublin, Ireland, pp 421–432
Durrani N, Schmid H, Fraser A, Koehn P, Schütze H (2015) The operation sequence model—combining N-gram-based and phrase-based statistical machine translation. Comput Linguist 41(2):185–214
Article MathSciNet Google Scholar
Dušek O, Žabokrtský Z, Popel M, Dušek M, Novák M, Mareček D (2012) Formemes in English-Czech deep syntactic MT. In: Proceedings of the 7th workshop on statistical machine translation. Association for Computational Linguistics, Montreal, Canada, pp 267–274
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM Model 2. In: Proceedings of NAACL. Atlanta, Georgia, USA, pp 644–648
Dzikiene JK, Nivre J, Krupavičius A (2013) Lithuanian dependency parsing with rich morphological features. In: Proceedings of the fourth workshop on statistical parsing of morphologically-rich languages, pp 12–21
Eisele A, Federmann C, Saint-Amand H, Jellinghaus M, Herrmann T, Chen Y (2008) Using Moses to integrate multiple rule-based machine translation engines into a hybrid system. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 179–182
Farrús M, Costa-jussà MR, Morse MP (2012) Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations. J Am Soc Inf Sci Technol 63(1):174–184
Article Google Scholar
Federmann C, Hunsicker S (2011) Stochastic parse tree selection for an existing RBMT system. In: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 351–357
Felice M, Specia L (2013) Investigating the contribution of linguistic information to quality estimation. Mach Transl 27:193–212
Article Google Scholar
Fishel M (2009) Deeper than words: morph-based alignment for statistical machine translation. In: Proceedings of the conference of the pacific association for computational linguistics (PacLing 2009), University of Hokkaido, Sapporo, Japan
Galuščáková P, Bojar O (2012) Improving SMT by using parallel data of a closely related language. In: Human Language Technologies—the Baltic Perspective—proceedings of the fifth international conference Baltic HLT 2012, IOS Press, Amsterdam, Netherlands, Frontiers in AI and Applications, vol 247, pp 58–65
Gao J, He X, tau Yih W, Deng L (2014) Learning continuous phrase representations for translation modeling. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 699–709
Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Proceedings of the workshop software engineering, testing, and quality assurance for natural language processing. Association for Computational Linguistics, pp 49–57
Gaudio R, Labaka G, Agirre E, Osenova P, Simov K, Popel M, Oele D, van Noord G, Gomes L, Ja António Rodrigues, Neale S, Ja Silva, Querido A, Rendeiro N, Branco A (2016) SMT and hybrid systems of the QTLeap project in the WMT16 IT-task. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 435–441
Genzel D (2010) Automatically learning source-side reordering rules for large scale machine translation. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, pp 376–384
Giménez J, Màrquez L (2010) Linguistic measures for automatic machine translation evaluation. Mach Transl 24:209–240
Article Google Scholar
Gimpel K, Smith NA (2014) Phrase dependency machine translation with quasi-synchronous tree-to-tree feature. Comput Linguist 40(2):349–401
Article Google Scholar
Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP). Vancouver, Canada, pp 676–683
Graham Y, van Genabith J (2010) Factor templates for factored machine translation models. In; Proceedings of the seventh international workshop on spoken language translation (IWSLT). France, Paris, pp 275–282
Green N (2011) Effects of noun phrase bracketing in dependency parsing and machine translation. In: Proceedings of the ACL 2011 student session. Association for Computational Linguistics, Portland, OR, USA, pp 69–74
Green S, DeNero J (2012) A class-based agreement model for generating accurately inflected translations. In: Proceedings of the 50th annual meeting of the association for computational linguistics. Jeju, Republic of Korea, Association for Computational Linguistics, pp 146–155
Hammarströ H, Borin L (2011) Unsupervised learning of morphology. Comput Linguist 37(2):309–350
Article MathSciNet Google Scholar
Hirsimäki T, Pylkkönen J, Kurimo M (2009) Importance of high-order N-gram models in morph-based speech recognition. IEEE/ACM Trans Audio Speech Lang Process 17(4):724–732
Article Google Scholar
Ho C, Azmi Murad MA, Doraisamy S, Abdul Kadir R (2014) Extracting lexical and phrasal paraphrases: a review of the literature. Artif Intell Rev 42(4):851–894
Article Google Scholar
Hoang C, Sima’an K (2014) Latent domain translation models in mix-of-domains haystack. In: COLING 2014, 25th international conference on computational linguistics, proceedings of the conference: technical papers, August 23–29, 2014. Dublin, Ireland, pp 1928–1939
Hoang T, Bojar O (2015) TmTriangulate: a tool for phrase table triangulation. Prague Bull Math Linguist 104:75–86
Article Google Scholar
Homola P, Kuboň V (2008) A hybrid machine translation system for typologically related languages. In: Proceedings of the 21st international florida-artificial-intelligence-research-society conference (FLAIRS), pp 227–228
Huet S, Manishina E, Lefevre F (2013) Factored machine translation systems for Russian-English. In: Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, pp 154–157
Hunsicker S, Yu C, Federmann C (2012) Machine learning for hybrid machine translation. In: Proceedings of the seventh workshop on statistical machine translation, pp 312–316
Ircig P, Psutka JV, Psutka J (2009) Using morphological information for robust language modeling in Czech ASR system. IEEE/ACM Trans Audio Speech Lang Process 17(4):840–847
Article Google Scholar
Ircing P, Krbec P, Hajič J, Khudanpur S, Jelinek F, Psutka J, Byrne W (2001) On large vocabulary continuous speech recognition of highly inflectional language—Czech. In: Proceedings of the European conference on speech communication and technology (EUROSPEECH), pp 487–490
ISO 9:1995 (1995) Information and documentation transliteration of Cyrillic characters into Latin characters Slavic and non-Slavic languages. International Organization for Standardization
Jawaid B, Bojar O (2014) Two-step machine translation with lattices. In: Proceedings of the 9th international conference on language resources and evaluation (LREC 2014). Reykjavík, Iceland, pp 682–686
Jean S, Firat O, Cho K, Memisevic R, Bengio Y (2015) Montreal neural machine translation systems for WMT’15. In: Proceedings of the tenth workshop on statistical machine translation. Lisboa, Portugal, pp 134–140
Jeong M, Toutanova K, Suzuki H, Quirk C (2010) A discriminative lexicon model for complex morphology. In: The ninth conference of the association for machine translation in the Americas (AMTA). Association for Computational Linguistics
Joty S, Guzmán F, Màrquez L, Nakov P (2014) DiscoTK: using discourse structure for machine translation evaluation. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 402–408
Juhár J, Staš J, Hládek D (2012) Recent progress in development of language model for Slovak large vocabulary continuous speech recognition. In: New technologies-trends, innovations and research, pp 261–276
Junczys-Dowmunt M, Szał A (2011) SyMGiza++: Symmetrized word alignment models for statistical machine translation. In: International joint conferences security and intelligent information systems (SIIS), pp 379–390
Junczys-Dowmunt M, Dwojak T, Sennrich R (2016) The AMU-UEDIN submission to the WMT16 news translation task: attention-based NMT models as feature functions in phrase-based SMT. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 319–325
Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1700–1709
Katz SM (1987) Estimation of probabilities from sparse data for the language model component of a speech recogniser. IEEE Trans Acoust Speech Signal Process 35(3):400–401
Article MathSciNet Google Scholar
Kazi M, Salesky E, Thompson B, Ray J, Coury M, Shen W, Anderson T, Erdmann G, Gwinnup J, Young K, Ore B, Hutt M (2014) The MITLL-AFRL IWSLT 2014 MT System. In: Proceedings of the international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 65–73
Kipyatkova I, Karpov A (2014) Study of Morphological factors of factored language models for Russian ASR. In: Proceedings of the 16th international conference speech and computer (SPECOM). Novi Sad, Serbia, pp 451–458
Kirchhoff K, Yang M, Duh K (2006) Machine translation of parliamentary proceedings using morpho-syntactic knowledge. In: Proceedings of the TC-STAR workshop on speech-to-speech translation
Kneser R, Ney H (1993) Improved clustering techniques for class-based statistical language modelling. In: Proceedings of third European conference on speech communication and technology. EUROSPEECH 1993, Berlin, Germany, pp 22–25
Koehn P (2011) Statistical machine translation. Cambridge University Press, Cambridge
MATH Google Scholar
Koehn P, Haddow B (2012) Interpolated backoff for factored translation models. In: Proceedings of the tenth conference of the association for machine translation in the Americas (AMTA)
Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). Czech Republic, Scotland, Prague, pp 868–876
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the human language technology and North American Association for computational linguistics conference (HLT/NAACL). Czech Republic, Scotland, Prague, pp 48–54
Kolovratník D, Klyueva N, Bojar O (2009) Statistical machine translationrelated and unrelated languages. In: ITAT 2009 information technologies—applications and theory, Slovakia, pp 31–36
Kos K, Bojar O (2009) Evaluation of machine translation metrics for Czech as the target language. Prague Bull Math Linguist 92:135–147
Article Google Scholar
Kuboň V, Vičič J (2014) A comparison of MT Methods for closely related languages: a case study on Czech Slovak language pair. In: Proceedings of the conference language technology for closely related languages and language variants (LT4CloseLang), pp 92–98
Labaka G, España-Bonet C, Màrquez L, Sarasola K (2014) A hybrid machine translation architecture guided by syntax. Mach Transl 28(2):91–125
Article Google Scholar
Lembersky G, Ordan N, Wintner S (2012) Language models for machine translation: original vs. translated texts. Comput Linguist 38(4):799–825
Article MathSciNet Google Scholar
Lerner U, Petrov S (2013) Source-side classifier preordering for machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP ’13). Seattle, Washington, USA, pp 513–523
Libovický J, Pecina P (2015) Tolerant BLEU: a submission to the WMT14 metrics task. In: Proceedings of the ninth workshop on statistical machine translation (SMT), pp 409–413
Lo C, Cherry C, Foster G, Stewart D, Islam R, Kazantseva A, Kuhn R (2016) NRC Russian-English machine translation system for WMT 2016. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 326–332
Luong MT, Socher R, Manning CD (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning. Association for Computational Linguistics, Sofia, Bulgaria, pp 104–113
Macherey K, Dai AM, Talbot D, Popat AC, Och F (2011) Language-independent compound splitting with morphological operations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Portland, Oregon, HLT ’11, pp 1395–1404
Majewski P (2008) Syllable based language model for large vocabulary continuous speech recognition of Polish. Proceedings of the 11th international conference text, speech and dialogue (TSD). Brno, Czech Republic, pp 397–401
Marasek K (2012) TED Polish-to-English translation system for the IWSLT 2012. In: Proceedings of the international workshop on spoken language translation (IWSLT), Hong Kong, pp 126–129
Mareček D, Rosa R, Galuščáková P, Bojar O (2011) Two-step translation with grammatical post-processing. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, WMT ’11, pp 426–432
Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JAR, Costa-jussà MR (2006) N-gram-based machine translation. Comput Linguist 32(4):527–549
Article MathSciNet MATH Google Scholar
Maučec MS, Brest J (2010) Reduction of morpho-syntactic features in statistical machine translation of highly inflective language. Informatica 21(1):95–116
MATH Google Scholar
Maučec MS, Donaj G (2016) Morphosyntactic tags in statistical machine translation of highly inflectional language. In: Proceedings of the artificial intelligence and natural language conference (AINL FRUCT). Saint-Petersburg, Russia, pp 99–102
Maučec MS, Kačič Z, Verdonik D (2014) Statistical machine translation of subtitles for highly inflected language pair. Pattern Recogn Lett 46:96–103
Article Google Scholar
McDonald R, Nivre J (2011) Analyzing and integrating dependency parsers. Comput Linguist 37(1):197–230
Article Google Scholar
Mikolov T, Kopecký J, Burget L, Glembek O, Černocký JH (2009) Neural network based language models for highly inflected languages. In: Proceedings of the ICASSP, pp 4725–4728
Mikolov T, Yih W, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL HLT). Atlanta, Georgia, pp 746–751
Miłkowski M (2012) The Polish language in the digital age, White Paper Series. Springer, Berlin
Google Scholar
Minkov E, Toutanova K, Suzuki H (2007) Generating complex morphology for machine translation. In: roceedings of the 45th annual meeting of the association of computational linguistics. Association for Computational Linguistics, Prague, Czech Republic, pp 128--135
Molchanov A, Bykov F (2016) PROMT translation systems for WMT 2016 translation tasks. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 339–343
Morchid M, Huet S, Dufour R (2014) Topic-based approach for post-processing correction of automatic translations. In: Proceedings of the 11th international workshop on spoken language translation, Lake Tahoe, pp 80–85
Müller T, Schuetze H, Schmid H (2012) A comparative investigation of morphological language modeling for the languages of the European Union. In: Human language technologies: conference of the North American chapter of the association of computational linguistics, proceedings, June 3–8, 2012. Montréal, Canada, pp 386–395
Munková D, Munk M (2014) An automatic evaluation of machine translation and Slavic languages. In: Proceedings of the 8th international conference on application of information and communication technologies (AICT-2014), Astana, pp 447–451
Munková D, Munk M (2015) Automatic evaluation of machine translation through the residual analysis. In: Proceedings of the 11th international conference advanced intelligent computing theories and applications. Fuzhou, China, pp 481–490
Niehues J, Herrmann T, Vogel S, Waibel A (2011) Wider context by using bilingual language models in machine translation. In: Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 198–206
Nivre J (2015) Towards a universal grammar for natural language processing. In: Gelbukh A (ed) Computational linguisticsand intelligent text processing. Springer, Berlin, pp 3–16
Chapter Google Scholar
Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135
Article Google Scholar
Novák V, Žabokrtský Z (2007) Feature engineering in maximum spanning tree dependency parser. In: Proceedings of the 10th international conference on text. Pilsen, Czech Republic, Speech and Dialogue, pp 92–98
Novák V, Nedoluzhko A, Žabokrtský Z (2013) Translation of “it” in a deep syntax framework. In: Proceedings of the workshop on discourse in machine translation (DiscoMT). Association for Computational Linguistics, Sofia, Bulgaria, pp 51–59
Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting on association for computational linguistics, vol 1. Association for Computational Linguistics, Sapporo, Japan, pp 160–167
Och FJ, Ney H (2003a) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Article MATH Google Scholar
Och FJ, Ney H (2003b) The alignment template approach to statistical machine translation. Comput Linguist 30(4):417–449
Article MATH Google Scholar
Oparin I (2008) Language models for automatic speech recognition of inflectional languages. Ph.D. Dissertation, University of West Bohemia
Oparin I, Glembek O, Burget L, Černocký J (2008) Morphological random forests for language modeling of inflectional languages. In: Proceedings of the spoken language technology workshop, (IEEE). Goa, India, pp 189–192
Papineni K, Roukos S, Ward T, Zhu WJ (2004) BLEU: a method for automatic evaluation of machine translation. Tech. Rep. RC22176(W0109-022), IBM Research Report, IBM
Popel M, Žabokrtský Z (2010) TectoMT: Modular NLP framework. In: Proceedings of the 7th international conference on advances in natural language processing, Reykjavik, Iceland, IceTAL’10, pp 293–304
Popel M, Mareček D, Green N, Žabokrtský Z (2011) Influence of parser choice on dependency-based MT. IN: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, UK, pp 433–439
Popović M (2011) Hjerson: an open source tool for automatic error classification of machine translation output. Prague Bull Math Linguist 96:59–68
Article Google Scholar
Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation. Association for Computational Linguistics, Lisbon, Portugal, pp 392–395
Popović M, Arčan M (2015) Identifying main obstacles for statistical machine translation of morphologically rich South Slavic languages. In: Proceedings of the eighteenth annual conference of the European association for machine translation (EAMT 15). Antalya, Turkey, pp 97–104
Popović M, Ljubešić N (2014) Exploring cross-language statistical machine translation for closely related South Slavic languages. In: Proceedings of the conference: language technology for closely related languages and language variants (LT4CloseLang). Association for Computational Linguistics, Doha, Qatar, pp 76–84
Popović M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of the 4th international conference on language resources and evaluation (LREC), Lisbon, Portugal, pp 1585–1588
Popović M, Ney H (2011) Towards automatic error analysis of machine translation output. Comput Linguist 37(4):657–688
Article MathSciNet Google Scholar
Popović M, Arčan M, Avramidis E, Burchardt A, Lommel AR (2015) Poor man’s lemmatisation for automatic error classification. In: The eighteenth annual conference of the European association for machine translation (EAMT 15), pp 105–112
Prochazka V, Pollak P, Zdansky J, Nouza J (2011) Performance of Czech speech recognition with language models created from public resources. Radioengineering 20(4):1002–1008
Google Scholar
Rishøj C, Søgaard A (2011) Factored translation with unsupervised word clusters. In: Proceedings of the 6th workshop on statistical machine translation. Association for Computational Linguistics, Edinburgh, Scotland, pp 447–451
Rosa R, Mareček D, Dušek O (2012) DEPFIX: a system for automatic correction of Czech MT outputs. In: Proceedings of the seventh workshop on statistical machine translation. Association for Computational Linguistics, Montreal, Canada, WMT ’12, pp 362–368
Rosa R, Sudarikov R, Novák M, Popel M, Bojar O (2016) Dictionary-based domain adaptation of MT systems without retraining. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, Berlin, Germany, pp 449–455
Rotovnik T, Maučec MS, Kačič Z (2007) Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Commun 49(6):437–452
Article Google Scholar
Ruth J, O’Regan J (2011) Shallow-transfer rule-based machine translation from Czech to Polish. In: Proceedings of the second international workshop on free/open-source rule-based machine translation, pp 69–76
Salehi B, Cook P, Baldwin T (2014) Using distributional similarity of multi-way translations to predict multiword expression compositionality. In: Proceedings of the 14th conference of the european chapter of the association for computational linguistics. Association for Computational Linguistics, Gothenburg, Sweden, pp 472–481
Schwenk H, Rousseau A, Attik M (2012) Large, pruned or continuous space language models on a GPU for statistical machine translation. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL HLT). Atlanta, Georgia, pp 11–19
Seeker W, Kuhn J (2013) Morphological and syntactic case in statistical dependency parsing. Comput Linguist 39:23–55
Article Google Scholar
Sennrich R (2015) Modelling and optimizing on syntactic N-grams for statistical machine translation. Trans Assoc Computat Linguist 3:169–182
Article Google Scholar
Sennrich R, Haddow B, Birch A (2016a) Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the first conference on machine translation. Association for Computational Linguistics, pp 371–376
Sennrich R, Haddow B, Birch A (2016b) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1715–1725
Shaik MAB, Mousa AED, Schüter R, Ney H (2011) Using morpheme and syllable based sub-words for Polish LVCSR. In: Proceedings of ICASSP, pp 4680–4683
Shalonova K, Golénia B, Flach P (2009) Towards learning morphology for under-resourced fusional and agglutinating languages. IEEE/ACM Trans Audio Speech Lang Process 17(5):956–965
Article Google Scholar
Shin E, Stüker S, Kilgour K, Fügen C, Waibel A (2013) Maximum entropy language modeling for Russian ASR. In: Proceedings of the 10th international workshop on spoken language translation, Heidelberg, Germany, pp 288–294
Simova I, Kordoni V (2013) Improving English-Bulgarian statistical machine translation by phrasal verb treatment. In: Workshop on multi-word units in machine translation and translation technologies, pp 62–71
Slawik I, Niehues J, Waibel A (2015) Stripping adjectives: integration techniques for selective stemming in SMT systems. In: The eighteenth annual conference of the European association for machine translation (EAMT 15), pp 105–112
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation error rate with targeted human annotation. In: 5th conference of the association for machine translation in the Americas (AMTA), Boston, Massachusetts
Son LH, Allauzen A, Yvon F (2012) Continuous space translation models with neural networks. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies, pp 39–48
Stanojević M, Sima’an K (2014) BEER: BEtter evaluation as ranking. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 414–419
Tamchyna A, Bojar O (2015) What a transfer-based system brings to the combination with PBMT. In: Proceedings of the ACL 2015 fourth workshop on hybrid approaches to translation (HyTra). Association for Computational Linguistics, Beijing, China, pp 11–20
Tiedemann J (2012) Character-based pivot translation for under-resourced languages and domains. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics (EACL 2012), The Association for Computational Linguistics, pp 141–151
Tiedemann J, Agić Ž, Nivre J (2014) Treebank translation for cross-lingual parser induction. In: Proceedings of the eighteenth conference on computational natural language learning (CoNLL). Avignon, France, pp 130–140
Tillmann C (2004) A unigram orientation model for statistical machine translation. In: Proceedings of HLT-NAACL 2004: short papers. Association for Computational Linguistics, Boston, Massachusetts, pp 101–104
Tillmann C, Hewavitharana S (2013) A unified alignment algorithm for bilingual data. Nat Lang Eng 19(1):33–60
Article Google Scholar
Toral A, Pecina P, Wang L, van Genabith J (2015) Linguistically-augmented perplexity-based data selection for language models. Comput Speech Lang 32:11–26
Article Google Scholar
Toutanova K, Suzuki H, Ruopp A (2008) Applying morphology generation models to machine translation. Proc ACL. Association for Computational Linguistics, Columbus, pp 514–522
Google Scholar
Tran K, Bisazza A, Monz C (2014) Word translation prediction for morphologically rich languages with bilingual neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1676–1688
Tsvetkov Y, Dyer C, Levin L, Bhatia A (2013) Generating English determiners in phrase-based translation with synthetic translation options. In: Proceedings of the eighth workshop on statistical machine translation. Sofia, Bulgaria, pp 271–280
Vaswani A, Huang L, Chiang D (2012) Smaller alignment models for better translations: unsupervised word alignment with the l0-norm. In: Proceedings of the 50th annual meeting of the association for computational linguistics, pp 311–319
Vazhenina D, Markov K (2013) Factored language modeling for Russian LVCSR. In: Proceedings of the international joint conference on awareness science and technology & ubi-media computing, pp 205–210
Vidhu Bhala RV, Abirami S (2014) Trends in word sense disambiguation. Artif Intell Rev 42(2):159–171
Article Google Scholar
Virpioja S, Väyrynen J, Mansikkaniemi A, Kurimo M (2010) Applying morphological decomposition to statistical machine translation. In: Proceedings of the joint fifth workshop on statistical machine translation and metrics MATR. Uppsala University, Uppsala, Sweden, pp 195–200
Wang L, Wong DF, Chao LS, Lu Y, Xing J (2014) A systematic comparison of data selection criteria for SMT domain adaptation. Sci World J 2014
Wang R, Osenova P, Simov K (2012) Linguistically-augmented Bulgarian-to-English statistical machine translation model. IN: Proceedings of the joint workshop on exploiting synergies between information retrieval and machine translation (ESIRMT) and hybrid approaches to machine translation (HyTra). Association for Computational Linguistics, Avignon, France, pp 119–128
Wang R, Zhao H, Lu BL (2015) Bilingual continuous-space language model growing for statistical machine translation. IEEE/ACM Trans Audio Speech Lang Process 23(7):1209–1220
Article Google Scholar
Wang R, Utiyama M, Goto I, Sumita E, Zhao H, Lu BL (2016) Converting continuous-space language models into N-gram language models with efficient bilingual pruning for statistical machine translation. ACM Trans Asian Low-Resour Lang Inf Process 15(3):11:1–11:26
Article Google Scholar
Weller M, Kisselew M, Smekalova S, Fraser A, Schmid H, Durrani N, Sajjad H, Farkas R (2013) Munich-Edinburgh-Stuttgart submissions at WMT13: morphological and syntactic processing for SMT. In: Proceedings of the eighth workshop on statistical machine translation. Association for Computational Linguistics, Sofia, Bulgaria, pp 232–239
Williams P, Sennrich R, Post M, Koehn P (2016) Syntax-based statistical machine translation. Morgan & Claypool, San Rafael
Book Google Scholar
Wołk K, Marasek K (2013) Polish - English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the international workshop on spoken language translation (IWSLT), Heidelberg, Germany
Wołk K, Marasek K (2014a) Enhanced bilingual evaluation understudy. In: Proceedings of the 11th international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 191–197
Wołk K, Marasek K (2014b) Polish - English speech statistical machine translation systems for the IWSLT 2014. In: Proceedings of the international workshop on spoken language translation (IWSLT), Lake Tahoe, pp 143–149
Wołk K, Marasek K (2015a) Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts. Procedia Comput Sci 64:2–9
Article Google Scholar
Wołk K, Marasek K (2015b) PJAIT systems for the IWSLT 2015 evaluation campaign enhanced by comparable corpora. In: Proceedings of the international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 101–104
Wołk K, Marasek K, Glinkowski W (2015a) Telemedicine as a special case of the machine translation. Comput Med Imaging Graph 46:249–256
Article Google Scholar
Wołk K, Rejmund E, Marasek K (2015b) Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy-based heuristics. In: Proceedings of the international symposium on methodologies for intelligent systems (ISMIS), pp 433–441
Wróblewska A (2011) Polish-English word alignment: preliminary study. Emerg Intell Technol Ind 369:123–132
Google Scholar
Wu X, Yu H, Liu Q (2014) RED: DCU-CASICT participation in WMT2014 metrics task. In: Proceedings of the ninth workshop on statistical machine translation. Association for Computational Linguistics, Baltimore, Maryland, USA, pp 420–425
Xiong D, Zhang M (2015) Backward and trigger-based language models for statistical machine translation. Nat Lang Eng 21(2):201–226
Article MathSciNet Google Scholar
Žabokrtský Z, Ptáček J, Pajas P (2008) TectoMT: Highly modular MT system with tectogrammatics used as transfer layer. In: Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, Columbus, Ohio, USA, pp 167–170
Zeman D, Fishel M, Berka J, Bojar O (2011) Addicter: What is wrong with my translations? Prague Bull Math Linguist 96:79–88
Article Google Scholar
Zens R, Ney H (2006) Discriminative reordering models for statistical machine translation. In: Proceedings of the workshop on statistical machine translation, New York City, pp 55–63

Download references

Acknowledgements

The authors would like to thank the editor and anonymous reviewers for their helpful and constructive comments that greatly contributed to improving the paper. Funding was provided by Javna Agencija za Raziskovalno Dejavnost RS (Grant Nos. P2-0069, P2-0041).

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000, Maribor, Slovenia
Mirjam Sepesy Maučec & Janez Brest

Authors

Mirjam Sepesy Maučec
View author publications
You can also search for this author in PubMed Google Scholar
Janez Brest
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mirjam Sepesy Maučec.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maučec, M.S., Brest, J. Slavic languages in phrase-based statistical machine translation: a survey. Artif Intell Rev 51, 77–117 (2019). https://doi.org/10.1007/s10462-017-9558-2

Download citation

Published: 06 May 2017
Issue Date: 31 January 2019
DOI: https://doi.org/10.1007/s10462-017-9558-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Slavic languages in phrase-based statistical machine translation: a survey

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural language syntax complies with the free-energy principle

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Slavic languages in phrase-based statistical machine translation: a survey

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural language syntax complies with the free-energy principle

Foundation and large language models: fundamentals, challenges, opportunities, and social impacts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation