Abstract
The multilingual nature of the world makes translation a crucial requirement today. Within this research we apply state of the art statistical machine translation techniques to the West-Slavic languages group. We do West-Slavic languages classification and choose Polish as a representative candidate for our research. The experiments are conducted on written and spoken texts, which characteristics are defined as well. The machine translation systems are trained within West-Slavic group as well as into English. Translation systems and data sets are analyzed, prepared and adapted for the needs of West-Slavic—* translation. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. By defining proper translation parameters to morphologically rich languages we improve the translation quality and draw the conclusions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mohammadi, M.; Ghasemaghaee, N.: Building bilingual parallel corpora based on wikipedia. In: 2010 Second International Conference on Computer Engineering and Applications (ICCEA), pp. 264–268. IEEE (2010)
Ruiz, N., Federico, M.: Complexity of spoken versus written language for machine translation. In: Proceedings of the 17th Annual Conference of the European Association for Machine Translation, pp. 173–180 (2014)
Dalewska-Greń, H.: Języki słowiańskie. Wydawn, Naukowe PWN (1997)
Stieber, Z.: Zarys gramatyki porównawczej języków słowiańskich Wydawn. Naukowe PWN (2005)
Oczkowa, B., Szczepańska, E., Kwoka T.: Słowiańskie języki literackie. Wydawnictwo Uniwersytetu Jagiellońskiego (2011)
Języki zachodniosłowiańskie last modified October 16 2015. https://pl.wikipedia.org/wiki/J%C4%99zyki_zachodnios%C5%82owia%C5%84skie
Wołk, K., Marasek, K.: Polish–English speech statistical machine translation systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119 (2013)
Swan, O.E.: Polish Grammar in a Nutshell (2003)
Choong C.: The Difference between Written and Spoken English. Assignment Unit 1 A in fulfillment of Graduate Diploma in English (2014)
Daniels, P. T., Bright, W.: The World’s Writing Systems. Oxford University Press (1996)
Coleman, J., A speech is not an essay. Harv. Bus. Rev. (2014)
Ager, S.: Differences between writing and speech, Omniglot—the online encyclopedia of writing systems and languages. http://www.omniglot.com/writing/writingvspeech.htm. Accessed 8 Aug 2013
Koehn, P., et al.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)
Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: INTERSPEECH (2002)
Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49–57. Association for Computational Linguistics (2008)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features, pp. 137–142. Springer, Berlin, Heidelberg (1998)
Tiedemann, J.: News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In: Recent Advances in Natural Language Processing, pp. 237–248 (2009)
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Natural Language Processing–IJCNLP 2005, pp. 257–268. Springer, Berlin, Heidelberg (2005)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. LREC 2012, 2214–2218 (2012)
Acknowledgments
This research was supported by Polish-Japanese Academy of Information Technology statutory resources (ST/MUL/2016), resources for young researchers at PJATK and CLARIN ERIC research program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this paper
Cite this paper
Wołk, A., Wołk, K., Marasek, K. (2017). Analysis of Complexity Between Spoken and Written Language for Statistical Machine Translation in West-Slavic Group. In: Zgrzywa, A., Choroś, K., Siemiński, A. (eds) Multimedia and Network Information Systems. Advances in Intelligent Systems and Computing, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-319-43982-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-43982-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43981-5
Online ISBN: 978-3-319-43982-2
eBook Packages: EngineeringEngineering (R0)