Abstract
Machine translation is one of the most important applications of natural language processing. The last 25 years have seen tremendous progress in machine translation, enabled by the development of statistical techniques and availability of large-scale parallel sentence corpora from which statistical models of translation can be learned. Turkish poses quite many challenges for statistical machine translation as alluded to in Chap. 1, owing mainly to its complex morphology. This chapter discusses in more detail the challenges of Turkish in the context of statistical machine translation and describes two widely different approaches that have been employed in the last several years to English to Turkish machine translation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
www.statmt.org/wmt16/ (Accessed Sept. 14, 2017).
- 2.
International Workshop on Spoken Language Translation: workshop2013.iwslt.org/ (Accessed Sept. 14, 2017).
- 3.
Note that on the English side, the filler for [something] would come in the middle of this phrase.
- 4.
See Chap. 1 for details.
- 5.
This disambiguator has about 94% accuracy.
- 6.
Ideally, it would have been very desirable to actually do derivational morphological analysis on the English side, so that one could, for example, analyze accession into access plus a marker indicating nominalization.
- 7.
The training set in the first row of Table 10.2 was limited to sentences on the Turkish side which had at most 90 tokens (roots and bound morphemes) in total in order to comply with the limitations of the GIZA++ alignment tool. However when only the content words are included, we have more sentences to include since much less number of sentences violate the length restriction when morphemes/function words are removed.
- 8.
It should be noted that what to selectively attach to the root should be considered on a per-language basis; if Turkish were to be aligned with a language with similar morphological markers, this perhaps would not have been needed.
- 9.
Using the content word data improved performance for all representations except the baseline.
- 10.
We ran MERT on the baseline model and the morphologically segmented models forcing -weight-d to range a very small around 0.1, but letting the other parameters range in their suggested ranges. Even though the procedure came back claiming that it achieved a better BLEU score on the tune set, running the new model on the test set did not show any improvement at all. This may have been due to the fact that the initial choice of -weight-d along with -dl set to -1 provides such a drastic improvement that perturbations in the other parameters do not have much impact.
- 11.
We arrived at this combination by experimenting with the decoder to avoid the almost monotonic translation we were getting with the default parameters. These parameters boosted the BLEU scores substantially compared to default parameters used by the decoder.
- 12.
We should also note that all sentences were lowercased so that we would not have to deal with exact capitalization issue at that stage.
- 13.
The meanings of various tags are as follows: Dependency Labels: PMOD—Preposition Modifier; POS—Possessive. Part-of-Speech Tags for the English words: +IN—Preposition; +PRP$—Possessive Pronoun; +JJ—Adjective; +NN—Noun; +NNS—Plural Noun. Morphological Feature Tags in the Turkish Sentence: +A3pl—3rd person plural; +P3sg—3rd person singular possessive; +Loc—Locative case. Note that we mark an English plural noun as +NN_NNS to indicate that the root is a noun and there is a plural morpheme on it. Note also that economic is also related to relations but we are not interested in such content words and their relations.
- 14.
We use _ to prefix such syntactic tags on the English side.
- 15.
The order is important in that we would like to attach the same sequence of function words in the same order so that the resulting tags on the English side are the same.
- 16.
We outline two additional rules later when we see a more complex example in Fig. 10.4.
- 17.
For example, the morphological analyzer outputs +A3sg to mark a singular noun, if there is no explicit plural morpheme. Such markers are removed.
- 18.
The tune set was not used in this work but reserved for future work so that meaningful comparisons could be made.
- 19.
It is possible that the ten test sets are not mutually exclusive.
- 20.
These allow and do not penalize unlimited distortions, but increase decoding time.
- 21.
In Moses, factors are separated by a ‘|’ symbol.
- 22.
Concatenating Root and Tags gives the Surface form, in that the surface is unique given this concatenation.
- 23.
Note that for Turkish, this representation is equivalent to surface words in that the surface is unique given this representation.
- 24.
Note that in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score.
- 25.
In order to provide a simple and clear representation, the example sentences contain the surface form of the words as opposed to the morphemic representation used earlier.
- 26.
For instance, consider the example in Fig. 10.4 involving if with some additional modifiers added to the intervening noun phrase.
References
Bisazza A, Federico M (2009) Morphological pre-processing for Turkish to English statistical machine translation. In: Proceedings of IWSLT, Tokyo, pp 129–135
Carpuat M (2009) Toward using morphology in French-English phrase-based SMT. In: Proceedings of WMT, Athens, pp 150–154
Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: Proceedings of EMNLP, Singapore, pp 718–726
Corston-Oliver S, Gamon M (2004) Normalizing German and English inflectional morphology to improve statistical word alignment. In: Proceedings of AMTA, Washington, DC
Durgar-El Kahlout İ (2009) A prototype English-Turkish statistical machine translation system. PhD thesis, Sabancı University, Istanbul
Durgar-El Kahlout İ, Oflazer K (2006) Initial explorations in English to Turkish statistical machine translation. In: Proceedings of WMT, New York, NY, pp 7–14
Durgar-El Kahlout İ, Oflazer K (2010) Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans Audio Speech Lang Process 18(6):1313–1322
Durgar-El Kahlout İ, Mermer C, Doğan MU (2012) Recent improvements in statistical machine translation between Turkish and English. In: Vertan C, von Hahn W (eds) Multilingual processing in Eastern and Southern EU languages: low-resourced technologies and translation. Cambridge Scholars Publishing, Cambridge
Eyigöz E, Gildea D, Oflazer K (2013a) Multi-rate HMMs for word alignment. In: Proceedings of WMT, Sofia, pp 494–502
Eyigöz E, Gildea D, Oflazer K (2013b) Simultaneous word-morpheme alignment for statistical machine translation. In: Proceedings of NAACL-HLT, Atlanta, GA, pp 32–40
Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of EMNLP, Vancouver, BC, pp 676–683
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 127–133
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL, Prague, pp 177–180
Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of NAACL-HLT, Boston, MA, pp 57–60
Luong MT, Nakov P, Kan MY (2010) A hybrid morpheme-word representation for machine translation of morphologically rich languages. In: Proceedings of EMNLP, Cambridge, MA, pp 148–157
Mermer C, Akın AA (2010) Unsupervised search for the optimal segmentation for statistical machine translation. In: Proceedings of the ACL student research workshop, Uppsala, pp 31–36
Minkov E, Toutanova K, Suzuki H (2007) Generating complex morphology for machine translation. In: Proceedings of ACL, Prague, pp 128–135
Naradowsky J, Toutanova K (2011) Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-markov models. In: Proceedings of ACL-HLT, Portland, OR, pp 895–904
Nguyen T, Vogel S, Smith NA (2010) Nonparametric word segmentation for machine translation. In: Proceedings of COLING, Beijing, pp 815–823
Niessen S, Ney H (2004) Statistical machine translation with scarce resources using morpho-syntatic information. Comput Linguist 30(2):181–204
Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135
Oflazer K (1994) Two-level description of Turkish morphology. Lit Linguist Comput 9(2):137–148
Oflazer K (1996) Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–99
Oflazer K, Durgar-El Kahlout İ (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of WMT, Prague, pp 25–32
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, Philadelphia, PA, pp 311–318
Popovic M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of LREC, Lisbon, pp 1585–1588
Sadat F, Habash N (2006) Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, pp 1–8
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing, Manchester
Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, CO, vol 2, pp 901–904
Talbot D, Osborne M (2006) Modelling lexical redundancy for machine translation. In: Proceedings of COLING-ACL, Sydney, pp 969–976
Tantuğ AC, Oflazer K, Durgar-El Kahlout İ (2008) BLEU+: a tool for fine-grained BLEU computation. In: Proceedings of LREC, Marrakesh, pp 1493–1499
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 252–259
Yang M, Kirchhoff K (2006) Phrase-based backoff models for machine translation of highly inflected languages. In: Proceedings of EACL, Trento, pp 41–48
Yeniterzi R (2009) Syntax-to-morphology alignment and constituent reordering in factored phrase-based statistical machine translation from English to Turkish. Master’s thesis, Sabancı University, Istanbul
Yeniterzi R, Oflazer K (2010) Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In: Proceedings of ACL, Uppsala, pp 454–464
Yılmaz E, Durgar-El Kahlout İ (2014) The use of recurrent neural networks language model in Turkish-English machine translation. In: Proceedings of IEEE signal processing and communications applications conference, Trabzon, pp 1247–1250
Yılmaz E, Durgar-El Kahlout İ, Aydın B, Özil ZS (2013) TÜBİTAK Turkish-English submissions for IWSLT 2013. In: Proceedings of IWSLT, Heidelberg, pp 152–159
Yuret D, Türe F (2006) Learning morphological disambiguation rules for Turkish. In: Proceedings of NAACL-HLT, New York, NY, pp 328–334
Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of NAACL-HLT, New York, NY, pp 201–204
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Oflazer, K., Yeniterzi, R., Kahlout, İ.DE. (2018). Statistical Machine Translation and Turkish. In: Oflazer, K., Saraçlar, M. (eds) Turkish Natural Language Processing. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-90165-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-90165-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-90163-3
Online ISBN: 978-3-319-90165-7
eBook Packages: Computer ScienceComputer Science (R0)