Statistical Machine Translation and Turkish

Oflazer, Kemal; Yeniterzi, Reyyan; Kahlout, İlknur Durgar-El

doi:10.1007/978-3-319-90165-7_10

Kemal Oflazer⁶,
Reyyan Yeniterzi⁷ &
İlknur Durgar-El Kahlout⁸

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

853 Accesses
1 Citations

Abstract

Machine translation is one of the most important applications of natural language processing. The last 25 years have seen tremendous progress in machine translation, enabled by the development of statistical techniques and availability of large-scale parallel sentence corpora from which statistical models of translation can be learned. Turkish poses quite many challenges for statistical machine translation as alluded to in Chap. 1, owing mainly to its complex morphology. This chapter discusses in more detail the challenges of Turkish in the context of statistical machine translation and describes two widely different approaches that have been employed in the last several years to English to Turkish machine translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.statmt.org/wmt16/ (Accessed Sept. 14, 2017).
2.
International Workshop on Spoken Language Translation: workshop2013.iwslt.org/ (Accessed Sept. 14, 2017).
3.
Note that on the English side, the filler for [something] would come in the middle of this phrase.
4.
See Chap. 1 for details.
5.
This disambiguator has about 94% accuracy.
6.
Ideally, it would have been very desirable to actually do derivational morphological analysis on the English side, so that one could, for example, analyze accession into access plus a marker indicating nominalization.
7.
The training set in the first row of Table 10.2 was limited to sentences on the Turkish side which had at most 90 tokens (roots and bound morphemes) in total in order to comply with the limitations of the GIZA++ alignment tool. However when only the content words are included, we have more sentences to include since much less number of sentences violate the length restriction when morphemes/function words are removed.
8.
It should be noted that what to selectively attach to the root should be considered on a per-language basis; if Turkish were to be aligned with a language with similar morphological markers, this perhaps would not have been needed.
9.
Using the content word data improved performance for all representations except the baseline.
10.
We ran MERT on the baseline model and the morphologically segmented models forcing -weight-d to range a very small around 0.1, but letting the other parameters range in their suggested ranges. Even though the procedure came back claiming that it achieved a better BLEU score on the tune set, running the new model on the test set did not show any improvement at all. This may have been due to the fact that the initial choice of -weight-d along with -dl set to -1 provides such a drastic improvement that perturbations in the other parameters do not have much impact.
11.
We arrived at this combination by experimenting with the decoder to avoid the almost monotonic translation we were getting with the default parameters. These parameters boosted the BLEU scores substantially compared to default parameters used by the decoder.
12.
We should also note that all sentences were lowercased so that we would not have to deal with exact capitalization issue at that stage.
13.
The meanings of various tags are as follows: Dependency Labels: PMOD—Preposition Modifier; POS—Possessive. Part-of-Speech Tags for the English words: +IN—Preposition; +PRP$—Possessive Pronoun; +JJ—Adjective; +NN—Noun; +NNS—Plural Noun. Morphological Feature Tags in the Turkish Sentence: +A3pl—3rd person plural; +P3sg—3rd person singular possessive; +Loc—Locative case. Note that we mark an English plural noun as +NN_NNS to indicate that the root is a noun and there is a plural morpheme on it. Note also that economic is also related to relations but we are not interested in such content words and their relations.
14.
We use _ to prefix such syntactic tags on the English side.
15.
The order is important in that we would like to attach the same sequence of function words in the same order so that the resulting tags on the English side are the same.
16.
We outline two additional rules later when we see a more complex example in Fig. 10.4.
17.
For example, the morphological analyzer outputs +A3sg to mark a singular noun, if there is no explicit plural morpheme. Such markers are removed.
18.
The tune set was not used in this work but reserved for future work so that meaningful comparisons could be made.
19.
It is possible that the ten test sets are not mutually exclusive.
20.
These allow and do not penalize unlimited distortions, but increase decoding time.
21.
In Moses, factors are separated by a ‘|’ symbol.
22.
Concatenating Root and Tags gives the Surface form, in that the surface is unique given this concatenation.
23.
Note that for Turkish, this representation is equivalent to surface words in that the surface is unique given this representation.
24.
Note that in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score.
25.
In order to provide a simple and clear representation, the example sentences contain the surface form of the words as opposed to the morphemic representation used earlier.
26.
For instance, consider the example in Fig. 10.4 involving if with some additional modifiers added to the intervening noun phrase.

References

Bisazza A, Federico M (2009) Morphological pre-processing for Turkish to English statistical machine translation. In: Proceedings of IWSLT, Tokyo, pp 129–135
Google Scholar
Carpuat M (2009) Toward using morphology in French-English phrase-based SMT. In: Proceedings of WMT, Athens, pp 150–154
Google Scholar
Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: Proceedings of EMNLP, Singapore, pp 718–726
Google Scholar
Corston-Oliver S, Gamon M (2004) Normalizing German and English inflectional morphology to improve statistical word alignment. In: Proceedings of AMTA, Washington, DC
Google Scholar
Durgar-El Kahlout İ (2009) A prototype English-Turkish statistical machine translation system. PhD thesis, Sabancı University, Istanbul
Google Scholar
Durgar-El Kahlout İ, Oflazer K (2006) Initial explorations in English to Turkish statistical machine translation. In: Proceedings of WMT, New York, NY, pp 7–14
Google Scholar
Durgar-El Kahlout İ, Oflazer K (2010) Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans Audio Speech Lang Process 18(6):1313–1322
Google Scholar
Durgar-El Kahlout İ, Mermer C, Doğan MU (2012) Recent improvements in statistical machine translation between Turkish and English. In: Vertan C, von Hahn W (eds) Multilingual processing in Eastern and Southern EU languages: low-resourced technologies and translation. Cambridge Scholars Publishing, Cambridge
Google Scholar
Eyigöz E, Gildea D, Oflazer K (2013a) Multi-rate HMMs for word alignment. In: Proceedings of WMT, Sofia, pp 494–502
Google Scholar
Eyigöz E, Gildea D, Oflazer K (2013b) Simultaneous word-morpheme alignment for statistical machine translation. In: Proceedings of NAACL-HLT, Atlanta, GA, pp 32–40
Google Scholar
Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of EMNLP, Vancouver, BC, pp 676–683
Google Scholar
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 127–133
Google Scholar
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL, Prague, pp 177–180
Google Scholar
Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of NAACL-HLT, Boston, MA, pp 57–60
Google Scholar
Luong MT, Nakov P, Kan MY (2010) A hybrid morpheme-word representation for machine translation of morphologically rich languages. In: Proceedings of EMNLP, Cambridge, MA, pp 148–157
Google Scholar
Mermer C, Akın AA (2010) Unsupervised search for the optimal segmentation for statistical machine translation. In: Proceedings of the ACL student research workshop, Uppsala, pp 31–36
Google Scholar
Minkov E, Toutanova K, Suzuki H (2007) Generating complex morphology for machine translation. In: Proceedings of ACL, Prague, pp 128–135
Google Scholar
Naradowsky J, Toutanova K (2011) Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-markov models. In: Proceedings of ACL-HLT, Portland, OR, pp 895–904
Google Scholar
Nguyen T, Vogel S, Smith NA (2010) Nonparametric word segmentation for machine translation. In: Proceedings of COLING, Beijing, pp 815–823
Google Scholar
Niessen S, Ney H (2004) Statistical machine translation with scarce resources using morpho-syntatic information. Comput Linguist 30(2):181–204
Google Scholar
Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135
Google Scholar
Oflazer K (1994) Two-level description of Turkish morphology. Lit Linguist Comput 9(2):137–148
Google Scholar
Oflazer K (1996) Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–99
Google Scholar
Oflazer K, Durgar-El Kahlout İ (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of WMT, Prague, pp 25–32
Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, Philadelphia, PA, pp 311–318
Google Scholar
Popovic M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of LREC, Lisbon, pp 1585–1588
Google Scholar
Sadat F, Habash N (2006) Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, pp 1–8
Google Scholar
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing, Manchester
Google Scholar
Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, CO, vol 2, pp 901–904
Google Scholar
Talbot D, Osborne M (2006) Modelling lexical redundancy for machine translation. In: Proceedings of COLING-ACL, Sydney, pp 969–976
Google Scholar
Tantuğ AC, Oflazer K, Durgar-El Kahlout İ (2008) BLEU+: a tool for fine-grained BLEU computation. In: Proceedings of LREC, Marrakesh, pp 1493–1499
Google Scholar
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 252–259
Google Scholar
Yang M, Kirchhoff K (2006) Phrase-based backoff models for machine translation of highly inflected languages. In: Proceedings of EACL, Trento, pp 41–48
Google Scholar
Yeniterzi R (2009) Syntax-to-morphology alignment and constituent reordering in factored phrase-based statistical machine translation from English to Turkish. Master’s thesis, Sabancı University, Istanbul
Google Scholar
Yeniterzi R, Oflazer K (2010) Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In: Proceedings of ACL, Uppsala, pp 454–464
Google Scholar
Yılmaz E, Durgar-El Kahlout İ (2014) The use of recurrent neural networks language model in Turkish-English machine translation. In: Proceedings of IEEE signal processing and communications applications conference, Trabzon, pp 1247–1250
Google Scholar
Yılmaz E, Durgar-El Kahlout İ, Aydın B, Özil ZS (2013) TÜBİTAK Turkish-English submissions for IWSLT 2013. In: Proceedings of IWSLT, Heidelberg, pp 152–159
Google Scholar
Yuret D, Türe F (2006) Learning morphological disambiguation rules for Turkish. In: Proceedings of NAACL-HLT, New York, NY, pp 328–334
Google Scholar
Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of NAACL-HLT, New York, NY, pp 201–204
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University Qatar, Doha-Education City, Qatar
Kemal Oflazer
Özyeǧin University, Istanbul, Turkey
Reyyan Yeniterzi
TÜBİTAK-BİLGEM, Gebze, Kocaeli, Turkey
İlknur Durgar-El Kahlout

Authors

Kemal Oflazer
View author publications
You can also search for this author in PubMed Google Scholar
Reyyan Yeniterzi
View author publications
You can also search for this author in PubMed Google Scholar
İlknur Durgar-El Kahlout
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kemal Oflazer .

Editor information

Editors and Affiliations

Carnegie Mellon University Qatar, Doha-Education City, Qatar
Kemal Oflazer
Electrical and Electronic Engineering, Boğaziçi University, Istanbul-Bebek, Turkey
Murat Saraçlar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Oflazer, K., Yeniterzi, R., Kahlout, İ.DE. (2018). Statistical Machine Translation and Turkish. In: Oflazer, K., Saraçlar, M. (eds) Turkish Natural Language Processing. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-90165-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-90165-7_10
Published: 21 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-90163-3
Online ISBN: 978-3-319-90165-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics