Abstract
Using statistical machine translation (SMT) for dialectal varieties usually suffers from data sparsity, but combining word-level and character-level models can yield good results even with small training data by exploiting the relative proximity between the two varieties. In this paper, we describe a specific problem and its solution, arising with the translation between standard Austrian German and Viennese dialect. In general, for a phrase-based approach to SMT, complex lexical transformations and syntactic reordering cannot be dealt with satisfyingly. In a situation with sparse resources it becomes merely impossible. These are typical cases where rule-based preprocessing of the source data is the preferable option, hence the hybrid character of the resulting system. One such case is the transformation between synthetic imperfect verb forms to perfect tense with finite auxiliary and past participle, which involves detection of clause boundaries and identification of clause type. We present an approach that utilizes a full parse of the source sentences and discuss the problems that arise using such an approach. Within the developed SMT system, the models trained on preprocessed data unsurprisingly fare better than those trained on the original data, but also unchanged sentences gain slightly better scores. This shows that introducing a rule-based layer dealing with systematic non-local transformations increases the overall performance of the system, most probably due to a higher accuracy in the alignment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See [1] for details on the orthography developed for this project.
- 2.
There are two exceptions which indeed have imperfect forms: the auxiliary sein ‘to be’ and the two modals sollen ‘ought to’ and wollen ‘want’.
- 3.
A phenomenon with similar consequences for SMT is the lack of genitive case in VD. It is either replaced by dative, or – in possessive constructions – by a prepositional phrase (s auto fon da schwesda – das Auto von der Schwester ‘the car of the sister’). Alternatively, with animate possessors, there is also a construction not existing in Standard German: the possessor in dative case, and a resumptive possessive pronoun (da schwesda ia auto – \(^{?}\) der Schwester ihr Auto ‘the sister-Dat her car’). These constructions will not be discussed in this paper.
References
Hildenbrandt, T., Moosmüller, S., Neubarth, F.: Orthographic encoding of the Viennese dialect for machine translation. In: Vetulani, Z., Uszkoreit, H. (eds.) Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference (LTC 2013), 7–9 December 2013, Poznan, Poland, pp. 399–403 (2013)
Schikola, H.: Schriftdeutsch und Wienerisch. Österreichischer Bundesverlag für Unterricht, Wissenschaft and Kunst, Wien (1954)
Hornung, M.: Wörterbuch der Wiener Mundart. ÖBV - Pädagogischer Verlag, Wien (1998)
Collins, M., Koehn, P., Kučerová, I.: Clause restructuring for statistical machine translation. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, June 2005, pp. 531–540 (2005)
Labov, W.: Principles of Linguistic Change (II): Social Factors. Blackwell, Massachusetts (2001)
Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Rep. of Korea, 8–14 July 2012, pp. 301–305 (2012)
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the 4th Biennal Workshop on Balto-Slavic Natural Language Processing of the 51th Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, pp. 58–62 (2013)
Nakov, P., Ng, H.T.: Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. 44, 179–222 (2012)
Zbib, R., Maldiochi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O., Callison-Burch, C.: Machine translation of arabic dialects. In: Proceedings of NAACL: HLT 2012, Montreal, Canada, pp. 49–59 (2012)
Sawaf, H.: Arabic dialect handling in hybrid machine translation. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado (2010)
Haddow, B., Hernández Huerta, A., Neubarth, F., Trost, H.: Corpus development for machine translation between standard and dialectal varieties. In: Proceedings of the Workshop Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants of the 9th International Conference on Recent Advances in Natural Language Processing (RANLP 2013), 13 September 2013, Hissar, Bulgaria, pp. 7–14 (2013)
Korpusbasierte Wortgrundformenliste DEREWO, v-ww-bll-320000g-2012-12-31-1.0, mit Benutzerdokumentation, Institut für Deutsche Sprache, Programmbereich Korpuslinguistik, Mannheim, Deutschland (2013)
den Besten, H.: On the interaction of root transformations and lexical deletive rules. In: Abraham, W. (ed.) On the Formal Syntax of the Westgermania. Papers from the 3rd Groningen Grammar Talks, pp. 47–131. John Benjamins, Amsterdam (1983)
Haider, H.: The case of German. In: Toman, J. (ed.) Studies in German Grammar, pp. 65–101. Foris, Dordrecht (1985)
Diedrichsen, E.: Zu einer semantischen Klassifikation der intransitiven haben- und sein- Verben im Deutschen. In: Katz, G., et al. (ed.) Sinn & Bedeutung VI, Proceedings of the 6th Annual Meeting of the Gesellschaft für Semantik, University of Osnabrück (2002)
Schmid, H.: Efficient parsing of highly ambiguous context-free grammars with bit vectors. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), vol. 1, Geneva, Switzerland, pp. 162–168 (2004)
Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German corpus. J. Lang. Comput. 2004(2), 597–620 (2004)
Björkelund, A., Bohnet, B., Hafdell, L., Nugues, P.: A high-performance syntactic and semantic dependency parser. In: Coling 2010: Demonstration Volume, Beijing, 23–27 August 2010, pp. 33–36 (2010)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, 2007, pp. 177–180 (2007)
Vilar, D., Peter, J.-T., Ney, H.: Can we translate letters? In: Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, ACL, pp. 33–39 (2007)
Tiedemann, J.: Character-based PSMT for closely related languages. In: Marqués, L., Somers, H. (eds.) Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT 2009), Barcelona, Spain, pp. 12–19 (2009)
Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL00), Hongkong, China, pp. 440–447 (2000)
Postel, H.J.: Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse. In: IBM Nachrichten, 19, pp. 925–931 (1969)
Acknowledgements
The work presented in this paper was carried out within the project ‘Machine Learning Techniques for Modeling of Language Varieties’ (MLT4MLV - ICT10-049, 2011–2013) which was funded by the Vienna Science and Technology Fund (WWTF).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Neubarth, F., Haddow, B., Huerta, A.H., Trost, H. (2016). A Hybrid Approach to Statistical Machine Translation Between Standard and Dialectal Varieties. In: Vetulani, Z., Uszkoreit, H., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2013. Lecture Notes in Computer Science(), vol 9561. Springer, Cham. https://doi.org/10.1007/978-3-319-43808-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-43808-5_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43807-8
Online ISBN: 978-3-319-43808-5
eBook Packages: Computer ScienceComputer Science (R0)