Abstract
Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We also examine a set of six detokenization techniques and evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.
Similar content being viewed by others
References
Al-Haj H, Lavie A (2010) The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation. In: The ninth conference of the Association for Machine Translation in the Americas (AMTA-2010), Denver
Badr I, Zbib R, Glass J (2008) Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 153–156. http://www.aclweb.org/anthology/P/P08/P08-2039
Badr I, Zbib R, Glass J (2009) Syntactic phrase reordering for English-to-Arabic statistical machine translation. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens, pp 86–93. http://www.aclweb.org/anthology/E09-1011
Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of the human language technology conference/North American chapter of Association for Computational Linguistics (HLT/NAACL-03), Edmonton, pp 4–6
Buckwalter T (2004) Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2004L02, ISBN 1-58563-324-0
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the 11th conference of the European chapter of the Association for Computational Linguistics (EACL’06), Trento, Italy, pp 249–256
Diehl F, Gales M, Tomalin M, Woodland P (2009) Morphological analysis and decomposition for Arabic speech-to-text systems. In: Proceedings of interspeech, Brighton
El Kholy A, Habash N (2010a) Techniques for Arabic morphological detokenization and orthographic denormalization. In: Proceedings of the seventh international conference on language resources and evaluation (LREC), Valletta
El Kholy A, Habash N (2010b) Orthographic and morphological processing for English–Arabic statistical machine translation. In: In Actes de Traitement Automatique des Langues Naturelles (TALN), Montreal
Elming J, Habash N (2009) Syntactic reordering for English–Arabic phrase-based machine translation. In: Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, Athens, pp 69–77. http://www.aclweb.org/anthology/W09-0809
Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Vancouver, pp 676–683
Habash N (2007) Arabic morphological representations for machine translation. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Berlin
Habash N (2010) Introduction to Arabic natural language processing. Morgan & Claypool Publishers, San Rafael
Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, pp 573–580. http://www.aclweb.org/anthology/P/P05/P05-1071
Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 7th meeting of the North American chapter of the Association for Computational Linguistics/human language tTechnologies conference (HLT-NAACL06), New York, pp 49–52
Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Berlin
Heintz I (2008) Arabic language modeling with finite state transducers. In: Proceedings of the ACL-08: HLT student research workshop, Columbus, pp 37–42. http://www.aclweb.org/anthology/P/P08/P08-3007
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the empirical methods in natural language processing conference (EMNLP’04), Barcelona
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics, companion volume, Proceedings of the demo and poster sessions, Prague, pp 177–180
Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of the 5th meeting of the North American chapter of the Association for Computational Linguistics/human language technologies conference (HLT-NAACL04), Boston, pp 57–60
Maamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic treebank : building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabic language resources and tools, Cairo, pp 102–109
Nießen S, Ney H (2004) Statistical machine translation with scarce resources usingmorpho-syntactic information. Comput Linguist 30(2):181–204
Och FJ (2005) Google system description for the 2005 NIST MT evaluation. In: MT eval workshop (unpublished talk)
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–52
Oflazer K, Durgar El-Kahlout I (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of the second workshop on statistical machine translation, Association for Computational Linguistics, Prague, pp 25–32. http://www.aclweb.org/anthology/W/W07/W07-0704
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, pp 311–318
Roth R, Rambow O, Habash N, Diab M, Rudin C (2008) Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 117–120. http://www.aclweb.org/anthology/P/P08/P08-2030
Sarikaya R, Deng Y (2007) Joint morphological-lexical language modeling for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, companion volume, short papers, Rochester, pp 145–148. http://www.aclweb.org/anthology/N/N07/N07-2037
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2, Denver, pp 901–904
Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers, New York City, pp 201–204
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
El Kholy, A., Habash, N. Orthographic and morphological processing for English–Arabic statistical machine translation. Machine Translation 26, 25–45 (2012). https://doi.org/10.1007/s10590-011-9110-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9110-0