The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics

  • Reem F. AlfuraihEmail author
Project Notes


Around the world, a growing interest has been seen in learner translator corpora, which are invaluable resources for teaching and research. This paper introduces a new resource to support researchers from different interdisciplinary areas such as computational linguistics, descriptive translation studies, computer-aided translation technology, Arabic machine translation applications, cognitive science, and translation pedagogy. Motivated by the lack of learner translator resources that provide data about learners of translation from and into Arabic, the undergraduate learner translator corpus (ULTC) is an ongoing, error-tagged sentence-aligned parallel corpus of English, Arabic, and French, with Arabic as its main language. The present corpus, consisting of parallel texts of female learners of translation from English or French into Arabic, is the first of its kind in terms of the languages represented, tasks covered, and number of students involved. It is also unique in terms of combining many complementary corpora of cross-lingual data, each of which has its own web-based query interface and corpus analysis tools. This paper describes the ULTC compilation process, preliminary findings, and planned future expansion and research.


Translation pedagogy Arabic parallel corpus Multilingual corpus Multimodal corpus Interpreting corpus Triangulation 



The author would like to thank the anonymous reviewers for the detailed and constructive review that helped to clarify many points and improve the structure of the manuscript. The author is greatly indebted to PNU instructors, course coordinators, and learners for their contributions.


  1. Abu Shquier, M. M., & Abu Shqeer, O. (2012). Words ordering and corresponding verb-subject agreements in English–Arabic machine translation, An enhancement approach. The International Arab Journal of Information Technology (IAJIT), 2, 49–60.CrossRefGoogle Scholar
  2. Afli, H., Lohar, P., & Way, A. (2017). MultiNews: A web collection of an aligned multimodal and multilingual corpus. In Proceedings of the first workshop on curation and applications of parallel and comparable corpora. Taipei, Taiwan.Google Scholar
  3. Al-Ajmi, H. (2004). A new English-Arabic parallel text corpus for lexicographic applications. Lexikos, 14(1), 326–330.Google Scholar
  4. Al-Jarf, R. (2007). SVO word order errors in English–Arabic translation. Translators’ Journal, 52, 299–308.Google Scholar
  5. Al-Momani, I. (2010). Does the VP node exist in Modern Standard Arabic? Journal of Language and Literature, ISSN: 2078-0303, May 2010.Google Scholar
  6. Alotaibi, H. M. (2017). Arabic–English parallel corpus: A new resource for translation training and language teaching. Arab World English Journal, 8(3), 319.CrossRefGoogle Scholar
  7. Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 17–45). Amsterdam: John Benjamins.CrossRefGoogle Scholar
  8. Baker, M. (1999). The role of corpora in investigating the linguistic behaviour of professional translators. International Journal of Corpus Linguistics, 4(2), 281–298.CrossRefGoogle Scholar
  9. Bowker, L., & Pearson, J. (2002). Working with specialized language: A practical guide to using corpora. London: Routledge.CrossRefGoogle Scholar
  10. Bowker, L., & Peter, B. (2003). Student translation archive and student translation tracking system: Design, development and application. In F. Zanettin, S. Bernardini, & D. Stewart (Eds.), Corpora in translator education (pp. 103–119). Manchester: St Jerome Publishing.Google Scholar
  11. Carl, M. (2012). Translog-II: A program for recording user activity data for empirical reading and writing research. In Proceedings of the eighth international conference on language resources and evaluation, European Language Resources Association (ELRA), Istanbul, Turkey.
  12. Carl, M., Bangalore, S., & Schaeffer, M. (2015). New directions in empirical translation process research: Exploring the CRITT TPR-DB. Cham: Springer. (New Frontiers in Translation Studies).Google Scholar
  13. Carl, M., & Dragsted, B. (2012). Inside the monitor model: Process of default and challenged translation production. Translation: Corpora, Computation, Cognition, 2(1), 127–145. (Special issue on the crossroads between contrastive linguistics, translation studies and machine translation).Google Scholar
  14. Carl, M., Dragsted, B., Elming, J., Hardt, D., & Jakobsen, A. L. (2012). The process of post-editing: A pilot study. In B. Sharp, M. Zock, M. Carl, A. L. Jakobsen (eds.), Proceedings of the 8th natural language processing and cognitive science workshop (Copenhagen studies in language series, Vol. 41, pp. 131–142).Google Scholar
  15. Castagnoli, S. (2009). Regularities and variations in learner translations: A corpus-based study of conjunctive explicitation. PhD Dissertation, University of Pisa.Google Scholar
  16. Castagnoli, S., Ciobanu, D., Kunz, K., Volanschi, A., & Kübler, N. (2011). Designing a learner translator corpus for training purposes. In N. Kübler (Ed.), Corpora, language, teaching, and resources: From theory to practice (pp. 221–248). Bern: Peter Lang.Google Scholar
  17. Cettolo, M. (2016). An Arabic–Hebrew parallel corpus of TED talks. In Proceedings of the AMTA 2016 workshop on Semitic machine translation (SeMaT). Austin, US-TX.Google Scholar
  18. Dimitriu, R. (2009). Translators’ prefaces as documentary sources for translation studies, Perspectives. Studies in Translatology, 17(3), 193–206.CrossRefGoogle Scholar
  19. Espunya, A. (2014). The UPF learner translation corpus as a resource for translator training. Language Resources & Evaluation, 48, 33.CrossRefGoogle Scholar
  20. Ferguson, C. A. (1959). Diglossia. Word, 15, 325–340.CrossRefGoogle Scholar
  21. Florén, C. (2006). ENTRAD, an English Spanish parallel corpus created for the teaching of translation. Paper presented at the 7th teaching and language corpora conference (TALC 2006).Google Scholar
  22. Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em. In Proceedings of EMNLP, vol. 2004.Google Scholar
  23. Graedler, A. L. (2013). Nest – a corpus in the brooding box. In M. Huber & J. Mukherjee (Eds.), Corpus linguistics and variation in English: Focus on non-native Englishs. Studies in Variation, Contacts and Change in English, University of Giessen.Google Scholar
  24. Granger, S. (2002). A bird’s eye view of learner corpus research. In S. Granger, J. Hung, S. Petch-Tyson (eds.), Computer learner corpora, second language acquisition and foreign language teaching. Amsterdam & Philadelphia: Benjamins.Google Scholar
  25. Guzman, F., Sajjad, H., Abdelali, A., & Vogel, S. (2013). The AMARA corpus: Building resources for translating the web’s educational content. In Proceedings of the international workshop on spoken language translation, IWSLT 2013. Heidelberg: IWSLT.Google Scholar
  26. Hansen, G. (Ed.). (2002). Empirical translation studies: Process and product (Copenhagen studies in language, vol. 27). Denmark: Samfundslittera-tur.Google Scholar
  27. Hewavitharana, S., Vogel, S. (2011). Extracting parallel phrases from comparable data. In Proceedings of the 4th workshop on building and using comparable corpora: Comparable corpora and the web (pp. 61–68). Association for Computational Linguistics.Google Scholar
  28. Horn, C. (2015). Diglossia in the Arab world. Open Journal of Modern Linguistics, 5, 100–104.CrossRefGoogle Scholar
  29. Hu, K., & Tao, Q. (2013). The Chinese–English conference interpreting corpus: Uses and limitations. Meta, 58(3), 626–642. Scholar
  30. Izquierdo, M., Hofland, K., & Reigem, Ø. (2008). The ACTRES parallel corpus: An English–Spanish translation corpus. Corpora, 3(1), 31–41.CrossRefGoogle Scholar
  31. Izwaini, S. (2003). Building specialised corpora for translation studies. In Workshop on multilingual corpora: Linguistic requirements and technical perspectives, corpus linguistics. (pp. 17–25). , Lancaster University, UK.
  32. Jakobsen, A. (2003). Effects of think aloud on translation speed, revision and segmentation. In F. Alves (Ed.), Triangulating translation: Perspectives in process oriented research (pp. 69–95). Amsterdam: Benjamins.CrossRefGoogle Scholar
  33. Jakobsen, A. L. (2011). Tracking translators’ keystrokes and eye movements with Translog. In C. Alvstad, A. Hild, & E. Tiselius (Eds.), Methods and strategies of process research integrative approaches in translation studies (pp. 37–55). Amsterdam: John Benjamins Publishing.CrossRefGoogle Scholar
  34. Jakobsen, A. L., & Schou, L. (1999). Logging target text production with Translog. Copenhagen Studies in Language (Vol. 24, pp. 9–20). Copenhagen: Samfundslitteratur.Google Scholar
  35. Kumar, G., Cao, Y., Cotterell, R., Callison-Burch, C., Povey, D., & Khudanpur, S. (2014). Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation. IWSLT.Google Scholar
  36. Kutuzov, A., & Kunilovskaya, M. (2014). Russian learner translator corpus. In P. Sojka, A. Horak, I. Kopecek, & K. Pala (Eds.), Text, speech and dialogue (Lecture Notes in Computer Science) (Vol. 8655, pp. 315–323). Berlin: Springer.Google Scholar
  37. Li, X., et al. (2013). GALE Arabic-English parallel aligned treebank – broadcast news. Part 1 LDC2013T14. Web Download. Philadelphia: Linguistic Data Consortium.Google Scholar
  38. McEnery, A. M., & Xiao, R. Z. (2007). Parallel and comparable corpora: What are they up to? In G. Anderman, & M. Rogers (Eds.), Incorporating corpora: Translation and the linguist. Retrieved from
  39. Mesa-Lao, B. (2014). Gaze behavior on source texts: An exploratory study comparing translation and post-editing. In S. O’Brien, L. W. Balling, M. Carl, M. Simard, & L. Specia (Eds.), Post-editing of machine translation (pp. 219–245). Newcastle Upon Tyne: Cambridge Scholar Publishing.Google Scholar
  40. Mikhailov, M., Cooper, R. (2016). Corpus linguistics for translation and contrastive studies: A guide for research. Routledge. Corpus Linguistics Guides. London & New York: Routledge.Google Scholar
  41. Norberg, U. (2014). Fostering self-reflection in translation students. Translation & Interpreting Studies, 9(1), 150–164.CrossRefGoogle Scholar
  42. Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.Google Scholar
  43. Paltridge, B. (2012). Discourse analysis (2nd ed.). London: Bloomsbury.Google Scholar
  44. Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. M. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. Language Resources and Evaluation Conference (LREC 2014).Google Scholar
  45. Rafalovitch, A., & Dale, R. (2009). United Nations General Assembly resolutions: A six-language parallel corpus. In Proceedings of the MT summit XII. (pp. 292–299, Ottawa, Canada).Google Scholar
  46. Russo, M., Bendazzoli, C., Sandrelli, A., & Spinolo, N. (2012). The European parliament interpreting corpus (EPIC): Implementation and developments. In S. F. Straniero & C. Falbo (Eds.), Breaking ground in corpus-based interpreting studies (pp. 53–90). Frankfurt am Main: Peter Lang.Google Scholar
  47. Salhi, H. (2013). Investigating the complementary polysemy and the Arabic translations of the noun destruction in EAPCOUNT. Meta Translators’ Journal, 58(1), 227–246.Google Scholar
  48. Schmidt, T., & Wörner, K. (Eds.). (2012). Multilingual corpora and multilingual corpus analysis (p. 407). Amsterdam/Philadelphia: John Benjamins.Google Scholar
  49. Serbina, T., et al. (2015). Development of a keystroke logged translation corpus. In C. Fantinuoli & F. Zanettin (Eds.), New directions in corpus-based translation studies (pp. 11–34). Berlin: Language Science Press.Google Scholar
  50. Shlesinger, M. (2008). Towards a definition of interpretese: An intermodal, corpus-based study. In G. Hansen, A. Chesterman, & H. Gerzynisch-Arbogast (Eds.), Efforts and models in interpreting and translation research (pp. 237–253). Amsterdam/Philadelphia: John Benjamins.Google Scholar
  51. Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In Human language technologies: The 2010 annual conference of the North American chapter of the Association for Computational Linguistics, pp. 403–411. Association for Computational Linguistics.Google Scholar
  52. Sosnina, E. P. (2006). Development and application of Russian translation learner corpus. St. Petersburg: Papers from the Corpus Linguistics Conference.Google Scholar
  53. Stefanescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th conference of the European Association for Machine Translation (pp. 137–144).Google Scholar
  54. Štěpánková, K. (2014). Learner translation corpus: CELTraC (Bachelor’s thesis).Google Scholar
  55. Temnikova, I., Abdelali, A., Hedaya, S., Vogel, S., & Al Daher, A. (2017). Interpreting strategies annotation in the WAW corpus. RANLP, p. 36.Google Scholar
  56. Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th international conference on language resources and evaluation (LREC’12) (pp. 2214–2218). Istanbul: European Language Research Association.Google Scholar
  57. Tono, Y. (2003). Learner corpora: Design, development and application. In Proceedings of the corpus linguistics 2003 conference (pp. 800–809). Lancaster, UK, 28–31 March 2003.Google Scholar
  58. Uzar, R., & Walinski, J. (2001). Analyzing the fluency of translators. International Journal of Corpus Linguistics, 155(166), 12.Google Scholar
  59. Wurm, A. (2013). Eigennamen und Realia in einem Korpus studentischer Übersetzungen (KOPTE); in: transkom, 6(2); 381–419.
  60. Xiao, R., & McEnery, T. (2002). A two-level approach to situation aspect. Paper presented at the 5th chronos colloquium on tense, aspect and modality, Groningen, Netherlands.Google Scholar
  61. Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics, 40(1), 171–202.CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.College of LanguagesPrincess Nourah bint Abdulrahman UniversityRiyadhSaudi Arabia

Personalised recommendations