TEP: Tehran English-Persian Parallel Corpus

  • Mohammad Taher Pilevar
  • Heshaam Faili
  • Abdol Hamid Pilevar
Conference paper

DOI: 10.1007/978-3-642-19437-5_6

Volume 6609 of the book series Lecture Notes in Computer Science (LNCS)
Cite this paper as:
Pilevar M.T., Faili H., Pilevar A.H. (2011) TEP: Tehran English-Persian Parallel Corpus. In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg

Abstract

Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles, together with some of the difficulties we experienced during data extraction and sentence alignment are addressed. To the best of our knowledge, TEP has been the first freely released large-scale (in order of million words) English-Persian parallel corpus.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mohammad Taher Pilevar
    • 1
  • Heshaam Faili
    • 1
  • Abdol Hamid Pilevar
    • 2
  1. 1.Natural Language Processing LaboratoryUniversity of TehranIran
  2. 2.Faculty of Computer EngineeringBu Ali Sina UniversityHamedanIran