A Bilingual Corpus of Novels Aligned at Paragraph Level

  • Alexander Gelbukh
  • Grigori Sidorov
  • José Ángel Vera-Félix
Conference paper

DOI: 10.1007/11816508_4

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4139)
Cite this paper as:
Gelbukh A., Sidorov G., Vera-Félix J.Á. (2006) A Bilingual Corpus of Novels Aligned at Paragraph Level. In: Salakoski T., Ginter F., Pyysalo S., Pahikkala T. (eds) Advances in Natural Language Processing. Lecture Notes in Computer Science, vol 4139. Springer, Berlin, Heidelberg

Abstract

The paper presents a bilingual English-Spanish parallel corpus aligned at the paragraph level. The corpus consists of twelve large novels found in Internet and converted into text format with manual correction of formatting problems and errors. We used a dictionary-based algorithm for automatic alignment of the corpus. Evaluation of the results of alignment is given. There are very few available resources as far as parallel fiction texts are concerned, while they are non-trivial case of alignment of a considerable size. Usually, approaches for automatic alignment that are based on linguistic data are applied for texts in the restricted areas, like laws, manuals, etc. It is not obvious that these methods are applicable for fiction texts because these texts have much more cases of non-literal translation than the texts in the restricted areas. We show that the results of alignment for fiction texts using dictionary based method are good, namely, produce state of art precision value.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Alexander Gelbukh
    • 1
  • Grigori Sidorov
    • 1
  • José Ángel Vera-Félix
    • 1
  1. 1.Natural Language and Text Processing Laboratory, Center for Research in Computer ScienceNational Polytechnic InstituteMexico CityMexico

Personalised recommendations