Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

  • Mohammad Sadegh Rasooli
  • Omid Kashefi
  • Behrouz Minaei-Bidgoli
Conference paper

DOI: 10.1007/978-3-642-25631-8_52

Part of the Lecture Notes in Computer Science book series (LNCS, volume 7097)
Cite this paper as:
Rasooli M.S., Kashefi O., Minaei-Bidgoli B. (2011) Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents. In: Salem M.V.M., Shaalan K., Oroumchian F., Shakery A., Khelalfa H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg

Abstract

The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.

Keywords

Sentence Alignment Paragraph Alignment Parallel Corpus Bilingual Corpus Persian English Machine Translation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mohammad Sadegh Rasooli
    • 1
  • Omid Kashefi
    • 1
  • Behrouz Minaei-Bidgoli
    • 1
  1. 1.Department of Computer EngineeringIran University of Science and TechnologyIran

Personalised recommendations