Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

  • Mohammad Sadegh Rasooli
  • Omid Kashefi
  • Behrouz Minaei-Bidgoli
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7097)


The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.


Sentence Alignment Paragraph Alignment Parallel Corpus Bilingual Corpus Persian English Machine Translation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Frankenberg-Garcia, A.: Compiling and using a Parallel Corpus for Research in Translation. International Journal of Translation (2009)Google Scholar
  2. 2.
    Tiedemann, J.: Recycling Translations: Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. In: Faculty of Languages, Department of Linguistics. Uppsala University, Uppsala (2003)Google Scholar
  3. 3.
    Brown, P., et al.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19(2), 263–311 (1993)Google Scholar
  4. 4.
    Simões, A., Almeida, J.J.: Parallel Corpora based Translation Resources Extraction. Procesamiento del Lenguaje Natural, 265–272 (2007)Google Scholar
  5. 5.
    Li, P., Sun, M., Xue, P.: Fast-Champollion: a fast and robust sentence alignment algorithm. In: 23rd International Conference on Computational Linguistics: Posters, pp. 710–718. Association for Computational Linguistics, Beijing (2010)Google Scholar
  6. 6.
    Resnik, P., Olsen, M.B., Diab, M.: The Bible as a parallel corpus: Annotating the “Book of 2000 Tongues”. Computers and the Humanities 33, 129–153 (1999)CrossRefGoogle Scholar
  7. 7.
    Menezes, A., Richardson, S.D.: A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation, Toulouse, France, pp. 39–46 (2001)Google Scholar
  8. 8.
    Resnik, P., Smith, N.A.: The Web as a Parallel Corpus. Computational Linguistics 29, 349–380 (2003)CrossRefGoogle Scholar
  9. 9.
    Miangah, T.M.: Constructing a large-scale english-persian parallel corpus. Meta: Translators’ Journal 54(1), 181–188 (2009)CrossRefGoogle Scholar
  10. 10.
    Pilevar, M.T., Faili, H., Pilevar, A.H.: TEP: Tehran English-Persian Parallel Corpus. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 68–79. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)Google Scholar
  12. 12.
    Gale, W.A., Church, K.W.: A program for Aligning Sentences in Bilingual Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 177–184 (1991)Google Scholar
  13. 13.
    Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Chen, S.F.: Aligning Sentences in Bilingual Corpora Using Lexical Information. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 9–16 (1993)Google Scholar
  15. 15.
    Wu, D.: Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, pp. 80–87 (1994)Google Scholar
  16. 16.
    Meyers, A., Kosaka, M., Grishman, R.: A Multilingual Procedure for Dictionary-Based Sentence Alignment. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 187–198. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  17. 17.
    Gale, W.A., Church, K.W.: A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics 19(1), 75–102 (1993)Google Scholar
  18. 18.
    Tiedemann, J.: Bitext Alignment. Synthesis Lectures on Human Language Technologies 4, 1–165 (2011)CrossRefGoogle Scholar
  19. 19.
    Kay, M., Röscheisen, M.: Text-translation alignment. Computational Linguistics 19(1), 121–142 (1993)Google Scholar
  20. 20.
    Melamed, I.D.: A geometric approach to mapping bitext correspondence. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (1996)Google Scholar
  21. 21.
    Haruno, M., Yamazaki, T.: High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information. Natural Language Engineering 3(1), 131–138 (1997)CrossRefGoogle Scholar
  22. 22.
    Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 13(03), 235–260 (2007)CrossRefGoogle Scholar
  23. 23.
    Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89. Association for Computational Linguistics, Beijing (2010)Google Scholar
  24. 24.
    Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 1992), Montreal, Canada, pp. 67–81 (1992)Google Scholar
  25. 25.
    Chuang, T.C., Wu, J.-C., Lin, T., Shei, W.-C., Chang, J.S.: Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 224–232. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  26. 26.
    Fattah, M.A., et al.: Sentence alignment using P-NNT and GMM. Computer Speech & Language 21(4), 594–608 (2007)CrossRefGoogle Scholar
  27. 27.
    Ma, X.: Champollion: A robust parallel text sentence aligner. In: LREC 2006: Fifth International Conference on Language Resources and Evaluation, pp. 489–492 (2006)Google Scholar
  28. 28.
    Sennrich, R., Volk, M.: Iterative, MT-based Sentence Alignment of Parallel Texts. In: 18th Nordic Conference of Computational Linguistics, NODALIDA 2011 (2011)Google Scholar
  29. 29.
    Sennrich, R., Volk, M.: MT-based Sentence Alignment for OCR-generated Parallel Texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado (2010)Google Scholar
  30. 30.
    Sarikaya, R., et al.: Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation. In: 10th Annual Conference of the International Speech Communication Association (INTERSPEECH 2009), Brighton, United Kingdom (2009)Google Scholar
  31. 31.
    Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 403–411. Association for Computational Linguistics, Los Angeles (2010)Google Scholar
  32. 32.
    Shi, L., Zhou, M.: Improved sentence alignment on parallel web pages using a stochastic tree alignment model. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 505–513. Association for Computational Linguistics, Honolulu (2008)Google Scholar
  33. 33.
    Biçici, E.: Context-Based Sentence Alignment in Parallel Corpora. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 434–444. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  34. 34.
    Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Conference on Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria, pp. 582–588 (2007)Google Scholar
  35. 35.
    Tiedemann, J.: Synchronizing translated movie subtitles. In: 6th International Conference on Language Resources and Evaluation, LREC 2008 (2008)Google Scholar
  36. 36.
    Gautam, M., Sinha, R.M.K.: A Hybrid Approach to Sentence Alignment Using Genetic Algorithm. In: International Conference on Computing: Theory and Applications (ICCTA 2007), pp. 480–484. IEEE Computer Society (2007)Google Scholar
  37. 37.
    Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRefGoogle Scholar
  38. 38.
    Kashefi, O., Nasri, M., Kanani, K.: Automatic Spell Checking in Persian Language. Supreme Council of Information and Communication Technology (SCICT), Tehran (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mohammad Sadegh Rasooli
    • 1
  • Omid Kashefi
    • 1
  • Behrouz Minaei-Bidgoli
    • 1
  1. 1.Department of Computer EngineeringIran University of Science and TechnologyIran

Personalised recommendations