Skip to main content

Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7097))

Abstract

The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.

This paper is funded by Computer Research Center of Islamic Sciences (CRCIS).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Frankenberg-Garcia, A.: Compiling and using a Parallel Corpus for Research in Translation. International Journal of Translation (2009)

    Google Scholar 

  2. Tiedemann, J.: Recycling Translations: Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. In: Faculty of Languages, Department of Linguistics. Uppsala University, Uppsala (2003)

    Google Scholar 

  3. Brown, P., et al.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19(2), 263–311 (1993)

    Google Scholar 

  4. Simões, A., Almeida, J.J.: Parallel Corpora based Translation Resources Extraction. Procesamiento del Lenguaje Natural, 265–272 (2007)

    Google Scholar 

  5. Li, P., Sun, M., Xue, P.: Fast-Champollion: a fast and robust sentence alignment algorithm. In: 23rd International Conference on Computational Linguistics: Posters, pp. 710–718. Association for Computational Linguistics, Beijing (2010)

    Google Scholar 

  6. Resnik, P., Olsen, M.B., Diab, M.: The Bible as a parallel corpus: Annotating the “Book of 2000 Tongues”. Computers and the Humanities 33, 129–153 (1999)

    Article  Google Scholar 

  7. Menezes, A., Richardson, S.D.: A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation, Toulouse, France, pp. 39–46 (2001)

    Google Scholar 

  8. Resnik, P., Smith, N.A.: The Web as a Parallel Corpus. Computational Linguistics 29, 349–380 (2003)

    Article  Google Scholar 

  9. Miangah, T.M.: Constructing a large-scale english-persian parallel corpus. Meta: Translators’ Journal 54(1), 181–188 (2009)

    Article  Google Scholar 

  10. Pilevar, M.T., Faili, H., Pilevar, A.H.: TEP: Tehran English-Persian Parallel Corpus. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 68–79. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)

    Google Scholar 

  12. Gale, W.A., Church, K.W.: A program for Aligning Sentences in Bilingual Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 177–184 (1991)

    Google Scholar 

  13. Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  14. Chen, S.F.: Aligning Sentences in Bilingual Corpora Using Lexical Information. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 9–16 (1993)

    Google Scholar 

  15. Wu, D.: Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, pp. 80–87 (1994)

    Google Scholar 

  16. Meyers, A., Kosaka, M., Grishman, R.: A Multilingual Procedure for Dictionary-Based Sentence Alignment. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 187–198. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  17. Gale, W.A., Church, K.W.: A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics 19(1), 75–102 (1993)

    Google Scholar 

  18. Tiedemann, J.: Bitext Alignment. Synthesis Lectures on Human Language Technologies 4, 1–165 (2011)

    Article  Google Scholar 

  19. Kay, M., Röscheisen, M.: Text-translation alignment. Computational Linguistics 19(1), 121–142 (1993)

    Google Scholar 

  20. Melamed, I.D.: A geometric approach to mapping bitext correspondence. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (1996)

    Google Scholar 

  21. Haruno, M., Yamazaki, T.: High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information. Natural Language Engineering 3(1), 131–138 (1997)

    Article  Google Scholar 

  22. Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 13(03), 235–260 (2007)

    Article  Google Scholar 

  23. Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89. Association for Computational Linguistics, Beijing (2010)

    Google Scholar 

  24. Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 1992), Montreal, Canada, pp. 67–81 (1992)

    Google Scholar 

  25. Chuang, T.C., Wu, J.-C., Lin, T., Shei, W.-C., Chang, J.S.: Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 224–232. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  26. Fattah, M.A., et al.: Sentence alignment using P-NNT and GMM. Computer Speech & Language 21(4), 594–608 (2007)

    Article  Google Scholar 

  27. Ma, X.: Champollion: A robust parallel text sentence aligner. In: LREC 2006: Fifth International Conference on Language Resources and Evaluation, pp. 489–492 (2006)

    Google Scholar 

  28. Sennrich, R., Volk, M.: Iterative, MT-based Sentence Alignment of Parallel Texts. In: 18th Nordic Conference of Computational Linguistics, NODALIDA 2011 (2011)

    Google Scholar 

  29. Sennrich, R., Volk, M.: MT-based Sentence Alignment for OCR-generated Parallel Texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado (2010)

    Google Scholar 

  30. Sarikaya, R., et al.: Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation. In: 10th Annual Conference of the International Speech Communication Association (INTERSPEECH 2009), Brighton, United Kingdom (2009)

    Google Scholar 

  31. Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 403–411. Association for Computational Linguistics, Los Angeles (2010)

    Google Scholar 

  32. Shi, L., Zhou, M.: Improved sentence alignment on parallel web pages using a stochastic tree alignment model. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 505–513. Association for Computational Linguistics, Honolulu (2008)

    Google Scholar 

  33. Biçici, E.: Context-Based Sentence Alignment in Parallel Corpora. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 434–444. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  34. Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Conference on Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria, pp. 582–588 (2007)

    Google Scholar 

  35. Tiedemann, J.: Synchronizing translated movie subtitles. In: 6th International Conference on Language Resources and Evaluation, LREC 2008 (2008)

    Google Scholar 

  36. Gautam, M., Sinha, R.M.K.: A Hybrid Approach to Sentence Alignment Using Genetic Algorithm. In: International Conference on Computing: Theory and Applications (ICCTA 2007), pp. 480–484. IEEE Computer Society (2007)

    Google Scholar 

  37. Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)

    Article  Google Scholar 

  38. Kashefi, O., Nasri, M., Kanani, K.: Automatic Spell Checking in Persian Language. Supreme Council of Information and Communication Technology (SCICT), Tehran (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rasooli, M.S., Kashefi, O., Minaei-Bidgoli, B. (2011). Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25631-8_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25630-1

  • Online ISBN: 978-3-642-25631-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics