Abstract
The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.
This paper is funded by Computer Research Center of Islamic Sciences (CRCIS).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Frankenberg-Garcia, A.: Compiling and using a Parallel Corpus for Research in Translation. International Journal of Translation (2009)
Tiedemann, J.: Recycling Translations: Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. In: Faculty of Languages, Department of Linguistics. Uppsala University, Uppsala (2003)
Brown, P., et al.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19(2), 263–311 (1993)
Simões, A., Almeida, J.J.: Parallel Corpora based Translation Resources Extraction. Procesamiento del Lenguaje Natural, 265–272 (2007)
Li, P., Sun, M., Xue, P.: Fast-Champollion: a fast and robust sentence alignment algorithm. In: 23rd International Conference on Computational Linguistics: Posters, pp. 710–718. Association for Computational Linguistics, Beijing (2010)
Resnik, P., Olsen, M.B., Diab, M.: The Bible as a parallel corpus: Annotating the “Book of 2000 Tongues”. Computers and the Humanities 33, 129–153 (1999)
Menezes, A., Richardson, S.D.: A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation, Toulouse, France, pp. 39–46 (2001)
Resnik, P., Smith, N.A.: The Web as a Parallel Corpus. Computational Linguistics 29, 349–380 (2003)
Miangah, T.M.: Constructing a large-scale english-persian parallel corpus. Meta: Translators’ Journal 54(1), 181–188 (2009)
Pilevar, M.T., Faili, H., Pilevar, A.H.: TEP: Tehran English-Persian Parallel Corpus. In: Gelbukh, A. (ed.) CICLing 2011, Part II. LNCS, vol. 6609, pp. 68–79. Springer, Heidelberg (2011)
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)
Gale, W.A., Church, K.W.: A program for Aligning Sentences in Bilingual Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 177–184 (1991)
Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)
Chen, S.F.: Aligning Sentences in Bilingual Corpora Using Lexical Information. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 9–16 (1993)
Wu, D.: Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, pp. 80–87 (1994)
Meyers, A., Kosaka, M., Grishman, R.: A Multilingual Procedure for Dictionary-Based Sentence Alignment. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 187–198. Springer, Heidelberg (1998)
Gale, W.A., Church, K.W.: A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics 19(1), 75–102 (1993)
Tiedemann, J.: Bitext Alignment. Synthesis Lectures on Human Language Technologies 4, 1–165 (2011)
Kay, M., Röscheisen, M.: Text-translation alignment. Computational Linguistics 19(1), 121–142 (1993)
Melamed, I.D.: A geometric approach to mapping bitext correspondence. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (1996)
Haruno, M., Yamazaki, T.: High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information. Natural Language Engineering 3(1), 131–138 (1997)
Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 13(03), 235–260 (2007)
Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 81–89. Association for Computational Linguistics, Beijing (2010)
Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 1992), Montreal, Canada, pp. 67–81 (1992)
Chuang, T.C., Wu, J.-C., Lin, T., Shei, W.-C., Chang, J.S.: Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 224–232. Springer, Heidelberg (2005)
Fattah, M.A., et al.: Sentence alignment using P-NNT and GMM. Computer Speech & Language 21(4), 594–608 (2007)
Ma, X.: Champollion: A robust parallel text sentence aligner. In: LREC 2006: Fifth International Conference on Language Resources and Evaluation, pp. 489–492 (2006)
Sennrich, R., Volk, M.: Iterative, MT-based Sentence Alignment of Parallel Texts. In: 18th Nordic Conference of Computational Linguistics, NODALIDA 2011 (2011)
Sennrich, R., Volk, M.: MT-based Sentence Alignment for OCR-generated Parallel Texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado (2010)
Sarikaya, R., et al.: Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation. In: 10th Annual Conference of the International Speech Communication Association (INTERSPEECH 2009), Brighton, United Kingdom (2009)
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 403–411. Association for Computational Linguistics, Los Angeles (2010)
Shi, L., Zhou, M.: Improved sentence alignment on parallel web pages using a stochastic tree alignment model. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 505–513. Association for Computational Linguistics, Honolulu (2008)
Biçici, E.: Context-Based Sentence Alignment in Parallel Corpora. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 434–444. Springer, Heidelberg (2008)
Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Conference on Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria, pp. 582–588 (2007)
Tiedemann, J.: Synchronizing translated movie subtitles. In: 6th International Conference on Language Resources and Evaluation, LREC 2008 (2008)
Gautam, M., Sinha, R.M.K.: A Hybrid Approach to Sentence Alignment Using Genetic Algorithm. In: International Conference on Computing: Theory and Applications (ICCTA 2007), pp. 480–484. IEEE Computer Society (2007)
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Kashefi, O., Nasri, M., Kanani, K.: Automatic Spell Checking in Persian Language. Supreme Council of Information and Communication Technology (SCICT), Tehran (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rasooli, M.S., Kashefi, O., Minaei-Bidgoli, B. (2011). Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_52
Download citation
DOI: https://doi.org/10.1007/978-3-642-25631-8_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25630-1
Online ISBN: 978-3-642-25631-8
eBook Packages: Computer ScienceComputer Science (R0)