Abstract
Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 12(4), 1–26 (2006)
Braune, F., Fraser, A.: Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In: Coling 2010: Poster Volume, pp. 81–89 (August 2010)
Papineni, K., Rouskos, S., Ward, T., Zhu, W.J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proc. of 40th Annual Meeting of the Assoc. for Computational Linguistics, Philadelphia, pp. 311–318 (July 2002)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A Study of Translation Edit Rate with Targeted Human Annotation. In: Proc. of 7thConference of the Assoc. for Machine Translation in the Americas, Cambridge (August 2006)
Levenshtein, V.I.: Binary codes with correction for deletions and insertions of the symbol 1. Problemy Peredachi Informacii (1965)
Linguistic Intelligence Research Group, NTT Communication Science Laboratories. RIBES: Rank-based Intuitive Bilingual Evaluation Score, http://www.kecl.ntt.co.jp/icl/lirg/ribes/ , (retrieved on August 7, 2013)
International Workshop on Spoken Language Translation (IWSLT), http://www.iwslt2013.org/ , (retrieved on August 7, 2013)
ABBYY Aligner, http://www.abbyy.com/aligner/ (retrieved on August 7, 2013)
Unitex/Gramlab, http://www-igm.univ-mlv.fr/unitex (retrieved on August 7, 2013)
hunalign – sentence aligner, http://mokk.bme.hu/resources/hunalign/ (retrieved on August 8, 2013)
Bleualign, https://github.com/rsennrich/Bleualign (retrieved on August 8, 2013)
Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proc. of International Workshop on Spoken Language Translation (IWSLT) 2010, Hong Kong (December 2012)
Schmidt, A.: Statistical Machine Translation Between New Language Pairs Using Multiple Intermediaries (Doctoral dissertation, Thesis) (2007)
Specia, L., Raj, D., Turchi, M.: Machine translation evaluate versus quality estimation. Machine Translation 24, 39–50 (2010)
Chahuneau, V., Smith, N.A., Dyer, C.: pycdec: A Python Interface to cdec. The Prague Bulletin of Mathematical Linguistics (98), 51–61 (2012)
Dyer, C., et al.: cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In: Proc. of ACL 2010 System Demonstrations, pp. 7–12. Association for Computational Linguistics (July 2010)
Cettolo, M., Girardi, C., Federico, M.: Wit3: Web inventory of transcribed and translated talks. In: Proc. of 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261–268 (May 2012)
Paumier, S., Nakamura, T., Voyatzi, S.: UNITEX, a Corpus Processing System with Multi-Lingual Linguistic Resources. eLEX2009, 173 (2009)
Santos, A.: A survey on parallel corpora alignment. In: MI-STAR 2011, pp. 117–128 (2011)
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proc. of 29th Annual Meeting of the ACL, Berkeley, pp. 169–176 (1991)
Gale, W.A., Church, K.W.: Identifying word correspondences in parallel texts. In: Proc. of DARPA Workshop on Speech and Natual Language, pp. 152–157 (1991)
Varga, D., et al.: Parallel corpora for medium density languages. In: Proc. of the RANLP 2005, pp. 590–596 (2005)
Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proc. of 23rd COLING International Conference, Beijing, China, pp. 81–89 (2010)
Bonhomme, P., Romary, L.: The lingua parallel concordancing project: Managing multilingual texts for educational purpose. In: Proc. of Quinzièmes Journées Internationales IA 1995, Montpellier (1995)
Thorleuchter, D., Van den Poel, D.: Web Mining based Extraction of Problem Solution Ideas. Expert Systems with Applications 40(10), 3961–3969 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Wołk, K., Marasek, K. (2014). A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation. In: Rocha, Á., Correia, A., Tan, F., Stroetmann, K. (eds) New Perspectives in Information Systems and Technologies, Volume 1. Advances in Intelligent Systems and Computing, vol 275. Springer, Cham. https://doi.org/10.1007/978-3-319-05951-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-05951-8_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05950-1
Online ISBN: 978-3-319-05951-8
eBook Packages: EngineeringEngineering (R0)