A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

  • Krzysztof Wołk
  • Krzysztof Marasek
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 275)


Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.


Machine Translation Word Order Sentence Length Statistical Machine Translation Parallel Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 12(4), 1–26 (2006)Google Scholar
  2. 2.
    Braune, F., Fraser, A.: Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In: Coling 2010: Poster Volume, pp. 81–89 (August 2010)Google Scholar
  3. 3.
    Papineni, K., Rouskos, S., Ward, T., Zhu, W.J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proc. of 40th Annual Meeting of the Assoc. for Computational Linguistics, Philadelphia, pp. 311–318 (July 2002)Google Scholar
  4. 4.
    Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A Study of Translation Edit Rate with Targeted Human Annotation. In: Proc. of 7thConference of the Assoc. for Machine Translation in the Americas, Cambridge (August 2006)Google Scholar
  5. 5.
    Levenshtein, V.I.: Binary codes with correction for deletions and insertions of the symbol 1. Problemy Peredachi Informacii (1965)Google Scholar
  6. 6.
    Linguistic Intelligence Research Group, NTT Communication Science Laboratories. RIBES: Rank-based Intuitive Bilingual Evaluation Score,, (retrieved on August 7, 2013)
  7. 7.
    International Workshop on Spoken Language Translation (IWSLT),, (retrieved on August 7, 2013)
  8. 8.
    ABBYY Aligner, (retrieved on August 7, 2013)
  9. 9.
    Unitex/Gramlab, (retrieved on August 7, 2013)
  10. 10.
    hunalign – sentence aligner, (retrieved on August 8, 2013)
  11. 11.
    Bleualign, (retrieved on August 8, 2013)
  12. 12.
    Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proc. of International Workshop on Spoken Language Translation (IWSLT) 2010, Hong Kong (December 2012)Google Scholar
  13. 13.
    Schmidt, A.: Statistical Machine Translation Between New Language Pairs Using Multiple Intermediaries (Doctoral dissertation, Thesis) (2007)Google Scholar
  14. 14.
    Specia, L., Raj, D., Turchi, M.: Machine translation evaluate versus quality estimation. Machine Translation 24, 39–50 (2010)CrossRefGoogle Scholar
  15. 15.
    Chahuneau, V., Smith, N.A., Dyer, C.: pycdec: A Python Interface to cdec. The Prague Bulletin of Mathematical Linguistics (98), 51–61 (2012)Google Scholar
  16. 16.
    Dyer, C., et al.: cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In: Proc. of ACL 2010 System Demonstrations, pp. 7–12. Association for Computational Linguistics (July 2010)Google Scholar
  17. 17.
    Cettolo, M., Girardi, C., Federico, M.: Wit3: Web inventory of transcribed and translated talks. In: Proc. of 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261–268 (May 2012)Google Scholar
  18. 18.
    Paumier, S., Nakamura, T., Voyatzi, S.: UNITEX, a Corpus Processing System with Multi-Lingual Linguistic Resources. eLEX2009, 173 (2009)Google Scholar
  19. 19.
    Santos, A.: A survey on parallel corpora alignment. In: MI-STAR 2011, pp. 117–128 (2011)Google Scholar
  20. 20.
    Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proc. of 29th Annual Meeting of the ACL, Berkeley, pp. 169–176 (1991)Google Scholar
  21. 21.
    Gale, W.A., Church, K.W.: Identifying word correspondences in parallel texts. In: Proc. of DARPA Workshop on Speech and Natual Language, pp. 152–157 (1991)Google Scholar
  22. 22.
    Varga, D., et al.: Parallel corpora for medium density languages. In: Proc. of the RANLP 2005, pp. 590–596 (2005)Google Scholar
  23. 23.
    Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proc. of 23rd COLING International Conference, Beijing, China, pp. 81–89 (2010)Google Scholar
  24. 24.
    Bonhomme, P., Romary, L.: The lingua parallel concordancing project: Managing multilingual texts for educational purpose. In: Proc. of Quinzièmes Journées Internationales IA 1995, Montpellier (1995)Google Scholar
  25. 25.
  26. 26.
    Thorleuchter, D., Van den Poel, D.: Web Mining based Extraction of Problem Solution Ideas. Expert Systems with Applications 40(10), 3961–3969 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of MultimediaPolish Japanese Institute of Information TechnologyWarsawPoland

Personalised recommendations