Skip to main content

A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

  • Conference paper
New Perspectives in Information Systems and Technologies, Volume 1

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 275))

Abstract

Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Deng, Y., Kumar, S., Byrne, W.: Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering 12(4), 1–26 (2006)

    Google Scholar 

  2. Braune, F., Fraser, A.: Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In: Coling 2010: Poster Volume, pp. 81–89 (August 2010)

    Google Scholar 

  3. Papineni, K., Rouskos, S., Ward, T., Zhu, W.J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proc. of 40th Annual Meeting of the Assoc. for Computational Linguistics, Philadelphia, pp. 311–318 (July 2002)

    Google Scholar 

  4. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A Study of Translation Edit Rate with Targeted Human Annotation. In: Proc. of 7thConference of the Assoc. for Machine Translation in the Americas, Cambridge (August 2006)

    Google Scholar 

  5. Levenshtein, V.I.: Binary codes with correction for deletions and insertions of the symbol 1. Problemy Peredachi Informacii (1965)

    Google Scholar 

  6. Linguistic Intelligence Research Group, NTT Communication Science Laboratories. RIBES: Rank-based Intuitive Bilingual Evaluation Score, http://www.kecl.ntt.co.jp/icl/lirg/ribes/ , (retrieved on August 7, 2013)

  7. International Workshop on Spoken Language Translation (IWSLT), http://www.iwslt2013.org/ , (retrieved on August 7, 2013)

  8. ABBYY Aligner, http://www.abbyy.com/aligner/ (retrieved on August 7, 2013)

  9. Unitex/Gramlab, http://www-igm.univ-mlv.fr/unitex (retrieved on August 7, 2013)

  10. hunalign – sentence aligner, http://mokk.bme.hu/resources/hunalign/ (retrieved on August 8, 2013)

  11. Bleualign, https://github.com/rsennrich/Bleualign (retrieved on August 8, 2013)

  12. Marasek, K.: TED Polish-to-English translation system for the IWSLT 2012. In: Proc. of International Workshop on Spoken Language Translation (IWSLT) 2010, Hong Kong (December 2012)

    Google Scholar 

  13. Schmidt, A.: Statistical Machine Translation Between New Language Pairs Using Multiple Intermediaries (Doctoral dissertation, Thesis) (2007)

    Google Scholar 

  14. Specia, L., Raj, D., Turchi, M.: Machine translation evaluate versus quality estimation. Machine Translation 24, 39–50 (2010)

    Article  Google Scholar 

  15. Chahuneau, V., Smith, N.A., Dyer, C.: pycdec: A Python Interface to cdec. The Prague Bulletin of Mathematical Linguistics (98), 51–61 (2012)

    Google Scholar 

  16. Dyer, C., et al.: cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In: Proc. of ACL 2010 System Demonstrations, pp. 7–12. Association for Computational Linguistics (July 2010)

    Google Scholar 

  17. Cettolo, M., Girardi, C., Federico, M.: Wit3: Web inventory of transcribed and translated talks. In: Proc. of 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261–268 (May 2012)

    Google Scholar 

  18. Paumier, S., Nakamura, T., Voyatzi, S.: UNITEX, a Corpus Processing System with Multi-Lingual Linguistic Resources. eLEX2009, 173 (2009)

    Google Scholar 

  19. Santos, A.: A survey on parallel corpora alignment. In: MI-STAR 2011, pp. 117–128 (2011)

    Google Scholar 

  20. Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proc. of 29th Annual Meeting of the ACL, Berkeley, pp. 169–176 (1991)

    Google Scholar 

  21. Gale, W.A., Church, K.W.: Identifying word correspondences in parallel texts. In: Proc. of DARPA Workshop on Speech and Natual Language, pp. 152–157 (1991)

    Google Scholar 

  22. Varga, D., et al.: Parallel corpora for medium density languages. In: Proc. of the RANLP 2005, pp. 590–596 (2005)

    Google Scholar 

  23. Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proc. of 23rd COLING International Conference, Beijing, China, pp. 81–89 (2010)

    Google Scholar 

  24. Bonhomme, P., Romary, L.: The lingua parallel concordancing project: Managing multilingual texts for educational purpose. In: Proc. of Quinzièmes Journées Internationales IA 1995, Montpellier (1995)

    Google Scholar 

  25. http://korpusy.s16874487.onlinehome-server.info/

  26. Thorleuchter, D., Van den Poel, D.: Web Mining based Extraction of Problem Solution Ideas. Expert Systems with Applications 40(10), 3961–3969 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krzysztof Wołk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Wołk, K., Marasek, K. (2014). A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation. In: Rocha, Á., Correia, A., Tan, F., Stroetmann, K. (eds) New Perspectives in Information Systems and Technologies, Volume 1. Advances in Intelligent Systems and Computing, vol 275. Springer, Cham. https://doi.org/10.1007/978-3-319-05951-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-05951-8_22

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-05950-1

  • Online ISBN: 978-3-319-05951-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics