A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

Conference paper

DOI: 10.1007/978-3-319-05951-8_22

Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 275)
Cite this paper as:
Wołk K., Marasek K. (2014) A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation. In: Rocha Á., Correia A., Tan F., Stroetmann K. (eds) New Perspectives in Information Systems and Technologies, Volume 1. Advances in Intelligent Systems and Computing, vol 275. Springer, Cham

Abstract

Text alignment is crucial to the accuracy of Machine Translation (MT) systems, some NLP tools or any other text processing tasks requiring bilingual data. This research proposes a language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments. This alignment approach was developed on the TED Talks corpus, but can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence recognition. Some of them value synonyms and semantic text structure analysis as a part of additional information. Minimization of data loss was ensured. The solution is compared to other sentence alignment implementations. Also an improvement in MT system score with text processed with described tool is shown.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of MultimediaPolish Japanese Institute of Information TechnologyWarsawPoland

Personalised recommendations