Abstract
We present the outline of a robust, precision oriented alignment method that deals with a corpus of comparable texts without standardized spelling or sentence boundary marking. The method identifies comparable sequences over a source and target text using a bilingual dictionary, uses various methods to assign a confidence score, and only keeps the highest scoring sequences. For comparison, a conventional alignment is done with a heuristic sentence splitting beforehand. Both methods are evaluated over transcriptions of two historical documents in different Early New High German dialects, and the method developed is found to outperform the competing one by a great margin.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29, 19–51 (2003)
Fung, P., Church, K.W.: K-vec: a new approach for aligning parallel texts. In: Proceedings of the 15th Conference on Computational Linguistics, vol. 2, pp. 1096–1102. Association for Computational Linguistics (1994)
Huang, H., Chen, H.: Pause and Stop Labeling for Chinese Sentence Boundary Detection. In: Proceedings of Recent Advances in Natural Language Processing, pp. 146–153 (2011)
Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing, vol. 2, pp. 1071–1082. IBM Press (1993)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19, 75–102 (1993)
Fung, P., McKeown, K.: Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping. In: Proceedings of the Association for Machine Translation in the Americas (AMTA 1994), pp. 81–88 (1994)
Church, K.W.: char_align: a program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, pp. 1–8. Association for Computational Linguistics (1993)
Kondrak, G., Dorr, B.: Identification of confusable drug names: A new approach and evaluation methodology. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 952–958. Association for Computational Linguistics (2004)
Ljubešić, N., Fišer, D.: Bootstrapping Bilingual Lexicons from Comparable Corpora for Closely Related Languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 91–98. Springer, Heidelberg (2011)
Lin, C.: ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 25–26 (2004)
Braune, F., Fraser, A.: Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In: Coling 2010: Poster Volumes, pp. 81–89 (2010)
Moore, R.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)
Simard, M.: Text-Translation Alignment: Three Languages Are Better Than Two. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 2–11. Association for Computational Linguistics (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Petran, F. (2012). Aligning the Un-Alignable — A Pilot Study Using a Noisy Corpus of Nonstandardized, Semi-parallel Texts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28601-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-28601-8_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28600-1
Online ISBN: 978-3-642-28601-8
eBook Packages: Computer ScienceComputer Science (R0)