Skip to main content

Aligning the Un-Alignable — A Pilot Study Using a Noisy Corpus of Nonstandardized, Semi-parallel Texts

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7182))

  • 1336 Accesses

Abstract

We present the outline of a robust, precision oriented alignment method that deals with a corpus of comparable texts without standardized spelling or sentence boundary marking. The method identifies comparable sequences over a source and target text using a bilingual dictionary, uses various methods to assign a confidence score, and only keeps the highest scoring sequences. For comparison, a conventional alignment is done with a heuristic sentence splitting beforehand. Both methods are evaluated over transcriptions of two historical documents in different Early New High German dialects, and the method developed is found to outperform the competing one by a great margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29, 19–51 (2003)

    Article  MATH  Google Scholar 

  2. Fung, P., Church, K.W.: K-vec: a new approach for aligning parallel texts. In: Proceedings of the 15th Conference on Computational Linguistics, vol. 2, pp. 1096–1102. Association for Computational Linguistics (1994)

    Google Scholar 

  3. Huang, H., Chen, H.: Pause and Stop Labeling for Chinese Sentence Boundary Detection. In: Proceedings of Recent Advances in Natural Language Processing, pp. 146–153 (2011)

    Google Scholar 

  4. Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing, vol. 2, pp. 1071–1082. IBM Press (1993)

    Google Scholar 

  5. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19, 75–102 (1993)

    Google Scholar 

  6. Fung, P., McKeown, K.: Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping. In: Proceedings of the Association for Machine Translation in the Americas (AMTA 1994), pp. 81–88 (1994)

    Google Scholar 

  7. Church, K.W.: char_align: a program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, pp. 1–8. Association for Computational Linguistics (1993)

    Google Scholar 

  8. Kondrak, G., Dorr, B.: Identification of confusable drug names: A new approach and evaluation methodology. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 952–958. Association for Computational Linguistics (2004)

    Google Scholar 

  9. Ljubešić, N., Fišer, D.: Bootstrapping Bilingual Lexicons from Comparable Corpora for Closely Related Languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 91–98. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Lin, C.: ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 25–26 (2004)

    Google Scholar 

  11. Braune, F., Fraser, A.: Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. In: Coling 2010: Poster Volumes, pp. 81–89 (2010)

    Google Scholar 

  12. Moore, R.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Simard, M.: Text-Translation Alignment: Three Languages Are Better Than Two. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 2–11. Association for Computational Linguistics (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Petran, F. (2012). Aligning the Un-Alignable — A Pilot Study Using a Noisy Corpus of Nonstandardized, Semi-parallel Texts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28601-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28601-8_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28600-1

  • Online ISBN: 978-3-642-28601-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics