Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5459))

Included in the following conference series:

Abstract

This paper describes a methodology for constructing aligned German-Chinese corpora from movie subtitles. The corpora will be used to train a special machine translation system with intention to automatically translate the subtitles between German and Chinese. Since the common length-based algorithm for alignment shows weakness on short spoken sentences, especially on those from different language families, this paper studies to use dynamic programming based on time-shift information in subtitles, and extends it with statistical lexical cues to align the subtitle. In our experiment with around 4,000 Chinese and German sentences, the proposed alignment approach yields 83.8% precision. Furthermore, it is unrelated to languages, and leads to a general method of parallel corpora building between different language families.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brown, P., Lai, J.C., Mercer, R.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th annual meeting on Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)

    Google Scholar 

  2. Wu, D.K.: Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. In: Proceedings of the 32th Annual Conference of the Association for Computational Linguistics, Las Cruces, New Mexico, pp. 80–87 (1994)

    Google Scholar 

  3. Shemtov, H.: Text Aligment in a Tool for Translating Revised Documents. In: Proceedings of the 6th Conference on European Chapter of the Association for Computational Linguistics, Utrecht, The Netherlands, pp. 449–453 (1993)

    Google Scholar 

  4. Armstrong, S., Way, A., Caffrey, C., Flanagan, M., Kenny, D., O’Hagan, M.: Improving the Quality of Automated DVD Subtitles via Example-based Machine Translation. In: Proceedings of Translating and the Computer, Aslib, London, vol. 28 (2006)

    Google Scholar 

  5. Martin, V.: The Automatic Translation of Film Subtitles. A Machine Translation Success Story? In: Resourceful Language Technology: Festschrift in Honor of Anna, vol. 7. Uppsala University (2008)

    Google Scholar 

  6. Mathieu, M., Emmanuel, G.: Multilingual Aligned Corpora from Movie Subtitles. Rapport interne LISTIC, p. 6 (2005)

    Google Scholar 

  7. Vandeghinste, V., Sang, E.K.: Using a Parallel Transcript/Subtitle Corpus for Sentence Compression. In: LREC, Lisbon, Portugal (2004)

    Google Scholar 

  8. Popowich, F., McFetridge, P., Turcato, D., Toole, J.: Machine translation of Closed Captions. Machine Translation 15, 311–341 (2000)

    Article  Google Scholar 

  9. Och, F., Ney, H.: Improved Statistical Alignment Models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440–447 (2000)

    Google Scholar 

  10. Lavecchia, C., Smaïli, K., Langlois, D.: Building Parallel Corpora from Movies. In: 5th International Workshop on Natural Language Processing and Cognitive Science, Funchal, Portugal (2007)

    Google Scholar 

  11. Tiedemann, J.: Improved Sentence Alignment for Movie Subtitles. In: Proceedings of the 12th Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 582–588 (2007)

    Google Scholar 

  12. Reinhard, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, Maryland, pp. 519–526 (1999)

    Google Scholar 

  13. Gale, W.A., Church, K.W.: A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics 19(1), 75–102 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xiao, H., Wang, X. (2009). Constructing Parallel Corpus from Movie Subtitles. In: Li, W., Mollá-Aliod, D. (eds) Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy. ICCPOL 2009. Lecture Notes in Computer Science(), vol 5459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00831-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00831-3_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00830-6

  • Online ISBN: 978-3-642-00831-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics