Skip to main content

Automatic Construction of Parallel Dialogue Corpora with Rich Information

  • Chapter
  • First Online:
Chinese Language Resources

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 49))

  • 104 Accesses

Abstract

Due to the lack of ideal resources, few researchers have investigated how to improve the machine translation (MT) of conversational materials by exploiting their internal structure. In this chapter, we will propose a novel strategy to automatically construct a parallel dialogue corpus by bridging two kinds of resources: movie subtitles and movie scripts. First, we collected parallel subtitles and their corresponding monolingual scripts from the Internet. After sentence alignment, we then projected all useful information from the script side to its corresponding subtitle side. Finally, we automatically built a Chinese-English dialogue corpus, which contains bilingual subtitle utterances, speaker names and actions, scene descriptions and boundaries, and script sentences. To demonstrate the usefulness of our data, we used speaker name tags to improve the translation performance. Our experiments showed that our approach achieved 81.79% accuracy in speaker name annotation, and the speaker-based model adaptation obtained around a 0.5 BLEU (bilingual evaluation understudy) point improvement in translation quality. We believe that our resources will benefit various tasks, such as dialogue systems, image/movie descriptions, and MT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We released our DCU-Huawei Chinese-English Dialogue Corpus 1.0 at http://computing.dcu.ie/∼lwang/resource.html.

  2. 2.

    Available at https://www.seas.upenn.edu/∼pdtb. Accessed 23 May 2018

  3. 3.

    Available at http://www.imsdb.com. Accessed 23 May 2018

  4. 4.

    Available at http://www.imdb.com. Accessed 23 May 2018

  5. 5.

    Available at https://lucene.apache.org. Accessed 23 May 2018

  6. 6.

    Available at http://opus.lingfil.uu.se/OpenSubtitles2016.php. Accessed 23 May 2018

  7. 7.

    Internet Movie Script Database, available at http://www.imsdb.com. Accessed 23 May 2018

References

  • Aizawa, Yasuyuki, Shigeki Matsubara, Nobuo Kawaguchi, Katsuhiko Toyama, and Yasuyoshi Inagaki. 2000. Spoken language corpus for machine interpretation research. In Proceedings of the 6th International Conference on Spoken Language Processing (Vol. 3), 398–401. Beijing, China.

    Google Scholar 

  • Banchs, Rafael E. 2012. Movie-dic: A movie dialogue corpus for research and development. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (Vol. 2), 203–207. Jeju, Republic of Korea.

    Google Scholar 

  • Cha, Sung-Hyuk. 2007. Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences 1:300–307.

    Google Scholar 

  • Danescu-Niculescu-Mizil, Cristian, and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, 7687. Portland, Oregon.

    Google Scholar 

  • Itamar, Einav, and Alon Itai. 2008. Using movie subtitles for creating a large-scale bilingual corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation, 269–272. Marrakech, Morocco.

    Google Scholar 

  • Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 177–180. Prague, Czech Republic.

    Google Scholar 

  • Lavecchia, Caroline, Kamel Smaïli, and David Langlois. 2007. Building parallel corpora from movies. In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, 201–210. Funchal, Madeira, Portugal.

    Google Scholar 

  • Lison, Pierre, and Raveesh Meena. 2016. Automatic turn segmentation for movie & TV subtitles. Paper presented at the Spoken Language Technology Workshop (SLT), 245–252. San Diego, California.

    Google Scholar 

  • Lison, Pierre, and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia.

    Google Scholar 

  • Liu, Siyou, Longyue Wang, and Chao-Hong Liu. 2018. Chinese-Portuguese machine translation: A study on building parallel corpora from comparable texts. arXiv preprint: arXiv:1804.01768.

    Google Scholar 

  • Matsubara, Shigeki, Akira Takagi, Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2002. Bilingual spoken monologue corpus for simultaneous machine interpretation research. In Proceedings of the Third International Conference on Language Resources and Evaluation, 153–159. Las Palmas, Canary Islands, Spain.

    Google Scholar 

  • Meyer, Thomas, and Andrei Popescu-Belis, A. 2012. Using sense-labeled discourse connectives for statistical machine translation. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation and Hybrid Approaches to Machine Translation, 129–138. Avignon, France.

    Google Scholar 

  • O’Hagan, Minako. 2012. From fan translation to crowdsourcing: Consequences of web 2.0 user empowerment in audiovisual translation. Approaches to Translation Studies 36:25–41.

    Google Scholar 

  • Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (Vol 1), 160–167. Sapporo, Japan.

    Google Scholar 

  • Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1):19–51.

    Google Scholar 

  • Prasad, Rashmi, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn discourse treebank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco.

    Google Scholar 

  • Ramos, Juan. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the 1st Instructional Conference on Machine Learning. Piscataway, New Jersey.

    Google Scholar 

  • Rohrbach, Anna, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2016. Movie description. In arXiv:1605.03705v1.

    Google Scholar 

  • Salton, Gerard, Alec Wong, and Chung-shu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM 18:613–620.

    Google Scholar 

  • Schmitt, Alexander, Stefan Ultes, and Wolfgang Minker. 2012. A parameterized and annotated spoken dialog corpus of the cmu let’s go bus information system. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 3369–3373. Istanbul, Turkey.

    Google Scholar 

  • Stolcke, Andreas. 2002. Srilm—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing, 901–904. Denver, Colorado.

    Google Scholar 

  • Takezawa, Toshiyuki, and Gen-ichiro Kikui. 2003. Collecting machine-translation-aided bilingual dialogues for corpus-based speech translation. In Proceedings of the 8th European Conference on Speech Communication and Technology, 2757–2760. Geneva, Switzerland.

    Google Scholar 

  • Tiedemann, Jörg. 2007a. Building a multilingual parallel subtitle corpus. In Proceedings of the 17th Conference on Computational Linguistics in the Netherlands, 1–14. Leuven, Netherlands.

    Google Scholar 

  • Tiedemann, Jörg. 2007b. Improved sentence alignment for movie subtitles. In Proceedings of the 3rd Conference on Recent Advances in Natural Language Processing (Vol. 7), 582–588. Borovets, Bulgaria.

    Google Scholar 

  • Tiedemann, Jörg 2008. Synchronizing translated movie subtitles. In Proceedings of the 6th International Conference on Language Resources and Evaluation, 1902–1906. Marrakech, Morocco.

    Google Scholar 

  • Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 2214–2218. Istanbul, Turkey.

    Google Scholar 

  • Wahlster, Wolfgang (ed.). 2013. Verbmobil: Foundations of speech-to-speech translation. Springer Science & Business Media.

    Google Scholar 

  • Walker, Marilyn A., Grace I. Lin, and Jennifer E. Sawyer. 2012. An annotated corpus of film dialogue for learning and characterizing character style. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 1373–1378. Istanbul, Turkey.

    Google Scholar 

  • Wang, Longyue, Shuo Li, Derek F. Wong, and Lidia S. Chao. 2012a. A joint Chinese named entity recognition and disambiguation system. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, 146–151. Tianjin, China.

    Google Scholar 

  • Wang, Long-Yue, Derek F. Wong, and Lidia S. Chao. 2012b. An improvement in cross-language document retrieval based on statistical models. In Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, 144–155. Chung-Li, Taiwan.

    Google Scholar 

  • Wang, Longyue, Derek F. Wong, Lidia S. Chao, and Junwen Xing. 2012c. Crfs-based Chinese word segmentation for micro-blog with small-scale data. In Proceedings of The Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, 51–57. Tianjin, China.

    Google Scholar 

  • Wang, Longyue, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang, and Francisco Oliveira. 2014. Combining domain adaptation approaches for medical text translation. In Proceedings of the 9th Workshop on Statistical Machine Translation, 254–259. Baltimore, Maryland.

    Google Scholar 

  • Wang, Longyue, Zhaopeng Tu, Xiaojun Zhang, Hang Li, Andy Way, and Qun Liu. 2016a. A novel approach for dropped pronoun translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 983–993. San Diego, California.

    Google Scholar 

  • Wang, Longyue, Xiaojun Zhang, Zhaopeng Tu, Hang Li, and Qun Liu. 2016b. Dropped pronoun generation for dialogue machine translation. In Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 6110–6114. Shanghai, China.

    Google Scholar 

  • Wang, Longyue, Xiaojun Zhang, Zhaopeng Tu, Andy Way, and Qun Liu. 2016c. Automatic construction of discourse corpus for dialogue translation. In Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia.

    Google Scholar 

  • Wang, Longyue, Zhaopeng Tu, Andy Way, and Qun Liu. 2017a. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2816–2821. Copenhagen, Denmark.

    Google Scholar 

  • Wang, Longyue, Zhaopeng Tu, Xiaojun Zhang, Siyou Liu, Hang Li, Andy Way, and Qun Liu. 2017b. A novel and robust approach for pro-drop language translation. Machine Translation 1–23.

    Google Scholar 

  • Wang, Longyue, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, and Qun Liu. 2018. Translating pro-drop languages with reconstruction models. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, Louisiana.

    Google Scholar 

  • Xiao, Han, and Xiaojie Wang. 2009. Constructing parallel corpus from movie subtitles. In Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy, 329–336. Hong Kong.

    Google Scholar 

  • Zhang, Shikun, Wang Ling, and Chris Dyer. 2014. Dual subtitles as parallel corpora. In Proceedings of the Nineth International Conference on Language Resources and Evaluation, 1869–1874. Reykjavik, Iceland.

    Google Scholar 

Download references

Acknowledgments

This work was most done while the authors were working in ADAPT Centre, Dublin City University. This work is supported by the Science Foundation Ireland (SFI) ADAPT project (grant no. 13/RC/2106) and partly supported by the XJTLU KSF Project (grant no. KSF-E-24) and the Open Projects Program of GDFS Translation Studies Centre (grant no. TSC201501).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaojun Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhang, X., Wang, L., Way, A., Liu, Q. (2023). Automatic Construction of Parallel Dialogue Corpora with Rich Information. In: Huang, CR., Hsieh, SK., Jin, P. (eds) Chinese Language Resources. Text, Speech and Language Technology, vol 49. Springer, Cham. https://doi.org/10.1007/978-3-031-38913-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-38913-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-38912-2

  • Online ISBN: 978-3-031-38913-9

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics