Abstract
Due to the lack of ideal resources, few researchers have investigated how to improve the machine translation (MT) of conversational materials by exploiting their internal structure. In this chapter, we will propose a novel strategy to automatically construct a parallel dialogue corpus by bridging two kinds of resources: movie subtitles and movie scripts. First, we collected parallel subtitles and their corresponding monolingual scripts from the Internet. After sentence alignment, we then projected all useful information from the script side to its corresponding subtitle side. Finally, we automatically built a Chinese-English dialogue corpus, which contains bilingual subtitle utterances, speaker names and actions, scene descriptions and boundaries, and script sentences. To demonstrate the usefulness of our data, we used speaker name tags to improve the translation performance. Our experiments showed that our approach achieved 81.79% accuracy in speaker name annotation, and the speaker-based model adaptation obtained around a 0.5 BLEU (bilingual evaluation understudy) point improvement in translation quality. We believe that our resources will benefit various tasks, such as dialogue systems, image/movie descriptions, and MT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We released our DCU-Huawei Chinese-English Dialogue Corpus 1.0 at http://computing.dcu.ie/∼lwang/resource.html.
- 2.
Available at https://www.seas.upenn.edu/∼pdtb. Accessed 23 May 2018
- 3.
Available at http://www.imsdb.com. Accessed 23 May 2018
- 4.
Available at http://www.imdb.com. Accessed 23 May 2018
- 5.
Available at https://lucene.apache.org. Accessed 23 May 2018
- 6.
Available at http://opus.lingfil.uu.se/OpenSubtitles2016.php. Accessed 23 May 2018
- 7.
Internet Movie Script Database, available at http://www.imsdb.com. Accessed 23 May 2018
References
Aizawa, Yasuyuki, Shigeki Matsubara, Nobuo Kawaguchi, Katsuhiko Toyama, and Yasuyoshi Inagaki. 2000. Spoken language corpus for machine interpretation research. In Proceedings of the 6th International Conference on Spoken Language Processing (Vol. 3), 398–401. Beijing, China.
Banchs, Rafael E. 2012. Movie-dic: A movie dialogue corpus for research and development. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (Vol. 2), 203–207. Jeju, Republic of Korea.
Cha, Sung-Hyuk. 2007. Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences 1:300–307.
Danescu-Niculescu-Mizil, Cristian, and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, 76–87. Portland, Oregon.
Itamar, Einav, and Alon Itai. 2008. Using movie subtitles for creating a large-scale bilingual corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation, 269–272. Marrakech, Morocco.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 177–180. Prague, Czech Republic.
Lavecchia, Caroline, Kamel Smaïli, and David Langlois. 2007. Building parallel corpora from movies. In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, 201–210. Funchal, Madeira, Portugal.
Lison, Pierre, and Raveesh Meena. 2016. Automatic turn segmentation for movie & TV subtitles. Paper presented at the Spoken Language Technology Workshop (SLT), 245–252. San Diego, California.
Lison, Pierre, and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia.
Liu, Siyou, Longyue Wang, and Chao-Hong Liu. 2018. Chinese-Portuguese machine translation: A study on building parallel corpora from comparable texts. arXiv preprint: arXiv:1804.01768.
Matsubara, Shigeki, Akira Takagi, Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2002. Bilingual spoken monologue corpus for simultaneous machine interpretation research. In Proceedings of the Third International Conference on Language Resources and Evaluation, 153–159. Las Palmas, Canary Islands, Spain.
Meyer, Thomas, and Andrei Popescu-Belis, A. 2012. Using sense-labeled discourse connectives for statistical machine translation. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation and Hybrid Approaches to Machine Translation, 129–138. Avignon, France.
O’Hagan, Minako. 2012. From fan translation to crowdsourcing: Consequences of web 2.0 user empowerment in audiovisual translation. Approaches to Translation Studies 36:25–41.
Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (Vol 1), 160–167. Sapporo, Japan.
Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1):19–51.
Prasad, Rashmi, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn discourse treebank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco.
Ramos, Juan. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the 1st Instructional Conference on Machine Learning. Piscataway, New Jersey.
Rohrbach, Anna, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2016. Movie description. In arXiv:1605.03705v1.
Salton, Gerard, Alec Wong, and Chung-shu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM 18:613–620.
Schmitt, Alexander, Stefan Ultes, and Wolfgang Minker. 2012. A parameterized and annotated spoken dialog corpus of the cmu let’s go bus information system. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 3369–3373. Istanbul, Turkey.
Stolcke, Andreas. 2002. Srilm—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing, 901–904. Denver, Colorado.
Takezawa, Toshiyuki, and Gen-ichiro Kikui. 2003. Collecting machine-translation-aided bilingual dialogues for corpus-based speech translation. In Proceedings of the 8th European Conference on Speech Communication and Technology, 2757–2760. Geneva, Switzerland.
Tiedemann, Jörg. 2007a. Building a multilingual parallel subtitle corpus. In Proceedings of the 17th Conference on Computational Linguistics in the Netherlands, 1–14. Leuven, Netherlands.
Tiedemann, Jörg. 2007b. Improved sentence alignment for movie subtitles. In Proceedings of the 3rd Conference on Recent Advances in Natural Language Processing (Vol. 7), 582–588. Borovets, Bulgaria.
Tiedemann, Jörg 2008. Synchronizing translated movie subtitles. In Proceedings of the 6th International Conference on Language Resources and Evaluation, 1902–1906. Marrakech, Morocco.
Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 2214–2218. Istanbul, Turkey.
Wahlster, Wolfgang (ed.). 2013. Verbmobil: Foundations of speech-to-speech translation. Springer Science & Business Media.
Walker, Marilyn A., Grace I. Lin, and Jennifer E. Sawyer. 2012. An annotated corpus of film dialogue for learning and characterizing character style. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 1373–1378. Istanbul, Turkey.
Wang, Longyue, Shuo Li, Derek F. Wong, and Lidia S. Chao. 2012a. A joint Chinese named entity recognition and disambiguation system. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, 146–151. Tianjin, China.
Wang, Long-Yue, Derek F. Wong, and Lidia S. Chao. 2012b. An improvement in cross-language document retrieval based on statistical models. In Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, 144–155. Chung-Li, Taiwan.
Wang, Longyue, Derek F. Wong, Lidia S. Chao, and Junwen Xing. 2012c. Crfs-based Chinese word segmentation for micro-blog with small-scale data. In Proceedings of The Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, 51–57. Tianjin, China.
Wang, Longyue, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang, and Francisco Oliveira. 2014. Combining domain adaptation approaches for medical text translation. In Proceedings of the 9th Workshop on Statistical Machine Translation, 254–259. Baltimore, Maryland.
Wang, Longyue, Zhaopeng Tu, Xiaojun Zhang, Hang Li, Andy Way, and Qun Liu. 2016a. A novel approach for dropped pronoun translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 983–993. San Diego, California.
Wang, Longyue, Xiaojun Zhang, Zhaopeng Tu, Hang Li, and Qun Liu. 2016b. Dropped pronoun generation for dialogue machine translation. In Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 6110–6114. Shanghai, China.
Wang, Longyue, Xiaojun Zhang, Zhaopeng Tu, Andy Way, and Qun Liu. 2016c. Automatic construction of discourse corpus for dialogue translation. In Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia.
Wang, Longyue, Zhaopeng Tu, Andy Way, and Qun Liu. 2017a. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2816–2821. Copenhagen, Denmark.
Wang, Longyue, Zhaopeng Tu, Xiaojun Zhang, Siyou Liu, Hang Li, Andy Way, and Qun Liu. 2017b. A novel and robust approach for pro-drop language translation. Machine Translation 1–23.
Wang, Longyue, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, and Qun Liu. 2018. Translating pro-drop languages with reconstruction models. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, Louisiana.
Xiao, Han, and Xiaojie Wang. 2009. Constructing parallel corpus from movie subtitles. In Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy, 329–336. Hong Kong.
Zhang, Shikun, Wang Ling, and Chris Dyer. 2014. Dual subtitles as parallel corpora. In Proceedings of the Nineth International Conference on Language Resources and Evaluation, 1869–1874. Reykjavik, Iceland.
Acknowledgments
This work was most done while the authors were working in ADAPT Centre, Dublin City University. This work is supported by the Science Foundation Ireland (SFI) ADAPT project (grant no. 13/RC/2106) and partly supported by the XJTLU KSF Project (grant no. KSF-E-24) and the Open Projects Program of GDFS Translation Studies Centre (grant no. TSC201501).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zhang, X., Wang, L., Way, A., Liu, Q. (2023). Automatic Construction of Parallel Dialogue Corpora with Rich Information. In: Huang, CR., Hsieh, SK., Jin, P. (eds) Chinese Language Resources. Text, Speech and Language Technology, vol 49. Springer, Cham. https://doi.org/10.1007/978-3-031-38913-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-38913-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-38912-2
Online ISBN: 978-3-031-38913-9
eBook Packages: EducationEducation (R0)