Automatic Construction of Parallel Dialogue Corpora with Rich Information

Zhang, Xiaojun; Wang, Longyue; Way, Andy; Liu, Qun

doi:10.1007/978-3-031-38913-9_15

Xiaojun Zhang⁵,
Longyue Wang⁶,
Andy Way⁷ &
…
Qun Liu⁸

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 49))

104 Accesses

Abstract

Due to the lack of ideal resources, few researchers have investigated how to improve the machine translation (MT) of conversational materials by exploiting their internal structure. In this chapter, we will propose a novel strategy to automatically construct a parallel dialogue corpus by bridging two kinds of resources: movie subtitles and movie scripts. First, we collected parallel subtitles and their corresponding monolingual scripts from the Internet. After sentence alignment, we then projected all useful information from the script side to its corresponding subtitle side. Finally, we automatically built a Chinese-English dialogue corpus, which contains bilingual subtitle utterances, speaker names and actions, scene descriptions and boundaries, and script sentences. To demonstrate the usefulness of our data, we used speaker name tags to improve the translation performance. Our experiments showed that our approach achieved 81.79% accuracy in speaker name annotation, and the speaker-based model adaptation obtained around a 0.5 BLEU (bilingual evaluation understudy) point improvement in translation quality. We believe that our resources will benefit various tasks, such as dialogue systems, image/movie descriptions, and MT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We released our DCU-Huawei Chinese-English Dialogue Corpus 1.0 at http://computing.dcu.ie/∼lwang/resource.html.
2.
Available at https://www.seas.upenn.edu/∼pdtb. Accessed 23 May 2018
3.
Available at http://www.imsdb.com. Accessed 23 May 2018
4.
Available at http://www.imdb.com. Accessed 23 May 2018
5.
Available at https://lucene.apache.org. Accessed 23 May 2018
6.
Available at http://opus.lingfil.uu.se/OpenSubtitles2016.php. Accessed 23 May 2018
7.
Internet Movie Script Database, available at http://www.imsdb.com. Accessed 23 May 2018

References

Aizawa, Yasuyuki, Shigeki Matsubara, Nobuo Kawaguchi, Katsuhiko Toyama, and Yasuyoshi Inagaki. 2000. Spoken language corpus for machine interpretation research. In Proceedings of the 6th International Conference on Spoken Language Processing (Vol. 3), 398–401. Beijing, China.
Google Scholar
Banchs, Rafael E. 2012. Movie-dic: A movie dialogue corpus for research and development. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (Vol. 2), 203–207. Jeju, Republic of Korea.
Google Scholar
Cha, Sung-Hyuk. 2007. Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences 1:300–307.
Google Scholar
Danescu-Niculescu-Mizil, Cristian, and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, 76–87. Portland, Oregon.
Google Scholar
Itamar, Einav, and Alon Itai. 2008. Using movie subtitles for creating a large-scale bilingual corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation, 269–272. Marrakech, Morocco.
Google Scholar
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 177–180. Prague, Czech Republic.
Google Scholar
Lavecchia, Caroline, Kamel Smaïli, and David Langlois. 2007. Building parallel corpora from movies. In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, 201–210. Funchal, Madeira, Portugal.
Google Scholar
Lison, Pierre, and Raveesh Meena. 2016. Automatic turn segmentation for movie & TV subtitles. Paper presented at the Spoken Language Technology Workshop (SLT), 245–252. San Diego, California.
Google Scholar
Lison, Pierre, and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia.
Google Scholar
Liu, Siyou, Longyue Wang, and Chao-Hong Liu. 2018. Chinese-Portuguese machine translation: A study on building parallel corpora from comparable texts. arXiv preprint: arXiv:1804.01768.
Google Scholar
Matsubara, Shigeki, Akira Takagi, Nobuo Kawaguchi, and Yasuyoshi Inagaki. 2002. Bilingual spoken monologue corpus for simultaneous machine interpretation research. In Proceedings of the Third International Conference on Language Resources and Evaluation, 153–159. Las Palmas, Canary Islands, Spain.
Google Scholar
Meyer, Thomas, and Andrei Popescu-Belis, A. 2012. Using sense-labeled discourse connectives for statistical machine translation. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation and Hybrid Approaches to Machine Translation, 129–138. Avignon, France.
Google Scholar
O’Hagan, Minako. 2012. From fan translation to crowdsourcing: Consequences of web 2.0 user empowerment in audiovisual translation. Approaches to Translation Studies 36:25–41.
Google Scholar
Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (Vol 1), 160–167. Sapporo, Japan.
Google Scholar
Och, Franz Josef, and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1):19–51.
Google Scholar
Prasad, Rashmi, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn discourse treebank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco.
Google Scholar
Ramos, Juan. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the 1st Instructional Conference on Machine Learning. Piscataway, New Jersey.
Google Scholar
Rohrbach, Anna, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2016. Movie description. In arXiv:1605.03705v1.
Google Scholar
Salton, Gerard, Alec Wong, and Chung-shu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM 18:613–620.
Google Scholar
Schmitt, Alexander, Stefan Ultes, and Wolfgang Minker. 2012. A parameterized and annotated spoken dialog corpus of the cmu let’s go bus information system. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 3369–3373. Istanbul, Turkey.
Google Scholar
Stolcke, Andreas. 2002. Srilm—An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing, 901–904. Denver, Colorado.
Google Scholar
Takezawa, Toshiyuki, and Gen-ichiro Kikui. 2003. Collecting machine-translation-aided bilingual dialogues for corpus-based speech translation. In Proceedings of the 8th European Conference on Speech Communication and Technology, 2757–2760. Geneva, Switzerland.
Google Scholar
Tiedemann, Jörg. 2007a. Building a multilingual parallel subtitle corpus. In Proceedings of the 17th Conference on Computational Linguistics in the Netherlands, 1–14. Leuven, Netherlands.
Google Scholar
Tiedemann, Jörg. 2007b. Improved sentence alignment for movie subtitles. In Proceedings of the 3rd Conference on Recent Advances in Natural Language Processing (Vol. 7), 582–588. Borovets, Bulgaria.
Google Scholar
Tiedemann, Jörg 2008. Synchronizing translated movie subtitles. In Proceedings of the 6th International Conference on Language Resources and Evaluation, 1902–1906. Marrakech, Morocco.
Google Scholar
Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 2214–2218. Istanbul, Turkey.
Google Scholar
Wahlster, Wolfgang (ed.). 2013. Verbmobil: Foundations of speech-to-speech translation. Springer Science & Business Media.
Google Scholar
Walker, Marilyn A., Grace I. Lin, and Jennifer E. Sawyer. 2012. An annotated corpus of film dialogue for learning and characterizing character style. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 1373–1378. Istanbul, Turkey.
Google Scholar
Wang, Longyue, Shuo Li, Derek F. Wong, and Lidia S. Chao. 2012a. A joint Chinese named entity recognition and disambiguation system. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, 146–151. Tianjin, China.
Google Scholar
Wang, Long-Yue, Derek F. Wong, and Lidia S. Chao. 2012b. An improvement in cross-language document retrieval based on statistical models. In Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, 144–155. Chung-Li, Taiwan.
Google Scholar
Wang, Longyue, Derek F. Wong, Lidia S. Chao, and Junwen Xing. 2012c. Crfs-based Chinese word segmentation for micro-blog with small-scale data. In Proceedings of The Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, 51–57. Tianjin, China.
Google Scholar
Wang, Longyue, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang, and Francisco Oliveira. 2014. Combining domain adaptation approaches for medical text translation. In Proceedings of the 9th Workshop on Statistical Machine Translation, 254–259. Baltimore, Maryland.
Google Scholar
Wang, Longyue, Zhaopeng Tu, Xiaojun Zhang, Hang Li, Andy Way, and Qun Liu. 2016a. A novel approach for dropped pronoun translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 983–993. San Diego, California.
Google Scholar
Wang, Longyue, Xiaojun Zhang, Zhaopeng Tu, Hang Li, and Qun Liu. 2016b. Dropped pronoun generation for dialogue machine translation. In Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 6110–6114. Shanghai, China.
Google Scholar
Wang, Longyue, Xiaojun Zhang, Zhaopeng Tu, Andy Way, and Qun Liu. 2016c. Automatic construction of discourse corpus for dialogue translation. In Proceedings of the 10th Language Resources and Evaluation Conference. Portorož, Slovenia.
Google Scholar
Wang, Longyue, Zhaopeng Tu, Andy Way, and Qun Liu. 2017a. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2816–2821. Copenhagen, Denmark.
Google Scholar
Wang, Longyue, Zhaopeng Tu, Xiaojun Zhang, Siyou Liu, Hang Li, Andy Way, and Qun Liu. 2017b. A novel and robust approach for pro-drop language translation. Machine Translation 1–23.
Google Scholar
Wang, Longyue, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, and Qun Liu. 2018. Translating pro-drop languages with reconstruction models. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence. New Orleans, Louisiana.
Google Scholar
Xiao, Han, and Xiaojie Wang. 2009. Constructing parallel corpus from movie subtitles. In Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy, 329–336. Hong Kong.
Google Scholar
Zhang, Shikun, Wang Ling, and Chris Dyer. 2014. Dual subtitles as parallel corpora. In Proceedings of the Nineth International Conference on Language Resources and Evaluation, 1869–1874. Reykjavik, Iceland.
Google Scholar

Download references

Acknowledgments

This work was most done while the authors were working in ADAPT Centre, Dublin City University. This work is supported by the Science Foundation Ireland (SFI) ADAPT project (grant no. 13/RC/2106) and partly supported by the XJTLU KSF Project (grant no. KSF-E-24) and the Open Projects Program of GDFS Translation Studies Centre (grant no. TSC201501).

Author information

Authors and Affiliations

Xi’an Jiaotong-Liverpool University, Suzhou, China
Xiaojun Zhang
Tencent AI Lab, Shenzhen, China
Longyue Wang
ADAPT Centre, Dublin City University, Dublin, Ireland
Andy Way
Huawei Noah’s Ark Lab, Hong Kong, China
Qun Liu

Authors

Xiaojun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Longyue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Andy Way
View author publications
You can also search for this author in PubMed Google Scholar
Qun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaojun Zhang .

Editor information

Editors and Affiliations

Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Chu-Ren Huang
Graduate Institute of Linguistics, National Taiwan University, Taipei, Taiwan
Shu-Kai Hsieh
School of Electronic Information and Artificial Intelligence, Leshan Normal University, Leshan City, Sichuan, China
Peng Jin

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, X., Wang, L., Way, A., Liu, Q. (2023). Automatic Construction of Parallel Dialogue Corpora with Rich Information. In: Huang, CR., Hsieh, SK., Jin, P. (eds) Chinese Language Resources. Text, Speech and Language Technology, vol 49. Springer, Cham. https://doi.org/10.1007/978-3-031-38913-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-38913-9_15
Published: 19 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-38912-2
Online ISBN: 978-3-031-38913-9
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics