Abstract
Face-to-face communication leads to better interactions between speakers than text-to-text conversations since the speakers can capture both textual and visual signals. Image-grounded emotional response generation (IgERG) tasks requires chatbots to generate a response with the understanding of both textual contexts and speakers’ emotions in visual signals. Pre-training models enhance many NLP and CV tasks and image-text pre-training also helps multimodal tasks. However, existing image-text pre-training methods typically pre-train on images by recognizing or modeling objects, but ignore the emotions expressed in the images. In this paper, we propose several pre-training tasks in a unified framework that not only captures emotions from images but also learns to incorporate the emotion into text generation. The pre-training involves single-modal learning to strengthen the ability to understand images and generate texts. It also involves cross-modal learning to enhance interactions between images and texts. The experiments verify our method in appropriateness, informativeness, and emotion consistency.
Z. Tian and Z. Wen–The two authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
81.1% samples contain multiple emotions in IgERG.
- 2.
- 3.
It roughly requires 64 GPUs for more than 2 weeks.
- 4.
- 5.
- 6.
Our code is available at: github.com/stupidHIGH/MM-Pre-train.
- 7.
- 8.
References
Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Chen, J., Zhang, L., Bai, C., Kpalma, K.: Review of recent deep learning based methods for image-text retrieval. In: MIPR, pp. 167–172. IEEE (2020)
Chen, S.Y., Hsu, C.C., Kuo, C.C., Ku, L.W., et al.: EmotionLines: an emotion corpus of multi-party conversations. In: ACL (2018)
Chen, Y.C., et al.: UNITER: universal image-text representation learning. In: ECCV (2020)
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: ICML 2021, pp. 1931–1942 (2021)
Colombo, P., Witon, W., Modi, A., Kennedy, J., Kapadia, M.: Affect-driven dialog generation. In: ACL, pp. 3734–3743 (2019)
Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: HLT, pp. 138–145 (2002)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Gao, J., et al.: Improving empathetic response generation by recognizing emotion cause in conversations. In: EMNLP (Finding), pp. 807–819 (2021)
Gokaslan, A., Cohen, V.: OpenWebText corpus. kylion007.github.io/OpenWebTextCorpus (2019)
Hazarika, D., Poria, S., Zadeh, A., Cambria, E., Morency, L.P., Zimmermann, R.: Conversational memory network for emotion recognition in dyadic dialogue videos. In: NAACL, vol. 2018, p. 2122. NIH Public Access (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Huang, H., et al.: M3P: learning universal representations via multitask multilingual multimodal pre-training. In: CVPR (2020)
Huber, B., McDuff, D., Brockett, C., Galley, M., Dolan, B.: Emotional dialogue generation using image-grounded language models. In: CHI, pp. 1–12 (2018)
Lee, D., Tian, Z., Xue, L., Zhang, N.L.: Enhancing content preservation in text style transfer using reverse attention and conditional layer normalization. In: ACL (2021)
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL (2019)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, vol. 34 (2020)
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: NAACL, pp. 110–119 (2016)
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: HERO: hierarchical encoder for video+ language omni-representation pre-training. In: ACL (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv:1908.03557 (2019)
Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: CVPR, pp. 2852–2861 (2017)
Li, W., et al.: UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: a manually labelled multi-turn dialogue dataset. In: ACL (2017)
Lin, Z., Madotto, A., Shin, J., Xu, P., Fung, P.: MoEL: mixture of empathetic listeners. In: EMNLP, pp. 121–132 (2019)
Lin, Z., et al.: CAiRE: an end-to-end empathetic chatbot. In: AAAI, vol. 34, pp. 13622–13623 (2020)
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)
Logeswaran, L., Lee, H., Bengio, S.: Content preserving text generation with attribute controls. In: NeurIPS, pp. 5108–5118 (2018)
Meng, Y., et al.: OpenViDial: a large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015 (2020)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NeurIPS (2013)
Mou, L., Song, Y., Yan, R., Li, G., Zhang, L., Jin, Z.: Sequence to backward and forward sequences: a content-introducing approach to generative short-text conversation. In: COLING, pp. 3349–3358 (2016)
Nagel, S.: CC-news (2016). commoncrawl.org/2016/10/newsdataset-available
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Polzin, T.S., Waibel, A.: Emotion-sensitive human-computer interfaces. In: ITRW (2000)
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: MELD: a multimodal multi-party dataset for emotion recognition in conversations. In: ACL (2018)
Qin, L., et al.: Conversing by reading: contentful neural conversation with on-demand machine reading. In: ACL, pp. 5427–5436 (2019)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Rashkin, H., Smith, E.M., Li, M., Boureau, Y.L.: Towards empathetic open-domain conversation models: a new benchmark and dataset. In: ACL (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Serban, I.V., et al.: A hierarchical latent variable encoder-decoder model for generating dialogues. In: AAAI, pp. 3295–3301 (2017)
Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. In: ACL, pp. 1577–1586 (2015)
Shuster, K., Humeau, S., Bordes, A., Weston, J.: Image-chat: Engaging grounded conversations. In: ACL, pp. 2414–2429 (2020)
Skowron, M.: Affect listeners: acquisition of affective states by means of conversational systems. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces: Active Listening and Synchrony. LNCS, vol. 5967, pp. 169–181. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12397-9_14
Song, Y., Liu, Z., Bi, W., Yan, R., Zhang, M.: Learning to customize model structures for few-shot dialogue generation tasks. In: ACL, pp. 5832–5841 (2020)
Song, Y., Tian, Z., Zhao, D., Zhang, M., Yan, R.: Diversifying neural conversation model with maximal marginal relevance. In: IJCNLP, pp. 169–174 (2017)
Song, Z., Zheng, X., Liu, L., Xu, M., Huang, X.J.: Generating responses with a specific emotion in dialog. In: ACL, pp. 3685–3695 (2019)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2019)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NeurIPS, pp. 3104–3112 (2014)
Tian, Z., et al.: Response-anticipated memory for on-demand knowledge integration in response generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 650–659 (2020)
Tian, Z., Bi, W., Zhang, Z., Lee, D., Song, Y., Zhang, N.L.: Learning from my friends: few-shot personalized conversation systems via social networks. In: AAAI, vol. 35, pp. 13907–13915 (2021)
Trinh, T.H., Le, Q.V.: A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847 (2018)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Xia, Q., et al.: XGPT: cross-modal generative pre-training for image captioning. In: Wang, L., Feng, Y., Hong, Yu., He, R. (eds.) NLPCC 2021. LNCS (LNAI), vol. 13028, pp. 786–797. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88480-2_63
Yan, R., Song, Y., Wu, H.: Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In: SIGIR, pp. 55–64 (2016)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding (2019)
Yu, C., Zhang, H., Song, Y., Ng, W.: CoCoLM: complex commonsense enhanced language model. arXiv preprint arXiv:2012.15643 (2020)
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE IS 31, 82–88 (2016)
Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: i have a dog, do you have pets too? In: ACL, pp. 2204–2213 (2018)
Zhang, Y., et al.: CelebA-spoof: large-scale face anti-spoofing dataset with rich annotations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 70–85. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_5
Zhang, Y., et al.: Multimodal style transfer via graph cuts. In: ICCV, pp. 5943–5951 (2019)
Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning social relation traits from face images. In: ICCV (2015)
Zhong, P., Zhang, C., Wang, H., Liu, Y., Miao, C.: Towards persona-based empathetic conversational models. In: EMNLP, pp. 6556–6566 (2020)
Zhou, H., Huang, M., Zhang, T., Zhu, X., Liu, B.: Emotional chatting machine: emotional conversation generation with internal and external memory. In: AAAI, pp. 730–738 (2018)
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: ICCV, pp. 19–27 (2015)
Acknowledgement
Research on this paper was supported by Hong Kong Research Grants Council under grand No. 16204920 and National Natural Science Foundation of China under Grant No. 62025208 and No. 62106275.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tian, Z. et al. (2022). Emotion-Aware Multimodal Pre-training for Image-Grounded Emotional Response Generation. In: Bhattacharya, A., et al. Database Systems for Advanced Applications. DASFAA 2022. Lecture Notes in Computer Science, vol 13247. Springer, Cham. https://doi.org/10.1007/978-3-031-00129-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-00129-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-00128-4
Online ISBN: 978-3-031-00129-1
eBook Packages: Computer ScienceComputer Science (R0)