Skip to main content

Emotion-Aware Multimodal Pre-training for Image-Grounded Emotional Response Generation

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13247))

Included in the following conference series:

Abstract

Face-to-face communication leads to better interactions between speakers than text-to-text conversations since the speakers can capture both textual and visual signals. Image-grounded emotional response generation (IgERG) tasks requires chatbots to generate a response with the understanding of both textual contexts and speakers’ emotions in visual signals. Pre-training models enhance many NLP and CV tasks and image-text pre-training also helps multimodal tasks. However, existing image-text pre-training methods typically pre-train on images by recognizing or modeling objects, but ignore the emotions expressed in the images. In this paper, we propose several pre-training tasks in a unified framework that not only captures emotions from images but also learns to incorporate the emotion into text generation. The pre-training involves single-modal learning to strengthen the ability to understand images and generate texts. It also involves cross-modal learning to enhance interactions between images and texts. The experiments verify our method in appropriateness, informativeness, and emotion consistency.

Z. Tian and Z. Wen–The two authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    81.1% samples contain multiple emotions in IgERG.

  2. 2.

    console.faceplusplus.com.cn.

  3. 3.

    It roughly requires 64 GPUs for more than 2 weeks.

  4. 4.

    dl.fbaipublicfiles.com/fairseq/models/.

  5. 5.

    huggingface.co/bert-base-uncased.

  6. 6.

    Our code is available at: github.com/stupidHIGH/MM-Pre-train.

  7. 7.

    We choose Oscar as our baseline, since Cho et al. [6] and Li et al. [26] reported Oscar outperforms most existing multimodal pre-training models, including UNITER [5], XGPT [58], VL-BART [6] and VL-T5 [6] on VQA, NLVR, and image captioning.

  8. 8.

    saifmohammad.com/WebPages/lexicons.html.

References

  1. Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)

    Google Scholar 

  2. Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)

    Google Scholar 

  3. Chen, J., Zhang, L., Bai, C., Kpalma, K.: Review of recent deep learning based methods for image-text retrieval. In: MIPR, pp. 167–172. IEEE (2020)

    Google Scholar 

  4. Chen, S.Y., Hsu, C.C., Kuo, C.C., Ku, L.W., et al.: EmotionLines: an emotion corpus of multi-party conversations. In: ACL (2018)

    Google Scholar 

  5. Chen, Y.C., et al.: UNITER: universal image-text representation learning. In: ECCV (2020)

    Google Scholar 

  6. Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: ICML 2021, pp. 1931–1942 (2021)

    Google Scholar 

  7. Colombo, P., Witon, W., Modi, A., Kennedy, J., Kapadia, M.: Affect-driven dialog generation. In: ACL, pp. 3734–3743 (2019)

    Google Scholar 

  8. Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  10. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: HLT, pp. 138–145 (2002)

    Google Scholar 

  11. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)

    Article  Google Scholar 

  12. Gao, J., et al.: Improving empathetic response generation by recognizing emotion cause in conversations. In: EMNLP (Finding), pp. 807–819 (2021)

    Google Scholar 

  13. Gokaslan, A., Cohen, V.: OpenWebText corpus. kylion007.github.io/OpenWebTextCorpus (2019)

    Google Scholar 

  14. Hazarika, D., Poria, S., Zadeh, A., Cambria, E., Morency, L.P., Zimmermann, R.: Conversational memory network for emotion recognition in dyadic dialogue videos. In: NAACL, vol. 2018, p. 2122. NIH Public Access (2018)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  16. Huang, H., et al.: M3P: learning universal representations via multitask multilingual multimodal pre-training. In: CVPR (2020)

    Google Scholar 

  17. Huber, B., McDuff, D., Brockett, C., Galley, M., Dolan, B.: Emotional dialogue generation using image-grounded language models. In: CHI, pp. 1–12 (2018)

    Google Scholar 

  18. Lee, D., Tian, Z., Xue, L., Zhang, N.L.: Enhancing content preservation in text style transfer using reverse attention and conditional layer normalization. In: ACL (2021)

    Google Scholar 

  19. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL (2019)

    Google Scholar 

  20. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, vol. 34 (2020)

    Google Scholar 

  21. Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: NAACL, pp. 110–119 (2016)

    Google Scholar 

  22. Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: HERO: hierarchical encoder for video+ language omni-representation pre-training. In: ACL (2020)

    Google Scholar 

  23. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv:1908.03557 (2019)

  24. Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: CVPR, pp. 2852–2861 (2017)

    Google Scholar 

  25. Li, W., et al.: UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020)

  26. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  27. Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: a manually labelled multi-turn dialogue dataset. In: ACL (2017)

    Google Scholar 

  28. Lin, Z., Madotto, A., Shin, J., Xu, P., Fung, P.: MoEL: mixture of empathetic listeners. In: EMNLP, pp. 121–132 (2019)

    Google Scholar 

  29. Lin, Z., et al.: CAiRE: an end-to-end empathetic chatbot. In: AAAI, vol. 34, pp. 13622–13623 (2020)

    Google Scholar 

  30. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)

    Google Scholar 

  31. Logeswaran, L., Lee, H., Bengio, S.: Content preserving text generation with attribute controls. In: NeurIPS, pp. 5108–5118 (2018)

    Google Scholar 

  32. Meng, Y., et al.: OpenViDial: a large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015 (2020)

  33. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NeurIPS (2013)

    Google Scholar 

  34. Mou, L., Song, Y., Yan, R., Li, G., Zhang, L., Jin, Z.: Sequence to backward and forward sequences: a content-introducing approach to generative short-text conversation. In: COLING, pp. 3349–3358 (2016)

    Google Scholar 

  35. Nagel, S.: CC-news (2016). commoncrawl.org/2016/10/newsdataset-available

  36. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)

    Google Scholar 

  37. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

    Google Scholar 

  38. Polzin, T.S., Waibel, A.: Emotion-sensitive human-computer interfaces. In: ITRW (2000)

    Google Scholar 

  39. Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: MELD: a multimodal multi-party dataset for emotion recognition in conversations. In: ACL (2018)

    Google Scholar 

  40. Qin, L., et al.: Conversing by reading: contentful neural conversation with on-demand machine reading. In: ACL, pp. 5427–5436 (2019)

    Google Scholar 

  41. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  42. Rashkin, H., Smith, E.M., Li, M., Boureau, Y.L.: Towards empathetic open-domain conversation models: a new benchmark and dataset. In: ACL (2019)

    Google Scholar 

  43. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)

    Google Scholar 

  44. Serban, I.V., et al.: A hierarchical latent variable encoder-decoder model for generating dialogues. In: AAAI, pp. 3295–3301 (2017)

    Google Scholar 

  45. Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. In: ACL, pp. 1577–1586 (2015)

    Google Scholar 

  46. Shuster, K., Humeau, S., Bordes, A., Weston, J.: Image-chat: Engaging grounded conversations. In: ACL, pp. 2414–2429 (2020)

    Google Scholar 

  47. Skowron, M.: Affect listeners: acquisition of affective states by means of conversational systems. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Development of Multimodal Interfaces: Active Listening and Synchrony. LNCS, vol. 5967, pp. 169–181. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12397-9_14

    Chapter  Google Scholar 

  48. Song, Y., Liu, Z., Bi, W., Yan, R., Zhang, M.: Learning to customize model structures for few-shot dialogue generation tasks. In: ACL, pp. 5832–5841 (2020)

    Google Scholar 

  49. Song, Y., Tian, Z., Zhao, D., Zhang, M., Yan, R.: Diversifying neural conversation model with maximal marginal relevance. In: IJCNLP, pp. 169–174 (2017)

    Google Scholar 

  50. Song, Z., Zheng, X., Liu, L., Xu, M., Huang, X.J.: Generating responses with a specific emotion in dialog. In: ACL, pp. 3685–3695 (2019)

    Google Scholar 

  51. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2019)

    Google Scholar 

  52. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NeurIPS, pp. 3104–3112 (2014)

    Google Scholar 

  53. Tian, Z., et al.: Response-anticipated memory for on-demand knowledge integration in response generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 650–659 (2020)

    Google Scholar 

  54. Tian, Z., Bi, W., Zhang, Z., Lee, D., Song, Y., Zhang, N.L.: Learning from my friends: few-shot personalized conversation systems via social networks. In: AAAI, vol. 35, pp. 13907–13915 (2021)

    Google Scholar 

  55. Trinh, T.H., Le, Q.V.: A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847 (2018)

  56. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  57. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)

    Google Scholar 

  58. Xia, Q., et al.: XGPT: cross-modal generative pre-training for image captioning. In: Wang, L., Feng, Y., Hong, Yu., He, R. (eds.) NLPCC 2021. LNCS (LNAI), vol. 13028, pp. 786–797. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88480-2_63

    Chapter  Google Scholar 

  59. Yan, R., Song, Y., Wu, H.: Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In: SIGIR, pp. 55–64 (2016)

    Google Scholar 

  60. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding (2019)

    Google Scholar 

  61. Yu, C., Zhang, H., Song, Y., Ng, W.: CoCoLM: complex commonsense enhanced language model. arXiv preprint arXiv:2012.15643 (2020)

  62. Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE IS 31, 82–88 (2016)

    Google Scholar 

  63. Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: i have a dog, do you have pets too? In: ACL, pp. 2204–2213 (2018)

    Google Scholar 

  64. Zhang, Y., et al.: CelebA-spoof: large-scale face anti-spoofing dataset with rich annotations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 70–85. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_5

    Chapter  Google Scholar 

  65. Zhang, Y., et al.: Multimodal style transfer via graph cuts. In: ICCV, pp. 5943–5951 (2019)

    Google Scholar 

  66. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning social relation traits from face images. In: ICCV (2015)

    Google Scholar 

  67. Zhong, P., Zhang, C., Wang, H., Liu, Y., Miao, C.: Towards persona-based empathetic conversational models. In: EMNLP, pp. 6556–6566 (2020)

    Google Scholar 

  68. Zhou, H., Huang, M., Zhang, T., Zhu, X., Liu, B.: Emotional chatting machine: emotional conversation generation with internal and external memory. In: AAAI, pp. 730–738 (2018)

    Google Scholar 

  69. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: ICCV, pp. 19–27 (2015)

    Google Scholar 

Download references

Acknowledgement

Research on this paper was supported by Hong Kong Research Grants Council under grand No. 16204920 and National Natural Science Foundation of China under Grant No. 62025208 and No. 62106275.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongsheng Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tian, Z. et al. (2022). Emotion-Aware Multimodal Pre-training for Image-Grounded Emotional Response Generation. In: Bhattacharya, A., et al. Database Systems for Advanced Applications. DASFAA 2022. Lecture Notes in Computer Science, vol 13247. Springer, Cham. https://doi.org/10.1007/978-3-031-00129-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-00129-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-00128-4

  • Online ISBN: 978-3-031-00129-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics