Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation

Li, Bin; Weng, Yixuan; Ma, Ziyu; Sun, Bin; Li, Shutao

doi:10.1007/978-3-031-17189-5_15

Bin Li¹¹,
Yixuan Weng¹²,
Ziyu Ma¹¹,
Bin Sun¹¹ &
…
Shutao Li¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13552))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

885 Accesses

Abstract

This paper introduces the schemes of Team LingJing’s experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG). The MDUG task can be divided into two phases: multi-modal context understanding and response generation. To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task. Specifically, we utilize the multi-tasking strategy for jointly modelling the scene- and session- multi-modal understanding. The visual captions are adopted to aware the scene information, while the fixed-type templated prompt based on the scene- and session-aware labels are used to further improve the dialogue generation performance. Extensive experimental results show that the proposed method has achieved state-of-the-art (SOTA) performance compared with other competitive methods, where we rank the 1-st in all three subtasks in this MDUG competition.

Supported by the National Key R &D Program of China (2018YFB1305200), the National Natural Science Fund of China (62171183).

B. Li, Y. Weng and Z. Ma—Contributed this work equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)
Google Scholar
Deldjoo, Y., Trippas, J.R., Zamani, H.: Towards multi-modal conversational information seeking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1577–1587 (2021)
Google Scholar
Roller, S., et al.: Recipes for building an open-domain chatbot. In: Conference of the European Chapter of the Association for Computational Linguistics (2021)
Google Scholar
Zhou, H., Huang, M., Zhang, T., Zhu, X., Liu, B.: Emotional chatting machine: emotional conversation generation with internal and external memory. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Su, Y., et al.: Language models can see: plugging visual controls in text generation. arXiv preprint arXiv:2205.02655 (2022)
Wang, Y., Zhao, X., Zhao, D.: NLPCC-2022-Shared-Task-4, May 2022
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Google Scholar
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)
Google Scholar
Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019)
Google Scholar
Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2Video: mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
Li, B., Weng, Y., Sun, B., Li, S.: Towards visual-prompt temporal answering grounding in medical instructional video. arXiv preprint arXiv:2203.06667 (2022)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Computer Vision and Pattern Recognition (2017)
Google Scholar
Sun, R., Chen, B., Zhou, Q., Li, Y., Cao, Y., Zheng, H.-T.: A non-hierarchical attention network with modality dropout for textual response generation in multimodal dialogue systems. arXiv preprint arXiv:2110.09702 (2021)
Li, B., Weng, Y., Xia, F., Sun, B., Li, S.: VPAI_LAB at MedVidQA 2022: a two-stage cross-modal fusion method for medical instructional video classification. In: Proceedings of the 21st Workshop on Biomedical Language Processing, pp. 212–219 (2022)
Google Scholar
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
Li, J., Li, D., Xiong, C., Hoi, S.: BlIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086 (2022)
Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747 (2022)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Taylor, W.L.: Cloze procedure: a new tool for measuring readability. Journalism Mass Commun. Q. 30 (1953)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv Computation and Language (2019)
Google Scholar
Clark, K., Luong, M.-T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. In: Learning (2020)
Google Scholar
He, P., Gao, J., Chen, W.: DeBERtaV 3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv Computation and Language (2021)
Google Scholar
Hadsell, R., Rao, D., Rusu, A., Pascanu, R.: Embracing change: continual learning in deep neural networks. Trends Cogn. Sci. 24, 1028–1040 (2020)
Article Google Scholar
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Meeting fof the Association or Computational Linguistics (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: memory optimizations toward training trillion parameter models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020. IEEE Press (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Electrical and Information Engineering, Hunan University, Changsha, China
Bin Li, Ziyu Ma, Bin Sun & Shutao Li
National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy Sciences, Beijing, China
Yixuan Weng

Authors

Bin Li
View author publications
You can also search for this author in PubMed Google Scholar
Yixuan Weng
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Ma
View author publications
You can also search for this author in PubMed Google Scholar
Bin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shutao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shutao Li .

Editor information

Editors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Wei Lu
Nanjing University, Nanjing, China
Shujian Huang
Soochow University, Suzhou, China
Yu Hong
Soochow University, Soochow, China
Xiabing Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, B., Weng, Y., Ma, Z., Sun, B., Li, S. (2022). Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13552. Springer, Cham. https://doi.org/10.1007/978-3-031-17189-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-17189-5_15
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17188-8
Online ISBN: 978-3-031-17189-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)