Skip to main content

Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13552))

  • 885 Accesses

Abstract

This paper introduces the schemes of Team LingJing’s experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG). The MDUG task can be divided into two phases: multi-modal context understanding and response generation. To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task. Specifically, we utilize the multi-tasking strategy for jointly modelling the scene- and session- multi-modal understanding. The visual captions are adopted to aware the scene information, while the fixed-type templated prompt based on the scene- and session-aware labels are used to further improve the dialogue generation performance. Extensive experimental results show that the proposed method has achieved state-of-the-art (SOTA) performance compared with other competitive methods, where we rank the 1-st in all three subtasks in this MDUG competition.

Supported by the National Key R &D Program of China (2018YFB1305200), the National Natural Science Fund of China (62171183).

B. Li, Y. Weng and Z. Ma—Contributed this work equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/tylin/coco-caption.

  2. 2.

    https://github.com/patrick-tssn/NLPCC-2022-Shared-Task-4.

  3. 3.

    https://huggingface.co/bert-base-uncased.

  4. 4.

    https://huggingface.co/roberta-large.

  5. 5.

    https://huggingface.co/google/electra-large-discriminator.

  6. 6.

    https://pytorch.org.

  7. 7.

    https://github.com/huggingface/transformers.

References

  1. De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)

    Google Scholar 

  2. Deldjoo, Y., Trippas, J.R., Zamani, H.: Towards multi-modal conversational information seeking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1577–1587 (2021)

    Google Scholar 

  3. Roller, S., et al.: Recipes for building an open-domain chatbot. In: Conference of the European Chapter of the Association for Computational Linguistics (2021)

    Google Scholar 

  4. Zhou, H., Huang, M., Zhang, T., Zhu, X., Liu, B.: Emotional chatting machine: emotional conversation generation with internal and external memory. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  5. Su, Y., et al.: Language models can see: plugging visual controls in text generation. arXiv preprint arXiv:2205.02655 (2022)

  6. Wang, Y., Zhao, X., Zhao, D.: NLPCC-2022-Shared-Task-4, May 2022

    Google Scholar 

  7. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  8. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)

    Google Scholar 

  9. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)

    Google Scholar 

  10. Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019)

    Google Scholar 

  11. Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2Video: mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)

  12. Li, B., Weng, Y., Sun, B., Li, S.: Towards visual-prompt temporal answering grounding in medical instructional video. arXiv preprint arXiv:2203.06667 (2022)

  13. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  14. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  15. Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

    Google Scholar 

  16. Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  17. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  18. Sun, R., Chen, B., Zhou, Q., Li, Y., Cao, Y., Zheng, H.-T.: A non-hierarchical attention network with modality dropout for textual response generation in multimodal dialogue systems. arXiv preprint arXiv:2110.09702 (2021)

  19. Li, B., Weng, Y., Xia, F., Sun, B., Li, S.: VPAI_LAB at MedVidQA 2022: a two-stage cross-modal fusion method for medical instructional video classification. In: Proceedings of the 21st Workshop on Biomedical Language Processing, pp. 212–219 (2022)

    Google Scholar 

  20. Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)

  21. Li, J., Li, D., Xiong, C., Hoi, S.: BlIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086 (2022)

  22. Wang, Z., et al.: Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747 (2022)

  23. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)

  24. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  25. Taylor, W.L.: Cloze procedure: a new tool for measuring readability. Journalism Mass Commun. Q. 30 (1953)

    Google Scholar 

  26. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv Computation and Language (2019)

    Google Scholar 

  27. Clark, K., Luong, M.-T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. In: Learning (2020)

    Google Scholar 

  28. He, P., Gao, J., Chen, W.: DeBERtaV 3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv Computation and Language (2021)

    Google Scholar 

  29. Hadsell, R., Rao, D., Rusu, A., Pascanu, R.: Embracing change: continual learning in deep neural networks. Trends Cogn. Sci. 24, 1028–1040 (2020)

    Article  Google Scholar 

  30. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Meeting fof the Association or Computational Linguistics (2019)

    Google Scholar 

  31. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  MATH  Google Scholar 

  32. Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: memory optimizations toward training trillion parameter models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020. IEEE Press (2020)

    Google Scholar 

  33. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shutao Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, B., Weng, Y., Ma, Z., Sun, B., Li, S. (2022). Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13552. Springer, Cham. https://doi.org/10.1007/978-3-031-17189-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17189-5_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17188-8

  • Online ISBN: 978-3-031-17189-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics