Skip to main content
Log in

Aspect-level sentiment-controlled knowledge grounded multimodal dialog generation using generative models for reviews

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

During a conversation, it is critical for participants to establish what they both agree on, also known as the common ground. Grounding implies recognizing that the listener has understood what the speaker has said, considering several factors. This can be accomplished by basing dialog models on various features like aspects, sentiments, images, and unstructured knowledge documents. The key innovation lies in our novel multi-modal knowledge-grounded context-aware transformer model, which enables a seamless fusion of textual and visual information. We introduce an effective technique for generating reviews based on the user’s aspect and sentiment (i.e., aspect-level sentiment-controllable reviews), which serves as the relevant external knowledge for the dialog systems. Our work highlights the importance of incorporating review expertise in knowledge-based multi-modal dialog generation. We utilize the Knowledge Grounded Multi-Modal Dialog (KGMMD) dataset, which includes dial og utterances accompanied by images, aspects, sentiment, and unstructured knowledge in the form of several long hotel reviews for different hotels mentioned in the dataset. The overall framework consists of a dialog encoder, a review generator, and a response decoder, all of which complement one another by generating appropriate reviews, which eventually assist in generating an adequate response. The proposed model outperforms the baseline models for aspect-level sentiment-controlled knowledge-based multimodal response generation with a significant increase in F1-score (13.3%) and BLEU-4 (5.3%) on the KGMMD dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Availability of Data and Materials

The KGMMD dataset analysed in this work are included in the published articles by [8] and can be downloaded from the links:https://github.com/deekshaVarshney/KGMMD.

Notes

  1. This is strictly for notation. External knowledge/images may not be present in the dataset for every utterance. In such cases, we use a null vector as an input.

  2. https://github.com/facebookresearch/ParlAI/blob/master/parlai/core/metrics.py.

  3. https://github.com/Maluuba/nlg-eval.

References

  1. Le, H, Hoi, S, Sahoo, D, Chen, N (2019) End-to-end multimodal dialog systems with hierarchical multimodal attention on video features. In: DSTC7 at AAAI2019 Workshop

  2. Saha, A, Khapra, M, Sankaranarayanan, K (2018) Towards building large scale multimodal domain-aware conversation systems. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  3. Chauhan, H, Firdaus, M, Ekbal, A, Bhattacharyya, P (2019) Ordinal and attribute aware response generation in a multimodal dialogue system. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 5437–5447

  4. Das, A, Kottur, S, Gupta, K, Singh, A, Yadav, D, Moura, JM, Parikh, D, Batra, D (2017) Visual dialog. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 326–335

  5. Le, H, Sahoo, D, Chen, NF, Hoi, SC (2019) Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv:1907.01166

  6. Firdaus M, Thakur N, Ekbal A (2021) Aspect-aware response generation for multimodal dialogue system. ACM Transactions on Intelligent Systems and Technology (TIST) 12(2):1–33

    Article  Google Scholar 

  7. Firdaus, M, Chauhan, H, Ekbal, A, Bhattacharyya, P (2020) Emosen: Generating sentiment and emotion controlled responses in a multimodal dialogue system. IEEE Transactions on Affective Computing

  8. Varshney, D, Singh, AEA (2021) Knowledge grounded multimodal dialog generation in task-oriented settings. In: PACLIC

  9. Shang, L, Lu, Z, Li, H (2015) Neural responding machine for short-text conversation. arXiv:1503.02364

  10. Vinyals, O, Le, Q (2015) A neural conversational model. arXiv:1506.05869

  11. Sordoni, A, Bengio, Y, Vahabi, H, Lioma, C, Grue Simonsen, J, Nie, J-Y (2015) A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM international on conference on information and knowledge management, pp 553–562

  12. Serban, I, Sordoni, A, Bengio, Y, Courville, A, Pineau, J (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the AAAI conference on artificial intelligence, vol 30

  13. Serban, I, Sordoni, A, Lowe, R, Charlin, L, Pineau, J, Courville, A, Bengio, Y (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  14. Xu, H, Peng, H, Xie, H, Cambria, E, Zhou, L, Zheng, W (2019) End-to-end latent-variable task-oriented dialogue system with exact log-likelihood optimization. World Wide Web, pp 1–14

  15. Golchha, H, Firdaus, M, Ekbal, A, Bhattacharyya, P (2019) Courteously yours: Inducing courteous behavior in customer care responses using reinforced pointer generator network. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers):pp 851–860

  16. Yang M, Tao J, Chao L, Li H, Zhang D, Che H, Gao T, Liu B (2015) User behavior fusion in dialog management with multi-modal history cues. Multimedia Tools and Applications 74(22):10025–10051

    Article  Google Scholar 

  17. Wang Y, Huang J, He T, Tu X (2020) Dialogue intent classification with character-cnn-bgru networks. Multimedia Tools and Applications 79(7):4553–4572

    Article  Google Scholar 

  18. Saha T, Gupta D, Saha S, Bhattacharyya P (2021) A hierarchical approach for efficient multi-intent dialogue policy learning. Multimedia Tools and Applications 80(28):35025–35050

    Article  Google Scholar 

  19. De Vries, H, Strub, F, Chandar, S, Pietquin, O, Larochelle, H, Courville, A (2017) Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5503–5512

  20. Gan, Z, Cheng, Y, Kholy, AE, Li, L, Liu, J, Gao, J (2019) Multi-step reasoning via recurrent dual attention for visual dialog. arXiv:1902.00579

  21. Mostafazadeh, N, Brockett, C, Dolan, B, Galley, M, Gao, J, Spithourakis, G.P, Vanderwende, L (2017) Image-grounded conversations: Multimodal context for natural question and response generation. arXiv:1701.08251

  22. Yoshino, K, Hori, C, Perez, J, D’Haro, L.F, Polymenakos, L, Gunasekara, C, Lasecki, W.S, Kummerfeld, JK, Galley, M, Brockett, C, et al (2019) Dialog system technology challenge 7. arXiv:1901.03461

  23. Lin, K-Y, Hsu, C-C, Chen, Y-N, Ku, L-W (2019) Entropy-enhanced multimodal attention model for scene-aware dialogue generation. arXiv:1908.08191

  24. Alamri, H, Hori, C, Marks, TK, Batra, D, Parikh, D (2018) Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop, vol 2

  25. Agarwal, S, Dusek, O, Konstas, I, Rieser, V (2018) Improving context modelling in multimodal dialogue generation. In: 11th International conference of natural language generation 2018, pp 129–134. Association for Computational Linguistics

  26. Agarwal, S, Dušek, O, Konstas, I, Rieser, V (2018) A knowledge-grounded multimodal search-based conversational agent. In: Proceedings of the 2018 EMNLP workshop SCAI: The 2nd international workshop on search-oriented conversational AI, pp 59–66

  27. Liao, L, Ma, Y, He, X, Hong, R, Chua, T-s (2018) Knowledge-aware multimodal dialogue systems. In: Proceedings of the 26th ACM international conference on multimedia, pp 801–809

  28. Cui, C, Wang, W, Song, X, Huang, M, Xu, X-S, Nie, L (2019) User attention-guided multimodal dialog systems. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 445–454

  29. Tian Z, Xie Z, Lin F, Song Y (2023) A multi-view meta-learning approach for multi-modal response generation. Proceedings of the ACM Web Conference 2023:1938–1947

    Google Scholar 

  30. Lee, Y-J, Ko, B, Kim, H-G, Choi, H-J (2022) Dialogcc: Large-scale multi-modal dialogue dataset. arXiv:2212.04119

  31. Zang, X, Liu, L, Wang, M, Song, Y, Zhang, H, Chen, J (2021) Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv:2108.01453

  32. Zhou, J, Tian, J, Wang, R, Wu, Y, Yan, M, He, L, Huang, X (2023) Multi-modal multi-hop interaction network for dialogue response generation. Expert Systems with Applications, 120267

  33. Firdaus, M, Thakur, N, Ekbal, A (2020) Multidm-gcn: Aspect-guided response generation in multi-domain multi-modal dialogue system using graph convolution network. In: Proceedings of the 2020 conference on empirical methods in natural language processing: findings, pp 2318–2328

  34. Chen H, Lin Y, Qi F, Hu J, Li P, Zhou J, Sun M (2021) Aspect-level sentiment-controllable review generation with mutual learning framework. Proceedings of the AAAI conference on artificial intelligence 35:12639–12647

    Article  Google Scholar 

  35. Kong, X, Li, B, Neubig, G, Hovy, E, Yang, Y (2019) An adversarial approach to high-quality, sentiment-controlled neural dialogue generation. arXiv:1901.07129

  36. Zhang, B, Wang, J, Ma, H, Xu, B, Lin, H (2023) Zrigf: An innovative multimodal framework for zero-resource image-grounded dialogue generation. arXiv:2308.00400

  37. Firdaus, M, Madasu, A, Ekbal, A (2023) A unified framework for slot based response generation in a multimodal dialogue system. arXiv:2305.17433

  38. Raghu, D, Gupta, N, (2019) Mausam: Disentangling Language and Knowledge in Task-Oriented Dialogs. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers):pp 1239–1255. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1126. https://www.aclweb.org/anthology/N19-1126

  39. Reddy, RG, Contractor, D, Raghu, D, Joshi, S (2019) Multi-level memory for task oriented dialogs. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human Language Technologies, vol 1 (Long and Short Papers):pp 3744–3754

  40. Chen, X, Xu, J, Xu, B (2019) A working memory model for task-oriented dialog response generation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2687–2693. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1258, https://www.aclweb.org/anthology/P19-1258

  41. Wang, J, Liu, J, Bi, W, Liu, X, He, K, Xu, R, Yang, M (2020) Dual dynamic memory network for end-to-end multi-turn task-oriented dialog systems. In: Proceedings of the 28th international conference on computational linguistics, pp 4100–4110. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.362. https://www.aclweb.org/anthology/2020.coling-main.362

  42. Madotto, A, Wu, C-S, Fung, P (2018) Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. arXiv:1804.08217

  43. Wu, C-S, Socher, R, Xiong, C (2019) Global-to-local memory pointer networks for task-oriented dialogue. arXiv:1901.04713

  44. Meng, F, Tu, Z, Cheng, Y, Wu, H, Zhai, J, Yang, Y, Wang, D (2018) Neural machine translation with key-value memory-augmented attention. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 2574–2580

  45. Gao, S, Takanobu, R, Peng, W, Liu, Q, Huang, M (2021) Hyknow: End-to-end task-oriented dialog modeling with hybrid knowledge management. arXiv:2105.06041

  46. Xu, Y, Ishii, E, Cahyawijaya, S, Liu, Z, Winata, G.I, Madotto, A, Su, D, Fung, P (2021) Retrieval-free knowledge-grounded dialogue response generation with adapters. arXiv:2105.06232

  47. Zhang, W, Chen, J, Wu, H, Wan, S, Li, G (2021) A knowledge-grounded dialog system based on pre-trained language models. arXiv:2106.14444

  48. Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, A.N, Kaiser, Ł, Polosukhin, I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  49. Simonyan, K, Zisserman, A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  50. Maas, A, Daly, RE, Pham, PT, Huang, D, Ng, AY, Potts, C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 142–150

  51. Devlin, J, Chang, M-W, Lee, K, Toutanova, K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol1 (Long and Short Papers):pp 4171–4186. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423. https://www.aclweb.org/anthology/N19-1423

  52. Firdaus, M, Thangavelu, N, Ekba, A, Bhattacharyya, P (2020) Persona aware response generation with emotions. In: 2020 International joint conference on neural networks (IJCNN):pp 1–8. IEEE

  53. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychological bulletin 76(5):378

    Article  Google Scholar 

  54. Zang, H, Wan, X (2017) Towards automatic generation of product reviews from aspect-sentiment scores. In: Proceedings of the 10th international conference on natural language generation, pp 168–177

  55. Serban, IV, Sordoni, A, Bengio, Y, Courville, A, Pineau, J (2015) Hierarchical neural network generative models for movie dialogues, 7(8). arXiv:1507.04808

  56. Li, Z, Niu, C, Meng, F, Feng, Y, Li, Q, Zhou, J (2019) Incremental transformer with deliberation decoder for document grounded conversations. arXiv:1907.0885

  57. Kingma, D.P, Ba, J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  58. Papineni, K, Roukos, S, Ward, T, Zhu, W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318. Association for Computational Linguistics

  59. Liu, C.-W, Lowe, R, Serban, I, Noseworthy, M, Charlin, L, Pineau, J (2016) How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. Association for Computational Linguistics,. https://doi.org/10.18653/v1/D16-1230 .https://www.aclweb.org/anthology/D16-1230

Download references

Acknowledgements

Authors gratefully acknowledge the support from the project “Sevak-An Intelligent Indian Language Chabot”, sponsored by SERB-Imprint 2, Government of India.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deeksha Varshney.

Ethics declarations

Ethical standard

Our research is entirely based on publicly available data. We followed the data utilisation policies and did not violate any copyright issues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Varshney, D., Singh, A. & Ekbal, A. Aspect-level sentiment-controlled knowledge grounded multimodal dialog generation using generative models for reviews. Multimed Tools Appl 83, 29197–29219 (2024). https://doi.org/10.1007/s11042-023-16720-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16720-z

Keywords

Navigation