Abstract
During a conversation, it is critical for participants to establish what they both agree on, also known as the common ground. Grounding implies recognizing that the listener has understood what the speaker has said, considering several factors. This can be accomplished by basing dialog models on various features like aspects, sentiments, images, and unstructured knowledge documents. The key innovation lies in our novel multi-modal knowledge-grounded context-aware transformer model, which enables a seamless fusion of textual and visual information. We introduce an effective technique for generating reviews based on the user’s aspect and sentiment (i.e., aspect-level sentiment-controllable reviews), which serves as the relevant external knowledge for the dialog systems. Our work highlights the importance of incorporating review expertise in knowledge-based multi-modal dialog generation. We utilize the Knowledge Grounded Multi-Modal Dialog (KGMMD) dataset, which includes dial og utterances accompanied by images, aspects, sentiment, and unstructured knowledge in the form of several long hotel reviews for different hotels mentioned in the dataset. The overall framework consists of a dialog encoder, a review generator, and a response decoder, all of which complement one another by generating appropriate reviews, which eventually assist in generating an adequate response. The proposed model outperforms the baseline models for aspect-level sentiment-controlled knowledge-based multimodal response generation with a significant increase in F1-score (13.3%) and BLEU-4 (5.3%) on the KGMMD dataset.
Similar content being viewed by others
Availability of Data and Materials
The KGMMD dataset analysed in this work are included in the published articles by [8] and can be downloaded from the links:https://github.com/deekshaVarshney/KGMMD.
Notes
This is strictly for notation. External knowledge/images may not be present in the dataset for every utterance. In such cases, we use a null vector as an input.
References
Le, H, Hoi, S, Sahoo, D, Chen, N (2019) End-to-end multimodal dialog systems with hierarchical multimodal attention on video features. In: DSTC7 at AAAI2019 Workshop
Saha, A, Khapra, M, Sankaranarayanan, K (2018) Towards building large scale multimodal domain-aware conversation systems. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Chauhan, H, Firdaus, M, Ekbal, A, Bhattacharyya, P (2019) Ordinal and attribute aware response generation in a multimodal dialogue system. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 5437–5447
Das, A, Kottur, S, Gupta, K, Singh, A, Yadav, D, Moura, JM, Parikh, D, Batra, D (2017) Visual dialog. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 326–335
Le, H, Sahoo, D, Chen, NF, Hoi, SC (2019) Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv:1907.01166
Firdaus M, Thakur N, Ekbal A (2021) Aspect-aware response generation for multimodal dialogue system. ACM Transactions on Intelligent Systems and Technology (TIST) 12(2):1–33
Firdaus, M, Chauhan, H, Ekbal, A, Bhattacharyya, P (2020) Emosen: Generating sentiment and emotion controlled responses in a multimodal dialogue system. IEEE Transactions on Affective Computing
Varshney, D, Singh, AEA (2021) Knowledge grounded multimodal dialog generation in task-oriented settings. In: PACLIC
Shang, L, Lu, Z, Li, H (2015) Neural responding machine for short-text conversation. arXiv:1503.02364
Vinyals, O, Le, Q (2015) A neural conversational model. arXiv:1506.05869
Sordoni, A, Bengio, Y, Vahabi, H, Lioma, C, Grue Simonsen, J, Nie, J-Y (2015) A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM international on conference on information and knowledge management, pp 553–562
Serban, I, Sordoni, A, Bengio, Y, Courville, A, Pineau, J (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Serban, I, Sordoni, A, Lowe, R, Charlin, L, Pineau, J, Courville, A, Bengio, Y (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Xu, H, Peng, H, Xie, H, Cambria, E, Zhou, L, Zheng, W (2019) End-to-end latent-variable task-oriented dialogue system with exact log-likelihood optimization. World Wide Web, pp 1–14
Golchha, H, Firdaus, M, Ekbal, A, Bhattacharyya, P (2019) Courteously yours: Inducing courteous behavior in customer care responses using reinforced pointer generator network. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers):pp 851–860
Yang M, Tao J, Chao L, Li H, Zhang D, Che H, Gao T, Liu B (2015) User behavior fusion in dialog management with multi-modal history cues. Multimedia Tools and Applications 74(22):10025–10051
Wang Y, Huang J, He T, Tu X (2020) Dialogue intent classification with character-cnn-bgru networks. Multimedia Tools and Applications 79(7):4553–4572
Saha T, Gupta D, Saha S, Bhattacharyya P (2021) A hierarchical approach for efficient multi-intent dialogue policy learning. Multimedia Tools and Applications 80(28):35025–35050
De Vries, H, Strub, F, Chandar, S, Pietquin, O, Larochelle, H, Courville, A (2017) Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5503–5512
Gan, Z, Cheng, Y, Kholy, AE, Li, L, Liu, J, Gao, J (2019) Multi-step reasoning via recurrent dual attention for visual dialog. arXiv:1902.00579
Mostafazadeh, N, Brockett, C, Dolan, B, Galley, M, Gao, J, Spithourakis, G.P, Vanderwende, L (2017) Image-grounded conversations: Multimodal context for natural question and response generation. arXiv:1701.08251
Yoshino, K, Hori, C, Perez, J, D’Haro, L.F, Polymenakos, L, Gunasekara, C, Lasecki, W.S, Kummerfeld, JK, Galley, M, Brockett, C, et al (2019) Dialog system technology challenge 7. arXiv:1901.03461
Lin, K-Y, Hsu, C-C, Chen, Y-N, Ku, L-W (2019) Entropy-enhanced multimodal attention model for scene-aware dialogue generation. arXiv:1908.08191
Alamri, H, Hori, C, Marks, TK, Batra, D, Parikh, D (2018) Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop, vol 2
Agarwal, S, Dusek, O, Konstas, I, Rieser, V (2018) Improving context modelling in multimodal dialogue generation. In: 11th International conference of natural language generation 2018, pp 129–134. Association for Computational Linguistics
Agarwal, S, Dušek, O, Konstas, I, Rieser, V (2018) A knowledge-grounded multimodal search-based conversational agent. In: Proceedings of the 2018 EMNLP workshop SCAI: The 2nd international workshop on search-oriented conversational AI, pp 59–66
Liao, L, Ma, Y, He, X, Hong, R, Chua, T-s (2018) Knowledge-aware multimodal dialogue systems. In: Proceedings of the 26th ACM international conference on multimedia, pp 801–809
Cui, C, Wang, W, Song, X, Huang, M, Xu, X-S, Nie, L (2019) User attention-guided multimodal dialog systems. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 445–454
Tian Z, Xie Z, Lin F, Song Y (2023) A multi-view meta-learning approach for multi-modal response generation. Proceedings of the ACM Web Conference 2023:1938–1947
Lee, Y-J, Ko, B, Kim, H-G, Choi, H-J (2022) Dialogcc: Large-scale multi-modal dialogue dataset. arXiv:2212.04119
Zang, X, Liu, L, Wang, M, Song, Y, Zhang, H, Chen, J (2021) Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv:2108.01453
Zhou, J, Tian, J, Wang, R, Wu, Y, Yan, M, He, L, Huang, X (2023) Multi-modal multi-hop interaction network for dialogue response generation. Expert Systems with Applications, 120267
Firdaus, M, Thakur, N, Ekbal, A (2020) Multidm-gcn: Aspect-guided response generation in multi-domain multi-modal dialogue system using graph convolution network. In: Proceedings of the 2020 conference on empirical methods in natural language processing: findings, pp 2318–2328
Chen H, Lin Y, Qi F, Hu J, Li P, Zhou J, Sun M (2021) Aspect-level sentiment-controllable review generation with mutual learning framework. Proceedings of the AAAI conference on artificial intelligence 35:12639–12647
Kong, X, Li, B, Neubig, G, Hovy, E, Yang, Y (2019) An adversarial approach to high-quality, sentiment-controlled neural dialogue generation. arXiv:1901.07129
Zhang, B, Wang, J, Ma, H, Xu, B, Lin, H (2023) Zrigf: An innovative multimodal framework for zero-resource image-grounded dialogue generation. arXiv:2308.00400
Firdaus, M, Madasu, A, Ekbal, A (2023) A unified framework for slot based response generation in a multimodal dialogue system. arXiv:2305.17433
Raghu, D, Gupta, N, (2019) Mausam: Disentangling Language and Knowledge in Task-Oriented Dialogs. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers):pp 1239–1255. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1126. https://www.aclweb.org/anthology/N19-1126
Reddy, RG, Contractor, D, Raghu, D, Joshi, S (2019) Multi-level memory for task oriented dialogs. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human Language Technologies, vol 1 (Long and Short Papers):pp 3744–3754
Chen, X, Xu, J, Xu, B (2019) A working memory model for task-oriented dialog response generation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2687–2693. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1258, https://www.aclweb.org/anthology/P19-1258
Wang, J, Liu, J, Bi, W, Liu, X, He, K, Xu, R, Yang, M (2020) Dual dynamic memory network for end-to-end multi-turn task-oriented dialog systems. In: Proceedings of the 28th international conference on computational linguistics, pp 4100–4110. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.362. https://www.aclweb.org/anthology/2020.coling-main.362
Madotto, A, Wu, C-S, Fung, P (2018) Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. arXiv:1804.08217
Wu, C-S, Socher, R, Xiong, C (2019) Global-to-local memory pointer networks for task-oriented dialogue. arXiv:1901.04713
Meng, F, Tu, Z, Cheng, Y, Wu, H, Zhai, J, Yang, Y, Wang, D (2018) Neural machine translation with key-value memory-augmented attention. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 2574–2580
Gao, S, Takanobu, R, Peng, W, Liu, Q, Huang, M (2021) Hyknow: End-to-end task-oriented dialog modeling with hybrid knowledge management. arXiv:2105.06041
Xu, Y, Ishii, E, Cahyawijaya, S, Liu, Z, Winata, G.I, Madotto, A, Su, D, Fung, P (2021) Retrieval-free knowledge-grounded dialogue response generation with adapters. arXiv:2105.06232
Zhang, W, Chen, J, Wu, H, Wan, S, Li, G (2021) A knowledge-grounded dialog system based on pre-trained language models. arXiv:2106.14444
Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, A.N, Kaiser, Ł, Polosukhin, I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Simonyan, K, Zisserman, A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Maas, A, Daly, RE, Pham, PT, Huang, D, Ng, AY, Potts, C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 142–150
Devlin, J, Chang, M-W, Lee, K, Toutanova, K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol1 (Long and Short Papers):pp 4171–4186. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423. https://www.aclweb.org/anthology/N19-1423
Firdaus, M, Thangavelu, N, Ekba, A, Bhattacharyya, P (2020) Persona aware response generation with emotions. In: 2020 International joint conference on neural networks (IJCNN):pp 1–8. IEEE
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychological bulletin 76(5):378
Zang, H, Wan, X (2017) Towards automatic generation of product reviews from aspect-sentiment scores. In: Proceedings of the 10th international conference on natural language generation, pp 168–177
Serban, IV, Sordoni, A, Bengio, Y, Courville, A, Pineau, J (2015) Hierarchical neural network generative models for movie dialogues, 7(8). arXiv:1507.04808
Li, Z, Niu, C, Meng, F, Feng, Y, Li, Q, Zhou, J (2019) Incremental transformer with deliberation decoder for document grounded conversations. arXiv:1907.0885
Kingma, D.P, Ba, J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Papineni, K, Roukos, S, Ward, T, Zhu, W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318. Association for Computational Linguistics
Liu, C.-W, Lowe, R, Serban, I, Noseworthy, M, Charlin, L, Pineau, J (2016) How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. Association for Computational Linguistics,. https://doi.org/10.18653/v1/D16-1230 .https://www.aclweb.org/anthology/D16-1230
Acknowledgements
Authors gratefully acknowledge the support from the project “Sevak-An Intelligent Indian Language Chabot”, sponsored by SERB-Imprint 2, Government of India.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical standard
Our research is entirely based on publicly available data. We followed the data utilisation policies and did not violate any copyright issues.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Varshney, D., Singh, A. & Ekbal, A. Aspect-level sentiment-controlled knowledge grounded multimodal dialog generation using generative models for reviews. Multimed Tools Appl 83, 29197–29219 (2024). https://doi.org/10.1007/s11042-023-16720-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16720-z