Abstract
Multi-modal sentiment and emotion analysis have been an emerging and prominent field nowadays at the intersection of natural language processing, deep learning, machine learning, computer vision, and speech processing. Sentiment and emotion prediction model finds the attitude of a speaker or writer towards any discussion, debate, event, document or topic. It can be expressed in different ways like the words spoken, energy and tone while delivering words, accompanying facial expressions, gestures, etc. Moreover related and similar tasks generally depend on each other and are predicted better if solved through a joint framework. In this paper, we present a multi-task gated contextual cross-modal attention framework which considers all the three modalities (viz. text, acoustic and visual) and multiple utterances for sentiment and emotion prediction together. We evaluate our proposed approach on CMU-MOSEI dataset for sentiment and emotion prediction. Evaluation results depict that our proposed approach extracts co-relation among the three modalities and attains an improvement over the previous state-of-the-art models.
First two authors have equal contributions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blanchard, N., Moreira, D., Bharati, A., Scheirer, W.: Getting the subtext without the text: scalable multimodal sentiment classification from visual and acoustic modalities. In: Proceedings of Grand Challenge and Workshop on Human Multimodal Language, Melbourne, Australia, pp. 1–10 (2018)
Zhang, Y., et al.: A quantum-inspired multimodal sentiment analysis framework. Theor. Comput. Sci. 752, 21–40 (2018)
Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., Morency, L.: Multi-level multiple attentions for contextual multimodal sentiment analysis. In: 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, pp. 1033–1038, November 2017
Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrušaitis, T., Morency, L.-P.: Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan (2016)
Deng, D., Zhou, Y., Pi, J., Shi, B.E.: Multimodal utterance level affect analysis using visual, audio and text features. arXiv preprint arXiv:1805.00625 (2018)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-2018), New Orleans, USA, pp. 5642–5649 (2018)
Arevalo, J., Solorio, T., Montes-y-Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017)
Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.-P.: Multimodal language analysis in the Wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the ACL, Melbourne, Australia, pp. 2236–2246 (2018)
Tong, E., Zadeh, A., Jones, C., Morency, L.-P.: Combating human trafficking with multimodal deep models. In: Proceedings of the 55th Annual Meeting of the ACL, Vancouver, Canada, pp. 1547–1556. ACL (2017)
Rajagopalan, S.S., Morency, L.-P., Baltrus̆aitis, T., Goecke, R.: Extending long short-term memory for multi-view structured learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 338–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_21
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1103–1114, September 2017
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.-P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the 32nd AAAI, New Orleans, Louisiana, USA, 2–7 February 2018
Acknowledgement
Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya Ph.D. scheme of MeiTY, Government of India. The research reported here is also partially supported by “Skymap Global India Private Limited”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sangwan, S., Chauhan, D.S., Akhtar, M.S., Ekbal, A., Bhattacharyya, P. (2019). Multi-task Gated Contextual Cross-Modal Attention Framework for Sentiment and Emotion Analysis. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1142. Springer, Cham. https://doi.org/10.1007/978-3-030-36808-1_72
Download citation
DOI: https://doi.org/10.1007/978-3-030-36808-1_72
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36807-4
Online ISBN: 978-3-030-36808-1
eBook Packages: Computer ScienceComputer Science (R0)