Speaker-Independent Multimodal Sentiment Analysis for Big Data



In this chapter, we propose a contextual multimodal sentiment analysis framework which outperforms the state of the art. This framework has been evaluated against speaker-dependent and speaker-independent problems. We also address the generalizability issue of the proposed method. This chapter also contains a discussion for an important component to be considered for a multimodal information processing system, which is the type of information fusion technique to be applied to combine the multimodal data.


  1. 1.
    Cambria, E., Das, D., Bandyopadhyay, S., Feraco, A.: A Practical Guide to Sentiment Analysis. Springer, Cham (2017)CrossRefGoogle Scholar
  2. 2.
    Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion. 37, 98–125 (2017)CrossRefGoogle Scholar
  3. 3.
    Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., Morency, L.-P.: Context-dependent sentiment analysis in user-generated videos. ACL. 2, 873–883 (2017)Google Scholar
  4. 4.
    Chaturvedi, I., Ragusa, E., Gastaldo, P., Zunino, R., Cambria, E.: Bayesian network based extreme learning machine for subjectivity detection. J. Franklin Inst. 355(4), 1780–1797 (2018)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Cambria, E., Poria, S., Hazarika, D., Kwok, K.: SenticNet 5: discovering conceptual primitives for sentiment analysis by means of context embeddings. In: AAAI, pp. 1795–1802 (2018)Google Scholar
  6. 6.
    Oneto, L., Bisio, F., Cambria, E., Anguita, D.: Statistical learning theory & ELM for big social data analysis. IEEE Comput. Intell. Mag. 11(3), 45–55 (2016)CrossRefGoogle Scholar
  7. 7.
    Cambria, E., Hussain, A., Computing, S.: A Common-Sense-Based Framework for Concept-Level Sentiment Analysis. Springer, Cham (2015)Google Scholar
  8. 8.
    Cambria, E., Poria, S., Gelbukh, A., Thelwall, M.: Sentiment analysis is a big suitcase. IEEE Intell. Syst. 32(6), 74–80 (2017)CrossRefGoogle Scholar
  9. 9.
    Poria, S., Chaturvedi, I., Cambria, E., Bisio, F.: Sentic LDA: improving on LDA with semantic similarity for aspect-based sentiment analysis. In: IJCNN, pp. 4465–4473 (2016)Google Scholar
  10. 10.
    Ma, Y., Cambria, E., Gao, S.: Label embedding for zero-shot fine-grained named entity typing. In: COLING, pp. 171–180 (2016)Google Scholar
  11. 11.
    Xia, Y., Erik, C., Hussain, A., Zhao, H.: Word polarity disambiguation using bayesian model & opinion-level features. Cogn. Comput. 7(3), 369–380 (2015)CrossRefGoogle Scholar
  12. 12.
    Zhong, X., Sun, A., Cambria, E.: Time expression analysis and recognition using syntactic token types and general heuristic rules. In: ACL, pp. 420–429 (2017)Google Scholar
  13. 13.
    Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning-based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)CrossRefGoogle Scholar
  14. 14.
    Poria, S., Cambria, E., Hazarika, D., Vij, P.: A deeper look into sarcastic tweets using deep convolutional neural networks. In: COLING, pp. 1601–1612 (2016)Google Scholar
  15. 15.
    Xing, F., Cambria, E., Welsch, R.: Natural language based financial forecasting: a survey. Artif. Intell. Rev. 50(1), 49–73 (2018)CrossRefGoogle Scholar
  16. 16.
    Ebrahimi, M., Hossein, A., Sheth, A.: Challenges of sentiment analysis for dynamic events. IEEE Intell. Syst. 32(5), 70–75 (2017)CrossRefGoogle Scholar
  17. 17.
    Cambria, E., Hussain, A., Durrani, T., Havasi, C., Eckl, C., Munro, J.: Sentic computing for patient centered application. In: IEEE ICSP, pp. 1279–1282 (2010)Google Scholar
  18. 18.
    Valdivia, A., Luzon, V., Herrera, F.: Sentiment analysis in tripadvisor. IEEE Intell. Syst. 32(4), 72–77 (2017)CrossRefGoogle Scholar
  19. 19.
    Cavallari, S., Zheng, V., Cai, H., Chang, K., Cambria, E.: Learning community embedding with community detection and node embedding on graphs. In: CIKM, pp. 377–386 (2017)Google Scholar
  20. 20.
    Mihalcea, R., Garimella, A.: What men say, what women hear: finding gender-specific meaning shades. IEEE Intell. Syst. 31(4), 62–67 (2016)CrossRefGoogle Scholar
  21. 21.
    Pérez-Rosas, V., Mihalcea, R., Morency, L.-P.: Utterancelevel multimodal sentiment analysis. ACL. 1, 973–982 (2013)Google Scholar
  22. 22.
    Wollmer, M., Weninger, F., Knaup, T., Schuller, B., Sun, C., Sagae, K., Morency, L.-P.: Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intell. Syst. 28(3), 46–53 (2013)CrossRefGoogle Scholar
  23. 23.
    Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of EMNLP, pp. 2539–2544 (2015)Google Scholar
  24. 24.
    Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009)CrossRefGoogle Scholar
  25. 25.
    D’mello, S.K., Kory, J.: A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. 47(3), 43–79 (2015)Google Scholar
  26. 26.
    Rosas, V., Mihalcea, R., Morency, L.-P.: Multimodal sentiment analysis of spanish online videos. IEEE Intell. Syst. 28(3), 38–45 (2013)CrossRefGoogle Scholar
  27. 27.
    Sarkar, C., Bhatia, S., Agarwal, A., Li, J.: Feature analysis for computational personality recognition using youtube personality data set. In: Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition, pp. 11–14. ACM (2014)Google Scholar
  28. 28.
    Poria, S., Cambria, E., Hussain, A., Huang, G.-B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015)CrossRefGoogle Scholar
  29. 29.
    Monkaresi, H., Sazzad Hussain, M., Calvo, R.A.: Classification of affects using head movement, skin color features and physiological signals. In: Systems, Man, and Cybernetics (SMC), 2012 I.E. International Conference on IEEE, pp. 2664–2669 (2012)Google Scholar
  30. 30.
    Wang, S., Zhu, Y., Wu, G., Ji, Q.: Hybrid video emotional tagging using users’ eeg & video content. Multimed. Tools Appl. 72(2), 1257–1283 (2014)CrossRefGoogle Scholar
  31. 31.
    Alam, F., Riccardi, G.: Predicting personality traits using multimodal information. In: Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition, pp. 15–18. ACM (2014)Google Scholar
  32. 32.
    Cai, G., Xia, B.: Convolutional neural networks for multimedia sentiment analysis. In: National CCF Conference on Natural Language Processing and Chinese Computing, pp. 159–167. Springer (2015)Google Scholar
  33. 33.
    Yamasaki, T., Fukushima, Y., Furuta, R., Sun, L., Aizawa, K., Bollegala, D.: Prediction of user ratings of oral presentations using label relations. In: Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia, pp. 33–38. ACM (2015)Google Scholar
  34. 34.
    Glodek, M., Reuter, S., Schels, M., Dietmayer, K., Schwenker, F.: Kalman filter based classifier fusion for affective state recognition. In: Multiple Classifier Systems, pp. 85–94. Springer (2013)Google Scholar
  35. 35.
    Dobrišek, S., Gajšek, R., Mihelič, F., Pavešić, N., Štruc, V.: Towards efficient multi-modal emotion recognition. Int. J. Adv. Rob. Syst. 10, 53 (2013)CrossRefGoogle Scholar
  36. 36.
    Mansoorizadeh, M., Charkari, N.M.: Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–297 (2010)CrossRefGoogle Scholar
  37. 37.
    Poria, S., Cambria, E., Howard, N., Huang, G.-B., Hussain, A.: Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing. 174, 50–59 (2016)CrossRefGoogle Scholar
  38. 38.
    Lin, J.-C., Wu, C.-H., Wei, W.-L.: Error weighted semi-coupled hidden markov model for audio-visual emotion recognition. IEEE Trans. Multimed. 14(1), 142–156 (2012)CrossRefGoogle Scholar
  39. 39.
    Lu, K., Jia, Y.: Audio-visual emotion recognition with boosted coupled hmm. In: 21st International Conference on Pattern Recognition (ICPR), IEEE 2012, pp. 1148–1151 (2012)Google Scholar
  40. 40.
    Metallinou, A., Wöllmer, M., Katsamanis, A., Eyben, F., Schuller, B., Narayanan, S.: Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affect. Comput. 3(2), 184–198 (2012)CrossRefGoogle Scholar
  41. 41.
    Baltrusaitis, T., Banda, N., Robinson, P.: Dimensional affect recognition using continuous conditional random fields. In: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on IEEE, pp. 1–8 (2013)Google Scholar
  42. 42.
    Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., Rigoll, G.: Lstm-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis. Comput. 31(2), 153–163 (2013)CrossRefGoogle Scholar
  43. 43.
    Song, M., Jiajun, B., Chen, C., Li, N.: Audio-visual based emotion recognition-a new approach. Comput. Vis. Pattern Recognit. 2, II–1020 (2004)Google Scholar
  44. 44.
    Zeng, Z., Hu, Y., Liu, M., Fu, Y., Huang, T.S.: Training combination strategy of multi-stream fused hidden markov model for audio-visual affect recognition. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, pp. 65–68. ACM (2006)Google Scholar
  45. 45.
    Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., Karpouzis, K.: Modeling naturalistic affective states via facial & vocal expressions recognition. In: Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 146–154. ACM (2006)Google Scholar
  46. 46.
    Petridis, S., Pantic, M.: Audiovisual discrimination between laughter and speech. In: International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008. IEEE, pp. 5117–5120 (2008)Google Scholar
  47. 47.
    Sebe, N., Cohen, I., Gevers, T., Huang, T.S.: Emotion recognition based on joint visual and audio cues. In: 18th International Conference on Pattern Recognition, ICPR 2006, IEEE, vol. 1, pp. 1136–1139 (2006)Google Scholar
  48. 48.
    Atrey, P.K., Anwar Hossain, M., Saddik, A.E., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)CrossRefGoogle Scholar
  49. 49.
    Corradini, A., Mehta, M., Bernsen, N.O., Martin, J., Abrilian, S.: Multimodal input fusion in human-computer interaction. Comput. Syst. Sci. 198, 223 (2005)Google Scholar
  50. 50.
    Iyengar, G., Nock, H.J., Neti, C.: Audio-visual synchrony for detection of monologues in video archives. In: Proceedings of International Conference on Multimedia and Expo, ICME’03, IEEE, vol. 1, pp. 772–775 (2003)Google Scholar
  51. 51.
    Adams, W.H., Iyengar, G., Lin, C.-Y., Naphade, M.R., Neti, C., Nock, H.J., Smith, J.R.: Semantic indexing of multimedia content using visual, audio & text cues. EURASIP J. Adv. Signal Process. 2003(2), 1–16 (2003)CrossRefGoogle Scholar
  52. 52.
    Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J. Adv. Signal Process. 2002(11), 1–15 (2002)zbMATHCrossRefGoogle Scholar
  53. 53.
    Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audio-visual speaker tracking. In: Proceedings of the 7th International Conference on Multimodal Interfaces, pp. 61–68. ACM (2005)Google Scholar
  54. 54.
    Potamitis, I., Chen, H., Tremoulis, G.: Tracking of multiple moving speakers with multiple microphone arrays. IEEE Trans. Speech Audio Process. 12(5), 520–529 (2004)CrossRefGoogle Scholar
  55. 55.
    Morency, L.-P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176. ACM (2011)Google Scholar
  56. 56.
    Gunes, H., Pantic, M.: Dimensional emotion prediction from spontaneous head gestures for interaction with sensitive artificial listeners. In: International Conference on Intelligent Virtual Agents, pp. 371–377 (2010)CrossRefGoogle Scholar
  57. 57.
    Valstar, M.F., Almaev, T., Girard, J.M., McKeown, G., Mehu, M., Yin, L., Pantic, M., Cohn, J.F.: Fera 2015-second facial expression recognition and analysis challenge. Automat. Face Gesture Recognit. 6, 1–8 (2015)Google Scholar
  58. 58.
    Nicolaou, M.A., Gunes, H., Pantic, M.: Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In: Proceedings of LREC Int’l Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, pp. 43–48 (2010)Google Scholar
  59. 59.
    Chang, K.-H., Fisher, D., Canny, J.: Ammon: a speech analysis library for analyzing affect, stress & mental health on mobile phones. In: Proceedings of PhoneSense (2011)Google Scholar
  60. 60.
    Castellano, G., Kessous, L., Caridakis, G.: Emotion recognition through multiple modalities: face, body gesture, speech. In: Peter, C., Beale, R. (eds.) Affect and Emotion in Human-Computer Interaction, pp. 92–103. Springer, Heidelberg (2008)Google Scholar
  61. 61.
    Eyben, F., Wöllmer, M., Graves, A., Schuller, B., Douglas-Cowie, E., Cowie, R.: On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. J. Multimodal User Interfaces. 3(1–2), 7–19 (2010)CrossRefGoogle Scholar
  62. 62.
    Eyben, F., Wöllmer, M., Schuller, B.: Openear—introducing the Munich open-source emotion and affect recognition toolkit. In: 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops 2009, pp. 1–6. IEEE (2009)Google Scholar
  63. 63.
    Chetty, G., Wagner, M., Goecke, R.: A multilevel fusion approach for audiovisual emotion recognition. In: AVSP, pp. 115–120 (2008)Google Scholar
  64. 64.
    Zhang, S., Li, L., Zhao, Z.: Audio-visual emotion recognition based on facial expression and affective speech. In: Multimedia and Signal Processing, pp. 46–52. Springer (2012)Google Scholar
  65. 65.
    Paleari, M., Benmokhtar, R., Huet, B.: Evidence theory-based multimodal emotion recognition. In: International Conference on Multimedia Modeling, pp. 435–446 (2009)Google Scholar
  66. 66.
    Rahman, T., Busso, C.: A personalized emotion recognition system using an unsupervised feature adaptation scheme. In: 2012 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5117–5120. IEEE (2012)Google Scholar
  67. 67.
    Jin, Q., Li, C., Chen, S., Wu, H.: Speech emotion recognition with acoustic and lexical features. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, pp. 4749–4753. IEEE (2015)Google Scholar
  68. 68.
    Metallinou, A., Lee, S., Narayanan, S.: Audio-visual emotion recognition using Gaussian mixture models for face and voice. In: 10th IEEE International Symposium on ISM 2008, pp. 250–257. IEEE (2008)Google Scholar
  69. 69.
    Rozgić, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Prasad, R.: Ensemble of svm trees for multimodal emotion recognition. In: Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1–4. IEEE (2012)Google Scholar
  70. 70.
    DeVault, D., Artstein, R., Benn, G., Dey, T., Fast, E., Gainer, A., Georgila, K., Gratch, J., Hartholt, A., Lhommet, M., et al.: Simsensei kiosk: a virtual human interviewer for healthcare decision support. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 1061–1068 (2014)Google Scholar
  71. 71.
    Siddiquie, B., Chisholm, D., Divakaran, A.: Exploiting multimodal affect and semantics to identify politically persuasive web videos. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, pp. 203–210 (2015)Google Scholar
  72. 72.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)Google Scholar
  73. 73.
    Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia. ACM, pp. 1459–1462 (2010)Google Scholar
  74. 74.
    Baltrušaitis, T., Robinson, P., Morency, L.-P.: 3d constrained local model for rigid and non-rigid facial tracking. In: Computer Vision and Pattern Recognition (CVPR), pp. 2610–2617. IEEE (2012).Google Scholar
  75. 75.
    Gers, F.: Long Short-Term Memory in Recurrent Neural Networks, Ph.D. thesis, Universität Hannover (2001)Google Scholar
  76. 76.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  77. 77.
    Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classification. In: The 54th Annual Meeting of the Association for Computational Linguistics, pp. 207–213 (2016)Google Scholar
  78. 78.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning & stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  79. 79.
    Zadeh, A., Zellers, R., Pincus, E., Morency, L.-P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)CrossRefGoogle Scholar
  80. 80.
    Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringNTUSingaporeSingapore
  2. 2.School of Natural SciencesUniversity of StirlingStirlingUK

Personalised recommendations