Skip to main content
Log in

Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Multimodal emotion recognition is a challenging research topic which has recently started to attract the attention of the research community. To better recognize the video users’ emotion, the research of multimodal emotion recognition based on audio and video is essential. Multimodal emotion recognition performance heavily depends on finding good shared feature representation. The good shared representation needs to consider two aspects: (1) it has the character of each modality and (2) it can balance the effect of different modalities to make the decision optimal. In the light of these, we propose a novel Enhanced Sparse Local Discriminative Canonical Correlation Analysis approach (En-SLDCCA) to learn the multimodal shared feature representation. The shared feature representation learning involves two stages. In the first stage, we pretrain the Sparse Auto-Encoder with unimodal video (or audio), so that we can obtain the hidden feature representation of video and audio separately. In the second stage, we obtain the correlation coefficients of video and audio using our En-SLDCCA approach, then we form the shared feature representation which fuses the features from video and audio using the correlation coefficients. We evaluate the performance of our method on the challenging multimodal Enterface’05 database. Experimental results reveal that our method is superior to the unimodal video (or audio) and improves significantly the performance for multimodal emotion recognition when compared with the current state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. An, L., Yang, S., Bhanu, B.: Person re-identification by robust canonical correlation analysis. IEEE Signal Process. Lett. 22(8), 1103–1107 (2015). doi:10.1109/LSP.2015.2390222

    Article  Google Scholar 

  2. Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: International Conference on Multimodal Interfaces, vol. 38, No. 4, pp. 205–211 (2004). doi:10.1145/1027933.1027968

  3. Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.: Multimodal human emotion/expression recognition. IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371 (1998). doi:10.1109/AFGR.1998.670976

  4. Chen, Y., Wiesel, A., Eldar, Y.C., Hero, A.O.: Shrinkage algorithms for mmse covariance estimation. IEEE Trans. Signal Process. 58(10), 5016–5029 (2010). doi:10.1109/TSP.2010.2053029

    Article  MathSciNet  MATH  Google Scholar 

  5. Datcu, D., Rothkrantz, L.J.M.: Multimodal recognition of emotions in car environments. In: DI & I Prague (2009)

  6. Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Humaine Association Conference on AVTective Computing and Intelligent Interaction, vol. 7971, pp. 511–516 (2013). doi:10.1109/ACII.2013.90

  7. Dobrisek, S., Gajsek, R., Mihelic, F., Pavesic, N., Struc, V.: Towards efficient multi-modal emotion recognition. Int. J. Adv. Robot. Syst. 10(53), 53–53 (2013). doi:10.5772/54002

    Article  Google Scholar 

  8. Gajsek, R., STruc V, Mihelic F, : Multi-modal emotion recognition using canonical correlations and acoustic features. In: International Conference on Pattern Recognition, vol. 82, No. 6, pp. 4133–4136 (2010). doi:10.1109/ICPR.2010.1005

  9. Gunes, H., Piccardi, M., Pantic, M.: From the lab to the real world: affect recognition using multiple cues and modalities. InTech Education and Publishing, Croatia (2008)

    Google Scholar 

  10. Han, M.J., Hsu, I.H., Song, K.T., Chang, F.Y.: A new information fusion method for svm-based robotic audio-visual emotion recognition. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 2656–2661 (2007). doi:10.1109/ICSMC.2007.4413990

  11. Hardoon, D.R., Shawe-Taylor, J.: Sparse canonical correlation analysis. Mach. Learn. 83(3), 331–353 (2011). doi:10.1007/s10994-010-5222-7

    Article  MathSciNet  MATH  Google Scholar 

  12. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004). doi:10.1162/0899766042321814

    Article  MATH  Google Scholar 

  13. Huang, L., Xin, L., Zhao, L., Tao, J.: Combining audio and video by dominance in bimodal emotion recognition. In: Second International Conference on Affective Computing and Intelligent Interaction, vo.l 4738, pp. 729–730 (2007). doi:10.1007/978-3-540-74889-2_71

  14. Kapoor, A., Burleson, W., Picard, R.W.: Automatic prediction of frustration. Int. J. Hum. Comput. Stud. 65(8), 724–736 (2007). doi:10.1016/j.ijhcs.2007.02.003

    Article  Google Scholar 

  15. Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 32, No. 3, pp. 3687–3691 (2013). doi:10.1109/ICASSP.2013.6638346

  16. Le, D., Provost, E.M.: Emotion recognition from spontaneous speech using hidden markov models with deep belief networks. In: Automatic Speech Recognition and Understanding, pp. 216–221 (2013). doi:10.1109/ASRU.2013.6707732

  17. Ledoit, O., Wolf, M.: Nonlinear shrinkage estimation of large-dimensional covariance matrices. Ann. Stat. 40(2), 1024–1060 (2012). doi:10.1214/12-AOS989

    Article  MathSciNet  MATH  Google Scholar 

  18. Li, Z., Tang, J.: Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans. Image Process. 24(12), 5343–5355 (2015). doi:10.1109/TIP.2015.2479560

    Article  MathSciNet  MATH  Google Scholar 

  19. Li, Z., Tang, J.: Weakly-supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(99), 276–288 (2017). doi:10.1109/TIP.2016.2624140

    Article  MathSciNet  MATH  Google Scholar 

  20. Li, Z., Liu, J., Yang, Y., Zhou, X., Lu, H.: Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans. Knowl. Data Eng. 26(9), 2138–2150 (2014). doi:10.1109/TKDE.2013.65

    Article  Google Scholar 

  21. Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015). doi:10.1109/TPAMI.2015.2400461

    Article  Google Scholar 

  22. Liu, M., Shan, S., Wang, R., Chen, X.: Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1749–1756 (2014). doi:10.1109/CVPR.2014.226

  23. Lu J, Hu J, Zhou X, Shang Y: Activity-based person identification using sparse coding and discriminative metric learning. In: ACM International Conference on Multimedia, pp. 1061–1064, (2012). doi:10.1145/2393347.2396383

  24. Mansoorizadeh, M., Moghaddam Charkari, N.: Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–297 (2010). doi:10.1007/s11042-009-0344-2

    Article  Google Scholar 

  25. Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: International Conference on Data Engineering Workshops, pp. 8–8. doi:10.1109/ICDEW.2006.145

  26. Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 22, pp. 1103–1107 (2015). doi:10.1109/LSP.2015.2390222

    Article  Google Scholar 

  27. Nie, L., Zhang, L., Yang, Y., Wang, M., Hong, R., Chua, T.S.: Beyond doctors: future health prediction from multimedia and multimodal observations. In: The 23rd ACM International Conference on Multimedia, pp. 591–600. (2015). doi:10.1145/2733373.2806217

  28. Nie, L., Song, X., Chua, T.S.: Learning from Multiple Social Networks, pp. 118–118. Morgan & Claypool, San Rafael (2016). doi:10.2200/S00714ED1V01Y201603ICR048

    Article  Google Scholar 

  29. Paleari, M., Huet, B.: Toward emotion indexing of multimedia excerpts. In: International Workshop on Content-Based Multimedia Indexing, pp. 425–432, (2008). doi:10.1109/CBMI.2008.4564978

  30. Paleari, M., Benmokhtar, R., Huet, B.: Evidence theory-based multimodal emotion recognition. In: International Multimedia Modeling Conference on Advances in Multimedia Modeling, vol. 5371, pp. 435–446 (2009). doi:10.1007/978-3-540-92892-8_44

    Google Scholar 

  31. Paleari, M., Chellali, R., Huet, B.: Bimodal emotion recognition. In: International Conference on Social Robotics, vol. 6414, pp. 305–314 (2010). doi:10.1007/978-3-642-17248-9_32

    Chapter  Google Scholar 

  32. Peng, Y., Zhang, D., Zhang, J.: A new canonical correlation analysis algorithm with local discrimination. Neural Process. Lett. 31(1), 1–15 (2010). doi:10.1007/s11063-009-9123-3

    Article  Google Scholar 

  33. Pun, T., Alecu, T.I., Chanel, G., Kronegg, J.: Brain-computer interaction research at the computer vision and multimedia laboratory, university of geneva. IEEE Trans. Neural Syst. Rehabil. Eng. 14(2), 210–213 (2006). doi:10.1109/TNSRE.2006.875544

    Article  Google Scholar 

  34. Schmidt, E.M.: Modeling and predicting emotion in music. Emotion 5, 6-6 (2012)

    Google Scholar 

  35. Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: a benchmark comparison of performances. In: IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 552–557, doi:10.1109/ASRU.2009.5372886

  36. Shan, C., Gong, S., Mcowan, P.W.: Beyond facial expressions: Learning human emotion from body gestures. In: Proceedings of the British Machine Vision Conference, pp. 43.1–43.10 (2007). doi:10.5244/C.21.43

  37. Stuhlsatz, A., Lippel, J., Zielke, T.: Feature extraction with deep neural networks by a generalized discriminant analysis. IEEE Trans. Neural Netw. Learn. Syst. 23(4), 596–608 (2012). doi:10.1109/TNNLS.2012.2183645

    Article  Google Scholar 

  38. Tang, J., Shu, X., Qi, G.J., Li, Z., Wang, M., Yan, S., Jain, R.: Tri-clustered tensor completion for social-aware image tag refinement. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1 (2016). doi:10.1109/TPAMI.2016.2608882

    Article  Google Scholar 

  39. Wang, H.: Local two-dimensional canonical correlation analysis. IEEE Signal Process. Lett. 17(11), 921–924 (2010). doi:10.1109/LSP.2010.2071863

    Article  Google Scholar 

  40. Wang, Y., Guan, L., Venetsanopoulos, A.N.: Audiovisual emotion recognition via cross-modal association in kernel space. In: IEEE International Conference on Multimedia and Expo, pp. 1–6 (2011). doi:10.1109/ICME.2011.6011949

  41. Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D., Levinson, S.: Audio-visual affect recognition. IEEE Trans. Multimed. 9(2), 424–428 (2007). doi:10.1109/TMM.2006.886310

    Article  Google Scholar 

  42. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007). doi:10.1109/TPAMI.2007.1110

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 61272211 and 61672267) and the General Financial Grant from the China Postdoctoral Science Foundation 2015M570413.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qirong Mao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, J., Mao, Q., Tu, J. et al. Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis. Multimedia Systems 25, 451–461 (2019). https://doi.org/10.1007/s00530-017-0547-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-017-0547-8

Keywords

Navigation