Advertisement

Continuous affect recognition with weakly supervised learning

  • Ercheng PeiEmail author
  • Dongmei Jiang
  • Mitchel Alioscha-Perez
  • Hichem Sahli
Article
  • 4 Downloads

Abstract

Recognizing a person’s affective state from audio-visual signals is an essential capability for intelligent interaction. Insufficient training data and the unreliable labels of affective dimensions (e.g., valence and arousal) are two major challenges in continuous affect recognition. In this paper, we propose a weakly supervised learning approach based on hybrid deep neural network and bidirectional long short-term memory recurrent neural network (DNN-BLSTM). It firstly maps the audio/visual features into a more discriminative space via the powerful modelling capacities of DNN, then models the temporal dynamics of affect via BLSTM. To reduce the negative impact of the unreliable labels, we utilize a temporal label (TL) along with a robust loss function (RL) for incorporating weak supervision into the learning process of the DNN-BLSTM model. Therefore, the proposed method not only has a simpler structure than the deep BLSTM model in He et al. (24) which requires more training data, but also is robust to noisy and unreliable labels. Single modal and multimodal affect recognition experiments have been carried out on the RECOLA dataset. Single modal recognition results show that the proposed method with TL and RL obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC), while multimodal recognition results show that with less feature streams, our proposed approach obtains better or comparable results with the state-of-the-art methods.

Keywords

Continuous affect recognition DNN-BLSTM Weak supervision 

Notes

Acknowledgements

This work is supported by the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the Chinese Scholarship Council (CSC) (grant 201706290115), the VUB Interdisciplinary Research Program through the EMO-App project.

References

  1. 1.
    Baltrušaitis T, Banda N, Robinson P (2013) Dimensional affect recognition using continuous conditional random fields. In: Proceedings of the 10th IEEE international conference and workshops on automatic face and gesture recognition (FG 2013). IEEE, pp 1–8Google Scholar
  2. 2.
    Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, LondonzbMATHGoogle Scholar
  3. 3.
    Brady K, Gwon Y, Khorrami P, Godoy E, Campbell W, Dagli C, Huang TS (2016) Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 97–104Google Scholar
  4. 4.
    Chao L, Tao J, Yang M, Li Y, Wen Z (2014) Multi-scale temporal modeling for dimensional emotion recognition in video. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 11–18Google Scholar
  5. 5.
    Chao L, Tao J, Yang M, Li Y, Wen Z (2015) Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 65–72Google Scholar
  6. 6.
    Chen S, Jin Q (2015) Multi-modal dimensional emotion recognition using recurrent neural networks. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 49–56Google Scholar
  7. 7.
    Chen S, Jin Q, Zhao J, Wang S (2017) Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th annual workshop on audio/visual emotion challenge. ACM, pp 19–26Google Scholar
  8. 8.
    Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: Proceedings of the 15th ACM on International conference on multimodal interaction. ACM, pp 509–516Google Scholar
  9. 9.
    Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 461–466Google Scholar
  10. 10.
    Dhall A, Ramana Murthy O, Goecke R, Joshi J, Gedeon T (2015) Video and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 international conference on multimodal interaction. ACM, pp 423–426Google Scholar
  11. 11.
    Dhall A, Goecke R, Joshi J, Hoey J, Gedeon T (2016) Emotiw 2016: Video and group-level emotion recognition challenges. In: Proceedings of the 18th ACM international conference on multimodal interaction. ACM, pp 427–432Google Scholar
  12. 12.
    Dhall A, Goecke R, Ghosh S, Joshi J, Hoey J, Gedeon T (2017) From individual to group-level emotion recognition: Emotiw 5.0. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, pp 524–528Google Scholar
  13. 13.
    Duda RO, Hart PE, Stork DG (1973) Pattern classification. Wiley, New YorkzbMATHGoogle Scholar
  14. 14.
    Ekman P, Friesen WV (2003) Unmasking the face: a guide to recognizing emotions from facial clues. Ishk, Los AltosGoogle Scholar
  15. 15.
    Erdem CE, Turan C, Aydin Z (2015) Baum-2: a multilingual audio-visual affective face database. Multimed Tools Appl 74(18):7429–7459CrossRefGoogle Scholar
  16. 16.
    Gers FA, Schmidhuber J, Cummins F (1999) Learning to forget: Continual prediction with lstm. In: Proceedings ICANN 1999, 9th international conference on artificial neural networks. IET, pp 850–855Google Scholar
  17. 17.
    Ghimire D, Jeong S, Lee J, Park SH (2017) Facial expression recognition based on local region specific features and support vector machines. Multimed Tools Appl 76(6):7803–7821CrossRefGoogle Scholar
  18. 18.
    Ghimire D, Lee J, Li ZN, Jeong S (2017) Recognition of facial expressions based on salient geometric features and support vector machines. Multimed Tools Appl 76(6):7921–7946CrossRefGoogle Scholar
  19. 19.
    Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, BerlinCrossRefGoogle Scholar
  20. 20.
    Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5-6):602–610CrossRefGoogle Scholar
  21. 21.
    Graves A, Jaitly N, Mohamed A (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 273–278Google Scholar
  22. 22.
    Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP 2013). IEEE, pp 6645–6649Google Scholar
  23. 23.
    Han J, Zhang Z, Ringeval F, Schuller B (2017) Reconstruction-error-based learning for continuous emotion recognition in speech. In: Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP 2017). IEEE, pp 2367–2371Google Scholar
  24. 24.
    He L, Jiang D, Yang L, Pei E, Wu P, Sahli H (2015) Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 73–80Google Scholar
  25. 25.
    Hernández-González J, Inza I, Lozano JA (2016) Weak supervision and other non-standard classification problems: a taxonomy. Pattern Recogn Lett 69:49–55CrossRefGoogle Scholar
  26. 26.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  27. 27.
    Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the 2017 IEEE international conference on computer vision and pattern recognition (CVPR). IEEE, pp 2261–2269Google Scholar
  28. 28.
    Kaya H, Çilli F, Salah AA (2014) Ensemble cca for continuous emotion prediction. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 19–26Google Scholar
  29. 29.
    Le D, Aldeneh Z, Provost EM (2017) Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. In: Proceedings of the 17th annual conference of the international speech communication association (INTERSPEECH 2017)Google Scholar
  30. 30.
    Lisetti C (1998) Affective computing. Pattern Anal Applic 1(1):71–73CrossRefGoogle Scholar
  31. 31.
    Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297CrossRefGoogle Scholar
  32. 32.
    Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: Proceedings of the 11th international society for music information retrieval conference (ISMIR 2010), pp 441–446Google Scholar
  33. 33.
    Nguyen MH, Torresani L, De La Torre F, Rother C (2009) Weakly supervised discriminative localization and classification: a joint learning process. In: Proceedings of the 12th international conference on computer vision (ICCV 2009). IEEE, pp 1925–1932Google Scholar
  34. 34.
    Nicolaou MA, Gunes H, Pantic M (2010) Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In: Proceedings of LREC int. workshop on multimodal corpora: advances in capturing, coding and analyzing multimodality. Citeseer, pp 43–48Google Scholar
  35. 35.
    Nicolaou MA, Gunes H, Pantic M (2011) Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans Affect Comput 2(2):92–105CrossRefGoogle Scholar
  36. 36.
    Nicolle J, Rapp V, Bailly K, Prevost L, Chetouani M (2012) Robust continuous prediction of human emotions using multiscale dynamic cues. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 501–508Google Scholar
  37. 37.
    Ozkan D, Scherer S, Morency LP (2012) Step-wise emotion recognition using concatenated-hmm. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 477–484Google Scholar
  38. 38.
    Pei E, Yang L, Jiang D, Sahli H (2015) Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 2015 international conference on affective computing and intelligent interaction (ACII 2015). IEEE, pp 208–214Google Scholar
  39. 39.
    Povolny F, Matejka P, Hradis M, Popková A, Otrusina L, Smrz P, Wood I, Robin C, Lamel L (2016) Multimodal emotion recognition for avec 2016 challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 75–82Google Scholar
  40. 40.
    Prenter PM, et al. (2008) Splines and variational methods, Courier Corporation, ChelmsfordGoogle Scholar
  41. 41.
    Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125CrossRefGoogle Scholar
  42. 42.
    Ringeval F, Sonderegger A, Sauer J, Lalanne D (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: Proceedings of the 10th IEEE international conference and workshops on automatic face and gesture recognition (FG 2013). IEEE, pp 1–8Google Scholar
  43. 43.
    Ringeval F, Schuller B, Valstar M, Cowie R, Pantic M (2015) Avec 2015: The 5th international audio/visual emotion challenge and workshop. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1335–1336Google Scholar
  44. 44.
    Ringeval F, Schuller B, Valstar M, Jaiswal S, Marchi E, Lalanne D, Cowie R, Pantic M (2015) Av+ ec 2015: The first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th international workshop on audio/visual emotion Challenge. ACM, pp 3–8Google Scholar
  45. 45.
    Ringeval F, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M, Pantic M (2017) Avec 2017: Real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th annual workshop on audio/visual emotion challenge. ACM, pp 3–9Google Scholar
  46. 46.
    Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161CrossRefGoogle Scholar
  47. 47.
    Schuller B, Valster M, Eyben F, Cowie R, Pantic M (2012) Avec 2012: the continuous audio/visual emotion challenge. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 449–456Google Scholar
  48. 48.
    Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, et al. (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of the 14th annual conference of the international speech communication association (INTERSPEECH 2013)Google Scholar
  49. 49.
    Siddiqi MH, Ali R, Idris M, Khan AM, Kim ES, Whang MC, Lee S (2016) Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection. Multimed Tools Appl 75(2):935–959CrossRefGoogle Scholar
  50. 50.
    Sidorov M, Minker W (2014) Emotion recognition and depression diagnosis by acoustic and visual features: a multimodal approach. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 81–86Google Scholar
  51. 51.
    Somandepalli K, Gupta R, Nasir M, Booth BM, Lee S, Narayanan SS (2016) Online affect tracking with multimodal kalman filters. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 59–66Google Scholar
  52. 52.
    Sun B, Cao S, Li L, He J, Yu L (2016) Exploring multimodal visual features for continuous affect recognition. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 83–88Google Scholar
  53. 53.
    Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP 2016). IEEE, pp 5200–5204Google Scholar
  54. 54.
    Valstar MF, Jiang B, Mehu M, Pantic M, Scherer K (2011) The first facial expression recognition and analysis challenge. In: Proceedings of the 2011 IEEE international conference on automatic face & gesture recognition and workshops (FG 2011). IEEE, pp 921–926Google Scholar
  55. 55.
    Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schnieder S, Cowie R, Pantic M (2013) Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, pp 3–10Google Scholar
  56. 56.
    Valstar M, Schuller B, Smith K, Almaev T, Eyben F, Krajewski J, Cowie R, Pantic M (2014) Avec 2014: 3d dimensional affect and depression recognition challenge. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 3–10Google Scholar
  57. 57.
    Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) Fera 2015-second facial expression recognition and analysis challenge. In: 11Th IEEE international conference and workshops on automatic face and gesture recognition (FG 2015), vol 6. IEEE, pp 1–8Google Scholar
  58. 58.
    Valstar M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R, Pantic M (2016) Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 3–10Google Scholar
  59. 59.
    Valstar MF, Sánchez-Lozano E, Cohn JF, Jeni LA, Girard JM, Zhang Z, Yin L, Pantic M (2017) Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge. In: 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, pp 839–847Google Scholar
  60. 60.
    Van Der Maaten L (2012) Audio-visual emotion challenge 2012: a simple approach. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 473–476Google Scholar
  61. 61.
    Verma GK, Tiwary US (2017) Affect representation and recognition in 3d continuous valence–arousal–dominance space. Multimed Tools Appl 76(2):2159–2183CrossRefGoogle Scholar
  62. 62.
    Ververidis D, Kotropoulos C (2006) Fast sequential floating forward selection applied to emotional speech features estimated on des and susas data collections. In: Proceedings of the 14th european signal processing conference. IEEE, pp 1–5Google Scholar
  63. 63.
    Wang F, Sahli H, Gao J, Jiang D, Verhelst W (2015) Relevance units machine based dimensional and continuous speech emotion prediction. Multimed Tools Appl 74(22):9983–10000CrossRefGoogle Scholar
  64. 64.
    Weninger F, Geiger J, Wöllmer M, Schuller B, Rigoll G (2014) Feature enhancement by deep lstm networks for asr in reverberant multisource environments. Comput Speech Lang 28(4):888–902CrossRefGoogle Scholar
  65. 65.
    Weninger F, Bergmann J, Schuller BW (2015) Introducing currennt: the munich open-source cuda recurrent neural network toolkit. J Mach Learn Res 16(3):547–551MathSciNetzbMATHGoogle Scholar
  66. 66.
    Weninger F, Ringeval F, Marchi E, Schuller B (2016) Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence. AAAI Press, pp 2196–2202Google Scholar
  67. 67.
    Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560CrossRefGoogle Scholar
  68. 68.
    Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications 1:433–486Google Scholar
  69. 69.
    Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of the ninth annual conference of the international speech communication association (INTERSPEECH 2008), pp 597–600Google Scholar
  70. 70.
    Wollmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Sel Top Sign Proces 4(5):867–881CrossRefGoogle Scholar
  71. 71.
    Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) Lstm-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163CrossRefGoogle Scholar
  72. 72.
    Zhang Z, Ringeval F, Han J, Deng J, Marchi E, Schuller B (2016) Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with lstm neural networks. In: Proceedings of the 17th annual conference of the international speech communication association (INTERSPEECH 2016), pp 3593–3597Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.VUB-NPU Joint AVSP Laboratory, School of Computer ScienceNorthwestern Polytechnical University (NPU)Xi’anChina
  2. 2.Department ETROVrije Universiteit Brussel (VUB)BrusselsBelgium

Personalised recommendations