Journal on Multimodal User Interfaces

, Volume 10, Issue 2, pp 139–149 | Cite as

Combining modality-specific extreme learning machines for emotion recognition in the wild

Original Paper

Abstract

This paper proposes extreme learning machines (ELM) for modeling audio and video features for emotion recognition under uncontrolled conditions. The ELM paradigm is a fast and accurate learning alternative for single layer Feedforward networks. We experiment on the acted facial expressions in the wild corpus, which features seven discrete emotions, and adhere to the EmotiW 2014 challenge protocols. In our study, better results for both modalities are obtained with kernel ELM compared to basic ELM. We contrast several fusion approaches and reach a test set accuracy of 50.12 % (over a video-only baseline of 33.70 %) on the seven-class (i.e. six basic emotions plus neutral) EmotiW 2014 Challenge, by combining one audio and three video sub-systems. We also compare ELM with partial least squares regression based classification that is used in the top performing system of EmotiW 2014, and discuss the advantages of both approaches.

Keywords

Audio-visual emotion corpus Audio-visual fusion Feature extraction Emotion recognition in the wild Extreme learning machines 

References

  1. 1.
    Almaev TR, Valstar MF (2013) Local Gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In: 2013 humaine association conference on affective computing and intelligent interaction (ACII), IEEE, pp 356–361Google Scholar
  2. 2.
    Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press, CambridgeMATHGoogle Scholar
  3. 3.
    Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J Matrix Anal Appl 29(1):328–347MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proc. of INTERSPEECH 2005, pp 1517–1520Google Scholar
  5. 5.
    Cowie R, Sussman N, Ben-Ze’ev A (2011) Emotion: concepts and definitions. In: Petta P, Pelechaud C, Cowie R (eds) Emotion-oriented systems: the humaine handbook. Springer, Berlin, pp 9–32Google Scholar
  6. 6.
    Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed 19(3):34–41CrossRefGoogle Scholar
  7. 7.
    Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: baseline, data and protocol. In: Proceedings of the 16th international conference on multimodal interaction, ACM, ICMI ’14, pp 461–466Google Scholar
  8. 8.
    Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: Proc. of the 15th ACM Intl. conf. on multimodal interaction (ICMI 2013), ACM, pp 509–516Google Scholar
  9. 9.
    Engberg I, Hansen A (1996) Documentation of the Danish emotional speech database (DES). Internal AAU Report, Center for Person Kommunikation, DenmarkGoogle Scholar
  10. 10.
    Eyben F, Wöllmer M, Schuller B (2010) OpenSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proc. of the intl. conf. on multimedia, ACM, pp 1459–1462Google Scholar
  11. 11.
    Hamm J, Lee DD (2008) Grassmann discriminant analysis: a unifying view on subspace-based learning. In: Proceedings of the 25th international conference on machine learning, pp 376–383Google Scholar
  12. 12.
    Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of INTERSPEECH, ISCA, Singapore, pp 223–227Google Scholar
  13. 13.
    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefMATHGoogle Scholar
  14. 14.
    Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. Proc IEEE Int Joint Conf Neural Netw 2:985–990Google Scholar
  15. 15.
    Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501CrossRefGoogle Scholar
  16. 16.
    Huang GB, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B Cybern 42(2):513–529CrossRefGoogle Scholar
  17. 17.
    Itakura F (1975) Line spectrum representation of linear predictor coefficients of speech signals. J Acoust Soc Am 57(S1):S35CrossRefGoogle Scholar
  18. 18.
    Kahou SE, Pal C, Bouthillier X, Froumenty P, Gülçehre c, Memisevic R, Vincent P, Courville A, Bengio Y, Ferrari RC, Mirza M, Jean S, Carrier PL, Dauphin Y, Boulanger-Lewandowski N, Aggarwal A, Zumer J, Lamblin P, Raymond JP, Desjardins G, Pascanu R, Warde-Farley D, Torabi A, Sharma A, Bengio E, Côté M, Konda KR, Wu Z (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ACM, ICMI ’13, pp 543–550Google Scholar
  19. 19.
    Kaya H, Özkaptan T, Salah AA, Gürgen F (2015) Random discriminative projection based feature selection with application to conflict recognition. IEEE Signal Process Lett 22(6):671–675. doi:10.1109/LSP.2014.2365393 CrossRefGoogle Scholar
  20. 20.
    Kaya H, Eyben F, Salah AA, Schuller BW (2014) CCA Based feature selection with application to continuous depression recognition from acoustic speech features. In: Proceedings of IEEE International conference on acoustics, speech, and signal processing (ICASSP 2014), pp 3757–3761Google Scholar
  21. 21.
    Kaya H, Özkaptan T, Salah AA, Gürgen F (2014) Canonical Correlation analysis and local fisher discriminant analysis based multi-view acoustic feature reduction for physical load prediction. In: Proceedings of INTERSPEECH, ISCA, Singapore, pp 442–446Google Scholar
  22. 22.
    Kaya H, Salah AA (2014) Combining modality-specific extreme learning machines for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ACM, ICMI ’14, pp 487–493Google Scholar
  23. 23.
    Kaya H, Salah AA, Gurgen SF, Ekenel H (2014) Protocol and easeline for experiments on Bogazici university Turkish emotional speech corpus. In: IEEE Signal processing and communications applications conf. (SIU), 2014, pp 1698–1701Google Scholar
  24. 24.
    Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on Grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on International conference on multimodal interaction, ACM, ICMI ’13, pp 525–530Google Scholar
  25. 25.
    Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ACM, New York, NY, USA, ICMI ’14, pp 494–501Google Scholar
  26. 26.
    Lovrić M, Min-Oo M, Ruh EA (2000) Multivariate normal distributions parametrized as a Riemannian symmetric space. J Multivar Anal 74(1):36–48MathSciNetCrossRefMATHGoogle Scholar
  27. 27.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110Google Scholar
  28. 28.
    Lyakso E, Frolova O, Dmitrieva E, Grigorev A, Kaya H, Karpov AA (2015) EmoChildRu: emotional child russian speech corpus. INTERSPEECH (submitted)Google Scholar
  29. 29.
    Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE ’05 audio-visual emotion database. In: Proceedings of IEEE workshop on multimedia database managementGoogle Scholar
  30. 30.
    McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2):153–157. doi:10.1007/BF02295996 CrossRefGoogle Scholar
  31. 31.
    Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987CrossRefMATHGoogle Scholar
  32. 32.
    Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRefGoogle Scholar
  33. 33.
    Rao CR, Mitra SK (1971) Gen Inverse Matrices Appl, vol 7. Wiley, New YorkGoogle Scholar
  34. 34.
    Schuller B (2011) Voice and speech analysis in search of states and traits. In: Salah AA, Gevers T (eds) Computer analysis of human behavior. Springer, Berlin, pp 227–253Google Scholar
  35. 35.
    Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119–131CrossRefGoogle Scholar
  36. 36.
    Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller CA, Narayanan SS (2010) The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, pp 2794–2797Google Scholar
  37. 37.
    Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, Mortillaro M, Salamin H, Polychroniou A, Valente F, Kim S (2013) The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of INTERSPEECH, ISCA, ISCA, Lyon, France, pp 148–152Google Scholar
  38. 38.
    Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ACM, New York, NY, USA, ICMI ’14, pp 481–486Google Scholar
  39. 39.
    Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300MathSciNetCrossRefMATHGoogle Scholar
  40. 40.
    Vemulapalli R, Pillai JK, Chellappa R (2013) Kernel learning for extrinsic classification of manifold features. In: IEEE conference on computer vision and pattern recognition (CVPR 2013), pp 1782–1789Google Scholar
  41. 41.
    Wang R, Guo H, Davis LS, Dai Q (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In: IEEE conference on computer vision and pattern recognition (CVPR 2012), pp 2496–2503Google Scholar
  42. 42.
    Wold H (1985) Partial least squares. In: Kotz S, Johnson NL (eds) Encyclopedia of statistical sciences. Wiley, New York, pp 581–491Google Scholar

Copyright information

© OpenInterface Association 2015

Authors and Affiliations

  1. 1.Department of Computer EngineeringBoğaziçi UniversityIstanbulTurkey

Personalised recommendations