Multimedia Tools and Applications

, Volume 76, Issue 6, pp 8305–8328 | Cite as

Ensemble softmax regression model for speech emotion recognition



Automatic emotion recognition from speech signals is one of the important research areas. Most speech emotion recognition methods have been proposed, among which ensemble learning is an effective way. However, they are still confronted with problems, such as the curse of dimensionality and the diversity of the base classifiers hardly ensured. To overcome the problems, this paper proposes an ensemble Softmax regression model for speech emotion recognition (ESSER). It applies the feature extraction methods with much different principles to generate the subspaces for the base classifier, so that the diversity of the base classifiers could be ensured. Furthermore, a feature selection method that selects features according to global structure of the data is used to reduce the dimension of subspaces, which can further increase the diversity of the base classifiers and overcome the curse of dimensionality. As in the case of the diversity of the base classifiers ensured, the performance of ensemble classifier highly depends on the ability of the base classifier, it is reasonable for ESSER to select Softmax as the base classifier as Softmax has shown its superiority in speech emotion recognition. The conducted experiments validate the proposed approach in term of the performance of speech emotion recognition.


Speech emotion recognition Softmax regression Ensemble learning Ensemble Softmax regression 


  1. 1.
    Attabi Y, Dumouchel P (2013) Anchor models for emotion recognition from speech. IEEE Trans Affect Comput 4(3):280–290CrossRefGoogle Scholar
  2. 2.
    Brown G, Pocock A, Zhao M-J, Lujan M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66MathSciNetMATHGoogle Scholar
  3. 3.
    Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Proc. INTERSPEECH, Lisbon, p 1517–1520Google Scholar
  4. 4.
    Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proc. international conference on knowledge discovery and data mining, p 333–342Google Scholar
  5. 5.
    Cao H, Verma R, Nenkov A (2014) Speaker-sensitive emotion recognition via ranking: studies on acted and spontaneous speech. Comput Speech Lang in Press, FanGoogle Scholar
  6. 6.
    Chang C-C, Lin C-J (2011) LIBSVM -- a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27CrossRefGoogle Scholar
  7. 7.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHGoogle Scholar
  8. 8.
    Comparing multiple classifiers for speech-based detection of self confidence—a pilot studyGoogle Scholar
  9. 9.
    Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27CrossRefMATHGoogle Scholar
  10. 10.
    Danisman T, Alpkocak A (2008) Emotion classification of audio signals using ensemble of support vector machines. Percep Multimodal Dialogue Syst 5078:205–216CrossRefGoogle Scholar
  11. 11.
    El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn 44(3):572–587CrossRefMATHGoogle Scholar
  12. 12.
    Ellis DPW (2005) PLP and RASTA in Matlab.
  13. 13.
    Eyben F, Wöllmer M, Schuller B (2010) openSMILE-The Munich versatile and fast open-source audio feature extractor. In: ACM Multimedia (MM), Florence, p 1459–1462Google Scholar
  14. 14.
    Haq S, Jackson PJB (2009) Speaker-dependent audio-visual emotion recognition. In: Proc. International Conference on Auditory Visual Speech Processing (AVSP), p 53–58Google Scholar
  15. 15.
    Hassan A, Damper RI (2012) Classification of emotional speech using 3DEC hierarchical classifier. Speech Comm 54(7):903–916CrossRefGoogle Scholar
  16. 16.
    Hassan A, Damper R, Niranjan M (2013) On acoustic emotion recognition: compensating for covariate shift. IEEE Trans Audio Speech Lang Process 21(7):1458–1468CrossRefGoogle Scholar
  17. 17.
    Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752CrossRefGoogle Scholar
  18. 18.
    Hermansky H, Morgan N, Bayya A, Kohn P (1992) RASTA-PLP speech analysis technique. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, p 121–124Google Scholar
  19. 19.
    Huang Y, Guobao Z, Xu X (2009) Speech emotion recognition research based on the stacked generalization ensemble neural network for robot pet. In: Proc. Chinese Conference on Pattern Recognition (CCPR), p 1–5Google Scholar
  20. 20.
    Huang D-Y, Zhang Z, Ge SS (2014) Speaker state classification based on fusion of asymmetric simple partial least squares (SIMPLS) and support vector machines. Expert Syst Appl 28(2):392–419Google Scholar
  21. 21.
    Kobayashi VB, Calag VB (2013) Detection of affective states from speech signals using ensembles of classifiers. In: Proc. IET Intelligent Signal Processing Conference (ISP), p 1–9Google Scholar
  22. 22.
    Kockmann M, Burget L, Cernocky J (2009) Brno University of Technology System for Interspeech 2009 emotion challenge. In: Proc. INTERSPEECHGoogle Scholar
  23. 23.
    Leea C-C, Mowera E, Bussob C, Leea S, Narayanana S (2011) Emotion recognition using a hierarchical binary decision tree approach. Speech Comm 53(9–10):1162–1171CrossRefGoogle Scholar
  24. 24.
    Mariooryad S, Busso C (2014) Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Comm 57:1–9CrossRefGoogle Scholar
  25. 25.
    Milton A, Tamil Selvi S (2014) Class-specific multiple classifiers scheme to recognize emotions from speech signals. Comput Speech Lang 28(3):727–742CrossRefGoogle Scholar
  26. 26.
    Morrison D, De Silva LC (2007) Voting ensembles for spoken affect classification. J Netw Comput Appl 30(4):1356–1365CrossRefGoogle Scholar
  27. 27.
    Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centers. Speech Comm 49(2):98–112CrossRefGoogle Scholar
  28. 28.
    Natalie van der Wal C, Kowalczyk W (2013) Detecting changing emotions in human speech by machine and humans. Neural Comput & Applic 39(4):675–691Google Scholar
  29. 29.
    Nocedal J, Damper R, Niranjan M (1980) Updating quasi-Newton matrices with limited storage. Math Comput 35(151):773–782MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Ntalampiras S, Fakotakis N (2012) Modeling the temporal evolution of acoustic parameters for speech emotion recognition. IEEE Trans Affect Comput 3(1):116–125CrossRefGoogle Scholar
  31. 31.
    Ooi CS, Seng KP, Ang L-M, Chew LW (2014) A new approach of audio emotion recognition. Expert Syst Appl 14(13):5858–5869CrossRefGoogle Scholar
  32. 32.
    Park J-S, Kim J-H, Yung-Hwan O (2009) Feature vector classification based speech emotion recognition for service robots. IEEE Trans Consum Electron 55(3):1590–1596CrossRefGoogle Scholar
  33. 33.
    Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRefGoogle Scholar
  34. 34.
    Qian Y, Ying L, Pingping J (2013) Speech emotion recognition using supervised manifold learning based on all class and pairwise-class feature extraction. In: Proc. IEEE Conference Anthology, p 1–5Google Scholar
  35. 35.
    Rozgic V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Ensemble of SVM trees for multimodal emotion recognition. In: Signal & information processing association annual summit and conference (APSIPA ASC), Hollywood, p 1–4Google Scholar
  36. 36.
    Sarker MK, Alam KMR, ArifuzzamanM (2014) Emotion recognition from speech based on relevant feature and majority voting. In: Proc. International Conference on Informatics, Electronics & Vision, p 1–5Google Scholar
  37. 37.
    Schuller B, Reiter S, Muller R, Al-Hames M, Lang M, Rigoll G (2005) Speaker independent speech emotion recognition by ensemble classification. In: Proc. IEEE International Conference on Multimedia and Expo(ICME), Amsterdam, p 864–867Google Scholar
  38. 38.
    Schuller, S. Steidl, A. Batliner (2009) The INTERSPEECH 2009 emotion challenge. In: Proc. INTERSPEECHGoogle Scholar
  39. 39.
    Schuller B, Steidl S, Batliner A (2010) The INTERSPEECH 2010 paralinguistic challenge. In: proc. INTERSPEECH, p 2794–2797Google Scholar
  40. 40.
    Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2012) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119–131CrossRefGoogle Scholar
  41. 41.
    Steidl S (2009) Automatic classification of emotion related user states in spontaneous children’s speech. Logos VerlagGoogle Scholar
  42. 42.
    Vlasenko B, Prylipko D, Böck R, Wendemuth A (2014) Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications. Comput Speech Lang 28(2):483–500CrossRefGoogle Scholar
  43. 43.
    Voicebox: speech processing toolbox for MATLAB,
  44. 44.
    Wagner J, Lingenfelser F, Andre E, Kim J (2011) Exploring fusion methods for multimodal emotion recognition with missing data. IEEE Trans Affect Comput 4(2):206–218CrossRefGoogle Scholar
  45. 45.
    Weiss GM, Provost F (2001) The effect of class distribution on classifier learning, technical report, Department of Computer Science, Rutgers UniversityGoogle Scholar
  46. 46.
    Wu S, Falk TH, Chan W-Y (2011) Automatic speech emotion recognition using modulation spectral features. Speech Comm 24(7):768–785CrossRefGoogle Scholar
  47. 47.
    Yan R, Liu Y, Jin R, Hauptmann A (2003) On predicting rare cases with SVM ensembles in scene classification. In: Proc. International Conference on Acoustics, Speech, and Signal ProcessingGoogle Scholar
  48. 48.
    Yuanlu K, Li L (2013) Speech emotion recognition of decision fusion based on DS evidence theory. In: Proc. International Conference on Software Engineering and Service Science, p 795–798Google Scholar
  49. 49.
    Zhao X, Zhang S, Lei B (2014) Robust emotion recognition in noisy speech via sparse representation. Neural Comput & Applic 24(7-8):1539–1553CrossRefGoogle Scholar
  50. 50.
    Zheng W, Xin M, Wang X, Wang B (2014) A novel speech emotion recognition method via incomplete sparse least square regression. IEEE Signal Process Lett 21(5):569–572CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Jiaxing UniversityJiaxingChina
  2. 2.South China University of TechnologyGuangzhouChina

Personalised recommendations