Improving Emotion Detection with Sub-clip Boosting

  • Ermal TotoEmail author
  • Brendan J. Foley
  • Elke A. Rundensteiner
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11053)


With the emergence of systems such as Amazon Echo, Google Home, and Siri, voice has become a prevalent mode for humans to interact with machines. Emotion detection from voice promises to transform a wide range of applications, from adding emotional-awareness to voice assistants, to creating more sensitive robotic helpers for the elderly. Unfortunately, due to individual differences, emotion expression varies dramatically, making it a challenging problem. To tackle this challenge, we introduce the Sub-Clip Classification Boosting (SCB) Framework, a multi-step methodology for emotion detection from non-textual features of audio clips. SCB features a highly-effective sub-clip boosting methodology for classification that, unlike traditional boosting using feature subsets, instead works at the sub-instance level. Multiple sub-instance classifications increase the likelihood that an emotion cue will be found within a voice clip, even if its location varies between speakers. First, each parent voice clip is decomposed into overlapping sub-clips. Each sub-clip is then independently classified. Further, the Emotion Strength of the sub-classifications is scored to form a sub-classification and strength pair. Finally we design a FilterBoost-inspired “Oracle”, that utilizes sub-classification and Emotion Strength pairs to determine the parent clip classification. To tune the classification performance, we explore the relationships between sub-clip properties, such as length and overlap. Evaluation on 3 prominent benchmark datasets demonstrates that our SCB method consistently outperforms all state-of-the art-methods across diverse languages and speakers. Code related to this paper is available at:


Classification Emotion Boosting Sub-clip Sub-classification 


  1. 1.
    Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Trans. Inf. Syst. (TOIS) 26(3), 12 (2008)CrossRefGoogle Scholar
  2. 2.
    Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015)CrossRefGoogle Scholar
  3. 3.
    Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service, PlatCon, pp. 1–5. IEEE (2017)Google Scholar
  4. 4.
    Bradley, J.K., Schapire, R.E.: FilterBoost: regression and classification on large datasets. In: NIPS, pp. 185–192 (2007)Google Scholar
  5. 5.
    Chenchah, F., Lachiri, Z.: Speech emotion recognition in acted and spontaneous context. Proc. Comput. Sci. 39, 139–145 (2014)CrossRefGoogle Scholar
  6. 6.
    Ekman, P.: Strong evidence for universals in facial expressions: a reply to Russell’s mistaken critique (1994)CrossRefGoogle Scholar
  7. 7.
    El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)CrossRefGoogle Scholar
  8. 8.
    Eyben, F., Unfried, M., Hagerer, G., Schuller, B.: Automatic multi-lingual arousal detection from voice applied to real product testing applications. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5155–5159. IEEE (2017)Google Scholar
  9. 9.
    Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838. ACM (2013)Google Scholar
  10. 10.
    Haq, S., Jackson, P., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of International Conference on Auditory-Visual Speech Processing, AVSP 2008, Tangalooma, Australia, September 2008Google Scholar
  11. 11.
    Hossain, M.S., Muhammad, G., Alhamid, M.F., Song, B., Al-Mutib, K.: Audio-visual emotion recognition using big data towards 5G. Mobile Netw. Appl. 21(5), 753–763 (2016)CrossRefGoogle Scholar
  12. 12.
    Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804. ACM (2014)Google Scholar
  13. 13.
    Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.A.: A review on speech emotion recognition: case of pedagogical interaction in classroom. In: 2017 International Conference on Advanced Technologies for Signal and Image Processing, ATSIP, pp. 1–7. IEEE (2017)Google Scholar
  14. 14.
    Kishore, K.K., Satish, P.K.: Emotion recognition in speech using MFCC and wavelet features. In: 2013 IEEE 3rd International Advance Computing Conference, IACC, pp. 842–847. IEEE (2013)Google Scholar
  15. 15.
    Knapp, M.L., Hall, J.A., Horgan, T.G.: Nonverbal Communication in Human Interaction. Cengage Learning, Boston (2013)Google Scholar
  16. 16.
    Kobayashi, V., Calag, V.: Detection of affective states from speech signals using ensembles of classifiers. In: FIET Intelligent Signal Processing Conference (2013)Google Scholar
  17. 17.
    Kobayashi, V.: A hybrid distance-based method and support vector machines for emotional speech detection. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2013. LNCS, vol. 8399, pp. 85–99. Springer, Cham (2014). Scholar
  18. 18.
    Kraus, M.W.: Voice-only communication enhances empathic accuracy. Am. Psychol. 72(7), 644 (2017)CrossRefGoogle Scholar
  19. 19.
    Litman, D.J., Silliman, S.: ITSPOKE: an intelligent tutoring spoken dialogue system. In: Demonstration Papers at HLT-NAACL 2004, pp. 5–8. Association for Computational Linguistics (2004)Google Scholar
  20. 20.
    Nass, C., Moon, Y.: Machines and mindlessness: social responses to computers. J. Soc. Issues 56(1), 81–103 (2000)CrossRefGoogle Scholar
  21. 21.
    Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Picard, R.W.: Affective computing (1995)Google Scholar
  23. 23.
    Poels, K., Dewitte, S.: How to capture the heart? Reviewing 20 years of emotion measurement in advertising. J. Advert. Res. 46(1), 18–37 (2006)CrossRefGoogle Scholar
  24. 24.
    Riva, G.: Ambient intelligence in health care. CyberPsychol. Behav. 6(3), 295–300 (2003)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Sun, Y., Wen, G.: Ensemble softmax regression model for speech emotion recognition. Multimed. Tools Appl. 76(6), 8305–8328 (2017)CrossRefGoogle Scholar
  26. 26.
    Todorovski, L., Džeroski, S.: Combining classifiers with meta decision trees. Mach. Learn. 50(3), 223–249 (2003)CrossRefGoogle Scholar
  27. 27.
    Valstar, M., et al.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2013)Google Scholar
  28. 28.
    Vasuki, P.: Speech emotion recognition using adaptive ensemble of class specific classifiers. Res. J. Appl. Sci. Eng. Technol. 9(12), 1105–1114 (2015)CrossRefGoogle Scholar
  29. 29.
    Vasuki, P., Vaideesh, A., Abubacker, M.S.: Emotion recognition using ensemble of cepstral, perceptual and temporal features. In: International Conference on Inventive Computation Technologies, ICICT, vol. 2, pp. 1–6. IEEE (2016)Google Scholar
  30. 30.
    Verhoef, P.C., Lemon, K.N., Parasuraman, A., Roggeveen, A., Tsiros, M., Schlesinger, L.A.: Customer experience creation: determinants, dynamics and management strategies. J. Retail. 85(1), 31–41 (2009)CrossRefGoogle Scholar
  31. 31.
    Vlasenko, B., Wendemuth, A.: Tuning hidden Markov model for speech emotion recognition. Fortschritte der Akustik 33(1), 317 (2007)Google Scholar
  32. 32.
    Vogt, T., André, E., Bee, N.: EmoVoice—a framework for online recognition of emotions from voice. In: André, E., Dybkjær, L., Minker, W., Neumann, H., Pieraccini, R., Weber, M. (eds.) PIT 2008. LNCS, vol. 5078, pp. 188–199. Springer, Heidelberg (2008). Scholar
  33. 33.
    Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed. 10(5), 936–946 (2008)CrossRefGoogle Scholar
  34. 34.
    Weißkirchen, N., Bock, R., Wendemuth, A.: Recognition of emotional speech with convolutional neural networks by means of spectral estimates. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 50–55. IEEE (2017)Google Scholar
  35. 35.
    Wen, G., Li, H., Huang, J., Li, D., Xun, E.: Random deep belief networks for recognizing emotions from speech signals. Comput. Intell. Neurosci. 2017 (2017)CrossRefGoogle Scholar
  36. 36.
    Yu, D., Deng, L.: Automatic Speech Recognition. SCT. Springer, London (2015). Scholar
  37. 37.
    Zao, L., Cavalcante, D., Coelho, R.: Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Process. Lett. 21(5), 620–624 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ermal Toto
    • 1
    Email author
  • Brendan J. Foley
    • 1
  • Elke A. Rundensteiner
    • 1
  1. 1.Computer Science DepartmentWorcester Polytechnic InstituteWorcesterUSA

Personalised recommendations