Abstract
With the emergence of systems such as Amazon Echo, Google Home, and Siri, voice has become a prevalent mode for humans to interact with machines. Emotion detection from voice promises to transform a wide range of applications, from adding emotional-awareness to voice assistants, to creating more sensitive robotic helpers for the elderly. Unfortunately, due to individual differences, emotion expression varies dramatically, making it a challenging problem. To tackle this challenge, we introduce the Sub-Clip Classification Boosting (SCB) Framework, a multi-step methodology for emotion detection from non-textual features of audio clips. SCB features a highly-effective sub-clip boosting methodology for classification that, unlike traditional boosting using feature subsets, instead works at the sub-instance level. Multiple sub-instance classifications increase the likelihood that an emotion cue will be found within a voice clip, even if its location varies between speakers. First, each parent voice clip is decomposed into overlapping sub-clips. Each sub-clip is then independently classified. Further, the Emotion Strength of the sub-classifications is scored to form a sub-classification and strength pair. Finally we design a FilterBoost-inspired “Oracle”, that utilizes sub-classification and Emotion Strength pairs to determine the parent clip classification. To tune the classification performance, we explore the relationships between sub-clip properties, such as length and overlap. Evaluation on 3 prominent benchmark datasets demonstrates that our SCB method consistently outperforms all state-of-the art-methods across diverse languages and speakers. Code related to this paper is available at: https://arcgit.wpi.edu/toto/EMOTIVOClean.
Keywords
- Classification
- Emotion
- Boosting
- Sub-clip
- Sub-classification
This is a preview of subscription content, access via your institution.
Buying options





Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Trans. Inf. Syst. (TOIS) 26(3), 12 (2008)
Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015)
Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service, PlatCon, pp. 1–5. IEEE (2017)
Bradley, J.K., Schapire, R.E.: FilterBoost: regression and classification on large datasets. In: NIPS, pp. 185–192 (2007)
Chenchah, F., Lachiri, Z.: Speech emotion recognition in acted and spontaneous context. Proc. Comput. Sci. 39, 139–145 (2014)
Ekman, P.: Strong evidence for universals in facial expressions: a reply to Russell’s mistaken critique (1994)
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Eyben, F., Unfried, M., Hagerer, G., Schuller, B.: Automatic multi-lingual arousal detection from voice applied to real product testing applications. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5155–5159. IEEE (2017)
Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838. ACM (2013)
Haq, S., Jackson, P., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of International Conference on Auditory-Visual Speech Processing, AVSP 2008, Tangalooma, Australia, September 2008
Hossain, M.S., Muhammad, G., Alhamid, M.F., Song, B., Al-Mutib, K.: Audio-visual emotion recognition using big data towards 5G. Mobile Netw. Appl. 21(5), 753–763 (2016)
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804. ACM (2014)
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.A.: A review on speech emotion recognition: case of pedagogical interaction in classroom. In: 2017 International Conference on Advanced Technologies for Signal and Image Processing, ATSIP, pp. 1–7. IEEE (2017)
Kishore, K.K., Satish, P.K.: Emotion recognition in speech using MFCC and wavelet features. In: 2013 IEEE 3rd International Advance Computing Conference, IACC, pp. 842–847. IEEE (2013)
Knapp, M.L., Hall, J.A., Horgan, T.G.: Nonverbal Communication in Human Interaction. Cengage Learning, Boston (2013)
Kobayashi, V., Calag, V.: Detection of affective states from speech signals using ensembles of classifiers. In: FIET Intelligent Signal Processing Conference (2013)
Kobayashi, V.: A hybrid distance-based method and support vector machines for emotional speech detection. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2013. LNCS, vol. 8399, pp. 85–99. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08407-7_6
Kraus, M.W.: Voice-only communication enhances empathic accuracy. Am. Psychol. 72(7), 644 (2017)
Litman, D.J., Silliman, S.: ITSPOKE: an intelligent tutoring spoken dialogue system. In: Demonstration Papers at HLT-NAACL 2004, pp. 5–8. Association for Computational Linguistics (2004)
Nass, C., Moon, Y.: Machines and mindlessness: social responses to computers. J. Soc. Issues 56(1), 81–103 (2000)
Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005)
Picard, R.W.: Affective computing (1995)
Poels, K., Dewitte, S.: How to capture the heart? Reviewing 20 years of emotion measurement in advertising. J. Advert. Res. 46(1), 18–37 (2006)
Riva, G.: Ambient intelligence in health care. CyberPsychol. Behav. 6(3), 295–300 (2003)
Sun, Y., Wen, G.: Ensemble softmax regression model for speech emotion recognition. Multimed. Tools Appl. 76(6), 8305–8328 (2017)
Todorovski, L., Džeroski, S.: Combining classifiers with meta decision trees. Mach. Learn. 50(3), 223–249 (2003)
Valstar, M., et al.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2013)
Vasuki, P.: Speech emotion recognition using adaptive ensemble of class specific classifiers. Res. J. Appl. Sci. Eng. Technol. 9(12), 1105–1114 (2015)
Vasuki, P., Vaideesh, A., Abubacker, M.S.: Emotion recognition using ensemble of cepstral, perceptual and temporal features. In: International Conference on Inventive Computation Technologies, ICICT, vol. 2, pp. 1–6. IEEE (2016)
Verhoef, P.C., Lemon, K.N., Parasuraman, A., Roggeveen, A., Tsiros, M., Schlesinger, L.A.: Customer experience creation: determinants, dynamics and management strategies. J. Retail. 85(1), 31–41 (2009)
Vlasenko, B., Wendemuth, A.: Tuning hidden Markov model for speech emotion recognition. Fortschritte der Akustik 33(1), 317 (2007)
Vogt, T., André, E., Bee, N.: EmoVoice—a framework for online recognition of emotions from voice. In: André, E., Dybkjær, L., Minker, W., Neumann, H., Pieraccini, R., Weber, M. (eds.) PIT 2008. LNCS, vol. 5078, pp. 188–199. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69369-7_21
Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed. 10(5), 936–946 (2008)
Weißkirchen, N., Bock, R., Wendemuth, A.: Recognition of emotional speech with convolutional neural networks by means of spectral estimates. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 50–55. IEEE (2017)
Wen, G., Li, H., Huang, J., Li, D., Xun, E.: Random deep belief networks for recognizing emotions from speech signals. Comput. Intell. Neurosci. 2017 (2017)
Yu, D., Deng, L.: Automatic Speech Recognition. SCT. Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Zao, L., Cavalcante, D., Coelho, R.: Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Process. Lett. 21(5), 620–624 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Toto, E., Foley, B.J., Rundensteiner, E.A. (2019). Improving Emotion Detection with Sub-clip Boosting. In: , et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11053. Springer, Cham. https://doi.org/10.1007/978-3-030-10997-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-10997-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10996-7
Online ISBN: 978-3-030-10997-4
eBook Packages: Computer ScienceComputer Science (R0)
-
Published in cooperation with
http://www.ecmlpkdd.org/