Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification
- 333 Downloads
As an essential approach to understanding human interactions, emotion classification is a vital component of behavioral studies as well as being important in the design of context-aware systems. Recent studies have shown that speech contains rich information about emotion, and numerous speech-based emotion classification methods have been proposed. However, the classification performance is still short of what is desired for the algorithms to be used in real systems. We present an emotion classification system using several one-against-all support vector machines with a thresholding fusion mechanism to combine the individual outputs, which provides the functionality to effectively increase the emotion classification accuracy at the expense of rejecting some samples as unclassified. Results show that the proposed system outperforms three state-of-the-art methods and that the thresholding fusion mechanism can effectively improve the emotion classification, which is important for applications that require very high accuracy but do not require that all samples be classified. We evaluate the system performance for several challenging scenarios including speaker-independent tests, tests on noisy speech signals, and tests using non-professional acted recordings, in order to demonstrate the performance of the system and the effectiveness of the thresholding fusion mechanism in real scenarios.
KeywordsEmotion classification Support vector machine Thresholding fusion Noisy speech
This research was supported by funding from the National Institute of Health NICHD (Grant R01 HD060789). We thank Dr. Jennifer Samp for obtaining the voice recordings from students at the University of Georgia. We also thank Sefik Emre Eskimez and Kenneth Imade for conducting the human user study using Amazon Mechanical Turk.
- Al Machot, F., Mosa, A. H., Dabbour, K., Fasih, A., Schwarzlmuller, C., Ali, M., & Kyamakya, K. (2011). A novel real-time emotion detection system from audio streams based on Bayesian quadratic discriminate classifier for ADAS. In Nonlinear Dynamics and Synchronization 16th Int’l Symposium on Theoretical Electrical Engineering, Joint 3rd Int’l Workshop on.Google Scholar
- Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In Proceeings of International Conference on Spoken Language Processing (pp. 2037–2040).Google Scholar
- Bakeman, R. (1997). Behavioral observation and coding. Handbook of research methods in social psychology. Cambridge: Cambridge University Press.Google Scholar
- Bao, H., Xu, M. X., & Zheng, T. F. (2007). Emotion attribute projection for speaker recognition on emotional speech. In Procceedings of Interspeech (pp. 758–761).Google Scholar
- Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., & Aharonson, V. (2006). Combining efforts for improving automatic classification of emotional user states. In Proceedings of the Fifth Slovenian and First International Language Technologies Conference.Google Scholar
- Chang, K., Fisher, D., & Canny, J. (2011). AMMON: a speech analysis library for analyzing affect, stress, and mental health on mobile phones. In 2nd International Workshop on Sensing Applications on Mobile Phones.Google Scholar
- Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., & Schröder, M. (2000). ‘FEELTRACE’: an instrument for recording perceived emotion in real time. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion.Google Scholar
- Eskimez, S. E., Imade, K., Yang, N., Sturge-Appley, M., Duan, Z., & Heinzelman, W. (2016). Emotion classification: How does an automated system compare to naive human coders? In Acoustics, Speech and Signal Processing, Proceedings of the IEEE International Conference on.Google Scholar
- Farrús, M., Ejarque, P., Temko, A., & Hernando, J. (2007). Histogram equalization in SVM multimodal person verification. In Proceedings of IAPR/IEEE International Conference on Biometrics.Google Scholar
- Goudbeek, M., Goldman, J. P., & Scherer, K. R. (2009). Emotion dimensions and formant position. In INTERSPEECH (pp. 1575–1578).Google Scholar
- Goyal, A., Riloff, E., Daumé III, H., & Gilbert, N. (2010). Toward plot units: automatic affect state analysis. In Proceedings of HLT/NAACL Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (CAET).Google Scholar
- Gupta P., & Rajput N. (2007). Two-stream emotion recognition for call center monitoring. In INTERSPEECH (pp. 2241–2244).Google Scholar
- Hsu, C.W., Chang ,C.C., & Lin, C.J. (2003). A practical guide to support vector classification.Google Scholar
- Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New Jersey: Prentice Hall PTR.Google Scholar
- Huisman, G., Van Hout, M., van Dijk, E., van der Geest, T., & Heylen, D. (2013). Lemtool—measuring emotions in visual interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.Google Scholar
- Implementation of extracting MFCCs included in the VOICEBOX toolkit. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
- Jong, N. H. D., & Wempe, T. (2007). Automatic measurement of speech rate in spoken Dutch. ACLC Working Papers, 2(2), 49–58.Google Scholar
- Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., & Shikano, K. (2003). GMM-based voice conversion applied to emotional speech synthesis. In Proceedings of Eurospeech.Google Scholar
- Kerig, P., & Baucom, D. (2004). Couple observational coding systems. Abington: Routledge.Google Scholar
- Kwon, O.W., Chan, K., Hao, J., & Lee, T. W. (2003). Emotion recognition by speech signals. In EUROSPEECH. (pp. 125–128).Google Scholar
- Lee, C., & Lee, G. G. (2007). Emotion recognition for affective user interfaces using natural language dialogs. In Procceedings of IEEE International Symposium on Robot and Human interactive Communication. (pp. 798–801).Google Scholar
- Lee, C. M., Narayanan, S. S., & Pieraccini, R. (2002). Combining acoustic and language information for emotion recognition. In Proceeding of 7th International Conference on Spoken Language Processing.Google Scholar
- Lee, L., & Rose, R. C. (1996). Speaker normalization using efficient frequency warping procedures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (pp. 353–356).Google Scholar
- Liberman, M., Davis, K., Grossman, M., Martey, N., & Bell, J. (2002). Emotional prosody speech and transcripts. Philadelphia: Linguistic Data Consortium (LDC).Google Scholar
- Ling, C., Dong, M., Li, H., Yu, Z. L., & Chan, P. (2010). Machine learning methods in the application of speech emotion recognition. In Application of Machine Learning (pp. 1–19).Google Scholar
- MATLAB implementation of mutual information. 2007. http://www.mathworks.com/matlabcentral/fileexchange/14888-mutual-information-computation.
- Özkul, S., Bozkurt, E., Asta, S., Yemez, Y., & Erzin, E. (2012). Multimodal analysis of upper-body gestures, facial expressions and speech. In Procceedings of the 4th International Workshop on Corpora for Research on Emotion Sentiment and Social Signals.Google Scholar
- Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods (pp. 185–208). Cambridge: MIT Press.Google Scholar
- Qin, L., Ling, Z. H., Wu, Y. J., Zhang, B. F., & Wang, R. H. (2006). Hmm-based emotional speech synthesis using average emotion model. In Procceedings of Chinese Spoken Language Processing (pp. 233–240).Google Scholar
- Rachuri, K. K., Musolesi, M., Mascolo, C., Rentfrow, P. J., Longworth, C., & Aucinas, A. (2010). EmotionSense: a mobile phones based adaptive platform for experimental social psychology research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (pp. 281–290).Google Scholar
- Sauter, D. A., Eisner, F., Calder, A. J., & Scott, S. K. (2010). Perceptual cues in nonverbal vocal expressions of emotion, 63(11), 2251–2272.Google Scholar
- Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden markov model-based speech emotion recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (p. 1).Google Scholar
- Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (p. 577).Google Scholar
- Schuller, B., Vlasenko, B., Minguez, R., Rigoll, G., & Wendemuth, A. (2007). Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: IEEE Workshop on Automatic Speech Recognition Understanding (pp. 596–600).Google Scholar
- Sethu, V., Ambikairajah, E., & Epps, J. (2008). Empirical mode decomposition based weighted frequency feature for speech-based emotion classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5017–5020).Google Scholar
- Shafran, I. (2005). A comparison of classifiers for detecting emotion from speech. In IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
- Shrawankar, U., & Thakare, V.M. (2013). Adverse conditions and ASR techniques for robust speech user interface. arXiv preprint arXiv:13035515.
- Steidl, S., Polzehl, T., Bunnell, H. T., Dou, Y., Muthukumar, P. K., Perry, D., Prahallad, K., Vaughn, C., Black, A. W., & Metze, F. (2012). Emotion identification for evaluation of synthesized emotional speech. In Procceedings of Speech Prosody.Google Scholar
- Tacconi, D., Mayora, O., Lukowicz, P., Arnrich, B., Setz, C., Troster, G., & Haring, C. (2008). Activity and emotion recognition to support early diagnosis of psychiatric diseases. In Pervasive Computing Technologies for Healthcare (PervasiveHealth), Second International Conference on (pp. 100–102).Google Scholar
- Tang, H., Chu, S. M., Hasegawa-Johnson, M., & Huang, T. S. (2009). Emotion recognition from speech via boosted Gaussian mixture models. In IEEE International Conference on Multimedia and Expo (ICME) (pp. 294–297).Google Scholar
- Varga A. P., Steeneken H. J. M., Tomlinson M., & Jones D. (1992). NOISEX-92 study on the effect of additive noise on automatic speech recognition. http://spib.ece.rice.edu/spib/data/signals/noise/.
- Vlasenko, B., Schuller, B., Wendemuth, A., & Rigoll, G. (2007). Combining frame and turn-level information for robust recognition of emotions within speech. In INTERSPEECH. (pp. 2249–2252).Google Scholar
- Wireless communication and networking group, University of Rochester. 2016. http://www.ece.rochester.edu/projects/wcng/project_bridge.html.
- Wu, G., & Chang, E. Y. (2003). Class-boundary alignment for imbalanced dataset learning. In Workshop on Learning from Imbalanced Datasets II, ICML (pp. 49–56).Google Scholar
- Wu, S., Falk, T. H., & Chan, W. Y. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features. In Procceedings of the 16th International Conference on Digital Signal Processing.Google Scholar
- Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In Procceedings of Interspeech.Google Scholar
- Yang, N., Muraleedharan, R., Kohl, J., Demirkol, I., Heinzelman, W., & Sturge-Apple, M. (2012). Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion. In Spoken Language Technology Workshop (SLT), 2012 IEEE (pp. 455–460).Google Scholar
- Zhang S., Zhao X., & Lei B. (2013). Speech emotion recognition using an enhanced kernel isomap for human-robot interaction. International Journal of Advanced Robotic Systems. doi: 10.5772/55403.