Skip to main content

Advertisement

Log in

Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

As an essential approach to understanding human interactions, emotion classification is a vital component of behavioral studies as well as being important in the design of context-aware systems. Recent studies have shown that speech contains rich information about emotion, and numerous speech-based emotion classification methods have been proposed. However, the classification performance is still short of what is desired for the algorithms to be used in real systems. We present an emotion classification system using several one-against-all support vector machines with a thresholding fusion mechanism to combine the individual outputs, which provides the functionality to effectively increase the emotion classification accuracy at the expense of rejecting some samples as unclassified. Results show that the proposed system outperforms three state-of-the-art methods and that the thresholding fusion mechanism can effectively improve the emotion classification, which is important for applications that require very high accuracy but do not require that all samples be classified. We evaluate the system performance for several challenging scenarios including speaker-independent tests, tests on noisy speech signals, and tests using non-professional acted recordings, in order to demonstrate the performance of the system and the effectiveness of the thresholding fusion mechanism in real scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Al Machot, F., Mosa, A. H., Dabbour, K., Fasih, A., Schwarzlmuller, C., Ali, M., & Kyamakya, K. (2011). A novel real-time emotion detection system from audio streams based on Bayesian quadratic discriminate classifier for ADAS. In Nonlinear Dynamics and Synchronization 16th Int’l Symposium on Theoretical Electrical Engineering, Joint 3rd Int’l Workshop on.

  • Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In Proceeings of International Conference on Spoken Language Processing (pp. 2037–2040).

  • Bakeman, R. (1997). Behavioral observation and coding. Handbook of research methods in social psychology. Cambridge: Cambridge University Press.

    Google Scholar 

  • Bänziger, T., Patel, S., & Scherer, K. R. (2014). The role of perceived voice and speech characteristics in vocal emotion communication. Journal of nonverbal behavior, 38(1), 31–52.

    Article  Google Scholar 

  • Bao, H., Xu, M. X., & Zheng, T. F. (2007). Emotion attribute projection for speaker recognition on emotional speech. In Procceedings of Interspeech (pp. 758–761).

  • Barra-Chicote, R., Yamagishi, J., King, S., Montero, J. M., & Macias-Guarasa, J. (2010). Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech. Speech Communication, 52(5), 394–404.

    Article  Google Scholar 

  • Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., & Aharonson, V. (2006). Combining efforts for improving automatic classification of emotional user states. In Proceedings of the Fifth Slovenian and First International Language Technologies Conference.

  • Bellegarda, J. R. (2013). Data-driven analysis of emotion in text using latent affective folding and embedding. Computational Intelligence, 29(3), 506–526.

    Article  MathSciNet  Google Scholar 

  • Bitouk, D., Ragini, V., & Ani, N. (2010). Class-level spectral features for emotion recognition. Journal of Speech Communication, 52(7–8), 613–625.

    Article  Google Scholar 

  • Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C. C., Lammert, A. C., & Christensen, A. (2013). Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech communication, 55(1), 1–21.

    Article  Google Scholar 

  • Chang, K., Fisher, D., & Canny, J. (2011). AMMON: a speech analysis library for analyzing affect, stress, and mental health on mobile phones. In 2nd International Workshop on Sensing Applications on Mobile Phones.

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.

    MATH  Google Scholar 

  • Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., & Schröder, M. (2000). ‘FEELTRACE’: an instrument for recording perceived emotion in real time. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion.

  • Eskimez, S. E., Imade, K., Yang, N., Sturge-Appley, M., Duan, Z., & Heinzelman, W. (2016). Emotion classification: How does an automated system compare to naive human coders? In Acoustics, Speech and Signal Processing, Proceedings of the IEEE International Conference on.

  • Farrús, M., Ejarque, P., Temko, A., & Hernando, J. (2007). Histogram equalization in SVM multimodal person verification. In Proceedings of IAPR/IEEE International Conference on Biometrics.

  • Goudbeek, M., Goldman, J. P., & Scherer, K. R. (2009). Emotion dimensions and formant position. In INTERSPEECH (pp. 1575–1578).

  • Goyal, A., Riloff, E., Daumé III, H., & Gilbert, N. (2010). Toward plot units: automatic affect state analysis. In Proceedings of HLT/NAACL Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (CAET).

  • Gupta P., & Rajput N. (2007). Two-stream emotion recognition for call center monitoring. In INTERSPEECH (pp. 2241–2244).

  • Hoque, M., Yeasin, M., & Louwerse, M. (2006). Robust recognition of emotion from speech. Intelligent virtual agents (pp. 42–53). Berlin: Springer. Lecture notes in computer science.

    Chapter  Google Scholar 

  • Hsu, C.W., Chang ,C.C., & Lin, C.J. (2003). A practical guide to support vector classification.

  • Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New Jersey: Prentice Hall PTR.

    Google Scholar 

  • Huisman, G., Van Hout, M., van Dijk, E., van der Geest, T., & Heylen, D. (2013). Lemtool—measuring emotions in visual interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.

  • Implementation of extracting MFCCs included in the VOICEBOX toolkit. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.

  • Jong, N. H. D., & Wempe, T. (2007). Automatic measurement of speech rate in spoken Dutch. ACLC Working Papers, 2(2), 49–58.

    Google Scholar 

  • Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., & Shikano, K. (2003). GMM-based voice conversion applied to emotional speech synthesis. In Proceedings of Eurospeech.

  • Kerig, P., & Baucom, D. (2004). Couple observational coding systems. Abington: Routledge.

    Google Scholar 

  • Kwon, O.W., Chan, K., Hao, J., & Lee, T. W. (2003). Emotion recognition by speech signals. In EUROSPEECH. (pp. 125–128).

  • Lee, C., & Lee, G. G. (2007). Emotion recognition for affective user interfaces using natural language dialogs. In Procceedings of IEEE International Symposium on Robot and Human interactive Communication. (pp. 798–801).

  • Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.

    Article  Google Scholar 

  • Lee, C. M., Narayanan, S. S., & Pieraccini, R. (2002). Combining acoustic and language information for emotion recognition. In Proceeding of 7th International Conference on Spoken Language Processing.

  • Lee, L., & Rose, R. C. (1996). Speaker normalization using efficient frequency warping procedures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (pp. 353–356).

  • Liberman, M., Davis, K., Grossman, M., Martey, N., & Bell, J. (2002). Emotional prosody speech and transcripts. Philadelphia: Linguistic Data Consortium (LDC).

    Google Scholar 

  • Ling, C., Dong, M., Li, H., Yu, Z. L., & Chan, P. (2010). Machine learning methods in the application of speech emotion recognition. In Application of Machine Learning (pp. 1–19).

  • MATLAB implementation of mutual information. 2007. http://www.mathworks.com/matlabcentral/fileexchange/14888-mutual-information-computation.

  • Özkul, S., Bozkurt, E., Asta, S., Yemez, Y., & Erzin, E. (2012). Multimodal analysis of upper-body gestures, facial expressions and speech. In Procceedings of the 4th International Workshop on Corpora for Research on Emotion Sentiment and Social Signals.

  • Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Mchine Intelligence, 27(8), 1226–1238.

    Article  Google Scholar 

  • Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods (pp. 185–208). Cambridge: MIT Press.

    Google Scholar 

  • Qin, L., Ling, Z. H., Wu, Y. J., Zhang, B. F., & Wang, R. H. (2006). Hmm-based emotional speech synthesis using average emotion model. In Procceedings of Chinese Spoken Language Processing (pp. 233–240).

  • Rachuri, K. K., Musolesi, M., Mascolo, C., Rentfrow, P. J., Longworth, C., & Aucinas, A. (2010). EmotionSense: a mobile phones based adaptive platform for experimental social psychology research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (pp. 281–290).

  • Roberto, B. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.

    Article  Google Scholar 

  • Rong, J., Li, G., & Chen, Y. P. P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information Processing and Management, 45(3), 315–328.

    Article  Google Scholar 

  • Sauter, D. A., Eisner, F., Calder, A. J., & Scott, S. K. (2010). Perceptual cues in nonverbal vocal expressions of emotion, 63(11), 2251–2272.

    Google Scholar 

  • Scherer, K. R. (2003). Vocal communication of emotion: a review of research paradigms. Speech Communication, 40(1–2), 227–256.

    Article  MATH  Google Scholar 

  • Scherer, K. R. (2005). What are emotions? And how can they be measured? Social Science Information, 44(4), 695–729.

    Article  MathSciNet  Google Scholar 

  • Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden markov model-based speech emotion recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (p. 1).

  • Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (p. 577).

  • Schuller, B., Vlasenko, B., Minguez, R., Rigoll, G., & Wendemuth, A. (2007). Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: IEEE Workshop on Automatic Speech Recognition Understanding (pp. 596–600).

  • Sethu, V., Ambikairajah, E., & Epps, J. (2008). Empirical mode decomposition based weighted frequency feature for speech-based emotion classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5017–5020).

  • Shafran, I. (2005). A comparison of classifiers for detecting emotion from speech. In IEEE International Conference on Acoustics, Speech and Signal Processing.

  • Shrawankar, U., & Thakare, V.M. (2013). Adverse conditions and ASR techniques for robust speech user interface. arXiv preprint arXiv:13035515.

  • Steidl, S., Polzehl, T., Bunnell, H. T., Dou, Y., Muthukumar, P. K., Perry, D., Prahallad, K., Vaughn, C., Black, A. W., & Metze, F. (2012). Emotion identification for evaluation of synthesized emotional speech. In Procceedings of Speech Prosody.

  • Tacconi, D., Mayora, O., Lukowicz, P., Arnrich, B., Setz, C., Troster, G., & Haring, C. (2008). Activity and emotion recognition to support early diagnosis of psychiatric diseases. In Pervasive Computing Technologies for Healthcare (PervasiveHealth), Second International Conference on (pp. 100–102).

  • Tang, H., Chu, S. M., Hasegawa-Johnson, M., & Huang, T. S. (2009). Emotion recognition from speech via boosted Gaussian mixture models. In IEEE International Conference on Multimedia and Expo (ICME) (pp. 294–297).

  • Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.

    Book  MATH  Google Scholar 

  • Vapnik, V. N. (1998). Statistical learning theory. New Jersey: Wiley.

    MATH  Google Scholar 

  • Varga A. P., Steeneken H. J. M., Tomlinson M., & Jones D. (1992). NOISEX-92 study on the effect of additive noise on automatic speech recognition. http://spib.ece.rice.edu/spib/data/signals/noise/.

  • Vlasenko, B., Schuller, B., Wendemuth, A., & Rigoll, G. (2007). Combining frame and turn-level information for robust recognition of emotions within speech. In INTERSPEECH. (pp. 2249–2252).

  • Wireless communication and networking group, University of Rochester. 2016. http://www.ece.rochester.edu/projects/wcng/project_bridge.html.

  • Wu, C. H., Kung, C., Lin, J. C., & Wei, W. L. (2013). Two-level hierarchical alignment for semi-coupled hmm-based audiovisual emotion recognition with temporal course. IEEE Transactions on Multimedia, 15(8), 1880–1895.

    Article  Google Scholar 

  • Wu, G., & Chang, E. Y. (2003). Class-boundary alignment for imbalanced dataset learning. In Workshop on Learning from Imbalanced Datasets II, ICML (pp. 49–56).

  • Wu, S., Falk, T. H., & Chan, W. Y. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features. In Procceedings of the 16th International Conference on Digital Signal Processing.

  • Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In Procceedings of Interspeech.

  • Yang, N., Ba, H., Cai, W., Demirkol, I., & Heinzelman, W. (2014). BaNa: a noise resilient fundamental frequency detection algorithm for speech and music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1833–1848.

    Article  Google Scholar 

  • Yang, N., Muraleedharan, R., Kohl, J., Demirkol, I., Heinzelman, W., & Sturge-Apple, M. (2012). Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion. In Spoken Language Technology Workshop (SLT), 2012 IEEE (pp. 455–460).

  • Yang, Y., Fairbairn, C., & Cohn, J. F. (2013). Detecting depression severity from vocal prosody. IEEE Transactions on Affective Computing, 4(2), 142–150.

    Article  Google Scholar 

  • Yun, S., & Yoo, C. D. (2012). Loss-scaled large-margin gaussian mixture models for speech emotion classification. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 585–598.

    Article  Google Scholar 

  • Zhang S., Zhao X., & Lei B. (2013). Speech emotion recognition using an enhanced kernel isomap for human-robot interaction. International Journal of Advanced Robotic Systems. doi:10.5772/55403.

Download references

Acknowledgments

This research was supported by funding from the National Institute of Health NICHD (Grant R01 HD060789). We thank Dr. Jennifer Samp for obtaining the voice recordings from students at the University of Georgia. We also thank Sefik Emre Eskimez and Kenneth Imade for conducting the human user study using Amazon Mechanical Turk.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Na Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, N., Yuan, J., Zhou, Y. et al. Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification. Int J Speech Technol 20, 27–41 (2017). https://doi.org/10.1007/s10772-016-9364-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-016-9364-2

Keywords

Navigation