International Journal of Speech Technology

, Volume 20, Issue 1, pp 27–41 | Cite as

Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification

  • Na YangEmail author
  • Jianbo Yuan
  • Yun Zhou
  • Ilker Demirkol
  • Zhiyao Duan
  • Wendi Heinzelman
  • Melissa Sturge-Apple


As an essential approach to understanding human interactions, emotion classification is a vital component of behavioral studies as well as being important in the design of context-aware systems. Recent studies have shown that speech contains rich information about emotion, and numerous speech-based emotion classification methods have been proposed. However, the classification performance is still short of what is desired for the algorithms to be used in real systems. We present an emotion classification system using several one-against-all support vector machines with a thresholding fusion mechanism to combine the individual outputs, which provides the functionality to effectively increase the emotion classification accuracy at the expense of rejecting some samples as unclassified. Results show that the proposed system outperforms three state-of-the-art methods and that the thresholding fusion mechanism can effectively improve the emotion classification, which is important for applications that require very high accuracy but do not require that all samples be classified. We evaluate the system performance for several challenging scenarios including speaker-independent tests, tests on noisy speech signals, and tests using non-professional acted recordings, in order to demonstrate the performance of the system and the effectiveness of the thresholding fusion mechanism in real scenarios.


Emotion classification Support vector machine Thresholding fusion Noisy speech 



This research was supported by funding from the National Institute of Health NICHD (Grant R01 HD060789). We thank Dr. Jennifer Samp for obtaining the voice recordings from students at the University of Georgia. We also thank Sefik Emre Eskimez and Kenneth Imade for conducting the human user study using Amazon Mechanical Turk.


  1. Al Machot, F., Mosa, A. H., Dabbour, K., Fasih, A., Schwarzlmuller, C., Ali, M., & Kyamakya, K. (2011). A novel real-time emotion detection system from audio streams based on Bayesian quadratic discriminate classifier for ADAS. In Nonlinear Dynamics and Synchronization 16th Int’l Symposium on Theoretical Electrical Engineering, Joint 3rd Int’l Workshop on.Google Scholar
  2. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In Proceeings of International Conference on Spoken Language Processing (pp. 2037–2040).Google Scholar
  3. Bakeman, R. (1997). Behavioral observation and coding. Handbook of research methods in social psychology. Cambridge: Cambridge University Press.Google Scholar
  4. Bänziger, T., Patel, S., & Scherer, K. R. (2014). The role of perceived voice and speech characteristics in vocal emotion communication. Journal of nonverbal behavior, 38(1), 31–52.CrossRefGoogle Scholar
  5. Bao, H., Xu, M. X., & Zheng, T. F. (2007). Emotion attribute projection for speaker recognition on emotional speech. In Procceedings of Interspeech (pp. 758–761).Google Scholar
  6. Barra-Chicote, R., Yamagishi, J., King, S., Montero, J. M., & Macias-Guarasa, J. (2010). Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech. Speech Communication, 52(5), 394–404.CrossRefGoogle Scholar
  7. Batliner, A., Steidl, S., Schuller, B., Seppi, D., Laskowski, K., Vogt, T., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., & Aharonson, V. (2006). Combining efforts for improving automatic classification of emotional user states. In Proceedings of the Fifth Slovenian and First International Language Technologies Conference.Google Scholar
  8. Bellegarda, J. R. (2013). Data-driven analysis of emotion in text using latent affective folding and embedding. Computational Intelligence, 29(3), 506–526.MathSciNetCrossRefGoogle Scholar
  9. Bitouk, D., Ragini, V., & Ani, N. (2010). Class-level spectral features for emotion recognition. Journal of Speech Communication, 52(7–8), 613–625.CrossRefGoogle Scholar
  10. Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C. C., Lammert, A. C., & Christensen, A. (2013). Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech communication, 55(1), 1–21.CrossRefGoogle Scholar
  11. Chang, K., Fisher, D., & Canny, J. (2011). AMMON: a speech analysis library for analyzing affect, stress, and mental health on mobile phones. In 2nd International Workshop on Sensing Applications on Mobile Phones.Google Scholar
  12. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.zbMATHGoogle Scholar
  13. Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., & Schröder, M. (2000). ‘FEELTRACE’: an instrument for recording perceived emotion in real time. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion.Google Scholar
  14. Eskimez, S. E., Imade, K., Yang, N., Sturge-Appley, M., Duan, Z., & Heinzelman, W. (2016). Emotion classification: How does an automated system compare to naive human coders? In Acoustics, Speech and Signal Processing, Proceedings of the IEEE International Conference on.Google Scholar
  15. Farrús, M., Ejarque, P., Temko, A., & Hernando, J. (2007). Histogram equalization in SVM multimodal person verification. In Proceedings of IAPR/IEEE International Conference on Biometrics.Google Scholar
  16. Goudbeek, M., Goldman, J. P., & Scherer, K. R. (2009). Emotion dimensions and formant position. In INTERSPEECH (pp. 1575–1578).Google Scholar
  17. Goyal, A., Riloff, E., Daumé III, H., & Gilbert, N. (2010). Toward plot units: automatic affect state analysis. In Proceedings of HLT/NAACL Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (CAET).Google Scholar
  18. Gupta P., & Rajput N. (2007). Two-stream emotion recognition for call center monitoring. In INTERSPEECH (pp. 2241–2244).Google Scholar
  19. Hoque, M., Yeasin, M., & Louwerse, M. (2006). Robust recognition of emotion from speech. Intelligent virtual agents (pp. 42–53). Berlin: Springer. Lecture notes in computer science.CrossRefGoogle Scholar
  20. Hsu, C.W., Chang ,C.C., & Lin, C.J. (2003). A practical guide to support vector classification.Google Scholar
  21. Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New Jersey: Prentice Hall PTR.Google Scholar
  22. Huisman, G., Van Hout, M., van Dijk, E., van der Geest, T., & Heylen, D. (2013). Lemtool—measuring emotions in visual interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.Google Scholar
  23. Implementation of extracting MFCCs included in the VOICEBOX toolkit.
  24. Jong, N. H. D., & Wempe, T. (2007). Automatic measurement of speech rate in spoken Dutch. ACLC Working Papers, 2(2), 49–58.Google Scholar
  25. Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H., & Shikano, K. (2003). GMM-based voice conversion applied to emotional speech synthesis. In Proceedings of Eurospeech.Google Scholar
  26. Kerig, P., & Baucom, D. (2004). Couple observational coding systems. Abington: Routledge.Google Scholar
  27. Kwon, O.W., Chan, K., Hao, J., & Lee, T. W. (2003). Emotion recognition by speech signals. In EUROSPEECH. (pp. 125–128).Google Scholar
  28. Lee, C., & Lee, G. G. (2007). Emotion recognition for affective user interfaces using natural language dialogs. In Procceedings of IEEE International Symposium on Robot and Human interactive Communication. (pp. 798–801).Google Scholar
  29. Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.CrossRefGoogle Scholar
  30. Lee, C. M., Narayanan, S. S., & Pieraccini, R. (2002). Combining acoustic and language information for emotion recognition. In Proceeding of 7th International Conference on Spoken Language Processing.Google Scholar
  31. Lee, L., & Rose, R. C. (1996). Speaker normalization using efficient frequency warping procedures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (pp. 353–356).Google Scholar
  32. Liberman, M., Davis, K., Grossman, M., Martey, N., & Bell, J. (2002). Emotional prosody speech and transcripts. Philadelphia: Linguistic Data Consortium (LDC).Google Scholar
  33. Ling, C., Dong, M., Li, H., Yu, Z. L., & Chan, P. (2010). Machine learning methods in the application of speech emotion recognition. In Application of Machine Learning (pp. 1–19).Google Scholar
  34. Özkul, S., Bozkurt, E., Asta, S., Yemez, Y., & Erzin, E. (2012). Multimodal analysis of upper-body gestures, facial expressions and speech. In Procceedings of the 4th International Workshop on Corpora for Research on Emotion Sentiment and Social Signals.Google Scholar
  35. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Mchine Intelligence, 27(8), 1226–1238.CrossRefGoogle Scholar
  36. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods (pp. 185–208). Cambridge: MIT Press.Google Scholar
  37. Qin, L., Ling, Z. H., Wu, Y. J., Zhang, B. F., & Wang, R. H. (2006). Hmm-based emotional speech synthesis using average emotion model. In Procceedings of Chinese Spoken Language Processing (pp. 233–240).Google Scholar
  38. Rachuri, K. K., Musolesi, M., Mascolo, C., Rentfrow, P. J., Longworth, C., & Aucinas, A. (2010). EmotionSense: a mobile phones based adaptive platform for experimental social psychology research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (pp. 281–290).Google Scholar
  39. Roberto, B. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.CrossRefGoogle Scholar
  40. Rong, J., Li, G., & Chen, Y. P. P. (2009). Acoustic feature selection for automatic emotion recognition from speech. Information Processing and Management, 45(3), 315–328.CrossRefGoogle Scholar
  41. Sauter, D. A., Eisner, F., Calder, A. J., & Scott, S. K. (2010). Perceptual cues in nonverbal vocal expressions of emotion, 63(11), 2251–2272.Google Scholar
  42. Scherer, K. R. (2003). Vocal communication of emotion: a review of research paradigms. Speech Communication, 40(1–2), 227–256.CrossRefzbMATHGoogle Scholar
  43. Scherer, K. R. (2005). What are emotions? And how can they be measured? Social Science Information, 44(4), 695–729.MathSciNetCrossRefGoogle Scholar
  44. Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden markov model-based speech emotion recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 2 (p. 1).Google Scholar
  45. Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (p. 577).Google Scholar
  46. Schuller, B., Vlasenko, B., Minguez, R., Rigoll, G., & Wendemuth, A. (2007). Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: IEEE Workshop on Automatic Speech Recognition Understanding (pp. 596–600).Google Scholar
  47. Sethu, V., Ambikairajah, E., & Epps, J. (2008). Empirical mode decomposition based weighted frequency feature for speech-based emotion classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5017–5020).Google Scholar
  48. Shafran, I. (2005). A comparison of classifiers for detecting emotion from speech. In IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
  49. Shrawankar, U., & Thakare, V.M. (2013). Adverse conditions and ASR techniques for robust speech user interface. arXiv preprint arXiv:13035515.
  50. Steidl, S., Polzehl, T., Bunnell, H. T., Dou, Y., Muthukumar, P. K., Perry, D., Prahallad, K., Vaughn, C., Black, A. W., & Metze, F. (2012). Emotion identification for evaluation of synthesized emotional speech. In Procceedings of Speech Prosody.Google Scholar
  51. Tacconi, D., Mayora, O., Lukowicz, P., Arnrich, B., Setz, C., Troster, G., & Haring, C. (2008). Activity and emotion recognition to support early diagnosis of psychiatric diseases. In Pervasive Computing Technologies for Healthcare (PervasiveHealth), Second International Conference on (pp. 100–102).Google Scholar
  52. Tang, H., Chu, S. M., Hasegawa-Johnson, M., & Huang, T. S. (2009). Emotion recognition from speech via boosted Gaussian mixture models. In IEEE International Conference on Multimedia and Expo (ICME) (pp. 294–297).Google Scholar
  53. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.CrossRefzbMATHGoogle Scholar
  54. Vapnik, V. N. (1998). Statistical learning theory. New Jersey: Wiley.zbMATHGoogle Scholar
  55. Varga A. P., Steeneken H. J. M., Tomlinson M., & Jones D. (1992). NOISEX-92 study on the effect of additive noise on automatic speech recognition.
  56. Vlasenko, B., Schuller, B., Wendemuth, A., & Rigoll, G. (2007). Combining frame and turn-level information for robust recognition of emotions within speech. In INTERSPEECH. (pp. 2249–2252).Google Scholar
  57. Wireless communication and networking group, University of Rochester. 2016.
  58. Wu, C. H., Kung, C., Lin, J. C., & Wei, W. L. (2013). Two-level hierarchical alignment for semi-coupled hmm-based audiovisual emotion recognition with temporal course. IEEE Transactions on Multimedia, 15(8), 1880–1895.CrossRefGoogle Scholar
  59. Wu, G., & Chang, E. Y. (2003). Class-boundary alignment for imbalanced dataset learning. In Workshop on Learning from Imbalanced Datasets II, ICML (pp. 49–56).Google Scholar
  60. Wu, S., Falk, T. H., & Chan, W. Y. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features. In Procceedings of the 16th International Conference on Digital Signal Processing.Google Scholar
  61. Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In Procceedings of Interspeech.Google Scholar
  62. Yang, N., Ba, H., Cai, W., Demirkol, I., & Heinzelman, W. (2014). BaNa: a noise resilient fundamental frequency detection algorithm for speech and music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1833–1848.CrossRefGoogle Scholar
  63. Yang, N., Muraleedharan, R., Kohl, J., Demirkol, I., Heinzelman, W., & Sturge-Apple, M. (2012). Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion. In Spoken Language Technology Workshop (SLT), 2012 IEEE (pp. 455–460).Google Scholar
  64. Yang, Y., Fairbairn, C., & Cohn, J. F. (2013). Detecting depression severity from vocal prosody. IEEE Transactions on Affective Computing, 4(2), 142–150.CrossRefGoogle Scholar
  65. Yun, S., & Yoo, C. D. (2012). Loss-scaled large-margin gaussian mixture models for speech emotion classification. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 585–598.CrossRefGoogle Scholar
  66. Zhang S., Zhao X., & Lei B. (2013). Speech emotion recognition using an enhanced kernel isomap for human-robot interaction. International Journal of Advanced Robotic Systems. doi: 10.5772/55403.

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Na Yang
    • 1
    Email author
  • Jianbo Yuan
    • 1
  • Yun Zhou
    • 1
  • Ilker Demirkol
    • 2
  • Zhiyao Duan
    • 1
  • Wendi Heinzelman
    • 1
  • Melissa Sturge-Apple
    • 3
  1. 1.Department of Electrical and Computer EngineeringUniversity of RochesterRochesterUSA
  2. 2.Department of Telematics EngineeringUniversitat Politècnica de Catalunya and with i2Cat FoundationBarcelonaSpain
  3. 3.Department of Clinical and Social Sciences in PsychologyUniversity of RochesterRochesterUSA

Personalised recommendations