Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema
- 443 Downloads
In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisaki’s model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers’ error rates and then to evaluate the information expressed by the classifiers’ confusion matrices.
KeywordsEmotion recognition Large-scale feature extraction Binary classification schema Speaker-independent protocol Classifier comparison
M. Kotti would like to thank Associate Professor Constantine Kotropoulos for his valuable contributions for the extraction of part of the features that are described in Sect. 4.
- Austermann, A., Esau, N., Kleinjohann, L., & Kleinjohann, B. (2005). Prosody based emotion recognition for MEXI. In Proc. IEEE/RSJ int. conf. intelligent robots and systems, Edmonton, Canada, August 2005 (pp. 201–208). Google Scholar
- Benetos, E., Kotti, M., & Kotropoulos, C. (2007). Large scale musical instrument identification. In Proc. 4th sound and music computing conference, Lefkada, Greece, July 2007 (pp. 283–286). Google Scholar
- Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proc. institute of phonetic sciences (Vol. 17, pp. 97–110). Google Scholar
- Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Proc. 9th European conf. speech communication and technology, Lisbon, Portugal, September 2005 (pp. 1517–1520). Google Scholar
- Burkhardt, F., Ajmera, J., Englert, R., Stegmann, J., & Burleson, W. (2006). Detecting anger in automated voice portal dialogs. In Proc. 9th int. conf. spoken language processing, Pittsburgh, USA, September 2006 (pp. 1–4). Google Scholar
- Ekman, P., & Davidson, R. (1994). The nature of emotion: fundamental questions. New York: Oxford University Press. Google Scholar
- Ekman, P., Matsumoto, D., & Friesen, W. (2005). Facial expression in affective disorders. In Series in affective science. What the face reveals (pp. 331–342). London: Oxford Press. Chap. 15. Google Scholar
- Ellis, D. P. W. (2005). PLP and RASTA (and MFCC, and inversion) in Matlab. URL http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/. Online web resource.
- Espinosa, H. P., & Reyes-García, C. (2009). Detection of negative emotional state in speech with anfis and genetic algorithms. In Proc. 6th int. workshop models and analysis of vocal emissions for biomedical applications, Florence, Italy, December 2009 (pp. 24–28). Google Scholar
- Fersini, E., Messina, E., Arosio, G., & Archetti, F. (2009). Audio-based emotion recognition in judicial domain: A multilayer support vector machines approach. In Proc. 6th int. conf. machine learning and data mining in pattern recognition, Leipzig, Germany, July 2009 (pp. 594–602). CrossRefGoogle Scholar
- Gunes, H., Schuller, B., Pantic, M., & Cowie, R. (2011). Emotion representation, analysis and synthesis in continuous space: A survey. In Proc. of IEEE int. conf. automatic face and gesture recognition, Santa Barbara, USA, March 2011 (pp. 827–834). Google Scholar
- Hirschberg, J., Benus, S., Brenier, J. M., Enos, F., & Friedman, S. (2005). Distinguishing deceptive from non-deceptive speech. In Proc. 9th European conf. speech communication and technology, Lisbon, Portugal, September 2005 (pp. 1833–1836). Google Scholar
- Jackson, L. B. (1989). Digital filters and signal processing (2nd ed.). New York: Kluwer Academic. Google Scholar
- Konstantinidis, E. I., Hitoglou-Antoniadou, M., Luneski, A., Bamidis, P. D., & Nikolaidou, M. M. (2009). Using affective avatars and rich multimedia content for education of children with autism. In Proc. 2nd int. conf. pervsive technologies related to assistive environments, Corfu, Greece, June 2009 (pp. 1–6). CrossRefGoogle Scholar
- Kostoulas, T. P., & Fakotakis, N. (2006). A speaker dependent emotion recognition framework. In Proc. 5th int. symposium communication systems, networks and digital signal processing, Patras, Greece, July 2006 (pp. 305–309). Google Scholar
- Kotti, M., Paternò, F., & Kotropoulos, C. (2010). Speaker-independent negative emotion recognition. In Proc. 2nd int. workshop cognitive information processing, Elba Island, Italy, June 2010. Google Scholar
- Lee, C. M., & Narayanan, S. (2005). Towards detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(12), 293–303. Google Scholar
- Mixdorff, H. (2000). A novel approach to the fully automatic extraction of Fujisaki model parameters. In Proc. IEEE int. conf. acoustics, speech, and signal processing, June 2000 (pp. 1281–1284). Google Scholar
- Nass, C., Jonsson, I. M., Harris, H., Reaves, B., Endo, J., Brave, S., & Takayama, L. (2005). Improving automotive safety by pairing driver emotion and car voice emotion. In Proc. int. conf. human-computer interaction, extended abstracts on human factors in computing systems, Portland, OR, USA, April 2005 (pp. 1973–1976). Google Scholar
- Pao, T. L., Chen, Y. T., Yeh, J. H., & Li, P. J. (2006). Mandarin emotional speech recognition based on SVM and NN. In Proc. 18th int. conf. pattern recognition, Hong Kong, Hong Kong, August 2006 (pp. 1096–1100). Google Scholar
- Picard, R. W. (1997). Affective computing. Cambridge: MIT Press. Google Scholar
- Ramakrishnan, S., & El Emary, I. (2011). Speech emotion recognition approaches in human computer interaction. Telecommunication Systems, 1–12. doi:10.1007/s11235-011-9624-z.
- Schuller, B., Villar, R. J., Rigoll, G., & Lang, M. (2005b). Meta-classifiers in acoustic and linguistic feature fusion-based affect recognition. In Proc. IEEE int. conf. acoustics, speech, and signal processing, Philadelphia, USA, March 2005 (pp. 325–328). Google Scholar
- Schuller, B., Steidl, S., & Batliner, A. (2009b). The INTERSPEECH 2009 emotion challenge. In Proc. 10th annual int. conf. speech communication association, Brighton, UK, September 2009 (pp. 312–315). Google Scholar
- Tato, R., Santos, R., Kompe, R., & Pardo, J. M. (2002). Emotional space improves emotion recognition. In Proc. 7th int. conf. spoken language processing, September 2002 (pp. 2029–2032). Google Scholar
- Vanello, N., Martini, N., Milanesi, M., Keiser, H., Calisti, M., Bocchi, L., Manfredi, C., & Landini, L. (2009). Evaluation of a pitch estimation algorithm for speech emotion recognition. In Proc. 6th int. workshop models and analysis of vocal emissions for biomedical applications, Florence, Italy, December 2009 (pp. 29–32). Google Scholar
- Ververidis, D., & Kotropoulos, C. (2006). Fast sequential floating forward selection applied to emotional speech features estimated on DES and SUSAS data collections. In Proc. 14th European signal processing conference, Florence, Italy, September 2006. Google Scholar
- Vogt, T., André, E., & Bee, N. (2008). EmoVoice—A framework for online recognition of emotions from voice. In Proc. 4th IEEE tutorial and research workshop on perception and interactive technologies for speech-based systems, Irsee, Germany, June 2008 (pp. 188–199). Google Scholar
- Wallach, H. (2006). Evaluation metrics for hard classifiers (Tech. Rep.). Cambridge University, Cavendish Lab. URL www.inference.phy.cam.ac.uk/hmw26/papers/evaluation.ps.
- Watson, D. (2000). Mood and temperament. New York: Guilford Press. Google Scholar
- Zervas, P., Mporas, I., Fakotakis, N., & Kokkinakis, G. K. (2006). Employing Fujisaki’s intonation model parameters for emotion recognition. In Proc. 4th hellenic conf. artificial intelligence, Heraclion, Greece, May 2006 (pp. 443–453). Google Scholar