Skip to main content
Log in

Improving the performance of the speaker emotion recognition based on low dimension prosody features vector

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speaker emotion recognition is an important research issue as it finds lots of applications in human–robot interaction, computer–human interaction, etc. This work deals with the recognition of emotion of the speaker from speech utterance. For that features like pitch, log energy, zero crossing rate, and first three formant frequencies are used. Feature vectors are constructed using the 11 statistical parameters of each feature. The Artificial Neural Network (ANN) is chosen as a classifier owing to its universal function approximation capabilities. In ANN based classifier, the time required for training the network as well as for classification depends upon the dimension of feature vector. This work focused on development of a speaker emotion recognition system using prosody features as well as reduction of dimensionality of feature vectors. Here, principle component analysis (PCA) is used for feature vector dimensionality reduction. Emotional prosody speech and transcription from Linguistic Data Consortium (LDC) and Berlin emotional databases are considered for evaluating the performance of proposed approach for seven types of emotion recognition. The performance of the proposed method is compared with existing approaches and better performance is obtained with proposed method. From experimental results it is observed that 75.32% and 84.5% recognition rate is obtained for Berlin emotional database and LDC emotional speech database respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Linguistic Data Consurtium (LDC), http://www.ldc.upenn.edu/.

References

  • Abelin, A. & Allwood, J. (2000). Cross-linguistic interpretation of emotional prosody. In Proceedings of the ISCA workshop on speech and emotion.

  • Anagnostopoulos, C. N., & Vovoli, E. (2010). Sound processing features for speaker-dependent and phrase-independent emotion recognition in Berlin database. In G. Papadopoulos, W. Wojtkowski, G. Wojtkowski, S. Wrycza, & J. Zupancic (Eds.), Information systems development (pp. 413–421). Boston: Springer.

    Google Scholar 

  • Anagnostopoulos, C. N., Iliou, T., & Giannoukos, (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.

    Article  Google Scholar 

  • Atassi, H. & Esposito, A. (2008). A speaker independent approach to the classification of emotional vocal expressions. In Proceedings of 20th IEEE international conference on tools with artificial intelligence (pp. 147–152).

  • Banse, R., & Sherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70(3), 614–636.

    Article  Google Scholar 

  • Bisio, I., Delfino, A., Lavagetto, F., Marchese, M., & Sciarrone, A. (2013). Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Transactions on Emerging Topics in Computing, 1(2), 244–257.

    Article  Google Scholar 

  • Breazeal, C. (2001). Designing social robots. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. & Weiss, B. (2005). A database of German emotional speech, In Proceedings of interspeech.

  • Burkhardt, F., & Sendlmeier, W. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. In Proceedings of the ISCA workshop on speech and emotion.

  • Cao, H., Vermaa, R., & Nenkovab, A. (2015). Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech. Computer Speech & Language, 29(1), 186–202.

    Article  Google Scholar 

  • Cen, L., Ser, W., & Yu, Z. L. (2008). Speech emotion recognition using canonical correlation analysis and probabilistic neural network, In Proceedings of 7th international conference on machine learning and applications (pp. 859–862).

  • Chiaverini, S., Siciliano, B., & Villani, L. (1999). A survey of robot interaction control scheme with experimental comparison. IEEE/ASME Transactions on Mechatronics, 4(3), 273–285.

    Article  Google Scholar 

  • De Cheveign, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.

    Article  Google Scholar 

  • FirozShah, A., Vimal Krishnan, V. R., Raji Sukumar, A., Jayakumar, A., & Babu Anto, P. (2009). Speaker independent automatic emotion recognition from speech: A comparison of MFCCs and discrete wavelet transforms. In Proceedings of international conference on advances in recent technologies in communication and computing (pp. 528–531).

  • Fu, L., Mao, X., & Chen, L. (2008a). Relative speech emotion recognition based artificial neural network. In Proceedings of IEEE Pacific-Asia workshop on computational intelligence and industrial application (pp. 140–144).

  • Fu, L., Mao, X., & Chen, L. (2008b). Speaker independent emotion recognition using HMMs fusion system with relative features. In Proceedings of 1st international conference on intelligent networks and intelligent systems (pp. 608–611).

  • Giannakopoulos, T., Pikrakis, A., & Theodoridis, S. (2009). A dimensional approach to emotion recognition of speech from movies. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 65–68).

  • Iliou, T., & Anagnostopoulos, C. N. (2009). Comparison of different classifiers for emotion recognition. In Proceedings of panhellenic conference in informatics (pp. 102–106).

  • Kostoulas, T. P., & Fakotakis, N. (2006). A speaker dependent emotion recognition framework, CSNDSP. In Proceedings of 5th international symposium computers, systems, networks and digital signal processing (pp. 305–309).

  • Kostoulas, T., Ganchev, T., Lazaridis, A., & Fakotakis, N. (2010). Enhancing emotion recognition from speech through feature selection, In P. Sojka, A. Hork, I. Kopecek, K. Pala (Eds.), International conference on text, speech and dialogue, lecture notes in artificial intelligence (Vol. 6231, pp. 338–344).

  • Loni, D. Y., & Subbaraman, S. (2014). Formant estimation of speech and singing voice by combining wavelet with LPC and Cepstrum techniques. In 9th international conference on industrial and information systems (ICIIS) IEEE (pp. 1–7).

  • Luengo, I., Navas, E., & Hernaez, I. (2010). Feature analysis and evaluation for automatic emotion identification in speech. IEEE Transaction on Multimedia, 12, 490–501.

    Article  Google Scholar 

  • Lugger, M., & Yang, B. (2007a). An incremental analysis of different feature groups in speaker independent emotion recognition. In Proceedings of international congress phonetic sciences (pp. 2149–2152).

  • Lugger, M., & Yang, B. (2007b). The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 17–20).

  • McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., & Stroeve, S. (2000). Approaching automatic recognition of emotion from voice: A rough benchmark. In Proceedings of ISCA workshop speech emotion in Belfast, UK (pp. 207–212).

  • Mishra, H. K., & Sekhar, C. C. (2009). Variational gaussian mixture models for speech emotion recognition. In Proceedings of 7th international conference on advances in pattern recognition (pp. 183–186).

  • Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion recognition in spontaneous speech using GMMs. In Proceedings of INTERSPEECH conference (pp. 809–812).

  • Oudeyer, P. Y. (2003). The production and recognition of emotions in speech: Feature and algorithms. International Journal of Human-Computer Studies, 59(1), 157–183.

    Google Scholar 

  • Pantic, M., & Rothkrantz, L. J. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, 91(9), 1370–1390.

    Article  Google Scholar 

  • Picard, R. (1997). Affective computing. Cambridge, MA: MIT Press.

    Google Scholar 

  • Ramakrishnan, S. (2012). Recognition of emotion from speech: A review. In S. Ramakrishnan (Ed.), Speech enhancement, modeling and recognition: Algorithms and applications (pp. 121–138). London: IntechOpen.

    Chapter  Google Scholar 

  • Ross, M., Shaffer, H., Cohen, A., Freudberg, R., & Manley, H. (1974). Average magnitude difference function pitch extractor. IEEE Transactions on Acoustics, Speech, and Signal Processing, 22(5), 353–362.

    Article  Google Scholar 

  • Schuller, B., Muller, R., Lang, M., & Rigoll, G. (2005a). Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In Proceedings of 9th Eurospeech Interspeech (pp. 805–809).

  • Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., et al. (2007). The relevance of feature type for the automatic classification of emotional user states: Low level descriptors and functionals. In Proceedings of interspeech (pp. 2253–2256).

  • Schuller, B., Mller, R., Eyben, F., Gast, J., Hrnler, B., Wllmer, M., et al. (2009). Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image and Vision Computing, 27, 1760–1774.

    Article  Google Scholar 

  • Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A., et al. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transaction on Affect Computing, 1, 119–131.

    Article  Google Scholar 

  • Shimamura, T., & Kobayashi, H. (2001). Weighted autocorrelation for pitch extraction of noisy speech. IEEE Transactions on Speech and Audio Processing, 9(7), 727–730.

    Article  Google Scholar 

  • Sidorova, J. (2007). Speech emotion recognition. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona.

  • Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). Speech Coding and Synthesis, 495, 495–518.

    Google Scholar 

  • Tan, L. N., & Alwan, A. (2013). Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7), 841–856.

    Article  Google Scholar 

  • Tickle, A. (2000). English and Japanese speakers emotion vocalizations and recognition: a comparison highlighting vowel quality. In ISCA workshop on speech and emotion, Belfast.

  • Tolkmitt, F. J., & Scherer, K. R. (1986). Effect of experimentally induced stress on vocal parameter. Journal of Experimental Psychology: Human Perception and Performance, 12(3), 302–313.

    Google Scholar 

  • Williams, C. E., & Stevens, K. N. (1972). Emotions and speech: Some acoustical correlates. Journal of the Acoustical Society of America, 52(4), 1238–1250.

    Article  Google Scholar 

  • Wu, S., Falk, T.H., & Chan, W. Y. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features. In Proceedings of 16th international conference on digital signal processing.

  • Wu, C. H., & Liang, W. B. (2011). Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Transaction on Affect Computing, 2, 10–21.

    Article  Google Scholar 

  • Yang, C., Ji, L., & Liu, G. (2009a). Study to speech emotion recognition based on TWINsSVM. In Proceedings of 5th international conference on natural computation (pp. 312–316).

  • Yang, T., Yang, J., & Bi, F. (2009b). Emotion statuses recognition of speech signal using intuitionistic fuzzy set. In Proceedings of world congress on software engineering (pp. 204–207).

  • Yun, S., Yoo, C. D. (2009). Speech emotion recognition via a max-margin framework incorporating a loss function based on the Watson and Tellegens emotion model. In Proceedings IEEE international conference on acoustics, speech and signal processing (pp. 4169–4172).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashishkumar Prabhakar Gudmalwar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gudmalwar, A.P., Rama Rao, C.V. & Dutta, A. Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. Int J Speech Technol 22, 521–531 (2019). https://doi.org/10.1007/s10772-018-09576-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-09576-4

Keywords

Navigation