Abstract
Speaker emotion recognition is an important research issue as it finds lots of applications in human–robot interaction, computer–human interaction, etc. This work deals with the recognition of emotion of the speaker from speech utterance. For that features like pitch, log energy, zero crossing rate, and first three formant frequencies are used. Feature vectors are constructed using the 11 statistical parameters of each feature. The Artificial Neural Network (ANN) is chosen as a classifier owing to its universal function approximation capabilities. In ANN based classifier, the time required for training the network as well as for classification depends upon the dimension of feature vector. This work focused on development of a speaker emotion recognition system using prosody features as well as reduction of dimensionality of feature vectors. Here, principle component analysis (PCA) is used for feature vector dimensionality reduction. Emotional prosody speech and transcription from Linguistic Data Consortium (LDC) and Berlin emotional databases are considered for evaluating the performance of proposed approach for seven types of emotion recognition. The performance of the proposed method is compared with existing approaches and better performance is obtained with proposed method. From experimental results it is observed that 75.32% and 84.5% recognition rate is obtained for Berlin emotional database and LDC emotional speech database respectively.
Similar content being viewed by others
Notes
Linguistic Data Consurtium (LDC), http://www.ldc.upenn.edu/.
References
Abelin, A. & Allwood, J. (2000). Cross-linguistic interpretation of emotional prosody. In Proceedings of the ISCA workshop on speech and emotion.
Anagnostopoulos, C. N., & Vovoli, E. (2010). Sound processing features for speaker-dependent and phrase-independent emotion recognition in Berlin database. In G. Papadopoulos, W. Wojtkowski, G. Wojtkowski, S. Wrycza, & J. Zupancic (Eds.), Information systems development (pp. 413–421). Boston: Springer.
Anagnostopoulos, C. N., Iliou, T., & Giannoukos, (2015). Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artificial Intelligence Review, 43(2), 155–177.
Atassi, H. & Esposito, A. (2008). A speaker independent approach to the classification of emotional vocal expressions. In Proceedings of 20th IEEE international conference on tools with artificial intelligence (pp. 147–152).
Banse, R., & Sherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70(3), 614–636.
Bisio, I., Delfino, A., Lavagetto, F., Marchese, M., & Sciarrone, A. (2013). Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Transactions on Emerging Topics in Computing, 1(2), 244–257.
Breazeal, C. (2001). Designing social robots. Cambridge, MA: MIT Press.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. & Weiss, B. (2005). A database of German emotional speech, In Proceedings of interspeech.
Burkhardt, F., & Sendlmeier, W. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. In Proceedings of the ISCA workshop on speech and emotion.
Cao, H., Vermaa, R., & Nenkovab, A. (2015). Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech. Computer Speech & Language, 29(1), 186–202.
Cen, L., Ser, W., & Yu, Z. L. (2008). Speech emotion recognition using canonical correlation analysis and probabilistic neural network, In Proceedings of 7th international conference on machine learning and applications (pp. 859–862).
Chiaverini, S., Siciliano, B., & Villani, L. (1999). A survey of robot interaction control scheme with experimental comparison. IEEE/ASME Transactions on Mechatronics, 4(3), 273–285.
De Cheveign, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.
FirozShah, A., Vimal Krishnan, V. R., Raji Sukumar, A., Jayakumar, A., & Babu Anto, P. (2009). Speaker independent automatic emotion recognition from speech: A comparison of MFCCs and discrete wavelet transforms. In Proceedings of international conference on advances in recent technologies in communication and computing (pp. 528–531).
Fu, L., Mao, X., & Chen, L. (2008a). Relative speech emotion recognition based artificial neural network. In Proceedings of IEEE Pacific-Asia workshop on computational intelligence and industrial application (pp. 140–144).
Fu, L., Mao, X., & Chen, L. (2008b). Speaker independent emotion recognition using HMMs fusion system with relative features. In Proceedings of 1st international conference on intelligent networks and intelligent systems (pp. 608–611).
Giannakopoulos, T., Pikrakis, A., & Theodoridis, S. (2009). A dimensional approach to emotion recognition of speech from movies. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 65–68).
Iliou, T., & Anagnostopoulos, C. N. (2009). Comparison of different classifiers for emotion recognition. In Proceedings of panhellenic conference in informatics (pp. 102–106).
Kostoulas, T. P., & Fakotakis, N. (2006). A speaker dependent emotion recognition framework, CSNDSP. In Proceedings of 5th international symposium computers, systems, networks and digital signal processing (pp. 305–309).
Kostoulas, T., Ganchev, T., Lazaridis, A., & Fakotakis, N. (2010). Enhancing emotion recognition from speech through feature selection, In P. Sojka, A. Hork, I. Kopecek, K. Pala (Eds.), International conference on text, speech and dialogue, lecture notes in artificial intelligence (Vol. 6231, pp. 338–344).
Loni, D. Y., & Subbaraman, S. (2014). Formant estimation of speech and singing voice by combining wavelet with LPC and Cepstrum techniques. In 9th international conference on industrial and information systems (ICIIS) IEEE (pp. 1–7).
Luengo, I., Navas, E., & Hernaez, I. (2010). Feature analysis and evaluation for automatic emotion identification in speech. IEEE Transaction on Multimedia, 12, 490–501.
Lugger, M., & Yang, B. (2007a). An incremental analysis of different feature groups in speaker independent emotion recognition. In Proceedings of international congress phonetic sciences (pp. 2149–2152).
Lugger, M., & Yang, B. (2007b). The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of IEEE international conference on acoustics, speech and signal processing (pp. 17–20).
McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., & Stroeve, S. (2000). Approaching automatic recognition of emotion from voice: A rough benchmark. In Proceedings of ISCA workshop speech emotion in Belfast, UK (pp. 207–212).
Mishra, H. K., & Sekhar, C. C. (2009). Variational gaussian mixture models for speech emotion recognition. In Proceedings of 7th international conference on advances in pattern recognition (pp. 183–186).
Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion recognition in spontaneous speech using GMMs. In Proceedings of INTERSPEECH conference (pp. 809–812).
Oudeyer, P. Y. (2003). The production and recognition of emotions in speech: Feature and algorithms. International Journal of Human-Computer Studies, 59(1), 157–183.
Pantic, M., & Rothkrantz, L. J. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, 91(9), 1370–1390.
Picard, R. (1997). Affective computing. Cambridge, MA: MIT Press.
Ramakrishnan, S. (2012). Recognition of emotion from speech: A review. In S. Ramakrishnan (Ed.), Speech enhancement, modeling and recognition: Algorithms and applications (pp. 121–138). London: IntechOpen.
Ross, M., Shaffer, H., Cohen, A., Freudberg, R., & Manley, H. (1974). Average magnitude difference function pitch extractor. IEEE Transactions on Acoustics, Speech, and Signal Processing, 22(5), 353–362.
Schuller, B., Muller, R., Lang, M., & Rigoll, G. (2005a). Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In Proceedings of 9th Eurospeech Interspeech (pp. 805–809).
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., et al. (2007). The relevance of feature type for the automatic classification of emotional user states: Low level descriptors and functionals. In Proceedings of interspeech (pp. 2253–2256).
Schuller, B., Mller, R., Eyben, F., Gast, J., Hrnler, B., Wllmer, M., et al. (2009). Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image and Vision Computing, 27, 1760–1774.
Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A., et al. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transaction on Affect Computing, 1, 119–131.
Shimamura, T., & Kobayashi, H. (2001). Weighted autocorrelation for pitch extraction of noisy speech. IEEE Transactions on Speech and Audio Processing, 9(7), 727–730.
Sidorova, J. (2007). Speech emotion recognition. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona.
Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). Speech Coding and Synthesis, 495, 495–518.
Tan, L. N., & Alwan, A. (2013). Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7), 841–856.
Tickle, A. (2000). English and Japanese speakers emotion vocalizations and recognition: a comparison highlighting vowel quality. In ISCA workshop on speech and emotion, Belfast.
Tolkmitt, F. J., & Scherer, K. R. (1986). Effect of experimentally induced stress on vocal parameter. Journal of Experimental Psychology: Human Perception and Performance, 12(3), 302–313.
Williams, C. E., & Stevens, K. N. (1972). Emotions and speech: Some acoustical correlates. Journal of the Acoustical Society of America, 52(4), 1238–1250.
Wu, S., Falk, T.H., & Chan, W. Y. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features. In Proceedings of 16th international conference on digital signal processing.
Wu, C. H., & Liang, W. B. (2011). Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Transaction on Affect Computing, 2, 10–21.
Yang, C., Ji, L., & Liu, G. (2009a). Study to speech emotion recognition based on TWINsSVM. In Proceedings of 5th international conference on natural computation (pp. 312–316).
Yang, T., Yang, J., & Bi, F. (2009b). Emotion statuses recognition of speech signal using intuitionistic fuzzy set. In Proceedings of world congress on software engineering (pp. 204–207).
Yun, S., Yoo, C. D. (2009). Speech emotion recognition via a max-margin framework incorporating a loss function based on the Watson and Tellegens emotion model. In Proceedings IEEE international conference on acoustics, speech and signal processing (pp. 4169–4172).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gudmalwar, A.P., Rama Rao, C.V. & Dutta, A. Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. Int J Speech Technol 22, 521–531 (2019). https://doi.org/10.1007/s10772-018-09576-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-09576-4