Significance of Emotionally Significant Regions of Speech for Emotive to Neutral Conversion

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9468)


Most of the speech processing applications suffer from a degradation in performance when operated in emotional environments. The degradation in performance is mostly due to a mismatch between developing and operating environments. Model adaptation and feature adaptation schemes have been employed to adapt speech systems developed in neutral environments to emotional environments. In this study, we have considered only anger emotion in emotional environments. In this work, we have studied the signal level conversion from anger emotion to neutral emotion. Emotion in human speech is concentrated over a small region in the entire utterance. The regions of speech that are highly influenced by the emotive state of the speaker is are considered as emotionally significant regions of an utterance. Physiological constraints of human speech production mechanism are explored to detect the emotionally significant regions of an utterance. Variation of various prosody parameters (Pitch, duration and energy) based on their position in the sentences is analyzed to obtain the modification factors. Speech signal in the emotionally significant regions is modified using the corresponding modification factor to generate the neutral version of the anger speech. Speech samples from Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) are used in this study. A subjective listening test is performed for evaluating the effectiveness of the proposed conversion.


Emotionally significant regions Emotion recognition Automatic speech recognition Physiological constraints Emotional environments Emotive to neutral conversion Adaptation scheme 


  1. 1.
    Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 11(2), 109–118 (1992)CrossRefGoogle Scholar
  2. 2.
    Batliner, A., Steidl, S., Seppi, D., Schuller, B.: Segmenting into adequate units for automatic recognition of emotion-related episodes: a speech-based approach. Adv. Hum. Comput. Interact. 2010, 3 (2010)CrossRefGoogle Scholar
  3. 3.
    Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.G.: Emotion recognition in human-computer interaction. Sig. Process. Mag. IEEE 18(1), 32–80 (2001)CrossRefGoogle Scholar
  4. 4.
    Gangamohan, P., Kadiri, S.R., Yegnanarayana, B.: Analysis of emotional speech at subsegmental level. In: INTERSPEECH, pp. 1916–1920 (2013)Google Scholar
  5. 5.
    Hansen, J.H., Bou-Ghazale, S.E., Sarikaya, R., Pellom, B.: Getting started with susas: a speech under simulated and actual stress database. In: Eurospeech, vol. 97, pp. 1743–1746 (1997)Google Scholar
  6. 6.
    Hansen, J.H., Womack, B.D.: Feature analysis and neural network-based classification of speech under stress. IEEE Trans. Speech Audio Process. 4(4), 307–313 (1996)CrossRefGoogle Scholar
  7. 7.
    Kadiri, S.R., Gangamohan, P., Yegnanarayana, B.: Discriminating neutral and emotional speech using neural networks. ICON (2014)Google Scholar
  8. 8.
    Koolagudi, S.G., Maity, S., Kumar, V.A., Chakrabarti, S., Rao, K.S.: IITKGP-SESC: speech database for emotion analysis. In: Ranka, S., Aluru, S., Buyya, R., Chung, Y.-C., Dua, S., Grama, A., Gupta, S.K.S., Kumar, R., Phoha, V.V. (eds.) IC3 2009. CCIS, vol. 40, pp. 485–492. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  9. 9.
    Krothapalli, S.R., Yadav, J., Sarkar, S., Koolagudi, S.G., Vuppala, A.K.: Neural network based feature transformation for emotion independent speaker identification. Int. J. Speech Technol. 15(3), 335–349 (2012)CrossRefGoogle Scholar
  10. 10.
    Murty, K.S.R., Yegnanarayana, B.: Epoch extraction from speech signals. IEEE Trans. Speech Audio Lang. Process. 16(8), 1602–1613 (2008)CrossRefGoogle Scholar
  11. 11.
    Murty, K.: Significance of excitation source information for speech analysis. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras (2009)Google Scholar
  12. 12.
    Ortony, A., Clore, G.L., Collins, A.: The cognitive structure of emotions. Cambridge University Press, Cambridge (1990)Google Scholar
  13. 13.
    Raja, G.S., Dandapat, S.: Speaker recognition under stressed condition. Int. J. Speech Technol. 13(3), 141–161 (2010)CrossRefGoogle Scholar
  14. 14.
    Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. Digital Signal Process. 10(1), 19–41 (2000)CrossRefGoogle Scholar
  15. 15.
    Schuller, B., Stadermann, J., Rigoll, G.: Affect-robust speech recognition by dynamic emotional adaptation. In: Proceedings of Speech Prosody. Citeseer (2006)Google Scholar
  16. 16.
    Stevens, K.N.: Acoustic Phonetics, vol. 30. MIT press, Cambridge (2000) Google Scholar
  17. 17.
    Tao, J., Kang, Y., Li, A.: Prosody conversion from neutral speech to emotional speech. IEEE Trans. Audio Speech Lang. Process. 14(4), 1145–1154 (2006)CrossRefGoogle Scholar
  18. 18.
    Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using psola technique. In: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-92, vol. 1, pp. 145–148. IEEE (1992)Google Scholar
  19. 19.
    Vlasenko, B., Philippou-Hübner, D., Prylipko, D., Böck, R., Siegert, I., Wendemuth, A.: Vowels formants analysis allows straightforward detection of high arousal emotions. In: 2011 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2011)Google Scholar
  20. 20.
    Vlasenko, B., Prylipko, D., Wendemuth, A.: Towards robust spontaneous speech recognition with emotional speech adapted acoustic models. In: Poster and Demo Track of the 35th German Conference on Artificial Intelligence, KI-2012, pp. 103–107. Citeseer, Saarbrucken, Germany (2012)Google Scholar
  21. 21.
    Vlasenko, B., Wendemuth, A.: Location of an emotionally neutral region in valence-arousal space: two-class vs. three-class cross corpora emotion recognition evaluations. In: 2014 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2014)Google Scholar
  22. 22.
    Vuppala, A.K., Kadiri, S.R.: Neutral to anger speech conversion using non-uniform duration modification. In: 2014 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1–4. IEEE (2014)Google Scholar
  23. 23.
    Vuppala, A.K., Limmayya, J., Raghavendra, G.: Neutral speech to anger speech conversion using prosody modification. In: Prasath, R., Kathirvalavakumar, T. (eds.) MIKE 2013. LNCS, vol. 8284, pp. 383–390. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  24. 24.
    Vydana, H.K., Kadiri, S.R., Vuppala, A.K.: Vowel-based non-uniform prosody modification for emotion conversion. Circuits Syst. Signal Process. 34, 1–21 (2015)CrossRefGoogle Scholar
  25. 25.
    Vydana, H.K., Kumar, P.P., Krishna, K., Vuppala, A.K.: Improved emotion recognition using GMM-UBMs. In: 2015 International Conference on Signal Processing And Communication Engineering Systems (SPACES), pp. 53–57. IEEE (2015)Google Scholar
  26. 26.
    Yang, B., Lugger, M.: Emotion recognition from speech signals using new harmony features. Signal Process. 90(5), 1415–1423 (2010)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Speech and Vision LabInternational Institute of Information TechnologyHyderabadIndia

Personalised recommendations