Abstract
This work attempts to convert a given neutral speech to a target emotional style using signal processing techniques. Sadness and anger emotions are considered in this study. For emotion conversion, we propose signal processing methods to process neutral speech in three ways: (i) modifying the energy spectra (ii) modifying the source features and (iii) modifying the prosodic features. Energy spectra of different emotions are analyzed, and a method has been proposed to modify the energy spectra of neutral speech after dividing the speech into different frequency bands. For the source part, epoch strength and epoch sharpness are extensively studied. A new method has been proposed for modification and incorporation of epoch strength and epoch sharpness parameters using appropriate modification factors. Prosodic features like pitch contour and intensity have also been modified in this work. New pitch contours corresponding to the target emotions are derived from the pitch contours of neutral test utterances. The new pitch contours are incorporated into the neutral utterances. Intensity modification is done by dividing neutral utterances into three equal segments and modifying the intensities of these segments separately, according to the modification factors suitable for the target emotions. Subjective evaluation using mean opinion scores has been carried out to evaluate the quality of converted emotional speech. Though the modified speech does not completely resemble the target emotion, the potential of these methods to change the style of the speech is demonstrated by these subjective tests.
Similar content being viewed by others
References
Bishop, C. M. (2007). Pattern Recognition and Machine Learning (2nd ed.)., Information Science and Statistics New York: Springer.
Bulut, M., & Narayanan, S. (2008). On the robustness of overall f0-only modifications to the perception of emotions in speech. The Journal of the Acoustical Society of America, 123(6), 4547–4558.
Govind, D., Prasanna, S. M., & Yegnanarayana, B. (2011) Neutral to target emotion conversion using source and suprasegmental information. In Interspeech (pp. 2969–2972).
Iriondo, I., Alías, F., Melenchón, J., & Llorca, M. A. (2004) Modeling and synthesizing emotional speech for Catalan text-to-speech synthesis. In Tutorial and research workshop on affective dialogue systems (pp. 197–208). New York: Springer.
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99–117.
Kröger, B. J., & Birkholz, P. (2009). Multimodal signals: Cognitive and algorithmic issues. Berlin: Springer.
Krothapalli, S. R., & Koolagudi, S. G. (2013). Characterization and recognition of emotions from speech using excitation source information. International Journal of Speech Technology, 16(2), 181–201.
Montero, J. M., Gutierrez-Arriola, J. M., Palazuelos, S. E., Enriquez, E., Aguilera, S., & Pardo, J. M. (1998). Emotional speech synthesis: From speech database to TTS. ICSLP, 98, 923–926.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.
Mozziconacci, S. J. L. (1998). Speech variability and emotion: Production and perception. Eindhoven: Technische Universiteit Eindhoven.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
Pitrelli, J. F., Bakis, R., Eide, E. M., Fernandez, R., Hamza, W., & Picheny, M. A. (2006). The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108.
Prasanna, S. M., & Govind, D. (2010) Analysis of excitation source information in emotional speech. In Interspeech (pp. 781–784).
Přibilová, A., & Přibil, J. (2009) Spectrum modification for emotional speech synthesis. In Multimodal signals: Cognitive and algorithmic issues (pp. 232–241). Berlin: Springer.
Rao, K. S. (2012). Unconstrained pitch contour modification using instants of significant excitation. Circuits, Systems, and Signal Processing, 31(6), 2133–2152.
Rao, K. S., & Koolagudi, S. G. (2012). Emotion recognition using speech features. New York: Springer.
Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55(6), 745–756.
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 972–980.
Rao, K. S., et al. (2010). Real time prosody modification. Journal of Signal and Information Processing, 1(01), 50.
Sarkar, P., Haque, A., Dutta, A. K., Reddy, G., Harikrishna, D., Dhara, P., Verma, R., Narendra, N., Sunil S. B., & Yadav, J., et al. (2014). Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for Indian languages: Bengali, Hindi and Telugu. In Seventh international conference on contemporary computing (IC3). IEEE (pp. 473–477).
Schröder, M. (2001). Emotional speech synthesis: A review. In Interspeech (pp. 561–564).
Schröder, M. (2004). Dimensional emotion representation as a basis for speech synthesis with non-extreme emotions. In Tutorial and research workshop on affective dialogue systems (pp. 209–220). New York: Springer.
Silva, A., Vala, M., & Paiva, A. (2001). The storyteller: Building a synthetic character that tells stories. In Proceedings of the workshop multimodal communication and context in embodied agents (pp. 53–58).
Tao, J. (2003). Emotion control of Chinese speech synthesis in natural environment. In Interspeech (pp. 2349–2352).
Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1145–1154.
Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.
Türk, O., & Schröder, M. (2008). A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis. In Interspeech (pp. 2282–2285).
Yadav, J., & Rao, K. S. (2016). Prosodic mapping using neural networks for emotion conversion in Hindi language. Circuits, Systems, and Signal Processing, 35(1), 139–162.
Yegnanarayana, B., & Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 614–624.
Zhang, J. Y., Black, A. W., & Sproat, R. (2003). Identifying speakers in children’s stories for speech synthesis. In Interspeech.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Haque, A., Rao, K.S. Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech. Int J Speech Technol 20, 15–25 (2017). https://doi.org/10.1007/s10772-016-9386-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-016-9386-9