Skip to main content
Log in

Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This work attempts to convert a given neutral speech to a target emotional style using signal processing techniques. Sadness and anger emotions are considered in this study. For emotion conversion, we propose signal processing methods to process neutral speech in three ways: (i) modifying the energy spectra (ii) modifying the source features and (iii) modifying the prosodic features. Energy spectra of different emotions are analyzed, and a method has been proposed to modify the energy spectra of neutral speech after dividing the speech into different frequency bands. For the source part, epoch strength and epoch sharpness are extensively studied. A new method has been proposed for modification and incorporation of epoch strength and epoch sharpness parameters using appropriate modification factors. Prosodic features like pitch contour and intensity have also been modified in this work. New pitch contours corresponding to the target emotions are derived from the pitch contours of neutral test utterances. The new pitch contours are incorporated into the neutral utterances. Intensity modification is done by dividing neutral utterances into three equal segments and modifying the intensities of these segments separately, according to the modification factors suitable for the target emotions. Subjective evaluation using mean opinion scores has been carried out to evaluate the quality of converted emotional speech. Though the modified speech does not completely resemble the target emotion, the potential of these methods to change the style of the speech is demonstrated by these subjective tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Bishop, C. M. (2007). Pattern Recognition and Machine Learning (2nd ed.)., Information Science and Statistics New York: Springer.

    MATH  Google Scholar 

  • Bulut, M., & Narayanan, S. (2008). On the robustness of overall f0-only modifications to the perception of emotions in speech. The Journal of the Acoustical Society of America, 123(6), 4547–4558.

    Article  Google Scholar 

  • Govind, D., Prasanna, S. M., & Yegnanarayana, B. (2011) Neutral to target emotion conversion using source and suprasegmental information. In Interspeech (pp. 2969–2972).

  • Iriondo, I., Alías, F., Melenchón, J., & Llorca, M. A. (2004) Modeling and synthesizing emotional speech for Catalan text-to-speech synthesis. In Tutorial and research workshop on affective dialogue systems (pp. 197–208). New York: Springer.

  • Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: A review. International Journal of Speech Technology, 15(2), 99–117.

    Article  Google Scholar 

  • Kröger, B. J., & Birkholz, P. (2009). Multimodal signals: Cognitive and algorithmic issues. Berlin: Springer.

    Google Scholar 

  • Krothapalli, S. R., & Koolagudi, S. G. (2013). Characterization and recognition of emotions from speech using excitation source information. International Journal of Speech Technology, 16(2), 181–201.

    Article  Google Scholar 

  • Montero, J. M., Gutierrez-Arriola, J. M., Palazuelos, S. E., Enriquez, E., Aguilera, S., & Pardo, J. M. (1998). Emotional speech synthesis: From speech database to TTS. ICSLP, 98, 923–926.

    Google Scholar 

  • Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.

    Article  Google Scholar 

  • Mozziconacci, S. J. L. (1998). Speech variability and emotion: Production and perception. Eindhoven: Technische Universiteit Eindhoven.

    Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.

    Article  Google Scholar 

  • Pitrelli, J. F., Bakis, R., Eide, E. M., Fernandez, R., Hamza, W., & Picheny, M. A. (2006). The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108.

    Article  Google Scholar 

  • Prasanna, S. M., & Govind, D. (2010) Analysis of excitation source information in emotional speech. In Interspeech (pp. 781–784).

  • Přibilová, A., & Přibil, J. (2009) Spectrum modification for emotional speech synthesis. In Multimodal signals: Cognitive and algorithmic issues (pp. 232–241). Berlin: Springer.

  • Rao, K. S. (2012). Unconstrained pitch contour modification using instants of significant excitation. Circuits, Systems, and Signal Processing, 31(6), 2133–2152.

    Article  MathSciNet  Google Scholar 

  • Rao, K. S., & Koolagudi, S. G. (2012). Emotion recognition using speech features. New York: Springer.

    MATH  Google Scholar 

  • Rao, K. S., & Vuppala, A. K. (2013). Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Communication, 55(6), 745–756.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 972–980.

    Article  Google Scholar 

  • Rao, K. S., et al. (2010). Real time prosody modification. Journal of Signal and Information Processing, 1(01), 50.

    Article  Google Scholar 

  • Sarkar, P., Haque, A., Dutta, A. K., Reddy, G., Harikrishna, D., Dhara, P., Verma, R., Narendra, N., Sunil S. B., & Yadav, J., et al. (2014). Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for Indian languages: Bengali, Hindi and Telugu. In Seventh international conference on contemporary computing (IC3). IEEE (pp. 473–477).

  • Schröder, M. (2001). Emotional speech synthesis: A review. In Interspeech (pp. 561–564).

  • Schröder, M. (2004). Dimensional emotion representation as a basis for speech synthesis with non-extreme emotions. In Tutorial and research workshop on affective dialogue systems (pp. 209–220). New York: Springer.

  • Silva, A., Vala, M., & Paiva, A. (2001). The storyteller: Building a synthetic character that tells stories. In Proceedings of the workshop multimodal communication and context in embodied agents (pp. 53–58).

  • Tao, J. (2003). Emotion control of Chinese speech synthesis in natural environment. In Interspeech (pp. 2349–2352).

  • Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1145–1154.

    Article  Google Scholar 

  • Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.

    Article  Google Scholar 

  • Türk, O., & Schröder, M. (2008). A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis. In Interspeech (pp. 2282–2285).

  • Yadav, J., & Rao, K. S. (2016). Prosodic mapping using neural networks for emotion conversion in Hindi language. Circuits, Systems, and Signal Processing, 35(1), 139–162.

    Article  MathSciNet  Google Scholar 

  • Yegnanarayana, B., & Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 614–624.

    Article  Google Scholar 

  • Zhang, J. Y., Black, A. W., & Sproat, R. (2003). Identifying speakers in children’s stories for speech synthesis. In Interspeech.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arijul Haque.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Haque, A., Rao, K.S. Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech. Int J Speech Technol 20, 15–25 (2017). https://doi.org/10.1007/s10772-016-9386-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-016-9386-9

Keywords

Navigation