Skip to main content

Advertisement

Log in

Transformation of Emotion by Modifying Prosody and Spectral Energy Using Discrete Wavelet Transform

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

Speech is one of the most natural ways of human communication. Emotion is the hallmark of human speech and plays a dominant role in communicating feelings. Transformation/conversion of emotion in human speech is a powerful arena of research, due to its versatile application in Human–Computer Interaction (HCI). The objective of this work is to transform neutral speech utterance to target emotional speech utterance by modifying prosody (duration, intensity & pitch), and spectral energy of the neutral speech utterance. The target emotions considered in this work are sad, happy, anger and fear. The emotion transformation is carried out in two steps: (i) modifying prosody and (ii) modifying the spectral energy. This paper proposes a novel technique called GRM (Gaussian Regression Model) to modify the prosody of the utterances and Discrete Wavelet Transform (DWT) for modifying the spectral energy to enhance the expressiveness of the utterances. This algorithm was developed after thoroughly analysing the energy spectra of neutral and target emotional speech. This work also explores existing techniques: LMM (Linear Modification Model) and GNM (Gaussian Normalization Model) for Kannada speech utterances for the purposes of benchmarking the proposed algorithm. Kannada Emotional Speech (KES) database comprising of 1800 sentences were used for analysis and transformation. The expressiveness of the converted emotional speech are evaluated using objective and subjective tests. The GRM and spectral energy modification method are promising, and the results prove that there is a significant increase in Emotion Recognition Rate (ERR) and Mean Opinion Score (MOS).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The KES database created and analysed during this study are available from the corresponding author on reasonable request.

References

  1. Fujisaki, H. (2004). Information, prosody, and modeling-with emphasis on tonal features of speech, in In Speech Prosody 2004, International Conference.

  2. Cowie, R., et al. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.

    Article  Google Scholar 

  3. Zhang, J. Y., Black, A. W., Sproat, R. (2003). Identifying speakers in children's stories for speech synthesis. in Eighth European Conference on Speech Communication and Technology.

  4. Schröder, M. (2001) Emotional speech synthesis:A review. in Seventh European Conference on Speech Communication and Technology.

  5. Pitrelli, J. F., et al. (2006). The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108e.

    Article  Google Scholar 

  6. Cen, L, et al. 2010 Generating emotional speech from neutral speech. in 2010 7th International symposium on chinese spoken language processing. IEEE

  7. Desai, S., et al. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.

    Article  Google Scholar 

  8. Nirmal, J., et al. (2014). Voice conversion using general regression neural network. Applied Soft Computing, 24, 1–12.

    Article  Google Scholar 

  9. Vekkot, S., & Gupta, D. (2022). Fusion of spectral and prosody modelling for multilingual speech emotion conversion. Knowledge-Based Systems, 242, 108360.

    Article  Google Scholar 

  10. Türk, O., & Marc S. (2008). A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis. in Ninth Annual Conference of the International Speech Communication Association.

  11. Zhou, K., Sisman, B., Liu, R., & Li, H. (2022). Emotional voice conversion: Theory, databases and ESD. Speech Communication, 137, 1–18.

    Article  Google Scholar 

  12. Stylianou, Y., Cappé, O., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on speech and audio processing, 6(2), 131–142.

    Article  Google Scholar 

  13. Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech communication, 40(1–2), 227–256.

    Article  Google Scholar 

  14. Song, (2011). Voice conversion using support vector regression., in Electronics letters.

  15. Zhou, K., Sisman, B., Rana, R., Schuller, B. W., & Li, H. (2022). Emotion intensity and its control for emotional voice conversion. IEEE Transactions on Affective Computing, 14(1), 31–48.

    Article  Google Scholar 

  16. Guido, R. C., et al. (2007). A neural-wavelet architecture for voice conversion. Neurocomputing, 71(1–3), 174–180.

    Article  Google Scholar 

  17. Luo, Z., et al. (2017). Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform. EURASIP Journal on Audio, Speech, and Music Processing, 2017(1), 18.

    Article  Google Scholar 

  18. Aihara, R., et al. (2010). GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing, 2(5), 134–138.

    Article  Google Scholar 

  19. Abe, M., et al. (1990). Voice conversion through vector quantization. Journal of the Acoustical Society of Japan (E), 11(2), 71–76.

    Article  Google Scholar 

  20. Black, A. W., Heiga Z., & Keiichi T. (2007). Statistical parametric speech synthesis. in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07. Vol. 4. IEEE.

  21. Erro, D., Moreno, A., & Bonafonte, A. (2009). Voice conversion based on weighted frequency warping. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 922–931.

    Article  Google Scholar 

  22. Toda, T., Hiroshi, S., & Kiyohiro, S. (2001). Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). Vol. 2. IEEE.

  23. Helander, E. E., & Jani, N. (2007). A novel method for prosody prediction in voice conversion. in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07. IEEE. Vol. 4.

  24. Haque, A., & Sreenivasa, Rao, K. (2015). Analysis and modification of spectral energy for neutral to sad emotion conversion. In 2015 Eighth International Conference on Contemporary Computing (IC3). IEEE

  25. Yadav, J., & Sreenivasa, R. K. (2016). Prosodic mapping using neural networks for emotion conversion in Hindi language. Circuits, Systems, and Signal Processing, 35(1), 139–162.

    Article  MathSciNet  Google Scholar 

  26. Singh, J. B., & Lehana, P. (2018). Straight-based emotion conversion using quadratic multivariate polynomial. Circuits, Systems, and Signal Processing, 37(5), 2179–2193.

    Article  MathSciNet  Google Scholar 

  27. Inanoglu, Z., & Steve, Y. (2005). Intonation modelling and adaptation for emotional prosody generation. in International Conference on Affective Computing and Intelligent Interaction. Springer, Berlin, Heidelberg.

  28. Quatieri, (2006). Discrete-time speech signal processing: principles and practice. Pearson Education India.

  29. Paeschke, A., Miriam, K., & Sendlmeier, W. F. (1999). F0-contours in emotional speech. Proc. ICPhS. Vol. 99.

  30. Mozziconacci, S. J. L, & Hermes, D. J. (1999). Role of intonation patterns in conveying emotion in speech. Proceedings of ICPhS.

  31. D. J. Ravi (2009). Kannada text to speech synthesis systems: emotion analysis, in In the Proceedings of the seventh International Conference on Natural Language Processing (ICON 2009).

  32. Murray. (1993). Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. The Journal of the Acoustical Society of America, 93(2), 1097–1108.

    Article  MathSciNet  Google Scholar 

  33. Kawanami, H., et al. (2003). GMM-based voice conversion applied to emotional speech synthesis. in Eighth European Conference on Speech Communication and Technology.

  34. Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE transactions on Audio, Speech, and Language processing, 14(4), 1145–1154.

    Article  Google Scholar 

  35. Bulut, M., et al. (2005). Investigating the role of phoneme-level modifications in emotional speech resynthesis. in Ninth European Conference on Speech Communication and Technology.

  36. Turk, O., & Schroder, M. (2010). Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 965–973.

    Article  Google Scholar 

  37. Cahn, J. E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8(1), 1–1.

    MathSciNet  Google Scholar 

  38. Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16(4), 369–390.

    Article  Google Scholar 

  39. Li, B., et al. (2012). Emotional speech conversion based on spectrum-prosody dual transformation. in 2012 IEEE 11th International Conference on Signal Processing. Vol. 1. IEEE

  40. Mozziconacci, S. J. L. (1998) Speech variability and emotion: Production and perception.

  41. Montero, J. M., et al. (1998) Emotional speech synthesis: From speech database to TTS. in Fifth International Conference on Spoken Language Processing.

  42. Iriondo, I., et al. (2004). Modeling and synthesizing emotional speech for Catalan text-to-speech synthesis. Tutorial and research workshop on affective dialogue systems. Springer, Berlin, Heidelberg.

  43. Schröder, M. (2004). Dimensional emotion representation as a basis for speech synthesis with non-extreme emotions. Tutorial and research workshop on affective dialogue systems. Springer, Berlin, Heidelberg.

  44. Lee, H. J. (2012). Fairy tale storytelling system: Using both prosody and text for emotional speech synthesis. International Conference on Hybrid Information Technology. Springer, Berlin, Heidelberg.

  45. Haque, A., & Krothapalli, S. R. (2017). Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech. International Journal of Speech Technology, 20(1), 15–25.

    Article  Google Scholar 

  46. Jain, A., Agrawal, S. S., & Prakash, N. (2011). Transformation of emotion based on acoustic features of intonation patterns for hindi speech and their perception. IETE Journal of Research, 57(4), 318–324.

    Article  Google Scholar 

  47. Ali, S. A., et al. (2013). Development and analysis of speech emotion corpus using prosodic features for cross linguistics. International Journal of Scientific & Engineering Research, 4(1), 1–8.

    Google Scholar 

  48. Nataraja, N. P. (1981). Intonation in four Indian languages under five emotional conditions. Journal of all India institute of Speech and Hearing, 12(1), 22–27.

    Google Scholar 

  49. Govind, D., Mahadeva, Prasanna, S. R., & Yegnanarayana, B. (2011). Neutral to target emotion conversion using source and suprasegmental information. Twelfth annual conference of the international speech communication association. 2011.

  50. Pathak, B. S., Sayankar, M., Panat. A. (2014). Emotion transformation from neutral to 3 emotions of speech signal using DWT and adaptive filtering techniques. In 2014 Annual IEEE India Conference (INDICON). IEEE.

  51. Luo, Z., Tetsuya, T., & Yasuo A. (2016). Emotional voice conversion using neural networks with different temporal scales of F0 based on wavelet transform. SSW.

  52. Geethashree, A., D. J. Ravi (2019). Modification of Prosody for Emotion Conversion using Gaussian Regression Model. in International Journal of Recent Technology and Engineering.

  53. Geethashree, A., & Ravi, D. J. (2018). Kannada Emotional Speech Database: Design, Development and Evaluation. in Proceedings of International Conference on Cognition and Recognition. Springer, Singapore.

  54. Tantrigoda, D. A., & Rodrigo, D. S. (2014). Numerical implementation of Fourier transforms and associated problems. International Journal of Multidisciplinary Studies. https://doi.org/10.31357/ijms.v1i1.2234

    Article  Google Scholar 

  55. Conder, J. A. (2015). Fitting multiple bell curves stably and accurately to a time series as applied to Hubbert cycles or other phenomena. Mathematical Geosciences, 47(6), 663–678.

    Article  MathSciNet  Google Scholar 

  56. Guo, H. (2011). A simple algorithm for fitting a Gaussian function [DSP tips and tricks]. IEEE Signal Processing Magazine, 28(5), 134–137.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the conception and design. Data collection, algorithm design and analysis were performed by Geethashree A. The implementation of the algorithm was done by Geethashree A and Alfred Vivek D’Souza under the supervision of D. J Ravi. The first draft of the manuscript was written by Geethashree A and all authors commented on previous versions of the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to A. Geethashree.

Ethics declarations

Confict of interest

All author declares that they do not have any confict of interests.

Human and Animal Rights

This study did not include any animal participation.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geethashree, A., D’Souza, A.V. & Ravi, D.J. Transformation of Emotion by Modifying Prosody and Spectral Energy Using Discrete Wavelet Transform. Wireless Pers Commun 133, 771–794 (2023). https://doi.org/10.1007/s11277-023-10790-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-023-10790-w

Keywords

Navigation