Circuits, Systems, and Signal Processing

, Volume 37, Issue 5, pp 2179–2193 | Cite as

STRAIGHT-Based Emotion Conversion Using Quadratic Multivariate Polynomial

  • Jang Bahadur Singh
  • Parveen Lehana
Short Paper


Speech is the natural mode of communication and the easiest way of expressing human emotions. Emotional speech is expressed in terms of features like f0 contour, intensity, speaking rate, and voice quality. The group of these features is called prosody. Generally, prosody is modified by pitch and time scaling. Emotional speech conversion is more sensitive to prosody unlike voice conversion, where spectral conversion is the main concern. Several techniques, linear as well as nonlinear, have been used for transforming the speech. Our hypothesis is that quality of emotional speech conversion can be improved by estimating nonlinear relationship between the neutral and emotional speech feature vectors. In this research work, quadratic multivariate polynomial (QMP) has been explored for transforming neutral speech to emotional target speech. Both subjective and objective analyses were carried out to evaluate the transformed emotional speech using comparison mean opinion scores (CMOS), mean opinion scores (MOS), identification rate, root-mean-square error, and Mahalanobis distance. For Toronto emotional database, except for neutral/sad conversion, the CMOS analysis indicates that the transformed speech can partly be perceived as target emotion. Moreover, the MOS and spectrogram indicate good quality of transformed speech. For German database except for neutral/boredom conversion, the CMOS value of proposed technique has better score than gross and initial–middle–final methods but less than syllable method. However, QMP technique is simple, is easy to implement, has better quality of transformed speech, and estimates transformation function using limited number of utterances of training set.


Emotion conversion STRAIGHT DTW Mahalanobis distance Quadratic multivariate polynomial 



The authors would like to thank Prof. Hideki Kawahara, Wakayama University, for his assistance for STRAIGHT.


  1. 1.
    M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization. J. Acoust. Soc. Jpn. (E) 11(2), 71–76 (1990)CrossRefGoogle Scholar
  2. 2.
    Y. Adachi, S. Kawamoto, S. Morishima, S. Nakamura, Perceptual similarity measurement of speech by combination of acoustic features, in Proceedings IEEE International Conference Acoustics, Speech and Signal Processing, (2008), pp. 4861–4864Google Scholar
  3. 3.
    R. Aihara, R. Takashima, T. Takiguchi, Y. Ariki, GMM-based emotional voice conversion using spectrum and prosody features. Am. J. Signal Process. 2, 134–138 (2012)CrossRefGoogle Scholar
  4. 4.
    R. Aihara, R. Ueda, T. Takiguchi, Y. Ariki, Exemplar-based emotional voice conversion using non-negative matrix factorization, in Proceedings IEEE Asia-Pacific Signal and Information Processing Association, (2014), pp. 1-7Google Scholar
  5. 5.
    M. Bulut, et al., Investigating the role of phoneme-level modifications in emotional speech resynthesis, in Proceedings INTERSPEECH, (2005), pp. 801–804Google Scholar
  6. 6.
    F. Burkhardt, W.F. Sendlmeier, Verification of acoustical correlates of emotional speech using formant synthesis, in Tutorial and Research Workshop on Speech and Emotion, (2000), pp. 151–156Google Scholar
  7. 7.
    F. Burkhardt, N. Campbell, Emotional speech synthesis, in Oxford Handbook of Affective Computing, ed. By R.A. Calvo, S.K. D’Mello, J. Gratch, A. Kappas (Oxford University Press, 2014), p. 286Google Scholar
  8. 8.
    F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of German emotional speech, in Proceedings INTERSPEECH, (2005), pp. 1517–1520Google Scholar
  9. 9.
    L. Cen, P. Chan, M. Dong, H. Li, Generating emotional speech from neutral speech, in Proceedings 7th International Symposium on Chinese Spoken Language Processing, (2010), pp. 383–386Google Scholar
  10. 10.
    R.R. Chang, X.Q. Yu, Y.Y. Yuan, W.G. Wan, Emotional analysis and synthesis of human voice based on STRAIGHT. Appl. Mech. Mater. 536, 105–110 (2014)Google Scholar
  11. 11.
    Y. Chen, M. Chu, E. Chang, J. Liu, R. Liu, Voice conversion with smoothed GMM and MAP adaptation, in Eurospeech, (2003), pp. 2413–2416Google Scholar
  12. 12.
    R. Cowie et al., Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 18, 32–80 (2001)CrossRefGoogle Scholar
  13. 13.
    E.A. Cudney et al., An evaluation of Mahalanobis–Taguchi system and neural network for multivariate pattern recognition. J. Ind. Syst. Eng. 1, 139–150 (2007)Google Scholar
  14. 14.
    S. Desai, A.W. Black, B. Yegnanarayana, K. Prahallad, Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18, 954–964 (2010)CrossRefGoogle Scholar
  15. 15.
    K. Dupuis, M.K. Pichora-Fuller, Toronto Emotional Speech Set (TESS) (Psychology Department, University of Toronto, Toronto, 2010)Google Scholar
  16. 16.
    T. En-Najjary, O. Rosec, T. Chonavel, A voice conversion method based on joint pitch and spectral envelope transformation, in Proceedings INTERSPEECH (2004)Google Scholar
  17. 17.
    D. Erro, A. Moreno, A. Bonafonte, Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18, 922–931 (2010)CrossRefGoogle Scholar
  18. 18.
    H. Fujisaki, Information, prosody, and modeling with emphasis on tonal features of speech, in Speech Prosody, (2004), pp. 1–10Google Scholar
  19. 19.
    K.I. Funahashi, On the approximate realization of continuous mappings by neural networks. Neural Netw. 2, 183–192 (1989)CrossRefGoogle Scholar
  20. 20.
    D. Govind, S.R.M. Prasanna, B. Yegnanarayana, Neutral to target emotion conversion using source and suprasegmental information, in Proceedings INTERSPEECH, (2011), pp. 2969–2972Google Scholar
  21. 21.
    R.C. Guido et al., A neural-wavelet architecture for voice conversion. Neurocomputing 71, 174–180 (2007)CrossRefGoogle Scholar
  22. 22.
    A. Haque, K. S. Rao, Analysis and modification of spectral energy for neutral to sad emotion conversion, in Proceedings IEEE 8th International Contemporary Computing, (2015), pp. 263–268Google Scholar
  23. 23.
    A. Haque, K.S. Rao, Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech, in International Journal of Speech Technology, (2016), pp. 1–11Google Scholar
  24. 24.
    E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18, 912–921 (2010)CrossRefGoogle Scholar
  25. 25.
    W.J. Holmes, J.N. Holmes, M.W. Judd, Extension of the bandwidth of the JSRU parallel-formant synthesizer for high quality synthesis of male and female speech, in Proceedings IEEE International Conference Acoustics, Speech, and Signal Processing, (1990), pp. 313–316Google Scholar
  26. 26.
    A. Iida, N. Campbell, S. Iga, F. Higuchi, M. Yasumura, A speech synthesis system with emotion for assisting communication, in Tutorial and Research Workshop on Speech and Emotion, (2000), pp. 167–172Google Scholar
  27. 27.
    T. Irino, Y. Minami, T. Nakatani, M. Tsuzaki, H. Tagawa, Evaluation of a speech recognition/generation method based on HMM and STRAIGHT, in Proceedings INTERSPEECH (2002)Google Scholar
  28. 28.
    H. Kawahara, M. Morise, Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana Acad. Proc. Eng. Sci. 36, 713–727 (2011)Google Scholar
  29. 29.
    H. Kawahara, I. Masuda-Katsuse, A. De Cheveigne, Restructuring speech representations using a pitch adaptive time frequency smoothing and an instantaneous frequency based f0 extraction: possible role of repetitive structure in sounds. Speech Commun. 27, 187–207 (1999)CrossRefGoogle Scholar
  30. 30.
    R. Lawrence, Fundamentals of Speech Recognition (Pearson Education India, Delhi, 2008)Google Scholar
  31. 31.
    P.K. Lehana, P.C. Pandey, Transformation of short-term spectral envelope of speech signal using multivariate polynomial modeling, in Proceedings National Conference on Communications, NCC, (2011)Google Scholar
  32. 32.
    P.K. Lehana, Spectral mapping using multivariate polynomial modeling for voice conversion, Ph.D. Thesis, Department of Electrical Engineering, IIT Bombay, India (2013)Google Scholar
  33. 33.
    Z.H. Ling, L. Deng, D. Yu, Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21, 2129–2139 (2013)CrossRefGoogle Scholar
  34. 34.
    K. Liu, J. Zhang, Y. Yan, High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for mandarin, in Proceedings 4th International Conference on Fuzzy Systems and Knowledge Discovery, (2007), pp. 410–414Google Scholar
  35. 35.
    Z. Luo, J. Chen, T. Nakashika, T. Takiguchi, Y. Ariki, Emotional voice conversion using neural networks with different temporal scales of f0 based on wavelet transform, in Proceedings 9th ISCA Speech Synthesis Workshop (2016), pp. 140–145Google Scholar
  36. 36.
    Z. Luo, T. Takiguchi, Y. Ariki, Emotional voice conversion using deep neural networks with MCC and F0 features, in Proceedings IEEE 15th International Conference Computer and Information Science, (2016), pp. 1–5Google Scholar
  37. 37.
    P.C. Mahalanobis, On the generalized distance in statistics, in Proceedings of the National Institute of Sciences of India, (1936), pp. 49–55Google Scholar
  38. 38.
    T. Masuko, K. Tokuda, T. Kobayashi, S. Imai, Voice characteristics conversion for HMM-based speech synthesis system. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 3, 1611–1614 (1997)Google Scholar
  39. 39.
    A. Mouchtaris, S.S. Narayanan, C. Kyriakakis, Multichannel audio synthesis by subband-based spectral conversion and parameter adaptation. IEEE Trans. Speech Audio Process. 13, 263–274 (2005)CrossRefGoogle Scholar
  40. 40.
    T. Nakashika, R. Takashima, T. Takiguchi, Y. Ariki, Voice conversion in high-order Eigen space using deep belief nets, in Proceedings INTERSPEECH, (2013), pp. 369–372Google Scholar
  41. 41.
    J. Nirmal, M. Zaveri, S. Patnaik, P. Kachare, Voice conversion using general regression neural network. Appl. Soft Comput. 24, 1–12 (2014)CrossRefGoogle Scholar
  42. 42.
    H.K. Palo, M.N. Mohanty, M. Chandra, Efficient feature combination techniques for emotional speech classification. Int. J. Speech Technol. 19, 135–150 (2016)CrossRefGoogle Scholar
  43. 43.
    B.S. Pathak, M. Sayankar, A. Panat, Emotion transformation from neutral to 3 emotions of speech signal using DWT and adaptive filtering techniques, in Proceedings IEEE 11th India Conference: Emerging Trends and Innovation in Technology, (2014)Google Scholar
  44. 44.
    K.R. Scherer, Vocal communication of emotion: a review of research paradigms. Speech Commun. 40, 227–256 (2003)CrossRefzbMATHGoogle Scholar
  45. 45.
    M. Schröder, Emotional speech synthesis: a review, in Proceedings INTERSPEECH, (2001), pp. 561–564Google Scholar
  46. 46.
    J.B. Singh, R. Khanna, P. Lehana, Effect of MFCC based features for speech signal alignments, in Proceedings International Journal on Natural Language Computing, vol. 2 (2013)Google Scholar
  47. 47.
    Y. Stylianou, O. Cappé, E. Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 131–142 (1998)CrossRefGoogle Scholar
  48. 48.
    D. Sundermann, A. Bonafonte, H. Ney, A study on residual prediction techniques for voice conversion. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1, 1–13 (2005)Google Scholar
  49. 49.
    T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 2, 841–844 (2001)Google Scholar
  50. 50.
    K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 3, 1315–1318 (2000)Google Scholar
  51. 51.
    O.Türk, M. Schröder, A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis, in Proceedings INTERSPEECH, (2008), pp. 2282–2285Google Scholar
  52. 52.
    O.Türk, L.M. Arslan, Voice conversion methods for vocal tract and pitch contour modification, in Proceedings INTERSPEECH (2003)Google Scholar
  53. 53.
    O.Türk, Cross-lingual voice conversion. Ph.D. dissertation, Bogaziçi University, (2007)Google Scholar
  54. 54.
    H. Valbret, E. Moulines, J.P. Tubach, Voice transformation using PSOLA technique. Speech Commun. 11, 175–187 (1992)CrossRefGoogle Scholar
  55. 55.
    C. Veaux, X. Rodet, Intonation conversion from neutral to expressive speech, in Proceedings INTERSPEECH (2011), pp. 2765–2768Google Scholar
  56. 56.
    F. Villavicencio, A. Röbel, X. Rodet, Extending efficient spectral envelope modeling to Mel-frequency based representation, in Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, (2008), pp. 1625–1628Google Scholar
  57. 57.
    Z. Wu, Spectral mapping for voice conversion, Ph.D. dissertation, St. School of Computer Engineering, Nanyang Technological University, (2015)Google Scholar
  58. 58.
    J. Yadav, K.S. Rao, Prosodic mapping using neural networks for emotion conversion in Hindi Language. Circuits Systems Signal Process. 35, 139–162 (2016)MathSciNetCrossRefGoogle Scholar
  59. 59.
    H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.D.S.P. Lab, Department of ElectronicsUniversity of JammuJammuIndia

Personalised recommendations