Circuits, Systems, and Signal Processing

, Volume 35, Issue 1, pp 139–162 | Cite as

Prosodic Mapping Using Neural Networks for Emotion Conversion in Hindi Language

  • Jainath YadavEmail author
  • K. Sreenivasa Rao


An emotion is made of several components such as physiological changes in the body, subjective feelings and expressive behaviors. These changes in speech signal are mainly observed in prosody parameters such as pitch, duration and energy. Hindi language is mostly syllabic in nature. Syllables are the most suitable basic units for the analysis and synthesis of speech. Therefore, vowel onset point detection method is used to segment the speech utterance into syllable like units. In this work, prosody parameters are modified using instants of significant excitation (epochs) and these instants are detected using zero frequency filtering-based method. Epoch locations in the voiced speech correspond to instants of glottal closure, and in the unvoiced region, they correspond to some random instants of significant excitation. Anger, happiness and sadness emotions are considered as target emotions in the proposed emotion conversion framework. Feedforward neural network models are explored for mapping the prosodic parameters between neutral and target emotions. Predicted prosodic parameters of the target emotion are incorporated into neutral speech at syllable level to produce the desired emotional speech. After incorporating the emotion-specific prosody, perceptual quality of the transformed speech is evaluated by subjective tests.


Feedforward neural network (FFNN) Emotion conversion   Prosody parameters Instants of significant excitation (epochs) Vowel onset point Objective measures 


  1. 1.
    B.W. Alan, Z. Heiga, T. Keiichi, Statistical parametric speech synthesis, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing (2007) pp. 1229–1232Google Scholar
  2. 2.
    M. Bulut, Emotional Speech Resynthesis. University of Southern California, Ph.D. dissertation (2008)Google Scholar
  3. 3.
    F. Burkhardt, W.F. Sendlmeier, Verification of acoustical correlates of emotional speech using formant synthesis, in ISCA Workshop on Speech and Emotion, pp. 151–156 (2000)Google Scholar
  4. 4.
    J.P. Cabral, Transforming Prosody and Choice Quality to Generate Emotions in Speech. PhD dissertaion, Instituto Superior Tcnico (IST), Lisbon, Portugal (2006)Google Scholar
  5. 5.
    J.E. Cahn, The generation of affect in synthesized speech. J. Am. Voice I/O Soc. 8, 1–19 (1990)Google Scholar
  6. 6.
    F. Charpentier, M. Stella, Diphone synthesis using an overlap add technique for speech waveforms concatenation, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing, vol. 11, pp. 2015–2018 (1986)Google Scholar
  7. 7.
    R. Chauhan, J. Yadav, S.G. Koolagudi, Text independent emotion recognition using spectral features, in The 4rd International Conference on Contemporary Computing. JIIT university and University of Florida. JIIT university and University of Florida (2011)Google Scholar
  8. 8.
    R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, J.G. Taylor, Emotion recognition in human–computer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001). doi: 10.1109/79.911197 CrossRefGoogle Scholar
  9. 9.
    B. C. Csji, Approximation with Artificial Neural Networks. PhD diss, Etvs Lornd University Hungary (2001)Google Scholar
  10. 10.
    S. Desai, A.W. Black, B. Yegnanarayana, K. Prahallad, Spectral mapping using artificial neural networks for voice conversion. IEEE Trans. Audio Speech Lang. Process. 18, 954–964 (2010)CrossRefGoogle Scholar
  11. 11.
    T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes, in Proceedings of Fourth International Conference on Spoken Language, vol. 3, pp. 1393–1396 (1996)Google Scholar
  12. 12.
    D. Erro, E. Navas, I. Herndez, I. Saratxaga, Emotion conversion based on prosodic unit selection. IEEE Trans. Audio Speech Lang. Process. 18(5), 974–983 (2010). doi: 10.1109/TASL.2009.2038658 CrossRefGoogle Scholar
  13. 13.
    D. Govind, S.R.M. Prasanna, Expressive speech synthesis using prosodic modification and dynamic time warping, in Proceedings on NCC, pp. 285–289 (2009)Google Scholar
  14. 14.
    D. Govind, S.R.M. Prasanna. Neutral to target emotion conversion using source and suprasegmental information, in Proceedings of Interspeech, pp. 2969–2972 (2011)Google Scholar
  15. 15.
    R.V. Hogg, J. Ledolter, Engineering Statistics. 866 Third Avenue (Macmillan Publishing Company, New York, 1987)Google Scholar
  16. 16.
    A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in Proceedings of IEEE International Conference Acoust, Speech, Signal Processing, vol. 1, pp. 373–376 (1996)Google Scholar
  17. 17.
    S.G. Koolagudi, R. Reddy, J. Yadav, K.S. Rao. IITKGPSEHSC: Hindi speech corpus for emotion analysis, in International Conference on Devices and Communications, pp. 1–5 (2011)Google Scholar
  18. 18.
    W.D. Massaro, Auditory visual speech processing, in Eurospeech Conference (2001)Google Scholar
  19. 19.
    E. Moulines, J. Laroche, Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Commun. 16(2), 175–205 (1995)CrossRefGoogle Scholar
  20. 20.
    I.R. Murray, J.L. Arnott, Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Commun. 16, 369–390 (1995)CrossRefGoogle Scholar
  21. 21.
    K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16, 1602–1613 (2008)CrossRefGoogle Scholar
  22. 22.
    N.P. Narendra, K. Sreenivasa Rao, K. Ghosh, V. Ramu Reddy, S. Maity, Development of syllable-based text to speech synthesis system in Bengali. Int. J. Speech Technol. 14, 167–181 (2011)CrossRefGoogle Scholar
  23. 23.
    J. Nicholson, K. Takahashi, R. Nakatsu. Emotion recognition in speech using neural networks, in Sixth International Conference on Neural Information Processing, pp. 495–501 (1999)Google Scholar
  24. 24.
    S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process. 17(4), 556–565 (2009)CrossRefGoogle Scholar
  25. 25.
    S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana. Fast prosody modification using instants of significant excitation, in Proceedings of Speech Prosody (2010)Google Scholar
  26. 26.
    K.S. Rao, B. Yegnanarayana, Duration modification using glottal closure instants and vowel onset points. Speech Commun. 51(12), 1263–1269 (2009)CrossRefGoogle Scholar
  27. 27.
    K.S. Rao, Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech Lang. 24(3), 474–494 (2010)CrossRefGoogle Scholar
  28. 28.
    K.S. Rao, A.K. Vuppala, Non-uniform time scale modification using instants of significant excitation and vowel onset points. Speech Commun. 55(6), 745–756 (2013)CrossRefGoogle Scholar
  29. 29.
    K.S. Rao, B. Yegnanarayana, Prosodymodification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14(3), 972–980 (2006)CrossRefGoogle Scholar
  30. 30.
    P. Sarkar, A. Haque, D. Arup Kumar, M. Gurunath Reddy, D.M. Harikrishna, P. Dhara, N.P. Narendra, R. Verma, S.B. Sunil Kumar, J. Yadav, K. Sreenivasa Rao. Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for indian languages: Bengali, Hindi and Telugu, in Proceedings of the International Conference on Contemporary Computing (IC3 2014), pp. 473–477 (2014)Google Scholar
  31. 31.
    M. Schrder, R. Cowie, E. Douglas-Cowie, M. Westerdijk, S. Gielen. Acoustic correlates of emotion dimensions in view of speech synthesis, in Proceedings of Eurospeech, pp. 87–90 (2001)Google Scholar
  32. 32.
    M. Schroder, Can Emotions be Synthesized Without Controlling Voice Quality? University of Saarland. Phonus 4, Research Report of the Institute of PhoneticsGoogle Scholar
  33. 33.
    K.S.S. Srinivas, K. Prahallad, An FIR implementation of zero frequency filtering of speech signals. IEEE Trans. Audio Speech Lang. Process. 20(9), 2613–2617 (2012). doi: 10.1109/TASL.2012.2207114 CrossRefGoogle Scholar
  34. 34.
    J. Tao, Y. Kang, A. Li, Prosody con- version from neutral speech to emotional speech. IEEE Trans. Audio Speech Lang. Process. 14(4), 1145–1154 (2006). doi: 10.1109/TASL.2006.876113 CrossRefGoogle Scholar
  35. 35.
    M. Theune, K. Meijs, D. Heylen. Generating expressive speech for storytelling applications, in IEEE Transactions on Audio, Speech and Language Processing, pp. 1137–1144 (2006)Google Scholar
  36. 36.
    R. Verma, P. Sarkar, K. Sreenivasa Rao. Conversion of neutral speech to storytelling style speech, in Proceedings of the Eighth IEEE International Conference on Advances in Pattern Recognition, ICAPR 2015 (2015)Google Scholar
  37. 37.
    A.K. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20(6), 1894–1903 (2012). doi: 10.1109/TASL.2012.2191284 CrossRefGoogle Scholar
  38. 38.
    B. Yegnanarayana, Artificial Neural Networks. Prentice-Hall of India. (2004).
  39. 39.
    T. Yoshimura, K. Tokuda, T. Kobayashi, T. Masuko, T. Kitamura. Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis (1999)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.School of Information TechnologyIndian Institute of TechnologyKharagpurIndia

Personalised recommendations