Nonlinear Emotional Prosody Generation and Annotation

  • Jianhua Tao
  • Jian Yu
  • Yongguo Kang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4274)


Emotion is an important element in expressive speech synthesis. The paper makes the brief analysis on prosody parameters, stresses, rhythms and paralinguistic information in different emotional speech, and labels the speech with rich annotation information in multi-layers. Then, a CART model is used to do the emotional prosody generation. Unlike the traditional linear modification method, which makes direct modification of F0 contours and syllabic durations from acoustic distributions of emotional speech, such as, F0 topline, F0 baseline, durations and intensities, the CART models try to map the subtle prosody distributions between neutral and emotional speech within various context information. Experiments show that, with the CART model, the traditional context information is able to generate a good emotional prosody outputs, however the results could be improved if more rich information, such as stresses, breaks and jitter information, are integrated into the context information.


Speech Synthesis Pitch Contour Emotional Speech Emotional Prosody Pitch Target 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Murray, I., Arnott, J.L.: Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion. Journal of the Acoustic Society of America, 1097–1108 (1993)Google Scholar
  2. 2.
    Stibbard, R.M.: Vocal Expression of Emotions in Non-laboratory Speech: An Investigation of the Reading/Leeds Emotion in Speech Project Annotation Data. PhD Thesis. University of Reading, UK (2001)Google Scholar
  3. 3.
    McGilloway, S., Cowie, R., Doulas-Cowie, E., Gielen, S., Westerdijk, M., Stroeve, S.: Approaching Automatic Recognition of Emotion from Voice: A Rough Benchmark (2000)Google Scholar
  4. 4.
    Amir, N.: Classifying emotions in speech: a comparison of methods. In: EUROSPEECH 2001. Holon Academic Institute of technology, Escandinavia (2001)Google Scholar
  5. 5.
    Mozziconacci, S.J.L., Hermes, D.J.: Expression of emotion and attitude through temporal speech variations. In: ICSLP 2000, Beijing (2000)Google Scholar
  6. 6.
    Cahn, J.E.: The generation of affect in synthesized speech. Journal of the American Voice I/O Society 8 (July 1990)Google Scholar
  7. 7.
    Campbell, N.: Synthesis Units for Conversational Speech - Using Phrasal Segments,
  8. 8.
    Schröder, M., Breuer, S.: XML Representation Languages as a Way of Interconnecting TTS Modules. In: Proc. ICSLP Jeju, Korea (2004)Google Scholar
  9. 9.
    Eide, E., Aaron, A., Bakis, R., Hamza, W., Picheny, M., Pitrelli, J.: A corpus-based approach to <ahem/> expressive speech synthesis. In: IEEE speech synthesis workshop, Santa Monica (2002)Google Scholar
  10. 10.
    Chuang, Z.-J., Wu, C.-H.: Emotion Recognition from Textual Input using an Emotional Semantic Network. In: Proceedings of International Conference on Spoken Language Processing, ICSLP 2002, Denver (2002)Google Scholar
  11. 11.
    Tao, J.: Emotion control of Chinese speech synthesis in natural environment. In: EUROSPEECH 2003, pp. 2349–2352 (2003)Google Scholar
  12. 12.
    Xu, Y., Wang, Q.E.: Pitch targets and their realization: Evidence from mandarin chinese. Speech Communication 33, 319–337 (2001)MATHCrossRefGoogle Scholar
  13. 13.
    Li, A., Wang, H.: Friendly Speech Analysis and Perception in Standard Chinese. In: ICSLP 2004, Kerea (2004)Google Scholar
  14. 14.
    Laver, J.: The phonetic description of paralinguistic phenomena. In: The XIIIth International Congress on Phonetic Sciences. Stockholm, Sweden, Suppl. 1–4 (1995)Google Scholar
  15. 15.
    Fujisaki, H., Hirose, K.: Analysis of voice fundamental frequency contours for declarative sentence of Japanese. J. Acoust. Soc. Jpn (E) 5(4), 233–242 (1984)Google Scholar
  16. 16.
    Kochanski, G.P., Shih, C.: Stem-ML: Language independent prosody description. In: The 6th International Conference on Spoken Language Processing, Beijing, ChinaGoogle Scholar
  17. 17.
    Sun, X.: The Determination, Analysis, and Synthesis of Fundamental Frequency, Ph.D. thesis, Northwest University (2002)Google Scholar
  18. 18.
    Kawahra, H., Akahane-Yamada, R.: Perceptual Effects of Spectral Envelope and F0 Manipulations Using STRAIGHT Method. J. Acoust. Soc. Am. 103(5), Pt. 2, 1aSC27, 2776 (May 1998)Google Scholar
  19. 19.
  20. 20.
    Campbell, N.: Getting to the Heart of the Matter; Speech is more than just the Expression of Text or Language. In: LREC (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jianhua Tao
    • 1
  • Jian Yu
    • 1
  • Yongguo Kang
    • 1
  1. 1.National Laboratory of Pattern Recognition (NLPR), Institute of AutomationChinese Academy of SciencesBeijing

Personalised recommendations