Interactive Intonation Optimisation Using CMA-ES and DCT Parameterisation of the F0 Contour for Speech Synthesis

  • Adriana Stan
  • Florin-Claudiu Pop
  • Marcel Cremene
  • Mircea Giurgiu
  • Denis Pallez
Part of the Studies in Computational Intelligence book series (SCI, volume 387)


Expressive speech is one of the latest concerns of text-to-speech systems. Due to the subjectivity of expression and emotion realisation in speech, humans cannot objectively determine if one system is more expressive than the other. Most of the text-to-speech systems have a rather flat intonation and do not provide the option of changing the output speech. We therefore present an interactive intonation optimisation method based on the pitch contour parameterisation and evolution strategies. The Discrete Cosine Transform (DCT) is applied to the phrase level pitch contour. Then, the genome is encoded as a vector that contains 7 most significant DCT coefficients. Based on this initial individual, new speech samples are obtained using an interactive Covariance Matrix Adaptation Evolution Strategy (CMA-ES) algorithm. We evaluate a series of parameters involved in the process, such as the initial standard deviation, population size, the dynamic expansion of the pitch over the generations and the naturalness and expressivity of the resulted individuals. The results have been evaluated on a Romanian parametric-based speech synthesiser and provide the guidelines for the setup of an interactive optimisation system, in which the users can subjectively select the individual which best suits their expectations with minimum amount of fatigue.


Discrete Cosine Transform Mean Opinion Score Speech Synthesis Speech Sample Inverse Discrete Cosine Transform 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    D’Este, F., Bakker, E.: Articulatory Speech Synthesis with Parallel Multi-Objective Genetic Algorithms. In: Proc. ASCI (2010)Google Scholar
  2. 2.
    Fujisaki, H., Ohno, S.: The use of a generative model of F0 contours for multilingual speech synthesis. In: ICSLP- 1998, pp. 714–717 (1998)Google Scholar
  3. 3.
    Fukumoto, M.: Interactive Evolutionary Computation Utilizing Subjective Evaluation and Physiological Information as Evaluation Value. In: Systems Man and Cybernetics, pp. 2874–2879 (2010)Google Scholar
  4. 4.
    Hansen, N.: The CMA evolution strategy: A tutorial. Tech. rep., TU Berlin, ETH Zurich (2005)Google Scholar
  5. 5.
    Hansen, N., Ostermeier, A.: Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In: Proceedings of IEEE International Conference on Evolutionary Computation, pp. 312–317 (1996)Google Scholar
  6. 6.
    Holland, H.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975)Google Scholar
  7. 7.
    Latorre, J., Akamine, M.: Multilevel Parametric-Base F0 Model for Speech Synthesis. In: Proc. Interspeech (2008)Google Scholar
  8. 8.
    Lv, S., Wang, S., Wang, X.: Emotional speech synthesis by XML file using interactive genetic algorithms. In: GEC Summit, pp. 907–910 (2009)Google Scholar
  9. 9.
    Marques, V.M., Reis, C., Machado, J.A.T.: Interactive Evolutionary Computation in Music. In: Systems Man and Cybernetics, pp. 3501–3507 (2010)Google Scholar
  10. 10.
    McDermott, J., O’Neill, M., Griffith, N.J.L.: Interactive EC control of synthesized timbre. Evolutionary Computation 18, 277–303 (2010)CrossRefGoogle Scholar
  11. 11.
    Moisa, T., Ontanu, D., Dediu, A.-H.: Speech synthesis using neural networks trained by an evolutionary algorithm. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2074, pp. 419–428. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  12. 12.
    Panait, L., Luke, S.: A comparison of two competitive fitness functions. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2002, pp. 503–511 (2002)Google Scholar
  13. 13.
    Qian, Y., Wu, Z., Soong, F.: Improved Prosody Generation by Maximizing Joint Likelihood of State and Longer Units. In: Proc. ICASSP (2009)Google Scholar
  14. 14.
    Sakai, S.: Additive modelling of English F0 contour for Speech Synthesis. In: Proc. ICASSP (2005)Google Scholar
  15. 15.
    Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J.: ToBI: A standard for labeling English prosody. In: ICSLP-1992, vol. 2, pp. 867–870 (1992)Google Scholar
  16. 16.
    Stan, A., Yamagishi, J., King, S., Aylett, M.: The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication 53(3), 442–450 (2011), doi:10.1016/j.specom.2010.12.002CrossRefGoogle Scholar
  17. 17.
    Tao, J., Kang, Y., Li, A.: Prosody conversion from neutral speech to emotional speech. IEEE Trans. on Audio Speech and Language Processing 14(4), 1145–1154 (2006), doi: 10.1109/TASL,876113CrossRefGoogle Scholar
  18. 18.
    Taylor, P.: The tilt intonation model. In: ICSLP 1998, pp. 1383–1386 (1998)Google Scholar
  19. 19.
    Teutenberg, J., Wilson, C., Riddle, P.: Modelling and Synthesising F0 Contours with the Discrete Cosine Transform. In: Proc. ICASSP (2008)Google Scholar
  20. 20.
    Yamagishi, J., Onishi, K., Masuko, T., Kobayashi, T.: Acoustic modeling of speaking styles and emotional expressions in hmm-based speech synthesis. IEICE - Trans. Inf. Syst. E88-D, 502–509 (2005)CrossRefGoogle Scholar
  21. 21.
    Zen, H., Nose, T., Yamagishi, J., Sako, S., Tokuda, K.: The HMM-based speech synthesis system (HTS) version 2.0. In: Proc. of Sixth ISCA Workshop on Speech Synthesis, pp. 294–299 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Adriana Stan
    • 1
  • Florin-Claudiu Pop
    • 1
  • Marcel Cremene
    • 1
  • Mircea Giurgiu
    • 1
  • Denis Pallez
    • 2
  1. 1.Communications DepartmentTechnical University of Cluj-NapocaClujRomania
  2. 2.Laboratoire d’Informatique, Signaux, et Systèmes de Sophia-Antipolis (I3S)Université de Nice Sophia-AntipolisFrance

Personalised recommendations