Modelling F0 Dynamics in Unit Selection Based Speech Synthesis

  • Daniel Tihelka
  • Jindřich Matoušek
  • Zdeněk Hanzlíček
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)

Abstract

In the common unit selection implementations, F0 continuity is measured as one of concatenation cost features with the expectation that smooth units transition (regarding speech melody) is ensured when the difference of F0 is low enough. This measure generally uses a static F0 value computed at the units boundary. In the present paper we show, however, that the use of static F0 values is not enough for smooth speech units concatenation, and that a dynamic nature of the F0 contour must be taken into account. Two schemes of dynamic F0 handling are presented, and speech generated by both schemes is compared by means of listening tests on specially selected phrases which are known to carry unnatural artefacts. Advantages and disadvantages of the individual schemes are also discussed.

Keywords

text-to-speech synthesis unit selection concatenation cost fundamental frequency F0 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bellegarda, J.R.: A novel discontinuity metric for unit selection text-to-speech synthesis. In: Proc. of 5th Speech Synthesis Workshop (SSW5), Pittsburgh, PA, USA, pp. 133–138 (2004)Google Scholar
  2. 2.
    Conkie, A., Syrdal, A.K.: Using F0 to constrain the unit selection Viterbi network. In: Proc. of Acoustics, Speech, and Signal Processing ICASSP, pp. 5376–5379. IEEE (2011)Google Scholar
  3. 3.
    Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. of Acoustics, Speech, and Signal Processing ICASSP 1996, vol. 1, pp. 373–376. IEEE (1996)Google Scholar
  4. 4.
    Klabbers, E., Veldhuis, R.N.J.: Reducing audible spectral discontinuities. IEEE Transactions on Speech and Audio Processing 9(1), 39–51 (2001), http://dblp.uni-trier.de/db/journals/taslp/taslp9.html#KlabbersV01 CrossRefGoogle Scholar
  5. 5.
    Legát, M., Matoušek, J.: Design of the test stimuli for the evaluation of concatenation cost functions. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 339–346. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Legát, M., Matoušek, J.: Collection and analysis of data for evaluation of concatenation cost functions. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 345–352. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Legát, M., Matoušek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Communication, 552–566 (2011), http://www.kky.zcu.cz/en/publications/LegatM_2011_Onthedetectionof
  8. 8.
    Legát, M., Matoušek, J.: Pitch contours as predictors of audible concatenation artifacts. In: Proc. of World Congress on Engineering and Computer Science 2011, San Francisco, USA, pp. 525–529 (2011)Google Scholar
  9. 9.
    Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH 2008, Proc. of 9th Annual Conference of International Speech Communication Association, Brisbane, Australia, pp. 1626–1629 (2008)Google Scholar
  10. 10.
    Matoušek, J., Tihelka, D., Psutka, J.V.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003), http://dx.doi.org/10.1007/978-3-540-39398-6_41 CrossRefGoogle Scholar
  11. 11.
    Matoušek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006), http://dx.doi.org/10.1007/11846406_55 CrossRefGoogle Scholar
  12. 12.
    Narendra, N.P., Rao, K.S.: Syllable specific unit selection cost functions for text-to-speech synthesis. ACM Transactions on Speech and Language Processing 9(3), 5:1–5:24 (2012), http://doi.acm.org/10.1145/2382434.2382435
  13. 13.
    Pantazis, Y., Stylianou, Y.: On the detection of discontinuities in concatenative speech synthesis. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) COST 277. LNCS, vol. 4391, pp. 89–100. Springer, Heidelberg (2007), http://dx.doi.org/10.1007/978-3-540-71505-4_6 CrossRefGoogle Scholar
  14. 14.
    Přibil, J., Přibilová, A.: Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech. EURASIP Journal on Audio, Speech, and Music Processing 33(3), 1–22 (2013), http://dx.doi.org/10.1186/1687-4722-2013-8 Google Scholar
  15. 15.
    Stylianou, Y., Syrdal, A.K.: Perceptual and objective detection of discontinuities in concatenative speech synthesis. In: Proc. IEEE Acoustics, Speech, and Signal Processing (ICASSP), pp. 837–840 (2001)Google Scholar
  16. 16.
    Syrdal, A.K., Conkie, A.D.: Data-driven perceptually based join costs. In: Proc. of 5th Speech Synthesis Workshop (SSW5), Pittsburgh, PA, USA, pp. 49–54 (2004)Google Scholar
  17. 17.
    Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I. (ed.) TSD 2013. LNCS, vol. 8082, pp. 442–449. Springer, Heidelberg (2013), http://dx.doi.org/10.1007/978-3-642-40585-3_56 Google Scholar
  18. 18.
    Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: INTERSPEECH 2010, Proc. of 11th Annual Conference of the International Speech Communication Association, pp. 174–177 (2010), http://www.isca-speech.org/archive/interspeech_2010/i10_0174.html
  19. 19.
    Tihelka, D., Stanislav, P.: ARTIC for assistive technologies: Transformation to resource-limited hardware. In: Proc. of World Congress on Engineering and Computer Science 2011, San Francisco, USA, pp. 581–584 (2011)Google Scholar
  20. 20.
    Vepa, J., King, S.: Kalman–filter based join cost for unit–selection speech synthesis. In: Proc. EUROSPEECH 2003 – INTERSPEECH 2003, Proc. of 8th European Conference on Speech Communication and Technology, pp. 293–296. ISCA (2003)Google Scholar
  21. 21.
    Vepa, J., King, S.: Join cost for unit selection speech synthesis. Ph.D. thesis, The University of Edinburgh, College of Science and Engineering, School of Informatics (2004), https://www.era.lib.ed.ac.uk/handle/1842/1452
  22. 22.
    Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book Version 3.4. Cambridge University Press (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Daniel Tihelka
    • 1
  • Jindřich Matoušek
    • 1
  • Zdeněk Hanzlíček
    • 1
  1. 1.Faculty of Applied Sciences, Dept. of CyberneticsUniversity of West BohemiaPlzeňCzech Republic

Personalised recommendations