Quality Improvements of Zero-Concatenation-Cost Chain Based Unit Selection

  • Jiří Kala
  • Jindřich Matoušek
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8773)


In our previous work, we introduced a zero-concatenation-cost (ZCC) chain based framework of unit-selection speech synthesis. This framework proved to be very fast as it reduced the computational load of a unit-selection system up to hundreds of time. Since the ZCC chain based algorithm principally prefers to select longer segments of speech, an increased number of audible artifacts were expected to occur at concatenation points of longer ZCC chains. Indeed, listening tests revealed a number of artifacts present in synthetic speech; however, the artifacts occurred in a similar extent in synthetic speech produced by both ZCC chain based and standard Viterbi search algorithms. In this paper, we focus on the sources of the artifacts and we propose improvements of the synthetic speech quality within the ZCC algorithm. The quality and computational demands of the improved ZCC algorithm are compared to the unit-selection algorithm based on the standard Viterbi search.


speech synthesis unit selection Viterbi algorithm zero-concatenation-cost chain duration F0 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Beutnagel, M., Mohri, M., Riley, M.: Rapid unit selection from a large speech corpus for concatenative speech synthesis. In: Proc. EUROSPEECH, Budapest, Hungary, pp. 607–610 (1999)Google Scholar
  2. 2.
    Blouin, C., Bagshaw, P.C., Rosec, O.: A method of unit pre-selection for speech synthesis based on acoustic clustering and decision trees. In: Proc. ICASSP, Hong Kong, vol. 1, pp. 692–695 (2003)Google Scholar
  3. 3.
    Čepko, J., Talafová, R., Vrabec, J.: Indexing join costs for faster unit selection synthesis. In: Proc. Internat. Conf. Systems, Signals Image Processing (IWSSIP), Bratislava, Slovak Republic, pp. 503–506 (2008)Google Scholar
  4. 4.
    Conkie, A., Beutnagel, M., Syrdal, A.K., Brown, P.: Preselection of candidate units in a unit selection-based text-to-speech synthesis system. In: Proc. ICSLP, Beijing, China, vol. 3, pp. 314–317 (2000)Google Scholar
  5. 5.
    Conkie, A., Syrdal, A.K.: Using F0 to constrain the unit selection Viterbi network. In: Proc. ICASSP, Prague, Czech Republic, pp. 5376–5379 (2011)Google Scholar
  6. 6.
    Hamza, W., Donovan, R.: Data-driven segment preselection in the IBM trainable speech synthesis system. In: Proc. INTERSPEECH, Denver, USA, pp. 2609–2612 (2002)Google Scholar
  7. 7.
    Hunt, A.J., Black, A.W.: Unit selection in concatenative speech synhesis system using a large speech database. In: Proc. ICASSP, Atlanta, USA, pp. 373–376 (1996)Google Scholar
  8. 8.
    Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: Proc. ICASSP, Florence, Italy (2014)Google Scholar
  9. 9.
    Legát, M., Matoušek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Commun. 53(4), 552–566 (2011)CrossRefGoogle Scholar
  10. 10.
    Ling, Z.H., Hu, Y., Shuang, Z.W., Wang, R.H.: Decision tree based unit pre-selection in Mandarin Chinese synthesis. In: Proc. ISCSLP, Taipei, Taiwan (2002)Google Scholar
  11. 11.
    Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: Proc. 2nd IASTED Internat. Conf. on Computational Intelligence, San Francisco, USA, pp. 442–447 (2006)Google Scholar
  12. 12.
    Nishizawa, N., Kawai, H.: Unit database pruning based on the cost degradation criterion for concatenative speech synthesis. In: Proc. ICASSP, Las Vegas, USA, pp. 3969–3972 (2008)Google Scholar
  13. 13.
    Riley, M.: Tree-based modeling for speech synthesis. In: Bailly, G., Benoit, C., Sawallis, T. (eds.) Talking Machines: Theories, Models and Designs, pp. 265–273. Elsevier, Amsterdam (1992)Google Scholar
  14. 14.
    Romportl, J., Kala, J.: Prosody modelling in czech text-to-speech synthesis. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis, pp. 200–205. Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn (2007)Google Scholar
  15. 15.
    Romportl, J., Matoušek, J., Tihelka, D.: Advanced prosody modelling. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 441–447. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  16. 16.
    Sakai, S., Kawahara, T., Nakamura, S.: Admissible stopping in Viterbi beam search for unit selection in concatenative speech synthesis. In: Proc. ICASSP, Las Vegas, USA, pp. 4613–4616 (2008)Google Scholar
  17. 17.
    Taylor, P., Caley, R., Black, A., King, S.: Edinburgh speech tools library: System documentation (1999),
  18. 18.
    Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proc. INTERSPEECH, Makuhari, Japan, pp. 174–177 (2010)Google Scholar
  19. 19.
    Tihelka, D., Matoušek, J.: Unit selection and its relation to symbolic prosody: a new approach. In: Proc. INTERSPEECH, Pittsburgh, USA, pp. 2042–2045 (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jiří Kala
    • 1
  • Jindřich Matoušek
    • 1
  1. 1.Dept. of Cybernetics, Faculty of Applied SciencesUniversity of West BohemiaCzech Rep.

Personalised recommendations