Improving Speech Synthesis Quality for Voices Created from an Audiobook Database

  • Pavel Chistikov
  • Dmitriy Zakharov
  • Andrey Talanov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8773)

Abstract

This paper describes an approach to improving synthesized speech quality for voices created by using an audiobook database. The data consist of a large amount of read speech by one speaker, which we matched with the corresponding book texts. The main problems with such a database are the following. First, the recordings were made at different times under different acoustic conditions, and the speaker reads the text with a variety of intonations and accents, which leads to very high voice parameter variability. Second, automatic techniques for sound file labeling make more errors due to the large variability of the database, especially as there can be mismatches between the text and the corresponding sound files. These problems dramatically affect speech synthesis quality, so a robust method for solving them is vital for voices created using audiobooks. The approach described in the paper is based on statistical models of voice parameters and special algorithms of speech element concatenation and modification. Listening tests show that it strongly improves synthesized speech quality.

Keywords

speech synthesis database quality control hidden Markov models Unit Selection speech modification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Black, A., Hunt, A.: Unit Selection in a Concatenative Speech Synthesis Using a Large Speech Database. In: Proc. of the ICASSP 1996, Atlanta, Georgia, May 7–10, vol. 1, pp. 373–376 (1996)Google Scholar
  2. 2.
    Prodan, A., Chistikov, P., Talanov, A.: Voice building system for hybrid Russian TTS system “VitalVoice”. In: Proc. of the Dialogue-2010 International Conference, Bekasovo, Russia, May 26–30, vol. 9(16), pp. 394–399 (2010)Google Scholar
  3. 3.
    Chistikov, P., Korolkov, E., Talanov, A.: Combining HMM and Unit Selection technologies to increase naturalness of synthesized speech. Computational Linguistics and Intellectual Technologies 2, 12(19), 607–615 (2013)Google Scholar
  4. 4.
    Yamagishi, J., Zen, H., Toda, T., Tokuda, K.: Speaker independent HMM-based speech synthesis system - hts-2007 system for the blizzard challenge 2007. Paper presented at the Blizzard Challenge 2007, Bonn, Germany (2007)Google Scholar
  5. 5.
    Breuer, S., Bergmann, S., Dragon R., Möller, S.: Set-up of a Unit-Selection Synthesis with a Prominent Voice. Paper Presented at the 5th International Conference on Language Resources and Evaluation, Genoa, Italy (2006)Google Scholar
  6. 6.
    King, S., Karaiskos, V.: The Blizzard Challenge 2013. Paper Presented at the Blizzard Challenge 2013 Workshop, Barcelona, Spain (2013)Google Scholar
  7. 7.
    Chistikov, P., Talanov, A.: High Quality Speech Synthesis Using a Small Dataset. In: Proc. of the SLTU-2014 International Conference, St. Petersburg, Russia, May 14–16, pp. 105–111 (2014)Google Scholar
  8. 8.
    Chistikov, P., Khomitsevich, O.: Improving prosodic break detection in a russian TTS system. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 181–188. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  9. 9.
    Zen, H., Tokuda, K., Masuko, T., Kobayashi, T.: Hidden semi-Markov model based speech synthesis. In: Proc. of the 8th International Conference on Spoken Language Processing, Jeju Island, South Korea, October 4–8, pp. 1393–1396 (2004)Google Scholar
  10. 10.
    Grofit, S., Lavner, Y.: Time-Scale Modification of Audio Signals Using Enhanced WSOLA With Managment of Transients. IEEE Transaction on audio, speech, and language processing 16(1), 106–115 (2008)CrossRefGoogle Scholar
  11. 11.
    Pratosh, A., Ananthapadmanabha, T., Ramakrishnan, A.: Epoch Extraction Based on Integrated Linear Prediction Residual Using. IEEE Transaction on audio, speech, and language processing 21(12), 2471–2480 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Pavel Chistikov
    • 1
    • 2
  • Dmitriy Zakharov
    • 2
  • Andrey Talanov
    • 2
  1. 1.Mechanics and OpticsNational Research University of Information TechnologiesSaint-PetersburgRussia
  2. 2.Speech Technology Center Ltd.Saint-PetersburgRussia

Personalised recommendations