Improving Speech Synthesis Quality for Voices Created from an Audiobook Database
This paper describes an approach to improving synthesized speech quality for voices created by using an audiobook database. The data consist of a large amount of read speech by one speaker, which we matched with the corresponding book texts. The main problems with such a database are the following. First, the recordings were made at different times under different acoustic conditions, and the speaker reads the text with a variety of intonations and accents, which leads to very high voice parameter variability. Second, automatic techniques for sound file labeling make more errors due to the large variability of the database, especially as there can be mismatches between the text and the corresponding sound files. These problems dramatically affect speech synthesis quality, so a robust method for solving them is vital for voices created using audiobooks. The approach described in the paper is based on statistical models of voice parameters and special algorithms of speech element concatenation and modification. Listening tests show that it strongly improves synthesized speech quality.
Keywordsspeech synthesis database quality control hidden Markov models Unit Selection speech modification
Unable to display preview. Download preview PDF.
- 1.Black, A., Hunt, A.: Unit Selection in a Concatenative Speech Synthesis Using a Large Speech Database. In: Proc. of the ICASSP 1996, Atlanta, Georgia, May 7–10, vol. 1, pp. 373–376 (1996)Google Scholar
- 2.Prodan, A., Chistikov, P., Talanov, A.: Voice building system for hybrid Russian TTS system “VitalVoice”. In: Proc. of the Dialogue-2010 International Conference, Bekasovo, Russia, May 26–30, vol. 9(16), pp. 394–399 (2010)Google Scholar
- 3.Chistikov, P., Korolkov, E., Talanov, A.: Combining HMM and Unit Selection technologies to increase naturalness of synthesized speech. Computational Linguistics and Intellectual Technologies 2, 12(19), 607–615 (2013)Google Scholar
- 4.Yamagishi, J., Zen, H., Toda, T., Tokuda, K.: Speaker independent HMM-based speech synthesis system - hts-2007 system for the blizzard challenge 2007. Paper presented at the Blizzard Challenge 2007, Bonn, Germany (2007)Google Scholar
- 5.Breuer, S., Bergmann, S., Dragon R., Möller, S.: Set-up of a Unit-Selection Synthesis with a Prominent Voice. Paper Presented at the 5th International Conference on Language Resources and Evaluation, Genoa, Italy (2006)Google Scholar
- 6.King, S., Karaiskos, V.: The Blizzard Challenge 2013. Paper Presented at the Blizzard Challenge 2013 Workshop, Barcelona, Spain (2013)Google Scholar
- 7.Chistikov, P., Talanov, A.: High Quality Speech Synthesis Using a Small Dataset. In: Proc. of the SLTU-2014 International Conference, St. Petersburg, Russia, May 14–16, pp. 105–111 (2014)Google Scholar
- 9.Zen, H., Tokuda, K., Masuko, T., Kobayashi, T.: Hidden semi-Markov model based speech synthesis. In: Proc. of the 8th International Conference on Spoken Language Processing, Jeju Island, South Korea, October 4–8, pp. 1393–1396 (2004)Google Scholar