A Systematic Study of Open Source and Commercial Text-to-Speech (TTS) Engines

Hosier, Jordan; Kalfen, Jordan; Sharma, Nikhita; Gurbani, Vijay K.

doi:10.1007/978-3-030-58323-1_34

Jordan Hosier¹²,
Jordan Kalfen¹³,
Nikhita Sharma¹² &
…
Vijay K. Gurbani^12,13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12284))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1427 Accesses

Abstract

The widespread availability of open source and commercial text-to-speech (TTS) engines allows for the rapid creation of telephony services that require a TTS component. However, there exists neither a standard corpus nor common metrics to objectively evaluate TTS engines. Listening tests are a prominent method of evaluation in the domain where the primary goal is to produce speech targeted at human listeners. Nonetheless, subjective evaluation can be problematic and expensive. Objective evaluation metrics, such as word accuracy and contextual disambiguation (is “Dr.” rendered as Doctor or Drive?), have the benefit of being both inexpensive and unbiased. In this paper, we study seven TTS engines, four open source engines and three commercial ones. We systematically evaluate each TTS engine on two axes: (1) contextual word accuracy (includes support for numbers, homographs, foreign words, acronyms, and directional abbreviations); and (2) naturalness (how natural the TTS sounds to human listeners). Our results indicate that commercial engines may have an edge over open source TTS engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://mycroft.ai/documentation/mimic (last visit: April 23, 2020).
2.
http://www.festvox.org/flite/ (last visit: April 23, 2020).
3.
http://mary.dfki.de (last visit: April 23, 2020).
4.
https://github.com/r9y9/deepvoice3_pytorch (last visit: April 23, 2020).
5.
https://www.voicery.com (last visit: March 2020).
6.
https://www.acapela-group.com/ (last visit: April 23, 2020).
7.
http://speech.diotek.com/en/text-to-speech-demonstration.php (last visit: April 23, 2020).
8.
https://aws.amazon.com/polly/ (last visit: February 2020).
9.
https://www.ibm.com/Watson/services/text-to-speech/ (last visit: May 2019).
10.
“When the sunlight strikes raindrops in the air, they act like a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors. These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon”.

References

Eric, P.: Voting methods. The Stanford Encyclopedia of Philosophy (2012). http://plato.stanford.edu/entries/voting-methods/
Wei, P., et al.: Deep voice 3: 2000-speaker neural text-to-speech (2017). arXiv preprint arXiv:1710.07654
Wang, Y., et al.: Tacotron: a fully end-to-end text-to-speech synthesis model (2017). arXiv preprint arXiv:1703.10135
Yamagishi, J., et al.: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 17(1), 66–83 (2009)
Article Google Scholar
Tribolet, J. M., et al.: A study of complexity and quality of speech waveform coders. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1978, vol. 3. IEEE (1978)
Google Scholar
Möller, S., Falk, T.H.: Quality prediction for synthesized speech: comparison of approaches. In: International Conference on Acoustics (2009)
Google Scholar
Black, A.W., Tokuda, K.: The blizzard challenge-2005: evaluating corpus-based speech synthesis on common datasets. In: Ninth European Conference on Speech Communication and Technology (2005)
Google Scholar
Stoll, G., Kozamernik, F.: A method for subjective listening tests of intermediate audio quality. ITU Working Party (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Vail Systems, Inc., Chicago, USA
Jordan Hosier, Nikhita Sharma & Vijay K. Gurbani
Illinois Institute of Technology, Chicago, USA
Jordan Kalfen & Vijay K. Gurbani

Authors

Jordan Hosier
View author publications
You can also search for this author in PubMed Google Scholar
Jordan Kalfen
View author publications
You can also search for this author in PubMed Google Scholar
Nikhita Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Vijay K. Gurbani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vijay K. Gurbani .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák

Appendices

A Appendix A: Evaluation of TTS Engines on Our Corpus

URL: http://www.cs.iit.edu/~vgurbani/tsd2020/appendix-a.pdf

SHA-1 Hash: b14f7632306c2c9aa4154882d97c1c829ee48224

B Appendix B: Survey Answers by Participants

URL: http://www.cs.iit.edu/~vgurbani/tsd2020/appendix-b.pdf

SHA-1 Hash: f92c24fd84c35ee0be210801122deccf17ab0818

C Appendix C: Rendering of “The Rainbow Passage”

URL: http://www.cs.iit.edu/~vgurbani/tsd2020/tsd-paper1023.zip

SHA-1 Hash: 8ef25f33b2f95300abb1e3200d0d7cc9ead856e8

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hosier, J., Kalfen, J., Sharma, N., Gurbani, V.K. (2020). A Systematic Study of Open Source and Commercial Text-to-Speech (TTS) Engines. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-58323-1_34
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics