Perceptual Quality Assessment of TTS-Synthesized Speech

Chen, Zidong; Min, Xiongkuo

doi:10.1007/978-981-99-0856-1_31

Zidong Chen¹¹ &
Xiongkuo Min¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1766))

Included in the following conference series:

International Forum on Digital TV and Wireless Multimedia Communications

543 Accesses

Abstract

The evaluation of a Text-to-Speech (TTS) system is typically labor-intensive and highly biased because there is no golden standard of the generated speech or objective evaluation metrics. To improve the performance of TTS systems, it is highly desirable to explore the perceptual quality assessment of TTS-synthesized speech and propose a relatively valid evaluation method. In this paper, we introduce a deep-learning-based approach to predict human labeled perceptual quality scores of the generated speech. Our approach is based on ResNet and self-attention, where the former addresses the issue of deep feature extraction and integration and the latter takes advantage of the natural relationship between the input sequences. The experiment results indicate that the proposed method performs better on test tasks in terms of various accuracy evaluation criteria than the state-of-the-art methods. Additionally, the experiment demonstrates a strong correlation between the predicted scores and the true (human) MOS scores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 1:125–128 (1993)
Google Scholar
Methods for objective and subjective assessment of quality perceptual evaluation of speech quality ( PESQ ) : an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (2002)
Google Scholar
Kim, D.-S.: Anique: an auditory model for single-ended speech quality estimation. IEEE Trans. Speech Audio Process. 13, 821–831 (2005)
Article Google Scholar
Falk, T.H., Zheng, C., Chan, W.Y.: A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans. Audio Speech Lang. Process. 18, 1766–1774 (2010)
Article Google Scholar
Malfait, L., Berger, J., Kastner, M.: P.563–the ITU-T standard for single-ended speech quality assessment. IEEE Trans. Audio Speech Lang. Process. 14, 1924–1934 (2006)
Article Google Scholar
Sharma, D., Meredith, L., Lainez, J., Barreda, D., Naylor, P.A.: A non-intrusive PESQ measure. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 975–978 (2014)
Google Scholar
Rahdari, F., Mousavi, R., Eftekhari, M.: An ensemble learning model for single-ended speech quality assessment using multiple-level signal decomposition method. In: 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 189–193 (2014)
Google Scholar
Mittag, G., Möller, S.: Non-intrusive speech quality assessment for super-wideband speech communication networks. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7125–7129 (2019)
Google Scholar
Gamper, H., Reddy, C.K., Cutler, R., Tashev, I.J., Gehrke, J.: Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In: 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 85–89 (2019)
Google Scholar
Catellier, A.A., Voran, S.D.: Wenets: a convolutional framework for evaluating audio waveforms. ArXiv, abs/1909.09024 (2019)
Google Scholar
Patton, B., Agiomyrgiannakis, Y., Terry, M., Wilson, K., Saurous, R.A., Sculley, D.: AutoMOS: learning a non-intrusive assessor of naturalness-of-speech. ArXiv, abs/1611.09207 (2016)
Google Scholar
Manocha, P., Finkelstein, A., Zhang, R., Bryan, N.J., Mysore, G.J., Jin, Z.: A differentiable perceptual audio metric learned from just noticeable differences. In: INTERSPEECH (2020)
Google Scholar
Mittag, G., Naderi, B., Chehadi, A., Möller, S.: NISQA: a deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In: Interspeech (2021)
Google Scholar
Serrà, J., Pons, J., Pascual, S.: SESQA: semi-supervised learning for speech quality assessment. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 381–385 (2021)
Google Scholar
Yoshimura, T., Henter, G.E., Watts, O., Wester, M., Yamagishi, J., Tokuda, K.: A hierarchical predictor of synthetic speech naturalness using neural networks, In: INTERSPEECH (2016)
Google Scholar
Fu, S.W., Tsao, Y., Hwang, H.T., Wang, H.M.: Quality-net: an end-to-end non-intrusive speech quality assessment model based on blstm. ArXiv, abs/1808.05344 (2018)
Google Scholar
Lo, C.C., et al.: Mosnet: deep learning based objective assessment for voice conversion. ArXiv, abs/1904.08352 (2019)
Google Scholar
Jayesh, M.K., Sharma, M., Vonteddu, P., Shaik, M.A.B., Ganapathy, S.: Transformer networks for non-intrusive speech quality prediction. In: INTERSPEECH (2022)
Google Scholar
Liu, W., Xie, C.: MOS prediction network for non-intrusive speech quality assessment in online conferencing. In: INTERSPEECH (2022)
Google Scholar
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. ArXiv, abs/1804.04262 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. ArXiv, abs/1706.03762 (2017)
Google Scholar
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)
Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)
Google Scholar
Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech. ArXiv, abs/2006.04558 (2021)
Google Scholar
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. ArXiv, abs/2005.11129 (2020)
Google Scholar
Kuhn, D.: Speedy speech: efficient service delivery for articulation errors. Perspect. School-Based Issues 7, 11–14 (2006)
Article Google Scholar
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. ArXiv, abs/2106.06103 (2021)
Google Scholar
La’ncucki, A.: Fastpitch: parallel text-to-speech with pitch prediction. In: ICASSP (2021)
Google Scholar
Casanova, E., Weber, J., Shulby, C. D., Júnior, A., Gölge, E., Ponti, M. A.: Yourtts: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: ICML (2022)
Google Scholar
Bartoschek, S., et al.: webMUSHRA - a comprehensive framework for web-based listening tests. J. Open Res. Softw. 6(1), 8 (2018)
Article Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19, 2125–2136 (2011)
Article Google Scholar
Luo, Y., Mesgarani, N.: Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27, 1256–1266 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Zidong Chen
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Xiongkuo Min

Authors

Zidong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiongkuo Min
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zidong Chen .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
Shanghai Jiao Tong University, Shanghai, China
Jun Zhou
Shanghai Jiao Tong University, Shanghai, China
Hua Yang
Shanghai Jiao Tong University, Shanghai, China
Xiaokang Yang
Shanghai University, Shanghai, China
Ping An
Shanghai Jiao Tong University, Shanghai, China
Jia Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Min, X. (2023). Perceptual Quality Assessment of TTS-Synthesized Speech. In: Zhai, G., Zhou, J., Yang, H., Yang, X., An, P., Wang, J. (eds) Digital Multimedia Communications. IFTC 2022. Communications in Computer and Information Science, vol 1766. Springer, Singapore. https://doi.org/10.1007/978-981-99-0856-1_31

Download citation

DOI: https://doi.org/10.1007/978-981-99-0856-1_31
Published: 10 March 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0855-4
Online ISBN: 978-981-99-0856-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Perceptual Quality Assessment of TTS-Synthesized Speech