Abstract
The evaluation of a Text-to-Speech (TTS) system is typically labor-intensive and highly biased because there is no golden standard of the generated speech or objective evaluation metrics. To improve the performance of TTS systems, it is highly desirable to explore the perceptual quality assessment of TTS-synthesized speech and propose a relatively valid evaluation method. In this paper, we introduce a deep-learning-based approach to predict human labeled perceptual quality scores of the generated speech. Our approach is based on ResNet and self-attention, where the former addresses the issue of deep feature extraction and integration and the latter takes advantage of the natural relationship between the input sequences. The experiment results indicate that the proposed method performs better on test tasks in terms of various accuracy evaluation criteria than the state-of-the-art methods. Additionally, the experiment demonstrates a strong correlation between the predicted scores and the true (human) MOS scores.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 1:125–128 (1993)
Methods for objective and subjective assessment of quality perceptual evaluation of speech quality ( PESQ ) : an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (2002)
Kim, D.-S.: Anique: an auditory model for single-ended speech quality estimation. IEEE Trans. Speech Audio Process. 13, 821–831 (2005)
Falk, T.H., Zheng, C., Chan, W.Y.: A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans. Audio Speech Lang. Process. 18, 1766–1774 (2010)
Malfait, L., Berger, J., Kastner, M.: P.563–the ITU-T standard for single-ended speech quality assessment. IEEE Trans. Audio Speech Lang. Process. 14, 1924–1934 (2006)
Sharma, D., Meredith, L., Lainez, J., Barreda, D., Naylor, P.A.: A non-intrusive PESQ measure. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 975–978 (2014)
Rahdari, F., Mousavi, R., Eftekhari, M.: An ensemble learning model for single-ended speech quality assessment using multiple-level signal decomposition method. In: 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 189–193 (2014)
Mittag, G., Möller, S.: Non-intrusive speech quality assessment for super-wideband speech communication networks. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7125–7129 (2019)
Gamper, H., Reddy, C.K., Cutler, R., Tashev, I.J., Gehrke, J.: Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In: 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 85–89 (2019)
Catellier, A.A., Voran, S.D.: Wenets: a convolutional framework for evaluating audio waveforms. ArXiv, abs/1909.09024 (2019)
Patton, B., Agiomyrgiannakis, Y., Terry, M., Wilson, K., Saurous, R.A., Sculley, D.: AutoMOS: learning a non-intrusive assessor of naturalness-of-speech. ArXiv, abs/1611.09207 (2016)
Manocha, P., Finkelstein, A., Zhang, R., Bryan, N.J., Mysore, G.J., Jin, Z.: A differentiable perceptual audio metric learned from just noticeable differences. In: INTERSPEECH (2020)
Mittag, G., Naderi, B., Chehadi, A., Möller, S.: NISQA: a deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In: Interspeech (2021)
Serrà, J., Pons, J., Pascual, S.: SESQA: semi-supervised learning for speech quality assessment. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 381–385 (2021)
Yoshimura, T., Henter, G.E., Watts, O., Wester, M., Yamagishi, J., Tokuda, K.: A hierarchical predictor of synthetic speech naturalness using neural networks, In: INTERSPEECH (2016)
Fu, S.W., Tsao, Y., Hwang, H.T., Wang, H.M.: Quality-net: an end-to-end non-intrusive speech quality assessment model based on blstm. ArXiv, abs/1808.05344 (2018)
Lo, C.C., et al.: Mosnet: deep learning based objective assessment for voice conversion. ArXiv, abs/1904.08352 (2019)
Jayesh, M.K., Sharma, M., Vonteddu, P., Shaik, M.A.B., Ganapathy, S.: Transformer networks for non-intrusive speech quality prediction. In: INTERSPEECH (2022)
Liu, W., Xie, C.: MOS prediction network for non-intrusive speech quality assessment in online conferencing. In: INTERSPEECH (2022)
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. ArXiv, abs/1804.04262 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Vaswani, A., et al.: Attention is all you need. ArXiv, abs/1706.03762 (2017)
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)
Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech. ArXiv, abs/2006.04558 (2021)
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. ArXiv, abs/2005.11129 (2020)
Kuhn, D.: Speedy speech: efficient service delivery for articulation errors. Perspect. School-Based Issues 7, 11–14 (2006)
Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. ArXiv, abs/2106.06103 (2021)
La’ncucki, A.: Fastpitch: parallel text-to-speech with pitch prediction. In: ICASSP (2021)
Casanova, E., Weber, J., Shulby, C. D., Júnior, A., Gölge, E., Ponti, M. A.: Yourtts: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: ICML (2022)
Bartoschek, S., et al.: webMUSHRA - a comprehensive framework for web-based listening tests. J. Open Res. Softw. 6(1), 8 (2018)
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19, 2125–2136 (2011)
Luo, Y., Mesgarani, N.: Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27, 1256–1266 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, Z., Min, X. (2023). Perceptual Quality Assessment of TTS-Synthesized Speech. In: Zhai, G., Zhou, J., Yang, H., Yang, X., An, P., Wang, J. (eds) Digital Multimedia Communications. IFTC 2022. Communications in Computer and Information Science, vol 1766. Springer, Singapore. https://doi.org/10.1007/978-981-99-0856-1_31
Download citation
DOI: https://doi.org/10.1007/978-981-99-0856-1_31
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0855-4
Online ISBN: 978-981-99-0856-1
eBook Packages: Computer ScienceComputer Science (R0)