Skip to main content

Perceptual Quality Assessment of TTS-Synthesized Speech

  • Conference paper
  • First Online:
Digital Multimedia Communications (IFTC 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1766))

  • 543 Accesses

Abstract

The evaluation of a Text-to-Speech (TTS) system is typically labor-intensive and highly biased because there is no golden standard of the generated speech or objective evaluation metrics. To improve the performance of TTS systems, it is highly desirable to explore the perceptual quality assessment of TTS-synthesized speech and propose a relatively valid evaluation method. In this paper, we introduce a deep-learning-based approach to predict human labeled perceptual quality scores of the generated speech. Our approach is based on ResNet and self-attention, where the former addresses the issue of deep feature extraction and integration and the latter takes advantage of the natural relationship between the input sequences. The experiment results indicate that the proposed method performs better on test tasks in terms of various accuracy evaluation criteria than the state-of-the-art methods. Additionally, the experiment demonstrates a strong correlation between the predicted scores and the true (human) MOS scores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 1:125–128 (1993)

    Google Scholar 

  2. Methods for objective and subjective assessment of quality perceptual evaluation of speech quality ( PESQ ) : an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (2002)

    Google Scholar 

  3. Kim, D.-S.: Anique: an auditory model for single-ended speech quality estimation. IEEE Trans. Speech Audio Process. 13, 821–831 (2005)

    Article  Google Scholar 

  4. Falk, T.H., Zheng, C., Chan, W.Y.: A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans. Audio Speech Lang. Process. 18, 1766–1774 (2010)

    Article  Google Scholar 

  5. Malfait, L., Berger, J., Kastner, M.: P.563–the ITU-T standard for single-ended speech quality assessment. IEEE Trans. Audio Speech Lang. Process. 14, 1924–1934 (2006)

    Article  Google Scholar 

  6. Sharma, D., Meredith, L., Lainez, J., Barreda, D., Naylor, P.A.: A non-intrusive PESQ measure. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 975–978 (2014)

    Google Scholar 

  7. Rahdari, F., Mousavi, R., Eftekhari, M.: An ensemble learning model for single-ended speech quality assessment using multiple-level signal decomposition method. In: 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 189–193 (2014)

    Google Scholar 

  8. Mittag, G., Möller, S.: Non-intrusive speech quality assessment for super-wideband speech communication networks. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7125–7129 (2019)

    Google Scholar 

  9. Gamper, H., Reddy, C.K., Cutler, R., Tashev, I.J., Gehrke, J.: Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In: 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 85–89 (2019)

    Google Scholar 

  10. Catellier, A.A., Voran, S.D.: Wenets: a convolutional framework for evaluating audio waveforms. ArXiv, abs/1909.09024 (2019)

    Google Scholar 

  11. Patton, B., Agiomyrgiannakis, Y., Terry, M., Wilson, K., Saurous, R.A., Sculley, D.: AutoMOS: learning a non-intrusive assessor of naturalness-of-speech. ArXiv, abs/1611.09207 (2016)

    Google Scholar 

  12. Manocha, P., Finkelstein, A., Zhang, R., Bryan, N.J., Mysore, G.J., Jin, Z.: A differentiable perceptual audio metric learned from just noticeable differences. In: INTERSPEECH (2020)

    Google Scholar 

  13. Mittag, G., Naderi, B., Chehadi, A., Möller, S.: NISQA: a deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In: Interspeech (2021)

    Google Scholar 

  14. Serrà, J., Pons, J., Pascual, S.: SESQA: semi-supervised learning for speech quality assessment. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 381–385 (2021)

    Google Scholar 

  15. Yoshimura, T., Henter, G.E., Watts, O., Wester, M., Yamagishi, J., Tokuda, K.: A hierarchical predictor of synthetic speech naturalness using neural networks, In: INTERSPEECH (2016)

    Google Scholar 

  16. Fu, S.W., Tsao, Y., Hwang, H.T., Wang, H.M.: Quality-net: an end-to-end non-intrusive speech quality assessment model based on blstm. ArXiv, abs/1808.05344 (2018)

    Google Scholar 

  17. Lo, C.C., et al.: Mosnet: deep learning based objective assessment for voice conversion. ArXiv, abs/1904.08352 (2019)

    Google Scholar 

  18. Jayesh, M.K., Sharma, M., Vonteddu, P., Shaik, M.A.B., Ganapathy, S.: Transformer networks for non-intrusive speech quality prediction. In: INTERSPEECH (2022)

    Google Scholar 

  19. Liu, W., Xie, C.: MOS prediction network for non-intrusive speech quality assessment in online conferencing. In: INTERSPEECH (2022)

    Google Scholar 

  20. Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. ArXiv, abs/1804.04262 (2018)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  22. Vaswani, A., et al.: Attention is all you need. ArXiv, abs/1706.03762 (2017)

    Google Scholar 

  23. Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)

    Google Scholar 

  24. Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)

    Google Scholar 

  25. Ren, Y., et al.: Fastspeech 2: fast and high-quality end-to-end text to speech. ArXiv, abs/2006.04558 (2021)

    Google Scholar 

  26. Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. ArXiv, abs/2005.11129 (2020)

    Google Scholar 

  27. Kuhn, D.: Speedy speech: efficient service delivery for articulation errors. Perspect. School-Based Issues 7, 11–14 (2006)

    Article  Google Scholar 

  28. Kim, J., Kong, J., Son, J.: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. ArXiv, abs/2106.06103 (2021)

    Google Scholar 

  29. La’ncucki, A.: Fastpitch: parallel text-to-speech with pitch prediction. In: ICASSP (2021)

    Google Scholar 

  30. Casanova, E., Weber, J., Shulby, C. D., Júnior, A., Gölge, E., Ponti, M. A.: Yourtts: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In: ICML (2022)

    Google Scholar 

  31. Bartoschek, S., et al.: webMUSHRA - a comprehensive framework for web-based listening tests. J. Open Res. Softw. 6(1), 8 (2018)

    Article  Google Scholar 

  32. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19, 2125–2136 (2011)

    Article  Google Scholar 

  33. Luo, Y., Mesgarani, N.: Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27, 1256–1266 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zidong Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Min, X. (2023). Perceptual Quality Assessment of TTS-Synthesized Speech. In: Zhai, G., Zhou, J., Yang, H., Yang, X., An, P., Wang, J. (eds) Digital Multimedia Communications. IFTC 2022. Communications in Computer and Information Science, vol 1766. Springer, Singapore. https://doi.org/10.1007/978-981-99-0856-1_31

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-0856-1_31

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-0855-4

  • Online ISBN: 978-981-99-0856-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics