Evaluation of Synthetic Speech by GMM-Based Continuous Detection of Emotional States

Přibil, Jiří; Přibilová, Anna; Matoušek, Jindřich

doi:10.1007/978-3-030-27947-9_22

Jiří Přibil⁹,
Anna Přibilová¹⁰ &
Jindřich Matoušek⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

791 Accesses

Abstract

The paper describes a system for automatic evaluation of synthetic speech quality based on continuous detection of emotional states throughout the spoken sentence using a Gaussian mixture model (GMM) classification. The final evaluation decision is made by statistical analysis of the results of emotional class differences between the sentences of original male or female voices and the speech synthesized by various methods with different parameters, approaches to prosody manipulation, etc. The basic experiments confirm the functionality of the developed system producing results comparable with those obtained by the standard listening test method. Additional investigations have shown that a number of mixtures, types of speech features, and a speech database used for creation and training of GMMs have a relatively great influence on continuous emotional style detection and the final quality evaluation of the tested synthetic speech.

This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic, project No. LO1506, and by the Ministry of Education, Science, Research, and Sports of the Slovak Republic VEGA 1/0854/16 (A. Přibilová).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Biagetti, G., Crippa, P., Falaschetti, L., Turchetti, N.: HMM speech synthesis based on MDCT representation. Int. J. Speech Technol. 21(4), 1045–1055 (2018)
Article Google Scholar
Zhao, Y., Takaki, S., Luong, H.T., Yamagishi, J., Saito, D., Minematsu, N.: Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a WaveNet vocoder. IEEE Access 6, 60478–60488 (2018)
Article Google Scholar
Saito, Y., Takamichi, S., Saruwatari, H.: Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans. Audio Speech Lang. Process. 26(1), 84–96 (2018)
Article Google Scholar
Vít, J., Matoušek, J.: Unit-selection speech synthesis adjustments for audiobook-based voices. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 335–342. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_38
Chapter Google Scholar
Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 442–449. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_56
Chapter Google Scholar
Adiga, N., Khonglah, B.K., Prasanna, S.R.M.: Improved voicing decision using glottal activity features for statistical parametric speech synthesis. Digital Signal Process. 71, 131–143 (2017)
Article MathSciNet Google Scholar
Achanta, S., Gangashetty, S.V.: Deep Elman recurrent neural networks for statistical parametric speech synthesis. Speech Commun. 93, 31–42 (2017)
Article Google Scholar
Pal, M., Paul, D., Saha, G.: Synthetic speech detection using fundamental frequency variation and spectral features. Comput. Speech Lang. 48, 31–50 (2018)
Article Google Scholar
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3, 72–83 (1995)
Article Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of german emotional speech. In: Proceedings of INTERSPEECH 2005, ISCA, Lisbon, Portugal, pp. 1517–1520 (2005)
Google Scholar
Přibil, J., Přibilová, A.: Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech. EURASIP J. Audio Speech Music Process. 2013, 8 (2013)
Article Google Scholar
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2017)
Article Google Scholar
Rencher, A.C., Schaalje, G.B.: Linear Models in Statistics, 2nd edn. Wiley, Hoboken (2008)
MATH Google Scholar
Jokinen, E., Saeidi, R., Kinnunen, T., Alku, P.: Vocal effort compensation for MFCC feature extraction in a shouted versus normal speaker recognition task. Comput. Speech Lang. 53, 1–11 (2019)
Article Google Scholar
Přibil, J., Přibilová, A., Matoušek, J.: Automatic text-independent artifact detection, localization, and classification in the synthetic speech. Radioengineering 26(4), 1151–1160 (2017)
Article Google Scholar
Jůzová, M., Tihelka, D., Skarnitzl, R.: Last syllable unit penalization in unit selection TTS. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 317–325. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_36
Chapter Google Scholar
Nabney, I.T.: Netlab Pattern Analysis Toolbox, Release 3.3. http://www.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/downloads. Accessed 2 Oct 2015

Download references

Author information

Authors and Affiliations

Faculty of Applied Sciences, Department of Cybernetics, UWB, Pilsen, Czech Republic
Jiří Přibil & Jindřich Matoušek
FEE & IT, Institute of Electronics and Photonics, SUT in Bratislava, Bratislava, Slovakia
Anna Přibilová

Authors

Jiří Přibil
View author publications
You can also search for this author in PubMed Google Scholar
Anna Přibilová
View author publications
You can also search for this author in PubMed Google Scholar
Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiří Přibil .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Přibil, J., Přibilová, A., Matoušek, J. (2019). Evaluation of Synthetic Speech by GMM-Based Continuous Detection of Emotional States. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_22
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics