GMM Classification of Text-to-Speech Synthesis: Identification of Original Speaker’s Voice

  • Jiří Přibil
  • Anna Přibilová
  • Jindřich Matoušek
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)


This paper describes two experiments. The first one deals with evaluation of synthetic speech quality by reverse identification of original speakers whose voices had been used for several Czech text-to-speech (TTS) systems. The second experiment was aimed at evaluation of the influence of voice transformation on the original speaker recognition. The paper further describes an analysis of the influence of initial settings for creation and training of the Gaussian mixture models (GMM), and the influence of different types of used speech features (spectral and/or supra-segmental) on correctness of GMM identification. The stability of the identification process with respect to the duration of the tested sentence (number of the processed frames) was analysed, too.


quality of synthetic speech text-to-speech system GMM classification statistical analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blauert, J., Jekosch, U.: A Layer Model of Sound Quality. Journal of the Audio Engineering Society 60, 4–12 (2012)Google Scholar
  2. 2.
    Kondo, K.: Subjective Quality Measurement of Speech: Its Evaluation, Estimation and Applications. Springer (2012)Google Scholar
  3. 3.
    Zelinka, J., Trmal, J., Müller, L.: On Context-Dependent Neural Networks and Speaker Adaptation. In: Proc. IEEE Conf. Signal Processing 2012, Beijing, China, pp. 515–518 (2012)Google Scholar
  4. 4.
    Pražák, A., Psutka, J.V., Psutka, J., Loose, Z.: Towards Live Subtitling of TV Ice-Hockey Commentary. In: Proc. SIGMAP 2013, Reykjavík, Iceland, pp. 151–155 (2013)Google Scholar
  5. 5.
    Jeong, Y.: Joint Speaker and Environment Adaptation Using TensorVoice for Robust Speech Recognition. Speech Communication 58, 1–10 (2014)CrossRefGoogle Scholar
  6. 6.
    Reynolds, D.A., Rose, R.C.: Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Trans. on Speech and Audio Processing 3, 72–83 (1995)CrossRefGoogle Scholar
  7. 7.
    Vondra, M., Vích, R.: Evaluation of Speech Emotion Classification Based on GMM and Data Fusion. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis. LNCS (LNAI), vol. 5641, pp. 98–105. Springer, Heidelberg (2009)Google Scholar
  8. 8.
    Přibilová, A., Přibil, J.: Non-Linear Frequency Scale Mapping for Voice Conversion in Text-to-Speech System with Cepstral Description. Speech Commun 48(12), 1691–1703 (2006)CrossRefGoogle Scholar
  9. 9.
    Vích, R., Přibil, J., Smékal, Z.: New Cepstral Zero-Pole Vocal Tract Models for TTS Synthesis. In: Proc. IEEE Region 8 EUROCON 2001, vol. 2, pp. 458–462 (2001)Google Scholar
  10. 10.
    Přibilová, A., Přibil, J.: Harmonic Model for Female Voice Emotional Synthesis. In: Fierrez, J., Ortega-Garcia, J., Esposito, A., Drygajlo, A., Faundez-Zanuy, M. (eds.) BioID MultiComm2009. LNCS, vol. 5707, pp. 41–48. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  11. 11.
    Horák, P.: Czech Pitch Contour Modeling Using Linear Prediction. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2008. LNCS (LNAI), vol. 5246, pp. 333–339. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi Search for Fast Unit Selection Synthesis. In: Proc. INTERSPEECH 2010, Makuhari, Japan, pp. 174–177 (2010)Google Scholar
  13. 13.
    Romportl, J., Matoušek, J.: Formal Prosodic Structures and Their Application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Přibil, J., Přibilová, A.: Czech TTS Engine for BraillePen Device Based on Pocket PC Platform. In: Proc. Conf. Electronic Speech Signal Processing (ESSP 2005), pp. 402–408 (2005)Google Scholar
  15. 15.
    Personal Computer Voices: PCVOX. Spektra v.d.n., (accessed February 5, 2014)
  16. 16.
    Přibil, J., Přibilová, A.: Evaluation of Influence of Spectral and Prosodic Features on GMM Classification of Czech and Slovak Emotional Speech. EURASIP Journal on Audio, Speech, and Music Processing 2013(8), 1–22 (2013)Google Scholar
  17. 17.
    Nabney, I.T.: Netlab Pattern Analysis Toolbox, (retrieved October 2, 2013)

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jiří Přibil
    • 1
    • 2
  • Anna Přibilová
    • 3
  • Jindřich Matoušek
    • 1
  1. 1.Faculty of Applied Sciences, Dept. of CyberneticsUniversity of West BohemiaPlzeňCzech Republic
  2. 2.Institute of Measurement ScienceSASBratislavaSlovakia
  3. 3.Faculty of Electrical Engineering & Information Technology, Institute of Electronics and PhotonicsSlovak University of TechnologyBratislavaSlovakia

Personalised recommendations