Phoneme Aware Speech Synthesis via Fine Tune Transfer Learning with a Tacotron Spectrogram Prediction Network

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1043)


The implications of realistic human speech imitation are both promising but potentially dangerous. In this work, a pre-trained Tacotron Spectrogram Feature Prediction Network is fine tuned with two 1.6 h speech datasets for 100,000 learning iterations, producing two individual models. The two Speech datasets are completely identical in content other than their textual representation, one follows the standard English language, whereas the second is an English phonetic representation in order to study the effects on the learning processes. To test imitative abilities post-training, thirty lines of speech are recorded from a human to be imitated. The models then attempt to produce these voice lines themselves, and the acoustic fingerprint of the outputs are compared to the real human speech. On average, English notation achieves 27.36%, whereas Phonetic English notation achieves 35.31% similarity to a human being. This suggests that representation of English through the International Phonetic Alphabet serves as more useful data than written English language. Thus, it is suggested from these experiments that a phonetic-aware paradigm would improve the abilities of speech synthesis similarly to its effects in the field of speech recognition.


Speech synthesis Fine tune learning Phonetic awareness Fingerprint analysis Tacotron 


  1. 1.
    Turing, A.M.: Computing Machinery and Intelligence (1950)Google Scholar
  2. 2.
    Locock, L., Ziebland, S., Dumelow, C.: Biographical disruption, abruption and repair in the context of motor neurone disease. Sociol. Health Illn. 31(7), 1043–1058 (2009)CrossRefGoogle Scholar
  3. 3.
    Yamagishi, J., Veaux, C., King, S., Renals, S.: Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction. Acoust. Sci. Technol. 33(1), 1–5 (2012)CrossRefGoogle Scholar
  4. 4.
    Baugh, A.C., Cable, T.: A History of the English Language. Routledge, Abingdon (1993) CrossRefGoogle Scholar
  5. 5.
    Loyn, H.R.: Anglo Saxon England and the Norman Conquest. Routledge, London (2014)CrossRefGoogle Scholar
  6. 6.
    Fromkin, V., Rodman, R., Hyams, N.: An Introduction to Language. Cengage, Boston (2006)Google Scholar
  7. 7.
    Titze, I.R., Martin, D.W.: Principles of Voice Production. Prentice-Hall, Englewood Cliffs (1994) Google Scholar
  8. 8.
    Menzel, W., Atwell, E., Bonaventura, P., Herron, D., Howarth, P., Morton, R., Souter, C.: The ISLE corpus of non-native spoken English. In: Proceedings of LREC 2000: Language Resources and Evaluation Conference, vol. 2, pp. 957–964. European Language Resources Association (2000)Google Scholar
  9. 9.
    Bird, J.J., Wanner, E., Ekart, A., Faria, D.R.: Phoneme aware speech recognition through evolutionary optimisation. In: The Genetic and Evolutionary Computation Conference, GECCO (2019)Google Scholar
  10. 10.
    Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.: Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)
  11. 11.
    Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788. IEEE (2018)Google Scholar
  12. 12.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)Google Scholar
  13. 13.
    Li, X., Wu, X.: Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4520–4524. IEEE (2015)Google Scholar
  14. 14.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  15. 15.
    Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)CrossRefGoogle Scholar
  16. 16.
    Sejdić, E., Djurović, I., Jiang, J.: Time-frequency feature representation using energy concentration: an overview of recent advances. Digit. Signal Process. 19(1), 153–183 (2009)CrossRefGoogle Scholar
  17. 17.
    Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R. J., Clark, R., Saurous, R.A.: Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In: International Conference on Machine Learning, pp. 4693–4702 (2018)Google Scholar
  18. 18.
    Zhang, M., Wang, X., Fang, F., Li, H., Yamagishi, J.: Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet. arXiv preprint arXiv:1903.12389 (2019)
  19. 19.
    Bormans, J., Gelissen, J., Perkis, A.: MPEG-21: the 21st century multimedia framework. IEEE Signal Process. Mag. 20(2), 53–62 (2003)CrossRefGoogle Scholar
  20. 20.
    Wang, A., et al.: An industrial strength audio search algorithm. In: ISMIR, vol. 2003, pp. 7–13, Washington, DC (2003)Google Scholar
  21. 21.
    IEEE.: IEEE Transactions on Audio and Electroacoustics, vol. 21. IEEE (1973)Google Scholar
  22. 22.
    Yochanang, K., Daengsi, T., Triyason, T., Wuttidittachotti, P.: A comparative study of VoIP quality measurement from G. 711 and G. 729 using PESQ and thai speech. In: International Conference on Advances in Information Technology, pp. 242–255. Springer (2013)Google Scholar
  23. 23.
    Yankelovich, N., Kaplan, J., Provino, J., Wessler, M., DiMicco, J.M.: Improving audio conferencing: are two ears better than one? In: Proceedings of the 2006 20th Anniversary Conference on Computer Supported Cooperative Work, pp. 333–342. ACM (2006)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Engineering and Applied ScienceAston UniversityBirminghamUK

Personalised recommendations