Journal of Signal Processing Systems

, Volume 90, Issue 7, pp 1025–1037 | Cite as

Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning

  • Zhengqi Wen
  • Kehuang Li
  • Zhen Huang
  • Chin-Hui Lee
  • Jianhua Tao


We propose three techniques to improve speech synthesis based on deep neural network (DNN). First, at the DNN input we use real-valued contextual feature vector to represent phoneme identity, part of speech and pause information instead of the conventional binary vector. Second, at the DNN output layer, parameters for pitch-scaled spectrum and aperiodicity measures are estimated for constructing the excitation signal used in our baseline synthesis vocoder. Third, the bidirectional recurrent neural network architecture with long short term memory (BLSTM) units is adopted and trained with multi-task learning for DNN-based speech synthesis. Experimental results demonstrate that the quality of synthesized speech has been improved by adopting the new input vector and output parameters. The proposed BLSTM architecture for DNN is also beneficial to learning the mapping function from the input contextual feature to the speech parameters and to improve speech quality.


DNN-based speech synthesis Vocoder Speech parametrization BLSTM Phoneme embedded vector Multi-task learning Pitch-scaled spectrum 



This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No. 2015AA016305), the National Natural Science Foundation of China (NSFC) (No.61403386), the Strategic Priority Research Program of the CAS (Grant XDB02080006) and partly supported by the Major Program for the National Social Science Fund of China (13&ZD189).


  1. 1.
    Hinton, G., Osindero, S., & Teh, Y. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18, 1527–1554.MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Hinton, G.-E. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11, 428–434.CrossRefGoogle Scholar
  3. 3.
    LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1990) Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems (NIPS), pp. 396–404.Google Scholar
  4. 4.
    LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.CrossRefGoogle Scholar
  5. 5.
    Mike, S., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.CrossRefGoogle Scholar
  6. 6.
    Sepp, H., & Jürgen, S. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  7. 7.
    Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRefGoogle Scholar
  8. 8.
    Graves, A., Mohamed, A., Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Network. In Proc. of ICASSP, pp. 6645–6649.Google Scholar
  9. 9.
    Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional Neural Networks for Speech Recognition. In IEEE/ACM Trans. on Audio, Speech and Language Processing, 22(10), 1533–1545.CrossRefGoogle Scholar
  10. 10.
    Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Proc. of NIPS, 1(2), 1097–1105.Google Scholar
  11. 11.
    K.M. He, X.Y. Zhang, S.Q. Ren and J. Sun, Deep Residual Learning for Image Recognition. In Proc. of CVPR, pp. 770–778, 2015.Google Scholar
  12. 12.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (Almost) from Scratch. Journal of Machind Learning Research, 12, 2493–2537.zbMATHGoogle Scholar
  13. 13.
    Kim, Y. (2014) Convolutional Neural Networks for Sentence Classification. In: Proc. of EMNLP, pp. 1746–1751.Google Scholar
  14. 14.
    Cho, K., Merrienboer, B., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In: Proc. of EMNLP.Google Scholar
  15. 15.
    Hunt, A. J., Black, A. W. (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. of ICASSP, pp. 373–376.Google Scholar
  16. 16.
    H. Kawai, T. Toda, J. Ni, et al. (2004) XIMERA: A new TTS from ATR based on corpus-based technologies. In Proc. of Fifth ISCA Workshop on Speech Synthesis.Google Scholar
  17. 17.
    Ling, Z. H., Wang, R. H. (2007) HMM-based hierarchical unit selection combining Kullback-Leibler divergence with likelihood criterion. In Proc. of ICASSP, pp. 1245–1248.Google Scholar
  18. 18.
    Black, A. W., Zen, H., & Tokuda, K. (2007). Statistical parametric speech synthesis. Proc. ICASSP, 4, 1229–1232.Google Scholar
  19. 19.
    Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical Parametric Speech Synthesis. Speech Communication, 51(11), 1039–1064.CrossRefGoogle Scholar
  20. 20.
    T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proc. of Eurospeech, pp. 2347–2350.Google Scholar
  21. 21.
    Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., Meng, H., & Deng, L. (2015). Deep Learning for Acoustic Modeling in Parametric Speech Generation. Journal of IEEE Signal Processing Magazine, 32, 35–52.CrossRefGoogle Scholar
  22. 22.
    Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural probabilistic language model. Journal of Machine Learning Research, 1137–1155.Google Scholar
  23. 23.
    Mikolov, T., Karafiat, M., Burget, L. (2010) J. “Honza” Cernocky and S. Khudanpur, “Recurrent neural network based language model. In Proc. of INTERSPEECH, pp. 1045–1048.Google Scholar
  24. 24.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.zbMATHGoogle Scholar
  25. 25.
    Huang, E. H., Socher, R., Manning, C. D., & Ng, A. Y. (2012). Improving word representations via global context and multiple word prototypes. Proc. of ACL, 1, 873–882.Google Scholar
  26. 26.
    T. Mikolov, K. Chen, G. Corrado and J. Dean (2013) Efficient estimation of word representations in vector space. In Proc. of CoRR.Google Scholar
  27. 27.
    Kang, S., Qian, X., & Meng, H. (2013). Multi-distribution deep belief network for speech synthesis. In Proc. of ICASSP, pp.7962–7966.Google Scholar
  28. 28.
    Ling, Z.-H., Deng, L., & Yu, D. (2013) Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis. In Proc. of ICASSP, pp. 7825–7829.Google Scholar
  29. 29.
    Fan, Y.-C., Qian, Y., Xie, F.-L. & Soong, F. K. (2014) TTS Synthesis with Bidirectional LSTM based Recurrent Neural Networks. In Proc. of Interspeech, pp.1964–1968.Google Scholar
  30. 30.
    Siniscalchi, S. M., Yu, D., Deng, L., & Lee, C.-H. (2013). Exploiting Deep Neural Networks for Detection-Based Speech Recognition. Neurocomputing, 106, 148–157.CrossRefGoogle Scholar
  31. 31.
    C.-H. Lee and S. M. Siniscalchi, “An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification and Recognition,” Proceedings of the IEEE, Vol. 101, No. 5, pp. 1089–1115, May 2013.Google Scholar
  32. 32.
    Caruana, R. (1997). Multitask learning. Machine Learning Journal, 28, 41–75.CrossRefGoogle Scholar
  33. 33.
    Wu, Z., Valentini-Botinhao, C., Watts, O., & King, S. (2015). Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In Proc. of ICASSP, pp. 4460–4464.Google Scholar
  34. 34.
    Tokuda, K., Kobayashi, T., & Imai, S. (1995). Speech parameter generation from HMM using dynamic features. Proc. of ICASSP, pp. 660–663.Google Scholar
  35. 35.
    Song, E., Joo, Y.-S., & Kang, H.-G. (2015) Improved Time-Frequency Trajectory Excitation Modeling for a Statistical Parametric Speech Synthesis System. In Proc. of ICASSP.Google Scholar
  36. 36.
    Fan, B., Lee, S.-W., Tian, X.-H., Xie, L., & Dong, M.-H. (2015). A Waveform Representation Framework for High-Qaulity Statistical Parametric Speech Synthesis. In Proc. of APASIPA.Google Scholar
  37. 37.
    Hu, Q., Yamagishi, J., Richmond, K., Subramanian K., & Stylianou, Y. (2016) Initial Investigation of Speech Synthesis based on Complex-Valued Neural Networks. In Proc. of ICASSP, pp. 5630–5634.Google Scholar
  38. 38.
    Wen, Z. Q., Kawahara, H., & Tao, J. H., (2012) Pitch-Scaled Analysis based Residual Reconstruction for Speech Analysis and Synthesis. In Proc. of INTERSPEECH, pp. 374–377.Google Scholar
  39. 39.
    Jackson, P. J. B., & Shadle, C. H. (2001). Pitch-Scaled Estimation of Simultaneous Voiced and Trubulence-Noise Components in Speech. IEEE Trans. On Speech Audio Processing, 9(7), 713–726.CrossRefGoogle Scholar
  40. 40.
    Soong, F.-K., & Juang, B.-H. (1984). Line spectrum pair (UP) and speech data compression. Proc. of ICASSP, San Diego, 1, 1.10.1–1.10.4.Google Scholar
  41. 41.
    Watts, O. (2013). Unsupervised learning for text-to-speech synthesis. PhD dissertation.Google Scholar
  42. 42.
    Chen, X., Xu, L., Liu, Z., Sun, M., & Luan, H. (2015). Joint learning of character and word embeddings. In International Joint Conference on Artificial Intelligence.Google Scholar
  43. 43.
    Wen, Z. Q., Li, Y., & Tao, J. H. (2016). The Parameterized Phoneme Identity Feature as a Continuous Real-Valued Vector for Neural Network based Speech Synthesis. In Proc. of INTERSPEECH.Google Scholar
  44. 44.
    Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning.Google Scholar
  45. 45.
    Sun, F., Guo, J., Lan, Y., Xu, J., & Cheng, X. (2016). Inside out: Two jointly predictive models for word representations and phrase representations. In Proceedings of the 30th AAAI conference.Google Scholar
  46. 46.
    Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800.CrossRefzbMATHGoogle Scholar
  47. 47.
    Bourlard, H., & Morgan, N. (1994). Connectionist speech recognition. Dordrecht: Kluwer Academic Publishers.CrossRefGoogle Scholar
  48. 48.
    Bertsekas, D. P. (1999). Nonlinear Programming (2nd ed.). Belmont: Athena Scientific.zbMATHGoogle Scholar
  49. 49.
    Rumelhart, D.-E., Hinton, G.-E., & Williams, R.-J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.CrossRefzbMATHGoogle Scholar
  50. 50.
    Zheng, Y. B., Wen, Z. Q., Liu, B., Li, Y. & Tao, J. H. (2016). An Initial Research Towards Accurate Pitch Extraction for Speech Synthesis based on Bidirectional Long Short-Term Memory Recurrent Neural Network. In Proc. of ICSP.Google Scholar
  51. 51.
    Su, H., Zhang, H., Zhang, X. L., & Gao, G. G. (2016). Convolutional Neural Network for Robust Pitch Determination. In Proc. of ICASSP, pp. 579–583.Google Scholar
  52. 52.
    Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(5), 187–207.CrossRefGoogle Scholar
  53. 53.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011) The Kaldi Speech Recognition Toolkit. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding.Google Scholar
  54. 54.
    Ephraim, Y., & Malah, D. (1985). Speech Enhancement using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSp-33(2), 443–445.CrossRefGoogle Scholar
  55. 55.
    Blin, L., Boeffard, O. & Barreaud, V. (2008). WEB-based listening test system for speech synthesis and speech conversion evaluation. In Proc. of LREC (Marrakech (Morocco)).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Zhengqi Wen
    • 1
  • Kehuang Li
    • 2
  • Zhen Huang
    • 2
  • Chin-Hui Lee
    • 2
  • Jianhua Tao
    • 1
    • 3
    • 4
  1. 1.National Laboratory of Pattern RecognitionBeijingChina
  2. 2.School of Electrical and Computer EngineeringGeorgia Institute of TechnologyAtlantaUSA
  3. 3.CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of AutomationChinese Academy of ScienceBeijingChina
  4. 4.School of Computer and Control EngineeringUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations