Speech Evaluation Based on Deep Learning Audio Caption

  • Liu Zhang
  • Hanyi Zhang
  • Jin Guo
  • Detao Ji
  • Qing Liu
  • Cheng XieEmail author
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 41)


Speech evaluation is an essential process of language learning. Traditionally, speech evaluation is done by experts evaluate voice and pronunciation from testers, which lack of efficiency and standards. In this paper, we propose a novel approach, based on deep learning and audio caption, to evaluate speeches instead of linguistic experts. First, the proposed approach extracts audio features from the speech. Then, the relationships between audio features expert evaluations are learned by deep learning. At last, an LSTM model is applied to predict expert evaluations. The experiment is done in a real-world dataset collected by our collaborative company. The result shows the proposed approach achieves excellent performance and has high potentials in the application.


Speech evaluation Deep learning Audio caption 


  1. 1.
    Heigold, G., Moreno, I., Bengio, S., et al.: End-to-end text-dependent speaker verification. In: IEEE International Conference on Acoustics. IEEE (2016)Google Scholar
  2. 2.
    Sadjadi, S.O., Ganapathy, S., Pelecanos, J.W.: The IBM 2016 speaker recognition system (2016)Google Scholar
  3. 3.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  4. 4.
    Esteva, A., Kuprel, B., Novoa, R.A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)CrossRefGoogle Scholar
  5. 5.
    Bulgarov, F.A., Nielsen, R., et al.: Proposition entailment in educational applications using deep neural networks. In: AAAI vol. 32, no. 18, pp. 5045–5052 (2018)Google Scholar
  6. 6.
    Kolanovic, M., et al.: Big data and AI strategies: machine learning and alternative data approach to investing. Am. Glob. Quant. Deriv. Strategy (2017)Google Scholar
  7. 7.
    Zhang, J., Zong, C.: Deep Learning: Fundamentals, Theory and Applications. Springer, Cham (2019). 111Google Scholar
  8. 8.
    Chen, L.-C., Zhu, Y., et al.: Computer Vision – ECCV 2018. In: European Conference on Computer Vision, vol. 833, Germany (2018)Google Scholar
  9. 9.
    Goldberg, Y.: A primer on neural network models for natural language processing. Comput. Sci. (2015)Google Scholar
  10. 10.
    Shi, Y., Hwang, M.Y., Lei, X.: End-to-end speech recognition using A high rank LSTM-CTC based model (2019)Google Scholar
  11. 11.
    Shan, C., Zhang, J., Wang, Y., et al.: Attention-based end-to-end models for small-footprint keyword spotting (2018)Google Scholar
  12. 12.
    Zhou, X., Li, J., Zhou, X.: Cascaded CNN-resBiLSTM-CTC: an end-to-end acoustic model for speech recognition (2018)Google Scholar
  13. 13.
    Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: a system for large-scale machine learning (2016)Google Scholar
  14. 14.
    Britz, D., Goldie, A., Luong, M.T., et al.: Massive exploration of neural machine translation architectures (2017)Google Scholar
  15. 15.
    Gehring, J., Auli, M., Grangier, D., et al.: Convolutional sequence to sequence learning (2017)Google Scholar
  16. 16.
    Hochreiter, S.: Recurrent neural net learning and vanishing gradient. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 6(2), 107–116 (1998)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Chien, J.T., Lu, T.W.: Deep recurrent regularization neural network for speech recognition. In: IEEE International Conference on Acoustics. IEEE (2015)Google Scholar
  18. 18.
    Wang, W., Deng, H.W.: Speaker recognition system using MFCC features and vector quantization. Chin. J. Sci. Instrum. 27(S), 2253–2255 (2006)Google Scholar
  19. 19.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition [EB/OL]. 5 February 2014
  20. 20.
    Peng, Y., Niu, W.: Feature word selection based on word vector. Comput. Technol. Sci. 28(6), 7–11 (2018)Google Scholar
  21. 21.
    Hochreiter, S., Schmidhuber, J.: Long short-term momery. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  22. 22.
    Kulkarni, G., Premraj, V., Ordonez, V., et al.: Baby talk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)CrossRefGoogle Scholar
  23. 23.
    Hong-Bin, Z., Dong-Hong, J., Lan, Y., et al.: Product image sentence annotation based on gradient kernel feature and N-gram model. Comput. Sci. 43(5), 269–273, 287 (2016)Google Scholar
  24. 24.
    Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)CrossRefGoogle Scholar
  25. 25.
    Prabhavalkar, R., Alsharif, O., Bruguier, A., McGraw, I.: On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition [EB/OL]. 25 March 2016

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Liu Zhang
    • 1
  • Hanyi Zhang
    • 1
  • Jin Guo
    • 1
  • Detao Ji
    • 1
  • Qing Liu
    • 1
  • Cheng Xie
    • 1
    Email author
  1. 1.Yunnan UniversityKunmingChina

Personalised recommendations