Optical Memory and Neural Networks

, Volume 27, Issue 4, pp 272–282 | Cite as

Performance Optimization of Speech Recognition System with Deep Neural Network Model



With the development of internet, man-machine interaction has tended to be more important. Precise speech recognition has become an important means to achieve man-machine interaction. In this study, deep neural network model was used to enhance speech recognition performance. Feedforward fully connected deep neural network, time-delay neural network, convolutional neural network and feedforward sequence memory neural network were studied, and their speech recognition performance was studied by comparing their acoustic models. Moreover, the recognition performance of the model after adding different dimension human voice features was tested. The results showed that the performance of the speech recognition system could be improved effectively by using the deep neural network model, and the performance of feedforward sequence memory neural network was the best, followed by deep neural network, time-delay neural network and convolutional neural network. Different extraction features had different improvement effects on model performance. The performance of the model which was added with Fbank extraction features was superior to that added with Mel-frequency cepstrum coefficient (MFCC) extraction feature. The model performance improved after the addition of vocal characteristics. Different models had different vocal characteristic dimensions.


deep neural network acoustic model speech recognition discriminative training performance optimization 


  1. 1.
    Chan, W., Jaitly, N., Le, Q., and Vinyals, O., Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, 2016, pp. 4960–4964.Google Scholar
  2. 2.
    Wang, Y., Li, J. and Gong, Y., Small-footprint high-performance deep neural network-based speech recognition using split-VQ, IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 4984–4988.Google Scholar
  3. 3.
    Wu, C., Karanasou, P., Gales, M.J.F., and Sim K.C., Stimulated deep neural network for speech recognition, in Interspeech, San Francisco, 2016, pp. 400–404.CrossRefGoogle Scholar
  4. 4.
    Graves, A., Mohamed, A.R. and Hinton, G., Speech recognition with deep recurrent neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 2013, pp. 6645–6649.Google Scholar
  5. 5.
    Salvador, S.W. and Weber, F.V., US Patent 9 153 231, 2015.Google Scholar
  6. 6.
    Cai, J., Li, F., Zhang, Y., and Liu, Y., Research on multi-base depth neural network speech recognition, Advanced Information Technology, Electronic and Automation Control Conference, Chongqing, 2017, pp. 1540–1544.Google Scholar
  7. 7.
    Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y., Attention-based models for speech recognition, Comput. Sci., 2015, vol. 10, no. 4, pp. 429–439.Google Scholar
  8. 8.
    Miao, Y., Gowayyed, M., and Metze, F., EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, Automatic Speech Recognition & Understanding, Scottsdale, 2015, pp. 167–174.Google Scholar
  9. 9.
    Schwarz, A., Huemmer, C., Maas, R. and Kellermann, W., Spatial diffuseness features for DNN-based speech recognition in noisy and reverberant environments, IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 4380–4384.Google Scholar
  10. 10.
    Kipyatkova, I., Experimenting with hybrid TDNN/HMM acoustic models for Russian speech recognition, Speech and Computer: 19th International Conference, 2017, pp. 362–369.Google Scholar
  11. 11.
    Yoshioka, T., Karita, S. and Nakatani, T., Far-field speech recognition using CNN-DNN-HMM with convolution in time’, IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, 2015, pp. 4360–4364.Google Scholar
  12. 12.
    Wang, Y., Bao, F., Zhang, H. and Gao, G.L., Research on Mongolian speech recognition based on FSMN, Natural Language Processing and Chinese Computing, 2017, pp. 243–254.Google Scholar
  13. 13.
    Alam, M.J., Gupta, V., Kenny, P., and Dumouchel, P., Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation’, EURASIP J. Adv. Signal Process., 2015, vol. 2015, no. 1, p. 50.CrossRefGoogle Scholar
  14. 14.
    Brayda, L., Wellekens, C., and Omologo, M., N-best parallel maximum likelihood beamformers for robust speech recognition, Signal Processing Conference, Florence, 2015, pp. 1–4.Google Scholar
  15. 15.
    Ali, A., Zhang, Y., Cardinal, P., Dahak, N., Vogel, S., and Glass, J.R., A complete KALDI recipe for building Arabic speech recognition systems, 2014 Spoken Language Technology Workshop, South Lake Tahoe, NV, 2015, pp. 525–529.Google Scholar

Copyright information

© Allerton Press, Inc. 2018

Authors and Affiliations

  1. 1.College of Modern Science and Technology, China Jiliang UniversityHangzhouZhejiangChina

Personalised recommendations