Word-Level Permutation and Improved Lower Frame Rate for RNN-Based Acoustic Modeling

  • Yuanyuan Zhao
  • Shiyu Zhou
  • Shuang Xu
  • Bo Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10639)


Recently, the RNN-based acoustic model has shown promising performance. However, its generalization ability to multiple scenarios is not powerful enough for two reasons. Firstly, it encodes inter-word dependency, which conflicts with the nature that an acoustic model should model the pronunciation of words only. Secondly, the RNN-based acoustic model depicting the inner-word acoustic trajectory frame-by-frame is too precise to tolerate small distortions. In this work, we propose two variants to address aforementioned two problems. One is the word-level permutation, i.e. the order of input features and corresponding labels is shuffled with a proper probability according to word boundaries. It aims to eliminate inter-word dependencies. The other one is the improved LFR (iLFR) model, which equidistantly splits the original sentence into N utterances to overcome the discarding data in LFR model. Results based on LSTM RNN demonstrate 7% relative performance improvement by jointing the word-level permutation and iLFR.


RNN-based acoustic model Acoustic trajectory Lower frame rate Word-level permutation 



This work was supported by 973 Program in China, grant No. 2013CB329302.


  1. 1.
    Graves, A., Jaitly, N., Mohamed, A.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 273–278. IEEE (2013)Google Scholar
  2. 2.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. In: Interspeech (2014)Google Scholar
  3. 3.
    Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947 (2015)
  4. 4.
    Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 4280–4284 (2015)Google Scholar
  5. 5.
    Pundak, G., Sainath, T.N.: Lower frame rate neural network acoustic models. Interspeech 2016, 22–26 (2016)CrossRefGoogle Scholar
  6. 6.
    Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-24797-2 CrossRefMATHGoogle Scholar
  7. 7.
    Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition. arXiv preprint arXiv:1610.09975 (2016)
  8. 8.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, DTIC Document (1985)Google Scholar
  9. 9.
    Williams, R.J., Peng, J.: An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comput. 2, 490–501 (1990)CrossRefGoogle Scholar
  10. 10.
    Kühnert, B., Nolan, F.: The origin of coarticulation. In: Coarticulation: Theory, Data and Techniques, pp. 7–30 (1999)Google Scholar
  11. 11.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  12. 12.
    Kanda, N., Lu, X., Kawai, H.: Maximum a posteriori based decoding for CTC acoustic models. Interspeech 2016, 1868–1872 (2016)CrossRefGoogle Scholar
  13. 13.
    Kanda, N., Tachimori, M., Lu, X., Kawai, H.: Training data pseudo-shuffling and direct decoding framework for recurrent neural network based acoustic modeling. In: Automatic Speech Recognition and Understanding (ASRU), pp. 15–21 (2015)Google Scholar
  14. 14.
    Zhao, Y., Xu, S., Xu, B.: Multidimensional residual learning based on recurrent neural networks for acoustic modeling. Interspeech 2016, 3419–3423 (2016)CrossRefGoogle Scholar
  15. 15.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)CrossRefGoogle Scholar
  16. 16.
    Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  17. 17.
    Senior, A., Sak, H., Shafran, I.: Context dependent phone models for LSTM RNN acoustic modelling. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4585–4589. IEEE (2015)Google Scholar
  18. 18.
    Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174. IEEE (2015)Google Scholar
  19. 19.
    LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-35289-8_3 CrossRefGoogle Scholar
  20. 20.
    Liu, Y., Fung, P., Yang, Y., Cieri, C., Huang, S., Graff, D.: HKUST/MTS: a very large scale Mandarin telephone speech corpus. In: Huo, Q., Ma, B., Chng, E.-S., Li, H. (eds.) ISCSLP 2006. LNCS, vol. 4274, pp. 724–735. Springer, Heidelberg (2006). doi: 10.1007/11939993_73 CrossRefGoogle Scholar
  21. 21.
    Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29 (2011)Google Scholar
  22. 22.
    Zhang, S., Zhang, C., You, Z., Zheng, R., Xu, B.: Asynchronous stochastic gradient descent for DNN training. In: ICASSP, pp. 6660–6663. IEEE (2013)Google Scholar
  23. 23.
    Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al.: Greedy layer-wise training of deep networks. In: Advances in Neural Information Processing Systems, vol. 19, p. 153 (2007)Google Scholar
  24. 24.
    Li, J., Zhang, H., Cai, X., Xu, B.: Towards end-to-end speech recognition for Chinese Mandarin using long short-term memory recurrent neural networks. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Institute of Automation, Chinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations