Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition

  • 397 Accesses

  • 1 Citations

Abstract

This paper proposes a novel regularized adaptation method to improve the performance of multi-accent Mandarin speech recognition task. The acoustic model is based on long short term memory recurrent neural network trained with a connectionist temporal classification loss function (LSTM-RNN-CTC). In general, directly adjusting the network parameters with a small adaptation set may lead to over-fitting. In order to avoid this problem, a regularization term is added to the original training criterion. It forces the conditional probability distribution estimated from the adapted model to be close to the accent independent model. Meanwhile, only the accent-specific output layer needs to be fine-tuned using this adaptation method. Experiments are conducted on RASC863 and CASIA regional accented speech corpus. The results show that the proposed method obtains obvious improvement when compared with LSTM-RNN-CTC baseline model. It also outperforms other adaptation methods.

This is a preview of subscription content, log in to check access.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8

References

  1. 1.

    Huang, C., Chen, T., & Chang, E. (2004). Accent Issues in Large Vocabulary Continuous Speech Recognition. Int J Speech Technol, 7(2), 141–153.

  2. 2.

    Wang, Z., Schultz, T., & Waibel, A. (2013). Comparison of Acoustic Model Adaptation Techniques on Non-native Speech. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

  3. 3.

    Arslan, L. M., & Hansen, J. L. (1997). A study of the temporal features and frequency characteristics in American english foreign accent. Journal of the Acoustical Society of America, 102(1), 28–40.

  4. 4.

    Liu, Y., & P. Fung (2006). Multi-accent Chinese Speech Recognition. In the Proceedings of Interspeech.

  5. 5.

    Fung, P., & Liu, Y. (Nov. 2005). Effects and Modeling of Phonetic and Acoustic Confusions in Accented Speech. J Acoust Soc Amer, 118(4), 3279–3293.

  6. 6.

    Leading Group Office of Survey of Language Use in China (2006). In survey of language use in China. Beijing: Yu Wen Press (in Chinese).

  7. 7.

    Davis S. B., & Mermelstein, P. (2013) Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition. IEEE Trans Acoustics Speech Signal Process, 21 (10), 2073–2084.

  8. 8.

    Zheng, Y. L., Sproat, R., Gu, L., Shafran, I., Zhou, H., Su, Y., Jurafsky, D., Starr, R., & Yoon, S. (2005). Accent Detection and Speech Recognition for Shanghai-accented Mandarin. In the Proceedings of Interspeech.

  9. 9.

    Vergyri, D., Lamel, L., & Gauvain, L. (2010). Automatic Speech Recognition of Multiple Accented English Data. In the Proceedings of Interspeech.

  10. 10.

    Ding, G. H. (2008). Phonetic Confusion Analysis and Robust Phone Set Generation for Shanghai-Accented Mandarin Speech Recognition. In the Proceedings of Interspeech.

  11. 11.

    Fosler-Lussier, E., Amdal, I., & Kuo, H.-K. J. (2005). A Framework for Predicting Speech Recognition Errors. Speech Communication, 46(2), 153–170.

  12. 12.

    Fosler-Lussier, E. (1999). Dynamic Pronunciation Models for Automatic Speech Recognition. Ph.D. dissertation, Int. Comput. Sci. Inst., Berkeley, CA, USA.

  13. 13.

    Hain, T., & Woodland, P. C. (1999). Dynamic HMM Selection for Continuous Speech Recognition. In Proc. Eurospeech, pp. 1327–1330.

  14. 14.

    V. Fisher et al. (1998). Speaker-Independent Upfront Dialect Adaptation in A Large Vocabulary Continuous Speech Recognition. In Proc. Int. Conf. Spoken Lang. Process.

  15. 15.

    Wang, Z., Schultz, T., & Waibel, A. (2003). Comparison of Acoustic Model Adaptation Techniques on Non-Native Speech. In ICASSP 2003. IEEE, pp. 540–543.

  16. 16.

    Mayfield Tomokiyo, L., & Waibel, A. (2001). Adaptation Methods for Non-Native Speech,” in Proceedings of Multilinguality in Spoken Language Processing, Aalborg.

  17. 17.

    Huang, C., Chang, E., Zhou, J., & Lee, K.-F. (2000). Accent Modeling Based on Pronunciation Dictionary Adaptation for Large Vocabulary Mandarin Speech Recognition. In ICSLP 2000, Beijing, pp. 818–821.

  18. 18.

    Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition. IEEE Trans Audio Speech Lang Process, 1(1), 33–42.

  19. 19.

    Seide, F., Li, G., & Yu, D. (2012). Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. In the Proceedings of Interspeech.

  20. 20.

    Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., NSainath, T., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag, 29(6), 82–97.

  21. 21.

    Yu, D., Seltzer, M., Li, J., Huang, J., & Seide, F. (2013). Feature learning in Deep Neural Networks - Studies on Speech Recognition Tasks. In the Proceedings of 2013 International Confernece on Learning Representation.

  22. 22.

    Goodfellow, I. J., Le, Q. V., Saxe, A. M., Lee, H., & Ng, A. Y. (2009). Measuring Invariances in Deep Networks. Advances in Neural Information Processing Systems (NIPS) 22.

  23. 23.

    Huang, Y., Yu, D., Liu, C. J., & Gong, Y. F. (2014). Multi-Accent Deep Neural Network Acoustic Model with Accent-Specific Top Layer Using the KLD-Regularized Model Adaptation. In the Proceedings of Interspeech.

  24. 24.

    Huang, J., Li, J., Yu, D., Deng, L., & Gong, Y. F. (2013). Cross-Language Knowledge Transfer Using Multilingual Deep Neural Network With Shared Hidden Layers. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).

  25. 25.

    Chen, M. M., Yang, Z. Y., Liang, J. Z., Li, Y. P., Liu, W. J. (2015). Improving Deep Neural Networks Based Multi-Accent Mandarin Speech Recognition Using I-Vectors and Accent-Specific Top layer. In the Proceedings of Interspeech.

  26. 26.

    Sak, H., Senior, A., & Beaufays, F. (2014). Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. In the Proceedings of Interspeech.

  27. 27.

    Liu, C., Wang, Y., Kumar, K., & Gong, Y. F. (2016). Investigations on Speaker Adaptation of LSTM RNN Models for Speech Recognition. In the Proceedings of the 2016 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).

  28. 28.

    Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker Adaptation of RNN-BLSTM for Speech Recognition Based on Speaker Code. In the Proceedings of the 2016 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).

  29. 29.

    Tan, T., Qian, Y., Yu, D., Kundu, S., & Lu, L. (2016). Speaker-Aware Training of LSTM-RNNs for Acoustic Modelling. In the Proceedings of the 2016 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).

  30. 30.

    Yi, J., Ni, H., Wen, Z. H., & Tao, J. (2016). Improving BLSTM RNN Based Mandarin Speech Recognition Using Accent Dependent Bottleneck Features. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

  31. 31.

    Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, Pittsburgh, USA.

  32. 32.

    Graves, A., Mohamed, A., & Hinton, G. (2013). Speech Recognition With Deep Recurrent Neural Networks. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6645–6649.

  33. 33.

    Graves, A., & Jaitly, N. (2014). Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772.

  34. 34.

    Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al. (2014). Deepspeech: Scaling up End-To-End Speech Recognition. arXiv preprint arXiv:1412.5567.

  35. 35.

    Hannun, A. Y., Maas, A. L., Jurafsky, D., & Ng, A. Y. (2014). First-Pass Large Vocabulary Continuous Speech Recognition Using Bi-Directional Recurrent DNNs. arXiv preprint arXiv:1408.2873.

  36. 36.

    Miao, Y. J., Gowayyed, M. & Metze, F. (2015). EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding. In the Proceedings of ASRU.

  37. 37.

    Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-Divergence Regularized Deep Neural Network Adaptation for Improved Large Vocabulary Speech Recognition. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).

  38. 38.

    (2003). RASC863: 863 annotated 4 regional accent speech corpus. Chinese Academy of Social Sciences. Available: http://www.chineseldc.org/doc/CLDC-SPC-2004-005/intro.htm.

  39. 39.

    (2003). CASIA: CASIA northern accent speech corpus. Chinese Academy of Sciences. Available: http://www.chineseldc.org/doc/CLDC-SPC-2004-015/intro.htm.

  40. 40.

    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y. M., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi SpeechRecognition Toolkit. In the Proceedings of ASRU.

  41. 41.

    Li, X., & Bilmes, J. (2006). Regularized adaptation of discriminative classifiers. In the Proceedings of the 2013 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP).

  42. 42.

    Miao, Y., Metze, F. (2015). On Speaker Adaptation of Long Short-Term Memory Recurrent Neural Networks. In the Proceedings of Interspeech.

Download references

Acknowledgements

This work is supported by the National High-Tech Research and Development Program of China (863 Program) (No.2015AA016305), the National Natural Science Foundation of China (NSFC) (No.61425017, No.61403386, No. 61305003), the Strategic Priority Research Program of the CAS (GrantXDB02080006) and the Major Program for the National Social Science Fund of China (13&ZD189).

Author information

Correspondence to Zhengqi Wen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yi, J., Wen, Z., Tao, J. et al. CTC Regularized Model Adaptation for Improving LSTM RNN Based Multi-Accent Mandarin Speech Recognition. J Sign Process Syst 90, 985–997 (2018). https://doi.org/10.1007/s11265-017-1291-1

Download citation

Keywords

  • multi-accent
  • Mandarin speech recognition
  • LSTM-RNN-CTC
  • model adaptation
  • CTC regularization