Skip to main content
Log in

Domain adaptation of lattice-free MMI based TDNN models for speech recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The recent proposed time-delay deep neural network (TDNN) acoustic models trained with lattice-free maximum mutual information (LF-MMI) criterion have been shown to give significant performance improvements over other deep neural network (DNN) models in variety speech recognition tasks. Meanwhile, the Kullback–Leibler divergence (KLD) regularization has been validated as an effective adaptation method for DNN acoustic models. However, to our best knowledge, no work has been reported on investigating whether the KLD-based method is also effective for LF-MMI based TDNN models, especially for the domain adaptation. In this study, we generalized the KLD regularized model adaptation to train domain-specific TDNN acoustic models. A few distinct and important observations have been obtained. Experiments were performed on the Cantonese accent, in-car and far-field noise Mandarin speech recognition tasks. Results demonstrated that the proposed domain adapted models can achieve around relative 7–29% word error rate reduction on these tasks, even when the adaptation utterances are only around 1 K.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Bell, P., Gales, M., Lanchantin, P., Liu, X., Long, Y., Renals, S., et al. (2012). Transcription of multi-genre media archives using out-of-domain data. In Proceedings of Workshop on Spoken Language Technology, IEEE (pp. 324–329).

  • Christensen, H., Aniol, M. B., Bell, P., Green, P., Hain, T., King, S., et al. (2013). Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proceedings of Interspeech, ISCA (pp. 3642–3645).

  • Fainberg, J., Bell, P., Lincoln, M., & Renals, S. (2016). Improving children’s speech recognition through out-of-domain data augmentation. In Proceedings of Interspeech, ISCA (pp. 1598–1602).

  • Gauvain, J., & Lee, C. (1992). MAP estimation of continuous density HMM: Theory and applications. In Proceedings of Workshop on Speech and Natural Language, Association for Computational Linguistics (pp. 185–190).

  • Huang, Y., Yu, D., Liu, C., & Gong, Y. (2014). Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation. In Proceedings of Interspeech, ISCA (pp. 2977–2981).

  • Huang, Z., Tang, J., Xue, S., & Dai, L. (2016). Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code. In Proceedings of ICASSP, IEEE (pp. 5305–5309).

  • Legetter, c, & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov models. Computer Speech and Language, 9, 171–185.

    Article  Google Scholar 

  • Mirsamadi, S., & Hansen, J. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of Interspeech, ISCA (pp. 2430–2434).

  • Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for different modeling of long temporal contexts. In Proceedings of Interspeech, ISCA (pp. 3214–3218).

  • Povey, D., (2005). Discriminative training for large vocabulary speech recognition. PhD dissertation, Cambridge University.

  • Povey, D., (2016). Kaldi code repository. Retrieved from https://github.com/kaldi-asr/kaldi.

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et. al. (2011). The Kaldi speech recognition toolkit. In Proceedings of ASRU, IEEE (pp. No. EPFL–CONF–192584).

  • Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., et al. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proceedings of Interspeech, ISCA (pp. 2751–2755).

  • Qian, Y., Tan, T., Yu, D., & Zhang, Y. (2016). Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In Proceedings of ICASSP, IEEE (pp. 5770–5774).

  • Sak, H., Senior, A., Rao, K., & Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of Interspeech, ISCA (pp. 1468–1472).

  • Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of ASRU, Olomouc (pp. 55–59).

  • Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In Proceedings of ICASSP, IEEE (pp. 225–229).

  • Senior, A., Sak, H., de Chaumont Quitry, F., Sainath, T., & Rao, K. (2015). Acoustic modeling with CD-CTC-SMBR LSTM RNNs. In Proceedings of ASRU, IEEE (pp. 604–609).

  • Toth, L., & Gosztolya, G. (2016). Adaptation of DNN acoustic models using KL-divergence regularization and multi-task training, In Proceedings of SPECOM. (pp. 108–115).

  • Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1713–1725.

    Article  Google Scholar 

  • Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach (1st ed.). New York: Springer.

    MATH  Google Scholar 

  • Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of ICASSP, IEEE (pp. 7893–7897).

Download references

Acknowledgements

This work was funded by the Shanghai Science and Technology Development Funds (Grant No.14YF1409300), and the Research Foundation of Young Teachers Program in Universities of Shanghai (Grant No. ZZshsf14026). Thanks to Beijing Unisound Information Technology Co., Ltd (http://www.unisound.com/) for providing the data sets of system training and test.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanhua Long.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Long, Y., Li, Y., Ye, H. et al. Domain adaptation of lattice-free MMI based TDNN models for speech recognition. Int J Speech Technol 20, 171–178 (2017). https://doi.org/10.1007/s10772-017-9399-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9399-z

Keywords

Navigation