Skip to main content
Log in

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

It is a need of time to build an Automatic Speech Recognition (ASR) system for low and limited resource languages. Usually, statistical techniques such as Hidden Markov Models (HMM) have been applied for Indian language ASR systems for the last two decades. In this work, we have selected the Time-delay Neural Network (TDNN) based acoustic modeling with i-vector adaptation for limited resource Hindi ASR. The TDNN can capture the extended temporal context of acoustic events. To reduce the training time, we used sub-sampling based TDNN architecture in this work. Further, data augmentation techniques have been applied to extend the size of training data developed by TIFR, Mumbai. The results show that data augmentation significantly improves the performance of the Hindi ASR. Further, \(\approx\) 4% average improvement has been recorded by applying i-vector adaptation in this work. We found the best system accuracy of 89.9% with TDNN based acoustic modeling with i-vector adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.ethnologue.com/language/hin.

  2. Online available: https://cqpweb.lancs.ac.uk/.

  3. Online available: https://wavesurfer-js.org/.

References

  • Abraham, B., Seeram, T., & Umesh, S. (2017). Transfer learning and distillation techniques to improve the acoustic modeling of low resource languages. In INTERSPEECH (pp. 2158–2162).

  • Aggarwal, R. K., & Dave, M. (2011). Acoustic modeling problem for automatic speech recognition system: Advances and refinements (Part II). International Journal of Speech Technology, 14(4), 309.

    Article  Google Scholar 

  • Aggarwal, R. K., & Dave, M. (2012). Filterbank optimization for robust ASR using GA and PSO. International Journal of Speech Technology, 15(2), 191–201.

    Article  Google Scholar 

  • Aggarwal, R. K., & Dave, M. (2013). Performance evaluation of sequentially combined heterogeneous feature streams for hindi speech recognition system. Telecommunication Systems, 52(3), 1457–1466.

    Article  Google Scholar 

  • An, G., Brizan, D. G., Ma, M., Morales, M., Syed, A.R., & Rosenberg, A. (2015). Automatic recognition of unified parkinson’s disease rating from speech with acoustic, i-vector and phonotactic features. In Sixteenth Annual Conference of the International Speech Communication Association.

  • Biswas, A., Menon, R., van der Westhuizen, E., & Niesler, T. (2019). Improved low-resource somali speech recognition by semi-supervised acoustic and language model training. arXiv preprint arXiv:1907.03064.

  • Biswas, A., Sahu, P. K., & Chandra, M. (2016). Admissible wavelet packet sub-band based harmonic energy features using anova fusion techniques for hindi phoneme recognition. IET Signal Processing, 10(8), 902–911.

    Article  Google Scholar 

  • Chellapriyadharshini, M., Toffy, A., & Ramasubramanian, V., et al. (2018). Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource indian language. arXiv preprint arXiv:1810.06635.

  • Chen, N. F., Lim, B. P., Hasegawa-Johnson, M. A., et al. (2017). Multitask learning for phone recognition of underresourced languages using mismatched transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 501–514.

    Google Scholar 

  • Chuangsuwanich, E. (2016). Multilingual techniques for low resource automatic speech recognition. Massachusetts Institute of Technology Cambridge United States: Tech. rep.

  • Dahl, G. E., Sainath, T. N., & Hinton, G. E. (2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 8609–8613). IEEE.

  • Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011a). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42.

    Article  Google Scholar 

  • Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011b). Large vocabulary continuous speech recognition with context-dependent DBN-HMMS. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4688–4691). IEEE.

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2017). Discriminative training using heterogeneous feature vector for hindi automatic speech recognition system. In 2017 International Conference on Computer and Applications (ICCA) (pp. 158–162). IEEE.

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2018a). Discriminative training using noise robust integrated features and refined hmm modeling. Journal of Intelligent Systems, 29(1), 327–344.

    Article  Google Scholar 

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Performance evaluation of hindi speech recognition system using optimized filterbanks. Engineering Science and Technology, an International Journal, 21(3), 389–398.

    Article  Google Scholar 

  • Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G. (2016). CP-JKU submissions for dcase-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks. IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE), 6, 5024–5028.

    Google Scholar 

  • Ghalehjegh, S. H., & Rose, R. C. (2015). Deep bottleneck features for i-vector based text-independent speaker verification. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 555–560). IEEE.

  • Hartmann, W., Hsiao, R., & Tsakalidis, S. (2017). Alternative networks for monolingual bottleneck features. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5290–5294). IEEE.

  • Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, Ar, Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5884–5887). IEEE.

  • Jaitly, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. In: Proc. ICML Workshop on Deep Learning for Audio, Speech and Language (Vol. 117).

  • Karafiát, M., Burget, L., Matějka, P., Glembek, O., & Černockỳ, J. (2011). ivector-based discriminative adaptation for automatic speech recognition. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (pp. 152–157). IEEE.

  • Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5220–5224). IEEE.

  • Kreyssig, F.L., Zhang, C., & Woodland, P. C. (2018). Improved tdnns using deep kernels and frequency dependent grid-RNNS. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4864–4868). IEEE.

  • Liu, B., Zhang, W., Xu, X., & Chen, D. (2019). Time delay recurrent neural network for speech recognition. In Journal of Physics: Conference Series (Vol. 1229, p. 012078). IOP Publishing.

  • Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudanpur, S. (2015a). JHU aspire system: Robust LVCSR with TDNNS, ivector adaptation and rnn-lms. In ASRU (pp. 539–546).

  • Peddinti, V., Chen, G., Povey, D., & Khudanpur, S. (2015b). Reverberation robust acoustic modeling using i-vectors with time delay neural networks. In Sixteenth Annual Conference of the International Speech Communication Association.

  • Peddinti, V., Povey, D., & Khudanpur, S. (2015c). A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, CONF. IEEE Signal Processing Society.

  • Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Interspeech (pp. 2751–2755).

  • Ragni, A., Knill, K., Rath, S. P., & Gales, M. (2014). Data augmentation for low resource languages.

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.

    Article  Google Scholar 

  • Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128.

  • Samudravijaya, K., Rao, P., & Agrawal, S. (2000). Hindi speech database. In Sixth International Conference on Spoken Language Processing.

  • Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 55–59). IEEE.

  • Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (pp. 24–29). IEEE.

  • Sercu, T., Puhrsch, C., Kingsbury, B., & LeCun, Y. (2016). Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4955–4959). IEEE.

  • Stolcke, A. (2002). Srilm-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.

  • Trmal, J., Kumar, G., Manohar, V., Khudanpur, S., Post, M., & McNamee, P. (2017). Using of heterogeneous corpora for training of an ASR system. arXiv preprint arXiv:1706.00321.

  • Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.

    Article  Google Scholar 

  • Weninger, F., Watanabe, S., Le Roux, J., Hershey, J., Tachioka, Y., Geiger, J., Schuller, B., & Rigoll, G. (2014). The merl/melco/tum system for the reverb challenge using deep recurrent neural network feature enhancement. In Proc. REVERB Workshop (pp. 1–8).

  • Xu, H., Su, H., Ni, C., Xiao, X., Huang, H., Chng, E. S., & Li, H. (2016). Semi-supervised and cross-lingual knowledge transfer learnings for DNN hybrid acoustic models under low-resource conditions. In INTERSPEECH (pp. 1315–1319).

  • Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1713–1725.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankit Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, A., Aggarwal, R.K. Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. Int J Speech Technol 25, 67–78 (2022). https://doi.org/10.1007/s10772-020-09757-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-020-09757-0

Keywords

Navigation