Skip to main content
Log in

An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition

  • Original Article
  • Published:
Journal of Reliable Intelligent Environments Aims and scope Submit manuscript

Abstract

Semi-supervised training and language adversarial transfer learning are two different techniques to improve the Automatic Speech Recognition (ASR) performance in limited resource conditions. In this paper, we combined these two techniques and presented a common framework for the Hindi ASR system. For acoustic modeling, we proposed a hybrid architecture of SincNet-Convolutional Neural Network (CNN)-Light Gated Recurrent Unit (LiGRU), which shows increased interpretability, high accuracy, and fewer parameter size. We investigate the impact of the proposed hybrid model on monolingual Hindi ASR with semi-supervised training, and multilingual Hindi ASR with language adversarial transfer learning. In this work, we have chosen three Indian languages (Hindi, Marathi, Bengali) of the same Indo-Aryan family for multilingual training. All experiments were conducted using Kaldi and Py-Torch Kaldi toolkits. The proposed model with combined learning strategies helps to get the lowest 5.5% Word Error Rate (WER) for Hindi ASR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Winter school on ASR, 16–28 May 2017, IIT Guwahati.

  2. http://openslr.org/53

Abbreviations

ASR:

Automatic Speech Recognition

BLSTM:

Bidirectional long short-term memory

BPTT:

Back-propagation through time

CD:

Context-dependent

CI:

Context-independent

CNN:

Convolutional Neural Network

DNN:

Deep Neural Network

Fbank:

log-Mel Filterbank

FC:

Fully-connected

GMM:

Gaussian mixture model

GRL:

Gradient reversal layer

GRU:

Gated recurrent unit

HMM:

Hidden Markove model

LF-MMI:

Lattice-free maximum mutual information

LiGRU:

Light gated recurrent unit

LSTM:

Long short-term memory

LVCSR:

Large vocabulary continuous speech recognition

MFCC:

Mel-frequency cepstral coefficient

ML:

Maximum-likelihood

MLP:

Multi-layer perceptron

RNN:

Recurrent neural network

RNNLM:

Recurrent neural network language model

SGD:

Stochastic-gradient decent

SHL:

Shared-hidden layer

SOTA:

State-of-the-art

SRILM:

SRI language modeling

TDNN:

Time-delay neural network

WER:

Word error rate

References

  1. Aggarwal RK, Dave M (2013) Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecommun Syst 52(3):1457–1466

    Article  Google Scholar 

  2. Alumäe T, Tsakalidis S, Schwartz RM (2016) Improved multilingual training of stacked neural network acoustic models for low resource languages. In: Interspeech, pp 3883–3887

  3. Barker J, Watanabe S, Vincent E, Trmal J (2018) The fifth’chime’speech separation and recognition challenge: dataset, task and baselines. arXiv preprint arXiv:1803.10609

  4. Biswas A, Menon R, van der Westhuizen E, Niesler T (2019) Improved low-resource somali speech recognition by semi-supervised acoustic and language model training. arXiv preprint arXiv:1907.03064

  5. Biswas A, de Wet F, van der Westhuizen E, Yilmaz E, Niesler T (2018) Multilingual neural network acoustic modelling for ASR of under-resourced English-isizulu code-switched speech. In: Interspeech, pp 2603–2607

  6. Chellapriyadharshini M, Toffy A, Ramasubramanian V et al (2018) Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource indian language. arXiv preprint arXiv:1810.06635

  7. Chen NF, Lim BP, Hasegawa-Johnson MA et al (2017) Multitask learning for phone recognition of underresourced languages using mismatched transcription. IEEE/ACM Trans Audio Speech Lang Process 26(3):501–514

    Google Scholar 

  8. Chen NF, Lim BP, Ni C, Xu H, HasegawaJohnson M, Chen W, Xiao X, Sivadas S, Chng ES, Ma B et al (2017) Low-resource spoken keyword search strategies in georgian inspired by distinctive feature theory. In: 2017 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 1322–1327. IEEE

  9. Chen X, Shi Z, Qiu X, Huang X (2017) Adversarial multi-criteria learning for chinese word segmentation. arXiv preprint arXiv:1704.07556

  10. Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. arXiv preprint arXiv:1409.1259

  11. Chu SM, Povey D, Kuo HK, Mangu L, Zhang S, Shi Q, Qin Y (2010) The 2009 ibm gale mandarin broadcast transcription system. In: 2010 IEEE international conference on acoustics, speech and signal processing, pp 4374–4377. IEEE

  12. Cui J, Kingsbury B, Ramabhadran B, Sethy A, Audhkhasi K, Cui X, Kislal E, Mangu L, Nussbaum-Thom M, Picheny M et al (2015) Multilingual representations for low resource speech recognition and keyword search. In: 2015 IEEE workshop on automatic speech recognition and understanding (ASRU), pp 259–266. IEEE

  13. Dahl GE, Yu D, Deng L, Acero A (2011) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Language Process 20(1):30–42

    Article  Google Scholar 

  14. Dash D, Kim MJ, Teplansky K, Wang J (2018) Automatic speech recognition with articulatory information and a unified dictionary for Hindi, Marathi, Bengali and Oriya. In: Interspeech, pp 1046–1050

  15. Dua M, Aggarwal RK, Biswas M (2018) Discriminative training using noise robust integrated features and refined hmm modeling. J Intell Syst 29(1):327–344

    Article  Google Scholar 

  16. Dua M, Aggarwal RK, Biswas M (2019) Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput Appl 31(10):6747–6755

    Article  Google Scholar 

  17. Fathima N, Patel T, Mahima C, Iyengar A (2018) Tdnn-based multilingual speech recognition system for low resource Indian languages. In: Interspeech, pp 3197–3201

  18. Gales MJ, Knill KM, Ragni A, Rath SP (2014) Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In: Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014), pp 16–23. International Speech Communication Association (ISCA)

  19. Ganin Y, Lempitsky V (2015) Unsupervised domain adaptation by backpropagation. In: International conference on machine learning, pp 1180–1189

  20. Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):2026–2030

    MathSciNet  MATH  Google Scholar 

  21. Ghoshal A, Swietojanski P, Renals S (2013) Multilingual training of deep neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 7319–7323. IEEE

  22. Grézl F, Karafiat M, Janda M (2011) Study of probabilistic and bottle-neck features in multilingual environment. In: 2011 IEEE workshop on automatic speech recognition & understanding, pp 359–364. IEEE

  23. Hain T, Woodland P, Evermann G, Povey D (2000) The CU-HTK march 2000 hub5e transcription system. In: Proc. speech transcription workshop, vol 1. Citeseer

  24. Hartmann W, Hsiao R, Tsakalidis S (2017) Alternative networks for monolingual bottleneck features. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5290–5294. IEEE

  25. Heigold G, Vanhoucke V, Senior A, Nguyen P, Ranzato M, Devin M, Dean J (2013) Multilingual acoustic models using distributed deep neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 8619–8623. IEEE

  26. Hernandez F, Nguyen V, Ghannay S, Tomashenko N, Estève Y (2018) Ted-lium 3: twice as much data and corpus repartition for experiments on speaker adaptation. In: International conference on speech and computer, pp 198–208. Springer

  27. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  28. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  29. Hoshen Y, Weiss RJ, Wilson KW (2015) Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4624–4628. IEEE

  30. Huang JT, Li J, Yu D, Deng L, Gong Y (2013) Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 7304–7308. IEEE

  31. Jung JW, Heo HS, Yang IH, Shim HJ, Yu HJ (2018) Avoiding speaker overfitting in end-to-end DNNS using raw waveform for text-independent speaker verification. Extraction 8(12):23–24

    Google Scholar 

  32. Kadyan V, Mantri A, Aggarwal R (2018) Refinement of HMM model parameters for Punjabi automatic speech recognition (PASR) system. IETE J Res 64(5):673–688

    Article  Google Scholar 

  33. Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: Sixteenth annual conference of the international speech communication association

  34. Kriman S, Beliaev S, Ginsburg B, Huang J, Kuchaiev O, Lavrukhin V, Leary R, Li J, Zhang Y (2020) Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6124–6128. IEEE

  35. Lazaridis A, Himawan I, Motlicek P, Mporas I, Garner PN (2016) Investigating cross-lingual multi-level adaptive networks: The importance of the correlation of source and target languages. In: Proceedings of the international workshop on spoken language translation, CONF

  36. Liu D, Xu J, Zhang P, Yan Y (2019) Investigation of knowledge transfer approaches to improve the acoustic modeling of vietnamese asr system. IEEE/CAA J Autom Sin 6(5):1187–1195

    Article  Google Scholar 

  37. Miao Y, Metze F (2013) Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training. Interspeech 13:2237–2241

    Google Scholar 

  38. Ni C, Leung CC, Wang L, Chen NF, Ma B (2017) Efficient methods to train multilingual bottleneck feature extractors for low resource keyword search. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5650–5654. IEEE

  39. Palaz D, Collobert R et al (2015) Analysis of CNN-based speech recognition system using raw speech as input. Tech. rep, Idiap

    Book  Google Scholar 

  40. Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5206–5210. IEEE

  41. Parcollet T, Morchid M, Linarès G, De Mori R (2019) Bidirectional quaternion long short-term memory recurrent neural networks for speech recognition. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8519–8523. IEEE

  42. Parcollet T, Ravanelli M, Morchid M, Linarès G, Trabelsi C, De Mori R, Bengio Y (2018) Quaternion recurrent neural networks. arXiv preprint arXiv:1806.04418

  43. Passricha V, Aggarwal RK (2019) Convolutional support vector machines for speech recognition. Int J Speech Technol 22(3):601–609

    Article  Google Scholar 

  44. Ravanelli M, Bengio Y (2018) Interpretable convolutional filters with sincnet. arXiv preprint arXiv:1811.09725

  45. Ravanelli M, Brakel P, Omologo M, Bengio Y (2018) Light gated recurrent units for speech recognition. IEEE Trans Emerg Top Comput Intell 2(2):92–102

    Article  Google Scholar 

  46. Ravanelli M, Parcollet T, Bengio Y (2019) The Pytorch-Kaldi speech recognition toolkit. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6465–6469. IEEE

  47. Ravanelli M (2018) Interpretable convolutional filters with sincnet. arXiv preprint arXiv:1811.09725

  48. Rebai I, BenAyed Y, Mahdi W, Lorré JP (2017) Improving speech recognition using data augmentation and acoustic model fusion. Proc Comput Sci 112:316–322

    Article  Google Scholar 

  49. Roger V, Farinas J, Pinquier J (2020) Deep neural networks for automatic speech processing: a survey from large corpora to limited data. arXiv preprint arXiv:2003.04241

  50. Sahraeian R, Van Compernolle D (2016) Using weighted model averaging in distributed multilingual DNNS to improve low resource ASR. Proc Comput Sci 81:152–158

    Article  Google Scholar 

  51. Sahraeian R, Van Compernolle D (2018) Cross-entropy training of DNN ensemble acoustic models for low-resource ASR. IEEE/ACM Trans Audio Speech Lang Process 26(11):1991–2001

    Article  Google Scholar 

  52. Sailor HB, Krishna MVS, Chhabra D, Patil AT, Kamble MR, Patil HA (2018) DA-IICT/IIITV system for low resource speech recognition challenge 2018. In: Interspeech, pp 3187–3191

  53. Samudravijaya K, Rao P, Agrawal S (2000) Hindi speech database. In: Sixth International conference on spoken language processing

  54. Saon G, Kurata G, Sercu T, Audhkhasi K, Thomas S, Dimitriadis D, Cui X, Ramabhadran B, Picheny M, Lim LL et al (2017) English conversational telephone speech recognition by humans and machines. arXiv preprint arXiv:1703.02136

  55. Scanzio S, Laface P, Fissore L, Gemello R, Mana F (2008) On the use of a multilingual neural network front-end. In: Ninth annual conference of the international speech communication association

  56. Sercu T, Puhrsch C, Kingsbury B, LeCun Y (2016) Very deep multilingual convolutional neural networks for LVCSR. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4955–4959. IEEE

  57. Shangguan Y, Li J, Qiao L, Alvarez R, McGraw I (2019) Optimizing speech recognition for the edge. arXiv preprint arXiv:1909.12408

  58. Shetty VM, Sharon RA, Abraham B, Seeram T, Prakash A, Ravi N, Umesh S (2018) Articulatory and stacked bottleneck features for low resource speech recognition. In: Interspeech, pp 3202–3206

  59. Shinohara Y (2016) Adversarial multi-task learning of deep neural networks for robust speech recognition. In: Interspeech, pp 2369–2372. San Francisco, CA, USA

  60. Stolcke A (2002) Srilm-an extensible language modeling toolkit. In: Seventh international conference on spoken language processing

  61. Tong S, Garner PN, Bourlard H (2017) An investigation of deep neural networks for multilingual speech recognition training and adaptation. In: Proc. of Interspeech, conf

  62. Trmal J, Wiesner M, Peddinti V, Zhang X, Ghahremani P, Wang Y, Manohar V, Xu H, Povey D, Khudanpur S (2017) The kaldi openkws system: Improving low resource keyword search. In: Interspeech, pp 3597–3601

  63. Tüske Z, Golik P, Schlüter R, Ney H (2014) Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth annual conference of the international speech communication association

  64. Tüske Z, Pinto J, Willett D, Schlüter R (2013) Investigation on cross-and multilingual MLP features under matched and mismatched acoustical conditions. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 7349–7353. IEEE

  65. Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7167–7176

  66. Veselỳ K, Karafiát M, Grézl F, Janda M, Egorova E (2012) The language-independent bottleneck features. In: 2012 IEEE spoken language technology workshop (SLT), pp 336–341. IEEE

  67. Vu NT, Schultz T (2013) Multilingual multilayer perceptron for rapid language adaptation between and across language families. In: Interspeech, pp 515–519

  68. Vydana HK, Gurugubelli K, Vegesna VVR, Vuppala AK (2018) An exploration towards joint acoustic modeling for Indian languages: Iiit-h submission for low resource speech recognition challenge for Indian languages, interspeech 2018. In: Interspeech, pp 3192–3196

  69. Wilkinson N, Biswas A, Yılmaz E, De Wet F, van der Westhuizen E, Niesler TR (2020) Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech. arXiv preprint arXiv:2004.06480

  70. Xu H, Do VH, Xiao X, Chng ES (2015) A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition. In: Sixteenth annual conference of the international speech communication association

  71. Xu H, Li K, Wang Y, Wang J, Kang S, Chen X, Povey D, Khudanpur S (2018) Neural network language modeling with letter-based features and importance sampling. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6109–6113. IEEE

  72. Xu H, Su H, Ni C, Xiao X, Huang H, Chng ES, Li H (2016) Semi-supervised and cross-lingual knowledge transfer learnings for DNN hybrid acoustic models under low-resource conditions. In: Interspeech, pp 1315–1319

  73. Yi J, Tao J, Wen Z, Bai Y (2018) Language-adversarial transfer learning for low-resource speech recognition. IEEE/ACM Trans Audio Speech Lang Process 27(3):621–630

    Article  Google Scholar 

  74. Yılmaz E, van den Heuvel H, van Leeuwen D (2016) Investigating bilingual deep neural networks for automatic recognition of code-switching Frisian speech. Proc Comput Sci 81:159–166

    Article  Google Scholar 

  75. Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of CNN and RNN for natural language processing. arXiv preprint arXiv:1702.01923

  76. Yu D, Li J (2017) Recent progresses in deep learning based acoustic models. IEEE/CAA J Autom Sin 4(3):396–409

    Article  Google Scholar 

  77. Zeghidour N, Usunier N, Synnaeve G, Collobert R, Dupoux E (2018) End-to-end speech recognition from the raw waveform. arXiv preprint arXiv:1806.07098

  78. Zhang M, Liu Y, Luan H, Sun M (2017) Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 1959–1970

  79. Zhou S, Zhao Y, Xu S, Xu B et al (2017) Multilingual recurrent neural networks with residual learning for low-resource speech recognition

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankit Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, A., Aggarwal, R.K. An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. J Reliable Intell Environ 8, 117–132 (2022). https://doi.org/10.1007/s40860-021-00140-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40860-021-00140-7

Keywords

Navigation