Skip to main content
Log in

Noise robust automatic speech recognition: review and analysis

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Automatic Speech Recognition (ASR) system is an emerging technology used in various fields such as robotics, traffic controls, and healthcare, etc. The leading cause of ASR performance degradation is mismatch between the training and testing environments. The main reason for this mismatch is the presence of noise during the testing phase of an ASR system. Various techniques have been used by different researchers in front and backend phases of ASR, to detect and handle the noise. However, a very few review papers have considered noise as a criterion to present the comparison among the existing research works. Hence, the objective of this survey is to analyze and review all the effective methods proposed by different scientists and researchers to boost the noise robustness of an ASR system. Initially, the paper discusses the basic architecture of an ASR system, the factors affecting the its performance, and noise problem formulation. Secondly, the work analysis existing state of the art noise robust ASR methods in terms of front end feature extraction techniques and backend classification model. Then, a detailed review in terms of various speech databases, that are used by these methods, is given. Finally, an analysis in terms of performance metrics of all these noise-resistant ASR techniques is presented. Also, the paper discusses various feature extraction techniques, backend classification methods, different speech databases and performance metrics in detail, while presenting the analysis. The paper also discusses the existing challenges, and describes future research directions in the area of building noise-resistant ASR systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

Manuscript has no associated data.

References

  • Abdollahi, M., & Nasersharif, B. (2017, May). Noise adaptive deep belief network for robust speech features extraction. In 2017 Iranian conference on electrical engineering (ICEE) (pp. 1491–1496). IEEE.

  • Agrawal, P., & Ganapathy, S. (2019). Modulation filter learning using deep variational networks for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 13(2), 244–253.

    Google Scholar 

  • Alimuradov, A. K., & Tychkov, A. Y. (2021, March). EMD-based noise-robust method for speech/pause segmentation. In 2021 3rd international youth conference on radio electronics, electrical and power engineering (REEPE) (pp. 1–8). IEEE.

  • Al-Karawi, K. A., & Mohammed, D. Y. (2021). Improving short utterance speaker verification by combining MFCC and entrocy in noisy conditions. Multimedia Tools and Applications, 80(14), 22231–22249.

    Google Scholar 

  • Baevski, A., Hsu, W. N., Conneau, A., & Auli, M. (2021). Unsupervised speech recognition. arXiv preprint arXiv:2105.11084.

  • Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The fifth ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. arXiv preprint arXiv:1803.10609.

  • Barker, J. P., Marxer, R., Vincent, E., & Watanabe, S. (2017). The CHiME challenges: Robust speech recognition in everyday environments. In S. Watanabe, M. Delcroix, F. Metze, & J. R. Hershey (Eds.), New era for robust speech recognition (pp. 327–344). Springer.

    Google Scholar 

  • Bawa, P., & Kadyan, V. (2021). Noise robust in-domain children speech enhancement for automatic Punjabi recognition system under mismatched conditions. Applied Acoustics, 175, 107810.

    Google Scholar 

  • Bharath, K. P., & Kumar, R. (2020). ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score. Multimedia Tools and Applications, 79(39), 28859–28883.

    Google Scholar 

  • Bourouba, H., & Djemili, R. (2020). Feature extraction algorithm using new cepstral techniques for robust speech recognition. Malaysian Journal of Computer Science, 33(2), 90–101.

    Google Scholar 

  • Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. (2017, November). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) (pp. 1–5). IEEE.

  • Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., & Wellner, P. (2005, July). The AMI meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction (pp. 28–39). Springer.

  • Casebeer, J., Vale, V., Isik, U., Valin, J. M., Giri, R., & Krishnaswamy, A. (2021, June). Enhancing into the codec: Noise robust speech coding with vector-quantized auto-encoders. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 711–715). IEEE.

  • Chai, L., Du, J., Liu, D. Y., Tu, Y. H., & Lee, C. H. (2021, January). Acoustic modeling for multi-array conversational speech recognition in the chime-6 challenge. In 2021 IEEE spoken language technology workshop (SLT) (pp. 912–918). IEEE.

  • Chao, F. A., Jiang, S. W. F., Yan, B. C., Hung, J. W., & Chen, B. (2021). TENET: A time-reversal enhancement network for noise-robust ASR. arXiv preprint arXiv:2107.01531.

  • Chao, F. A., Hung, J. W., & Chen, B. (2021, July). Cross-domain single-channel speech enhancement model with BI-projection fusion module for noise-robust ASR. In 2021 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.

  • Cho, B. J., & Park, H. M. (2021). Convolutional maximum-likelihood distortionless response beamforming with steering vector estimation for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1352–1367.

    Google Scholar 

  • Christensen, H., Barker, J., Ma, N., & Green, P. D. (2010). The CHiME corpus: a resource and a challenge for computational hearing in multisource environments. In Eleventh annual conference of the international speech communication association.

  • Chung, H., Jeon, H. B., & Park, J. G. (2020, July). Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning. In 2020 international joint conference on neural networks (IJCNN) (pp. 1–6). IEEE.

  • de La Calle-Silos, F., & Stern, R. M. (2017). Synchrony-based feature extraction for robust automatic speech recognition. IEEE Signal Processing Letters, 24(8), 1158–1162.

    Google Scholar 

  • Donahue, C., Li, B., & Prabhavalkar, R. (2018, April). Exploring speech enhancement with generative adversarial networks for robust speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5024–5028). IEEE.

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2017, September). Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system. In 2017 international conference on computer and applications (ICCA) (pp. 158–162). IEEE.

  • Dua, M., Sethi, P. S., Agrawal, V., & Chawla, R. (2021). Speaker recognition using noise robust features and LSTM-RNN. In Progress in advanced computing and intelligent engineering (pp. 19–28). Springer.

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2018). Optimizing integrated features for Hindi automatic speech recognition system. Journal of Intelligent Systems, 29(1), 959–976.

    Google Scholar 

  • Dua, M., Aggarwal, R. K., & Biswas, M. (2020). Discriminative training using noise-robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.

    Google Scholar 

  • Dua, M., Jain, C., & Kumar, S. (2021). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-021-02960-0

    Article  Google Scholar 

  • Dua, M., Jain, C., & Kumar, S. (2022). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. Journal of Ambient Intelligence and Humanized Computing, 13, 1–16.

    Google Scholar 

  • Dua, M., Sadhu, A., Jindal, A., & Mehta, R. (2022). A hybrid noise robust model for multi-replay attack detection in automatic speaker verification systems. Biomedical Signal Processing and Control, 74, 103517.

    Google Scholar 

  • Dubey, H., Sangwan, A., & Hansen, J. H. (2018). Leveraging frequency-dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11), 2056–2071.

    Google Scholar 

  • Erdogan, H., Hershey, J. R., Watanabe, S., & Le Roux, J. (2017). Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio. In New Era for Robust Speech Recognition (pp. 165–186). Springer.

  • Errattahi, R., El Hannani, A., & Ouahmane, H. (2018). Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128, 32–37.

    Google Scholar 

  • Fallside, F., Lucke, H., Marsland, T. P., O'Shea, P. J., Owen, M. S. J., Prager, R. W., Robinson, A. J., & Russell, N. H. (1990, April). Continuous speech recognition for the TIMIT database using neural networks. In International conference on acoustics, speech, and signal processing (pp. 445–448). IEEE.

  • Faragallah, O. S. (2018). Robust noise MKMFCC–SVM automatic speaker identification. International Journal of Speech Technology, 21(2), 185–192.

    Google Scholar 

  • Fendji, J. L. K., Tala, D. M., Yenke, B. O., & Atemkeng, M. (2021). Automatic Speech Recognition using limited vocabulary: A survey. arXiv preprint arXiv:2108.10254.

  • Fukuda, T., & Kurata, G. (2021, June). Generalized knowledge distillation from an ensemble of specialized teachers leveraging unsupervised neural clustering. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6868–6872). IEEE.

  • Gref, M., Walter, O., Schmidt, C., Behnke, S., & Köhler, J. (2020). Multi-staged cross-lingual acoustic model adaption for robust speech recognition in real-world applications—A case study on German oral history interviews. arXiv preprint arXiv:2005.12562.

  • Hermansky, H., Ellis, D. P., & Sharma, S. (2000, June). Tandem connectionist feature extraction for conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal processing: Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1635–1638). IEEE.

  • Higuchi, Y., Tawara, N., Ogawa, A., Iwata, T., Kobayashi, T., & Ogawa, T. (2021, January). Noise-robust attention learning for end-to-end speech recognition. In 2020 28th European Signal Processing Conference (EUSIPCO) (pp. 311–315). IEEE.

  • Hsu, W. N., & Glass, J. (2018, April). Extracting domain invariant features by unsupervised learning for robust automatic speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5614–5618). IEEE.

  • Hu, H., Tan, T., & Qian, Y. (2018, April). Generative adversarial networks based data augmentation for noise robust speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5044–5048). IEEE.

  • Huang, C. W., & Narayanan, S. S. (2017, July). Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition. In 2017 IEEE international conference on multimedia and expo (ICME) (pp. 583–588). IEEE.

  • Huang, Y., Ao, W., & Zhang, G. (2017). Novel sub-band spectral centroid weighted wavelet packet features with importance-weighted support vector machines for robust speech emotion recognition. Wireless Personal Communications, 95(3), 2223–2238.

    Google Scholar 

  • Huang, Y., Tian, K., Wu, A., & Zhang, G. (2019). Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. Journal of Ambient Intelligence and Humanized Computing, 10(5), 1787–1798.

    Google Scholar 

  • Ibrahim, A. K., Zhuang, H., Erdol, N., & Ali, A. M. (2018, December). Feature extraction methods for the detection of north Atlantic right whale up-calls. In 2018 international conference on computational science and computational intelligence (CSCI) (pp. 179–185). IEEE.

  • Jainar, S. J., Sale, P. L., & Nagaraja, B. G. (2020). VAD, feature extraction and modelling techniques for speaker recognition: A review. International Journal of Signal and Imaging Systems Engineering, 12(1–2), 1–18.

    Google Scholar 

  • Joshi, S. S., & Bhagile, V. D. (2020, November). Native and non-native Marathi numerals recognition using LPC and ANN. In 2020 4th international conference on electronics, communication and aerospace technology (ICECA) (pp. 355–361). IEEE.

  • Kadyan, V., & Kaur, M. (2020). SGMM-based modeling classifier for Punjabi automatic speech recognition system. In Smart computing paradigms: New progresses and challenges (pp. 149–155). Springer.

  • Kadyan, V., Bala, S., & Bawa, P. (2021). Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system. International Journal of Speech Technology, 24(2), 473–481.

    Google Scholar 

  • Kadyan, V., Bala, S., Bawa, P., & Mittal, M. (2020a). Developing in-vehicular noise robust children ASR system using Tandem-NN-based acoustic modelling. International Journal of Vehicle Autonomous Systems, 15(3–4), 296–306.

    Google Scholar 

  • Kadyan, V., Dua, M., & Dhiman, P. (2021). Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM. International Journal of Speech Technology, 24, 517–527.

    Google Scholar 

  • Kadyan, V., Mantri, A., & Aggarwal, R. K. (2020b). Improved filter bank on multitaper framework for robust Punjabi-ASR system. International Journal of Speech Technology, 23(1), 87–100.

    Google Scholar 

  • Kahn, J., Riviere, M., Zheng, W., Kharitonov, E., Xu, Q., Mazare, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A., & Dupoux, E. (2020, May). Libri-light: A benchmark for ASR with limited or no supervision. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7669–7673). IEEE.

  • Kamble, M. R., & Patil, H. A. (2020). Combination of amplitude and frequency modulation features for presentation attack detection. Journal of Signal Processing Systems, 92(8), 777–791.

    Google Scholar 

  • Khoria, K., Kamble, M. R., & Patil, H. A. (2021, January). Teager energy cepstral coefficients for classification of normal vs. whisper speech. In 2020 28th European signal processing conference (EUSIPCO) (pp. 1–5). IEEE.

  • Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., Sehr, A., Kellermann, W., Maas, R., Gannot, S., & Raj, B. (2013, October). The REVERB challenge: A common evaluation framework for de-reverberation and recognition of reverberant speech. In 2013 IEEE workshop on applications of signal processing to audio and acoustics (pp. 1–4). IEEE.

  • Kinoshita, K., Ochiai, T., Delcroix, M., & Nakatani, T. (2020, May). Improving noise-robust automatic speech recognition with single-channel time-domain enhancement network. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7009–7013). IEEE.

  • Kinoshita, K., Delcroix, M., Gannot, S., Habets, E. A. P., Haeb-Umbach, R., Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., Sehr, A., & Yoshioka, T. (2016). A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing, 2016, 1–19.

    Google Scholar 

  • Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017, March). A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5220–5224). IEEE.

  • Koya, J. R., & Rao, S. V. M. (2021). Deep bidirectional neural networks for robust speech recognition under heavy background noise. Materials Today: Proceedings.

  • Krishna, G., Tran, C., Yu, J., & Tewfik, A. H. (2019, May). Speech recognition with no speech or with noisy speech. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1090–1094). IEEE.

  • Krobba, A., Debyeche, M., & Selouani, S. A. (2020). Mixture linear prediction Gammatone Cepstral features for robust speaker verification under transmission channel noise. Multimedia Tools and Applications, 79(25), 18679–18693.

    Google Scholar 

  • Kuamr, A., Dua, M., & Choudhary, A. (2014, February). Implementation and performance evaluation of continuous Hindi speech recognition. In 2014 international conference on electronics and communication systems (ICECS) (pp. 1–5). IEEE.

  • Kumar, A., & Shahnawazuddin, S. (2020, July). Robust detection of vowel onset and end points. In 2020 international conference on signal processing and communications (SPCOM) (pp. 1–5). IEEE.

  • Kumar, K., Ren, B., Gong, Y., & Wu, J. (2020). Bandpass noise generation and augmentation for unified ASR. In INTERSPEECH (pp. 1683–1687).

  • Kumar, A., & Aggarwal, R. K. (2021). Discriminatively trained continuous Hindi speech recognition using integrated acoustic features and recurrent neural network language modeling. Journal of Intelligent Systems, 30(1), 165–179.

    Google Scholar 

  • Kumar, A., & Mittal, V. (2021). Hindi speech recognition in noisy environment using hybrid technique. International Journal of Information Technology, 13(2), 483–492.

    MathSciNet  Google Scholar 

  • Laghari, M., Tahir, M. J., Azeem, A., Riaz, W., & Zhou, Y. (2021, May). Robust speech emotion recognition for Sindhi language based on deep convolutional neural network. In 2021 international conference on communications, information system and computer engineering (CISCE) (pp. 543–548). IEEE.

  • Latha, A. P. (2020, October). Evaluation of voice mimicking using I–Vector framework. In Speech and computer: 22nd international conference, SPECOM 2020, St. Petersburg, Russia, October 7–9, 2020, Proceedings (Vol. 12335, p. 446). Springer Nature.

  • Li, H., Wang, D., Zhang, X., & Gao, G. (2020). Frame-level signal-to-noise ratio estimation using deep learning. In INTERSPEECH (pp. 4626–4630).

  • Lim, H., Kim, Y., & Kim, H. (2020). Cross-informed domain adversarial training for noise-robust wake-up word detection. IEEE Signal Processing Letters, 27, 1769–1773.

    Google Scholar 

  • Lin, Y., Guo, D., Zhang, J., Chen, Z., & Yang, B. (2020). A unified framework for multilingual speech recognition in air traffic control systems. IEEE Transactions on Neural Networks and Learning Systems.

  • Liu, B., Nie, S., Zhang, Y., Ke, D., Liang, S., & Liu, W. (2018, April). Boosting noise robustness of acoustic model via deep adversarial training. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5034–5038). IEEE.

  • Liu, B., Shen, Z., Huang, L., Gong, Y., Zhang, Z., & Cai, H. (2021, February). A 1D-CRNN inspired reconfigurable processor for noise-robust low-power keywords recognition. In 2021 design, automation & test in Europe conference & exhibition (DATE) (pp. 495–500). IEEE.

  • Lokesh, S., & Devi, M. R. (2019). Speech recognition system using enhanced mel frequency cepstral coefficient with windowing and framing method. Cluster Computing, 22(5), 11669–11679.

    Google Scholar 

  • Lü, Y., Lin, H., Wu, P., & Chen, Y. (2021). Feature compensation based on independent noise estimation for robust speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2021(1), 1–9.

    Google Scholar 

  • Maity, K., Pradhan, G., & Singh, J. P. (2021). A pitch and noise robust keyword spotting system using SMAC features with prosody modification. Circuits, Systems, and Signal Processing, 40(4), 1892–1904.

    Google Scholar 

  • Malekzadeh, S., Gholizadeh, M. H., & Razavi, S. N. (2018). Persian vowel recognition with MFCC and ANN on PCVC speech dataset. arXiv preprint arXiv:1812.06953.

  • Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: A survey. Multimedia Tools and Applications, 80(6), 9411–9457.

    Google Scholar 

  • Mandalapu, H., Ramachandra, R., & Busch, C. (2021, May). Smartphone audio replay attacks dataset. In 2021 IEEE international workshop on biometrics and forensics (IWBF) (pp. 1–6). IEEE.

  • McLoughlin, I., Xie, Z., Song, Y., Phan, H., & Palaniappan, R. (2020). Time-frequency feature fusion for noise-robust audio event classification. Circuits, Systems, and Signal Processing, 39(3), 1672–1687.

    Google Scholar 

  • Meng, Z., Watanabe, S., Hershey, J. R., & Erdogan, H. (2017, March). Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 271–275). IEEE.

  • Meng, L., Xu, J., Tan, X., Wang, J., Qin, T., & Xu, B. (2021, June). MixSpeech: Data augmentation for low-resource automatic speech recognition. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7008–7012). IEEE.

  • Meutzner, H., Ma, N., Nickel, R., Schymura, C., & Kolossa, D. (2017, March). Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5320–5324). IEEE.

  • Mitra, V., Sivaraman, G., Bartels, C., Nam, H., Wang, W., Espy-Wilson, C., Vergyri, D., & Franco, H. (2017, March). Joint modeling of articulatory and acoustic spaces for continuous speech recognition tasks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5205–5209). IEEE.

  • Mitra, V., Franco, H., Stern, R. M., van Hout, J., Ferrer, L., Graciarena, M., Wang, W., Vergyri, D., Alwan, A., & Hansen, J. H. L. (2017). Robust features in deep-learning-based speech recognition. In S. Watanabe, M. Delcroix, F. Metze, & J. R. Hershey (Eds.), New era for robust speech recognition (pp. 187–217). Springer.

    Google Scholar 

  • Mittal, A., & Dua, M. (2021). Constant Q cepstral coefficients and long short-term memory model-based automatic speaker verification system. In Proceedings of international conference on intelligent computing, information and control systems: ICICCS 2020 (pp. 895–904). Springer.

  • Naik, A. (2021). HMM-based phoneme speech recognition system for the control and command of industrial robots. Technical. Technical Transactions, e2021002.

  • Nainan, S., & Kulkarni, V. (2020). Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN. International Journal of Speech Technology, 24, 1–14.

    Google Scholar 

  • Naing, H. M. S., Hidayat, R., Hartanto, R., & Miyanaga, Y. (2020, November). A front-end technique for automatic noisy speech recognition. In 2020 23rd conference of the oriental COCOSDA international committee for the co-ordination and standardisation of speech databases and assessment techniques (O-COCOSDA) (pp. 49–54). IEEE.

  • Namazifar, M., Tur, G., & Hakkani-Tür, D. (2021, January). Warped language models for noise robust language understanding. In 2021 IEEE spoken language technology workshop (SLT) (pp. 981–988). IEEE.

  • Nanjo, H., & Kawahara, T. (2005, March). A new ASR evaluation measure and minimum Bayes-risk decoding for open-domain speech understanding. In Proceedings (ICASSP’05): IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. I–1053). IEEE.

  • Nian, Z., Tu, Y. H., Du, J., & Lee, C. H. (2021, June). A progressive learning approach to adaptive noise and speech estimation for speech enhancement and noisy speech recognition. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6913–6917). IEEE.

  • Noé, P. G., Parcollet, T., & Morchid, M. (2020, May). CGCNN: Complex Gabor convolutional neural network on raw speech. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7724–7728). IEEE.

  • Oglic, D., Cvetkovic, Z., Bell, P., & Renals, S. (2020, July). A deep 2D convolutional network for waveform-based speech recognition. In INTERSPEECH (pp. 1654–1658).

  • Oh, S. (2021). DNN based robust speech feature extraction and signal noise removal method using improved average prediction LMS filter for speech recognition. Journal of Convergence for Information Technology, 11(6), 1–6.

    Google Scholar 

  • Ouisaadane, A., & Safi, S. (2021). A comparative study for Arabic speech recognition system in noisy environments. International Journal of Speech Technology, 24, 1–10.

    Google Scholar 

  • Padi, B., Mohan, A., & Ganapathy, S. (2020). Towards relevance and sequence modeling in language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1223–1232.

    Google Scholar 

  • Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206–5210). IEEE.

  • Paul, D. B., & Baker, J. (1992). The design for the Wall Street Journal-based CSR corpus. In Speech and natural language: proceedings of a workshop held at Harriman. New York, February 23–26, 1992.

  • Pearce, D. (1998). Aurora project: Experimental framework for the performance evaluation of distributed speech recognition front-ends. ETSI working paper.

  • Qian, Y., Tan, T., Hu, H., & Liu, Q. (2018, April). Noise robust speech recognition on aurora4 by humans and machines. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5604–5608). IEEE.

  • Radha, K., & Bansal, M. (2022). Audio augmentation for non-native children’s speech recognition through discriminative learning. Entropy, 24(10), 1490.

    Google Scholar 

  • Raju, S., Jagtap, V., Kulkarni, P., Ravikanth, M., & Rafeeq, M. (2020, March). Speech recognition to build context: A survey. In 2020 international conference on computer science, engineering and applications (ICCSEA) (pp. 1–7). IEEE.

  • Ravanelli, M., Zhong, J., Pascual, S., Swietojanski, P., Monteiro, J., Trmal, J., & Bengio, Y. (2020, May). Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6989–6993). IEEE.

  • Ray, A., Rajeswar, S., & Chaudhury, S. (2015, January). Text recognition using deep BLSTM networks. In 2015 eighth international conference on advances in pattern recognition (ICAPR) (pp. 1–6). IEEE.

  • Reddy, C.K.A., Dubey, H., Gopal, V., Cutler, R., Braun, S., Gamper, H., Aichner, R., Srinivasan, S. (2021, June). ICASSP 2021 deep noise suppression challenge. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6623–6627). IEEE.

  • Rownicka, J., Bell, P., & Renals, S. (2020, May). Multi-scale octave convolutions for robust speech recognition. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7019–7023). IEEE.

  • Sahidullah, M., Kinnunen, T., & Hanilçi, C. (2015). A comparison of features for synthetic speech detection.

  • Sahu, P., Dua, M., & Kumar, A. (2018). Challenges and issues in adopting speech recognition. Speech and Language Processing for Human-Machine Communications: Proceedings of CSI, 2015, 209–215.

    Google Scholar 

  • Sailor, H. B., & Patil, H. A. (2017). Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition. The Journal of the Acoustical Society of America, 141(6), EL500–EL506.

    Google Scholar 

  • Sakthi, M., Tewfik, A., & Pawate, R. (2020, May). Speech Recognition model compression. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7869–7873). IEEE.

  • Shahrebabaki, A. S., Siniscalchi, S. M., Salvi, G., & Svendsen, T. (2021, May). A DNN based speech enhancement approach to noise robust acoustic-to-articulatory inversion. In 2021 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–5). IEEE.

  • Shen, Y. L., Huang, C. Y., Wang, S. S., Tsao, Y., Wang, H. M., & Chi, T. S. (2019, May). Reinforcement learning based speech enhancement for robust speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6750–6754). IEEE.

  • Sheng, P., Yang, Z., Hu, H., Tan, T., & Qian, Y. (2018, November). Data augmentation using conditional generative adversarial networks for robust speech recognition. In 2018 11th international symposium on Chinese spoken language processing (ISCSLP) (pp. 121–125). IEEE.

  • Singh, A., Kadyan, V., Kumar, M., & Bassan, N. (2020). ASRoIL: A comprehensive survey for automatic speech recognition of Indian languages. Artificial Intelligence Review, 53(5), 3673–3704.

    Google Scholar 

  • Song, Z. (2020). English speech recognition based on deep learning with multiple features. Computing, 102(3), 663–682.

    MathSciNet  MATH  Google Scholar 

  • Sriram, A., Jun, H., Gaur, Y., & Satheesh, S. (2018, April). Robust speech recognition using generative adversarial networks. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5639–5643). IEEE.

  • Sultana, S., Rahman, M. S., & Iqbal, M. Z. (2021). Recent advancement in speech recognition for Bangla: A survey. Int J Adv Comput Sci App. https://doi.org/10.14569/IJACSA.2021.0120365

    Article  Google Scholar 

  • Sun, S., Yeh, C. F., Hwang, M. Y., Ostendorf, M., & Xie, L. (2018, April). Domain adversarial training for accented speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4854–4858). IEEE.

  • Sun, S., Zhang, B., Xie, L., & Zhang, Y. (2017). An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing, 257, 79–87.

    Google Scholar 

  • Szöke, I., Skácel, M., Mošner, L., Paliesek, J., & Černocký, J. (2019). Building and evaluation of a real room impulse response dataset. IEEE Journal of Selected Topics in Signal Processing, 13(4), 863–876.

    Google Scholar 

  • Tambe, T., Yang, E-Y., Ko, G., Chai, Y., Hooper, C., Donato, M., Whatmough, P., Rush, A., Brooks, D., & Wei, G-Y. (2021, February). 9.8 A 25mm 2 SoC for IoT devices with 18ms noise-robust speech-to-text latency via Bayesian speech denoising and attention-based sequence-to-sequence DNN speech recognition in 16nm FinFET. In 2021 IEEE international solid-state circuits conference (ISSCC) (Vol. 64, pp. 158–160). IEEE.

  • Tan, T., Lu, Y., Ma, R., Zhu, S., Guo, J., & Qian, Y. (2021, June). AI speech-SJTUASR system for the accented English speech recognition challenge. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6413–6417). IEEE.

  • Tang, Z., Chen, L., Wu, B., Yu, D., & Manocha, D. (2020, May). Improving reverberant speech training using diffuse acoustic simulation. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6969–6973). IEEE.

  • Thimmaraja, Y. G., Nagaraja, B. G., & Jayanna, H. S. (2021). Speech enhancement and encoding by combining SS-VAD and LPC. International Journal of Speech Technology, 24(1), 165–172.

    Google Scholar 

  • Thomas, T., Spoorthy, V., Sobhana, N. V., & Koolagudi, S. G. (2020, December). Speaker recognition in emotional environment using excitation features. In 2020 third international conference on advances in electronics, computers and communications (ICAECC) (pp. 1–6). IEEE.

  • Vanderreydt, G., & Demuynck, K. (n.d.). A Novel Channel estimate for noise robust speech recognition. Available at SSRN 4330824.

  • Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II— NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.

    Google Scholar 

  • Wang, Z. Q., & Wang, D. (2020, May). Multi-microphone complex spectral mapping for speech de-reverberation. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 486–490). IEEE.

  • Wang, Z. Q., Wang, P., & Wang, D. (2020). Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1778–1787.

    Google Scholar 

  • Warden, P. (2017). Speech commands: A public dataset for single-word speech recognition. Retrieved from http://download.tensorflow.org/data/speech_commands_v0,1

  • Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., Khudanpur, S., Manohar, V., Povey, D., Raj, D., Snyder, D., Subramanian, A.S., Trmal, J., Yair, B.B., Boeddeker, C., Ni, Z., Fujita, Y., Horiguchi, S., Kanda, N., et al. (2020). CHiME-6 challenge: Tackling multi-speaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249.

  • Wessel, F., Schluter, R., Macherey, K., & Ney, H. (2001). Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3), 288–298.

    Google Scholar 

  • Wu, B., Li, K., Ge, F., Huang, Z., Yang, M., Siniscalchi, S. M., & Lee, C. H. (2017). An end-to-end deep learning approach to simultaneous speech de-reverberation and acoustic modeling for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1289–1300.

    Google Scholar 

  • Xu, Y., Weng, C., Hui, L., Liu, J., Yu, M., Su, D., & Yu, D. (2019, May). Joint training of complex ratio mask based beam former and acoustic model for noise robust ASR. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6745–6749). IEEE.

  • Yadav, I. C., & Pradhan, G. (2021). Pitch and noise normalized acoustic feature for children’s ASR. Digital Signal Processing, 109, 102922.

    Google Scholar 

  • Yalamanchili, B., Dungala, K., Mandapati, K., Pillodi, M., & Vanga, S. R. (2021). Survey on multimodal emotion recognition (MER) Systems. In Machine learning technologies and applications: Proceedings of ICACECS 2020 (pp. 319–326). Springer.

  • Yang, S., Lee, M., & Kim, H. (2021, January). Deep learning-based syllable recognition framework for Korean children. In 2021 international conference on information networking (ICOIN) (pp. 723–726). IEEE.

  • Yoshioka, T., & Gales, M. J. (2015). Environmentally robust ASR front-end for deep neural network acoustic models. Computer Speech & Language, 31(1), 65–86.

    Google Scholar 

  • Zealouk, O., Satori, H., Laaidi, N., Hamidi, M., & Satori, K. (2020). Noise effect on Amazigh digits in speech recognition system. International Journal of Speech Technology, 23(4), 885–892.

    Google Scholar 

  • Zhang, S., Do, C. T., Doddipatla, R., Loweimi, E., Bell, P., & Renals, S. (2021, June). Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2750–2754). IEEE.

  • Zhang, X., Zou, X., Sun, M., Zheng, T. F., Jia, C., & Wang, Y. (2019). Noise robust speaker recognition based on adaptive frame weighting in GMM for i-vector extraction. IEEE Access, 7, 27874–27882.

    Google Scholar 

  • Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, A. E. D., Jin, W., & Schuller, B. (2018). Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Transactions on Intelligent Systems and Technology (TIST), 9(5), 1–28.

    Google Scholar 

  • Zheng, N., Shi, Y., Kang, Y., & Meng, Q. (2021, June). A noise-robust signal processing strategy for cochlear implants using neural networks. In ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8343–8347). IEEE.

  • Zhou, P., Yang, W., Chen, W., Wang, Y., & Jia, J. (2019, May). Modality attention for end-to-end audio-visual speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6565–6569). IEEE

  • Zhu, Q. S., Zhou, L., Zhang, J., Liu, S. J., Hu, Y. C., & Dai, L. R. (2022). Robust Data2vec: Noise-robust speech representation learning for ASR by combining regression and improved contrastive learning. arXiv preprint arXiv:2210.15324.

  • Zylich, B., & Whitehill, J. (2020, May). Noise-robust key-phrase detectors for automated classroom feedback. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 9215–9219). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shelza Dua.

Ethics declarations

Conflict of interest

The submitted work does not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dua, M., Akanksha & Dua, S. Noise robust automatic speech recognition: review and analysis. Int J Speech Technol 26, 475–519 (2023). https://doi.org/10.1007/s10772-023-10033-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10033-0

Keywords

Navigation