Skip to main content
Log in

Group Attack Dingo Optimizer for enhancing speech recognition in noisy environments

  • Regular Article
  • Published:
The European Physical Journal Plus Aims and scope Submit manuscript

Abstract

The speech recognition system has become a vital technology enabling seamless human–computer interactions, even in noisy public places. To enhance the performance of various applications like machine translation, natural language processing, spoken language understanding, and text generation, speech enhancement (SE) techniques play a crucial role. In this study, we introduce a novel approach termed (GA-DOA) for optimizing speech enhancement tasks. Our method combines an improved short-time Fourier transform (STFT) and an optimized deep U-Net, with GA-DOA used to fine-tune the parameters. Additionally, feature extraction employs Mel-frequency cepstral coefficients (MFCCs), spectral features, and one-dimensional convolutional neural networks (1D-CNN). To select the most effective features, we employ GA-DOA-assisted feature selection. These optimized features are then fed into our proposed hybrid model for speech recognition (HMSR), which integrates bidirectional long short-term memory (BiLSTM) with the gated recurrent unit (GRU). Experimental results reveal that our proposed model achieves superior recognition rates and significantly lowers the word error rate (WER), thereby demonstrating enhanced system performance, even in noisy environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability Statement

This manuscript has associated data in a data repository. [Authors’ comment: The developed ASR model utilizes speech audio from four datasets sourced which are publically available from three different databases: the Multilingual and Code-Switching ASR Challenge dataset, the LibriSpeech ASR corpus, and the Crowdsourced High-Quality Kannada Multi-Speaker Speech Dataset. Datasets 1 and 4—Multilingual and Code-Switching ASR Challenge datasets: These datasets, obtained from [23], consist of three Indian languages: Hindi, Marathi, and Odia. Dataset 2—LibriSpeech ASR corpus: This dataset [24] is derived from audiobooks selected for the LibriVox project. Dataset 3—Crowdsourced High-Quality Kannada Multi-Speaker Speech dataset: This dataset [25] comprises recordings from native Kannada speakers located in Karnataka. For the additive noise, we took the noisy samples from NOISEX-92 [39] database and mixed them with different noises at different SNR levels.]

References

  1. P. Bawa, V. Kadyan, Noise robust in-domain children speech enhancement for automatic Punjabi recognition system under mismatched conditions. Appl. Acoust. 175, 107810 (2021)

    Article  Google Scholar 

  2. G. Thimmaraja Yadava, H.S. Jayanna, Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling. Int. J. Speech Technol. 23(1), 149–167 (2020)

    Article  Google Scholar 

  3. N. Upadhyay, H.G. Rosales, Bark scaled oversampled WPT based speech recognition enhancement in noisy environments. Int. J. Speech Technol. 23(1), 1–12 (2020)

    Article  Google Scholar 

  4. P. Wang, K. Tan et al., Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic modeling. IEEE/ACM Trans . Audio Speech Lang. Process. 28, 39–48 (2019)

    Article  Google Scholar 

  5. C.H. You, M. Bin, Spectral-domain speech enhancement for speech recognition. Speech Commun. 94, 30–41 (2017)

    Article  ADS  Google Scholar 

  6. Y. Shao, C.-H. Chang, Bayesian separation with sparsity promotion in perceptual wavelet domain for speech enhancement and hybrid speech recognition. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 41(2), 284–293 (2010)

    Article  Google Scholar 

  7. C. Donahue, B. Li, R. Prabhavalkar, Exploring speech enhancement with generative adversarial networks for robust speech recognition, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 5024–5028

  8. G. Kovács, L. Tóth, D. Van Compernolle, Selection and enhancement of gabor filters for automatic speech recognition. Int. J. Speech Technol. 18(1), 1–16 (2015)

    Article  Google Scholar 

  9. X. Xiao, S. Zhao, D.H. Ha Nguyen, X. Zhong, D.L. Jones, E.S. Chng, H. Li, Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. EURASIP J. Adv. Signal Process. 2016(1), 1–18 (2016)

    Article  Google Scholar 

  10. J. Novoa, J. Fredes, V. Poblete, N.B. Yoma, Uncertainty weighting and propagation in DNN-HMM-based speech recognition. Comput. Speech Lang. 47, 30–46 (2018)

    Article  Google Scholar 

  11. C. Fan, J. Yi, J. Tao, Z. Tian, B. Liu, Z. Wen, Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 198–209 (2020)

    Article  Google Scholar 

  12. J. Cadore, F.J. Valverde-Albacete, A. Gallardo-Antolín, C. Peláez-Moreno, Auditory-inspired morphological processing of speech spectrograms: applications in automatic speech recognition and speech enhancement. Cogn. Comput. 5(4), 426–441 (2013)

    Article  Google Scholar 

  13. J. Ming, D. Crookes, Speech enhancement based on full-sentence correlation and clean speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 531–543 (2017)

    Article  Google Scholar 

  14. B.K. Khonglah, A. Dey, S. Prasanna, Speech enhancement using source information for phoneme recognition of speech with background music. Circuits Syst. Signal Process. 38(2), 643–663 (2019)

    Article  Google Scholar 

  15. N. Moritz, K. Adiloğlu, J. Anemüller, S. Goetze, B. Kollmeier, Multi-channel speech enhancement and amplitude modulation analysis for noise robust automatic speech recognition. Comput. Speech Lang. 46, 558–573 (2017)

    Article  Google Scholar 

  16. J. Xue, T. Zheng, J. Han, Exploring attention mechanisms based on summary information for end-to-end automatic speech recognition. Neurocomputing 465, 514–524 (2021)

    Article  Google Scholar 

  17. L. Chai, J. Du, Q.-F. Liu, C.-H. Lee, A cross-entropy-guided measure (CEGM) for assessing speech recognition performance and optimizing DNN-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 106–117 (2020)

    Article  Google Scholar 

  18. Y.-H. Tu, J. Du, C.-H. Lee, Speech enhancement based on teacher-student deep learning using improved speech presence probability for noise-robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2080–2091 (2019)

    Article  Google Scholar 

  19. R.A. Ramadan, K. Yadav, Nonlinear acoustic noise cancellation based automatic speech recognition system (NANC-ASR) with convolutional neural networks. Int. J. Speech Technol. 25(3), 605–613 (2022)

    Article  Google Scholar 

  20. S. Lokesh, P. Malarvizhi Kumar, M. RamyaDevi, P. Parthasarathy, C. Gokulnath, An automatic Tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map. Neural Comput. Appl. 31(5), 1521–1531 (2019)

    Article  Google Scholar 

  21. N. Saleem, J. Gao, M.I. Khattak, H.T. Rauf, S. Kadry, M. Shafi, Deepresgru: residual gated recurrent neural network-augmented Kalman filtering for speech enhancement and recognition. Knowl.-Based Syst. 238, 107914 (2022)

    Article  Google Scholar 

  22. P. Agrawal, S. Ganapathy, Modulation filter learning using deep variational networks for robust speech recognition. IEEE J. Sel. Top. Signal Process. 13(2), 244–253 (2019)

    Article  ADS  Google Scholar 

  23. A. Diwan, R. Vaideeswaran, S. Shah, A. Singh, S. Raghavan, S. Khare, V. Unni, S. Vyas, A. Rajpuria, C. Yarra, et al., Multilingual and code-switching ASR challenges for low resource Indian languages, arXiv preprint arXiv:2104.00235 (2021)

  24. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, in, IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE 2015, 5206–5210 (2015)

  25. F. He, S.-H. C. Chu, O. Kjartansson, C. Rivera, A. Katanova, A. Gutkin, I. Demirsahin, C. Johny, M. Jansche, S. Sarin, K. Pipatsrisawat, Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems, in: Proceedings of The 12th Language Resources and Evaluation Conference (LREC), European Language Resources Association (ELRA), Marseille, France, 2020, pp. 6494–6503. https://www.aclweb.org/anthology/2020.lrec-1.800

  26. J.-W. Hwang, R.-H. Park, H.-M. Park, Efficient audio-visual speech enhancement using deep u-net with early fusion of audio and video information and RNN attention blocks. IEEE Access 9, 137584–137598 (2021)

    Article  Google Scholar 

  27. H. Zhang, H. Huang, H. Han, Attention-based convolution skip bidirectional long short-term memory network for speech emotion recognition. IEEE Access 9, 5332–5342 (2020)

    Article  Google Scholar 

  28. G. Cybenko, Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)

    Article  MathSciNet  Google Scholar 

  29. V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814 (2010)

  30. J.-R. Cano, Analysis of data complexity measures for classification. Expert Syst. Appl. 40(12), 4820–4831 (2013)

    Article  Google Scholar 

  31. S. Mirjalili, A. Lewis, The whale optimization algorithm. Adv. Eng. Softw. 95, 51–67 (2016)

    Article  Google Scholar 

  32. A. Siabi-Garjan, R. Hassanzadeh, A computational approach for engineering optical properties of multilayer thin films: particle swarm optimization applied to bruggeman homogenization formalism. Eur. Phys. J. Plus 133, 1–11 (2018)

    Article  Google Scholar 

  33. W.-T. Pan, A new fruit fly optimization algorithm: taking the financial distress model as an example. Knowl.-Based Syst. 26, 69–74 (2012)

    Article  Google Scholar 

  34. W. Feng, Convergence analysis of whale optimization algorithm. J. Phys: Conf. Ser. 1757(1), 012008 (2021). https://doi.org/10.1088/1742-6596/1757/1/012008

    Article  MathSciNet  Google Scholar 

  35. Q. Zhao, C. Li, Two-stage multi-swarm particle swarm optimizer for unconstrained and constrained global optimization. IEEE Access 8, 124905–124927 (2020)

    Article  Google Scholar 

  36. B. Xing, W.-J. Gao, B. Xing, W.-J. Gao, Fruit Fly Optimization Algorithm. Innovative Computational Intelligence: A Rough Guide to 134 Clever Algorithms (Springer, Berlin, 2014)

    Google Scholar 

  37. A.K. Bairwa, S. Joshi, D. Singh, Dingo optimizer: a nature-inspired metaheuristic approach for engineering problems. Math. Probl. Eng. 2021, 1–12 (2021)

    Article  Google Scholar 

  38. H. Peraza-Vázquez, A.F. Peña-Delgado, G. Echavarría-Castillo, A.B. Morales-Cepeda, J. Velasco-Álvarez, F. Ruiz-Perez, A bio-inspired method for engineering design optimization inspired by dingoes hunting strategies. Math. Probl. Eng. 2021, 1–19 (2021)

    Article  Google Scholar 

  39. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993). https://doi.org/10.1016/0167-6393(93)90095-3

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Ganesh Kumar.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, T.N.M., Kumar, K.G., Deepak, K.T. et al. Group Attack Dingo Optimizer for enhancing speech recognition in noisy environments. Eur. Phys. J. Plus 138, 1145 (2023). https://doi.org/10.1140/epjp/s13360-023-04775-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1140/epjp/s13360-023-04775-8

Navigation