Skip to main content
Log in

Mathematical analysis of AMRes: unlocking enhanced recognition across audio-visual domains

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

This research presents AMRes (Adaptive Windowed Convolutional Neural Network and Multiple Residual Network), a novel method that demonstrates remarkable resilience against overfitting, even in scenarios with limited training data. With meticulous design comprising up to 19 convolutional layers, it enables deep learning models to effectively extract and analyze vital information from input data while preserving it through deep layers. Through rigorous mathematical and empirical validation, the proposed method exhibits adept handling of intra- and inter-speaker variability, efficient training with limited data, and potential applicability beyond speech recognition in image-related applications, showcasing its generality. This paper establishes a solid foundation for advancements in multi-faceted recognition systems. The proposed method excels at proficiently modeling speech signals and achieving highly accurate speech recognition. Thorough evaluations utilizing well-established databases such as Switchboard, Timit, and FarsDAT for speech, and MNIST for image data underscore the method's versatility and efficacy across diverse data contexts. Comparative analysis against state-of-the-art techniques in speech recognition demonstrates the propose method's competitive and often superior performance. Notably, this method significantly reduces phoneme recognition errors by approximately 8% on notable datasets. This research provides a solid foundation for advancing the field and highlights the proposed method's potential in improving recognition systems, encouraging further exploration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Figa
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

These data were derived from the following resources available in the public domain: [https://datasets.activeloop.ai/docs/ml/datasets/timit-dataset/].

References

  1. Bekmanova G, Banu Y, Altynbek S, Assel M (2022) Emotional speech recognition method based on word transcription. Sensors 22(5):1937

    Article  Google Scholar 

  2. Van Trinh L, Le Dao Thi T, Le Xuan T, Eric C (2022) Emotional speech recognition using deep neural networks. Sensors 22(4):1414

    Article  Google Scholar 

  3. Li Jinyu (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal Inform Process 11:1

    Google Scholar 

  4. Ding N, Jiaxin G, Jing W, Wenhui S, Mingxuan F, Xiaoling L, Hua Z (2023) Speech recognition in echoic environments and the effect of aging and hearing impairment. Hear Res 1:108725

    Article  Google Scholar 

  5. Liu Alexander H, Wei-Ning H, Michael A, Alexei B (2023) Towards end-to-end unsupervised speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT) 221-228. IEEE.

  6. Thomas B, Samuel K, Salah K. (2022) Efficient adapter transfer of self-supervised speech models for automatic speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 7102-7106.

  7. Gupta Anup K, Puneet G, Esa R (2022) FATALRead-Fooling visual speech recognition models: Put words on Lips. Appl Intell 52:1–16

    Google Scholar 

  8. Pawar AB, Pranav G, Mangesh G, Jawale MA, William P. (2022) Challenges for Hate Speech Recognition System: Approach based on Solution." In 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), 2022, 699-704.

  9. Peng Y, Siddharth D, Ian L, Shinji W. (2022) Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, pp. 17627-17643. PMLR.

  10. Yuvaraj S, Abhishek B, William P, Vengatesan K, Rahul B. (2022) Speech Recognition Based Robotic Arm Writing. In Proceedings of International Conference on Communication and Artificial Intelligence: ICCAI 2021. Singapore: Springer Nature Singapore. pp. 23-33

  11. Aditya J, Kulkarni G, Shah V (2018) Natural language processing. Int J Comput Sci Eng 6(1):352–357

    Google Scholar 

  12. Kumar P, Saini R, Roy PP, Sahu PK, Dogra DP (2018) Envisioned speech recognition using EEG sensors. Pers Ubiquit Comput 22(1):185–199

    Article  Google Scholar 

  13. Chen Z, Droppo J, Li J, Xiong W, Chen Z, Droppo J, Li J, Xiong W (2018) Progressive joint modeling in unsupervised single-channel overlapped speech recognition, IEEE/ACM Transactions on Audio. Speech Lang Process (TASLP) 26(1):184–196

    Google Scholar 

  14. A Zeyer, P Doetsch, P Voigtlaender, R Schluter, H Ney (2017) A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition, In IEEE International Conference on Acoustics. Speech and Signal Processing pp. 2462–2466.

  15. W Chan, N Jaitly, Q Le, O Vinyals (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, in IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 4960–4964.

  16. Ashok Kumar L, Karthika Renuka D, Lovelyn Rose S, Made Wartana I (2022) Deep learning based assistive technology on audio visual speech recognition for hearing impaired. Int J Cogn Comput Eng 3:24–30

    Google Scholar 

  17. Ma, Pingchuan, Alexandros H, A Fernandez-Lopez, H Chen, S Petridis, M Pantic (2023) Auto-AVSR: Audio-visual speech recognition with automatic labels. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5.

  18. Qiya S, Sun B, Li S (2022) Multimodal sparse transformer network for audio-visual speech recognition. IEEE Trans Neural Netw Learn Syst 34:10028

    Google Scholar 

  19. Hanan A, Ullah A, Ram S, Zaki N (2022) Unsupervised automatic speech recognition A review. Speech Commun 139:76

    Article  Google Scholar 

  20. C Chung-Cheng, J Qin, Y Zhang, J Yu, Y Wu. (2022) Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning. PMLR pp. 3915-3924

  21. Ambuj M, Majumder N, Bharadwaj R, Mihalcea R, Poria S (2023) A review of deep learning techniques for speech processing. Inform Fus 99:101869

    Article  Google Scholar 

  22. Li B, Sainath TN, Weiss RJ, Wilson KW, Bacchiani M (2016) Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition. Interspeech 1:1976–1980

    Google Scholar 

  23. W Yiming, J Li, H Wang, Y Qian, C Wang, Y Wu. (2022) Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7097-7101. IEEE.

  24. Mohammed Hasan A, Musa Jaber M, Khalil Abd S, Rehman A, Javed Awan M, Vitkutė-Adžgauskienė D, Damaševičius R, Ali Bahaj S (2022) Harris hawks sparse auto-encoder networks for automatic speech recognition system. Appl Sci 12(3):1091

    Article  Google Scholar 

  25. William P, Ritik G, Rup esh C, Pawar AB, Jawale MA (2022) Machine Learning based Automatic Hate Speech Recognition System. In 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), pp. 315-318.

  26. Bharathi B, Bharathi Raja C, Subalalitha CN, Sripriya N, Arunaggiri P, Swetha V (2022) Findings of the shared task on Speech Recognition for Vulnerable Individuals in Tamil. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, pp. 339-345.

  27. Barkani F, Hamidi M, Laaidi N, Zealouk O, Satori H, Satori K (2023) Amazigh speech recognition based on the Kaldi ASR toolkit. Int J Inform Technol 15:1–8

    Google Scholar 

  28. Kumar A, Mittal V (2021) Hindi speech recognition in noisy environment using hybrid technique. Int J Inform Technol 13:483–492

    Google Scholar 

  29. J. Heymann, L. Drude, and R. Haeb-Umbach (2016) Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition. In CHiME workshop pp. 12-17.

  30. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9(2):171–185

    Article  Google Scholar 

  31. Ramesh P, Wilpon JG (1992) Modeling state durations in hidden Markov models for automatic speech recognition. IEEE Int Conf Acoust Speech Signal Process 1:381–384

    Google Scholar 

  32. Justine PN, Kao T, Zweig G (2011) Discriminative duration modeling for speech recognition with segmental conditional random fields. ICASSP 1:4476–4479

    Google Scholar 

  33. Yu SZ (2010) Hidden semi-Markov models. Artif Intell 174(2):215–243

    Article  MathSciNet  Google Scholar 

  34. Alumäe T (2014) Neural network phone duration model for speech recognition. Int Speech Commun Assoc, Interspeech 2014:1204–1208

    Google Scholar 

  35. BabaAli B (2016) A state-of-the-art framework for Persian speech recognition. Signal Data Process 13(3):1–13

    Google Scholar 

  36. Hadian H, Povey D, Sameti H, Khudanpur S (2017) Phone duration modeling for LVCSR using neural networks. Interspeech 2017:20–24

    Google Scholar 

  37. A. Senior, H. Sak, I. Shafran (2015) Context dependent phone models for LSTM RNN acoustic modelling. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing pp. 19–24.

  38. Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int J Inform Technol 14(7):3425–3436

    Google Scholar 

  39. Pandey J, Asati AR (2023) Lightweight convolutional neural network architecture implementation using TensorFlow lite. Int J Inform Technol 1:1–10

    Google Scholar 

  40. El Bourakadi D, Ramadan H, Yahyaouy A, Boumhidi J (2023) A novel solar power prediction model based on stacked BiLSTM deep learning and improved extreme learning machine. Int J Inform Technol 15(2):587–594

    Google Scholar 

  41. Dua S, Sambath Kumar S, Albagory Y, Ramalingam R, Dumka A, Singh R, Rashid M, Gehlot A, Alshamrani SS, Saeed A, AlGhamdi. (2022) Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Appl Sci 12(12):6223

    Article  Google Scholar 

  42. Jain V, Jain A, Chauhan A, Kotla SS, Gautam A (2021) American sign language recognition using support vector machine and convolutional neural network. Int J Inform Technol 13:1193–1200

    Google Scholar 

  43. Z Binbin, H Lv, P Guo, Q Shao, C Yang, L Xie, X Xu et al. (2022) Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 6182-6186.

  44. Mridha Muhammad F, Quwsar Ohi A, Abdul Hamid M, Mostafa Monowar M (2022) A study on the challenges and opportunities of speech recognition for Bengali language. Artif Intell Rev 2022:1–25

    Google Scholar 

  45. R Thomas, E Wallington, D Kalarikalayil Raju, O Klejch, J Pearson, M Jones, P Bell, S Robinson (2022) Opportunities and challenges of automatic speech recognition systems for low-resource language speakers. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems pp. 1-17.

  46. De la Rosa J, Rolv-Arild B, P Kummervold, F Wetjen (2023) Boosting Norwegian Automatic Speech Recognition. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa) pp 555-564.

  47. Zhang Y, Park DS, Han W, Qin J, Gulati A, Shor J, Jansen A et al (2022) Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE J Select Topic Signal Process 16(6):1519–1532

    Article  Google Scholar 

  48. Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans Audio, Speech Lang Process 21(11):2267–2276

    Article  Google Scholar 

  49. Wu F, K Kim, J Pan, KJ. Han, KQ Weinberger, Y Artzi (2022) Performance-efficiency trade-offs in unsupervised pre-training for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 7667-7671.

  50. Mukhamadiyev A, Ilyos K, Oybek D, Jinsoo C (2022) Automatic speech recognition method based on deep learning approaches for Uzbek language. Sensors 22(10):3683

    Article  Google Scholar 

  51. Hamed I, Pavel D, Chia-Yu L, Mohamed E, Slim A, Ngoc TV (2022) Investigations on speech recognition systems for low-resource dialectal Arabic-English code-switching speech. Comput Speech Lang 72:101278

    Article  Google Scholar 

  52. K He, X Zhang, S Ren, J Sun (2016) Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 770–778.

  53. Zoughi T, Homayounpour MM, Deypir M (2020) Adaptive windows multiple deep residual networks for speech recognition. Expert Syst Appl 139:112840

    Article  Google Scholar 

  54. Zoughi T, Homayounpour MM (2018) Adaptive windows convolutional neural network for speech recognition. Signal Data Process 15(3):13–30

    Article  Google Scholar 

  55. M. Hardt, T. Ma (2017) Identity Matters in Deep Learning. In International Conference on Learning Representations pp. 131-139.

  56. AM Saxe, JL McClelland, S Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations pp. 18-24.

  57. Derzko NA, Pfeffer AM (1965) Bounds for the spectral radius of a matrix. Math Comput 19(89):62–67

    Article  MathSciNet  Google Scholar 

  58. Sha F, Saul L (2006) Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition. IEEE Int Conf Acoust Speech Signal Process Process 1:265–268

    Google Scholar 

  59. Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio, Speech Lang Process 20(1):30–42

    Article  Google Scholar 

  60. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

Download references

Acknowledgments

The authors did not receive support from any organization for the submitted work. Mathematical Analysis of AMRes: Unlocking Enhanced Recognition across Audio-Visual Domains.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toktam Zoughi.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zoughi, T., Deypir, M. Mathematical analysis of AMRes: unlocking enhanced recognition across audio-visual domains. Int. j. inf. tecnol. (2024). https://doi.org/10.1007/s41870-024-01739-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41870-024-01739-8

Keywords

Navigation