Skip to main content
Log in

A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech Recognition (SR) is an emerging field in the native language nowadays. Recognizing isolated words in the local language helps people use smartphones and electronic gadgets without technical or educational knowledge. This paper proposes a novel deep Convolutional Neural Network (CNN) architecture to classify ten spoken Bengali numerals. The proposed model generates almost similar prediction accuracy as compared to an end-to-end CNN with nine times fewer parameters has been trained. Here, the raw audio samples are pre-processed, and then a unique hybrid feature of Mel Frequency Cepstral Coefficients (MFCC), Spectral Sub-band Energy (SSE), and Log Spectral Sub-band Energy (LSSE) have been extracted frame-wise and engendered into a vector. Finally, these vectors are fed to the proposed architecture of a one-dimensional CNN and achieve the highest test accuracy of 98.52%. The model has been trained for our created speech corpus of 14000 spoken Bengali digits and 30000 spoken English digits from the audio-MNIST dataset. The proposed neural model generates high prediction accuracy with a few times fewer parameters to be trained, generating low computational costs. The outcome of the proposed model is compared with several pre-trained deep learning models; the result shows the model's superiority. Source Code: https://github.com/BachchuPaul/Bengali-Isolated-Spoken-Digit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The audio-MNIST dataset generated during and/or analysed during the current study are available in the kaggle repository from the web link: https://www.kaggle.com/datasets/sripaadsrinivasan/audio-mnist. That dataset is originally git repository and details of the data set are available in https://github.com/soerenab/AudioMNIST/blob/master/data/audioMNIST_meta.txt. All data generated or analysed during this study are included in this published article [3] (and its supplementary information files).

Our own created dataset of spoken Bengali isolated digit dataset is generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech, Language Process 22(10):1533–1545

    Article  Google Scholar 

  2. Ahammad K, Rahman MM (2016) Connected bangla speech recognition using artificial neural network. Int J Comput Appl 149(9):38–41

    Google Scholar 

  3. Becker S, Ackermann M, Lapuschkin S, Müller KR, Samek W (2018) Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418

  4. Dikmese S, Sofotasios PC, Renfors M, Valkama M (2015) Subband energy based reduced complexity spectrum sensing under noise uncertainty and frequency-selective spectral characteristics. IEEE Trans Signal Process 64(1):131–145

    Article  MathSciNet  Google Scholar 

  5. Ferrer L, Lei Y, McLaren M, Scheffer N (2015) Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans Audio, Speech, Language Process 24(1):105–116

    Article  Google Scholar 

  6. Gamit MR, Dhameliya K (2015) Isolated words recognition using MFCC, LPC and neural network. Int J Res Eng Technol 4(6):146–149

    Article  Google Scholar 

  7. Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp 1440–1448)

  8. Grozdić ĐT, Jovičić ST, Subotić M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22

    Article  Google Scholar 

  9. Guiming D, Xia W, Guangyan W, Yan Z, Dan L (2016) Speech recognition based on convolutional neural networks. In 2016 IEEE International Conference on Signal and Image Processing (ICSIP) (pp 708-711). IEEE

  10. Gupta A, Sarkar K (2018) Recognition of spoken bengali numerals using MLP, SVM, RF based models with PCA based feature summarization. Int Arab J Inf Technol 15(2):263–269

    Google Scholar 

  11. Kadyan V, Mantri A, Aggarwal RK, Singh A (2019) A comparative study of deep neural network based Punjabi-ASR system. Int J Speech Technol 22(1):111–119

    Article  Google Scholar 

  12. Kaur G, Srivastava M, Kumar A (2017) Speaker and speech recognition using deep neural network. Int J Emerg Res Manag Technol 6:8

    Google Scholar 

  13. Kondhalkar H, Mukherji P (2019) A novel algorithm for speech recognition using tonal frequency cepstral coefficients based on human cochlea frequency map. J Eng Sci Technol 14(2):726–746

    Google Scholar 

  14. Krishnamoorthy P, Prasanna SM (2011) Enhancement of noisy speech by temporal and spectral processing. Speech Commun 53(2):154–174

    Article  Google Scholar 

  15. Lisa NJ, Eity QN, Muhammad G, Huda MN, Rahman CM (2010) Performance evaluation of Bangla word recognition using different acoustic features. Int J Comput Sci Netw Secur 10:96–100

    Google Scholar 

  16. Mahalingam H, Rajakumar M (2019) Speech recognition using multiscale scattering of audio signals and long short-term memory 0f neural networks. Int J Adv Comput Sci Cloud Comput 7:12–16

    Google Scholar 

  17. Masmoudi S, Frikha M, Chtourou M, Hamida AB (2011) Efficient MLP constructive training algorithm using a neuron recruiting approach for isolated word recognition system. Int J Speech Technol 14(1):1–10

    Article  Google Scholar 

  18. Nagajyothi D, Siddaiah P (2018) Speech recognition using convolutional neural networks. Int J Eng Technol 7(4.6):133–137

    Article  Google Scholar 

  19. Nicolson A, Hanson J, Lyons J, Paliwal K (2018) Spectral subband centroids for robust speaker identification using marginalization-based missing feature theory. Int J Signal Process Syst 6(1):12–16

    Article  Google Scholar 

  20. Palaz D, Doss MM, Collobert R (2015) Convolutional neural networks-based continuous speech recognition using raw speech signal. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 4295–4299). IEEE

  21. Paul B, Adhikary DD, Dey T, Guchhait S, Bera S (2022) Bangla Spoken Numerals Recognition by Using HMM. In Computational Intelligence in Pattern Recognition (pp 85–97). Springer, Singapore

  22. Paul B, Bera S, Paul R, Phadikar S (2021) Bengali spoken numerals recognition by MFCC and GMM technique. In Advances in Electronics, Communication and Computing (pp 85–96). Springer, Singapore

  23. Paul B, Dey T, Adhikary DD, Guchhai S, Bera S (2022) A novel approach of audio-visual color recognition using KNN. In Computational Intelligence in Pattern Recognition (pp 231–244). Springer, Singapore

  24. Paul B, Mukherjee H, Phadikar S, Roy K (2019) MFCC-Based Bangla Vowel Phoneme Recognition from Micro Clips. In International Conference on Intelligent Computing and Communication (pp 511–519). Springer, Singapore

  25. Paul B, Phadikar S, Bera S (2021) Indian regional spoken language identification using deep learning approach. In Proceedings of the Sixth International Conference on Mathematics and Computing (pp 263–274). Springer, Singapore

  26. Pawar GS, Morade SS (2014) Isolated English language digit recognition using hidden markov model toolkit. Int J Adv Res Comput Sci Softw Eng Jaunpur-222001, Uttar Pradesh, India, 4(6)

  27. Qadir JA, Al-Talabani AK, Aziz HA (2020) Isolated spoken word recognition using one-dimensional convolutional neural network. Int J Fuzzy Logic Intell Syst 20(4):272–277

    Article  Google Scholar 

  28. Sarma M (2017) Speech recognition using deep neural network-recent trends. Int J Intell Syst Des Comput 1(1-2):71–86

    Google Scholar 

  29. Sharmin R, Rahut SK, Huq MR (2020) Bengali spoken digit classification: A deep learning approach using convolutional neural network. Proc Comput Sci 171:1381–1388

    Article  Google Scholar 

  30. Shukla S, Jain M (2021) A novel stochastic deep resilient network for effective speech recognition. Int J Speech Technol 1–10

  31. Si S, Wang J, Sun H, Wu J, Zhang C, Qu X, Cheng N, Chen L, Xiao J (2021) Variational information bottleneck for effective low-resource audio classification. arXiv preprint arXiv:2107.04803

  32. Siniscalchi SM, Yu D, Deng L, Lee CH (2013) Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106:148–157

    Article  Google Scholar 

  33. Song Z (2020) English speech recognition based on deep learning with multiple features. Computing 102(3):663–682

    Article  MathSciNet  Google Scholar 

  34. Sumon SA, Chowdhury J, Debnath S, Mohammed N, Momen S (2018) Bangla short speech commands recognition using convolutional neural networks. In 2018 international conference on bangla speech and language processing (ICBSLP) (pp 1–6). IEEE

  35. Tripathi AM, Paul K (2022) When sub-band features meet attention mechanism while knowledge distillation for sound classification. Appl Acoust 195:108813

    Article  Google Scholar 

  36. Vani HY, Anusuya MA (2020) Fuzzy speech recognition: a review. Int J Comput Appl 177(47):39–54

    Google Scholar 

  37. Veisi H, Mani AH (2020) Persian speech recognition using deep learning. Int J Speech Technol 23(4):893–905

    Article  Google Scholar 

Download references

Funding

The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study. No funds, grants, or other support was received.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bachchu Paul.

Ethics declarations

Conflict of interest

There is no conflict of Interest between the authors regarding the manuscript preparation and submission.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Paul, B., Phadikar, S. A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits. Multimed Tools Appl 83, 1669–1692 (2024). https://doi.org/10.1007/s11042-023-15598-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15598-1

Keywords

Navigation