Abstract
Speech Recognition (SR) is an emerging field in the native language nowadays. Recognizing isolated words in the local language helps people use smartphones and electronic gadgets without technical or educational knowledge. This paper proposes a novel deep Convolutional Neural Network (CNN) architecture to classify ten spoken Bengali numerals. The proposed model generates almost similar prediction accuracy as compared to an end-to-end CNN with nine times fewer parameters has been trained. Here, the raw audio samples are pre-processed, and then a unique hybrid feature of Mel Frequency Cepstral Coefficients (MFCC), Spectral Sub-band Energy (SSE), and Log Spectral Sub-band Energy (LSSE) have been extracted frame-wise and engendered into a vector. Finally, these vectors are fed to the proposed architecture of a one-dimensional CNN and achieve the highest test accuracy of 98.52%. The model has been trained for our created speech corpus of 14000 spoken Bengali digits and 30000 spoken English digits from the audio-MNIST dataset. The proposed neural model generates high prediction accuracy with a few times fewer parameters to be trained, generating low computational costs. The outcome of the proposed model is compared with several pre-trained deep learning models; the result shows the model's superiority. Source Code: https://github.com/BachchuPaul/Bengali-Isolated-Spoken-Digit.
Similar content being viewed by others
Data availability
The audio-MNIST dataset generated during and/or analysed during the current study are available in the kaggle repository from the web link: https://www.kaggle.com/datasets/sripaadsrinivasan/audio-mnist. That dataset is originally git repository and details of the data set are available in https://github.com/soerenab/AudioMNIST/blob/master/data/audioMNIST_meta.txt. All data generated or analysed during this study are included in this published article [3] (and its supplementary information files).
Our own created dataset of spoken Bengali isolated digit dataset is generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech, Language Process 22(10):1533–1545
Ahammad K, Rahman MM (2016) Connected bangla speech recognition using artificial neural network. Int J Comput Appl 149(9):38–41
Becker S, Ackermann M, Lapuschkin S, Müller KR, Samek W (2018) Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418
Dikmese S, Sofotasios PC, Renfors M, Valkama M (2015) Subband energy based reduced complexity spectrum sensing under noise uncertainty and frequency-selective spectral characteristics. IEEE Trans Signal Process 64(1):131–145
Ferrer L, Lei Y, McLaren M, Scheffer N (2015) Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans Audio, Speech, Language Process 24(1):105–116
Gamit MR, Dhameliya K (2015) Isolated words recognition using MFCC, LPC and neural network. Int J Res Eng Technol 4(6):146–149
Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp 1440–1448)
Grozdić ĐT, Jovičić ST, Subotić M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22
Guiming D, Xia W, Guangyan W, Yan Z, Dan L (2016) Speech recognition based on convolutional neural networks. In 2016 IEEE International Conference on Signal and Image Processing (ICSIP) (pp 708-711). IEEE
Gupta A, Sarkar K (2018) Recognition of spoken bengali numerals using MLP, SVM, RF based models with PCA based feature summarization. Int Arab J Inf Technol 15(2):263–269
Kadyan V, Mantri A, Aggarwal RK, Singh A (2019) A comparative study of deep neural network based Punjabi-ASR system. Int J Speech Technol 22(1):111–119
Kaur G, Srivastava M, Kumar A (2017) Speaker and speech recognition using deep neural network. Int J Emerg Res Manag Technol 6:8
Kondhalkar H, Mukherji P (2019) A novel algorithm for speech recognition using tonal frequency cepstral coefficients based on human cochlea frequency map. J Eng Sci Technol 14(2):726–746
Krishnamoorthy P, Prasanna SM (2011) Enhancement of noisy speech by temporal and spectral processing. Speech Commun 53(2):154–174
Lisa NJ, Eity QN, Muhammad G, Huda MN, Rahman CM (2010) Performance evaluation of Bangla word recognition using different acoustic features. Int J Comput Sci Netw Secur 10:96–100
Mahalingam H, Rajakumar M (2019) Speech recognition using multiscale scattering of audio signals and long short-term memory 0f neural networks. Int J Adv Comput Sci Cloud Comput 7:12–16
Masmoudi S, Frikha M, Chtourou M, Hamida AB (2011) Efficient MLP constructive training algorithm using a neuron recruiting approach for isolated word recognition system. Int J Speech Technol 14(1):1–10
Nagajyothi D, Siddaiah P (2018) Speech recognition using convolutional neural networks. Int J Eng Technol 7(4.6):133–137
Nicolson A, Hanson J, Lyons J, Paliwal K (2018) Spectral subband centroids for robust speaker identification using marginalization-based missing feature theory. Int J Signal Process Syst 6(1):12–16
Palaz D, Doss MM, Collobert R (2015) Convolutional neural networks-based continuous speech recognition using raw speech signal. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 4295–4299). IEEE
Paul B, Adhikary DD, Dey T, Guchhait S, Bera S (2022) Bangla Spoken Numerals Recognition by Using HMM. In Computational Intelligence in Pattern Recognition (pp 85–97). Springer, Singapore
Paul B, Bera S, Paul R, Phadikar S (2021) Bengali spoken numerals recognition by MFCC and GMM technique. In Advances in Electronics, Communication and Computing (pp 85–96). Springer, Singapore
Paul B, Dey T, Adhikary DD, Guchhai S, Bera S (2022) A novel approach of audio-visual color recognition using KNN. In Computational Intelligence in Pattern Recognition (pp 231–244). Springer, Singapore
Paul B, Mukherjee H, Phadikar S, Roy K (2019) MFCC-Based Bangla Vowel Phoneme Recognition from Micro Clips. In International Conference on Intelligent Computing and Communication (pp 511–519). Springer, Singapore
Paul B, Phadikar S, Bera S (2021) Indian regional spoken language identification using deep learning approach. In Proceedings of the Sixth International Conference on Mathematics and Computing (pp 263–274). Springer, Singapore
Pawar GS, Morade SS (2014) Isolated English language digit recognition using hidden markov model toolkit. Int J Adv Res Comput Sci Softw Eng Jaunpur-222001, Uttar Pradesh, India, 4(6)
Qadir JA, Al-Talabani AK, Aziz HA (2020) Isolated spoken word recognition using one-dimensional convolutional neural network. Int J Fuzzy Logic Intell Syst 20(4):272–277
Sarma M (2017) Speech recognition using deep neural network-recent trends. Int J Intell Syst Des Comput 1(1-2):71–86
Sharmin R, Rahut SK, Huq MR (2020) Bengali spoken digit classification: A deep learning approach using convolutional neural network. Proc Comput Sci 171:1381–1388
Shukla S, Jain M (2021) A novel stochastic deep resilient network for effective speech recognition. Int J Speech Technol 1–10
Si S, Wang J, Sun H, Wu J, Zhang C, Qu X, Cheng N, Chen L, Xiao J (2021) Variational information bottleneck for effective low-resource audio classification. arXiv preprint arXiv:2107.04803
Siniscalchi SM, Yu D, Deng L, Lee CH (2013) Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106:148–157
Song Z (2020) English speech recognition based on deep learning with multiple features. Computing 102(3):663–682
Sumon SA, Chowdhury J, Debnath S, Mohammed N, Momen S (2018) Bangla short speech commands recognition using convolutional neural networks. In 2018 international conference on bangla speech and language processing (ICBSLP) (pp 1–6). IEEE
Tripathi AM, Paul K (2022) When sub-band features meet attention mechanism while knowledge distillation for sound classification. Appl Acoust 195:108813
Vani HY, Anusuya MA (2020) Fuzzy speech recognition: a review. Int J Comput Appl 177(47):39–54
Veisi H, Mani AH (2020) Persian speech recognition using deep learning. Int J Speech Technol 23(4):893–905
Funding
The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study. No funds, grants, or other support was received.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of Interest between the authors regarding the manuscript preparation and submission.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Paul, B., Phadikar, S. A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits. Multimed Tools Appl 83, 1669–1692 (2024). https://doi.org/10.1007/s11042-023-15598-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15598-1