A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Paul, Bachchu; Phadikar, Santanu

doi:10.1007/s11042-023-15598-1

A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Published: 09 May 2023

Volume 83, pages 1669–1692, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

137 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Speech Recognition (SR) is an emerging field in the native language nowadays. Recognizing isolated words in the local language helps people use smartphones and electronic gadgets without technical or educational knowledge. This paper proposes a novel deep Convolutional Neural Network (CNN) architecture to classify ten spoken Bengali numerals. The proposed model generates almost similar prediction accuracy as compared to an end-to-end CNN with nine times fewer parameters has been trained. Here, the raw audio samples are pre-processed, and then a unique hybrid feature of Mel Frequency Cepstral Coefficients (MFCC), Spectral Sub-band Energy (SSE), and Log Spectral Sub-band Energy (LSSE) have been extracted frame-wise and engendered into a vector. Finally, these vectors are fed to the proposed architecture of a one-dimensional CNN and achieve the highest test accuracy of 98.52%. The model has been trained for our created speech corpus of 14000 spoken Bengali digits and 30000 spoken English digits from the audio-MNIST dataset. The proposed neural model generates high prediction accuracy with a few times fewer parameters to be trained, generating low computational costs. The outcome of the proposed model is compared with several pre-trained deep learning models; the result shows the model's superiority. Source Code: https://github.com/BachchuPaul/Bengali-Isolated-Spoken-Digit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

Data availability

The audio-MNIST dataset generated during and/or analysed during the current study are available in the kaggle repository from the web link: https://www.kaggle.com/datasets/sripaadsrinivasan/audio-mnist. That dataset is originally git repository and details of the data set are available in https://github.com/soerenab/AudioMNIST/blob/master/data/audioMNIST_meta.txt. All data generated or analysed during this study are included in this published article [3] (and its supplementary information files).

Our own created dataset of spoken Bengali isolated digit dataset is generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech, Language Process 22(10):1533–1545
Article Google Scholar
Ahammad K, Rahman MM (2016) Connected bangla speech recognition using artificial neural network. Int J Comput Appl 149(9):38–41
Google Scholar
Becker S, Ackermann M, Lapuschkin S, Müller KR, Samek W (2018) Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418
Dikmese S, Sofotasios PC, Renfors M, Valkama M (2015) Subband energy based reduced complexity spectrum sensing under noise uncertainty and frequency-selective spectral characteristics. IEEE Trans Signal Process 64(1):131–145
Article MathSciNet Google Scholar
Ferrer L, Lei Y, McLaren M, Scheffer N (2015) Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans Audio, Speech, Language Process 24(1):105–116
Article Google Scholar
Gamit MR, Dhameliya K (2015) Isolated words recognition using MFCC, LPC and neural network. Int J Res Eng Technol 4(6):146–149
Article Google Scholar
Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp 1440–1448)
Grozdić ĐT, Jovičić ST, Subotić M (2017) Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 59:15–22
Article Google Scholar
Guiming D, Xia W, Guangyan W, Yan Z, Dan L (2016) Speech recognition based on convolutional neural networks. In 2016 IEEE International Conference on Signal and Image Processing (ICSIP) (pp 708-711). IEEE
Gupta A, Sarkar K (2018) Recognition of spoken bengali numerals using MLP, SVM, RF based models with PCA based feature summarization. Int Arab J Inf Technol 15(2):263–269
Google Scholar
Kadyan V, Mantri A, Aggarwal RK, Singh A (2019) A comparative study of deep neural network based Punjabi-ASR system. Int J Speech Technol 22(1):111–119
Article Google Scholar
Kaur G, Srivastava M, Kumar A (2017) Speaker and speech recognition using deep neural network. Int J Emerg Res Manag Technol 6:8
Google Scholar
Kondhalkar H, Mukherji P (2019) A novel algorithm for speech recognition using tonal frequency cepstral coefficients based on human cochlea frequency map. J Eng Sci Technol 14(2):726–746
Google Scholar
Krishnamoorthy P, Prasanna SM (2011) Enhancement of noisy speech by temporal and spectral processing. Speech Commun 53(2):154–174
Article Google Scholar
Lisa NJ, Eity QN, Muhammad G, Huda MN, Rahman CM (2010) Performance evaluation of Bangla word recognition using different acoustic features. Int J Comput Sci Netw Secur 10:96–100
Google Scholar
Mahalingam H, Rajakumar M (2019) Speech recognition using multiscale scattering of audio signals and long short-term memory 0f neural networks. Int J Adv Comput Sci Cloud Comput 7:12–16
Google Scholar
Masmoudi S, Frikha M, Chtourou M, Hamida AB (2011) Efficient MLP constructive training algorithm using a neuron recruiting approach for isolated word recognition system. Int J Speech Technol 14(1):1–10
Article Google Scholar
Nagajyothi D, Siddaiah P (2018) Speech recognition using convolutional neural networks. Int J Eng Technol 7(4.6):133–137
Article Google Scholar
Nicolson A, Hanson J, Lyons J, Paliwal K (2018) Spectral subband centroids for robust speaker identification using marginalization-based missing feature theory. Int J Signal Process Syst 6(1):12–16
Article Google Scholar
Palaz D, Doss MM, Collobert R (2015) Convolutional neural networks-based continuous speech recognition using raw speech signal. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 4295–4299). IEEE
Paul B, Adhikary DD, Dey T, Guchhait S, Bera S (2022) Bangla Spoken Numerals Recognition by Using HMM. In Computational Intelligence in Pattern Recognition (pp 85–97). Springer, Singapore
Paul B, Bera S, Paul R, Phadikar S (2021) Bengali spoken numerals recognition by MFCC and GMM technique. In Advances in Electronics, Communication and Computing (pp 85–96). Springer, Singapore
Paul B, Dey T, Adhikary DD, Guchhai S, Bera S (2022) A novel approach of audio-visual color recognition using KNN. In Computational Intelligence in Pattern Recognition (pp 231–244). Springer, Singapore
Paul B, Mukherjee H, Phadikar S, Roy K (2019) MFCC-Based Bangla Vowel Phoneme Recognition from Micro Clips. In International Conference on Intelligent Computing and Communication (pp 511–519). Springer, Singapore
Paul B, Phadikar S, Bera S (2021) Indian regional spoken language identification using deep learning approach. In Proceedings of the Sixth International Conference on Mathematics and Computing (pp 263–274). Springer, Singapore
Pawar GS, Morade SS (2014) Isolated English language digit recognition using hidden markov model toolkit. Int J Adv Res Comput Sci Softw Eng Jaunpur-222001, Uttar Pradesh, India, 4(6)
Qadir JA, Al-Talabani AK, Aziz HA (2020) Isolated spoken word recognition using one-dimensional convolutional neural network. Int J Fuzzy Logic Intell Syst 20(4):272–277
Article Google Scholar
Sarma M (2017) Speech recognition using deep neural network-recent trends. Int J Intell Syst Des Comput 1(1-2):71–86
Google Scholar
Sharmin R, Rahut SK, Huq MR (2020) Bengali spoken digit classification: A deep learning approach using convolutional neural network. Proc Comput Sci 171:1381–1388
Article Google Scholar
Shukla S, Jain M (2021) A novel stochastic deep resilient network for effective speech recognition. Int J Speech Technol 1–10
Si S, Wang J, Sun H, Wu J, Zhang C, Qu X, Cheng N, Chen L, Xiao J (2021) Variational information bottleneck for effective low-resource audio classification. arXiv preprint arXiv:2107.04803
Siniscalchi SM, Yu D, Deng L, Lee CH (2013) Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106:148–157
Article Google Scholar
Song Z (2020) English speech recognition based on deep learning with multiple features. Computing 102(3):663–682
Article MathSciNet Google Scholar
Sumon SA, Chowdhury J, Debnath S, Mohammed N, Momen S (2018) Bangla short speech commands recognition using convolutional neural networks. In 2018 international conference on bangla speech and language processing (ICBSLP) (pp 1–6). IEEE
Tripathi AM, Paul K (2022) When sub-band features meet attention mechanism while knowledge distillation for sound classification. Appl Acoust 195:108813
Article Google Scholar
Vani HY, Anusuya MA (2020) Fuzzy speech recognition: a review. Int J Comput Appl 177(47):39–54
Google Scholar
Veisi H, Mani AH (2020) Persian speech recognition using deep learning. Int J Speech Technol 23(4):893–905
Article Google Scholar

Download references

Funding

The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study. No funds, grants, or other support was received.

Author information

Authors and Affiliations

Department of Computer Science, Vidyasagar University, Midnapore, West Bengal, 721102, India
Bachchu Paul
Department of Computer Science & Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal, BF-142, Sector-I, Salt Lake, Kolkata, 700064, India
Santanu Phadikar

Authors

Bachchu Paul
View author publications
You can also search for this author in PubMed Google Scholar
Santanu Phadikar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bachchu Paul.

Ethics declarations

Conflict of interest

There is no conflict of Interest between the authors regarding the manuscript preparation and submission.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Paul, B., Phadikar, S. A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits. Multimed Tools Appl 83, 1669–1692 (2024). https://doi.org/10.1007/s11042-023-15598-1

Download citation

Received: 03 August 2022
Revised: 16 February 2023
Accepted: 21 April 2023
Published: 09 May 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15598-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid feature-extracted deep CNN with reduced parameters substitutes an End-to-End CNN for the recognition of spoken Bengali digits

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation