Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

Paul, Bachchu; Guchhait, Sumita; Maity, Sandipan; Laya, Biswajit; Ghorai, Anudyuti; Sarkar, Anish; Nandi, Utpal

doi:10.1007/s41870-024-01776-3

Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

Original Research
Published: 17 March 2024

Volume 16, pages 2661–2673, (2024)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

Bachchu Paul ORCID: orcid.org/0000-0002-4485-3393¹,
Sumita Guchhait²,
Sandipan Maity³,
Biswajit Laya⁴,
Anudyuti Ghorai⁴,
Anish Sarkar¹ &
…
Utpal Nandi¹

36 Accesses
Explore all metrics

Abstract

Communication through speech offers the most straightforward channel for man-machine interaction. Nevertheless, it is a barrier for some languages with low data resources. Extracting features and processing silence in a speech signal is an unnecessary extra effort. Noise in the speech signal reduces classification accuracy. Therefore, silence and noise are removed from the signal to improve recognition. Nonetheless, current approaches rely on static Zero-Crossing-Rate (ZCR) and energy values for the detection. Through the analysis of the speech signal, it has been determined that the utilization of fixed ZCR and energy values do not effectively address the delineation of unvoiced consonant boundaries in speech. The use of static values fails to accurately identify the speech boundary during the articulation of these unvoiced consonants. Therefore, in this study, the dynamic value of ZCR and energy has been derived to overcome this problem. Here, roughly a spoken region has first been identified from each speech signal of a non-overlapping frame. In the second step, the dynamic values are derived by two novel algorithms. Two standard datasets, the Free Spoken Digit Dataset (FSDD) and the Bangla 0 to 99 Dataset (Bangla Dataset), spoken words in English and Bengali, respectively, have been used in this study. The Mel Frequency Cepstral Coefficients (MFCC) have been extracted from each raw signal and the proposed pre-processed signal. Subsequently, these features are input into a Bidirectional Long-Short-Term-Memory (BiLSTM) network. The result shows the superiority of the proposed pre-processing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

Article 18 May 2024

Data availability

The FSDD is available in the Kaggle repository from the web link: https://www.kaggle.com/datasets/joserzapata/free-spoken-digit-dataset-fsdd. The “Bangla spoken 0-99 number” dataset generated during and/or analyzed during the current study is available in the Kaggle repository from the web link: https://www.kaggle.com/datasets/piasroy/bangla-spoken-099-numbers.

References

Paul B, Phadikar S (2023) A novel pre-processing technique of amplitude interpolation for enhancing the classification accuracy of Bengali phonemes. Multimed Tools Appl 82:7735–7755. https://doi.org/10.1007/s11042-022-13594-5
Article Google Scholar
Koutchadé IS, Adjibi SS (2021) Explaining the english consonant sounds to efl learners: more attention on voicing dimension/l’explication des sons consonantiques anglais aux apprenants de l’anglais langue etrangere: plus d’attention au voisement. Eur J Appl Linguist Stud 3(1):12
Article Google Scholar
https://sandiegovoiceandaccent.com/american-english-consonants/place-manner-and-voicing-of-the-american-english-consonants (Last access: 07-MAR-2023)
Bhowmik T, Mandal SKD (2018) Manner of articulation based Bengali phoneme classification. Int J Speech Technol 21:233–250. https://doi.org/10.1007/s10772-018-9498-5
Article Google Scholar
Hamooni H, Mueen A, Neel A (2016) Phoneme sequence recognition via DTW-based classification. Knowl Inf Syst 48:253–275. https://doi.org/10.1007/s10115-015-0885-9
Article Google Scholar
Hasan MR, Hasan MM, Hossain MZ (2022) Effect of vocal tract dynamics on neural network-based speech recognition: A Bengali language-based study. Expert Syst 39(9):e13045
Article Google Scholar
Moulin-Frier C, Nguyen SM, Oudeyer P-Y (2013) Self-Organization of Early Vocal Development in Infants and Machines: The Role of Intrinsic Motivation. Front Psychol 4:1006. https://doi.org/10.3389/fpsyg.2013.01006
Article Google Scholar
Mohanty P, Nayak AK (2022) CNN based keyword spotting: An application for context based voiced Odia words. Int. j. inf. tecnol. 14:3647–3658. https://doi.org/10.1007/s41870-022-00992-z
Article Google Scholar
Aldarmaki H, Ullah A, Ram S, Zaki N (2022) Unsupervised automatic speech recognition: A review. Speech Commun 1398:76
Article Google Scholar
Mahalingam H, Rajakumar M (2019) Speech recognition using multiscale scattering of audio signals and long short-term memory of neural networks. Int. J. Adv. Comput. Sci. Cloud Comput 7:12–16
Google Scholar
Wu J, Chua Y, Zhang M, Li H, Tan KC (2018) A spiking neural network framework for robust sound classification. Front Neurosci 12:836
Article Google Scholar
R Gary Leonard (1993) George Doddington. TIDIGITS LDC93S10. Web Download. Philadelphia: Linguistic Data Consortium.
Nayak SK, Nayak AK, Mishra S, Mohanty P (2023) Deep learning approaches for speech command recognition in a low resource KUI language. Int J Intell Syst Appl Eng 11(2):377–386. https://ijisae.org/index.php/IJISAE/article/view/2641
Vani HY, Anusuya MA (2020) Improving speech recognition using bionic wavelet features. AIMS Electron Electr Eng 4(2):200–215
Article Google Scholar
Chuchra A, Kaur M, Gupta S (2022) A Deep Learning Approach for Splicing Detection in Digital Audios. In: Saraswat M, Sharma H, Balachandran K, Kim JH, Bansal JC (eds) Congress on Intelligent Systems Lecture Notes on Data Engineering and Communications Technologies. Springer, Singapore, p 543
Google Scholar
Turab, M., Kumar, T., Bendechache, M., Saber, T. (2022). Investigating multi-feature selection and ensembling for audio classification. arXiv preprint arXiv:2206.07511.
Savitha G (2021) Deep Recurrent Neural Network Based Audio Speech Recognition System. Inform Technol Ind 9(2):941–949
MathSciNet Google Scholar
M. Shuvo, S. A. Shahriyar, and M. Akhand, “Bangla numeral recognition from speech signal using convolutional neural network.” In 2019 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE, 2019, pp. 1–4.
Paul B, Bera S, Paul R, Phadikar S (2021) Bengali Spoken Numerals Recognition by MFCC and GMM Technique. In: Mallick PK, Bhoi AK, Chae GS, Kalita K (eds) Advances in Electronics Communication and Computing ETAEERE 2020 Lecture Notes in Electrical Engineering. Springer, Singapore, p 85
Google Scholar
Sen, O., & Roy, P. (2021, September). A convolutional neural network based approach to recognize bangla spoken digits from speech signal. In 2021 International Conference on Electronics, Communications and Information Technology (ICECIT) (pp. 1–4). IEEE.
Paul B, Paul R, Bera S, Phadikar S (2023) Isolated Bangla Spoken Digit and Word Recognition Using MFCC and DTW. In: Gyei-Kark P, Jana DK, Panja P, Abd Wahab MH (eds) Engineering Mathematics and Computing Studies in Computational Intelligence. Springer, Singapore, p 1
Google Scholar
Noman A, Cheng X. (2022). Bengali Isolated Speech Recognition Using Artificial Neural Network. In Mechatronics and Automation Technology (pp. 14-23). IOS Press.
https://github.com/Jakobovski/free-spoken-digit-dataset/tree/v1.0.8 DOI https://doi.org/10.5281/zenedo.1342401
https://www.kaggle.com/datasets/piasroy/bangla-spoken-099-numbers
Ying M, Kaiyong L, Jiayu H, Zangjia G (2019) Analysis of Tibetan folk music style based on audio signal processing. J Electr Electron Eng 7(6):151–154
Google Scholar
Jothimani S, Premalatha K (2022) MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network. Chaos, Solitons Fractals 162:112512
Article Google Scholar
Sasmal S, Saring Y (2023) A zero-resourced indigenous language phones occurrence and durations analysis for an automatic speech recognition system. Int J Inf Tecnol. https://doi.org/10.1007/s41870-023-01451-z
Article Google Scholar
Biswas M, Rahaman S, Ahmadian A, Subari K, Singh PK (2023) Automatic spoken language identification using MFCC based time series features. Multimedia Tools Appl 82(7):9565–9595
Article Google Scholar
Sasmal S, Saring Y (2023) Isolated words recognition of Adi, a low-resource indigenous language of Arunachal Pradesh. Int. J. Inf. Tecnol. 15:3079–3092. https://doi.org/10.1007/s41870-023-01339-y
Article Google Scholar
Ai OC, Hariharan M, Yaacob S, Chee LS (2012) Classification of speech dysfluencies with MFCC and LPCC features. Expert Syst Appl 39(2):2157–2165
Article Google Scholar
Li Qin, Yang Yuze, Lan Tianxiang, Zhu Huifeng, Wei Qi, Qiao Fei, Liu Xinjun, Yang Huazhong (2020) MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications. IEEE Access 8:48720–48730
Article Google Scholar
Choudakkanavar G, Mangai JA, Bansal M (2022) MFCC based ensemble learning method for multiple fault diagnosis of roller bearing. Int. J. Inf. Tecnol. 14:2741–2751. https://doi.org/10.1007/s41870-022-00932-x
Article Google Scholar
Koduru A, Valiveti HB, Budati AK (2020) Feature extraction algorithms to improve the speech emotion recognition rate. Int J Speech Technol 23(1):45–55
Article Google Scholar
Sahidullah M, Saha G (2012) Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech commun 54(4):543–565
Article Google Scholar
Paul B, Mukherjee H, Phadikar S, Roy K (2020) MFCC-Based Bangla Vowel Phoneme Recognition from Micro Clips. In: Bhateja V, Satapathy S, Zhang YD, Aradhya V (eds) Intelligent Computing and Communication ICICC 2019 Advances in Intelligent Systems and Computing. Springer, Singapore, pp 511–519
Google Scholar
Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int. J. Inf. Tecnol. 14:3425–3436. https://doi.org/10.1007/s41870-022-00907-y
Article Google Scholar
Ihianle IK, Nwajana AO, Ebenuwa SH, Otuka RI, Owa K, Orisatoki MO (2020) A deep learning approach for human activities recognition from multimodal sensing devices. IEEE Access 8:179028–179038
Article Google Scholar
Shah SRB, Chadha GS, Schwung A, Ding SX (2021) A sequence-to-sequence approach for remaining useful lifetime estimation using attention-augmented bidirectional lstm. Intell Syst Appl 10:200049
Google Scholar
Thakur A, Dhull SK (2022) Language-independent hyperparameter optimization based speech emotion recognition system. Int. J. Inf. Tecnol. 14:3691–3699. https://doi.org/10.1007/s41870-022-00996-9
Article Google Scholar
Girirajan S, Pandian A (2022) Acoustic model with hybrid Deep Bidirectional Single Gated Unit (DBSGU) for low resource speech recognition. Multimedia Tools Appl 81(12):17169–17184
Article Google Scholar
Oruh J, Viriri S, Adegun A (2022) Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access 10:30069–30079
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the Department of Computer Science, Vidyasagar University, for the facility of the laboratory to conduct the experiment. We would also thank the volunteers who helped with the audio data recording.

Funding

The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study. No funds, grants, or other support were received.

Author information

Authors and Affiliations

Department of Computer Science, Vidyasagar University, Midnapore, 721102, West Bengal, India
Bachchu Paul, Anish Sarkar & Utpal Nandi
Department of BCA, Belda College, Paschim Medinipur, Belda, 721424, West Bengal, India
Sumita Guchhait
Department of Computer Science, Debra Thana Sahid Kshudiram Smriti Mahavidyalaya, Debra, 721126, West Bengal, India
Sandipan Maity
Department of Computer Science (BCA), Kharagpur College, Inda, Kharagpur, 721305, West Bengal, India
Biswajit Laya & Anudyuti Ghorai

Authors

Bachchu Paul
View author publications
You can also search for this author in PubMed Google Scholar
Sumita Guchhait
View author publications
You can also search for this author in PubMed Google Scholar
Sandipan Maity
View author publications
You can also search for this author in PubMed Google Scholar
Biswajit Laya
View author publications
You can also search for this author in PubMed Google Scholar
Anudyuti Ghorai
View author publications
You can also search for this author in PubMed Google Scholar
Anish Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Utpal Nandi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Problem Statement analysis, Methodology, and Experimental implementation: Bachchu Paul. Manuscript preparation, Language editing, Figures, Charts: Bachchu Paul, Sumita Guchhait, and Anish Sarkar. Proofreading, typesetting, Drafting: Bachchu Paul, Sandipan Maity, Biswajit Laya, Anudyuti Ghorai. Responses to reviewer's comments: Bachchu Paul, and Utpal Nandi.

Corresponding author

Correspondence to Bachchu Paul.

Ethics declarations

Conflict of interest

The authors have no conflict of interest regarding this manuscript's preparation and submission. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Paul, B., Guchhait, S., Maity, S. et al. Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants. Int. j. inf. tecnol. 16, 2661–2673 (2024). https://doi.org/10.1007/s41870-024-01776-3

Download citation

Received: 17 August 2023
Accepted: 17 November 2023
Published: 17 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s41870-024-01776-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

Data availability

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A novel framework using 3D-CNN and BiLSTM model with dynamic learning rate scheduler for visual speech recognition

Data availability

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation