Skip to main content
Log in

Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

Communication through speech offers the most straightforward channel for man-machine interaction. Nevertheless, it is a barrier for some languages with low data resources. Extracting features and processing silence in a speech signal is an unnecessary extra effort. Noise in the speech signal reduces classification accuracy. Therefore, silence and noise are removed from the signal to improve recognition. Nonetheless, current approaches rely on static Zero-Crossing-Rate (ZCR) and energy values for the detection. Through the analysis of the speech signal, it has been determined that the utilization of fixed ZCR and energy values do not effectively address the delineation of unvoiced consonant boundaries in speech. The use of static values fails to accurately identify the speech boundary during the articulation of these unvoiced consonants. Therefore, in this study, the dynamic value of ZCR and energy has been derived to overcome this problem. Here, roughly a spoken region has first been identified from each speech signal of a non-overlapping frame. In the second step, the dynamic values are derived by two novel algorithms. Two standard datasets, the Free Spoken Digit Dataset (FSDD) and the Bangla 0 to 99 Dataset (Bangla Dataset), spoken words in English and Bengali, respectively, have been used in this study. The Mel Frequency Cepstral Coefficients (MFCC) have been extracted from each raw signal and the proposed pre-processed signal. Subsequently, these features are input into a Bidirectional Long-Short-Term-Memory (BiLSTM) network. The result shows the superiority of the proposed pre-processing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Algorithm 2
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

The FSDD is available in the Kaggle repository from the web link: https://www.kaggle.com/datasets/joserzapata/free-spoken-digit-dataset-fsdd. The “Bangla spoken 0-99 number” dataset generated during and/or analyzed during the current study is available in the Kaggle repository from the web link: https://www.kaggle.com/datasets/piasroy/bangla-spoken-099-numbers.

References

  1. Paul B, Phadikar S (2023) A novel pre-processing technique of amplitude interpolation for enhancing the classification accuracy of Bengali phonemes. Multimed Tools Appl 82:7735–7755. https://doi.org/10.1007/s11042-022-13594-5

    Article  Google Scholar 

  2. Koutchadé IS, Adjibi SS (2021) Explaining the english consonant sounds to efl learners: more attention on voicing dimension/l’explication des sons consonantiques anglais aux apprenants de l’anglais langue etrangere: plus d’attention au voisement. Eur J Appl Linguist Stud 3(1):12

    Article  Google Scholar 

  3. https://sandiegovoiceandaccent.com/american-english-consonants/place-manner-and-voicing-of-the-american-english-consonants (Last access: 07-MAR-2023)

  4. Bhowmik T, Mandal SKD (2018) Manner of articulation based Bengali phoneme classification. Int J Speech Technol 21:233–250. https://doi.org/10.1007/s10772-018-9498-5

    Article  Google Scholar 

  5. Hamooni H, Mueen A, Neel A (2016) Phoneme sequence recognition via DTW-based classification. Knowl Inf Syst 48:253–275. https://doi.org/10.1007/s10115-015-0885-9

    Article  Google Scholar 

  6. Hasan MR, Hasan MM, Hossain MZ (2022) Effect of vocal tract dynamics on neural network-based speech recognition: A Bengali language-based study. Expert Syst 39(9):e13045

    Article  Google Scholar 

  7. Moulin-Frier C, Nguyen SM, Oudeyer P-Y (2013) Self-Organization of Early Vocal Development in Infants and Machines: The Role of Intrinsic Motivation. Front Psychol 4:1006. https://doi.org/10.3389/fpsyg.2013.01006

    Article  Google Scholar 

  8. Mohanty P, Nayak AK (2022) CNN based keyword spotting: An application for context based voiced Odia words. Int. j. inf. tecnol. 14:3647–3658. https://doi.org/10.1007/s41870-022-00992-z

    Article  Google Scholar 

  9. Aldarmaki H, Ullah A, Ram S, Zaki N (2022) Unsupervised automatic speech recognition: A review. Speech Commun 1398:76

    Article  Google Scholar 

  10. Mahalingam H, Rajakumar M (2019) Speech recognition using multiscale scattering of audio signals and long short-term memory of neural networks. Int. J. Adv. Comput. Sci. Cloud Comput 7:12–16

    Google Scholar 

  11. Wu J, Chua Y, Zhang M, Li H, Tan KC (2018) A spiking neural network framework for robust sound classification. Front Neurosci 12:836

    Article  Google Scholar 

  12. R Gary Leonard (1993) George Doddington. TIDIGITS LDC93S10. Web Download. Philadelphia: Linguistic Data Consortium.

  13. Nayak SK, Nayak AK, Mishra S, Mohanty P (2023) Deep learning approaches for speech command recognition in a low resource KUI language. Int J Intell Syst Appl Eng 11(2):377–386. https://ijisae.org/index.php/IJISAE/article/view/2641

  14. Vani HY, Anusuya MA (2020) Improving speech recognition using bionic wavelet features. AIMS Electron Electr Eng 4(2):200–215

    Article  Google Scholar 

  15. Chuchra A, Kaur M, Gupta S (2022) A Deep Learning Approach for Splicing Detection in Digital Audios. In: Saraswat M, Sharma H, Balachandran K, Kim JH, Bansal JC (eds) Congress on Intelligent Systems Lecture Notes on Data Engineering and Communications Technologies. Springer, Singapore, p 543

    Google Scholar 

  16. Turab, M., Kumar, T., Bendechache, M., Saber, T. (2022). Investigating multi-feature selection and ensembling for audio classification. arXiv preprint arXiv:2206.07511.

  17. Savitha G (2021) Deep Recurrent Neural Network Based Audio Speech Recognition System. Inform Technol Ind 9(2):941–949

    MathSciNet  Google Scholar 

  18. M. Shuvo, S. A. Shahriyar, and M. Akhand, “Bangla numeral recognition from speech signal using convolutional neural network.” In 2019 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE, 2019, pp. 1–4.

  19. Paul B, Bera S, Paul R, Phadikar S (2021) Bengali Spoken Numerals Recognition by MFCC and GMM Technique. In: Mallick PK, Bhoi AK, Chae GS, Kalita K (eds) Advances in Electronics Communication and Computing ETAEERE 2020 Lecture Notes in Electrical Engineering. Springer, Singapore, p 85

    Google Scholar 

  20. Sen, O., & Roy, P. (2021, September). A convolutional neural network based approach to recognize bangla spoken digits from speech signal. In 2021 International Conference on Electronics, Communications and Information Technology (ICECIT) (pp. 1–4). IEEE.

  21. Paul B, Paul R, Bera S, Phadikar S (2023) Isolated Bangla Spoken Digit and Word Recognition Using MFCC and DTW. In: Gyei-Kark P, Jana DK, Panja P, Abd Wahab MH (eds) Engineering Mathematics and Computing Studies in Computational Intelligence. Springer, Singapore, p 1

    Google Scholar 

  22. Noman A, Cheng X. (2022). Bengali Isolated Speech Recognition Using Artificial Neural Network. In Mechatronics and Automation Technology (pp. 14-23). IOS Press.

  23. https://github.com/Jakobovski/free-spoken-digit-dataset/tree/v1.0.8 DOI https://doi.org/10.5281/zenedo.1342401

  24. https://www.kaggle.com/datasets/piasroy/bangla-spoken-099-numbers

  25. Ying M, Kaiyong L, Jiayu H, Zangjia G (2019) Analysis of Tibetan folk music style based on audio signal processing. J Electr Electron Eng 7(6):151–154

    Google Scholar 

  26. Jothimani S, Premalatha K (2022) MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network. Chaos, Solitons Fractals 162:112512

    Article  Google Scholar 

  27. Sasmal S, Saring Y (2023) A zero-resourced indigenous language phones occurrence and durations analysis for an automatic speech recognition system. Int J Inf Tecnol. https://doi.org/10.1007/s41870-023-01451-z

    Article  Google Scholar 

  28. Biswas M, Rahaman S, Ahmadian A, Subari K, Singh PK (2023) Automatic spoken language identification using MFCC based time series features. Multimedia Tools Appl 82(7):9565–9595

    Article  Google Scholar 

  29. Sasmal S, Saring Y (2023) Isolated words recognition of Adi, a low-resource indigenous language of Arunachal Pradesh. Int. J. Inf. Tecnol. 15:3079–3092. https://doi.org/10.1007/s41870-023-01339-y

    Article  Google Scholar 

  30. Ai OC, Hariharan M, Yaacob S, Chee LS (2012) Classification of speech dysfluencies with MFCC and LPCC features. Expert Syst Appl 39(2):2157–2165

    Article  Google Scholar 

  31. Li Qin, Yang Yuze, Lan Tianxiang, Zhu Huifeng, Wei Qi, Qiao Fei, Liu Xinjun, Yang Huazhong (2020) MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications. IEEE Access 8:48720–48730

    Article  Google Scholar 

  32. Choudakkanavar G, Mangai JA, Bansal M (2022) MFCC based ensemble learning method for multiple fault diagnosis of roller bearing. Int. J. Inf. Tecnol. 14:2741–2751. https://doi.org/10.1007/s41870-022-00932-x

    Article  Google Scholar 

  33. Koduru A, Valiveti HB, Budati AK (2020) Feature extraction algorithms to improve the speech emotion recognition rate. Int J Speech Technol 23(1):45–55

    Article  Google Scholar 

  34. Sahidullah M, Saha G (2012) Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech commun 54(4):543–565

    Article  Google Scholar 

  35. Paul B, Mukherjee H, Phadikar S, Roy K (2020) MFCC-Based Bangla Vowel Phoneme Recognition from Micro Clips. In: Bhateja V, Satapathy S, Zhang YD, Aradhya V (eds) Intelligent Computing and Communication ICICC 2019 Advances in Intelligent Systems and Computing. Springer, Singapore, pp 511–519

    Google Scholar 

  36. Shashidhar R, Patilkulkarni S, Puneeth SB (2022) Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int. J. Inf. Tecnol. 14:3425–3436. https://doi.org/10.1007/s41870-022-00907-y

    Article  Google Scholar 

  37. Ihianle IK, Nwajana AO, Ebenuwa SH, Otuka RI, Owa K, Orisatoki MO (2020) A deep learning approach for human activities recognition from multimodal sensing devices. IEEE Access 8:179028–179038

    Article  Google Scholar 

  38. Shah SRB, Chadha GS, Schwung A, Ding SX (2021) A sequence-to-sequence approach for remaining useful lifetime estimation using attention-augmented bidirectional lstm. Intell Syst Appl 10:200049

    Google Scholar 

  39. Thakur A, Dhull SK (2022) Language-independent hyperparameter optimization based speech emotion recognition system. Int. J. Inf. Tecnol. 14:3691–3699. https://doi.org/10.1007/s41870-022-00996-9

    Article  Google Scholar 

  40. Girirajan S, Pandian A (2022) Acoustic model with hybrid Deep Bidirectional Single Gated Unit (DBSGU) for low resource speech recognition. Multimedia Tools Appl 81(12):17169–17184

    Article  Google Scholar 

  41. Oruh J, Viriri S, Adegun A (2022) Long short-term memory recurrent neural network for automatic speech recognition. IEEE Access 10:30069–30079

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the Department of Computer Science, Vidyasagar University, for the facility of the laboratory to conduct the experiment. We would also thank the volunteers who helped with the audio data recording.

Funding

The authors did not receive support from any organization for the submitted work. No funding was received to assist with the preparation of this manuscript. No funding was received for conducting this study. No funds, grants, or other support were received.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, Problem Statement analysis, Methodology, and Experimental implementation: Bachchu Paul. Manuscript preparation, Language editing, Figures, Charts: Bachchu Paul, Sumita Guchhait, and Anish Sarkar. Proofreading, typesetting, Drafting: Bachchu Paul, Sandipan Maity, Biswajit Laya, Anudyuti Ghorai. Responses to reviewer's comments: Bachchu Paul, and Utpal Nandi.

Corresponding author

Correspondence to Bachchu Paul.

Ethics declarations

Conflict of interest

The authors have no conflict of interest regarding this manuscript's preparation and submission. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Paul, B., Guchhait, S., Maity, S. et al. Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants. Int. j. inf. tecnol. 16, 2661–2673 (2024). https://doi.org/10.1007/s41870-024-01776-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-024-01776-3

Keywords

Navigation