Abstract
The aim of this paper is to explore a robust method for vowel region detection from multimode speech. In realistic scenario, speech can be classified into three modes namely; conversation, extempore, and read. The existing method detects the vowel form the speech recorded in clean environment which may not be appropriate for the multimode speech tasks. To address this issue, we proposed an approach based on continuous wavelet transform coefficients and phone boundaries for detecting the vowel regions from different modes of the speech signal. For evaluation of the proposed vowel region (VR) detection technique, TIMIT (read speech) and Bengali (read, extempore, and conversation speech) corpora are used. The proposed VR detection technique is compared to the state-of-the-art methods. The experiments has recorded significant gain in the performance of the proposed technique than the state-of-the-art methods. The efficiency of the proposed technique is shown by extracting vocal tract and excitation source features from automatically detected VRs for developing the multilingual speech mode classification (MSMC) model. The evaluation results report that the performance of the MSMC model is significantly improved when features are extracted from the vowel regions than the entire speech utterance.
Similar content being viewed by others
References
Burget L, Schwarz P, Agarwal M, Akyazi P, Feng K, Ghoshal A, Glembek O, Goel N, Karafiát M, Povey D, et al. (2010) Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models. In: International conference on acoustics speech and signal processing (ICASSP), pp 4334–4337. IEEE
Furui S (1986) On the role of spectral transition for speech perception. The Journal of the Acoustical Society of America 80(4):1016–1025
Garofalo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1993) The DARPA TIMIT acoustic-phonetic continuous speech corpus cdrom. Linguistic Data Consortium. pp. 207–212
Haykin S (1994) Neural networks: a comprehensive foundation. Prentice Hall PTR, Upper Saddle River
Keerthana YM, Reddy MK, Rao KS (2019) Cwt-based approach for epoch extraction from telephone quality speech. IEEE Signal Processing Letters 26(8):1107–1111
Kumar A, Shahnawazuddin S, Pradhan G (2017) Improvements in the detection of vowel onset and offset points in a speech sequence. Circuits, systems, and signal processing 36(6):2315–2340
Kumar A, Shahnawazuddin S, Pradhan G (2017) Non-local estimation of speech signal for vowel onset point detection in varied environments.. In: INTERSPEECH, pp 429–433
Kumar SBSunil, Rao KS, Pati D (2013) Phonetic and prosodically rich transcribed speech corpus in Indian languages: Bengali and Odia. In: Proceedings of international conference on oriental COCOSDA held jointly with conference on asian spoken language research and evaluation (O-COCOSDA/CASLRE), Gurgaon, India, pp 1–5
Mallat S (1999) Wavelet tour of signal processing. New York, NY, USA: Academic
Manjunath KE, Rao KS (2014) Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In: Twentieth national conference on communications (NCC), pp 1–6. IEEE
Manjunath KE, Rao KS (2015) Source and system features for phone recognition. International Journal of Speech Technology 18(2):257–270
Manjunath KE, Rao KS, Jayagopi DB (2017) Development of multilingual phone recognition system for Indian languages. In: International conference on signal processing, informatics, communication and energy systems (SPICES), pp 1–6. IEEE
Mittal VK, Vuppala AK (2016) Changes in shout features in automatically detected vowel regions. In: 2016 International conference on signal processing and communications (SPCOM), pp 1–5. IEEE
Murty K SR, Yegnanarayana B (2008) Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing 16(8):1602–1613
Pradeep R, Rao KS (2016) Deep neural networks for Kannada phoneme recognition. In: Ninth international conference on contemporary computing (IC3), pp 1–6. IEEE
Pradhan G, Prasanna SRMahadeva (2013) Speaker verification by vowel and nonvowel like segmentation. IEEE Transactions on Audio, Speech, and Language Processing 21(4):854–867
Prasanna SRM, Pradhan G (2011) Significance of vowel-like regions for speaker verification under degraded conditions. IEEE transactions on audio, speech, and language processing 19(8):2552–2565
Prasanna SRM, Reddy BVS, Krishnamoorthy P (2009) Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on audio, speech, and language processing 17(4):556–565
Ramdinmawii E, Mohanta A, Mittal VK (2017) Emotion recognition from speech signal. In: TENCON, pp 1562–1567. IEEE
Reddy MK, Rao KS (2017) Robust pitch extraction method for the hmm-based speech synthesis system. IEEE signal processing letters 24(8):1133–1137
Scanzio S, Laface P, Fissore L, Gemello R, Mana F (2008) On the use of a multilingual neural network front-end. In: Ninth annual conference of the international speech communication association
Suni AS, Simko J, Vainio MT, et al. (2016) Boundary detection using continuous wavelet analysis. Proceedings of Speech prosody 2016
Thirumuru R, Gangashetty SV, Vuppala AK (2018) Improved vowel region detection from a continuous speech using post processing of vowel onset points and vowel end-points. Multimedia Tools and Applications 77(4):4753–4767
Tripathi K, Rao KS (2017) Improvement of phone recognition accuracy using speech mode classification. International Journal of Speech Technology 21(3):489–500. https://doi.org/10.1007/s10772-017-9483-4
Tripathi K, Rao KS (2019) Speech mode classification for indian languages using vocal tract and excitation source features. In: 22nd Conference of the oriental COCOSDA
Tripathi K, Sreenivasa Rao K (2019) VOP Detection for Read and Conversation Speech using CWT Coefficients and Phone Boundaries
Vuppala AK, Rao KS (2013) Speaker identification under background noise using features extracted from steady vowel regions. International Journal of Adaptive Control and Signal Processing 27(9):781–792
Vuppala AK, Rao KS, Chakrabarti S (2012) Improved vowel onset point detection using epoch intervals. AEU-International Journal of Electronics and Communications 66(8):697–700
Vuppala AK, Yadav J, Chakrabarti S, Rao KS (2012) Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing 20(6):1894–1903
Yadav J, Rao KS (2013) Detection of vowel offset point from speech signal. IEEE signal processing letters 20(4):299–302
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tripathi, K., Rao, K.S. Robust vowel region detection method for multimode speech. Multimed Tools Appl 80, 13615–13637 (2021). https://doi.org/10.1007/s11042-020-10394-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10394-7