Abstract
Human–computer interaction via speech is more common than ever before. Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications, such as automatic speech recognition, speaker identification, and language identification, even though there are more than a hundred thousand laryngectomees in the world who can only whisper. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one-dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope. The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients, a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy features trained on long short-term memory network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.
Similar content being viewed by others
Code Availability
Code will be available from the corresponding author, on reasonable request.
References
T. Ashihara, Y. Shinohara, H. Sato, T. Moriya, K. Matsui, T. Fukutomi, Y. Yamaguchi, Y. Aono, Neural whispered speech detection with imbalanced learning, in INTERSPEECH (2019), pp. 3352–3356
S. Baghel, M. Bhattacharjee, S. Prasanna, P. Guha, Shouted and normal speech classification using 1D CNN, in International Conference on Pattern Recognition and Machine Intelligence (Springer, 2019), pp. 472–480
I. Brook, The Laryngectomee Guide (CreateSpace Publication, Charleston, 2013)
M. Cotescu, T. Drugman, G. Huybrechts, J. Lorenzo-Trueba, A. Moinet, Voice conversion for whispered speech synthesis. IEEE Signal Process. Lett. 27, 186–190 (2019)
F. Cummins, M. Grimaldi, T. Leonard, J. Simko, The CHAINS corpus: characterizing individual speakers. Proc. SPECOM 6, 431–435 (2006)
T. Grozdić, S.T. Jovičić, Whispered speech recognition using deep de-noising autoencoder and inverse filtering. IEEE/ACM Trans. Audio Speech Lang. Proc. 25(12), 2313–2322 (2017)
T. Ito, K. Takeda, F. Itakura, Analysis and recognition of whispered speech. Speech Commun. 45(2), 139–152 (2005)
Q. Jin, S.C.S. Jou, T. Schultz, Whispering speaker identification, in IEEE International Conference on Multimedia and Expo (IEEE, 2007), pp. 1027–1030
S.T. Jovičić, Formant feature differences between whispered and voiced sustained vowels. Acta Acust. Acust. 84(4), 739–743 (1998)
K. Khoria, M.R. Kamble, H.A. Patil, Teager energy cepstral coefficients for classification of normal vs. whisper speech, in 28th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp. 1–5
B.P. Lim, Computational differences between whispered and non-whispered speech (University of Illinois at Urbana-Champaign, 2011)
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
T. Nagarajan, H.A. Murthy, Subband-based group delay segmentation of spontaneous speech into syllable-like units. EURASIP J. Adv. Signal Proc. 2004(17), 1–12 (2004)
Z. Qian, K. Xiao, Tagging tone for mandarin pinyin based on sequence labelling. DEStech Transactions on Environment, Energy and Earth Sciences (PEEES) (2020)
T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Pearson Education India, Noida, 2002)
L.R. Rabiner, R.W. Schafer, Theory and Applications of Digital Speech Processing (Prentice Hall Inc., Hoboken, 2011)
Z. Raeesy, K. Gillespie, C. Ma, T. Drugman, J. Gu, R. Maas, A. Rastrow, B. Hoffmeister, Lstm-based whisper detection, in IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 139–144
N.J. Shah, M.A.B. Shaik, P. Periyasamy, H.A. Patil, V. Vij, Exploiting phase-based features for whisper vs. speech classification, in 29th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp. 21–25
P. Vijayalakshmi, M.R. Reddy, The analysis on band-limited hypernasal speech using group delay based formant extraction technique, in Ninth European Conference on Speech Communication and Technology (2005)
P. Vijayalakshmi, M.R. Reddy, D. O’Shaughnessy, Acoustic analysis and detection of hypernasality using a group delay function. IEEE Trans. Biomed. Eng. 54(4), 621–629 (2007)
S.J. Wenndt, E.J. Cupples, R.M. Floyd, A study on the classification of whispered and normally phonated speech, in Seventh International Conference on Spoken Language Processing (2002)
J.B. Wilson, J.D. Mosko, A comparative analysis of whispered and normally phonated speech using an LPC-10 vocoder. Technical report. Rome Air Development Center Griffiss AFB NY (1985)
C. Zhang, J.H. Hansen, Analysis and classification of speech mode: whispered through shouted, in Eighth Annual Conference of the International Speech Communication Association (2007)
C. Zhang, J.H. Hansen, An entropy based feature for whisper-island detection within audio streams, in Ninth Annual Conference of the International Speech Communication Association (2008)
C. Zhang, J.H. Hansen, Advancements in whisper-island detection within normally phonated audio streams, in Tenth Annual Conference of the International Speech Communication Association (2009)
C. Zhang, J.H. Hansen, Whisper-island detection based on unsupervised segmentation with entropy-based speech feature processing. IEEE Trans. Audio Speech Lang. Process. 19(4), 883–894 (2010)
Funding
This research was not funded.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Joysingh, S.J., Vijayalakshmi, P. & Nagarajan, T. Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech. Circuits Syst Signal Process 42, 3038–3053 (2023). https://doi.org/10.1007/s00034-022-02263-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02263-5