Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

Joysingh, S. Johanan; Vijayalakshmi, P.; Nagarajan, T.

doi:10.1007/s00034-022-02263-5

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

Published: 23 December 2022

Volume 42, pages 3038–3053, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

219 Accesses
1 Altmetric
Explore all metrics

Abstract

Human–computer interaction via speech is more common than ever before. Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications, such as automatic speech recognition, speaker identification, and language identification, even though there are more than a hundred thousand laryngectomees in the world who can only whisper. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one-dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope. The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients, a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy features trained on long short-term memory network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Availability of data and materials

The data used in the current work, the wTIMIT and the CHAINS dataset, are available, on reasonable request, from the authors of [11] and [5], respectively.

Code Availability

Code will be available from the corresponding author, on reasonable request.

References

T. Ashihara, Y. Shinohara, H. Sato, T. Moriya, K. Matsui, T. Fukutomi, Y. Yamaguchi, Y. Aono, Neural whispered speech detection with imbalanced learning, in INTERSPEECH (2019), pp. 3352–3356
S. Baghel, M. Bhattacharjee, S. Prasanna, P. Guha, Shouted and normal speech classification using 1D CNN, in International Conference on Pattern Recognition and Machine Intelligence (Springer, 2019), pp. 472–480
I. Brook, The Laryngectomee Guide (CreateSpace Publication, Charleston, 2013)
Google Scholar
M. Cotescu, T. Drugman, G. Huybrechts, J. Lorenzo-Trueba, A. Moinet, Voice conversion for whispered speech synthesis. IEEE Signal Process. Lett. 27, 186–190 (2019)
Article Google Scholar
F. Cummins, M. Grimaldi, T. Leonard, J. Simko, The CHAINS corpus: characterizing individual speakers. Proc. SPECOM 6, 431–435 (2006)
Google Scholar
T. Grozdić, S.T. Jovičić, Whispered speech recognition using deep de-noising autoencoder and inverse filtering. IEEE/ACM Trans. Audio Speech Lang. Proc. 25(12), 2313–2322 (2017)
Article Google Scholar
T. Ito, K. Takeda, F. Itakura, Analysis and recognition of whispered speech. Speech Commun. 45(2), 139–152 (2005)
Article Google Scholar
Q. Jin, S.C.S. Jou, T. Schultz, Whispering speaker identification, in IEEE International Conference on Multimedia and Expo (IEEE, 2007), pp. 1027–1030
S.T. Jovičić, Formant feature differences between whispered and voiced sustained vowels. Acta Acust. Acust. 84(4), 739–743 (1998)
Google Scholar
K. Khoria, M.R. Kamble, H.A. Patil, Teager energy cepstral coefficients for classification of normal vs. whisper speech, in 28th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp. 1–5
B.P. Lim, Computational differences between whispered and non-whispered speech (University of Illinois at Urbana-Champaign, 2011)
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
Article Google Scholar
T. Nagarajan, H.A. Murthy, Subband-based group delay segmentation of spontaneous speech into syllable-like units. EURASIP J. Adv. Signal Proc. 2004(17), 1–12 (2004)
MATH Google Scholar
Z. Qian, K. Xiao, Tagging tone for mandarin pinyin based on sequence labelling. DEStech Transactions on Environment, Energy and Earth Sciences (PEEES) (2020)
T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Pearson Education India, Noida, 2002)
Google Scholar
L.R. Rabiner, R.W. Schafer, Theory and Applications of Digital Speech Processing (Prentice Hall Inc., Hoboken, 2011)
Google Scholar
Z. Raeesy, K. Gillespie, C. Ma, T. Drugman, J. Gu, R. Maas, A. Rastrow, B. Hoffmeister, Lstm-based whisper detection, in IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 139–144
N.J. Shah, M.A.B. Shaik, P. Periyasamy, H.A. Patil, V. Vij, Exploiting phase-based features for whisper vs. speech classification, in 29th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp. 21–25
P. Vijayalakshmi, M.R. Reddy, The analysis on band-limited hypernasal speech using group delay based formant extraction technique, in Ninth European Conference on Speech Communication and Technology (2005)
P. Vijayalakshmi, M.R. Reddy, D. O’Shaughnessy, Acoustic analysis and detection of hypernasality using a group delay function. IEEE Trans. Biomed. Eng. 54(4), 621–629 (2007)
Article Google Scholar
S.J. Wenndt, E.J. Cupples, R.M. Floyd, A study on the classification of whispered and normally phonated speech, in Seventh International Conference on Spoken Language Processing (2002)
J.B. Wilson, J.D. Mosko, A comparative analysis of whispered and normally phonated speech using an LPC-10 vocoder. Technical report. Rome Air Development Center Griffiss AFB NY (1985)
C. Zhang, J.H. Hansen, Analysis and classification of speech mode: whispered through shouted, in Eighth Annual Conference of the International Speech Communication Association (2007)
C. Zhang, J.H. Hansen, An entropy based feature for whisper-island detection within audio streams, in Ninth Annual Conference of the International Speech Communication Association (2008)
C. Zhang, J.H. Hansen, Advancements in whisper-island detection within normally phonated audio streams, in Tenth Annual Conference of the International Speech Communication Association (2009)
C. Zhang, J.H. Hansen, Whisper-island detection based on unsupervised segmentation with entropy-based speech feature processing. IEEE Trans. Audio Speech Lang. Process. 19(4), 883–894 (2010)
Article Google Scholar

Download references

Funding

This research was not funded.

Author information

Authors and Affiliations

Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
S. Johanan Joysingh & P. Vijayalakshmi
Shiv Nadar University Chennai, Kalavakkam, Chennai, India
T. Nagarajan

Authors

S. Johanan Joysingh
View author publications
You can also search for this author in PubMed Google Scholar
P. Vijayalakshmi
View author publications
You can also search for this author in PubMed Google Scholar
T. Nagarajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Johanan Joysingh.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Joysingh, S.J., Vijayalakshmi, P. & Nagarajan, T. Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech. Circuits Syst Signal Process 42, 3038–3053 (2023). https://doi.org/10.1007/s00034-022-02263-5

Download citation

Received: 26 March 2022
Revised: 06 December 2022
Accepted: 06 December 2022
Published: 23 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00034-022-02263-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Availability of data and materials

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Availability of data and materials

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation