Skip to main content
Log in

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Human–computer interaction via speech is more common than ever before. Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications, such as automatic speech recognition, speaker identification, and language identification, even though there are more than a hundred thousand laryngectomees in the world who can only whisper. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one-dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope. The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients, a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy features trained on long short-term memory network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

The data used in the current work, the wTIMIT and the CHAINS dataset, are available, on reasonable request, from the authors of [11] and [5], respectively.

Code Availability

Code will be available from the corresponding author, on reasonable request.

References

  1. T. Ashihara, Y. Shinohara, H. Sato, T. Moriya, K. Matsui, T. Fukutomi, Y. Yamaguchi, Y. Aono, Neural whispered speech detection with imbalanced learning, in INTERSPEECH (2019), pp. 3352–3356

  2. S. Baghel, M. Bhattacharjee, S. Prasanna, P. Guha, Shouted and normal speech classification using 1D CNN, in International Conference on Pattern Recognition and Machine Intelligence (Springer, 2019), pp. 472–480

  3. I. Brook, The Laryngectomee Guide (CreateSpace Publication, Charleston, 2013)

    Google Scholar 

  4. M. Cotescu, T. Drugman, G. Huybrechts, J. Lorenzo-Trueba, A. Moinet, Voice conversion for whispered speech synthesis. IEEE Signal Process. Lett. 27, 186–190 (2019)

    Article  Google Scholar 

  5. F. Cummins, M. Grimaldi, T. Leonard, J. Simko, The CHAINS corpus: characterizing individual speakers. Proc. SPECOM 6, 431–435 (2006)

    Google Scholar 

  6. T. Grozdić, S.T. Jovičić, Whispered speech recognition using deep de-noising autoencoder and inverse filtering. IEEE/ACM Trans. Audio Speech Lang. Proc. 25(12), 2313–2322 (2017)

    Article  Google Scholar 

  7. T. Ito, K. Takeda, F. Itakura, Analysis and recognition of whispered speech. Speech Commun. 45(2), 139–152 (2005)

    Article  Google Scholar 

  8. Q. Jin, S.C.S. Jou, T. Schultz, Whispering speaker identification, in IEEE International Conference on Multimedia and Expo (IEEE, 2007), pp. 1027–1030

  9. S.T. Jovičić, Formant feature differences between whispered and voiced sustained vowels. Acta Acust. Acust. 84(4), 739–743 (1998)

    Google Scholar 

  10. K. Khoria, M.R. Kamble, H.A. Patil, Teager energy cepstral coefficients for classification of normal vs. whisper speech, in 28th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp. 1–5

  11. B.P. Lim, Computational differences between whispered and non-whispered speech (University of Illinois at Urbana-Champaign, 2011)

  12. J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)

    Article  Google Scholar 

  13. T. Nagarajan, H.A. Murthy, Subband-based group delay segmentation of spontaneous speech into syllable-like units. EURASIP J. Adv. Signal Proc. 2004(17), 1–12 (2004)

    MATH  Google Scholar 

  14. Z. Qian, K. Xiao, Tagging tone for mandarin pinyin based on sequence labelling. DEStech Transactions on Environment, Energy and Earth Sciences (PEEES) (2020)

  15. T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Pearson Education India, Noida, 2002)

    Google Scholar 

  16. L.R. Rabiner, R.W. Schafer, Theory and Applications of Digital Speech Processing (Prentice Hall Inc., Hoboken, 2011)

    Google Scholar 

  17. Z. Raeesy, K. Gillespie, C. Ma, T. Drugman, J. Gu, R. Maas, A. Rastrow, B. Hoffmeister, Lstm-based whisper detection, in IEEE Spoken Language Technology Workshop (SLT) (IEEE, 2018), pp. 139–144

  18. N.J. Shah, M.A.B. Shaik, P. Periyasamy, H.A. Patil, V. Vij, Exploiting phase-based features for whisper vs. speech classification, in 29th European Signal Processing Conference (EUSIPCO) (IEEE, 2021), pp. 21–25

  19. P. Vijayalakshmi, M.R. Reddy, The analysis on band-limited hypernasal speech using group delay based formant extraction technique, in Ninth European Conference on Speech Communication and Technology (2005)

  20. P. Vijayalakshmi, M.R. Reddy, D. O’Shaughnessy, Acoustic analysis and detection of hypernasality using a group delay function. IEEE Trans. Biomed. Eng. 54(4), 621–629 (2007)

    Article  Google Scholar 

  21. S.J. Wenndt, E.J. Cupples, R.M. Floyd, A study on the classification of whispered and normally phonated speech, in Seventh International Conference on Spoken Language Processing (2002)

  22. J.B. Wilson, J.D. Mosko, A comparative analysis of whispered and normally phonated speech using an LPC-10 vocoder. Technical report. Rome Air Development Center Griffiss AFB NY (1985)

  23. C. Zhang, J.H. Hansen, Analysis and classification of speech mode: whispered through shouted, in Eighth Annual Conference of the International Speech Communication Association (2007)

  24. C. Zhang, J.H. Hansen, An entropy based feature for whisper-island detection within audio streams, in Ninth Annual Conference of the International Speech Communication Association (2008)

  25. C. Zhang, J.H. Hansen, Advancements in whisper-island detection within normally phonated audio streams, in Tenth Annual Conference of the International Speech Communication Association (2009)

  26. C. Zhang, J.H. Hansen, Whisper-island detection based on unsupervised segmentation with entropy-based speech feature processing. IEEE Trans. Audio Speech Lang. Process. 19(4), 883–894 (2010)

    Article  Google Scholar 

Download references

Funding

This research was not funded.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Johanan Joysingh.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Joysingh, S.J., Vijayalakshmi, P. & Nagarajan, T. Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech. Circuits Syst Signal Process 42, 3038–3053 (2023). https://doi.org/10.1007/s00034-022-02263-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02263-5

Keywords

Navigation