Shouted and Normal Speech Classification Using 1D CNN

  • Shikha BaghelEmail author
  • Mrinmoy Bhattacharjee
  • S. R. M. Prasanna
  • Prithwijit Guha
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11942)


Automatic shouted speech detection systems usually model its spectral characteristics to differentiate it from normal speech. Mostly hand-crafted features have been explored for shouted speech detection. However, many works on audio processing suggest that approaches based on automatic feature learning are more robust than hand-crafted feature engineering. This work re-demonstrates this notion by proposing a 1D-CNN architecture for shouted and normal speech classification task. The CNN learns features from the magnitude spectrum of speech frames. Classification is performed by fully connected layers at later stages of the network. Performance of the proposed architecture is evaluated on three datasets and validated against three existing approaches. As an additional contribution, a discussion of features learned by the CNN kernels is provided with relevant visualizations.


Shouted and normal speech classification Shouted speech detection 1D CNN Convolution filter visualization 


  1. 1.
    Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion, WWW 2017 Companion, pp. 759–760. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva (2017)Google Scholar
  2. 2.
    Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: Proceedings of 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5, February 2017Google Scholar
  3. 3.
    Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, MM 2014, pp. 801–804. ACM, New York (2014)Google Scholar
  4. 4.
    Kim, J., Englebienne, G., Truong, K.P., Evers, V.: Deep temporal models using identity skip-connections for speech emotion recognition. In: Proceedings of the 25th ACM International Conference on Multimedia, MM 2017, pp. 1006–1013. ACM, New York (2017)Google Scholar
  5. 5.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  6. 6.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)CrossRefGoogle Scholar
  7. 7.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRefGoogle Scholar
  8. 8.
    Mesbahi, L., Sodoyer, D., Ambellouis, S.: Shout analysis and characterisation. Int. J. Speech Technol. 22(2), 295–304 (2019)CrossRefGoogle Scholar
  9. 9.
    Mittal, V.K., Yegnanarayana, B.: Effect of glottal dynamics in the production of shouted speech. J. Acoust. Soc. Am. 133(5), 3050–3061 (2013)CrossRefGoogle Scholar
  10. 10.
    Mittal, V.K., Yegnanarayana, B.: An automatic shout detection system using speech production features. In: Böck, R., Bonin, F., Campbell, N., Poppe, R. (eds.) MA3HMI 2014. LNCS (LNAI), vol. 8757, pp. 88–98. Springer, Cham (2015). Scholar
  11. 11.
    Palaz, D., Magimai-Doss, M., Collobert, R.: Analysis of CNN-based speech recognition system using raw speech as input. In: Proceedings of Sixteenth Annual Conference of the International Speech Communication Association, pp. 11–15 (2015)Google Scholar
  12. 12.
    Pohjalainen, J., Raitio, T., Yrttiaho, S., Alku, P.: Detection of shouted speech in noise: human and machine. J. Acoust. Soc. Am. 133(4), 2377–2389 (2013)CrossRefGoogle Scholar
  13. 13.
    Raitio, T., Suni, A., Pohjalainen, J., Airaksinen, M., Vainio, M., Alku, P.: Analysis and synthesis of shouted speech. In: Proceedings of Fourteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 1544–1548 (2013)Google Scholar
  14. 14.
    Zelinka, P., Sigmund, M., Schimmel, J.: Impact of vocal effort variability on automatic speech recognition. Speech Commun. 54(6), 732–742 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Shikha Baghel
    • 1
    Email author
  • Mrinmoy Bhattacharjee
    • 1
  • S. R. M. Prasanna
    • 1
    • 2
  • Prithwijit Guha
    • 1
  1. 1.Department of Electronics and Electrical EngineeringIndian Institute of Technology GuwahatiGuwahatiIndia
  2. 2.Department of Electrical EngineeringIndian Institute of Technology DharwadDharwadIndia

Personalised recommendations