Multimedia Tools and Applications

, Volume 78, Issue 5, pp 5571–5589 | Cite as

Deep features-based speech emotion recognition for smart affective services

  • Abdul Malik Badshah
  • Nasir Rahim
  • Noor Ullah
  • Jamil Ahmad
  • Khan Muhammad
  • Mi Young Lee
  • Soonil Kwon
  • Sung Wook BaikEmail author


Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.


Speech emotion recognition Convolutional neural network Spectrogram Rectangular kernels 



This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. R0126-15- 1119, Development of a solution for situation-awareness based on the analysis of speech and environmental sounds).


  1. 1.
    Abdelgawad H, Shalaby A, Abdulhai B, Gutub AAA (2014) Microscopic modeling of large-scale pedestrian–vehicle conflicts in the city of Madinah, Saudi Arabia. J Adv Transp 48:507–525CrossRefGoogle Scholar
  2. 2.
    Ahmad J, Muhammad K, Kwon S-I, Baik SW, Rho S (2016) Dempster-Shafer Fusion Based Gender Recognition for Speech Analysis Applications. In: Platform Technology and Service (PlatCon), 2016 International Conference on, pp 1–4Google Scholar
  3. 3.
    Ahmad J, Sajjad M, Rho S, Kwon S-I, Lee MY, Baik SW (2016) Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture. Multimed Tools Appl 1–25. CrossRefGoogle Scholar
  4. 4.
    Ahmad J, Fiaz M, Kwon S-I, Sodanil M, Vo B, Baik SW (2016) Gender Identification using MFCC for Telephone Applications-A Comparative Study. International Journal of Computer Science and Electronics Engineering 3.5 (2015):351–355Google Scholar
  5. 5.
    Aly SA, AlGhamdi TA, Salim M, Amin HH, Gutub AA (2014) Information Gathering Schemes For Collaborative Sensor Devices. Procedia Comput Sci 32:1141–1146CrossRefGoogle Scholar
  6. 6.
    Badshah AM, Ahmad J, Rahim N, Baik SW (2017) Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. In: Platform Technology and Service (PlatCon), 2017 International Conference on, pp 1–5Google Scholar
  7. 7.
    Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70:614CrossRefGoogle Scholar
  8. 8.
    Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828CrossRefGoogle Scholar
  9. 9.
    Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B (2005) A database of German emotional speech. In: Interspeech, pp 1517–1520Google Scholar
  10. 10.
    Curtis S, Zafar B, Gutub A, Manocha D (2013) Right of way. Vis Comput 29:1277–1292CrossRefGoogle Scholar
  11. 11.
    Deng L, Seltzer ML, Yu D, Acero A, Mohamed A-R, Hinton GE (2010) Binary coding of speech spectrograms using a deep auto-encoder. In: Interspeech, pp 1692–1695Google Scholar
  12. 12.
    Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp 511–516Google Scholar
  13. 13.
    Dennis J, Tran HD, Li H (2011) Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process Lett 18:130–133CrossRefGoogle Scholar
  14. 14.
    El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn 44:572–587CrossRefGoogle Scholar
  15. 15.
    Engberg IS, Hansen AV, Andersen O, Dalsgaard P (1997) Design, recording and verification of a danish emotional speech database. In: EurospeechGoogle Scholar
  16. 16.
    Eyben F, Wöllmer M, Schuller B (2009) OpenEAR—introducing the Munich open-source emotion and affect recognition toolkit. In: Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, pp 1–6Google Scholar
  17. 17.
    France DJ, Shiavi RG, Silverman S, Silverman M, Wilkes M (2000) Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans Biomed Eng 47:829–837CrossRefGoogle Scholar
  18. 18.
    Gharavian D, Sheikhan M, Nazerieh A, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput & Applic 21:2115–2126CrossRefGoogle Scholar
  19. 19.
    Guo Z, Wang ZJ (2013) An unsupervised hierarchical feature learning framework for one-shot image recognition. IEEE Trans Multimedia 15:621–632CrossRefGoogle Scholar
  20. 20.
    Gutub A, Alharthi N (2011) Improving Hajj and Umrah Services Utilizing Exploratory Data Visualization Techniques. Inf Vis 10:356–371CrossRefGoogle Scholar
  21. 21.
    Guven E, Bock P (2010) Speech emotion recognition using a backward context. In: Applied Imagery Pattern Recognition Workshop (AIPR), 2010 I.E. 39th, pp 1–5Google Scholar
  22. 22.
    Haq S, Jackson PJ, Edge J (2009) Speaker-dependent audio-visual emotion recognition. In: AVSP, pp 53–58Google Scholar
  23. 23.
    Hu H, Xu M-X, Wu W (2007) Fusion of global statistical and segmental spectral features for speech emotion recognition. In: INTERSPEECH, pp 2269–2272Google Scholar
  24. 24.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R et al (2014) Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp 675–678Google Scholar
  25. 25.
    Kaysi I, Sayour M, Alshalalfah B, Gutub A (2012) Rapid transit service in the unique context of Holy Makkah: assessing the first year of operation during the 2010 pilgrimage season. Urban Transp XVIII Urban Transp Environ 21st Century 18:253Google Scholar
  26. 26.
    Kaysi I, Alshalalfah B, Shalaby A, Sayegh A, Sayour M, Gutub A (2013) Users' Evaluation of Rail Systems in Mass Events: Case Study in Mecca, Saudi Arabia. Transp Res Rec J Transp Res Board 2350:111–118CrossRefGoogle Scholar
  27. 27.
    Khan MK, Zakariah M, Malik H, Choo K-KR (2017) A novel audio forensic data-set for digital multimedia forensics. Aust J Forensic Sci 1–18. CrossRefGoogle Scholar
  28. 28.
    Kim S, Guy SJ, Hillesland K, Zafar B, Gutub AA-A, Manocha D (2015) Velocity-based modeling of physical interactions in dense crowds. Vis Comput 31:541–555CrossRefGoogle Scholar
  29. 29.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  30. 30.
    Krothapalli SR, Koolagudi SG (2013) Emotion recognition using vocal tract information. In: Emotion Recognition Using Speech Features, ed. Springer, pp 67–78Google Scholar
  31. 31.
    Liu P, Choo K-KR, Wang L, Huang F (2016) SVM or deep learning? A comparative study on remote sensing image classification. Soft Comput 1–13. CrossRefGoogle Scholar
  32. 32.
    Lugger M, Janoir M-E, Yang B (2009) Combining classifiers with diverse feature sets for robust speaker independent emotion recognition. In: Signal Processing Conference, 2009 17th European, pp 1225–1229Google Scholar
  33. 33.
    Mao Q, Wang X, Zhan Y (2010) Speech emotion recognition method based on improved decision tree and layered feature selection. Int J Humanoid Rob 7:245–261CrossRefGoogle Scholar
  34. 34.
    Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimedia 16:2203–2213CrossRefGoogle Scholar
  35. 35.
    Morrison D, Wang R, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Comm 49:98–112CrossRefGoogle Scholar
  36. 36.
    Nanda A, Sa PK, Choudhury SK, Bakshi S, Majhi B (2017) A Neuromorphic Person Re-Identification Framework for Video Surveillance. IEEE Access 5:6471–6482Google Scholar
  37. 37.
    Pao T-L, Chen Y-T, Yeh J-H, Cheng Y-M, Lin Y-Y (2007) A comparative study of different weighting schemes on KNN-based emotion recognition in Mandarin speech. Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, pp 997–1005Google Scholar
  38. 38.
    Ramakrishnan S, El Emary IM (2013) Speech emotion recognition approaches in human computer interaction. Telecommun Syst 52(3):1467–1478CrossRefGoogle Scholar
  39. 39.
    Raman R, Sa PK, Majhi B, Bakshi S (2016) Direction Estimation for Pedestrian Monitoring System in Smart Cities: An HMM Based Approach. IEEE Access 4:5788–5808CrossRefGoogle Scholar
  40. 40.
    Rao KS, Koolagudi SG, Vempada RR (2013) Emotion recognition from speech using global and local prosodic features. Int Journal Speech Technol 16:143–160CrossRefGoogle Scholar
  41. 41.
    Rout JK, Choo K-KR, Dash AK, Bakshi S, Jena SK, Williams KL (2017) A model for sentiment and emotion analysis of unstructured social media text. Electron Commer Res 1–19. CrossRefGoogle Scholar
  42. 42.
    Schmidt EM, Kim YE (2011) Learning emotion-based acoustic features with deep belief networks. In: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011 I.E. Workshop on, pp 65–68Google Scholar
  43. 43.
    Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP'04). IEEE International Conference on, pp I-577Google Scholar
  44. 44.
    Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (1929-1958) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:2014MathSciNetzbMATHGoogle Scholar
  45. 45.
    Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Acoustics, Speech and Signal Processing (ICASSP), 2011 I.E. International Conference on, pp 5688–5691Google Scholar
  46. 46.
    Sun R, Moore E (2011) Investigating glottal parameters and teager energy operators in emotion recognition. Affective computing and intelligent interaction, pp 425-434Google Scholar
  47. 47.
    Ververidis D, Kotropoulos C (2006) Emotional speech recognition: Resources, features, and methods. Speech Comm 48:1162–1181CrossRefGoogle Scholar
  48. 48.
    Wöllmer M, Metallinou A, Katsamanis N, Schuller B, Narayanan S (2012) Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions. In: Acoustics, Speech and Signal Processing (ICASSP), 2012 I.E. International Conference on, pp 4157–4160Google Scholar
  49. 49.
    Xia M, Lijiang C (2010) Speech emotion recognition based on parametric filter and fractal dimension. IEICE Trans Inf Syst 93:2324–2326Google Scholar
  50. 50.
    Xu Z, Luo X, Liu Y, Choo K-KR, Sugumaran V, Yen N et al (2016) From latency, through outbreak, to decline: detecting different states of emergency events using web resources. IEEE Trans Big Data PP:1 CrossRefGoogle Scholar
  51. 51.
    Yen N, Zhang H, Wei X, Lu Z, Choo K-KR, Mei L et al (2017) Social Sensors Based Online Attention Computing of Public Safety Events. IEEE Trans Emerg Top Comput 5(3):403–411CrossRefGoogle Scholar
  52. 52.
    Yu D, Seltzer ML, Li J, Huang J-T, Seide F (2013) Feature learning in deep neural networks-studies on speech recognition tasks. Published at ICLR 2013.
  53. 53.
    Yun S, Yoo CD (2012) Loss-scaled large-margin Gaussian mixture models for speech emotion classification. IEEE Trans Audio Speech Lang Process 20:585–598CrossRefGoogle Scholar
  54. 54.
    (2017, 4–5-2017). NVIDIA/DIGITS. Available:

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Digital Contents Research InstituteSejong UniversitySeoulRepublic of Korea

Personalised recommendations