Multimedia Tools and Applications

, Volume 77, Issue 4, pp 4883–4907 | Cite as

Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture

  • Jamil Ahmad
  • Muhammad Sajjad
  • Seungmin Rho
  • Soon-il Kwon
  • Mi Young Lee
  • Sung Wook Baik


In the millions of emergency reporting calls made each year, about a quarter are non-emergencies. To avoid responding to such situations, forensic examination of the reported situation in the presence of speech as evidence has become an indispensable requirement for emergency response centers. Caller profile information like gender, age, emotional state, transcript, and contextual sounds determined from emergency calls, may be highly beneficial for their sophisticated forensic analysis. However, callers reporting emergency situations often express emotional stress which cause variations in speech production. Furthermore, low voice quality, and background noise make it very difficult to efficiently recognize caller attributes in such unconstrained environments. To overcome limitations of traditional classification systems in such situations, a hybrid two-stage classification scheme is proposed in this paper. Our framework consist of an ensemble of support vector machines (e-SVM) and deep neural networks (DNN) in a cascade. The first stage e-SVM consists of two models discriminatively trained on normal and stressful speech from emergency calls. Deep neural network forming the second stage of classification pipeline, is utilized only in case of ambiguous prediction results from the first stage. The adaptive nature of this two stage classification scheme helps achieve efficiency and high performance. Experiments conducted with a large dataset affirm the suitability of proposed architecture for efficient real-time speaker attribute recognition. The framework is evaluated for gender recognition from emergency calls in the presence of emotions and background noise. The framework yields significant performance improvements in comparison with other similar state-of-the-art gender recognition approaches.


Speaker attributes Stress-affected Deep neural network Emergency calls Hybrid classifier 



This work was supported by the ICT R&D program of MSIP/IITP. (No. R0126-15-1119, Development of a solution for situation-awareness based on the analysis of speech and environmental sounds).


  1. 1.
    Ahmad J, Sajjad M, Jan Z, Mehmood I, Rho S, Baik SW (2016a) Analysis of interaction trace maps for active authentication on smart devices. Multimedia Tools and Applications:1–19. doi: 10.1007/s11042-016-3450-y
  2. 2.
    Ahmad J, Muhammad K, Kwon Si, Baik SW, and Rho S 2016b Dempster-Shafer Fusion Based Gender Recognition for Speech Analysis Applications," in 2016 International Conference on Platform Technology and Service (PlatCon), pp. 1–4Google Scholar
  3. 3.
    Baber C, Mellor B, Graham R, Noyes JM, Tunley C (1996) Workload and the use of automatic speech recognition: the effects of time and resource demands. Speech Comm 20:37–53CrossRefGoogle Scholar
  4. 4.
    Bahari MH, McLaren M, van Leeuwen DA (2014) Speaker age estimation using i-vectors. Eng Appl Artif Intell 34:99–108CrossRefGoogle Scholar
  5. 5.
    Barkana BD, Zhou J (2015) A new pitch-range based feature set for a speaker’s age and gender classification. Appl Acoust 98:52–61CrossRefGoogle Scholar
  6. 6.
    Bond ZS., and Moore TJ (1990) A note on loud and lombard speech,” in International Conference on Speech Language Processing pp. 969–972.Google Scholar
  7. 7.
    Burkhardt F, Eckert M, Johannsen W, and Stegmann J (2010) A Database of Age and Gender Annotated Telephone Speech,” in Proc. 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010Google Scholar
  8. 8.
    Campbell WM, Campbell JP, Reynolds DA, Jones DA, and Leek TR (2004) "Phonetic speaker recognition with support vector machines" S. Thrun, L. Saul, B. Schokopf (Eds.), Advances in Neural Information Processing Systems, Vol. 16, MIT Press, Cambridge, MA (2004)Google Scholar
  9. 9.
    Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20:210–229CrossRefGoogle Scholar
  10. 10.
    Cummins N, Epps J, Sethu V, Breakspear M, and Goecke R 2013 Modeling spectral variability for the classification of depressed speech," in Interspeech, pp. 857–861Google Scholar
  11. 11.
    Dahl GE, Yu D, Deng L, Acero A (2012) "context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," Audio, Speech, and Language Processing. IEEE Transactions on 20:30–42Google Scholar
  12. 12.
    Deng J, Berg AC, and Fei-Fei L (2011) Hierarchical semantic indexing for large scale image retrieval," in Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference on, pp. 785–792Google Scholar
  13. 13.
    "EENA Operations Document (2011).Google Scholar
  14. 14.
    Fayek H, Lech M, and Cavedon L (2015) Towards real-time Speech Emotion Recognition using deep neural networks," in Signal Processing and Communication Systems (ICSPCS), 2015 9th International Conference on, pp. 1–5.Google Scholar
  15. 15.
    Fujimura H (2014) Simultaneous gender classification and voice activity detection using deep neural networks," in INTERSPEECH, pp. 1139–1143Google Scholar
  16. 16.
    Germain F, Sun DL, and Mysore GJ (2013) "Speaker and noise independent voice activity detection" in Proc. Interspeech 2013, pp 732–736Google Scholar
  17. 17.
    Hansen JH and Patil S (2007) Speech under stress: Analysis, modeling and recognition," in Speaker Classification I, ed: Springer, pp. 108–137.Google Scholar
  18. 18.
    Harb H, and Chen L (2003) Gender identification using a general audio classifier," in Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, pp. II-733-6 vol. 2Google Scholar
  19. 19.
    Harb H, Chen L (2005) Voice-based gender identification in multimedia applications. J Intell Inf Syst 24:179–198CrossRefGoogle Scholar
  20. 20.
    Hinton GE (2012) A practical guide to training restricted boltzmann machines," in Neural Networks: Tricks of the Trade, ed: Springer, pp. 599–619Google Scholar
  21. 21.
    Hu Y, Wu D, Nucci A (2012) Pitch-based gender identification with two-stage classification. Security and Communication Networks 5:211–225CrossRefGoogle Scholar
  22. 22.
    Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52:12–40CrossRefGoogle Scholar
  23. 23.
    Kockmann, M., Burget, L.,Cernocḱy, J., (2010). Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge. In: Proceedings of the Interspeech, Makuhari, Japan, pp. 2822–2825.Google Scholar
  24. 24.
    Li M, Han KJ, Narayanan S (2013) Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang 27:151–167CrossRefGoogle Scholar
  25. 25.
    Lu L, Zhang H-J, Jiang H (2002) "content analysis for audio classification and segmentation," Speech and Audio Processing. IEEE Transactions on 10:504–516Google Scholar
  26. 26.
    Martin A, Charlet D, and Mauuary L (2001) Robust speech/non-speech detection using LDA applied to MFCC," in Acoustics, Speech, and Signal Processing. Proceedings.(ICASSP'01). 2001 I.E. International Conference on, 2001, pp. 237–240.Google Scholar
  27. 27.
    Meinedo H and Trancoso I (2010) Age and gender classification using fusion of acoustic and prosodic features," in INTERSPEECH, pp. 2818–2821.Google Scholar
  28. 28.
    Metze F, Ajmera J, Englert R, Bub U, Burkhardt F, Stegmann J, et al., (2007) Comparison of four approaches to age and gender recognition for telephone applications," in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, pp. IV-1089-IV-1092.Google Scholar
  29. 29.
    Rao KS, and Sarkar S (2014) Robust speaker recognition in noisy environments. Springer Science+Business Media, pp 13–27. doi: 10.1007/978-3-319-07130-5
  30. 30.
    Sainath TN, Kingsbury B, Soltau H, Ramabhadran B (2013) "optimization techniques to improve training speed of deep neural networks for large speech tasks," Audio, Speech, and Language Processing. IEEE Transactions on 21:2267–2276Google Scholar
  31. 31.
    Saul LK, Jaakkola T, Jordan MI (1996) Mean field theory for sigmoid belief networks. J Artif Intell Res 4:61–76MATHGoogle Scholar
  32. 32.
    Shahin I (2013) Speaker identification in emotional talking environments based on CSPHMM2s. Eng Appl Artif Intell 26:1652–1659CrossRefGoogle Scholar
  33. 33.
    Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Comm 46:455–472CrossRefGoogle Scholar
  34. 34.
    Sigurdsson S, Petersen KB, and Lehn-Schiøler T (2006) Mel frequency cepstral coefficients: an evaluation of robustness of mp3 encoded music," in Seventh International Conference on Music Information Retrieval (ISMIR) Google Scholar
  35. 35.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15:1929–1958MathSciNetMATHGoogle Scholar
  36. 36.
    Ting H, Yingchun Y, and Zhaohui W (2006) Combining MFCC and pitch to enhance the performance of the gender recognition. In: 2006 8th international Conference on Signal Processing, 16-20 November 2006. doi: 10.1109/ICOSP.2006.345541
  37. 37.
    Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research 11:3371–3408MathSciNetMATHGoogle Scholar
  38. 38.
    Wolters MK, Vipperla R, and Renals S (2009) Age recognition for spoken dialogue systems: Do we need it? Proc. INTERSPEECH, pp. 1435–1438Google Scholar
  39. 39.
    Woźniak M, Graña M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Information Fusion 16:3–17CrossRefGoogle Scholar
  40. 40.
    Zhou G, Hansen JH, Kaiser JF (2001) "nonlinear feature based classification of speech under stress," Speech and Audio Processing. IEEE Transactions on 9:201–216Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Jamil Ahmad
    • 1
  • Muhammad Sajjad
    • 2
  • Seungmin Rho
    • 3
  • Soon-il Kwon
    • 1
  • Mi Young Lee
    • 1
  • Sung Wook Baik
    • 1
  1. 1.College of Electronics and Information EngineeringSejong UniversitySeoulRepublic of Korea
  2. 2.Department of Computer ScienceIslamia CollegePeshawarPakistan
  3. 3.Deparment of MultimediaSungkyul UniversityAnyangRepublic of Korea

Personalised recommendations