Multimedia Tools and Applications

, Volume 75, Issue 11, pp 6071–6089 | Cite as

Robust scream sound detection via sound event partitioning

  • Baiying LeiEmail author
  • Man-Wai Mak


This paper proposes a robust scream-sound detection scheme for acoustic surveillance applications. To enhance the discriminability between scream and non-scream sounds, a sound-event partitioning (SEP) method that facilitates the extraction of multiple acoustic vectors from a single sound event is developed. Regularized principal component analysis (PCA) and normalization are applied to the acoustic vectors, which are then classified by support vector machines (SVMs). Experimental results based on 1000 sound events show that the proposed scheme is effective even if there are severe mismatches between the training and testing conditions. The experimental results also show that the proposed scheme can reduce the equal error rate (EER) by up to 60 % when compared to a classical approach that uses mel-frequency cepstral coefficients (MFCC) as features. Extensive analyses on different processing stages of the proposed sound detection scheme also suggest that sound partitioning and feature normalization play important roles in boosting the detection performance.


Scream sound detection Regularized PCA-whitening Feature normalization Sound event partitioning 



The work was supported partly by National Natural Science Foundation of China (No. 61402296), Motorola Solutions Foundation (ID: 7186445) and the Hong Kong Polytechnic University Grant No. G-YL78. The authors would like to thank Wing-Lung Leung for developing the sound recording system and part of the Android App.


  1. 1.
  2. 2.
    Ali S, Smith-Miles KA (2006) Improved support vector machine generalization using normalized input space. In: Proc. of 19th Australian Joint Conference on Artificial Intelligence. pp 362–371Google Scholar
  3. 3.
    Atrey PK, Maddage NC, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing. pp V813-V816Google Scholar
  4. 4.
    Chu S, Narayanan S, Kuo CCJ (2009) Environmental sound recognition with time-frequency audio features. IEEE Trans Audio, Speech Lang Process 17(6):1142–1158CrossRefGoogle Scholar
  5. 5.
    Clavel C, Ehrette T, Richard G (2005) Events detection for an audio-based surveillance system. In: Proc.of IEEE International Conference on Multimedia and Expo. pp 1306–1309Google Scholar
  6. 6.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366CrossRefGoogle Scholar
  7. 7.
    Dennis J, Tran HD, Chng E-S (2013) Image feature representation of the subband power distribution for robust sound event classification. IEEE Trans Audio, Speech Lang Process 21(2):367–377CrossRefGoogle Scholar
  8. 8.
    Dennis J, Tran HD, Chng ES (2013) Overlapping sound event recognition using local spectrogram features and the generalised hough transform. Pattern Recogn Lett 34(9):1085–1093CrossRefGoogle Scholar
  9. 9.
    Dennis J, Tran HD, Li H (2011) Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process Lett 18(2):130–133CrossRefGoogle Scholar
  10. 10.
    Ferrer L, Bratt H, Burget L, Cernocky H, Glembek O, Graciarena M, Lawson A, Lei Y, Matejka P, Plchot O (2011) Promoting robustness for speaker modeling in the community: the PRISM evaluation set. In: Proc.of NIST 2011 WorkshopGoogle Scholar
  11. 11.
    Ghoraani B, Krishnan S (2011) Time-frequency matrix feature extraction and classification of environmental audio signals. IEEE Trans Audio, Speech Lang Process 19(7):2197–2209CrossRefGoogle Scholar
  12. 12.
    Guo G, Li SZ (2003) Content-based audio classification and retrieval by support vector machines. IEEE Trans Neural Netw 14(1):209–215CrossRefGoogle Scholar
  13. 13.
    Hautamaki V, Kinnunen T, Sedlak F, Lee KA, Ma B, Li H (2013) Sparse classifier fusion for speaker verification. IEEE Trans Audio, Speech Lang Process 21(8):1622–1631CrossRefGoogle Scholar
  14. 14.
    Huang W, Chiew T-K, Li H, Kok TS, Biswas J (2010) Scream detection for home applications. In: Proc.of 6th IEEE Conference on Industrial Electronics and Applications. pp 2115–2120Google Scholar
  15. 15.
    Human Sound Effects.
  16. 16.
    Jégou H, Chum O (2012) Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Proc.of European Conference on Computer Vision. pp 774–787Google Scholar
  17. 17.
    Kim MJ, Kim H (2011) Automatic extraction of pornographic contents using radon transform based audio features. In: Prof. of 9th International Workshop onContent-Based Multimedia Indexing. pp 205–210Google Scholar
  18. 18.
    Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40CrossRefGoogle Scholar
  19. 19.
    Kotus J, Lopatka K, Czyzewski A (2014) Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimedia Tools Appl 68(1):5–21CrossRefGoogle Scholar
  20. 20.
    Lei B, Rahman SA, Song I (2014) Content-based classification of breath sound with enhanced features. Neurocomputing 141:139–147CrossRefGoogle Scholar
  21. 21.
    Liao W-H, Lin Y-K (2009) Classification of non-speech human sounds: Feature selection and snoring sound analysis. In: Proc. of IEEE International Conference on on Systems, Man and Cybernetics. pp 2695–2700Google Scholar
  22. 22.
    Mak M-W, Kung S-Y (2012) Low-power SVM classifiers for sound event classification on mobile devices. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing pp 1985–1988Google Scholar
  23. 23.
    Mak M-W, Rao W (2011) Utterance partitioning with acoustic vector resampling for GMM–SVM speaker verification. Speech Comm 53(1):119–130CrossRefGoogle Scholar
  24. 24.
    Mak M-W, Yu H-B (2014) A study of voice activity detection techniques for NIST speaker recognition evaluations. Comput Speech Lang 28(1):295–313CrossRefGoogle Scholar
  25. 25.
    Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  26. 26.
    Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The DET curve in assessment of detection task performance. In: Proc.of 5th European Conference on Speech Communication and Technology. pp 1895–1898Google Scholar
  27. 27.
    Ntalampiras S, Potamitis I, Fakotakis N (2009) On acoustic surveillance of hazardous situations. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing. pp 165–168Google Scholar
  28. 28.
    Penet C, Demarty C-H, Gravier G, Gros P (2014) Variability modelling for audio events detection in movies. Multimedia Tools and Applications 1–31Google Scholar
  29. 29.
  30. 30.
    Ralf H, Thore G (2002) A PAC-Bayesian margin bound for linear classifiers. IEEE Trans Inf Theory 48(12):3140–3150MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Rao W, Mak M-W (2013) Boosting the performance of i-vector based speaker verification via utterance partitioning. IEEE Trans Audio, Speech Lang Process 21(5):1012–1022CrossRefGoogle Scholar
  32. 32.
  33. 33.
    Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Simonyan K, Parkhi OM, Vedaldi A, Zisserman A (2013) Fisher Vector Faces in the Wild. In: Proc. of British Machine Vision Conference. pp 8.1-8.12Google Scholar
  35. 35.
    Tran HD, Li H (2011) Sound event recognition with probabilistic distance SVMs. IEEE Trans Audio, Speech Lang Process 19(6):1556–1568CrossRefGoogle Scholar
  36. 36.
    Valenzise G, Gerosa L, Tagliasacchi M, Antonacci F, Sarti A (2007) Scream and gunshot detection and localization for audio-surveillance systems. In: Proc.of IEEE Conference on Advanced Video and Signal Based Surveillance. pp 21–26Google Scholar
  37. 37.
    Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm 12(3):247–251CrossRefGoogle Scholar
  38. 38.
    Wang Y, Han K, Wang D (2013) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio, Speech Lang Process 21(2):270–279CrossRefGoogle Scholar
  39. 39.
    Zhao X, Shao Y, Wang D (2012) CASA-based robust speaker identification. IEEE Trans Audio, Speech Lang Process 20(5):1608–1616CrossRefGoogle Scholar
  40. 40.
    Zhao X, Wang D (2013) Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: Proc.of IEEE International Conference on Acoustics, Speech and Signal Processing. pp 7204–7208Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Biomedical Engineering, School of Medicine, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, and Guangdong Key Laboratory for Biomedical Measurements and Ultrasound ImagingShenzhen UniversityShenzhenChina
  2. 2.Department of Electronic and Information EngineeringThe Hong Kong Polytechnic UniversityKowloonHong Kong

Personalised recommendations