Multi-channel Convolutional Neural Networks with Multi-level Feature Fusion for Environmental Sound Classification

  • Dading Chong
  • Yuexian ZouEmail author
  • Wenwu Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)


Learning acoustic models directly from the raw waveform is an effective method for Environmental Sound Classification (ESC) where sound events often exhibit vast diversity in temporal scales. Convolutional neural networks (CNNs) based ESC methods have achieved the state-of-the-art results. However, their performance is affected significantly by the number of convolutional layers used and the choice of the kernel size in the first convolutional layer. In addition, most existing studies have ignored the ability of CNNs to learn hierarchical features from environmental sounds. Motivated by these findings, in this paper, parallel convolutional filters with different sizes in the first convolutional layer are designed to extract multi-time resolution features aiming at enhancing feature representation. Inspired by VGG networks, we build our deep CNNs by stacking 1-D convolutional layers using very small filters except for the first layer. Finally, we extend the model using multi-level feature aggregation technique to boost the classification performance. The experimental results on Urbansound 8k, ESC-50, and ESC-10 show that our proposed method outperforms the state-of-the-art end-to-end methods for environmental sound classification in terms of the classification accuracy.


Environmental sound classification Multi-channel deep convolutional neural networks End-to-end Multi-level feature fusion 



This project was partially supported by Shenzhen Science & Technology Fundamental Research Programs (No: JCYJ20170817160058246 and JCYJ20170306165153653) & Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467). Special acknowledgements are given to Aoto-PKUSZ Joint Research Center of Artificial Intelligence on Scene Cognition & Technology Innovation for its support.


  1. 1.
    Virtanen, T., Plumbley, M.D., Ellis, D.: Computational Analysis of Sound Scenes and Events. Springer, Heidelberg (2018). Scholar
  2. 2.
    Boddapati, V., Petef, A., Rasmusson, J., Lundberg, L.: Classifying environmental sounds using image recognition networks. Proc. Comput. Sci. 112, 2048–2056 (2017)CrossRefGoogle Scholar
  3. 3.
    Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2015)Google Scholar
  4. 4.
    Vacher, M., Serignat, J.-F., Chaillol, S.: Sound classification in a smart room environment: an approach using GMM and HMM methods. In: The 4th IEEE Conference on Speech Technology and Human-Computer Dialogue (SpeD 2007), Publishing House of the Romanian Academy (Bucharest), pp. 135–146 (2007)Google Scholar
  5. 5.
    Łopatka, K., Zwan, P., Czyżewski, A.: Dangerous sound event recognition using support vector machine classifiers. In: Nguyen, N.T., Zgrzywa, A., Czyżewski, A. (eds.) Advances in Multimedia and Network Information System Technologies, pp. 49–57. Springer, Heidelberg (2010). Scholar
  6. 6.
    Su, F., Yang, L., Lu, T., Wang, G.: Environmental sound classification for scene recognition using local discriminant bases and HMM. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 1389–1392. ACM (2011)Google Scholar
  7. 7.
    Saki, F., Kehtarnavaz, N.: Background noise classification using random forest tree classifier for cochlear implant applications. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3591–3595. IEEE (2014)Google Scholar
  8. 8.
    Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)Google Scholar
  9. 9.
    Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 1533–1545 (2014)CrossRefGoogle Scholar
  10. 10.
    Kong, Q., Sobieraj, I., Wang, W., Plumbley, M.: Deep neural network baseline for DCASE challenge 2016. In: Proceedings of DCASE 2016 (2016)Google Scholar
  11. 11.
    Cotton, C.V., Ellis, D.P.: Spectral vs. spectro-temporal features for acoustic event detection. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 69–72. IEEE (2011)Google Scholar
  12. 12.
    Zhang, H., McLoughlin, I., Song, Y.: Robust sound event recognition using convolutional neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 559–563. IEEE (2015)Google Scholar
  13. 13.
    Zhang, X., Zou, Y., Shi, W.: Dilated convolution neural network with LeakyReLU for environmental sound classification. In: 2017 22nd International Conference on Digital Signal Processing (DSP), pp. 1–5. IEEE (2017)Google Scholar
  14. 14.
    Medhat, F., Chesmore, D., Robinson, J.: Masked conditional neural networks for audio classification. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10614, pp. 349–358. Springer, Cham (2017). Scholar
  15. 15.
    Tokozume, Y., Harada, T.: Learning environmental sounds with end-to-end convolutional neural network. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2721–2725. IEEE (2017)Google Scholar
  16. 16.
    Tokozume, Y., Ushiku, Y., Harada, T.: Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282 (2017)
  17. 17.
    Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017)Google Scholar
  18. 18.
    Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)Google Scholar
  19. 19.
    Zhao, B., Lu, H., Chen, S., Liu, J., Wu, D.: Convolutional neural networks for time series classification. J. Syst. Eng. Electron. 28, 162–169 (2017)CrossRefGoogle Scholar
  20. 20.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
  21. 21.
    Lee, J., Nam, J.: Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24, 1208–1212 (2017)CrossRefGoogle Scholar
  22. 22.
    Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 1041–1044. ACM (2014)Google Scholar
  23. 23.
    Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)Google Scholar
  24. 24.
    Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, p. 3 (2013)Google Scholar
  25. 25.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  26. 26.
    Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24, 279–283 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.ADSPLAB, School of ECEPeking UniversityShenzhenChina
  2. 2.Peng Cheng LaboratoryShenzhenChina
  3. 3.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildfordUK

Personalised recommendations