Advertisement

Deep Convolutional Neural Network with Mixup for Environmental Sound Classification

  • Zhichao Zhang
  • Shugong Xu
  • Shan Cao
  • Shunqing Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11257)

Abstract

Environmental sound classification (ESC) is an important and challenging problem. In contrast to speech, sound events have noise-like nature and may be produced by a wide variety of sources. In this paper, we propose to use a novel deep convolutional neural network for ESC tasks. Our network architecture uses stacked convolutional and pooling layers to extract high-level feature representations from spectrogram-like features. Furthermore, we apply mixup to ESC tasks and explore its impacts on classification performance and feature distribution. Experiments were conducted on UrbanSound8K, ESC-50 and ESC-10 datasets. Our experimental results demonstrated that our ESC system has achieved the state-of-the-art performance (83.7\(\%\)) on UrbanSound8K and competitive performance on ESC-50 and ESC-10.

Keywords

Environmental sound classification Convolutional neural network Mixup 

References

  1. 1.
    Agrawal, D.M., Sailor, H.B., Soni, M.H., Patil, H.A.: Novel teo-based gammatone features for environmental sound classification. In: 2017 25th European Signal Processing Conference (EUSIPCO), pp. 1809–1813. IEEE (2017)Google Scholar
  2. 2.
    Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems, pp. 892–900 (2016)Google Scholar
  3. 3.
    Barchiesi, D., Giannoulis, D., Dan, S., Plumbley, M.D.: Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Process. Mag. 32(3), 16–34 (2015)CrossRefGoogle Scholar
  4. 4.
    Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-based music information retrieval: current directions and future challenges. Proc. IEEE 96(4), 668–696 (2008)CrossRefGoogle Scholar
  5. 5.
    Chu, S., Narayanan, S., Kuo, C.C.J.: Environmental sound recognition with time-frequency audio features. Institute of Electrical and Electronics Engineers Inc., (2009)Google Scholar
  6. 6.
    Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017)Google Scholar
  7. 7.
    Eronen, A.J., et al.: Audio-based context recognition. IEEE Trans. Audio Speech Lang. Process. 14(1), 321–329 (2006)CrossRefGoogle Scholar
  8. 8.
    Geiger, J.T., Helwani, K.: Improving event detection for audio surveillance using gabor filterbank features. In: Signal Processing Conference, pp. 714–718 (2015)Google Scholar
  9. 9.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)Google Scholar
  10. 10.
    Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  11. 11.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift, pp. 448–456 (2015)Google Scholar
  12. 12.
    Kons, Z., Toledo-Ronen, O.: Audio event classification using deep neural networks. In: Interspeech, pp. 1482–1486 (2013)Google Scholar
  13. 13.
    Lee, K., Ellis, D.P.: Audio-based semantic concept classification for consumer video. IEEE Trans. Audio Speech Lang. Process. 18(6), 1406–1416 (2010)CrossRefGoogle Scholar
  14. 14.
    Lyon, R.F.: Machine hearing: an emerging field [exploratory dsp]. Signal Process. Mag. IEEE 27(5), 131–139 (2010)CrossRefGoogle Scholar
  15. 15.
    McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 540–552 (2015)CrossRefGoogle Scholar
  16. 16.
    McLoughlin, I.V.: Line spectral pairs. Signal Process. 88(3), 448–467 (2008)CrossRefGoogle Scholar
  17. 17.
    Mesaros, A., et al.: Detection and classification of acoustic scenes and events: outcome of the dcase 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. 26(2), 379–393 (2018)CrossRefGoogle Scholar
  18. 18.
    Mun, S., Shon, S., Kim, W., Han, D.K., Ko, H.: Deep neural network based learning and transferring mid-level audio features for acoustic scene classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 796–800. IEEE (2017)Google Scholar
  19. 19.
    Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: IEEE International Workshop on Machine Learning for Signal Processing, pp. 1–6 (2015)Google Scholar
  20. 20.
    Piczak, K.J.: ESC: dataset for environmental sound classification. In: ACM International Conference on Multimedia, pp. 1015–1018 (2015)Google Scholar
  21. 21.
    Radhakrishnan, R., Divakaran, A., Smaragdis, P.: Audio analysis for surveillance applications. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 158–161 (2005)Google Scholar
  22. 22.
    Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)Google Scholar
  23. 23.
    Salamon, J., Bello, J.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. PP(99), 1 (2016)Google Scholar
  24. 24.
    Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 1041–1044. ACM (2014)Google Scholar
  25. 25.
    Schedl, M., Gómez, E., Urbano, J., et al.: Music information retrieval: recent developments and applications. Found. Trends® Inf. Retr. 8(2–3), 127–261 (2014)CrossRefGoogle Scholar
  26. 26.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  27. 27.
    Takahashi, N., Gygli, M., Pfister, B., Van Gool, L.: Deep convolutional neural networks and data augmentation for acoustic event detection. arXiv preprint arXiv:1604.07160 (2016)
  28. 28.
    Temko, A., Monte, E., Nadeu, C.: Comparison of sequence discriminant support vector machines for acoustic event classification. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, vol. 5, p. V. IEEE (2006)Google Scholar
  29. 29.
    Tokozume, Y., Harada, T.: Learning environmental sounds with end-to-end convolutional neural network. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2721–2725. IEEE (2017)Google Scholar
  30. 30.
    Tokozume, Y., Ushiku, Y., Harada, T.: Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282 (2018)
  31. 31.
    Typke, R., Wiering, F., Veltkamp, R.C.: A survey of music information retrieval systems. In: Proceedings of the 6th International Conference on Music Information Retrieval, pp. 153–160. Queen Mary, University of London (2005)Google Scholar
  32. 32.
    Uzkent, B., Barkana, B.D., Cevikalp, H.: Non-speech environmental sound classification using svms with a new set of features. Int. J. Innov. Comput. Inf. Control 8(5), 3511–3524 (2012)Google Scholar
  33. 33.
    Vacher, M., Serignat, J.F., Chaillol, S.: Sound classification in a smart room environment: an approach using gmm and hmm methods. In: SpeD, vol. 1 (2014)Google Scholar
  34. 34.
    Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimedia 14(6), 1684–1689 (2012)CrossRefGoogle Scholar
  35. 35.
    Zhang, H., Mcloughlin, I., Song, Y.: Robust sound event recognition using convolutional neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 559–563 (2015)Google Scholar
  36. 36.
    Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
  37. 37.
    Zhang, X., Zou, Y., Shi, W.: Dilated convolution neural network with LeakyReLU for environmental sound classification. In: 2017 22nd International Conference on Digital Signal Processing (DSP), pp. 1–5. IEEE (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Zhichao Zhang
    • 1
  • Shugong Xu
    • 1
  • Shan Cao
    • 1
  • Shunqing Zhang
    • 1
  1. 1.Shanghai Institute for Advanced Communication and Data ScienceShanghai UniversityShanghaiChina

Personalised recommendations