Skip to main content

Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

  • Conference paper
  • First Online:
Advances in Multimedia Information Processing – PCM 2018 (PCM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11165))

Included in the following conference series:

Abstract

Motivated by the fact that characteristics of different sound classes are highly diverse in different temporal scales and hierarchical levels, a novel deep convolutional neural network (CNN) architecture is proposed for the environmental sound classification task. This network architecture takes raw waveforms as input, and a set of separated parallel CNNs are utilized with different convolutional filter sizes and strides, in order to learn feature representations with multi-temporal resolutions. On the other hand, the proposed architecture also aggregates hierarchical features from multi-level CNN layers for classification using direct connections between convolutional layers, which is beyond the typical single-level CNN features employed by the majority of previous studies. This network architecture also improves the flow of information and avoids vanishing gradient problem. The combination of multi-level features boosts the classification performance significantly. Comparative experiments are conducted on two datasets: the environmental sound classification dataset (ESC-50), and DCASE 2017 audio scene classification dataset. Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single-level features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Marchi, E., et al.: Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE 2016), pp. 65–69 (2016)

    Google Scholar 

  2. Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)

    Google Scholar 

  3. Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimed. 17(10), 1733–1746 (2015)

    Article  Google Scholar 

  4. Bugalho, M., Portelo, J., Trancoso, I., Pellegrini, T., Abad, A.: Detecting audio events for semantic video search. In: Tenth Annual Conference of the International Speech Communication Association (2009)

    Google Scholar 

  5. Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838. ACM (2013)

    Google Scholar 

  6. Han, Y., Lee, K.: Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383 (2016)

  7. Fonseca, E., Gong, R., Bogdanov, D., Slizovskaia, O., Gómez Gutiérrez, E., Serra, X.: Acoustic scene classification by ensembling gradient boosting machine and convolutional neural networks. In: Virtanen, T., et al. (eds.) Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 16 Nov 2017, Munich, Germany, Tampere (Finland): Tampere University of Technology 2017, pp. 37–41 (2017)

    Google Scholar 

  8. Abeßer, J., Mimilakis, S.I., Gräfe, R., Lukashevich, H., Fraunhofer, I.: Acoustic scene classification by combining autoencoder-based dimensionality reduction and convolutional neural networks (2017)

    Google Scholar 

  9. Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimed. 14(6), 1684–1689 (2012)

    Article  Google Scholar 

  10. Boddapati, V., Petef, A., Rasmusson, J., Lundberg, L.: Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 112, 2048–2056 (2017)

    Article  Google Scholar 

  11. Jung, J.W., Heo, H.S., Yang, I.H., Yoon, S.H., Shim, H.J., Yu, H.J.: DNN-based audio scene classification for DCASE 2017: dual input features, balancing cost, and stochastic data duplication. System 4(5) (2017)

    Google Scholar 

  12. Deng, J., et al.: The university of passau open emotion recognition system for the multimodal emotion challenge. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds.) CCPR 2016. CCIS, vol. 663, pp. 652–666. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-3005-5_54

    Chapter  Google Scholar 

  13. Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint. arXiv:1703.01789 (2017)

  14. Xu, Y., Huang, Q., Wang, W., Plumbley, M.D.: Hierarchical learning for dnn-based acoustic scene classification. arXiv preprint arXiv:1607.03682 (2016)

  15. Lee, J., Nam, J.: Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24(8), 1208–1212 (2017)

    Article  Google Scholar 

  16. Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  17. Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017)

    Google Scholar 

  18. Tokozume, Y., Harada, T.: Learning environmental sounds with end-to-end convolutional neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2721–2725 (2017)

    Google Scholar 

  19. Palaz, D., Magimai.-Doss, M., Collobert, R.: Analysis of CNN-based speech recognition system using raw speech as input. Technical report, Idiap (2015)

    Google Scholar 

  20. Zhu, B., Wang, C., Liu, F., Lei, J., Lu, Z., Peng, Y.: Learning environmental sounds with multi-scale convolutional neural network. arXiv preprint. arXiv:1803.10219 (2018)

  21. Piczak, K.J.: ESC: dataset for Environmental Sound Classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM Press (2015). https://doi.org/10.1145/2733373.2806390. http://dl.acm.org/citation.cfm?doid=2733373.2806390

  22. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 3 (2017)

    Google Scholar 

  23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  24. Han, Y., Park, J., Lee, K.: Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification (2017)

    Google Scholar 

  25. Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)

    Google Scholar 

  26. Lee, J., Park, J., Kim, K.L., Nam, J.: SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl. Sci. 8(1), 150 (2018)

    Article  Google Scholar 

  27. Hamel, P., Bengio, Y., Eck, D.: Building musically-relevant audio features through multiple timescale representations. In: ISMIR, pp. 553–558 (2012)

    Google Scholar 

  28. Dieleman, S., Schrauwen, B.: Multiscale approaches to music audio feature learning. In: 14th International Society for Music Information Retrieval Conference (ISMIR-2013), pp. 116–121. Pontifícia Universidade Católica do Paraná (2013)

    Google Scholar 

  29. Schindler, A., Lidy, T., Rauber, A.: Multi-temporal resolution convolutional neural networks for the dcase acoustic scene classification task (2017)

    Google Scholar 

  30. Xu, K., et al.: Mixup-based acoustic scene classification using multi-channel convolutional neural network. arXiv preprint. arXiv:1805.07319 (2018)

  31. Paszke, A., et al.: Automatic differentiation in pytorch (2017)

    Google Scholar 

  32. Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2015)

    Google Scholar 

Download references

Acknowledgments

This study is funded by the National Basic Research Program of China (973) under Grant No. 2014CB340303 and the Scientific Research Project of NUDT (No. ZK17-03-31).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dezhi Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, B., Xu, K., Wang, D., Zhang, L., Li, B., Peng, Y. (2018). Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00767-6_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00766-9

  • Online ISBN: 978-3-030-00767-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics