Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Zhu, Boqing; Xu, Kele; Wang, Dezhi; Zhang, Lilun; Li, Bo; Peng, Yuxing

doi:10.1007/978-3-030-00767-6_49

Boqing Zhu¹⁸,
Kele Xu^18,19,
Dezhi Wang²⁰,
Lilun Zhang²⁰,
Bo Li²¹ &
…
Yuxing Peng¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11165))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2579 Accesses
12 Citations

Abstract

Motivated by the fact that characteristics of different sound classes are highly diverse in different temporal scales and hierarchical levels, a novel deep convolutional neural network (CNN) architecture is proposed for the environmental sound classification task. This network architecture takes raw waveforms as input, and a set of separated parallel CNNs are utilized with different convolutional filter sizes and strides, in order to learn feature representations with multi-temporal resolutions. On the other hand, the proposed architecture also aggregates hierarchical features from multi-level CNN layers for classification using direct connections between convolutional layers, which is beyond the typical single-level CNN features employed by the majority of previous studies. This network architecture also improves the flow of information and avoids vanishing gradient problem. The combination of multi-level features boosts the classification performance significantly. Comparative experiments are conducted on two datasets: the environmental sound classification dataset (ESC-50), and DCASE 2017 audio scene classification dataset. Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single-level features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Marchi, E., et al.: Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE 2016), pp. 65–69 (2016)
Google Scholar
Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)
Google Scholar
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimed. 17(10), 1733–1746 (2015)
Article Google Scholar
Bugalho, M., Portelo, J., Trancoso, I., Pellegrini, T., Abad, A.: Detecting audio events for semantic video search. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Google Scholar
Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838. ACM (2013)
Google Scholar
Han, Y., Lee, K.: Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383 (2016)
Fonseca, E., Gong, R., Bogdanov, D., Slizovskaia, O., Gómez Gutiérrez, E., Serra, X.: Acoustic scene classification by ensembling gradient boosting machine and convolutional neural networks. In: Virtanen, T., et al. (eds.) Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 16 Nov 2017, Munich, Germany, Tampere (Finland): Tampere University of Technology 2017, pp. 37–41 (2017)
Google Scholar
Abeßer, J., Mimilakis, S.I., Gräfe, R., Lukashevich, H., Fraunhofer, I.: Acoustic scene classification by combining autoencoder-based dimensionality reduction and convolutional neural networks (2017)
Google Scholar
Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimed. 14(6), 1684–1689 (2012)
Article Google Scholar
Boddapati, V., Petef, A., Rasmusson, J., Lundberg, L.: Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 112, 2048–2056 (2017)
Article Google Scholar
Jung, J.W., Heo, H.S., Yang, I.H., Yoon, S.H., Shim, H.J., Yu, H.J.: DNN-based audio scene classification for DCASE 2017: dual input features, balancing cost, and stochastic data duplication. System 4(5) (2017)
Google Scholar
Deng, J., et al.: The university of passau open emotion recognition system for the multimodal emotion challenge. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds.) CCPR 2016. CCIS, vol. 663, pp. 652–666. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-3005-5_54
Chapter Google Scholar
Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint. arXiv:1703.01789 (2017)
Xu, Y., Huang, Q., Wang, W., Plumbley, M.D.: Hierarchical learning for dnn-based acoustic scene classification. arXiv preprint arXiv:1607.03682 (2016)
Lee, J., Nam, J.: Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24(8), 1208–1212 (2017)
Article Google Scholar
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017)
Google Scholar
Tokozume, Y., Harada, T.: Learning environmental sounds with end-to-end convolutional neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2721–2725 (2017)
Google Scholar
Palaz, D., Magimai.-Doss, M., Collobert, R.: Analysis of CNN-based speech recognition system using raw speech as input. Technical report, Idiap (2015)
Google Scholar
Zhu, B., Wang, C., Liu, F., Lei, J., Lu, Z., Peng, Y.: Learning environmental sounds with multi-scale convolutional neural network. arXiv preprint. arXiv:1803.10219 (2018)
Piczak, K.J.: ESC: dataset for Environmental Sound Classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM Press (2015). https://doi.org/10.1145/2733373.2806390. http://dl.acm.org/citation.cfm?doid=2733373.2806390
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 3 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Han, Y., Park, J., Lee, K.: Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification (2017)
Google Scholar
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)
Google Scholar
Lee, J., Park, J., Kim, K.L., Nam, J.: SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl. Sci. 8(1), 150 (2018)
Article Google Scholar
Hamel, P., Bengio, Y., Eck, D.: Building musically-relevant audio features through multiple timescale representations. In: ISMIR, pp. 553–558 (2012)
Google Scholar
Dieleman, S., Schrauwen, B.: Multiscale approaches to music audio feature learning. In: 14th International Society for Music Information Retrieval Conference (ISMIR-2013), pp. 116–121. Pontifícia Universidade Católica do Paraná (2013)
Google Scholar
Schindler, A., Lidy, T., Rauber, A.: Multi-temporal resolution convolutional neural networks for the dcase acoustic scene classification task (2017)
Google Scholar
Xu, K., et al.: Mixup-based acoustic scene classification using multi-channel convolutional neural network. arXiv preprint. arXiv:1805.07319 (2018)
Paszke, A., et al.: Automatic differentiation in pytorch (2017)
Google Scholar
Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2015)
Google Scholar

Download references

Acknowledgments

This study is funded by the National Basic Research Program of China (973) under Grant No. 2014CB340303 and the Scientific Research Project of NUDT (No. ZK17-03-31).

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Laboratory, National University of Defense Technology, Changsha, China
Boqing Zhu, Kele Xu & Yuxing Peng
College of Information Communication, National University of Defense Technology, Wuhan, China
Kele Xu
College of Meteorology and Oceanography, National University of Defense Technology, Changsha, China
Dezhi Wang & Lilun Zhang
Beijing University of Posts and Telecommunications, Beijing, China
Bo Li

Authors

Boqing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Kele Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dezhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lilun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuxing Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dezhi Wang .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, B., Xu, K., Wang, D., Zhang, L., Li, B., Peng, Y. (2018). Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-00767-6_49
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00766-9
Online ISBN: 978-3-030-00767-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics