Abstract
Motivated by the fact that characteristics of different sound classes are highly diverse in different temporal scales and hierarchical levels, a novel deep convolutional neural network (CNN) architecture is proposed for the environmental sound classification task. This network architecture takes raw waveforms as input, and a set of separated parallel CNNs are utilized with different convolutional filter sizes and strides, in order to learn feature representations with multi-temporal resolutions. On the other hand, the proposed architecture also aggregates hierarchical features from multi-level CNN layers for classification using direct connections between convolutional layers, which is beyond the typical single-level CNN features employed by the majority of previous studies. This network architecture also improves the flow of information and avoids vanishing gradient problem. The combination of multi-level features boosts the classification performance significantly. Comparative experiments are conducted on two datasets: the environmental sound classification dataset (ESC-50), and DCASE 2017 audio scene classification dataset. Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single-level features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Marchi, E., et al.: Pairwise decomposition with deep neural networks and multiscale kernel subspace learning for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE 2016), pp. 65–69 (2016)
Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events (2017)
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimed. 17(10), 1733–1746 (2015)
Bugalho, M., Portelo, J., Trancoso, I., Pellegrini, T., Abad, A.: Detecting audio events for semantic video search. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838. ACM (2013)
Han, Y., Lee, K.: Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383 (2016)
Fonseca, E., Gong, R., Bogdanov, D., Slizovskaia, O., Gómez Gutiérrez, E., Serra, X.: Acoustic scene classification by ensembling gradient boosting machine and convolutional neural networks. In: Virtanen, T., et al. (eds.) Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 16 Nov 2017, Munich, Germany, Tampere (Finland): Tampere University of Technology 2017, pp. 37–41 (2017)
Abeßer, J., Mimilakis, S.I., Gräfe, R., Lukashevich, H., Fraunhofer, I.: Acoustic scene classification by combining autoencoder-based dimensionality reduction and convolutional neural networks (2017)
Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimed. 14(6), 1684–1689 (2012)
Boddapati, V., Petef, A., Rasmusson, J., Lundberg, L.: Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 112, 2048–2056 (2017)
Jung, J.W., Heo, H.S., Yang, I.H., Yoon, S.H., Shim, H.J., Yu, H.J.: DNN-based audio scene classification for DCASE 2017: dual input features, balancing cost, and stochastic data duplication. System 4(5) (2017)
Deng, J., et al.: The university of passau open emotion recognition system for the multimodal emotion challenge. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds.) CCPR 2016. CCIS, vol. 663, pp. 652–666. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-3005-5_54
Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint. arXiv:1703.01789 (2017)
Xu, Y., Huang, Q., Wang, W., Plumbley, M.D.: Hierarchical learning for dnn-based acoustic scene classification. arXiv preprint arXiv:1607.03682 (2016)
Lee, J., Nam, J.: Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Process. Lett. 24(8), 1208–1212 (2017)
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Dai, W., Dai, C., Qu, S., Li, J., Das, S.: Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421–425. IEEE (2017)
Tokozume, Y., Harada, T.: Learning environmental sounds with end-to-end convolutional neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2721–2725 (2017)
Palaz, D., Magimai.-Doss, M., Collobert, R.: Analysis of CNN-based speech recognition system using raw speech as input. Technical report, Idiap (2015)
Zhu, B., Wang, C., Liu, F., Lei, J., Lu, Z., Peng, Y.: Learning environmental sounds with multi-scale convolutional neural network. arXiv preprint. arXiv:1803.10219 (2018)
Piczak, K.J.: ESC: dataset for Environmental Sound Classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM Press (2015). https://doi.org/10.1145/2733373.2806390. http://dl.acm.org/citation.cfm?doid=2733373.2806390
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 3 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Han, Y., Park, J., Lee, K.: Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification (2017)
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)
Lee, J., Park, J., Kim, K.L., Nam, J.: SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Appl. Sci. 8(1), 150 (2018)
Hamel, P., Bengio, Y., Eck, D.: Building musically-relevant audio features through multiple timescale representations. In: ISMIR, pp. 553–558 (2012)
Dieleman, S., Schrauwen, B.: Multiscale approaches to music audio feature learning. In: 14th International Society for Music Information Retrieval Conference (ISMIR-2013), pp. 116–121. Pontifícia Universidade Católica do Paraná (2013)
Schindler, A., Lidy, T., Rauber, A.: Multi-temporal resolution convolutional neural networks for the dcase acoustic scene classification task (2017)
Xu, K., et al.: Mixup-based acoustic scene classification using multi-channel convolutional neural network. arXiv preprint. arXiv:1805.07319 (2018)
Paszke, A., et al.: Automatic differentiation in pytorch (2017)
Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2015)
Acknowledgments
This study is funded by the National Basic Research Program of China (973) under Grant No. 2014CB340303 and the Scientific Research Project of NUDT (No. ZK17-03-31).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, B., Xu, K., Wang, D., Zhang, L., Li, B., Peng, Y. (2018). Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11165. Springer, Cham. https://doi.org/10.1007/978-3-030-00767-6_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-00767-6_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00766-9
Online ISBN: 978-3-030-00767-6
eBook Packages: Computer ScienceComputer Science (R0)