Robust Sound Event Classification with Local Time-Frequency Information and Convolutional Neural Networks

Yao, Yanli; Yu, Qiang; Wang, Longbiao; Dang, Jianwu

doi:10.1007/978-3-030-30490-4_29

Yanli Yao¹²,
Qiang Yu¹²,
Longbiao Wang¹² &
…
Jianwu Dang^12,13

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11730))

Included in the following conference series:

International Conference on Artificial Neural Networks

4566 Accesses

Abstract

How to effectively and accurately identify the sound event in a real-world noisy environment is still a challenging problem. Traditional methods for robust sound event classification generally perform well in clean conditions, but get worse in noisy situations. Biological evidence shows that local temporal and spectral information can be utilized for processing noise corrupted signals, motivating our novel approach for sound recognition by combining this with a convolutional neural network (CNN), one of the most popularly applied methods in acoustic processing. We use key-points (KPs) to construct a robust and sparse representation of the sound, followed by a CNN being trained as a classifier. RWCP database is used to evaluate the performance of our system. Our results show that the as-proposed KP-CNN system is effective and efficient for a robust sound event classification task in both mismatched and multi-condition environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allen, J.B.: How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994). https://doi.org/10.1007/978-1-4615-2281-2_11
Article Google Scholar
Cai, R., Lu, L., Hanjalic, A., Zhang, H.J., Cai, L.H.: A flexible framework for key audio effects detection and auditory context inference. IEEE Trans. Audio Speech Lang. Process. 14(3), 1026–1039 (2006). https://doi.org/10.1109/TSA.2005.857575
Article Google Scholar
Dennis, J., Tran, H.D., Chng, E.S.: Overlapping sound event recognition using local spectrogram features and the generalised hough transform. Pattern Recognit. Lett. 34(9), 1085–1093 (2013). https://doi.org/10.1016/j.patrec.2013.02.015
Article Google Scholar
Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process. Lett. 18(2), 130–133 (2010). https://doi.org/10.1109/LSP.2010.2100380
Article Google Scholar
Dennis, J., Yu, Q., Tang, H., Tran, H.D., Li, H.: Temporal coding of local spectrogram features for robust sound recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 803–807. IEEE (2013). https://doi.org/10.1109/ICASSP.2013.6637759
Ghiurcau, M.V., Rusu, C., Bilcu, R.C., Astola, J.: Audio based solutions for detecting intruders in wild areas. Signal Process. 92(3), 829–840 (2012). https://doi.org/10.1016/j.sigpro.2011.10.001
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012). https://doi.org/10.1145/3065386
Article Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015). https://doi.org/10.1038/nature14539
Article Google Scholar
LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10, p. 1995 (1995)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Lee, Y., Kassam, S.: Generalized median filtering and related nonlinear filtering techniques. IEEE Trans. Acoust. Speech Signal Process 33(3), 672–683 (1985)
Article Google Scholar
McLoughlin, I., Zhang, H., Xie, Z., Song, Y., Xiao, W.: Robust sound event classification using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 540–552 (2015). https://doi.org/10.1109/TASLP.2015.2389618
Article Google Scholar
Nakamura, S., Hiyane, K., Asano, F., Nishiura, T., Yamada, T.: Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition (2000)
Google Scholar
O’Shaughnessy, D.: Automatic speech recognition: history, methods and challenges. Pattern Recognit. 41(10), 2965–2979 (2008)
Article Google Scholar
Ozer, I., Ozer, Z., Findik, O.: Noise robust sound event classification with convolutional neural network. Neurocomputing 272, 505–512 (2018). https://doi.org/10.1016/j.neucom.2017.07.021
Article Google Scholar
Paliwal, K.K.: Spectral subband centroid features for speech recognition. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998, (Cat. No. 98CH36181), vol. 2, pp. 617–620. IEEE (1998). https://doi.org/10.1109/ICASSP.1998.675340
Phan, H., Hertel, L., Maass, M., Mazur, R., Mertins, A.: Learning representations for nonspeech audio events through their similarities to speech patterns. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 807–822 (2016). https://doi.org/10.1109/TASLP.2016.2530401
Article Google Scholar
Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2015). https://doi.org/10.1109/MLSP.2015.7324337
Sharan, R.V., Moir, T.J.: Subband time-frequency image texture features for robust audio surveillance. IEEE Trans. Inf. Forensics Secur. 10(12), 2605–2615 (2015). https://doi.org/10.1109/TIFS.2015.2469254
Article Google Scholar
Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993). https://doi.org/10.1016/0167-6393(93)90095-3
Article Google Scholar
Walters, T.C.: Auditory-based processing of communication sounds. Ph.D. thesis, University of Cambridge (2011)
Google Scholar
Wu, J., Chua, Y., Zhang, M., Li, H., Tan, K.C.: A spiking neural network framework for robust sound classification. Front. Neurosci. 12 (2018). https://doi.org/10.3389/fnins.2018.00836
Xiao, R., Tang, H., Gu, P., Xu, X.: Spike-based encoding and learning of spectrum features for robust sound recognition. Neurocomputing 313, 65–73 (2018). https://doi.org/10.1016/j.neucom.2018.06.022
Article Google Scholar
Yu, Q., Li, H., Tan, K.C.: Spike timing or rate? Neurons learn to make decisions for both through threshold-driven plasticity. IEEE Trans. Cybern. 49(6), 2178–2189 (2018). https://doi.org/10.1109/TCYB.2018.2821692
Article Google Scholar
Yu, Q., Yao, Y., Wang, L., Tang, H., Dang, J.: A multi-spike approach for robust sound recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 890–894. IEEE (2019). https://doi.org/10.1109/ICASSP.2019.8682963
Yu, Q., Yao, Y., Wang, L., Tang, H., Dang, J., Tan, K.C.: Robust environmental sound recognition with sparse key-point encoding and efficient multi-spike learning. arXiv preprint arXiv:1902.01094 (2019)
Zhang, H., McLoughlin, I., Song, Y.: Robust sound event recognition using convolutional neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 559–563. IEEE (2015). https://doi.org/10.1109/ICASSP.2015.7178031

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of China (No. 61806139, 61771333), and the Natural Science Foundation of Tianjin (No. 18JCYBJC41700).

Author information

Authors and Affiliations

Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Yanli Yao, Qiang Yu, Longbiao Wang & Jianwu Dang
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Jianwu Dang

Authors

Yanli Yao
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Longbiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Qiang Yu or Longbiao Wang .

Editor information

Editors and Affiliations

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Igor V. Tetko
Institute of Computer Science, Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Pavel Karpov
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Fabian Theis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yao, Y., Yu, Q., Wang, L., Dang, J. (2019). Robust Sound Event Classification with Local Time-Frequency Information and Convolutional Neural Networks. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series. ICANN 2019. Lecture Notes in Computer Science(), vol 11730. Springer, Cham. https://doi.org/10.1007/978-3-030-30490-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-30490-4_29
Published: 09 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30489-8
Online ISBN: 978-3-030-30490-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics