Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation

Grais, Emad M.; Wierstorf, Hagen; Ward, Dominic; Plumbley, Mark D.

doi:10.1007/978-3-319-93764-9_32

Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation

Emad M. Grais¹⁸,
Hagen Wierstorf¹⁸,
Dominic Ward¹⁸ &
…
Mark D. Plumbley¹⁸

Conference paper
First Online: 06 June 2018

1869 Accesses
13 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10891))

Abstract

In deep neural networks with convolutional layers, all the neurons in each layer typically have the same size receptive fields (RFs) with the same resolution. Convolutional layers with neurons that have large RF capture global information from the input features, while layers with neurons that have small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCN), where each layer has a range of neurons with different RF sizes to extract multi-resolution features that capture the global and local information from its input features. The proposed MR-FCN is applied to separate the singing voice from mixtures of music sources. Experimental results show that using MR-FCN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNs) on the audio source separation problem.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Klapuri, A., Davy, M.: Signal Processing Methods for Music Transcription. Springer, Boston (2007). https://doi.org/10.1007/0-387-32845-9
Book Google Scholar
Chandna, P., Miron, M., Janer, J., Gómez, E.: Monoaural audio source separation using deep convolutional neural networks. In: Tichavský, P., Babaie-Zadeh, M., Michel, O.J.J., Thirion-Moreau, N. (eds.) LVA/ICA 2017. LNCS, vol. 10169, pp. 258–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-53547-0_25
Chapter Google Scholar
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv:1603.07285 (2016)
Espi, M., Fujimoto, M., Kinoshita, K., Nakatani, T.: Exploiting spectro-temporal locality in deep learning based acoustic event detection. EURASIP J. Audio Speech Music Process. 26, 1–12 (2015)
Google Scholar
Grais, E.M., Plumbley, M.D.: Single channel audio source separation using convolutional denoising autoencoders. In: Proceedings of GlobalSIP (2017)
Google Scholar
Grais, E.M., Roma, G., Simpson, A.J.R., Plumbley, M.D.: Combining mask estimates for single channel audio source separation using deep neural networks. In: Proceedings of InterSpeech (2016)
Google Scholar
Hochberg, Y., Tamhane, A.C.: Multiple Comparison Procedures. Wiley, New York (1987). https://doi.org/10.1002/9780470316672
Book MATH Google Scholar
Kawahara, J., Hamarneh, G.: Multi-resolution-Tract CNN with hybrid pretrained and skin-lesion trained layers. In: Wang, L., Adeli, E., Wang, Q., Shi, Y., Suk, H.-I. (eds.) MLMI 2016. LNCS, vol. 10019, pp. 164–171. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47157-0_20
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. arXiv:1412.6980 and presented at ICLR (2015)
Lim, W., Lee, T.: Harmonic and percussive source separation using a convolutional auto encoder. In: Proceedings of EUSIPCO (2017)
Google Scholar
Miron, M., Janer, J., Gomez, E.: Monaural score-informed source separation for classical music using convolutional neural networks. In: Proceedings of ISMIR (2017)
Google Scholar
Naderi, N., Nasersharif, B.: Multiresolution convolutional neural network for robust speech recognition. In: Proceedings of ICEE (2017)
Google Scholar
Ono, N., Rafii, Z., Kitamura, D., Ito, N., Liutkus, A.: The 2015 signal separation evaluation campaign. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 387–395. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22482-4_45
Chapter Google Scholar
Park, S.R., Lee, J.W.: A fully convolutional neural network for speech enhancement. In: Proceedings of Interspeech (2017)
Google Scholar
Simpson, A.J.: Time-frequency trade-offs for audio source separation with binary masks. arXiv:1504.07372 (2015)
Tang, Y., Mohamed, A.: Multi resolution deep belief networks. In: Proceedings of AISTATS (2012)
Google Scholar
Venkataramani, S., Smaragdis, P.: End-to-end source separation with adaptive front-ends. In: Proceedings of WASPAA (2017)
Google Scholar
Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15, 1066–1074 (2007)
Article Google Scholar
Wenjie, L., Yujia, L., Raquel, U., Richard, Z.: Understanding the effective receptive field in deep convolutional neural networks. In: Proceedings of NIPS, pp. 4898–4906 (2016)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bullet. 1(6), 80–83 (1945)
Article Google Scholar
Xue, W., Zhao, H., Zhang, L.: Encoding multi-resolution two-stream CNNs for action recognition. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9949, pp. 564–571. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46675-0_62
Chapter Google Scholar
Zhao, M., Wang, D., Zhang, Z., Zhang, X.: Music removal by convolutional denoising autoencoder in speech recognition. In: Proceedings of APSIPA (2016)
Google Scholar

Download references

Acknowledgement

This work is supported by grant EP/L027119/2 from the UK Engineering and Physical Sciences Research Council (EPSRC).

Author information

Authors and Affiliations

Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Emad M. Grais, Hagen Wierstorf, Dominic Ward & Mark D. Plumbley

Authors

Emad M. Grais
View author publications
You can also search for this author in PubMed Google Scholar
Hagen Wierstorf
View author publications
You can also search for this author in PubMed Google Scholar
Dominic Ward
View author publications
You can also search for this author in PubMed Google Scholar
Mark D. Plumbley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emad M. Grais .

Editor information

Editors and Affiliations

Paul Sabatier University, Toulouse, France
Yannick Deville
Bar-Ilan University, Ramat Gan, Israel
Sharon Gannot
University of Surrey, Guildford, United Kingdom
Russell Mason
University of Surrey, Guildford, United Kingdom
Mark D. Plumbley
University of Surrey, Guildford, United Kingdom
Dominic Ward

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grais, E.M., Wierstorf, H., Ward, D., Plumbley, M.D. (2018). Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M., Ward, D. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science(), vol 10891. Springer, Cham. https://doi.org/10.1007/978-3-319-93764-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-93764-9_32
Published: 06 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93763-2
Online ISBN: 978-3-319-93764-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics