Abstract
In deep neural networks with convolutional layers, all the neurons in each layer typically have the same size receptive fields (RFs) with the same resolution. Convolutional layers with neurons that have large RF capture global information from the input features, while layers with neurons that have small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCN), where each layer has a range of neurons with different RF sizes to extract multi-resolution features that capture the global and local information from its input features. The proposed MR-FCN is applied to separate the singing voice from mixtures of music sources. Experimental results show that using MR-FCN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNs) on the audio source separation problem.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Klapuri, A., Davy, M.: Signal Processing Methods for Music Transcription. Springer, Boston (2007). https://doi.org/10.1007/0-387-32845-9
Chandna, P., Miron, M., Janer, J., Gómez, E.: Monoaural audio source separation using deep convolutional neural networks. In: Tichavský, P., Babaie-Zadeh, M., Michel, O.J.J., Thirion-Moreau, N. (eds.) LVA/ICA 2017. LNCS, vol. 10169, pp. 258–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-53547-0_25
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv:1603.07285 (2016)
Espi, M., Fujimoto, M., Kinoshita, K., Nakatani, T.: Exploiting spectro-temporal locality in deep learning based acoustic event detection. EURASIP J. Audio Speech Music Process. 26, 1–12 (2015)
Grais, E.M., Plumbley, M.D.: Single channel audio source separation using convolutional denoising autoencoders. In: Proceedings of GlobalSIP (2017)
Grais, E.M., Roma, G., Simpson, A.J.R., Plumbley, M.D.: Combining mask estimates for single channel audio source separation using deep neural networks. In: Proceedings of InterSpeech (2016)
Hochberg, Y., Tamhane, A.C.: Multiple Comparison Procedures. Wiley, New York (1987). https://doi.org/10.1002/9780470316672
Kawahara, J., Hamarneh, G.: Multi-resolution-Tract CNN with hybrid pretrained and skin-lesion trained layers. In: Wang, L., Adeli, E., Wang, Q., Shi, Y., Suk, H.-I. (eds.) MLMI 2016. LNCS, vol. 10019, pp. 164–171. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47157-0_20
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. arXiv:1412.6980 and presented at ICLR (2015)
Lim, W., Lee, T.: Harmonic and percussive source separation using a convolutional auto encoder. In: Proceedings of EUSIPCO (2017)
Miron, M., Janer, J., Gomez, E.: Monaural score-informed source separation for classical music using convolutional neural networks. In: Proceedings of ISMIR (2017)
Naderi, N., Nasersharif, B.: Multiresolution convolutional neural network for robust speech recognition. In: Proceedings of ICEE (2017)
Ono, N., Rafii, Z., Kitamura, D., Ito, N., Liutkus, A.: The 2015 signal separation evaluation campaign. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 387–395. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22482-4_45
Park, S.R., Lee, J.W.: A fully convolutional neural network for speech enhancement. In: Proceedings of Interspeech (2017)
Simpson, A.J.: Time-frequency trade-offs for audio source separation with binary masks. arXiv:1504.07372 (2015)
Tang, Y., Mohamed, A.: Multi resolution deep belief networks. In: Proceedings of AISTATS (2012)
Venkataramani, S., Smaragdis, P.: End-to-end source separation with adaptive front-ends. In: Proceedings of WASPAA (2017)
Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15, 1066–1074 (2007)
Wenjie, L., Yujia, L., Raquel, U., Richard, Z.: Understanding the effective receptive field in deep convolutional neural networks. In: Proceedings of NIPS, pp. 4898–4906 (2016)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bullet. 1(6), 80–83 (1945)
Xue, W., Zhao, H., Zhang, L.: Encoding multi-resolution two-stream CNNs for action recognition. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9949, pp. 564–571. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46675-0_62
Zhao, M., Wang, D., Zhang, Z., Zhang, X.: Music removal by convolutional denoising autoencoder in speech recognition. In: Proceedings of APSIPA (2016)
Acknowledgement
This work is supported by grant EP/L027119/2 from the UK Engineering and Physical Sciences Research Council (EPSRC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Grais, E.M., Wierstorf, H., Ward, D., Plumbley, M.D. (2018). Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M., Ward, D. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science(), vol 10891. Springer, Cham. https://doi.org/10.1007/978-3-319-93764-9_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-93764-9_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93763-2
Online ISBN: 978-3-319-93764-9
eBook Packages: Computer ScienceComputer Science (R0)