Skip to main content

Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10891))

Abstract

In deep neural networks with convolutional layers, all the neurons in each layer typically have the same size receptive fields (RFs) with the same resolution. Convolutional layers with neurons that have large RF capture global information from the input features, while layers with neurons that have small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCN), where each layer has a range of neurons with different RF sizes to extract multi-resolution features that capture the global and local information from its input features. The proposed MR-FCN is applied to separate the singing voice from mixtures of music sources. Experimental results show that using MR-FCN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNs) on the audio source separation problem.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Klapuri, A., Davy, M.: Signal Processing Methods for Music Transcription. Springer, Boston (2007). https://doi.org/10.1007/0-387-32845-9

    Book  Google Scholar 

  2. Chandna, P., Miron, M., Janer, J., Gómez, E.: Monoaural audio source separation using deep convolutional neural networks. In: Tichavský, P., Babaie-Zadeh, M., Michel, O.J.J., Thirion-Moreau, N. (eds.) LVA/ICA 2017. LNCS, vol. 10169, pp. 258–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-53547-0_25

    Chapter  Google Scholar 

  3. Chollet, F.: Keras (2015). https://github.com/fchollet/keras

  4. Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning. arXiv:1603.07285 (2016)

  5. Espi, M., Fujimoto, M., Kinoshita, K., Nakatani, T.: Exploiting spectro-temporal locality in deep learning based acoustic event detection. EURASIP J. Audio Speech Music Process. 26, 1–12 (2015)

    Google Scholar 

  6. Grais, E.M., Plumbley, M.D.: Single channel audio source separation using convolutional denoising autoencoders. In: Proceedings of GlobalSIP (2017)

    Google Scholar 

  7. Grais, E.M., Roma, G., Simpson, A.J.R., Plumbley, M.D.: Combining mask estimates for single channel audio source separation using deep neural networks. In: Proceedings of InterSpeech (2016)

    Google Scholar 

  8. Hochberg, Y., Tamhane, A.C.: Multiple Comparison Procedures. Wiley, New York (1987). https://doi.org/10.1002/9780470316672

    Book  MATH  Google Scholar 

  9. Kawahara, J., Hamarneh, G.: Multi-resolution-Tract CNN with hybrid pretrained and skin-lesion trained layers. In: Wang, L., Adeli, E., Wang, Q., Shi, Y., Suk, H.-I. (eds.) MLMI 2016. LNCS, vol. 10019, pp. 164–171. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47157-0_20

    Chapter  Google Scholar 

  10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. arXiv:1412.6980 and presented at ICLR (2015)

  11. Lim, W., Lee, T.: Harmonic and percussive source separation using a convolutional auto encoder. In: Proceedings of EUSIPCO (2017)

    Google Scholar 

  12. Miron, M., Janer, J., Gomez, E.: Monaural score-informed source separation for classical music using convolutional neural networks. In: Proceedings of ISMIR (2017)

    Google Scholar 

  13. Naderi, N., Nasersharif, B.: Multiresolution convolutional neural network for robust speech recognition. In: Proceedings of ICEE (2017)

    Google Scholar 

  14. Ono, N., Rafii, Z., Kitamura, D., Ito, N., Liutkus, A.: The 2015 signal separation evaluation campaign. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 387–395. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22482-4_45

    Chapter  Google Scholar 

  15. Park, S.R., Lee, J.W.: A fully convolutional neural network for speech enhancement. In: Proceedings of Interspeech (2017)

    Google Scholar 

  16. Simpson, A.J.: Time-frequency trade-offs for audio source separation with binary masks. arXiv:1504.07372 (2015)

  17. Tang, Y., Mohamed, A.: Multi resolution deep belief networks. In: Proceedings of AISTATS (2012)

    Google Scholar 

  18. Venkataramani, S., Smaragdis, P.: End-to-end source separation with adaptive front-ends. In: Proceedings of WASPAA (2017)

    Google Scholar 

  19. Vincent, E., Gribonval, R., Fevotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  20. Virtanen, T.: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15, 1066–1074 (2007)

    Article  Google Scholar 

  21. Wenjie, L., Yujia, L., Raquel, U., Richard, Z.: Understanding the effective receptive field in deep convolutional neural networks. In: Proceedings of NIPS, pp. 4898–4906 (2016)

    Google Scholar 

  22. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bullet. 1(6), 80–83 (1945)

    Article  Google Scholar 

  23. Xue, W., Zhao, H., Zhang, L.: Encoding multi-resolution two-stream CNNs for action recognition. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9949, pp. 564–571. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46675-0_62

    Chapter  Google Scholar 

  24. Zhao, M., Wang, D., Zhang, Z., Zhang, X.: Music removal by convolutional denoising autoencoder in speech recognition. In: Proceedings of APSIPA (2016)

    Google Scholar 

Download references

Acknowledgement

This work is supported by grant EP/L027119/2 from the UK Engineering and Physical Sciences Research Council (EPSRC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emad M. Grais .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Grais, E.M., Wierstorf, H., Ward, D., Plumbley, M.D. (2018). Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M., Ward, D. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2018. Lecture Notes in Computer Science(), vol 10891. Springer, Cham. https://doi.org/10.1007/978-3-319-93764-9_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93764-9_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93763-2

  • Online ISBN: 978-3-319-93764-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics