Skip to main content

A Lightweight Music Source Separation Model with Graph Convolution Network

  • Conference paper
  • First Online:
Man-Machine Speech Communication (NCMMSC 2023)

Abstract

With the rapid advancement of deep neural networks, there has been a significant improvement in the performance of music source separation methods. However, most of them primarily focus on improving their separation performance, while ignoring the issue of model size in the real-world environments. For the application in the real-world environments, in this paper, we propose a lightweight network combined with the Graph convolutional network Attention (GCN_A) module for Music Source Separation (G-MSS), which includes an Encoder and four Decoders, each of them outputs a target music source. The G-MSS network adopts both time-domain and frequency-domain L1 losses. The ablation study verifies the effectiveness of our designed GCN Attention (GCN_A) module and multiple Decoders, and also make a visualization analysis of the main components in the G-MSS network. Comparing with the other 13 methods on the MUSDB18 dataset, our proposed G-MSS achieves comparable separation performance while maintaining the lower amount of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mesaros, A., Virtanen, T.: Automatic recognition of lyrics in singing. EURASIP J. Audio Speech Music Process. 1–11, 2010 (2010)

    Google Scholar 

  2. Rosner, A., Kostek, B., Schuller, B.: Classification of music genres based on music separation into harmonic and drum components. In: Archives of Acoustics, pp. 629–638 (2014)

    Google Scholar 

  3. Dittmar, C., Cano, E., Abeßer, J., Grollmisch, S.: Music information retrieval meets music education. In: Dagstuhl Follow-Ups, vol. 3. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2012)

    Google Scholar 

  4. Reimer, B.: A Philosophy of Music Education: Advancing the Vision. State University of New York Press, New York (2022)

    Google Scholar 

  5. Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  6. Li, K., Yang, R., Hu, X.: An efficient encoder-decoder architecture with top-down attention for speech separation. arXiv preprint arXiv:2209.15200 (2022)

  7. Yip, J.Q., et al.: Aca-net: towards lightweight speaker verification using asymmetric cross attention. arXiv preprint arXiv:2305.12121 (2023)

  8. Macartney, C., Weyde, T.: Improved speech enhancement with the wave-u-net. arXiv preprint arXiv:1811.11307 (2018)

  9. Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020)

  10. Wang, L., Wei, W., Chen, Y., Hu, Y.: D2 net: a denoising and dereverberation network based on two-branch encoder and dual-path transformer. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1649–1654. IEEE (2022)

    Google Scholar 

  11. Stoller, D., Ewert, S., Dixon, S.: Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185 (2018)

  12. Défossez, A., Usunier, N., Bottou, L., Bach, F.: Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 (2019)

  13. Kim, J., Kang, H.-G.: Contrastive learning based deep latent masking for music source separation. In: Proceedings of INTERSPEECH, vol. 2023, pp. 3709–3713 (2023)

    Google Scholar 

  14. Koyama, Y., Murata, N., Uhlich, S., Fabbro, G., Takahashi, S., Mitsufuji, Y.: Music source separation with deep equilibrium models. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 296–300. IEEE (2022)

    Google Scholar 

  15. Takahashi, N., Mitsufuji, Y.: D3net: densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733 (2020)

  16. Choi, W., Kim, M., Chung, J., Jung, S.: Lasaft: latent source attentive frequency transformation for conditioned source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 171–175. IEEE (2021)

    Google Scholar 

  17. Li, T., Chen, J., Hou, H., Li, M.: Sams-net: a sliced attention-based neural network for music source separation. In: 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. IEEE (2021)

    Google Scholar 

  18. Stöter, F.-R., Uhlich, S., Liutkus, A., Mitsufuji, Y.: Open-unmix-a reference implementation for music source separation. J. Open Source Softw. 4(41), 1667 (2019)

    Article  Google Scholar 

  19. Sawata, R., Uhlich, S., Takahashi, S., Mitsufuji, Y.: All for one and one for all: Improving music separation by bridging networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 51–55. IEEE (2021)

    Google Scholar 

  20. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

    Google Scholar 

  21. Ying, H., Chen, Y., Yang, W., He, L., Huang, H.: Hierarchic temporal convolutional network with cross-domain encoder for music source separation. IEEE Signal Process. Lett. 29, 1517–1521 (2022)

    Article  Google Scholar 

  22. Luo, Y., Yu, J.: Music source separation with band-split rnn. IEEE/ACM Trans. Audio Speech Lang. Process. (2023)

    Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  24. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: bidirectional encoder representations from transformers (2016)

    Google Scholar 

  25. Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)

  26. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929

  27. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  28. Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25 (2021)

    Google Scholar 

  29. Rouard, S., Massa, F., Défossez, A.: Hybrid transformers for music source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

    Google Scholar 

  30. Luo, Y., Chen, Z., Yoshioka, T.: Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50. IEEE (2020)

    Google Scholar 

  31. Chen, J., Mao, Q., Liu, D.: Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975 (2020)

  32. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  33. Wang, T., Pan, Z., Ge, M., Yang, Z., Li, H.: Time-domain speech separation networks with graph encoding auxiliary. IEEE Signal Process. Lett. 30, 110–114 (2023)

    Article  Google Scholar 

  34. Tzirakis, P., Kumar, A., Donley, J.: Multi-channel speech enhancement using graph neural networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3415–3419. IEEE (2021)

    Google Scholar 

  35. Hu, Y., Tang, Y., Huang, H., He, L.: A graph isomorphism network with weighted multiple aggregators for speech emotion recognition. arXiv preprint arXiv:2207.00940 (2022)

  36. Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021)

    Google Scholar 

  37. Hu, Y., Zhu, X., Li, Y., Huang, H., He, L.: A multi-grained based attention network for semi-supervised sound event detection. arXiv preprint arXiv:2206.10175 (2022)

  38. Rafii, Z., Liutkus, A., Stöter, F.R., Mimilakis, S.I., Bittner, R.: Musdb18-a corpus for music separation (2017)

    Google Scholar 

  39. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  40. Stöter, F.-R., Liutkus, A., Ito, N.: The 2018 signal separation evaluation campaign. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M.D., Ward, D. (eds.) LVA/ICA 2018. LNCS, vol. 10891, pp. 293–305. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93764-9_28

    Chapter  Google Scholar 

  41. Samuel, D., Ganeshan, A., Naradowsky, J.: Meta-learning extractors for music source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 816–820. IEEE (2020)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the Multi-lingual Information Technology Research Center of Xinjiang (ZDI145-21).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, M., Wang, L., Hu, Y. (2024). A Lightweight Music Source Separation Model with Graph Convolution Network. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0601-3_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0600-6

  • Online ISBN: 978-981-97-0601-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics