A Lightweight Music Source Separation Model with Graph Convolution Network

Zhu, Mengying; Wang, Liusong; Hu, Ying

doi:10.1007/978-981-97-0601-3_3

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2006))

Included in the following conference series:

National Conference on Man-Machine Speech Communication

187 Accesses

Abstract

With the rapid advancement of deep neural networks, there has been a significant improvement in the performance of music source separation methods. However, most of them primarily focus on improving their separation performance, while ignoring the issue of model size in the real-world environments. For the application in the real-world environments, in this paper, we propose a lightweight network combined with the Graph convolutional network Attention (GCN_A) module for Music Source Separation (G-MSS), which includes an Encoder and four Decoders, each of them outputs a target music source. The G-MSS network adopts both time-domain and frequency-domain L1 losses. The ablation study verifies the effectiveness of our designed GCN Attention (GCN_A) module and multiple Decoders, and also make a visualization analysis of the main components in the G-MSS network. Comparing with the other 13 methods on the MUSDB18 dataset, our proposed G-MSS achieves comparable separation performance while maintaining the lower amount of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mesaros, A., Virtanen, T.: Automatic recognition of lyrics in singing. EURASIP J. Audio Speech Music Process. 1–11, 2010 (2010)
Google Scholar
Rosner, A., Kostek, B., Schuller, B.: Classification of music genres based on music separation into harmonic and drum components. In: Archives of Acoustics, pp. 629–638 (2014)
Google Scholar
Dittmar, C., Cano, E., Abeßer, J., Grollmisch, S.: Music information retrieval meets music education. In: Dagstuhl Follow-Ups, vol. 3. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2012)
Google Scholar
Reimer, B.: A Philosophy of Music Education: Advancing the Vision. State University of New York Press, New York (2022)
Google Scholar
Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Li, K., Yang, R., Hu, X.: An efficient encoder-decoder architecture with top-down attention for speech separation. arXiv preprint arXiv:2209.15200 (2022)
Yip, J.Q., et al.: Aca-net: towards lightweight speaker verification using asymmetric cross attention. arXiv preprint arXiv:2305.12121 (2023)
Macartney, C., Weyde, T.: Improved speech enhancement with the wave-u-net. arXiv preprint arXiv:1811.11307 (2018)
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020)
Wang, L., Wei, W., Chen, Y., Hu, Y.: D2 net: a denoising and dereverberation network based on two-branch encoder and dual-path transformer. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1649–1654. IEEE (2022)
Google Scholar
Stoller, D., Ewert, S., Dixon, S.: Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185 (2018)
Défossez, A., Usunier, N., Bottou, L., Bach, F.: Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 (2019)
Kim, J., Kang, H.-G.: Contrastive learning based deep latent masking for music source separation. In: Proceedings of INTERSPEECH, vol. 2023, pp. 3709–3713 (2023)
Google Scholar
Koyama, Y., Murata, N., Uhlich, S., Fabbro, G., Takahashi, S., Mitsufuji, Y.: Music source separation with deep equilibrium models. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 296–300. IEEE (2022)
Google Scholar
Takahashi, N., Mitsufuji, Y.: D3net: densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733 (2020)
Choi, W., Kim, M., Chung, J., Jung, S.: Lasaft: latent source attentive frequency transformation for conditioned source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 171–175. IEEE (2021)
Google Scholar
Li, T., Chen, J., Hou, H., Li, M.: Sams-net: a sliced attention-based neural network for music source separation. In: 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. IEEE (2021)
Google Scholar
Stöter, F.-R., Uhlich, S., Liutkus, A., Mitsufuji, Y.: Open-unmix-a reference implementation for music source separation. J. Open Source Softw. 4(41), 1667 (2019)
Article Google Scholar
Sawata, R., Uhlich, S., Takahashi, S., Mitsufuji, Y.: All for one and one for all: Improving music separation by bridging networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 51–55. IEEE (2021)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Ying, H., Chen, Y., Yang, W., He, L., Huang, H.: Hierarchic temporal convolutional network with cross-domain encoder for music source separation. IEEE Signal Process. Lett. 29, 1517–1521 (2022)
Article Google Scholar
Luo, Y., Yu, J.: Music source separation with band-split rnn. IEEE/ACM Trans. Audio Speech Lang. Process. (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: bidirectional encoder representations from transformers (2016)
Google Scholar
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25 (2021)
Google Scholar
Rouard, S., Massa, F., Défossez, A.: Hybrid transformers for music source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Google Scholar
Luo, Y., Chen, Z., Yoshioka, T.: Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50. IEEE (2020)
Google Scholar
Chen, J., Mao, Q., Liu, D.: Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975 (2020)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Wang, T., Pan, Z., Ge, M., Yang, Z., Li, H.: Time-domain speech separation networks with graph encoding auxiliary. IEEE Signal Process. Lett. 30, 110–114 (2023)
Article Google Scholar
Tzirakis, P., Kumar, A., Donley, J.: Multi-channel speech enhancement using graph neural networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3415–3419. IEEE (2021)
Google Scholar
Hu, Y., Tang, Y., Huang, H., He, L.: A graph isomorphism network with weighted multiple aggregators for speech emotion recognition. arXiv preprint arXiv:2207.00940 (2022)
Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021)
Google Scholar
Hu, Y., Zhu, X., Li, Y., Huang, H., He, L.: A multi-grained based attention network for semi-supervised sound event detection. arXiv preprint arXiv:2206.10175 (2022)
Rafii, Z., Liutkus, A., Stöter, F.R., Mimilakis, S.I., Bittner, R.: Musdb18-a corpus for music separation (2017)
Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Stöter, F.-R., Liutkus, A., Ito, N.: The 2018 signal separation evaluation campaign. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M.D., Ward, D. (eds.) LVA/ICA 2018. LNCS, vol. 10891, pp. 293–305. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93764-9_28
Chapter Google Scholar
Samuel, D., Ganeshan, A., Naradowsky, J.: Meta-learning extractors for music source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 816–820. IEEE (2020)
Google Scholar

Download references

Acknowledgements

This work is supported by the Multi-lingual Information Technology Research Center of Xinjiang (ZDI145-21).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xinjiang University, Urumqi, China
Mengying Zhu, Liusong Wang & Ying Hu
Key Laboratory of Signal Detection and Processing in Xinjiang, Urumqi, China
Mengying Zhu, Liusong Wang & Ying Hu

Authors

Mengying Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Liusong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Hu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jia Jia
University of Science and Technology of China, Anhui, China
Zhenhua Ling
Shanghai Jiao Tong University, Shanghai, China
Xie Chen
Beijing University of Posts and Telecommunications, Beijing, China
Ya Li
Hunan University, Hunan, China
Zixing Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, M., Wang, L., Hu, Y. (2024). A Lightweight Music Source Separation Model with Graph Convolution Network. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_3

Download citation

DOI: https://doi.org/10.1007/978-981-97-0601-3_3
Published: 15 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0600-6
Online ISBN: 978-981-97-0601-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Lightweight Music Source Separation Model with Graph Convolution Network