Abstract
With the rapid advancement of deep neural networks, there has been a significant improvement in the performance of music source separation methods. However, most of them primarily focus on improving their separation performance, while ignoring the issue of model size in the real-world environments. For the application in the real-world environments, in this paper, we propose a lightweight network combined with the Graph convolutional network Attention (GCN_A) module for Music Source Separation (G-MSS), which includes an Encoder and four Decoders, each of them outputs a target music source. The G-MSS network adopts both time-domain and frequency-domain L1 losses. The ablation study verifies the effectiveness of our designed GCN Attention (GCN_A) module and multiple Decoders, and also make a visualization analysis of the main components in the G-MSS network. Comparing with the other 13 methods on the MUSDB18 dataset, our proposed G-MSS achieves comparable separation performance while maintaining the lower amount of parameters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mesaros, A., Virtanen, T.: Automatic recognition of lyrics in singing. EURASIP J. Audio Speech Music Process. 1–11, 2010 (2010)
Rosner, A., Kostek, B., Schuller, B.: Classification of music genres based on music separation into harmonic and drum components. In: Archives of Acoustics, pp. 629–638 (2014)
Dittmar, C., Cano, E., Abeßer, J., Grollmisch, S.: Music information retrieval meets music education. In: Dagstuhl Follow-Ups, vol. 3. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2012)
Reimer, B.: A Philosophy of Music Education: Advancing the Vision. State University of New York Press, New York (2022)
Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Li, K., Yang, R., Hu, X.: An efficient encoder-decoder architecture with top-down attention for speech separation. arXiv preprint arXiv:2209.15200 (2022)
Yip, J.Q., et al.: Aca-net: towards lightweight speaker verification using asymmetric cross attention. arXiv preprint arXiv:2305.12121 (2023)
Macartney, C., Weyde, T.: Improved speech enhancement with the wave-u-net. arXiv preprint arXiv:1811.11307 (2018)
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020)
Wang, L., Wei, W., Chen, Y., Hu, Y.: D2 net: a denoising and dereverberation network based on two-branch encoder and dual-path transformer. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1649–1654. IEEE (2022)
Stoller, D., Ewert, S., Dixon, S.: Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185 (2018)
Défossez, A., Usunier, N., Bottou, L., Bach, F.: Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 (2019)
Kim, J., Kang, H.-G.: Contrastive learning based deep latent masking for music source separation. In: Proceedings of INTERSPEECH, vol. 2023, pp. 3709–3713 (2023)
Koyama, Y., Murata, N., Uhlich, S., Fabbro, G., Takahashi, S., Mitsufuji, Y.: Music source separation with deep equilibrium models. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 296–300. IEEE (2022)
Takahashi, N., Mitsufuji, Y.: D3net: densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733 (2020)
Choi, W., Kim, M., Chung, J., Jung, S.: Lasaft: latent source attentive frequency transformation for conditioned source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 171–175. IEEE (2021)
Li, T., Chen, J., Hou, H., Li, M.: Sams-net: a sliced attention-based neural network for music source separation. In: 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. IEEE (2021)
Stöter, F.-R., Uhlich, S., Liutkus, A., Mitsufuji, Y.: Open-unmix-a reference implementation for music source separation. J. Open Source Softw. 4(41), 1667 (2019)
Sawata, R., Uhlich, S., Takahashi, S., Mitsufuji, Y.: All for one and one for all: Improving music separation by bridging networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 51–55. IEEE (2021)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Ying, H., Chen, Y., Yang, W., He, L., Huang, H.: Hierarchic temporal convolutional network with cross-domain encoder for music source separation. IEEE Signal Process. Lett. 29, 1517–1521 (2022)
Luo, Y., Yu, J.: Music source separation with band-split rnn. IEEE/ACM Trans. Audio Speech Lang. Process. (2023)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: bidirectional encoder representations from transformers (2016)
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25 (2021)
Rouard, S., Massa, F., Défossez, A.: Hybrid transformers for music source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Luo, Y., Chen, Z., Yoshioka, T.: Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50. IEEE (2020)
Chen, J., Mao, Q., Liu, D.: Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975 (2020)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Wang, T., Pan, Z., Ge, M., Yang, Z., Li, H.: Time-domain speech separation networks with graph encoding auxiliary. IEEE Signal Process. Lett. 30, 110–114 (2023)
Tzirakis, P., Kumar, A., Donley, J.: Multi-channel speech enhancement using graph neural networks. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3415–3419. IEEE (2021)
Hu, Y., Tang, Y., Huang, H., He, L.: A graph isomorphism network with weighted multiple aggregators for speech emotion recognition. arXiv preprint arXiv:2207.00940 (2022)
Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021)
Hu, Y., Zhu, X., Li, Y., Huang, H., He, L.: A multi-grained based attention network for semi-supervised sound event detection. arXiv preprint arXiv:2206.10175 (2022)
Rafii, Z., Liutkus, A., Stöter, F.R., Mimilakis, S.I., Bittner, R.: Musdb18-a corpus for music separation (2017)
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Stöter, F.-R., Liutkus, A., Ito, N.: The 2018 signal separation evaluation campaign. In: Deville, Y., Gannot, S., Mason, R., Plumbley, M.D., Ward, D. (eds.) LVA/ICA 2018. LNCS, vol. 10891, pp. 293–305. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93764-9_28
Samuel, D., Ganeshan, A., Naradowsky, J.: Meta-learning extractors for music source separation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 816–820. IEEE (2020)
Acknowledgements
This work is supported by the Multi-lingual Information Technology Research Center of Xinjiang (ZDI145-21).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhu, M., Wang, L., Hu, Y. (2024). A Lightweight Music Source Separation Model with Graph Convolution Network. In: Jia, J., Ling, Z., Chen, X., Li, Y., Zhang, Z. (eds) Man-Machine Speech Communication. NCMMSC 2023. Communications in Computer and Information Science, vol 2006. Springer, Singapore. https://doi.org/10.1007/978-981-97-0601-3_3
Download citation
DOI: https://doi.org/10.1007/978-981-97-0601-3_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0600-6
Online ISBN: 978-981-97-0601-3
eBook Packages: Computer ScienceComputer Science (R0)