Skip to main content
Log in

Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Audio source separation is addressed using time–frequency filtering and conditional adversarial networks. First, pitch tracks in the mixed audio are estimated using a multi-pitch tracking algorithm, and binary masks are generated corresponding to each pitch track. Later, time–frequency filtering is done on the spectrogram of the input audio using generated binary mask. The filtered spectrogram is enhanced using conditional adversarial networks. Individual audio sources are reconstructed from the refined spectrogram using the mixed-signal phase. The performance is assessed using objective and subjective evaluation. The performance of the model is compared with that of the frequency domain deep clustering model and time-domain Conv-TasNet model. The proposed model shows a competing performance with that of the baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The datasets analysed in this manuscript are publicly available.

References

  1. D. Barry, G. Kearney, Localization quality assessment in source separation-based upmixing algorithms, in AES 35th International Conference (2009), pp. 2391–2395

  2. P. Comon, Independent component analysis, a new concept? Signal Process. 36(3), 287–314 (1994). https://doi.org/10.1016/0165-1684(94)90029-9

    Article  MATH  Google Scholar 

  3. C. Donahue, B. Li, R. Prabhavalkar, Exploring Speech Enhancement with Generative Adversarial Networks for robust Speech Recognition, in International Conference on Acoustics, Speech and Signal Processing (2018), pp. 5024–5028

  4. C. Donahue, J. McAuley, M. Puckette, Adversarial audio synthesis, in Proceedings of ICLR (2019), pp. 1–16

  5. C. Donahue, J.J. McAuley, M.S. Puckette, Synthesizing audio with generative adversarial networks, CoRR, vol. abs/1802.04208 (2018). [Online]. Available: http://arxiv.org/abs/1802.04208

  6. Z. Duan, B. Pardo, C. Zhang, Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans. Audio Speech Lang. Process. 18(8), 2121–2133 (2010)

    Article  Google Scholar 

  7. Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 138–150 (2014)

    Article  Google Scholar 

  8. Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 138–150 (2014)

    Article  Google Scholar 

  9. Z.-C. Fan, Y.-L. Lai, J.-S.R. Jang, SVSGAN: Singing Voice Separation Via Generative Adversarial Network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 726–730. https://doi.org/10.1109/ICASSP.2018.8462091

  10. C. Févotte, J. Idier, Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  11. I.J. Goodfellow et al., Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27, 2672–2680 (2014)

    Google Scholar 

  12. M. Gover, Score-Informed Source Separation of Choral Music (McGill University, Thesis submitted to Department of Music Research Schulich School of Music, 2019)

  13. E.M. Grais, M.U. Sen, H. Erdogan, Deep neural networks for single channel source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2014), pp. 3734–3738

  14. GTZAN Dataset-Music Genre Classification, https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification, Accessed online on 04 Jan 2022

  15. J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: discriminative embeddings for segmentation and separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2016), pp. 31–35

  16. Y. Ikemiya, K. Itoyama, K. Yoshii, Singing voice separation and vocal f0 estimation based on mutual combination of robust principal component analysis and subharmonic summation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2084–2095 (2016)

    Article  Google Scholar 

  17. S. Inoue, H. Kameoka, L. Li, S. Makino, Sepnet: a deep separation matrix prediction network for multichannel audio source separation, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 191–195

  18. P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 5967–5976

  19. L. Le Magoarou, A. Ozerov, N.Q.K. Duong, Text-informed audio source separation using nonnegative matrix partial co-factorization, in 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2013), pp. 1–6. https://doi.org/10.1109/MLSP.2013.6661995

  20. J. Le Roux, F.J. Weninger, J.R. Hershey, Sparse NMF half-baked or well done? Tech. Rep. TR2015-023 (MERL, Cambridge, 2015)

  21. H. Li, S. Fu, Y. Tsao, J. Yamagish, iMetricGAN: intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning, in Proceedings of Interspeech 2020, Shanghai, China, October 25–29 (2020), pp. 1336-1340

  22. L. Li, H. Kameoka, S. Makino, Determined audio source separation with multichannel star generative adversarial network, in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (2020), pp. 1–6. https://doi.org/10.1109/MLSP49062.2020.9231555

  23. Y. Luo, N. Mesgarani, Real-time single-channel dereverberation and separation with time-domain audio separation network, in Proceedings of Interspeech (2018), pp. 342–346

  24. Y. Luo, N. Mesgarani, Tasnet: time-domain audio separation network for real- time, single-channel speech separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 696–700

  25. Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  26. J.H. McDermott, The cocktail party problem. Curr. Biol. 19(22), 1024–1027 (2009)

    Article  Google Scholar 

  27. M. Mirza, S. Osindero, Conditional generative adversarial nets (2014). arXiv:1411.1784

  28. B. Nasersharif, S. Abdali, Speech/music separation using non-negative matrix factorization with combination of cost functions, in The International Symposium on Artificial Intelligence and Signal Processing (2015), pp. 107–111

  29. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 5206–5210

  30. S. Pascual, A. Bonafonte, J. Serrà, SEGAN: speech enhancement generative adversarial network. Proc Interspeech 2017, 3642–3646 (2017)

    Article  Google Scholar 

  31. H. Phan et al., Improving GANs for speech enhancement. IEEE Signal Process. Lett. 27, 1700–1704 (2020). https://doi.org/10.1109/LSP.2020.3025020

    Article  Google Scholar 

  32. Z. Rafii, A. Liutkus, F.-R. Stöter, S.I. Mimilakis, D. FitzGerald, B. Pardo, An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process. 26(8), 1307–1335 (2018). https://doi.org/10.1109/TASLP.2018.2825440

    Article  Google Scholar 

  33. O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation. Med. Image Comput. Comput. Assist. Intervent. 9351, 234–241 (2015)

    Google Scholar 

  34. D. Stoller, S. Ewert, S. Dixon, Adversarial semi-supervised audio source separation applied to singing voice extraction, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 2391–2395

  35. F.-R. Stöter, A. Liutkus, N. Ito, The 2018 signal separation evaluation campaign. in Y. Deville, S. Gannot, R. Mason, M.D. Plumbley, D. Ward (Eds.), 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2018) (2018), pp. 293–305

  36. Y.C. Subakan, P. Smaragdis, Generative adversarial source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 26–30

  37. The LJ Speech Dataset, https://keithito.com/LJ-Speech-Dataset/

  38. E. Tzinis, Z. Wang, P. Smaragdis, Sudo RM-RF: efficient networks for universal audio source separation, in IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (2020), pp. 1–6

  39. S. Uhlich, F. Giron, Y. Mitsufuji, Deep neural network based instrument extraction from music, in IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 2135–2139

  40. Z.-Q. Wang, J.L. Roux, J.R. Hershey, Alternative objective functions for deep clustering, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 686–690

  41. O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time–frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  42. J.-Y. Zhu, T. Park, P. Isola, A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proceedings of IEEE International Conference on Computer Vision (2017), pp. 2242–2251

Download references

Acknowledgements

The authors thank lab-mates for having helped us in conducting the subjective evaluation tests.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajeev Rajan.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Joseph, S., Rajan, R. Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking. Circuits Syst Signal Process 42, 1163–1180 (2023). https://doi.org/10.1007/s00034-022-02178-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02178-1

Keywords

Navigation