Abstract
Audio source separation is addressed using time–frequency filtering and conditional adversarial networks. First, pitch tracks in the mixed audio are estimated using a multi-pitch tracking algorithm, and binary masks are generated corresponding to each pitch track. Later, time–frequency filtering is done on the spectrogram of the input audio using generated binary mask. The filtered spectrogram is enhanced using conditional adversarial networks. Individual audio sources are reconstructed from the refined spectrogram using the mixed-signal phase. The performance is assessed using objective and subjective evaluation. The performance of the model is compared with that of the frequency domain deep clustering model and time-domain Conv-TasNet model. The proposed model shows a competing performance with that of the baseline models.
Similar content being viewed by others
Data Availability
The datasets analysed in this manuscript are publicly available.
References
D. Barry, G. Kearney, Localization quality assessment in source separation-based upmixing algorithms, in AES 35th International Conference (2009), pp. 2391–2395
P. Comon, Independent component analysis, a new concept? Signal Process. 36(3), 287–314 (1994). https://doi.org/10.1016/0165-1684(94)90029-9
C. Donahue, B. Li, R. Prabhavalkar, Exploring Speech Enhancement with Generative Adversarial Networks for robust Speech Recognition, in International Conference on Acoustics, Speech and Signal Processing (2018), pp. 5024–5028
C. Donahue, J. McAuley, M. Puckette, Adversarial audio synthesis, in Proceedings of ICLR (2019), pp. 1–16
C. Donahue, J.J. McAuley, M.S. Puckette, Synthesizing audio with generative adversarial networks, CoRR, vol. abs/1802.04208 (2018). [Online]. Available: http://arxiv.org/abs/1802.04208
Z. Duan, B. Pardo, C. Zhang, Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans. Audio Speech Lang. Process. 18(8), 2121–2133 (2010)
Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 138–150 (2014)
Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 138–150 (2014)
Z.-C. Fan, Y.-L. Lai, J.-S.R. Jang, SVSGAN: Singing Voice Separation Via Generative Adversarial Network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 726–730. https://doi.org/10.1109/ICASSP.2018.8462091
C. Févotte, J. Idier, Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011)
I.J. Goodfellow et al., Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27, 2672–2680 (2014)
M. Gover, Score-Informed Source Separation of Choral Music (McGill University, Thesis submitted to Department of Music Research Schulich School of Music, 2019)
E.M. Grais, M.U. Sen, H. Erdogan, Deep neural networks for single channel source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2014), pp. 3734–3738
GTZAN Dataset-Music Genre Classification, https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification, Accessed online on 04 Jan 2022
J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: discriminative embeddings for segmentation and separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2016), pp. 31–35
Y. Ikemiya, K. Itoyama, K. Yoshii, Singing voice separation and vocal f0 estimation based on mutual combination of robust principal component analysis and subharmonic summation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2084–2095 (2016)
S. Inoue, H. Kameoka, L. Li, S. Makino, Sepnet: a deep separation matrix prediction network for multichannel audio source separation, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 191–195
P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 5967–5976
L. Le Magoarou, A. Ozerov, N.Q.K. Duong, Text-informed audio source separation using nonnegative matrix partial co-factorization, in 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2013), pp. 1–6. https://doi.org/10.1109/MLSP.2013.6661995
J. Le Roux, F.J. Weninger, J.R. Hershey, Sparse NMF half-baked or well done? Tech. Rep. TR2015-023 (MERL, Cambridge, 2015)
H. Li, S. Fu, Y. Tsao, J. Yamagish, iMetricGAN: intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning, in Proceedings of Interspeech 2020, Shanghai, China, October 25–29 (2020), pp. 1336-1340
L. Li, H. Kameoka, S. Makino, Determined audio source separation with multichannel star generative adversarial network, in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (2020), pp. 1–6. https://doi.org/10.1109/MLSP49062.2020.9231555
Y. Luo, N. Mesgarani, Real-time single-channel dereverberation and separation with time-domain audio separation network, in Proceedings of Interspeech (2018), pp. 342–346
Y. Luo, N. Mesgarani, Tasnet: time-domain audio separation network for real- time, single-channel speech separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 696–700
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
J.H. McDermott, The cocktail party problem. Curr. Biol. 19(22), 1024–1027 (2009)
M. Mirza, S. Osindero, Conditional generative adversarial nets (2014). arXiv:1411.1784
B. Nasersharif, S. Abdali, Speech/music separation using non-negative matrix factorization with combination of cost functions, in The International Symposium on Artificial Intelligence and Signal Processing (2015), pp. 107–111
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 5206–5210
S. Pascual, A. Bonafonte, J. Serrà, SEGAN: speech enhancement generative adversarial network. Proc Interspeech 2017, 3642–3646 (2017)
H. Phan et al., Improving GANs for speech enhancement. IEEE Signal Process. Lett. 27, 1700–1704 (2020). https://doi.org/10.1109/LSP.2020.3025020
Z. Rafii, A. Liutkus, F.-R. Stöter, S.I. Mimilakis, D. FitzGerald, B. Pardo, An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process. 26(8), 1307–1335 (2018). https://doi.org/10.1109/TASLP.2018.2825440
O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation. Med. Image Comput. Comput. Assist. Intervent. 9351, 234–241 (2015)
D. Stoller, S. Ewert, S. Dixon, Adversarial semi-supervised audio source separation applied to singing voice extraction, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 2391–2395
F.-R. Stöter, A. Liutkus, N. Ito, The 2018 signal separation evaluation campaign. in Y. Deville, S. Gannot, R. Mason, M.D. Plumbley, D. Ward (Eds.), 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2018) (2018), pp. 293–305
Y.C. Subakan, P. Smaragdis, Generative adversarial source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 26–30
The LJ Speech Dataset, https://keithito.com/LJ-Speech-Dataset/
E. Tzinis, Z. Wang, P. Smaragdis, Sudo RM-RF: efficient networks for universal audio source separation, in IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (2020), pp. 1–6
S. Uhlich, F. Giron, Y. Mitsufuji, Deep neural network based instrument extraction from music, in IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 2135–2139
Z.-Q. Wang, J.L. Roux, J.R. Hershey, Alternative objective functions for deep clustering, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 686–690
O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time–frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)
J.-Y. Zhu, T. Park, P. Isola, A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proceedings of IEEE International Conference on Computer Vision (2017), pp. 2242–2251
Acknowledgements
The authors thank lab-mates for having helped us in conducting the subjective evaluation tests.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Joseph, S., Rajan, R. Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking. Circuits Syst Signal Process 42, 1163–1180 (2023). https://doi.org/10.1007/s00034-022-02178-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02178-1