Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking

Joseph, Sujo; Rajan, Rajeev

doi:10.1007/s00034-022-02178-1

Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking

Published: 23 September 2022

Volume 42, pages 1163–1180, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

519 Accesses
3 Citations
Explore all metrics

Abstract

Audio source separation is addressed using time–frequency filtering and conditional adversarial networks. First, pitch tracks in the mixed audio are estimated using a multi-pitch tracking algorithm, and binary masks are generated corresponding to each pitch track. Later, time–frequency filtering is done on the spectrogram of the input audio using generated binary mask. The filtered spectrogram is enhanced using conditional adversarial networks. Individual audio sources are reconstructed from the refined spectrogram using the mixed-signal phase. The performance is assessed using objective and subjective evaluation. The performance of the model is compared with that of the frequency domain deep clustering model and time-domain Conv-TasNet model. The proposed model shows a competing performance with that of the baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

Data Availability

The datasets analysed in this manuscript are publicly available.

References

D. Barry, G. Kearney, Localization quality assessment in source separation-based upmixing algorithms, in AES 35th International Conference (2009), pp. 2391–2395
P. Comon, Independent component analysis, a new concept? Signal Process. 36(3), 287–314 (1994). https://doi.org/10.1016/0165-1684(94)90029-9
Article MATH Google Scholar
C. Donahue, B. Li, R. Prabhavalkar, Exploring Speech Enhancement with Generative Adversarial Networks for robust Speech Recognition, in International Conference on Acoustics, Speech and Signal Processing (2018), pp. 5024–5028
C. Donahue, J. McAuley, M. Puckette, Adversarial audio synthesis, in Proceedings of ICLR (2019), pp. 1–16
C. Donahue, J.J. McAuley, M.S. Puckette, Synthesizing audio with generative adversarial networks, CoRR, vol. abs/1802.04208 (2018). [Online]. Available: http://arxiv.org/abs/1802.04208
Z. Duan, B. Pardo, C. Zhang, Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans. Audio Speech Lang. Process. 18(8), 2121–2133 (2010)
Article Google Scholar
Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 138–150 (2014)
Article Google Scholar
Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Lang. Process. 22(1), 138–150 (2014)
Article Google Scholar
Z.-C. Fan, Y.-L. Lai, J.-S.R. Jang, SVSGAN: Singing Voice Separation Via Generative Adversarial Network, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp. 726–730. https://doi.org/10.1109/ICASSP.2018.8462091
C. Févotte, J. Idier, Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011)
Article MathSciNet MATH Google Scholar
I.J. Goodfellow et al., Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27, 2672–2680 (2014)
Google Scholar
M. Gover, Score-Informed Source Separation of Choral Music (McGill University, Thesis submitted to Department of Music Research Schulich School of Music, 2019)
E.M. Grais, M.U. Sen, H. Erdogan, Deep neural networks for single channel source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2014), pp. 3734–3738
GTZAN Dataset-Music Genre Classification, https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification, Accessed online on 04 Jan 2022
J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: discriminative embeddings for segmentation and separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2016), pp. 31–35
Y. Ikemiya, K. Itoyama, K. Yoshii, Singing voice separation and vocal f0 estimation based on mutual combination of robust principal component analysis and subharmonic summation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2084–2095 (2016)
Article Google Scholar
S. Inoue, H. Kameoka, L. Li, S. Makino, Sepnet: a deep separation matrix prediction network for multichannel audio source separation, in ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 191–195
P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 5967–5976
L. Le Magoarou, A. Ozerov, N.Q.K. Duong, Text-informed audio source separation using nonnegative matrix partial co-factorization, in 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (2013), pp. 1–6. https://doi.org/10.1109/MLSP.2013.6661995
J. Le Roux, F.J. Weninger, J.R. Hershey, Sparse NMF half-baked or well done? Tech. Rep. TR2015-023 (MERL, Cambridge, 2015)
H. Li, S. Fu, Y. Tsao, J. Yamagish, iMetricGAN: intelligibility enhancement for speech-in-noise using generative adversarial network-based metric learning, in Proceedings of Interspeech 2020, Shanghai, China, October 25–29 (2020), pp. 1336-1340
L. Li, H. Kameoka, S. Makino, Determined audio source separation with multichannel star generative adversarial network, in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (2020), pp. 1–6. https://doi.org/10.1109/MLSP49062.2020.9231555
Y. Luo, N. Mesgarani, Real-time single-channel dereverberation and separation with time-domain audio separation network, in Proceedings of Interspeech (2018), pp. 342–346
Y. Luo, N. Mesgarani, Tasnet: time-domain audio separation network for real- time, single-channel speech separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 696–700
Y. Luo, N. Mesgarani, Conv-TasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
J.H. McDermott, The cocktail party problem. Curr. Biol. 19(22), 1024–1027 (2009)
Article Google Scholar
M. Mirza, S. Osindero, Conditional generative adversarial nets (2014). arXiv:1411.1784
B. Nasersharif, S. Abdali, Speech/music separation using non-negative matrix factorization with combination of cost functions, in The International Symposium on Artificial Intelligence and Signal Processing (2015), pp. 107–111
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 5206–5210
S. Pascual, A. Bonafonte, J. Serrà, SEGAN: speech enhancement generative adversarial network. Proc Interspeech 2017, 3642–3646 (2017)
Article Google Scholar
H. Phan et al., Improving GANs for speech enhancement. IEEE Signal Process. Lett. 27, 1700–1704 (2020). https://doi.org/10.1109/LSP.2020.3025020
Article Google Scholar
Z. Rafii, A. Liutkus, F.-R. Stöter, S.I. Mimilakis, D. FitzGerald, B. Pardo, An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process. 26(8), 1307–1335 (2018). https://doi.org/10.1109/TASLP.2018.2825440
Article Google Scholar
O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation. Med. Image Comput. Comput. Assist. Intervent. 9351, 234–241 (2015)
Google Scholar
D. Stoller, S. Ewert, S. Dixon, Adversarial semi-supervised audio source separation applied to singing voice extraction, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 2391–2395
F.-R. Stöter, A. Liutkus, N. Ito, The 2018 signal separation evaluation campaign. in Y. Deville, S. Gannot, R. Mason, M.D. Plumbley, D. Ward (Eds.), 14th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2018) (2018), pp. 293–305
Y.C. Subakan, P. Smaragdis, Generative adversarial source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 26–30
The LJ Speech Dataset, https://keithito.com/LJ-Speech-Dataset/
E. Tzinis, Z. Wang, P. Smaragdis, Sudo RM-RF: efficient networks for universal audio source separation, in IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (2020), pp. 1–6
S. Uhlich, F. Giron, Y. Mitsufuji, Deep neural network based instrument extraction from music, in IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 2135–2139
Z.-Q. Wang, J.L. Roux, J.R. Hershey, Alternative objective functions for deep clustering, in IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 686–690
O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time–frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004)
Article MathSciNet MATH Google Scholar
J.-Y. Zhu, T. Park, P. Isola, A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proceedings of IEEE International Conference on Computer Vision (2017), pp. 2242–2251

Download references

Acknowledgements

The authors thank lab-mates for having helped us in conducting the subjective evaluation tests.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Vikaram Sarabhai Space Centre, Indian Space Research Organization, Thiruvananthapuram, Kerala, India
Sujo Joseph
Audio, Speech and Language Lab, Department of Electronics and Communication Engineering, College of Engineering, Trivandrum, India
Rajeev Rajan
APJ Abdul Kalam Technological University, Trivandrum, Kerala, India
Rajeev Rajan

Authors

Sujo Joseph
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Rajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajeev Rajan.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Joseph, S., Rajan, R. Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking. Circuits Syst Signal Process 42, 1163–1180 (2023). https://doi.org/10.1007/s00034-022-02178-1

Download citation

Received: 16 January 2022
Revised: 05 September 2022
Accepted: 06 September 2022
Published: 23 September 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s00034-022-02178-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A Deep Learning Framework for Audio Deepfake Detection

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

A Deep Learning Framework for Audio Deepfake Detection

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation