Skip to main content
Log in

A hybrid model for unsupervised single channel speech separation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The performance of any voice recognition platform in real environment depends on how well the desired speech signal is separated from unwanted signals like background noise or background speakers. In this paper, we propose a three stage hybrid model to separate two speakers from single channel speech mixture under unsupervised condition. Proposed method combines three techniques namely speech segmentation, NMF (Nonnegative Matrix Factorization) and Masking. Speech segmentation groups the short speech frames belonging to individual speakers by identifying the speaker change over points. The segmentation block groups the speech frames belonging to individual speakers but lacks in continuity of the speech samples. Therefore a second stage is built using NMF. NMF algorithm performs better in separating the speech mixture when parts of the individual speech signals are known a priori. This requirement is satisfied by speech segmentation stage. NMF further separates the individual speech signals in the mixture by maintaining continuity of speech samples over time. To further improve the accuracy of separated speech signals, various masking methods like TFR (Time frequency Ratio), SM (Soft Mask) and HM (Hard Mask) are applied. The separation results are compared with other unsupervised algorithms. The proposed hybrid model produces promising results in unsupervised single channel speech separation. This model can be applied at the front end of any voice recognition platform to further improve the recognition efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Bin G, Woo W (2013) Unsupervised single Channel separation of no stationary signals using gamma tone filter bank and itakura-satio nonnegative matrix two-dimensional factorizations. IEEE Trans Circuits Syst 60(3):662–675

    Article  MathSciNet  Google Scholar 

  2. Boldt J, Ellis DPW (2009) A simple correlation-based model of intelligibility for nonlinear speech enhancement and separation. In Proceedings of the 17th European Signal Processing Conference (EUSIPCO2009) EURASIP, August 24-28, 2009, Glasgow, Scotland

  3. Changsheng Z, Dongmin H, Sijin Z (March 2020) A robust unsupervised method for the single channel speech separation. Proc. Int. Conf. on Computational intelligence and security, Macao, pp.387–390

  4. DeLiang W, Jitong C (2018) Supervised speech separation based on deep learning, An overview. IEEE Trans Audio Speech Lang Process 26(10):1702–1726

    Article  Google Scholar 

  5. Ellis D (2006) Model based scene analysis, Computational auditory scene analysis: Principles, Algorithms and Applications. Wiley/IEEE Press, New York

    Google Scholar 

  6. Fevotte C, Gribonval R, Vincent E (April 2005) BSS_EVAL toolbox user guide - Revision 2.0, Technical Report 1706, IRISA

  7. Gil J, Te Won L (2003) A maximum likelihood approach to single channel source separation. J Mach Learn Res 14:1365–1392

    MathSciNet  Google Scholar 

  8. Jun D, Yanhui T, Li Rong D, Chin L (2016) A Regression Approach to Single Channel Speech Separation Via High Resolution Deep Neural Networks'. IEEE Trans Audio Speech Lang Process 24(8):1424–1437

    Article  Google Scholar 

  9. Karhunen J, Oja E (2001) Independent component analysis. John Wiley Sons

    Google Scholar 

  10. Ke W, Frank S, Lei X (April 2019) A Pitch-aware Approach to Single-channel Speech Separation. Proc. Int. Conf. Acoustics Speech and Signal Processing, Brighton, United Kingdom, pp. 296–300

  11. Kumaraswamy R, Yegnanarayana R, S. (2009) Determining mixing parameters from multi speaker data using speech specific information. IEEE Trans Audio Speech Lang Process 17(6):1196–1207

    Article  Google Scholar 

  12. Kwang M, Chanjun C, Chaejun L (March 2020) Lightweight U-Net Based Monaural Speech Source Separation for Edge Computing Device. Proc. Int. Conf. Consumer Electronics, Las Vegas, USA, pp. 1–4

  13. Michael S, Michael W, Franz P (2011) Source- Filter based single channel speech separation using pitch information. IEEE Trans Audio Speech Lang Process 19(2):242–254

    Article  Google Scholar 

  14. PrasannaKumar MK, Kumaraswamy R (2015) Supervised and unsupervised separation of convolutive speech mixtures using f0 and formant frequencies. Int J Speech Technol 18(4):649–662

    Article  Google Scholar 

  15. PrasannaKumar MK, Kumaraswamy R (2017) An unsupervised approach for cochannel speech separation using Hilbert-Huang Transform and Fuzzy C-Means Clustering. Int J Speech Technol 20(1):1–13

    Article  Google Scholar 

  16. PrasannaKumar MK, Kumaraswamy R (2017) Single-channel speech separation using Empirical Mode Decomposition and multi pitch information with estimation of number of speakers. Int J Speech Technol 20(1):109–125

    Article  Google Scholar 

  17. PrasannaKumar MK, Kumaraswamy R (2017) Single-channel speech separation using Combined EMD and speech-specific information. Int J Speech Technol 20(4):1037–1047

    Article  Google Scholar 

  18. PrasannaKumar MK, Kumaraswamy R (2017) Single channel speech separation based on Empirical Mode Decomposition and Hilbert transform. IET Signal Processing 11(5):579–586

    Article  Google Scholar 

  19. PrasannaKumar MK, Kumaraswamy R (2021) Unsupervised speech separation by detecting speaker changeover points under single channel condition. Int J Speech Technol 24(4):1101–1112

    Article  Google Scholar 

  20. Qingju L, Jackson PJ, Wenwu W (2019) A speech synthesis approach for high quality speech separation and generation. IEEE Signal Process Lett 26(12):1872–1876

    Article  Google Scholar 

  21. Schmidt M, Olsson R (2006) Single channel speech separation using sparse non negative matrix factorization. Proc. Int. Conf. Spoken Lang. Process. (INTER SPEECH), Pittsburgh, PA, pp. 2614–2617

  22. Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An Algorithm for intelligibility prediction of time–frequency weighted noisy speech. in IEEE Trans Audio Speech Lang Process 19(7): 2125–2136

  23. Tengtrairat N, Bin G (2013) Single channel Blind separation using pseudo stereo mixture and complex 2-D histogram. IEEE Trans Neural Netw Learn Syst 24(11):1722–1735

    Article  Google Scholar 

  24. Vincent E, Gribonval R, Fevotte C (2006) Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process 14(4):1462–1469

    Article  Google Scholar 

  25. Virtanen T (2007) Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074

    Article  Google Scholar 

  26. Xiang L, Xihong W, Jing C (April 2019) A Spectral-change-aware Loss Function for DNN-based Speech Separation', Proc. Int. Conf. Acoustics Speech and Signal Processing, Brighton, United Kingdom, pp. 6870–6874

  27. Xiao L, Deliang W (2016) A Deep Ensemble Learning Method for Monaural Speech Separation. IEEE Trans Audio Speech Lang Process 24(5):967–977

    Article  Google Scholar 

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to MK Prasanna Kumar.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prasanna Kumar, M., Kumaraswamy, R. A hybrid model for unsupervised single channel speech separation. Multimed Tools Appl 83, 13241–13259 (2024). https://doi.org/10.1007/s11042-023-16108-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16108-z

Keywords

Navigation