Abstract
The performance of any voice recognition platform in real environment depends on how well the desired speech signal is separated from unwanted signals like background noise or background speakers. In this paper, we propose a three stage hybrid model to separate two speakers from single channel speech mixture under unsupervised condition. Proposed method combines three techniques namely speech segmentation, NMF (Nonnegative Matrix Factorization) and Masking. Speech segmentation groups the short speech frames belonging to individual speakers by identifying the speaker change over points. The segmentation block groups the speech frames belonging to individual speakers but lacks in continuity of the speech samples. Therefore a second stage is built using NMF. NMF algorithm performs better in separating the speech mixture when parts of the individual speech signals are known a priori. This requirement is satisfied by speech segmentation stage. NMF further separates the individual speech signals in the mixture by maintaining continuity of speech samples over time. To further improve the accuracy of separated speech signals, various masking methods like TFR (Time frequency Ratio), SM (Soft Mask) and HM (Hard Mask) are applied. The separation results are compared with other unsupervised algorithms. The proposed hybrid model produces promising results in unsupervised single channel speech separation. This model can be applied at the front end of any voice recognition platform to further improve the recognition efficiency.
Similar content being viewed by others
References
Bin G, Woo W (2013) Unsupervised single Channel separation of no stationary signals using gamma tone filter bank and itakura-satio nonnegative matrix two-dimensional factorizations. IEEE Trans Circuits Syst 60(3):662–675
Boldt J, Ellis DPW (2009) A simple correlation-based model of intelligibility for nonlinear speech enhancement and separation. In Proceedings of the 17th European Signal Processing Conference (EUSIPCO2009) EURASIP, August 24-28, 2009, Glasgow, Scotland
Changsheng Z, Dongmin H, Sijin Z (March 2020) A robust unsupervised method for the single channel speech separation. Proc. Int. Conf. on Computational intelligence and security, Macao, pp.387–390
DeLiang W, Jitong C (2018) Supervised speech separation based on deep learning, An overview. IEEE Trans Audio Speech Lang Process 26(10):1702–1726
Ellis D (2006) Model based scene analysis, Computational auditory scene analysis: Principles, Algorithms and Applications. Wiley/IEEE Press, New York
Fevotte C, Gribonval R, Vincent E (April 2005) BSS_EVAL toolbox user guide - Revision 2.0, Technical Report 1706, IRISA
Gil J, Te Won L (2003) A maximum likelihood approach to single channel source separation. J Mach Learn Res 14:1365–1392
Jun D, Yanhui T, Li Rong D, Chin L (2016) A Regression Approach to Single Channel Speech Separation Via High Resolution Deep Neural Networks'. IEEE Trans Audio Speech Lang Process 24(8):1424–1437
Karhunen J, Oja E (2001) Independent component analysis. John Wiley Sons
Ke W, Frank S, Lei X (April 2019) A Pitch-aware Approach to Single-channel Speech Separation. Proc. Int. Conf. Acoustics Speech and Signal Processing, Brighton, United Kingdom, pp. 296–300
Kumaraswamy R, Yegnanarayana R, S. (2009) Determining mixing parameters from multi speaker data using speech specific information. IEEE Trans Audio Speech Lang Process 17(6):1196–1207
Kwang M, Chanjun C, Chaejun L (March 2020) Lightweight U-Net Based Monaural Speech Source Separation for Edge Computing Device. Proc. Int. Conf. Consumer Electronics, Las Vegas, USA, pp. 1–4
Michael S, Michael W, Franz P (2011) Source- Filter based single channel speech separation using pitch information. IEEE Trans Audio Speech Lang Process 19(2):242–254
PrasannaKumar MK, Kumaraswamy R (2015) Supervised and unsupervised separation of convolutive speech mixtures using f0 and formant frequencies. Int J Speech Technol 18(4):649–662
PrasannaKumar MK, Kumaraswamy R (2017) An unsupervised approach for cochannel speech separation using Hilbert-Huang Transform and Fuzzy C-Means Clustering. Int J Speech Technol 20(1):1–13
PrasannaKumar MK, Kumaraswamy R (2017) Single-channel speech separation using Empirical Mode Decomposition and multi pitch information with estimation of number of speakers. Int J Speech Technol 20(1):109–125
PrasannaKumar MK, Kumaraswamy R (2017) Single-channel speech separation using Combined EMD and speech-specific information. Int J Speech Technol 20(4):1037–1047
PrasannaKumar MK, Kumaraswamy R (2017) Single channel speech separation based on Empirical Mode Decomposition and Hilbert transform. IET Signal Processing 11(5):579–586
PrasannaKumar MK, Kumaraswamy R (2021) Unsupervised speech separation by detecting speaker changeover points under single channel condition. Int J Speech Technol 24(4):1101–1112
Qingju L, Jackson PJ, Wenwu W (2019) A speech synthesis approach for high quality speech separation and generation. IEEE Signal Process Lett 26(12):1872–1876
Schmidt M, Olsson R (2006) Single channel speech separation using sparse non negative matrix factorization. Proc. Int. Conf. Spoken Lang. Process. (INTER SPEECH), Pittsburgh, PA, pp. 2614–2617
Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An Algorithm for intelligibility prediction of time–frequency weighted noisy speech. in IEEE Trans Audio Speech Lang Process 19(7): 2125–2136
Tengtrairat N, Bin G (2013) Single channel Blind separation using pseudo stereo mixture and complex 2-D histogram. IEEE Trans Neural Netw Learn Syst 24(11):1722–1735
Vincent E, Gribonval R, Fevotte C (2006) Performance measurement in blind audio source separation. IEEE Trans Audio Speech Lang Process 14(4):1462–1469
Virtanen T (2007) Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074
Xiang L, Xihong W, Jing C (April 2019) A Spectral-change-aware Loss Function for DNN-based Speech Separation', Proc. Int. Conf. Acoustics Speech and Signal Processing, Brighton, United Kingdom, pp. 6870–6874
Xiao L, Deliang W (2016) A Deep Ensemble Learning Method for Monaural Speech Separation. IEEE Trans Audio Speech Lang Process 24(5):967–977
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Prasanna Kumar, M., Kumaraswamy, R. A hybrid model for unsupervised single channel speech separation. Multimed Tools Appl 83, 13241–13259 (2024). https://doi.org/10.1007/s11042-023-16108-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16108-z