Skip to main content
Log in

A review on speech separation in cocktail party environment: challenges and approaches

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The Cocktail party problem, which is tracing and identifying a specific speaker’s speech while numerous speakers communicate concurrently is one of the crucial problems still to be addressed for automated speech recognition (ASR) and speaker recognition. In this study, we attempt to thoroughly explore traditional methods for speech separation in a cocktail party environment and further analyze traditional single-channel methods for example source-driven methods like Computational Auditory Scene Analysis (CASA), data-driven methods like non-negative matrix factorization (NMF), model-driven methods, customary multi-channel methods such as beamforming, blind source separation for multi-channel and the newly developed deep learning approaches such as meta-learning based methods, self-supervised learning. This paper further accentuates numerous datasets and evaluation metrics in the domain of speech processing & brings out the comparison between traditional methods and methods based on deep learning for speech separation. This study provides a basic understanding and comprehensive knowledge of state-of-the-art researches in the area of speech separation and serves as a brief overview to new researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

The authors confirm that data sharing not applicable to this article as no datasets were generated during the current study.

References

  1. Abdali S, NaserSharif B (2017) Non-negative matrix factorization for speech/music separation using source dependent decomposition rank, temporal continuity term and filtering. Biomed Signal Process Control 36:168–175

    Google Scholar 

  2. Arango-Sánchez JA, Arias-Londoño JD (2022) An enhanced conv-TasNet model for speech separation using a speaker distance-based loss function

  3. Awotunde JB, Ogundokun RO, Ayo FE, Matiluko OE (2020) Speech segregation in background noise based on deep learning. IEEE Access 8:169568–169575

    Google Scholar 

  4. Boppidi PKR, Louis VJ, Subramaniam A, Tripathy RK, Banerjee S, Kundu S (2020) Implementation of fast ICA using memristor crossbar arrays for blind image source separations. IET Circuits, Devices & Systems 14(4):484–489

    Google Scholar 

  5. Bronkhorst AW (2015) The cocktail-party problem revisited: early processing and selection of multi-talker speech. Attention, Perception, & Psychophysics 77(5):1465–1487

    Google Scholar 

  6. Brown GJ, Wang D (2005) Separation of speech by computational auditory scene analysis. In Speech enhancement (pp. 371–402). Springer, Berlin, Heidelberg

  7. Cermak J (2006) Blind speech separation by combining beamformers and a time frequency binary mask. Proc IWAENC 2006:145–148

    Google Scholar 

  8. Chen Z, McFee B, Ellis DP (2014) Speech enhancement by low-rank and convolutive dictionary spectrogram decomposition. In Fifteenth Annual Conference of the International Speech Communication Association

  9. Chen Z, Li J, Xiao X, Yoshioka T, Wang H, Wang Z, Gong Y (2017) Cracking the cocktail party problem by multi-beam deep attractor network. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 437-444)

  10. Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 246-250)

  11. Erdogan H, Hershey JR, Watanabe S, Le Roux J (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 708-712)

  12. Ghahramani Z, Jordan M (1995) Factorial hidden Markov models. Adv Neural Inf Proces Syst 8

  13. Guo, T., Wen, C., Jiang, D., Luo, N., Zhang, R., Zhao, S., ... & Li, X. (2021). Didispeech: A large scale mandarin speech corpus. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6968–6972)

  14. Hansen JH, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99

    Google Scholar 

  15. He W, Motlicek P, Odobez JM (2018) Deep neural networks for multiple speaker detection and localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 74-79)

  16. Hershey J, Kristjansson T, Rennie S, Olsen PA (2006) Single channel speech separation using factorial dynamics. Adv Neural Inf Proces Syst 19

  17. Hershey JR, Roux JL, Weninger F (2014) Deep unfolding: model-based inspiration of novel deep architectures

  18. Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 31-35)

  19. Hidri A, Meddeb S, Amiri H (2012) About multichannel speech signal extraction and separation techniques

  20. Hu G, Wang D (2010) A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process 18(8):2067–2079

    Google Scholar 

  21. Hu K, Wang D (2012) An unsupervised approach to cochannel speech separation. IEEE Trans Audio Speech Lang Process 21(1):122–131

    Google Scholar 

  22. Huang Z, et al. (2022) Investigating self-supervised learning for speech enhancement and separation. ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  23. Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P (2014) Deep learning for monaural speech separation. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1562-1566). IEEE

  24. Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12):2136–2147

    Google Scholar 

  25. Huang, K. P., Wu, Y. K., & Lee, H. Y. (2022). Improving the transferability of speech separation by meta-learning. arXiv preprint arXiv:2203.05882.

  26. Huang K-P, Wu Y-K, Lee H-y (2022) Improving the transferability of speech separation by meta-learning

  27. Isik Y, Roux JL, Chen Z, Watanabe S, Hershey JR (2016) Single-channel multi-speaker separation using deep clustering

  28. Jafari I, Togneri R, Nordholm S (2010) Review of multi-channel source separation in realistic environments. In 13th Australasian International Conference on Speech Science and Technology, Melbourne (pp. 201-204)

  29. Jan T, Wang W, Wang D (2011) A multistage approach to blind separation of convolutive speech mixtures. Speech Comm 53(4):524–539

    Google Scholar 

  30. Jesson J, Matheson L, Lacey FM (2011) Doing your literature review: traditional and systematic techniques. Sage

    Google Scholar 

  31. Jiang D, He Z, Lin Y, Chen Y, Xu L (2021) An improved unsupervised single-channel speech separation algorithm for processing speech sensor signals. Wireless Communications and Mobile Computing 2021

  32. Joder C, Weninger F, Eyben F, Virette D, Schuller B (2012) Real-time speech separation by semi-supervised nonnegative matrix factorization. In International Conference on Latent Variable Analysis and Signal Separation (pp. 322-329). Springer, Berlin, Heidelberg

  33. Kacur J, Puterka B, Pavlovicova J, Oravec M (2021) On the speech properties and feature extraction methods in speech emotion recognition. Sensors 21(5):1888

    Google Scholar 

  34. Kamm C, Walker M, Rabiner L (1997) The role of speech processing in human–computer intelligent communication. Speech Comm 23(4):263–278

    Google Scholar 

  35. Kammi S, Karami MR (2015) Single Channel speech separation using an efficient model-based method

    Google Scholar 

  36. Kwan C, Yin J, Ayhan B, Chu S, Liu X, Puckett K, ... & Sityar I (2008). Speech separation algorithms for multiple speaker environments. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1644–1648)

  37. Lee JH, Chang JH, Yang JM, Moon HG (2022) NAS-TasNet: neural architecture search for time-domain speech separation. IEEE Access

  38. Li Y, Zhang WT, Lou ST (2021) Generative adversarial networks for single channel separation of convolutive mixed speech signals. Neurocomputing 438:63–71

    Google Scholar 

  39. Liu J, Yu F, Chen Y (2014) Speech separation based on improved fast ICA with kurtosis maximization of wavelet packet coefficients. In New perspectives in information systems and technologies, volume 1 (pp. 43–50). Springer, Cham

  40. Lluís F, Pons J, Serra X (2018) End-to-end music source separation: is it possible in the waveform domain?

  41. Lu G, Xiao M, Wei P, Zhang H (2015) A new method of blind source separation using single-channel ICA based on higher-order statistics. Mathematical Problems in Engineering, 2015

  42. Luo Y (2021) End-to-end speech separation with neural networks. Columbia University

    Google Scholar 

  43. Luo Y, Mesgarani N (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696-700)

  44. Luo Y, Mesgarani N (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(8):1256–1266

    Google Scholar 

  45. Luo Y, Chen Z, Mesgarani N (2018) Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans Audio, Speech, Lang Process 26(4):787–796

    Google Scholar 

  46. Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 46-50)

  47. Marti A, Cobos M, Lopez JJ (2012) Automatic speech recognition in cocktail-party situations: a specific training for separated speech. J Acoustical Soc Am 131(2):1529–1535

    Google Scholar 

  48. McDermott JH (2009) The cocktail party problem. Curr Biol 19(22):R1024–R1027

    Google Scholar 

  49. Moon S, Kim H, Hwang I (2020) Deep learning-based channel estimation and tracking for millimeter-wave vehicular communications. J Commun Netw 22(3):177–184

    Google Scholar 

  50. Mowlaee P (2010) New stategies for single-channel speech separation. In: Institute for Electronic system. Aalborg University, Aalborg, Denmark Ph. D. thesis

    Google Scholar 

  51. Mowlaee P, Saeidi R, Christensen MG, Martin R (2012) Subjective and objective quality assessment of single-channel speech separation algorithms. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 69-72)

  52. Nag NC, Shah MS (2021) Non-negative matrix factorization on a multi-lingual overlapped speech signal: a signal and perception level analysis. International Journal of Computing and Digital System

  53. Nakamura T, Saruwatari H (2020) Time-domain audio source separation based on wave-U-net combined with discrete wavelet transform. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 386-390)

  54. Nandakumar MM, Bijoy KE (2014) Performance evaluation of single channel speech separation using non-negative matrix factorization. In 2014 IEEE National Conference on Communication, Signal Processing and Networking (NCCSN) (pp. 1-4)

  55. Nassif AB, Shahin I, Hamsa S, Nemmour N, Hirose K (2021) CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions. Appl Soft Comput 103:107141

    Google Scholar 

  56. Ochiai, T., Delcroix, M., Kinoshita, K., Ogawa, A., & Nakatani, T. (2019) A unified framework for neural speech separation and extraction. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6975-6979). IEEE.

  57. Olsson RK (2009) Algorithms for source separation: with cocktail party applications. DTU Informatics

  58. Parande, P. G., & Thomas, T. G. (2017). A study of the cocktail party problem. In 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA) (pp. 1-5). IEEE.

  59. Park J, Shin J, Lee K (2018) Separation of instrument sounds using non-negative matrix factorization with spectral envelope constraints

  60. Pedersen MS, Larsen J, Kjems U, Parra LC (2008) Convolutive blind source separation methods. In Springer handbook of speech processing (pp. 1065–1094). Springer, Berlin, Heidelberg

  61. Pedersen MS, Wang D, Larsen J, Kjems U (2008) Two-microphone separation of speech mixtures. IEEE Trans Neural Netw 19(3):475–492

    Google Scholar 

  62. Pham T, Lee YS, Chen YA, Wang JC (2015) A review on speech separation using NMF and its extensions. In 2015 International Conference on Orange Technologies (ICOT) (pp. 26-29)

  63. Qian YM, Weng C, Chang XK, Wang S, Yu D (2018) Past review, current progress, and challenges ahead on the cocktail party problem. Front Inform Technol Electron Eng 19(1):40–63

    Google Scholar 

  64. Qin CX, Qu D, Zhang LH (2018) Towards end-to-end speech recognition with transfer learning. EURASIP J Audio, Speech, Music Process 2018(1):1–9

    Google Scholar 

  65. Radfar, M. H., Dansereau, R. M., & Sayadiyan, A. (2006). A novel low complexity VQ-based single channel speech separation technique. In 2006 IEEE International Symposium on Signal Processing and Information Technology (pp. 572-577)

  66. Radfar MH, Dansereau RM, Sayadiyan A (2007) Monaural speech segregation based on fusion of source-driven with model-driven techniques. Speech Comm 49(6):464–476

    Google Scholar 

  67. Ranjan S, Payton KL, Mowlaee P (2012) Speaker independent single channel source separation using sinusoidal features. In Thirteenth Annual Conference of the International Speech Communication Association

  68. Rennie SJ, Hershey JR, Olsen PA (2010) Single-channel multitalker speech recognition. IEEE Signal Process Mag 27(6):66–80

    Google Scholar 

  69. Rybach D, Hahn S, Lehnen P, Nolden D, Sundermeyer M, Tüske Z, ... & Ney H (2011) Rasr-the rwth Aachen university open source speech recognition toolkit. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop

  70. Salman HM, Abbas NA (2021) Comparative study of QPSO and other methods in blind source separation. In Journal of Physics: Conference Series (Vol. 1804, no. 1, p. 012097). IOP Publishing

  71. Seung D, Lee L (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Proces Syst 13:556–562

    Google Scholar 

  72. Shi Z, Lin H, Liu L, Liu R, Hayakawa S, Harada S, Han J (2019) FurcaNet: an end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation

  73. Song Y, Shi S, Li J, Zhang H (2018) Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2 (Short Papers) (pp. 175-180)

  74. Souden M, Araki S, Kinoshita K, Nakatani T, Sawada H (2013) A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans Audio Speech Lang Process 21(9):1913–1928

    Google Scholar 

  75. Stark M, Wohlmayr M, Pernkopf F (2010) Source–filter-based single-channel speech separation using pitch information. IEEE Trans Audio Speech Lang Process 19(2):242–255

    Google Scholar 

  76. Stoller D, Ewert S, Dixon S (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation

  77. Subakan, Y. C., & Smaragdis, P. (2018) Generative adversarial source separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 26-30)

  78. Subakan C, Ravanelli M, Cornell S, Grondin F, Bronzi M (2022) On using transformers for speech-separation

  79. Toroghi RM, Faubel F, Klakow D (2012) Multi-channel speech separation with soft time-frequency masking. In SAPA-SCALE Conference

    Google Scholar 

  80. Venkatesan R, Ganesh AB (2018) Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest. Multimed Tools Appl 77(15):20129–20156

    Google Scholar 

  81. Virtanen T (2006) Speech recognition using factorial hidden Markov models for separation in the feature space. In Interspeech

  82. Virtanen T (2007) Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3):1066–1074

    Google Scholar 

  83. Wang D (2008) Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplification 12(4):332–353

    Google Scholar 

  84. Wang F-L, et al. (2022) Disentangling the impacts of language and channel variability on speech separation networks

  85. Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans Audio, Speech, Lang Process 26(10):1702–1726

    MathSciNet  Google Scholar 

  86. Wang Y, Narayanan A, Wang D (2014) On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(12):1849–1858

    Google Scholar 

  87. Wang, Z. Q., Le Roux, J., & Hershey, J. R. (2018). Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5)

  88. Wang ZQ, Le Roux J, Hershey JR (2018) Alternative objective functions for deep clustering. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 686-690)

  89. Wang L, Zheng W, Ma X, Lin S (2021) Denoising speech based on deep learning and wavelet decomposition. Sci Program, 2021

  90. Weng C, Yu D, Seltzer ML, Droppo J (2015) Deep neural networks for single-channel multi-talker speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(10):1670–1679

    Google Scholar 

  91. Wiem B, Anouar BMM, Aicha B (2016) Soft-CASA system for single channel speech separation. In 2016 4th International Conference on Control Engineering & Information Technology (CEIT) (pp. 1-5)

  92. Wiklund K, Haykin S (2009) The cocktail party problem: solutions and applications. Canadian Acoustics 37(3):80–81

    Google Scholar 

  93. Yang CH, Qi J, Chen PY, Ma X, Lee CH (2020) Characterizing speech adversarial examples using self-attention u-net enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3107-3111)

  94. Yilmaz O, Rickard S (2004) Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process 52(7):1830–1847

    MathSciNet  MATH  Google Scholar 

  95. Yu Y, Kim YJ (2018) A voice activity detection model composed of bidirectional LSTM and attention mechanism. In 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM) (pp. 1-5)

  96. Yu, D., Kolbæk, M., Tan, Z. H., & Jensen, J. (2017). Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 241-245). IEEE.

  97. Yuan CM, Sun XM, Zhao H (2020) Speech separation using convolutional neural network and attention mechanism. Discret Dyn Nat Soc 2020:1–10

    MathSciNet  Google Scholar 

  98. Zeghidour N, Grangier D (2021) Wavesplit: end-to-end speech separation by speaker clustering. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29:2840–2849

    Google Scholar 

  99. Zeremdini J, Messaoud MAB, Bouzid A (2015) A comparison of several computational auditory scene analysis (CASA) techniques for monaural speech segregation. Brain informatics 2(3):155–166

    Google Scholar 

  100. Zhang X, Wang D (2017) Deep learning based binaural speech separation in reverberant environments. IEEE/ACM Trans Audio, Speech, Language Processing 25(5):1075–1084

    Google Scholar 

  101. Zhang L, Wang M, Zhang Q, Liu M (2020) Environmental attention-guided branchy neural network for speech enhancement. Appl Sci 10(3):1167

    Google Scholar 

  102. Zhang L, Shi Z, Han J, Shi A, Ma D (2020) Furcanext: end-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks. In International conference on multimedia modeling (pp. 653-665). Springer, Cham

  103. Zhang P, Xu J, Hao Y, Xu B (2021) Online audio-visual speech separation with generative adversarial training. In 2021 7th International Conference on Computing and Artificial Intelligence (pp. 379-385)

  104. Zhao D, Li K, Li H (2021) A new method for separating EMI signal based on CEEMDAN and ICA. Neural Process Lett 53(3):2243–2259

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jharna Agrawal.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agrawal, J., Gupta, M. & Garg, H. A review on speech separation in cocktail party environment: challenges and approaches. Multimed Tools Appl 82, 31035–31067 (2023). https://doi.org/10.1007/s11042-023-14649-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14649-x

Keywords

Navigation