Speech Separation Using Deep Learning

  • P. NandalEmail author
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 39)


In general, humans communicate through speech. The target speech which is known as the speech of interest is degraded by reverberation from surface reflections and extra noises from additional sound sources. Speech separation means separating the voices of various speakers or separating noises (background interference) from the original audio signal. Speech separation is helpful for a bountiful of applications. It is an extremely challenging task to build an automatic system for this purpose. The information about the speaker or the source of the sound and the background noises are learned by training the machine with different data using supervised machine learning. The research work presented here is primarily partitioned into 3 parts i.e. mixing the audio files, applying the algorithm to isolate the different audio files and clean the noise from them, and at last, representing the isolated output in the form of graphs and hereafter it has been endeavored to convert the graphs into the audio signal.


Speech separation ICA JADE 


  1. 1.
    Vinyals, O., Ravuri, S.V., Povey, D.: Revisiting recurrent neural networks for robust ASR. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4085–4088. IEEE, Kyoto (2012)Google Scholar
  2. 2.
    Maas, A., Le, Q.V., O’neil, T.M., Vinyals, O., Nguyen, P., Ng, A.Y.: Recurrent neural networks for noise reduction in robust ASR. In: INTERSPEECH (2012)Google Scholar
  3. 3.
    Huang, P.S., Chen, S.D., Smaragdis, P., Hasegawa-Johnson, M.: Singing-voice separation from monaural recordings using robust principal component analysis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 57–60. IEEE, Kyoto (2012)Google Scholar
  4. 4.
    Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096. IEEE, Vancouver (2013)Google Scholar
  5. 5.
    Wang, Y., Wang, D.: Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)CrossRefGoogle Scholar
  6. 6.
    Cherry, E.C.: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975–979 (1953)CrossRefGoogle Scholar
  7. 7.
    Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. J. Acoust. Soc. Am. 27(2), 113–120 (1979)Google Scholar
  8. 8.
    Miller, G.A., Heise, G.A.: The trill threshold. J. Acoust. Soc. Am. 22(5), 637–638 (1950)CrossRefGoogle Scholar
  9. 9.
    Lyon, R.: A computational model of binaural localization and separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1148–1151. IEEE, Boston (1983)Google Scholar
  10. 10.
    Wang, D.: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008)CrossRefGoogle Scholar
  11. 11.
    Wang, D., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, Hoboken (2006)CrossRefGoogle Scholar
  12. 12.
    Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)CrossRefGoogle Scholar
  13. 13.
    Anzalone, M.C., Calandruccio, L., Doherty, K.A., Carney, L.H.: Determination of the potential benefit of time-frequency gain manipulation. Ear Hear. 27(5), 480–492 (2006)CrossRefGoogle Scholar
  14. 14.
    Brungart, D.S., Chang, P.S., Simpson, B.D., Wang, D.: Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am. 120(6), 4007–4018 (2006)CrossRefGoogle Scholar
  15. 15.
    Li, N., Loizou, P.C.: Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)CrossRefGoogle Scholar
  16. 16.
    Wang, D.L., Kjems, U., Pedersen, M.S., Boldt, J.B., Lunner, T.: Speech intelligibility in background noise with ideal binary time-frequency masking. J. Acoust. Soc. Am. 125(4), 2336–2347 (2009)CrossRefGoogle Scholar
  17. 17.
    SHOGUN-TOOLBOX Homepage. Accessed 21 April 2019
  18. 18.
    Sonnenburg, S.Ć., Henschel, S., Widmer, C., Behr, J., Zien, A., Bona, F.D., Binder, A., Gehl, C., Franc, V.: The SHOGUN machine learning toolbox. J. Mach. Learn. Res. 11(Jun), 1799–1802 (2010)Google Scholar
  19. 19.
    Delfarah, M., Wang, D.: Features for masking-based monaural speech separation in reverberant conditions. IEEE/ACM Trans. Audio Speech Lang. Process. 25(5), 1085–1094 (2017)CrossRefGoogle Scholar
  20. 20.
    Cardoso, J.F., Souloumiac, A.: An efficient technique for the blind separation of complex sources. In: Proceedings of the IEEE Signal Processing Workshop on Higher-Order Statistics, pp. 275–279. IEEE (1993)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.MSITGGSIP UniversityDelhiIndia

Personalised recommendations