Improving Reverberant Speech Separation with Binaural Cues Using Temporal Context and Convolutional Neural Networks

  • Alfredo Zermini
  • Qiuqiang Kong
  • Yong Xu
  • Mark D. Plumbley
  • Wenwu Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10891)


Given binaural features as input, such as interaural level difference and interaural phase difference, Deep Neural Networks (DNNs) have been recently used to localize sound sources in a mixture of speech signals and/or noise, and to create time-frequency masks for the estimation of the sound sources in reverberant rooms. Here, we explore a more advanced system, where feed-forward DNNs are replaced by Convolutional Neural Networks (CNNs). In addition, the adjacent frames of each time frame (occurring before and after this frame) are used to exploit contextual information, thus improving the localization and separation for each source. The quality of the separation results is evaluated in terms of Signal to Distortion Ratio (SDR).


Convolutional Neural Networks Binaural cues Reverberant rooms Speech separation Contextual information 



The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7-PEOPLE-2013-ITN) under grant agreement no 607290 SpaRTaN.


  1. 1.
    Comon, P., Jutten, C. (eds.): Handbook of Blind Source Separation: Independent Component Analysis and Applications. Elsevier, Amsterdam, Boston (2010)Google Scholar
  2. 2.
    Wang, D., Brown, G.J.: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, Hoboken (2006)CrossRefGoogle Scholar
  3. 3.
    Lee, D.D., Sebastian, S.H.: Algorithms for non-negative matrix factorization. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems 13, pp. 556–562. MIT Press, Cambridge (2001)Google Scholar
  4. 4.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefGoogle Scholar
  5. 5.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436 (2015)CrossRefGoogle Scholar
  6. 6.
    Jiang, Y., Wang, D., Liu, R., Feng, Z.: Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)CrossRefGoogle Scholar
  7. 7.
    Huang, P.-S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)CrossRefGoogle Scholar
  8. 8.
    Yu, Y., Wang, W., Han, P.: Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks. EURASIP J. Audio Speech Music Process. 2016(1), 7 (2016)CrossRefGoogle Scholar
  9. 9.
    Zermini, A., Liu, Q., Xu, Y., Plumbley, M.D., Betts, D., Wang, W.: Binaural and log-power spectra features with deep neural networks for speech-noise separation. In: IEEE 19th International Workshop on Multimedia Signal Processing, MMSP 2017, pp. 1–6. IEEE, October 2017Google Scholar
  10. 10.
    Xu, Y., Du, J., Dai, L., Lee, C.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)CrossRefGoogle Scholar
  11. 11.
    Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Sign. Process. Lett. 21(1), 65–68 (2014)CrossRefGoogle Scholar
  12. 12.
    Chakrabarty, S., Habets, E.A.P.: Multi-speaker localization using convolutional neural network trained with noise. In: 31st Conference on Neural Information Processing Systems (NIPS 2017) (2017)Google Scholar
  13. 13.
    Hummersone, C.: A psychoacoustic engineering approach to machine sound source separation in reverberant environments (2011).

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Alfredo Zermini
    • 1
  • Qiuqiang Kong
    • 1
  • Yong Xu
    • 1
  • Mark D. Plumbley
    • 1
  • Wenwu Wang
    • 1
  1. 1.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildfordUK

Personalised recommendations