Audio Source Separation with Discriminative Scattering Networks

  • Pablo SprechmannEmail author
  • Joan Bruna
  • Yann LeCun
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9237)


Many monaural signal decomposition techniques proposed in the literature operate on a feature space consisting of a time-frequency representation of the input data. A challenge faced by these approaches is to effectively exploit the temporal dependencies of the signals at scales larger than the duration of a time-frame. In this work we propose to tackle this problem by modeling the signals using a time-frequency representation with multiple temporal resolutions. For this reason we use a signal representation that consists of a pyramid of wavelet scattering operators, which generalizes Constant Q Transforms (CQT) with extra layers of convolution and complex modulus. We first show that learning standard models with this multi-resolution setting improves source separation results over fixed-resolution methods. As study case, we use Non-Negative Matrix Factorizations (NMF) that has been widely considered in many audio application. Then, we investigate the inclusion of the proposed multi-resolution setting into a discriminative training regime. We discuss several alternatives using different deep neural network architectures, and our preliminary experiments suggest that in this task, finite impulse, multi-resolution Convolutional Networks are a competitive baseline compared to recurrent alternatives.


Source separation Scattering Non-negative matrix factorization Deep learning 


  1. 1.
    Lee, D.D., Seung, H.S.: Learning parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  2. 2.
    Smaragdis, P., Fevotte, C., Mysore, G., Mohammadiha, N., Hoffman, M.: Static and dynamic source separation using nonnegative factorizations: a unified view. IEEE Sig. Process. Mag. 31(3), 66–75 (2014)CrossRefGoogle Scholar
  3. 3.
    Mairal, J., Bach, F., Ponce, J.: Task-driven dictionary learning. IEEE Trans. Pattern Anal. Mach. Intel. 34(4), 791–804 (2012)CrossRefGoogle Scholar
  4. 4.
    Sprechmann, P., Bronstein, A.M., Sapiro, G.: Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement. In: HSCMA, pp. 11–15. IEEE (2014)Google Scholar
  5. 5.
    Weninger, F., Le Roux, J., Hershey, J.R., Watanabe, S.: Discriminative NMF and its application to single-channel source separation. In: Proceedings of ISCA Interspeech (2014)Google Scholar
  6. 6.
    Huang, P.-S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: ICASSP, pp. 1562–1566 (2014)Google Scholar
  7. 7.
    Sprechmann, P., Bronstein, A., Bronstein, M., Sapiro, G.: Learnable low rank sparse models for speech denoising. In: ICASSP, pp. 136–140 (2013)Google Scholar
  8. 8.
    Weninger, F., Le Roux, J., Hershey, J.R., Schuller, B.: Discriminatively trained recurrent neural networks for single-channel speech separation. In: Proceedings IEEE GlobalSIP 2014 Symposium on Machine Learning Applications in Speech Processing (2014)Google Scholar
  9. 9.
    Févotte, C.: Majorization-minimization algorithm for smooth itakura-saito nonnegative matrix factorization. In: ICASSP, pp. 1980–1983. IEEE (2011)Google Scholar
  10. 10.
    Wilson, K.W., Raj, B., Smaragdis, P., Divakaran, A.: Speech denoising using nonnegative matrix factorization with priors. In: ICASSP, pp. 4029–4032 (2008)Google Scholar
  11. 11.
    Mysore, G.J., Smaragdis, P.: A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics. In: ICASSP, pp. 17–20 (2011)Google Scholar
  12. 12.
    Han, J., Mysore, G.J., Pardo, B.: Audio imputation using the non-negative hidden markov model. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) LVA/ICA 2012. LNCS, vol. 7191, pp. 347–355. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  13. 13.
    Févotte, C., Le Roux, J., Hershey, J.R.: Non-negative dynamical system with application to speech and audio. In: ICASSP (2013)Google Scholar
  14. 14.
    Boulanger-Lewandowski, N., Mysore, G.J., Hoffman, M.: Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation. In: ICASSP, May 2014, pp. 6969–6973 (2014)Google Scholar
  15. 15.
    Bruna, J., Sprechmann, P., LeCun, Y.: Source separation with scattering non-negative matrix factorization (2014, submitted)Google Scholar
  16. 16.
    Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276. IEEE (2012)Google Scholar
  17. 17.
    Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intel. 35(8), 1872–1886 (2013)CrossRefGoogle Scholar
  18. 18.
    Andén, J., Mallat, S.: Deep scattering spectrum (2013). arXiv preprint arXiv:1304.6763
  19. 19.
    Schmidt, M.N., Larsen, J., Hsiao, F.-T.: Wind noise reduction using non-negative sparse coding. In: MLSP, August 2007, pp. 431–436 (2007)Google Scholar
  20. 20.
    Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011)zbMATHMathSciNetCrossRefGoogle Scholar
  21. 21.
    Mallat, S.: Recursive interferometric representation. In: Proceedings of EUSICO Conference, Denmark (2010)Google Scholar
  22. 22.
    Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, New York (1999)zbMATHGoogle Scholar
  23. 23.
    Bruna, J., Mallat, S.: Audio texture synthesis with scattering moments (2013). arXiv preprint arXiv:1311.0407
  24. 24.
    Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Proc. 14(4), 1462–1469 (2006)CrossRefGoogle Scholar
  25. 25.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Courant Institute of Mathematical SciencesNew York UniversityNew YorkUSA
  2. 2.Department of StatisticsUniversity of CaliforniaBerkeleyUSA
  3. 3.Facebook AI ResearchNew YorkUSA

Personalised recommendations