Audio Source Separation with Discriminative Scattering Networks

Sprechmann, Pablo; Bruna, Joan; LeCun, Yann

doi:10.1007/978-3-319-22482-4_30

Pablo Sprechmann¹⁷,
Joan Bruna¹⁸ &
Yann LeCun^17,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9237))

Included in the following conference series:

International Conference on Latent Variable Analysis and Signal Separation

2625 Accesses
2 Citations

Abstract

Many monaural signal decomposition techniques proposed in the literature operate on a feature space consisting of a time-frequency representation of the input data. A challenge faced by these approaches is to effectively exploit the temporal dependencies of the signals at scales larger than the duration of a time-frame. In this work we propose to tackle this problem by modeling the signals using a time-frequency representation with multiple temporal resolutions. For this reason we use a signal representation that consists of a pyramid of wavelet scattering operators, which generalizes Constant Q Transforms (CQT) with extra layers of convolution and complex modulus. We first show that learning standard models with this multi-resolution setting improves source separation results over fixed-resolution methods. As study case, we use Non-Negative Matrix Factorizations (NMF) that has been widely considered in many audio application. Then, we investigate the inclusion of the proposed multi-resolution setting into a discriminative training regime. We discuss several alternatives using different deep neural network architectures, and our preliminary experiments suggest that in this task, finite impulse, multi-resolution Convolutional Networks are a competitive baseline compared to recurrent alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lee, D.D., Seung, H.S.: Learning parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article Google Scholar
Smaragdis, P., Fevotte, C., Mysore, G., Mohammadiha, N., Hoffman, M.: Static and dynamic source separation using nonnegative factorizations: a unified view. IEEE Sig. Process. Mag. 31(3), 66–75 (2014)
Article Google Scholar
Mairal, J., Bach, F., Ponce, J.: Task-driven dictionary learning. IEEE Trans. Pattern Anal. Mach. Intel. 34(4), 791–804 (2012)
Article Google Scholar
Sprechmann, P., Bronstein, A.M., Sapiro, G.: Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement. In: HSCMA, pp. 11–15. IEEE (2014)
Google Scholar
Weninger, F., Le Roux, J., Hershey, J.R., Watanabe, S.: Discriminative NMF and its application to single-channel source separation. In: Proceedings of ISCA Interspeech (2014)
Google Scholar
Huang, P.-S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: ICASSP, pp. 1562–1566 (2014)
Google Scholar
Sprechmann, P., Bronstein, A., Bronstein, M., Sapiro, G.: Learnable low rank sparse models for speech denoising. In: ICASSP, pp. 136–140 (2013)
Google Scholar
Weninger, F., Le Roux, J., Hershey, J.R., Schuller, B.: Discriminatively trained recurrent neural networks for single-channel speech separation. In: Proceedings IEEE GlobalSIP 2014 Symposium on Machine Learning Applications in Speech Processing (2014)
Google Scholar
Févotte, C.: Majorization-minimization algorithm for smooth itakura-saito nonnegative matrix factorization. In: ICASSP, pp. 1980–1983. IEEE (2011)
Google Scholar
Wilson, K.W., Raj, B., Smaragdis, P., Divakaran, A.: Speech denoising using nonnegative matrix factorization with priors. In: ICASSP, pp. 4029–4032 (2008)
Google Scholar
Mysore, G.J., Smaragdis, P.: A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics. In: ICASSP, pp. 17–20 (2011)
Google Scholar
Han, J., Mysore, G.J., Pardo, B.: Audio imputation using the non-negative hidden markov model. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) LVA/ICA 2012. LNCS, vol. 7191, pp. 347–355. Springer, Heidelberg (2012)
Chapter Google Scholar
Févotte, C., Le Roux, J., Hershey, J.R.: Non-negative dynamical system with application to speech and audio. In: ICASSP (2013)
Google Scholar
Boulanger-Lewandowski, N., Mysore, G.J., Hoffman, M.: Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation. In: ICASSP, May 2014, pp. 6969–6973 (2014)
Google Scholar
Bruna, J., Sprechmann, P., LeCun, Y.: Source separation with scattering non-negative matrix factorization (2014, submitted)
Google Scholar
Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276. IEEE (2012)
Google Scholar
Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intel. 35(8), 1872–1886 (2013)
Article Google Scholar
Andén, J., Mallat, S.: Deep scattering spectrum (2013). arXiv preprint arXiv:1304.6763
Schmidt, M.N., Larsen, J., Hsiao, F.-T.: Wind noise reduction using non-negative sparse coding. In: MLSP, August 2007, pp. 431–436 (2007)
Google Scholar
Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011)
Article MATH MathSciNet Google Scholar
Mallat, S.: Recursive interferometric representation. In: Proceedings of EUSICO Conference, Denmark (2010)
Google Scholar
Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, New York (1999)
MATH Google Scholar
Bruna, J., Mallat, S.: Audio texture synthesis with scattering moments (2013). arXiv preprint arXiv:1311.0407
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Proc. 14(4), 1462–1469 (2006)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556

Download references

Author information

Authors and Affiliations

Courant Institute of Mathematical Sciences, New York University, New York, USA
Pablo Sprechmann & Yann LeCun
Department of Statistics, University of California, Berkeley, USA
Joan Bruna
Facebook AI Research, New York, USA
Yann LeCun

Authors

Pablo Sprechmann
View author publications
You can also search for this author in PubMed Google Scholar
Joan Bruna
View author publications
You can also search for this author in PubMed Google Scholar
Yann LeCun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pablo Sprechmann .

Editor information

Editors and Affiliations

Inria, Villers-les-Nancy, France
Emmanuel Vincent
Tel Aviv University, Tel-Aviv, Israel
Arie Yeredor
Technical University of Libere, Liberec, Czech Republic
Zbyněk Koldovský
The Czech Academy of Sciences, Prague, Czech Republic
Petr Tichavský

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sprechmann, P., Bruna, J., LeCun, Y. (2015). Audio Source Separation with Discriminative Scattering Networks. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds) Latent Variable Analysis and Signal Separation. LVA/ICA 2015. Lecture Notes in Computer Science(), vol 9237. Springer, Cham. https://doi.org/10.1007/978-3-319-22482-4_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-22482-4_30
Published: 15 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22481-7
Online ISBN: 978-3-319-22482-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics