Advertisement

Deep Neural Network Based Multichannel Audio Source Separation

  • Aditya Arie Nugraha
  • Antoine Liutkus
  • Emmanuel Vincent
Chapter
Part of the Signals and Communication Technology book series (SCT)

Abstract

This chapter presents a multichannel audio source separation framework where deep neural networks (DNNs) are used to model the source spectra and combined with the classical multichannel Gaussian model to exploit the spatial information. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. Different design choices and their impact on the performance are discussed. They include the cost functions for DNN training, the number of parameter updates, the use of multiple DNNs, and the use of weighted parameter updates. Finally, we present its application to a speech enhancement task and a music separation task. The experimental results show the benefit of the multichannel DNN-based approach over a single-channel DNN-based approach and the multichannel nonnegative matrix factorization based iterative EM framework.

Notes

Acknowledgements

The authors would like to thank the developers of Theano [74] and Kaldi [75]. Experiments presented in this article were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

References

  1. 1.
    S. Makino, H. Sawada, T.-W. Lee (eds.), Blind Speech Separation (Springer, Dordrecht, The Netherlands, 2007)Google Scholar
  2. 2.
    M. Wölfel, J. McDonough, Distant Speech Recognition (Wiley, Chichester, West Sussex, UK, 2009)CrossRefGoogle Scholar
  3. 3.
    T. Virtanen, R. Singh, B. Raj (eds.), Techniques for Noise Robustness in Automatic Speech Recognition (Wiley, Chicester, West Sussex, UK, 2012)Google Scholar
  4. 4.
    G.R. Naik, W. Wang (eds.), Blind Source Separation: Advances in Theory, Algorithms and Applications (Springer, Berlin, Germany, 2014)Google Scholar
  5. 5.
    E. Vincent, N. Bertin, R. Gribonval, F. Bimbot, From blind to guided audio source separation: How models and side information can improve the separation of sound. IEEE Signal Process. Mag. 31(3), 107–115 (2014)CrossRefGoogle Scholar
  6. 6.
    L. Deng, D. Yu, Deep Learning: Methods and Applications, Found. Trends Signal Process. vol. 7 (Now Publishers Inc., Hanover, MA, USA, 2014), pp. 3–4Google Scholar
  7. 7.
    G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  8. 8.
    F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, in Proceeding of the Latent Variable Analysis Signal Separation (LVA/ICA), Liberec, Czech Republic, International Conference (2015)Google Scholar
  9. 9.
    J. Chen, Y. Wang, D. Wang, A feature study for classification-based speech separation at low signal-to-noise ratios IEEE/ACM. Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)Google Scholar
  10. 10.
    Y. Tu, J. Du, Y. Xu, L. Dai, C.-H. Lee, Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers in Proceeding of the International Symposium Chinese Spoken Language Processing (ISCSLP), Singapore (2014), pp. 250–254Google Scholar
  11. 11.
    P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Singing-voice separation from monaural recordings using deep recurrent neural networks, in Proceeding of the International Society for Music Information Retrieval (ISMIR), Taipei, Taiwan (2014), pp. 477–482Google Scholar
  12. 12.
    P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Deep learning for monaural speech separation, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy (2014), pp. 1562–1566Google Scholar
  13. 13.
    P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)Google Scholar
  14. 14.
    S. Araki, T. Hayashi, M. Delcroix, M. Fujimoto, K. Takeda, T. Nakatani, Exploring multi-channel features for denoising-autoencoder-based speech enhancement, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia (2015), pp. 2135–2139Google Scholar
  15. 15.
    S. Uhlich, F. Giron, Y. Mitsufuji, Deep neural network based instrument extraction from music, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia (2015), pp. 2135–2139Google Scholar
  16. 16.
    Y. Wang, D. Wang, Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(7), 1381–1390 (2013)Google Scholar
  17. 17.
    A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada (2013), pp. 7092–7096Google Scholar
  18. 18.
    Y. Jiang, D. Wang, R. Liu, Z. Feng, Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)Google Scholar
  19. 19.
    F. Weninger, J. Le Roux, J.R. Hershey, B. Schuller, Discriminatively trained recurrent neural networks for single-channel speech separation, in Proceeding of IEEE Global Conference Signal Information Processing (GlobalSIP), Atlanta, GA, USA (2014), pp. 577–581Google Scholar
  20. 20.
    A. Narayanan, D. Wang, Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 92–101 (2015)Google Scholar
  21. 21.
    Y. Wang, D. Wang, A deep neural network for time-domain signal reconstruction, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia (2015), pp. 4390–4394Google Scholar
  22. 22.
    D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2016)Google Scholar
  23. 23.
    H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia (2015), pp. 708–712Google Scholar
  24. 24.
    X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. Seltzer, G. Chen, Y. Zhang, M. Mandel, D. Yu, Deep beamforming networks for multi-channel speech recognition, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 5745–5749Google Scholar
  25. 25.
    J. Heymann, L. Drude, R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), pp. 196–200Google Scholar
  26. 26.
    A.A. Nugraha, A. Liutkus, E. Vincent, Multichannel audio source separation with deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 24(9), 1652–1664 (2016)Google Scholar
  27. 27.
    N.Q.K. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)Google Scholar
  28. 28.
    N. Duong, H. Tachibana, E. Vincent, N. Ono, R. Gribonval, S. Sagayama, Multichannel harmonic and percussive component separation by joint modeling of spatial and spectral continuity, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic (2011), pp. 205–208Google Scholar
  29. 29.
    A. Ozerov, E. Vincent, F. Bimbot, A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012)Google Scholar
  30. 30.
    T. Gerber, M. Dutasta, L. Girin, C. Févotte, Professionally-produced music separation guided by covers, in Proceeding of International Society of Music Information Retrieval (ISMIR), Porto, Portugal (2012), pp. 85–90Google Scholar
  31. 31.
    M. Togami, Y. Kawaguchi, Simultaneous optimization of acoustic echo reduction, speech dereverberation, and noise reduction against mutual interference. IEEE/ACM Trans. Audio Speech Lang. Process. 22(11), 1612–1623 (2014)Google Scholar
  32. 32.
    A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, L. Daudet, Kernel additive models for source separation. IEEE Trans. Signal Process. 62(16), 4298–4310 (2014)MathSciNetCrossRefGoogle Scholar
  33. 33.
    A. Liutkus, D. Fitzgerald, Z. Rafii, Scalable audio separation with light kernel additive modelling, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia (2015), pp. 76–80Google Scholar
  34. 34.
    S. Sivasankaran, A.A. Nugraha, E. Vincent, J.A. Morales-Cordovilla, S. Dalmia, I. Illina, A. Liutkus, Robust ASR using neural network based speech enhancement and feature simulation, in Proceeding of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, AZ, USA (2015), pp. 482–489Google Scholar
  35. 35.
    A.A. Nugraha, A. Liutkus, E. Vincent, Multichannel music separation with deep neural networks, in Proceeding of European Signal Processing Conference (EUSIPCO), Budapest, Hungary (2016), pp. 1748–1752Google Scholar
  36. 36.
    J.O. Smith, Spectral Audio Signal Processing. (W3K Publishing, 2011)Google Scholar
  37. 37.
    E. Vincent, M.G. Jafari, S.A. Abdallah, M.D. Plumbley, M.E. Davies, Probabilistic modeling paradigms for audio source separation, in Machine Audition: Principles, Algorithms and Systems, ed. by W. Wang (IGI Global, Hershey, PA, USA, 2011), pp. 162–185 (ch. 7)Google Scholar
  38. 38.
    D. Liu, P. Smaragdis, M. Kim, Experiments on deep learning for speech denoising, in Proceeding of ISCA INTERSPEECH, Singapore (2014), pp. 2685–2688Google Scholar
  39. 39.
    Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)CrossRefGoogle Scholar
  40. 40.
    C. Févotte, N. Bertin, J.-L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)CrossRefMATHGoogle Scholar
  41. 41.
    N. Bertin, C. Févotte, R. Badeau, A tempering approach for Itakura-Saito non-negative matrix factorization. with application to music transcription, in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan (2009), pp. 1545–1548Google Scholar
  42. 42.
    A. Lefèvre, F. Bach, C.Févotte, Online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence, in Proceeding of IEEE Workshop on Application of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA (2011), pp. 313–316Google Scholar
  43. 43.
    C. Févotte, A. Ozerov, Notes on nonnegative tensor factorization of the spectrogram for audio source separation: statistical insights and towards self-clustering of the spatial cues, in Proceeding of International Symposium on Computer Music Modeling and Retrieval, Málaga, Spain (2010), pp. 102–115Google Scholar
  44. 44.
    A. Liutkus, D. Fitzgerald, R. Badeau, Cauchy nonnegative matrix factorization, in Proceeding of IEEE Workshop on Application of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA (2015), pp. 1–5Google Scholar
  45. 45.
    J. McDonough K. Kumatani, Microphone arrays, in Techniques for Noise Robustness in Automatic Speech Recognition, ed. by T. Virtanen, R. Singh, B. Raj (Wiley, Chicester, West Sussex, UK, 2012) (ch. 6)Google Scholar
  46. 46.
    K. Kumatani, J. McDonough, B. Raj, Microphone array processing for distant speech recognition: from close-talking microphones to far-field sensors. IEEE Signal Process. Mag. 29(6), 127–140 (2012)CrossRefGoogle Scholar
  47. 47.
    J. Barker, R. Marxer, E. Vincent, S. Watanabe, The third ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines, in Proceeding of IEEE Automation Speech Recognition and Understanding Workshop (ASRU), Scottsdale, AZ, USA (2015), pp. 504–511Google Scholar
  48. 48.
    A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, J. Fontecave, The 2016 signal separation evaluation campaign, in Proceeding of International Conference Latent Variable Analysis Signal Separation (LVA/ICA), Grenoble, France, (2017), pp. 323–332Google Scholar
  49. 49.
    X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier networks, in Proceeding of International Conference Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, vol. 15 (2011), pp. 315–323Google Scholar
  50. 50.
    A.A. Nugraha, K. Yamamoto, S. Nakagawa, Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition. EURASIP. J. Audio Speech Music Process. 2014(13) (2014)Google Scholar
  51. 51.
    X. Jaureguiberry, E. Vincent, G. Richard, Fusion methods for speech enhancement and audio source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(7), 1266–1279 (2016)Google Scholar
  52. 52.
    Y. Bengio, Practical recommendations for gradient-based training of deep architectures, in Neural Networks: Tricks of the Trade, ed. by G. Montavon, G. Orr, K.-R. Müller. Lecture Notes in Computer Science, vol. 7700 (Springer, Berlin, Germany, 2012), pp. 437–478 (ch. 19)Google Scholar
  53. 53.
    P. Sprechmann, A.M. Bronstein, G. Sapiro, Supervised non-negative matrix factorization for audio source separation, in Excursions in Harmonic Analysis, ed. by R. Balan, M. Begu, J.J. Benedetto, W. Czaja, K.A. Okoudjou. vol. 4 (Springer, Switzerland, 2015), pp. 407–420Google Scholar
  54. 54.
    K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing human-level performance on imagenet classification (2015), arXiv e-prints http://arXiv.org/abs/1502.01852
  55. 55.
    Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in Proceeding of Conference Neural Information Processing Systems (NIPS), Vancouver, Canada (2006), pp. 153–160Google Scholar
  56. 56.
    M.D. Zeiler, ADADELTA: an adaptive learning rate method (2012), arXiv e-prints http://arXiv.org/abs/1212.5701
  57. 57.
    J. Garofalo, D. Graff, D. Paul, D. Pallett, CSR-I (WSJ0) Complete. (Linguistic Data Consortium, Philadelphia, 2007)Google Scholar
  58. 58.
    E. Vincent, R. Gribonval, C.Févotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)Google Scholar
  59. 59.
    B. Loesch, B. Yang, Adaptive segmentation and separation of determined convolutive mixtures under dynamic conditions, in Proceeding of International Conference Latent Variable Analysis Signal Separation (LVA/ICA), Saint-Malo, France (2010), pp. 41–48Google Scholar
  60. 60.
    C. Blandin, A. Ozerov, E. Vincent, Multi-source TDOA estimation in reverberant audio using angular spectra and clustering. Signal Process. 92(8), 1950–1960 (2012)CrossRefGoogle Scholar
  61. 61.
    Y. Salaün, E. Vincent, N. Bertin, N. Souviraà-Labastie, X. Jaureguiberry, D.T. Tran, F. Bimbot, The Flexible Audio Source Separation Toolbox Version 2.0, In: IEEE International of Conference Acoustics Speech Signal Process. (ICASSP), Florence, Italy (2014), Show & Tell, https://hal.inria.fr/hal-00957412
  62. 62.
    T. Hori, Z. Chen, H. Erdogan, J.R. Hershey, J. Le Roux, V. Mitra, S. Watanabe, The MERL/SRI system for the 3rd chime challenge using beamforming, robust feature extraction, and advanced speech recognition, in Proceeding of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, AZ, USA (2015), pp. 475–481Google Scholar
  63. 63.
    M.J.F. Gales, Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRefGoogle Scholar
  64. 64.
    K. Veselý, A. Ghoshal, L. Burget, D. Povey, Proceeding of Sequence-discriminative training of deep neural networks, Lyon, France, ISCA INTERSPEECH (2013), pp. 2345–2349Google Scholar
  65. 65.
    R. Kneser, H. Ney, Improved backing-off for M-gram language modeling, in Proceeding of IEEE International Conference on Acoustics Speech Signal Processing (ICASSP), Detroit, MI, USA, vol. 1 (1995), pp. 181–184Google Scholar
  66. 66.
    T. Mikolov, M. Karafiát, L. Burget, J. Černocký, S. Khudanpur, Prooceeding of Recurrent Neural Network Based Language Model, Chiba, Japan, ISCA INTERSPEECH (2010), pp. 1045–1048Google Scholar
  67. 67.
    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)MathSciNetMATHGoogle Scholar
  68. 68.
    N. Ono, D. Kitamura, Z. Rafii, N. Ito, A. Liutkus, The 2015 signal separation evaluation campaign (SiSEC2015), in Proceeding of International Conference Latent Variable Analysis Signal Separation (LVA/ICA), Liberec, Czech Republic (2015)Google Scholar
  69. 69.
    J.-L. Durrieu, B. David, G. Richard, A musically motivated mid-level representation for pitch estimation and musical audio source separation. IEEE J. Sel. Top. Signal Process. 5(6), 1180–1191 (2011)CrossRefGoogle Scholar
  70. 70.
    P.-S. Huang, S.D. Chen, P. Smaragdis, M. Hasegawa-Johnson, Singing-voice separation from monaural recordings using robust principal component analysis, in Proceeding of IEEE International Conference on Acoustics, Speech Signal Processing (ICASSP), Kyoto, Japan (2012), pp. 57–60Google Scholar
  71. 71.
    Z. Rafii, B. Pardo, Repeating pattern extraction technique (REPET): A simple method for music/voice separation. IEEE Trans. Audio, Speech, Lang. Process. 21(1), 73–84 (2013)Google Scholar
  72. 72.
    A. Liutkus, Z. Rafii, R. Badeau, B. Pardo, G. Richard, Adaptive filtering for music/voice separation exploiting the repeating musical structure, in Proceeding of IEEE International Conference on Acoustics, Speech Signal Processing (ICASSP), Kyoto, Japan (2012), pp. 53–56Google Scholar
  73. 73.
    Z. Rafii, B. Pardo, Music/voice separation using the similarity matrix, in Proceeding of International Society of Music Information Retrieval (ISMIR), Porto, Portugal (2012), pp. 583–588Google Scholar
  74. 74.
    Theano Development Team, Theano: A Python framework for fast computation of mathematical expressions (2016), arXiv e-prints http://arXiv.org/abs/1605.02688
  75. 75.
    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit in Proceeding of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Hawaii, USA (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Aditya Arie Nugraha
    • 1
  • Antoine Liutkus
    • 2
  • Emmanuel Vincent
    • 1
  1. 1.Inria NancyVillers-lès-NancyFrance
  2. 2.Inria Sophia AntipolisMontpellierFrance

Personalised recommendations