Skip to main content

Novel Deep Architectures in Speech Processing

  • Chapter
  • First Online:
New Era for Robust Speech Recognition

Abstract

Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model. In addition, unsupervised inference tasks such as adaptation and clustering are handled in a natural way. However, these benefits typically come at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, and discriminative training is relatively easy. However, their typically generic architectures often make it unclear how to incorporate specific problem knowledge or to perform flexible tasks such as unsupervised inference. This chapter introduces frameworks to provide the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and reinterpret inference iterations as layers in a deep network, while generalizing the parametrization to create a more powerful network. We show how such frameworks yield new understanding of conventional networks, and how they can result in novel networks for speech processing, including networks based on nonnegative matrix factorization, complex Gaussian microphone array signal processing, and a network inspired by efficient spectral clustering. We then discuss what has been learned in recent work and provide a prospectus for future research in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Indices k in superscript always refer to the iteration index (and similarly for l defined later as the source index).

References

  1. Attias, H.: New EM algorithms for source separation and deconvolution with a microphone array. In: Proceedings of ICASSP, vol. 5, pp. 297–300 (2003)

    Google Scholar 

  2. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention (2014). arXiv:1412.7755

    Google Scholar 

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv:1409.0473

    Google Scholar 

  4. Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1(1), 121–144 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge (1990)

    Google Scholar 

  6. Colson, B., Marcotte, P., Savard, G.: An overview of bilevel optimization. Ann. Oper. Res. 153(1), 235–256 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  7. Domke, J.: Parameter learning with truncated message-passing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2937–2943 (2011)

    Google Scholar 

  8. Domke, J.: Learning graphical model parameters with approximate marginal inference. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2454 (2013)

    Article  Google Scholar 

  9. Duong, N., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)

    Article  Google Scholar 

  10. Eggert, J., Körner, E.: Sparse coding and NMF. In: Proceedings of Neural Networks, vol. 4, pp. 2529–2533 (2004)

    Google Scholar 

  11. Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  12. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the Itakura–Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)

    Article  MATH  Google Scholar 

  13. Figueiredo, M.A.T., Jain, A.K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 381–396 (2002)

    Article  Google Scholar 

  14. Goodfellow, I.J., Mirza, M., Courville, A., Bengio, Y.: Multi-prediction deep Boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 548–556 (2013)

    Google Scholar 

  15. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks (2013). arXiv:1302.4389

    Google Scholar 

  16. Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: ICML, pp. 399–406 (2010)

    Google Scholar 

  17. Habets, E., Benesty, J., Cohen, I., Gannot, S., Dmochowski, J.: New insights into the MVDR beamformer in room acoustics. IEEE Trans. Audio Speech Lang. Process. 18(1), 158–170 (2010)

    Article  Google Scholar 

  18. Hershey, J.R.: Perceptual inference in generative models. Ph.D. thesis, University of California, San Diego (2005)

    Google Scholar 

  19. Hershey, J.R., Le Roux, J., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures (2014). arXiv:1409.2574

    Google Scholar 

  20. Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation (2015). arXiv:1508.04306

    Google Scholar 

  21. Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: Proceedings of ICASSP (2016)

    Google Scholar 

  22. Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  23. Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: Proceedings of ICASSP, pp. 1562–1566 (2014)

    Google Scholar 

  24. Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation (2015). arXiv:1502.04149

    Google Scholar 

  25. Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. In: Proceedings of ISCA Interspeech (2016)

    Book  Google Scholar 

  26. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)

    Article  MATH  Google Scholar 

  27. Kaiser, L., Sutskever, I.: Neural GPUs learn algorithms (2015). arXiv:1511.08228

    Google Scholar 

  28. Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Proceedings of WASPAA (2013)

    Google Scholar 

  29. Kreutz-Delgado, K.: The complex gradient operator and the CR-calculus (2009). arXiv:0906.4835

    Google Scholar 

  30. Le Roux, J., Hershey, J.R., Weninger, F.J.: Deep NMF for speech enhancement. In: Proceedings of ICASSP (2015)

    Google Scholar 

  31. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: NIPS, pp. 556–562 (2001)

    Google Scholar 

  32. Li, Y., Zemel, R.: Mean field networks. In: Learning Tractable Probabilistic Models (2014)

    Google Scholar 

  33. Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)

    Article  Google Scholar 

  34. Mairal, J., Bach, F., Ponce, J.: Task-driven dictionary learning. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 791–804 (2012)

    Article  Google Scholar 

  35. Mandel, M.I., Weiss, R.J., Ellis, D.P.: Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 18(2), 382–394 (2010)

    Article  Google Scholar 

  36. Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. H. Freeman, San Francisco (1982)

    Google Scholar 

  37. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)

    Google Scholar 

  38. Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of ICASSP, pp. 7092–7096 (2013)

    Google Scholar 

  39. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)

    MATH  Google Scholar 

  40. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)

    Google Scholar 

  41. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of ICASSP, pp. 81–84 (1995)

    Google Scholar 

  42. Romera-Paredes, B., Torr, P.H.: Recurrent instance segmentation (2015). arXiv:1511.08250

    Google Scholar 

  43. Ross, S., Munoz, D., Hebert, M., Bagnell, J.A.: Learning message-passing inference machines for structured prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2737–2744 (2011)

    Google Scholar 

  44. Seltzer, M.L., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Trans. Audio Speech Process. 12(5), 489–498 (2004)

    Article  Google Scholar 

  45. Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of ICASSP, pp. 7398–7402 (2013)

    Google Scholar 

  46. Shental, N., Zomet, A., Hertz, T., Weiss, Y.: Pairwise clustering and graphical models. In: Advances in Neural Information Processing Systems, pp. 185–192 (2004)

    Google Scholar 

  47. Smaragdis, P., Raj, B., Shashanka, M.: Supervised and semi-supervised separation of sounds from single-channel mixtures. In: Proceedings of ICA, pp. 414–421 (2007)

    Google Scholar 

  48. Souden, M., Araki, S., Kinoshita, K., Nakatani, T., Sawada, H.: A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans. Audio Speech Lang. Process. 21(9), 1913–1928 (2013)

    Article  Google Scholar 

  49. Sprechmann, P., Litman, R., Yakar, T.B., Bronstein, A.M., Sapiro, G.: Supervised sparse analysis and synthesis operators. In: NIPS, pp. 908–916 (2013)

    Google Scholar 

  50. Sprechmann, P., Bronstein, A.M., Sapiro, G.: Supervised non-Euclidean sparse NMF via bilevel optimization with applications to speech enhancement. In: Proceedings of HSCMA (2014)

    Book  Google Scholar 

  51. Stoyanov, V., Ropson, A., Eisner, J.: Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In: International Conference on Artificial Intelligence and Statistics, pp. 725–733 (2011)

    Google Scholar 

  52. Swietojanski, P., Ghoshal, A., Renals, S.: Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett. 21(9), 1120–1124 (2014)

    Article  Google Scholar 

  53. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  54. Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second ‘CHiME’ speech separation and recognition challenge: datasets, tasks and baselines. In: Proceedings of ICASSP, pp. 126–130 (2013)

    Google Scholar 

  55. Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: A new class of upper bounds on the log partition function. IEEE Trans. Inf. Theory 51(7), 2313–2335 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  56. Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM IEEE Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  57. Weiss, Y.: Comparing the mean field method and belief propagation for approximate inference in MRFs. In: Advanced Mean Field Methods Theory and Practice, pp. 229–240 (2001)

    Google Scholar 

  58. Weninger, F., Le Roux, J., Hershey, J.R., Watanabe, S.: Discriminative NMF and its application to single-channel source separation. In: Proceedings of ISCA Interspeech (2014)

    Google Scholar 

  59. Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Latent Variable Analysis and Signal Separation (LVA), pp. 91–99 (2015)

    Google Scholar 

  60. Wisdom, S., Hershey, J.R., Le Roux, J., Watanabe, S.: Deep unfolding for multichannel source separation: supplementary materials. http://www.merl.com/demos/deep-MCGMM (2015)

  61. Wisdom, S., Hershey, J., Le Roux, J., Watanabe, S.: Deep unfolding for multichannel source separation. In: Proceedings of ICASSP, pp. 121–125 (2016)

    Google Scholar 

  62. Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)

    Article  Google Scholar 

  63. Yakar, T.B., Litman, R., Sprechmann, P., Bronstein, A., Sapiro, G.: Bilevel sparse models for polyphonic music transcription. In: Proceedings of ISMIR (2013)

    Google Scholar 

  64. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inf. Theory 51(7), 2282–2312 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  65. Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation (2016). arXiv:1607.00325

    Google Scholar 

  66. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of ICASSP (2014)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John R. Hershey .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Hershey, J.R., Le Roux, J., Watanabe, S., Wisdom, S., Chen, Z., Isik, Y. (2017). Novel Deep Architectures in Speech Processing. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics