Novel Deep Architectures in Speech Processing

Hershey, John R.; Le Roux, Jonathan; Watanabe, Shinji; Wisdom, Scott; Chen, Zhuo; Isik, Yusuf

doi:10.1007/978-3-319-64680-0_6

John R. Hershey⁵,
Jonathan Le Roux⁵,
Shinji Watanabe⁵,
Scott Wisdom⁶,
Zhuo Chen⁷ &
…
Yusuf Isik⁸

2310 Accesses
2 Citations

Abstract

Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model. In addition, unsupervised inference tasks such as adaptation and clustering are handled in a natural way. However, these benefits typically come at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, and discriminative training is relatively easy. However, their typically generic architectures often make it unclear how to incorporate specific problem knowledge or to perform flexible tasks such as unsupervised inference. This chapter introduces frameworks to provide the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and reinterpret inference iterations as layers in a deep network, while generalizing the parametrization to create a more powerful network. We show how such frameworks yield new understanding of conventional networks, and how they can result in novel networks for speech processing, including networks based on nonnegative matrix factorization, complex Gaussian microphone array signal processing, and a network inspired by efficient spectral clustering. We then discuss what has been learned in recent work and provide a prospectus for future research in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Indices k in superscript always refer to the iteration index (and similarly for l defined later as the source index).

References

Attias, H.: New EM algorithms for source separation and deconvolution with a microphone array. In: Proceedings of ICASSP, vol. 5, pp. 297–300 (2003)
Google Scholar
Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention (2014). arXiv:1412.7755
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv:1409.0473
Google Scholar
Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Bayesian Anal. 1(1), 121–144 (2006)
Article MathSciNet MATH Google Scholar
Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge (1990)
Google Scholar
Colson, B., Marcotte, P., Savard, G.: An overview of bilevel optimization. Ann. Oper. Res. 153(1), 235–256 (2007)
Article MathSciNet MATH Google Scholar
Domke, J.: Parameter learning with truncated message-passing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2937–2943 (2011)
Google Scholar
Domke, J.: Learning graphical model parameters with approximate marginal inference. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2454 (2013)
Article Google Scholar
Duong, N., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)
Article Google Scholar
Eggert, J., Körner, E.: Sparse coding and NMF. In: Proceedings of Neural Networks, vol. 4, pp. 2529–2533 (2004)
Google Scholar
Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J.: Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of ICASSP (2015)
Book Google Scholar
Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the Itakura–Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)
Article MATH Google Scholar
Figueiredo, M.A.T., Jain, A.K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 381–396 (2002)
Article Google Scholar
Goodfellow, I.J., Mirza, M., Courville, A., Bengio, Y.: Multi-prediction deep Boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 548–556 (2013)
Google Scholar
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks (2013). arXiv:1302.4389
Google Scholar
Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: ICML, pp. 399–406 (2010)
Google Scholar
Habets, E., Benesty, J., Cohen, I., Gannot, S., Dmochowski, J.: New insights into the MVDR beamformer in room acoustics. IEEE Trans. Audio Speech Lang. Process. 18(1), 158–170 (2010)
Article Google Scholar
Hershey, J.R.: Perceptual inference in generative models. Ph.D. thesis, University of California, San Diego (2005)
Google Scholar
Hershey, J.R., Le Roux, J., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures (2014). arXiv:1409.2574
Google Scholar
Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation (2015). arXiv:1508.04306
Google Scholar
Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: Proceedings of ICASSP (2016)
Google Scholar
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)
Book Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: Proceedings of ICASSP, pp. 1562–1566 (2014)
Google Scholar
Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Joint optimization of masks and deep recurrent neural networks for monaural source separation (2015). arXiv:1502.04149
Google Scholar
Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R.: Single-channel multi-speaker separation using deep clustering. In: Proceedings of ISCA Interspeech (2016)
Book Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Article MATH Google Scholar
Kaiser, L., Sutskever, I.: Neural GPUs learn algorithms (2015). arXiv:1511.08228
Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: Proceedings of WASPAA (2013)
Google Scholar
Kreutz-Delgado, K.: The complex gradient operator and the CR-calculus (2009). arXiv:0906.4835
Google Scholar
Le Roux, J., Hershey, J.R., Weninger, F.J.: Deep NMF for speech enhancement. In: Proceedings of ICASSP (2015)
Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: NIPS, pp. 556–562 (2001)
Google Scholar
Li, Y., Zemel, R.: Mean field networks. In: Learning Tractable Probabilistic Models (2014)
Google Scholar
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)
Article Google Scholar
Mairal, J., Bach, F., Ponce, J.: Task-driven dictionary learning. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 791–804 (2012)
Article Google Scholar
Mandel, M.I., Weiss, R.J., Ellis, D.P.: Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 18(2), 382–394 (2010)
Article Google Scholar
Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. H. Freeman, San Francisco (1982)
Google Scholar
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)
Google Scholar
Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of ICASSP, pp. 7092–7096 (2013)
Google Scholar
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)
MATH Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)
Google Scholar
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: a British English speech corpus for large vocabulary continuous speech recognition. In: Proceedings of ICASSP, pp. 81–84 (1995)
Google Scholar
Romera-Paredes, B., Torr, P.H.: Recurrent instance segmentation (2015). arXiv:1511.08250
Google Scholar
Ross, S., Munoz, D., Hebert, M., Bagnell, J.A.: Learning message-passing inference machines for structured prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2737–2744 (2011)
Google Scholar
Seltzer, M.L., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Trans. Audio Speech Process. 12(5), 489–498 (2004)
Article Google Scholar
Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of ICASSP, pp. 7398–7402 (2013)
Google Scholar
Shental, N., Zomet, A., Hertz, T., Weiss, Y.: Pairwise clustering and graphical models. In: Advances in Neural Information Processing Systems, pp. 185–192 (2004)
Google Scholar
Smaragdis, P., Raj, B., Shashanka, M.: Supervised and semi-supervised separation of sounds from single-channel mixtures. In: Proceedings of ICA, pp. 414–421 (2007)
Google Scholar
Souden, M., Araki, S., Kinoshita, K., Nakatani, T., Sawada, H.: A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE Trans. Audio Speech Lang. Process. 21(9), 1913–1928 (2013)
Article Google Scholar
Sprechmann, P., Litman, R., Yakar, T.B., Bronstein, A.M., Sapiro, G.: Supervised sparse analysis and synthesis operators. In: NIPS, pp. 908–916 (2013)
Google Scholar
Sprechmann, P., Bronstein, A.M., Sapiro, G.: Supervised non-Euclidean sparse NMF via bilevel optimization with applications to speech enhancement. In: Proceedings of HSCMA (2014)
Book Google Scholar
Stoyanov, V., Ropson, A., Eisner, J.: Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In: International Conference on Artificial Intelligence and Statistics, pp. 725–733 (2011)
Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Convolutional neural networks for distant speech recognition. IEEE Signal Process. Lett. 21(9), 1120–1124 (2014)
Article Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., Matassoni, M.: The second ‘CHiME’ speech separation and recognition challenge: datasets, tasks and baselines. In: Proceedings of ICASSP, pp. 126–130 (2013)
Google Scholar
Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: A new class of upper bounds on the log partition function. IEEE Trans. Inf. Theory 51(7), 2313–2335 (2005)
Article MathSciNet MATH Google Scholar
Wang, Y., Narayanan, A., Wang, D.: On training targets for supervised speech separation. IEEE/ACM IEEE Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Article Google Scholar
Weiss, Y.: Comparing the mean field method and belief propagation for approximate inference in MRFs. In: Advanced Mean Field Methods Theory and Practice, pp. 229–240 (2001)
Google Scholar
Weninger, F., Le Roux, J., Hershey, J.R., Watanabe, S.: Discriminative NMF and its application to single-channel source separation. In: Proceedings of ISCA Interspeech (2014)
Google Scholar
Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., Schuller, B.: Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Latent Variable Analysis and Signal Separation (LVA), pp. 91–99 (2015)
Google Scholar
Wisdom, S., Hershey, J.R., Le Roux, J., Watanabe, S.: Deep unfolding for multichannel source separation: supplementary materials. http://www.merl.com/demos/deep-MCGMM (2015)
Wisdom, S., Hershey, J., Le Roux, J., Watanabe, S.: Deep unfolding for multichannel source separation. In: Proceedings of ICASSP, pp. 121–125 (2016)
Google Scholar
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Article Google Scholar
Yakar, T.B., Litman, R., Sprechmann, P., Bronstein, A., Sapiro, G.: Bilevel sparse models for polyphonic music transcription. In: Proceedings of ISMIR (2013)
Google Scholar
Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inf. Theory 51(7), 2282–2312 (2005)
Article MathSciNet MATH Google Scholar
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation (2016). arXiv:1607.00325
Google Scholar
Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of ICASSP (2014)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA
John R. Hershey, Jonathan Le Roux & Shinji Watanabe
University of Washington, Seattle, WA, USA
Scott Wisdom
Columbia University, New York, NY, USA
Zhuo Chen
Sabanci University, Istanbul, Turkey
Yusuf Isik

Authors

John R. Hershey
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Le Roux
View author publications
You can also search for this author in PubMed Google Scholar
Shinji Watanabe
View author publications
You can also search for this author in PubMed Google Scholar
Scott Wisdom
View author publications
You can also search for this author in PubMed Google Scholar
Zhuo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yusuf Isik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John R. Hershey .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hershey, J.R., Le Roux, J., Watanabe, S., Wisdom, S., Chen, Z., Isik, Y. (2017). Novel Deep Architectures in Speech Processing. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_6
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics