Abstract
Deep hierarchical structure with multiple layers of hidden space in human speech is intrinsically connected to its dynamic characteristics manifested in all levels of speech production and perception. The desire and an attempt to capitalize on a (superficial) understanding of this deep speech structure helped ignite the recent surge of interest in the deep learning approach to speech recognition and related applications, and a more thorough understanding of the deep structure of speech dynamics and the related computational representations is expected to further advance the research progress in speech technology. In this chapter, we first survey a series of studies on representing speech in a hidden space using dynamic systems and recurrent neural networks, emphasizing different ways of learning the model parameters and subsequently the hidden feature representations of time-varying speech data. We analyze and summarize this rich set of deep, dynamic speech models into two major categories: (1) top-down, generative models adopting localist representations of speech classes and features in the hidden space; and (2) bottom-up, discriminative models adopting distributed representations. With detailed examinations of and comparisons between these two types of models, we focus on the localist versus distributed representations as their respective hallmarks and defining characteristics. Future directions are discussed and analyzed about potential strategies to leverage the strengths of both the localist and distributed representations while overcoming their respective weaknesses, beyond blind integration of the two by using the generative model to pre-train the discriminative one as a popular method of training deep neural networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A. Acero, L. Deng, T. Kristjansson, J. Zhang, HMM adaptation using vector taylor series for noisy speech recognition, in Proceedings of International Conference on Spoken Language Processing (2000), pp. 869–872
J. Baker, Stochastic modeling for automatic speech recognition, in Speech Recognition, ed. by D. Reddy (Academic, New York, 1976)
J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Research developments and directions in speech recognition and understanding, part i. IEEE Signal Process. Mag. 26(3), 75–80 (2009)
J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, D. O’Shgughnessy, Updated MINDS report on speech recognition and understanding. IEEE Signal Process. Mag. 26(4), 78–85 (2009)
L. Baum, T. Petrie, Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)
Y. Bengio, N. Boulanger, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013
Y. Bengio, N. Boulanger-Lewandowski, R. Pascanu, Advances in optimizing recurrent networks, in Proceedings of ICASSP, Vancouver, 2013
J. Bilmes, Buried markov models: a graphical modeling approach to automatic speech recognition. Comput. Speech Lang. 17, 213–231 (2003)
J. Bilmes, What HMMs can do. IEICE Trans. Inf. Syst. E89-D(3), 869–891 (2006)
M. Boden, A guide to recurrent neural networks and backpropagation. Tech. rep., T2002:03, SICS (2002)
H. Bourlard, N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. The Kluwer International Series in Engineering and Computer Science, vol. 247 (Kluwer Academic, Boston, 1994)
J. Bridle, L. Deng, J. Picone, H. Richards, J. Ma, T. Kamm, M. Schuster, S. Pike, R. Reagan, An investigation of segmental hidden dynamic models of speech coarticulation for automatic speech recognition. Final Report for 1998 Workshop on Langauge Engineering, CLSP (Johns Hopkins, 1998)
J. Chen, L. Deng, A primal-dual method for training recurrent neural networks constrained by the echo-state property, in Proceedings of ICLR (2014)
J.-T. Chien, C.-H. Chueh, Dirichlet class language models for speech recognition. IEEE Trans. Audio Speech Lang. Process. 27, 43–54 (2011)
G. Dahl, D. Yu, L. Deng, A. Acero, Large vocabulary continuous speech recognition with context-dependent DBN-HMMs, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011)
G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 39, 1–38 (1977)
L. Deng, A generalized hidden markov model with state-conditioned trend functions of time for the speech signal. Signal Process. 27(1), 65–78 (1992)
L. Deng, A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition. Speech Commun. 24(4), 299–323 (1998)
L. Deng, Articulatory features and associated production models in statistical speech recognition, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 214–224
L. Deng, Computational models for speech production, in Computational Models of Speech Pattern Processing (Springer, New York, 1999), pp. 199–213
L. Deng, Switching dynamic system models for speech articulation and acoustics, in Mathematical Foundations of Speech and Language Processing (Springer, New York, 2003), pp. 115–134
L. Deng, Dynamic Speech Models—Theory, Algorithm, and Applications (Morgan and Claypool, San Rafael, 2006)
L. Deng, M. Aksmanovic, D. Sun, J. Wu, Speech recognition using hidden Markov models with polynomial regression functions as non-stationary states. IEEE Trans. Acoust. Speech Signal Process. 2(4), 101–119 (1994)
L. Deng, J. Chen, Sequence classification using high-level features extracted from deep neural networks, in Proceedings of ICASSP (2014)
L. Deng, J. Droppo, A. Acero, A Bayesian approach to speech feature enhancement using the dynamic cepstral prior, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2002), pp. I-829–I-832
L. Deng, K. Hassanein, M. Elmasry, Analysis of the correlation structure for a neural predictive model with application to speech recognition. Neural Netw. 7(2), 331–339 (1994)
L. Deng, G. Hinton, B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: an overview, in Proceedings of IEEE ICASSP, Vancouver, 2013
L. Deng, G. Hinton, D. Yu, Deep learning for speech recognition and related applications, in NIPS Workshop, Whistler, 2009
L. Deng, P. Kenny, M. Lennig, V. Gupta, F. Seitz, P. Mermelsten, Phonemic hidden markov models with continuous mixture output densities for large vocabulary word recognition. IEEE Trans. Acoust. Speech Signal Process. 39(7), 1677–1681 (1991)
L. Deng, L. Lee, H. Attias, A. Acero, Adaptive kalman filtering and smoothing for tracking vocal tract resonances using a continuous-valued hidden dynamic model. IEEE Trans. Audio Speech Lang. Process. 15(1), 13–23 (2007)
L. Deng, M. Lennig, F. Seitz, P. Mermelstein, Large vocabulary word recognition using context-dependent allophonic hidden markov models. Comput. Speech Lang. 4, 345–357 (1991)
L. Deng, X. Li, Machine learning paradigms in speech recognition: an overview. IEEE Trans. Audio Speech Lang. Process. 21(5), 1060–1089 (2013)
L. Deng, J. Ma, A statistical coarticulatory model for the hidden vocal-tract-resonance dynamics, in EUROSPEECH (1999), pp. 1499–1502
L. Deng, J. Ma, Spontaneous speech recognition using a statistical coarticulatory model for the hidden vocal-tract-resonance dynamics. J. Acoust. Soc. Am. 108, 3036–3048 (2000)
L. Deng, D. O’Shaughnessy, Speech Processing—A Dynamic and Optimization-Oriented Approach (Marcel Dekker, New York, 2003)
L. Deng, G. Ramsay, D. Sun, Production models as a structural basis for automatic speech recognition. Speech Commun. 33(2–3), 93–111 (1997)
L. Deng, D. Yu, Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2007), pp. 445–448
L. Deng, D. Yu, A. Acero, A bidirectional target filtering model of speech coarticulation: two-stage implementation for phonetic recognition. IEEE Trans. Speech Audio Process. 14, 256–265 (2006)
L. Deng, D. Yu, A. Acero, Structured speech modeling. IEEE Trans. Speech Audio Process. 14, 1492–1504 (2006)
P. Divenyi, S. Greenberg, G. Meyer, Dynamics of Speech Production and Perception (IOS Press, Amsterdam, 2006)
J. Droppo, A. Acero, Noise robust speech recognition with a switching linear dynamic model, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-953–I-956
E. Fox, E. Sudderth, M. Jordan, A. Willsky, Bayesian nonparametric methods for learning markov switching processes. IEEE Signal Process. Mag. 27(6), 43–54 (2010)
B. Frey, L. Deng, A. Acero, T. Kristjansson, Algonquin: iterating laplaces method to remove multiple types of acoustic distortion for robust speech recognition, in Proceedings of Eurospeech (2000)
M. Gales, S. Young, Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech Audio Process. 4(5), 352–359 (1996)
Z. Ghahramani, G.E. Hinton, Variational learning for switching state-space models. Neural Comput. 12, 831–864 (2000)
Y. Gong, I. Illina, J.-P. Haton, Modeling long term variability information in mixture stochastic trajectory framework, in Proceedings of International Conference on Spoken Language Processing (1996)
A. Graves, Sequence transduction with recurrent neural networks, in Representation Learning Workshop, ICML (2012)
A. Graves, A. Mahamed, G. Hinton, Speech recognition with deep recurrent neural networks, in Proceedings of ICASSP, Vancouver, 2013
G. E. Hinton, “A practical guide to training restricted Boltzmann machines,” in Technical report 2010-003, Machine Learning Group, University of Toronto, 2010.
G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
G. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)
G. Hinton, R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
W. Holmes, M. Russell, Probabilistic-trajectory segmental HMMs. Comput. Speech Lang. 13, 3–37 (1999)
X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development (Upper Saddle River, New Jersey 07458)
H. Jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD Report 159, GMD - German National Research Institute for Computer Science (2002)
F. Jelinek, Continuous speech recognition by statistical methods. Proc. IEEE 64(4), 532–557 (1976)
B.-H. Juang, S.E. Levinson, M.M. Sondhi, Maximum likelihood estimation for mixture multivariate stochastic observations of markov chains. IEEE Trans. Inf. Theory 32(2), 307–309 (1986)
B. Kingsbury, T. Sainath, H. Soltau, Scalable minimum Bayes risk training of deep neural network acoustic models using distributed hessian-free optimization, in Proceedings of Interspeech (2012)
H. Larochelle, Y. Bengio, Classification using discriminative restricted Boltzmann machines, in Proceedings of the 25th International Conference on Machine learning (ACM, New York, 2008), pp. 536–543
L. Lee, H. Attias, L. Deng, Variational inference and learning for segmental switching state space models of hidden speech dynamics, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2003), pp. I-872–I-875
L.J. Lee, P. Fieguth, L. Deng, A functional articulatory dynamic model for speech production, in Proceedings of ICASSP, Salt Lake City, vol. 2, 2001, pp. 797–800
S. Liu, K. Sim, Temporally varying weight regression: a semi-parametric trajectory model for automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22(1) 151–160 (2014)
S.M. Siniscalchia, D. Yu, L. Deng, C.-H. Lee, Exploiting deep neural networks for detection-based speech recognition. Neurocomputing 106, 148–157 (2013)
J. Ma, L. Deng, A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech. Comput. Speech Lang. 14, 101–104 (2000)
J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Process. 11(6), 590–602 (2003)
J. Ma, L. Deng, Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model. IEEE Trans. Audio Speech Lang. Process 11(6), 590–602 (2004)
J. Ma, L. Deng, Target-directed mixture dynamic models for spontaneous speech recognition. IEEE Trans. Audio Speech Process. 12(1), 47–58 (2004)
A.L. Maas, Q. Le, T.M. O’Neil, O. Vinyals, P. Nguyen, A.Y. Ng, Recurrent neural networks for noise reduction in robust asr, in Proceedings of INTERSPEECH, Portland, 2012
J. Martens, I. Sutskever, Learning recurrent neural networks with hessian-free optimization, in Proceedings of ICML, Bellevue, 2011, pp. 1033–1040
G. Mesnil, X. He, L. Deng, Y. Bengio, Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding, in Proceedings of INTERSPEECH, Lyon, 2013
B. Mesot, D. Barber, Switching linear dynamical systems for noise robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(6), 1850–1858 (2007)
T. Mikolov, Statistical language models based on neural networks, Ph.D. thesis, Brno University of Technology, 2012
T. Mikolov, A. Deoras, D. Povey, L. Burget, J. Cernocky, Strategies for training large scale neural network language models, in Proceedings of IEEE ASRU (IEEE, Honolulu, 2011), pp. 196–201
T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural network based language model, in Proceedings of INTERSPEECH, Makuhari, 2010, pp. 1045–1048
T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, S. Khudanpur, Extensions of recurrent neural network language model, in Proceedings of IEEE ICASSP, Prague, 2011, pp. 5528–5531
A. Mohamed, G. Dahl, G. Hinton, Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)
A. Mohamed, G.E. Dahl, G.E. Hinton, Deep belief networks for phone recognition, in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications (2009)
A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, M. Picheny, Deep belief networks using discriminative features for phone recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2011), pp. 5060–5063
N. Morgan, Deep and wide: multiple layers in automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 7–13 (2012)
M. Ostendorf, V. Digalakis, O. Kimball, From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Trans. Speech Audio Process. 4(5), 360–378 (1996)
M. Ostendorf, A. Kannan, O. Kimball, J. Rohlicek, Continuous word recognition based on the stochastic segment model, in Proceedings of DARPA Workshop CSR (1992)
E. Ozkan, I. Ozbek, M. Demirekler, Dynamic speech spectrum representation and tracking variable number of vocal tract resonance frequencies with time-varying dirichlet process mixture models. IEEE Trans. Audio Speech Lang. Process. 17(8), 1518–1532 (2009)
R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in Proceedings of ICML, Atlanta, 2013
V. Pavlovic, B. Frey, T. Huang, Variational learning in mixed-state dynamic graphical models, in Proceedings of UAI, Stockholm, 1999, pp. 522–530
J. Picone, S. Pike, R. Regan, T. Kamm, J. Bridle, L. Deng, Z. Ma, H. Richards, M. Schuster, Initial evaluation of hidden dynamic models on conversational speech, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (1999)
G. Puskorius, L. Feldkamp, Neurocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks. IEEE Trans. Neural Netw. 5(2), 279–297 (1998)
L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
S. Rennie, J. Hershey, P. Olsen, Single-channel multitalker speech recognition—graphical modeling approaches. IEEE Signal Process.Mag. 33, 66–80 (2010)
A.J. Robinson, An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Netw. 5(2), 298–305 (1994)
A. Rosti, M. Gales, Rao-blackwellised gibbs sampling for switching linear dynamical systems, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1 (2004), pp. I-809–I-812
M. Russell, P. Jackson, A multiple-level linear/linear segmental HMM with a formant-based intermediate layer. Comput. Speech Lang. 19, 205–225 (2005)
T. Sainath, B. Kingsbury, H. Soltau, B. Ramabhadran, Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans. Audio Speech Lang. Process. 21(11), 2267–2276 (2013)
F. Seide, G. Li, X. Chen, D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011 (Waikoloa, HI, USA), pp. 24–29
X. Shen, L. Deng, Maximum likelihood in statistical estimation of dynamical systems: decomposition algorithm and simulation results. Signal Process. 57, 65–79 (1997)
K. N. Stevens, Acoustic phonetics, Vol. 30, MIT Press, 2000.
V. Stoyanov, A. Ropson, J. Eisner, Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure, in Proceedings of AISTAT (2011)
I. Suskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of 28th International Conference on Machine Learning (2011)
I. Sutskever, Training recurrent neural networks, Ph.D. thesis, University of Toronto, 2013
I. Sutskever, J. Martens, G.E. Hinton, Generating text with recurrent neural networks, in Proceedings of ICML, Bellevue, 2011, pp. 1017–1024
R. Togneri, L. Deng, Joint state and parameter estimation for a target-directed nonlinear dynamic system model. IEEE Trans. Signal Process. 51(12), 3061–3070 (2003)
R. Togneri, L. Deng, A state-space model with neural-network prediction for recovering vocal tract resonances in fluent speech from mel-cepstral coefficients. Speech Commun. 48(8), 971–988 (2006)
F. Triefenbach, A. Jalalvand, K. Demuynck, J.-P. Martens, Acoustic modeling with hierarchical reservoirs. EEE Trans. Audio Speech Lang. Process. 21(11 ), 2439–2450 (2013)
S. Wright, D. Kanevsky, L. Deng, X. He, G. Heigold, H. Li, Optimization algorithms and applications for speech and language processing. IEEE Trans. Audio Speech Lang. Process. 21(11), 2231–2243 (2013)
X. Xing, M. Jordan, S. Russell, A generalized mean field algorithm for variational inference in exponential families, in Proceedings of UAI (2003)
D. Yu, L. Deng, Speaker-adaptive learning of resonance targets in a hidden trajectory model of speech coarticulation. Comput. Speech Lang. 27, 72–87 (2007)
D. Yu, L. Deng, Discriminative pretraining of deep neural networks, US Patent 20130138436 A1, 2013
D. Yu, L. Deng, G. Dahl, Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition, in NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
D. Yu, F. Seide, G. Li, L. Deng, Exploiting sparseness in deep neural networks for large vocabulary speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012), pp. 4409–4412
D. Yu, S. Siniscalchi, L. Deng, C. Lee, Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (2012)
H. Zen, K. Tokuda, T. Kitamura, An introduction of trajectory model into HMM-based speech synthesis, in Proceedings of ISCA SSW5 (2004), pp. 191–196
L. Zhang, S. Renals, Acoustic-articulatory modelling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this chapter
Cite this chapter
Deng, L., Togneri, R. (2015). Deep Dynamic Models for Learning Hidden Representations of Speech Features. In: Ogunfunmi, T., Togneri, R., Narasimha, M. (eds) Speech and Audio Processing for Coding, Enhancement and Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1456-2_6
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1456-2_6
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1455-5
Online ISBN: 978-1-4939-1456-2
eBook Packages: EngineeringEngineering (R0)