Abdel-Hamid O, Jiang H. (2013) Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon, France
Abdel-Hamid O, rahman Mohamed A, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech,and Signal Processing, Kyoto, pp 4277–4280
Aleksic PS, Katsaggelos AK (2004) Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 5, Montreal, pp 917–920
Barker J, Berthommier F (1999) Evidence of correlation between acoustic and visual features of speech. In: Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco , pp 5–9
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
Article
MATH
MathSciNet
Google Scholar
Bourlard H, Dupont S (1996) A new ASR approach based on independent processing and recombination of partial frequency bands. In: Proceedings of the 4th International Conference on Spoken Language Processing, vol 1, Philadelphia, pp 426–429
Bourlard H, Dupont S, Ris C (1996) Multi-stream speech recognition.IDIAP research report
Bourlard H a, Morgan N (1994) Connectionist speech recognition: a hybrid approach. Springer US, Boston
Book
Google Scholar
Brooke N, Petajan ED (1986) Seeing speech: Investigations into the synthesis and recognition of visible speech movements using automatic image processing and computer graphics. In: Proceedings of the International Conference on Speech Input and Output, Techniques and Applications, London, pp 104–109
Coates A, Huval B, Wang T, Wu DJ, Ng AY, Catanzaro B (2013) Deep learning with COTS HPC. In: Proceedings of the 30th international conference on machine learning, Atlanta, pp 1337–1345
Cootes T, Edwards G, Taylor C (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Article
Google Scholar
Dahl GE, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
Article
Google Scholar
Feng X, Zhang Y, Glass J (2014) Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, pp 1759–1763
Gurban M, Thiran JP, Drugman T, Dutoit T (2008) Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition. In: Proceedings of the 10th International Conference on Multimodal Interfaces, Chania, pp 237– 240
Heckmann M, Kroschel K, Savariaux C (2002) DCT-based video features for audio-visual speech recognition. In: Proceedings of the 7th International Conference on Spoken Language Processing, vol 3, Denver, pp 1925–1928
Hermansky H, Ellis D, Sharma S (2000) Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, Istanbul, pp 1635–1638
Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Proc Mag 29:82–97
Article
Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–7
Article
MATH
MathSciNet
Google Scholar
Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vancouver, pp 7596–7599
Janin A, Ellis D, Morgan N (1999) Multi-stream speech recognition: Ready for prime time? In: Proceedings of the 6th European Conference on Speech Communication and Technology. Budapest, Hungary
Krizhevsky A, Hinton GE (2011) Using very deep autoencoders for content-based image retrieval. In: Proceedings of the 19th European Symposium on Artificial Neural Networks. Bruges, Belgium
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems
Kuwabara H, Takeda K, Sagisaka Y, Katagiri S, Morikawa S, Watanabe T (1989) Construction of a large-scale Japanese speech database and its management system. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Glasgow, pp 560–563
Lan Y, Theobald BJ, Harvey R, Ong EJ, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the International Conference on Auditory-Visual Speech Processing. Hakone,Japan
Le QV, Ranzato M, Monga R, Devin M, Chen K, Corrado GS, Dean J, Ng AY (2012) Building high-level features using large scale unsupervised learning. In: Proceedings of the 29th International Conference on Machine Learning, Edinburgh, pp 81–88
LeCun Y, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, Washington, pp 97–104
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article
Google Scholar
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th International Conference on Machine Learning, Montreal, pp 609– 616
Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Proceedings of the Advances in Neural Information Processing Systems 22, Vancouver, pp 1096–1104
Lerner B, Guterman H, Aladjem M, Dinstein I (1999) A comparative study of neural network based feature extraction paradigms. Pattern Recogn Lett 20(1):7–14
Article
MATH
Google Scholar
Luettin J, Thacker N, Beet S (1996) Visual speech recognition using active shape models and hidden Markov models. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 2, Atlanta , pp 817–820
Maas AL, O’Neil TM, Hannun AY, Ng AY (2013) Recurrent neural network feature enhancement: The 2nd chime challenge. In: Proceedings of the 2nd International Workshop on Machine Listening in Multisource Environments.Vancouver, Canada
Martens J (2010) Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Learning, Machine, Haifa, pp 735–742
Matthews I, Cootes T, Bangham J, Cox S, Harvey R (2002) Extraction of visual features for lipreading. IEEE Trans Pattern Anal Mach Intell 24(2):198–213
Article
Google Scholar
Matthews I, Potamianos G, Neti C, Luettin J (2001) A comparison of model and transform-based visual features for audio-visual LVCSR. In: Proceedings of the IEEE International Conference on Multimedia and Expo. Tokyo, Japan
Mohamed A, Dahl GE, Hinton GE (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22
Article
Google Scholar
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning
NVIDIA Corporation (2014) CUBLAS library version 6.0 user guide. CUDA Toolkit Documentation
Palaz D, Collobert R, Magimai.-Doss M (2013) Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon, France
Pearlmutter B (1994) Fast exact multiplication by the Hessian. Neural Comput 6(1):147–160
Article
Google Scholar
Renals S, Morgan N, Member S, Bourlard H, Cohen M, Franco H (1994) Connectionist probability estimators in HMM speech recognition 2(1):161–174
Robert-Ribes J, Piquemal M, Schwartz JL, Escudier P (1996) Exploiting sensor fusion architectures and stimuli complementarity in av speech recognition. In: Stork D, Hennecke M (eds) Speechreading by Humans and Machines. Springer, Berlin Heidelberg, pp 193–210
Chapter
Google Scholar
Sainath TN, Kingsbury B, Ramabhadran B (2012) Auto-encoder bottleneck features using deep belief networks. In:Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, pp 4153–4156
Scanlon P, Reilly R (2001) Feature analysis for automatic speechreading. In: Proceedings of the IEEE 4th Workshop on Processing, Multimedia Signal, Cannes, pp 625–630
Schraudolph NN (2002) Fast curvature matrix-vector products for second-order gradient descent. Neural Comput 14(7):1723–38
Article
MATH
Google Scholar
Slaney M (1998) Auditory toolbox: A MATLAB toolbox for auditory modeling work version 2. Interval research corproation
Sutskever I, Martens J, Hinton G (2011) Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning, Bellevue, pp 1017–1024
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning, New York, pp 1096–1103
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
MATH
MathSciNet
Google Scholar
Yehia H, Rubin P, Vatikiotis-Bateson E (1998) Quantitative association of vocal-tract and facial behavior. Speech Comm 26:23–43
Article
Google Scholar
Yoshida T, Nakadai K, Okuno HG (2009) Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: Proceedings of the 9th IEEE-RAS International Conference on Humanoid Robots, Paris, pp 604–609
Young S, Evermann G, Gales M, Hain T, Liu XA, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2009) The HTK Book (for HTK Version 3.4),.Cambridge University Engineering Department