Automatic Speech Recognition Based on Neural Networks

  • Ralf Schlüter
  • Patrick Doetsch
  • Pavel Golik
  • Markus Kitza
  • Tobias Menne
  • Kazuki Irie
  • Zoltán Tüske
  • Albert Zeyer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9811)

Abstract

In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies on neural networks more and more. Both in acoustic and language modeling, neural networks today mark the state of the art for large vocabulary continuous speech recognition, providing huge improvements over former approaches that were solely based on Gaussian mixture hidden markov models and count-based language models. We give an overview of current activities in neural network based modeling for automatic speech recognition. This includes discussions of network topologies and cell types, training and optimization, choice of input features, adaptation and normalization, multitask training, as well as neural network based language modeling. Despite the clear progress obtained with neural network modeling in speech recognition, a lot is to be done, yet to obtain a consistent and self-contained neural network based modeling approach that ties in with the former state of the art. We will conclude by a discussion of open problems as well as potential future directions w.r.t. to neural network integration into automatic speech recognition systems.

References

  1. 1.
    Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, pp. 4277–4280, March 2012Google Scholar
  2. 2.
    Babel: US IARPA Project (2012–2016). http://www.iarpa.gov/Programs/ia/Babel/babel.html
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR), San Diego, CA, USA, May 2015Google Scholar
  4. 4.
    Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-End attention-based large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, pp. 4945–4949, March 2016Google Scholar
  5. 5.
    Bahdanau, D., Serdyuk, D., Brakel, P., Ke, N.R., Chorowski, J., Courville, A.C., Bengio, Y.: Task loss estimation for sequence prediction. CoRR abs/1511.06456 (2015). http://arxiv.org/abs/1511.06456
  6. 6.
    Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. In: Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, vol. 13, pp. 932–938, November 2000Google Scholar
  7. 7.
    Bourlard, H., Wellekens, C.J.: Links between markov models and multilayer perceptrons. In: Touretzky, D. (ed.) Advances in neural information processing systems i, pp. 502–510. Morgan Kaufmann, San Mateo, CA (1989)Google Scholar
  8. 8.
    Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers, Norwell (1993)Google Scholar
  9. 9.
    Breuel, T.M.: Benchmarking of LSTM Networks. arXiv preprint (2015). arXiv:1508.02774
  10. 10.
    Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs with relationships to statistical pattern recognition. In: Soulié, F.F., Hérault, J. (eds.) Neurocomputing: Algorithms, Architectures and Applications. Nato ASI Series F: Computer and Systems Sciences, vol. 68, pp. 227–236. Springer, Heidelberg (1989)Google Scholar
  11. 11.
    Burget, L., Schwarz, P., Agarwal, M., Akayazi, P., Feng, K., Ghoshal, A., Glembek, O., Goel, N., Karafiát, M., Povey, D., Rastrow, A., Rose, R.C., Thomas, S.: Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4334–4337 (2010)Google Scholar
  12. 12.
    Byrne, W., Beyerlein, P., Huerta, J.M., Khudanpur, S., Marthi, B., Morgan, J., Peterek, N., Picone, J., Vergyri, D., Wang, W.: Towards language independent acoustic modeling. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 1029–1032 (2000)Google Scholar
  13. 13.
    Caruana, R.: Multitask learning: A knowledge-based source of inductive bias. In: International Conference on Machine Learning (ICML), pp. 41–48 (1993)Google Scholar
  14. 14.
    Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, Attend and Spell. CoRR abs/1508.01211 (2015)Google Scholar
  15. 15.
    Chen, X., Liu, X., Gales, M., Woodland, P.: Investigation of back-off based interpolation between recurrent neural network and \(N\)-gram language models. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, AZ, USA, pp. 181–186, December 2015Google Scholar
  16. 16.
    Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014)Google Scholar
  17. 17.
    Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). In: International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016Google Scholar
  18. 18.
    Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRefGoogle Scholar
  19. 19.
    Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M.A., Senior, A., Tucker, P., Yang, K., Le, Q.V., Ng, A.Y.: Large scale distributed deep networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1223–1231. Nips Foundation (2012). http://books.nips.cc
  20. 20.
    Doetsch, P., Zeyer, A., Voigtlaender, P., Kulikov, I., Schlüter, R., Ney, H.: RETURNN: the RWTH extensible training framework for universal recurrent neural networks. In: Interspeech, San Francisco, CA, USA, September 2016, submittedGoogle Scholar
  21. 21.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Technical Report UCB/EECS-2010-24, EECS Department, University of California, Berkeley, March 2010Google Scholar
  22. 22.
    Geiger, J.T., Zhang, Z., Weninger, F., Schuller, B., Rigoll, G.: Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In: Interspeech, pp. 631–635 (2014)Google Scholar
  23. 23.
    Golik, P., Doetsch, P., Ney, H.: Cross-entropy vs. squared error training: a theoretical and experimental comparison. In: Interspeech, Lyon, France, pp. 1756–1760, August 2013Google Scholar
  24. 24.
    Golik, P., Tüske, Z., Schlüter, R., Ney, H.: Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In: Interspeech, pp. 26–30. Dresden, Germany, September 2015Google Scholar
  25. 25.
    Golik, P., Tüske, Z., Schlüter, R., Ney, H.: Multilingual features based keyword search for very low-resource languages. In: Interspeech, Dresden, Germany, pp. 1260–1264, September 2015Google Scholar
  26. 26.
    Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: International Conference on Machine Learning (ICML), Atlanta, GA, USA, June 2013Google Scholar
  27. 27.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition withdeep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP), pp. 6645–6649. IEEE (2013)Google Scholar
  28. 28.
    Graves, A.: Generating Sequences with Recurrent Neural Networks. CoRR abs/1308.0850 (2013). http://arxiv.org/abs/1308.0850
  29. 29.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International Conference on Machine Learning (ICML), NY, USA, pp. 369–376 (2006). http://doi.acm.org/10.1145/1143844.1143891
  30. 30.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)CrossRefGoogle Scholar
  31. 31.
    Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: A Search Space Odyssey. arXiv preprint (2015). arXiv:1503.04069
  32. 32.
    Grézl, F., Karafiát, M., Janda, M.: Study of probabilistic and bottle-neck features in multilingual environment. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 359–364 (2011)Google Scholar
  33. 33.
    Grézl, F., Karafiát, M., Kontár, S., Černocký, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, HI, USA, pp. 757–760, April 2007Google Scholar
  34. 34.
    Gülçehre, Ç., Bengio, Y.: ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient. CoRR abs/1412.7419 (2014). http://arxiv.org/abs/1412.7419
  35. 35.
    He, X., Deng, L., Chou, W.: Discriminative learning in sequential pattern recognition - a unifying review for optimization-oriented speech recognition. IEEE Signal Process. Mag. 25(5), 14–36 (2008)CrossRefGoogle Scholar
  36. 36.
    Heigold, G., Schlüter, R., Ney, H., Wiesler, S.: Discriminative training for automatic speech recognition: Modeling, criteria, optimization, implementation, and performance. IEEE Signal Process. Mag. 29(6), 58–69 (2012)CrossRefGoogle Scholar
  37. 37.
    Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)CrossRefGoogle Scholar
  38. 38.
    Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRefGoogle Scholar
  39. 39.
    Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, vol. 3, pp. 1635–1638, June 2000Google Scholar
  40. 40.
    Heymann, J., Drude, L., Chinaev, A., Häb-Umbach, R.: BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In: Automatic Speech Recognition and Understanding Workshop (ASRU), December 2015Google Scholar
  41. 41.
    Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRefMATHGoogle Scholar
  42. 42.
    Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In: Kolen, J., Kremer, S. (eds.) A Field Guide to Dynamical Recurrent Networks. IEEE Press, New York (2001)Google Scholar
  43. 43.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  44. 44.
    Hornik, K., Stinchcombe, M.B., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)CrossRefGoogle Scholar
  45. 45.
    Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.: Deep Networks with Stochastic Depth. arXiv preprint (2016). arXiv:1603.09382
  46. 46.
    Irie, K., Tüske, Z., Alkhouli, T., Schlüter, R., Ney, H.: LSTM, GRU, highway and a bit of attention: an empirical overview for language modeling in speech recognition. In: Interspeech, San Francisco, CA, USA, September 2016, submittedGoogle Scholar
  47. 47.
    Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: International Conference on Machine Learning (ICML), pp. 2342–2350 (2015)Google Scholar
  48. 48.
    Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
  49. 49.
    Kingsbury, B.: Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, pp. 3761–3764, April 2009Google Scholar
  50. 50.
    Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization. In: Interspeech, Portland, OR, USA, September 2012Google Scholar
  51. 51.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, vol. 2, November 1990Google Scholar
  52. 52.
    Li, B., Sim, K.C.: comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems. In: Interspeech, Makuhari, Japan, pp. 526–529, September 2010Google Scholar
  53. 53.
    Lippmann, R.P.: Review of neural networks for speech recognition. Neural Comput. 1(1), 1–38 (1989)CrossRefGoogle Scholar
  54. 54.
    Miao, Y., Metze, F.: Distance-aware DNNs for robust speech recognition. In: Interspeech, Dresden, Germany, pp. 761–765, September 2015Google Scholar
  55. 55.
    Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, Makuhari, Japan, pp. 1045–1048, September 2010Google Scholar
  56. 56.
    Montufar, G.F., Pascanu, R., Cho, K., Bengio, Y.: On the number of linear regions of deep neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2924–2932 (2014)Google Scholar
  57. 57.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning (ICML), Haifa, Israel, pp. 807–814, June 2010Google Scholar
  58. 58.
    Nakamura, M., Shikano, K.: A study of english word category prediction based on neural networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Glasglow, UK, pp. 731–734, May 1989Google Scholar
  59. 59.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. arXiv preprint (2012). arxiv:1211.5063
  60. 60.
    Plahl, C., Kozielski, M., Schlüter, R., Ney, H.: Feature combination and stacking of recurrent and non-recurrent neural networks for LVCSR. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, pp. 6714–6718, May 2013Google Scholar
  61. 61.
    Plahl, C., Schlüter, R., Ney, H.: Cross-lingual portability of Chinese and English neural network features for French and German LVCSR. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 371–376 (2011)Google Scholar
  62. 62.
    Qian, Y., Tan, T., Yu, D., Zhang, Y.: Integrated adaptation with multi-factor joint-learning for far-field speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, pp. 1–5 (2016)Google Scholar
  63. 63.
    Robinson, T., Hochberg, M., Renals, S.: IPA: Improved phone modelling with recurrent neural networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. I, pp. 37–40, April 1994Google Scholar
  64. 64.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)CrossRefGoogle Scholar
  65. 65.
    Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Interspeech, pp. 1–5 (2015)Google Scholar
  66. 66.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Interspeech, Singapore, pp. 338–342, September 2014Google Scholar
  67. 67.
    Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Olomouc, Czech Republic, pp. 55–59, December 2013Google Scholar
  68. 68.
    Scanzio, S., Laface, P., Fissore, L., Gemello, R., Mana, F.: On the use of a multilingual neural network front-end. In: Interspeech, pp. 2711–2714 (2008)Google Scholar
  69. 69.
    Schaaf, T., Metze, F.: Analysis of gender normalization using MLP and VTLN features. In: Interspeech, pp. 306–309 (2010)Google Scholar
  70. 70.
    Schlüter, R., Bezrukov, I., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 649–652 (2007)Google Scholar
  71. 71.
    Schultz, T., Waibel, A.: Fast bootstrapping Of LVCSR systems with multilingual phoneme sets. In: European Conference on Speech Communication and Technology (Eurospeech) (1997)Google Scholar
  72. 72.
    Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Waikoloa, HI, USA, pp. 24–29, December 2011Google Scholar
  73. 73.
    Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Interspeech, Florence, Italy, pp. 437–440, August 2011Google Scholar
  74. 74.
    Sonoda, S., Murata, N.: Neural network with unbounded activation functions is universal approximator. Appl. Comput. Harmonic Anal. (2016, in Press), Corrected Proof, Available online 17 December 2015Google Scholar
  75. 75.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetMATHGoogle Scholar
  76. 76.
    Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2368–2376 (2015)Google Scholar
  77. 77.
    Stolcke, A., Grézl, F., Hwang, M.Y., Lei, X., Morgan, N., Vergyri, D.: Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 321–324 (2006)Google Scholar
  78. 78.
    Sundermeyer, M., Ney, H., Schlüter, R.: From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 517–529 (2015)CrossRefGoogle Scholar
  79. 79.
    Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Interspeech, Portland, OR, USA, pp. 194–197, September 2012Google Scholar
  80. 80.
    Sundermeyer, M., Tüske, Z., Schlüter, R., Ney, H.: Lattice decoding and rescoring with long-span neural network language models. In: Interspeech, Singapore, pp. 661–665, September 2014Google Scholar
  81. 81.
    Thomas, S., Ganapathy, S., Hermansky, H.: Cross-lingual and multistream posterior features for low resource LVCSR systems. In: Interspeech, pp. 877–880 (2010)Google Scholar
  82. 82.
    Tóth, L., Frankel, J., Gosztolya, G., King, S.: Cross-lingual portability of MLP-based tandem features-a case study for English and Hungarian. In: Interspeech, pp. 2695–2698 (2008)Google Scholar
  83. 83.
    Tüske, Z., Golik, P., Nolden, D., Schlüter, R., Ney, H.: Data augmentation, feature combination, and multilingual neural networks to improve ASR and KWS performance for low-resource languages. In: Interspeech, Singapore, pp. 1420–1424, September 2014Google Scholar
  84. 84.
    Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Interspeech, Singapore, pp. 890–894, September 2014Google Scholar
  85. 85.
    Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Speaker adaptive joint training of gaussian mixture models and bottleneck features. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, AZ, USA, pp. 596–603, December 2015Google Scholar
  86. 86.
    Tüske, Z., Irie, K., Schlüter, R., Ney, H.: Investigation on log-linear interpolation of multi-domain neural network language model. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6005–6009, Shanghai, China, March 2016Google Scholar
  87. 87.
    Tüske, Z., Nolden, D., Schlüter, R., Ney, H.: Multilingual MRASTA features for low-resource keyword search and speech recognition systems. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2014)Google Scholar
  88. 88.
    Tüske, Z., Schlüter, R., Ney, H.: Multilingual hierarchical MRASTA features for ASR. In: Interspeech, pp. 2222–2226, Lyon, France, August 2013Google Scholar
  89. 89.
    Tüske, Z., Sundermeyer, M., Schlüter, R., Ney, H.: Context-dependent MLPs for LVCSR: TANDEM, hybrid or both? In: Interspeech, Portland, OR, USA, pp. 18–21, September 2012Google Scholar
  90. 90.
    Tüske, Z., Tahir, M.A., Schlüter, R., Ney, H.: Integrating gaussian mixtures into deep neural networks: Softmax layer with hidden variables. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, pp. 4285–4289, April 2015Google Scholar
  91. 91.
    Valente, F., Vepa, J., Plahl, C., Gollan, C., Hermansky, H., Schlüter, R.: Hierarchical neural networks feature extraction for LVCSR system. In: Interspeech, Antwerp, Belgium, pp. 42–45, August 2007Google Scholar
  92. 92.
    Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme recognition: neural networks vs. hidden markov models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 107–110, April 1989Google Scholar
  93. 93.
    Wiesler, S., Golik, P., Schlüter, R., Ney, H.: Investigations on sequence training of neural networks. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, pp. 4565–4569, April 2015Google Scholar
  94. 94.
    Wiesler, S., Li, J., Xue, J.: Investigations on hessian-free optimization for cross-entropy training of deep neural networks. In: Interspeech, Lyon, France, pp. 3317–3321, August 2013Google Scholar
  95. 95.
    Wiesler, S., Richard, A., Schlüter, R., Ney, H.: Mean-normalized stochastic gradient for large-scale deep learning. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, pp. 180–184, May 2014Google Scholar
  96. 96.
    Xue, J., Li, J., Yu, D., Seltzer, M., Gong, Y.: Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, pp. 6359–6363, May 2014Google Scholar
  97. 97.
    Zeiler, M.D.: ADADELTA: An Adaptive Learning Rate Method. CoRR abs/1212.5701 (2012)Google Scholar
  98. 98.
    Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., Ney, H.: A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In: Interspeech, San Francisco. CA, USA, September 2016, submittedGoogle Scholar
  99. 99.
    Zeyer, A., Schlüter, R., Ney, H.: Towards online-recognition with deep bidirectional LSTM acoustic models. In: Interspeech, San Francisco, CA, USA, September 2016, submittedGoogle Scholar
  100. 100.
    Zhang, Y., Chen, G., Yu, D., Yao, K., Khudanpur, S., Glass, J.: Highway Long Short-Term Memory RNNs for Distant Speech Recognition. arXiv preprint (2015). arxiv:1510.08983

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Ralf Schlüter
    • 1
  • Patrick Doetsch
    • 1
  • Pavel Golik
    • 1
  • Markus Kitza
    • 1
  • Tobias Menne
    • 1
  • Kazuki Irie
    • 1
  • Zoltán Tüske
    • 1
  • Albert Zeyer
    • 1
  1. 1.Lehrstuhl Informatik 6RWTH Aachen UniversityAachenGermany

Personalised recommendations