Improving Keyword Spotting with a Tandem BLSTM-DBN Architecture

  • Martin Wöllmer
  • Florian Eyben
  • Alex Graves
  • Björn Schuller
  • Gerhard Rigoll
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5933)


We propose a novel architecture for keyword spotting which is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The DBN uses a hidden garbage variable as well as the concept of switching parents to discriminate between keywords and arbitrary speech. Contextual information is incorporated by a BLSTM network, providing a discrete phoneme prediction feature for the DBN. Together with continuous acoustic features, the discrete BLSTM output is processed by the DBN which detects keywords. Due to the flexible design of our Tandem BLSTM-DBN recognizer, new keywords can be added to the vocabulary without having to re-train the model. Further, our concept does not require the training of an explicit garbage model. Experiments on the TIMIT corpus show that incorporating a BLSTM network into the DBN architecture can increase true positive rates by up to 10%.


Keyword Spotting Long Short-Term Memory Dynamic Bayesian Networks 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bilmes, J. A.: Graphical models and automatic speech recognition. In: Mathematical Foundations of Speech and Language Processing (2003)Google Scholar
  2. 2.
    Bilmes, J.A., Bartels, C.: Graphical model architectures for speech recognition. IEEE Signal Processing Magazine 22(5), 89–100 (2005)CrossRefGoogle Scholar
  3. 3.
    Fernández, S., Graves, A., Schmidhuber, J.: An Application of Recurrent Neural Networks to Discriminative Keyword Spotting. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN 2007. LNCS, vol. 4669, pp. 220–229. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Graves, A., Fernandez, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Proc. of ICANN, Warsaw, Poland, vol. 18(5-6), pp. 602–610 (2005)Google Scholar
  5. 5.
    Graves, A.: Supervised sequence labelling with recurrent neural networks. Phd thesis, Technische Universität München (2008)Google Scholar
  6. 6.
    Hermansky, H., Ellis, D.P.W., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proc. of ICASSP, Istanbul, Turkey, vol. 3, pp. 1635–1638 (2000)Google Scholar
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Ketabdar, H., Bourlard, H.: Enhanced phone posteriors for improving speech recognition systems. IDIAP-RR, no. 39 (2008)Google Scholar
  9. 9.
    Rose, R.C., Paul, D.B.: A hidden markov model based keyword recognition system. In: Proc. of ICASSP, Albuquerque, NM, USA, vol. 1, pp. 129–132 (1990)Google Scholar
  10. 10.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 2673–2681 (1997)CrossRefGoogle Scholar
  11. 11.
    Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., Cowie, R.: Abandoning Emotion Classes - Towards Continuous Emotion Recognition with Modelling of Long-Range Dependencies. In: Proc. of Interspeech, Brisbane, Australia, pp. 597–600 (2008)Google Scholar
  12. 12.
    Wöllmer, M., Eyben, F., Keshet, J., Graves, A., Schuller, B., Rigoll, G.: Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proc. of ICASSP, Taipei, Taiwan (2009)Google Scholar
  13. 13.
    Wöllmer, M., Eyben, F., Schuller, B., Rigoll, G.: Robust vocabulary independent keyword spotting with graphical models. In: Proc. of ASRU, Merano, Italy (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Martin Wöllmer
    • 1
  • Florian Eyben
    • 1
  • Alex Graves
    • 2
  • Björn Schuller
    • 1
  • Gerhard Rigoll
    • 1
  1. 1.Institute for Human-Machine CommunicationTechnische Universität MünchenGermany
  2. 2.Institute for Computer Science VITechnische Universität MünchenGermany

Personalised recommendations