Cognitive Computation

, Volume 2, Issue 3, pp 180–190 | Cite as

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

  • Martin Wöllmer
  • Florian Eyben
  • Alex Graves
  • Björn Schuller
  • Gerhard Rigoll


Robustly detecting keywords in human speech is an important precondition for cognitive systems, which aim at intelligently interacting with users. Conventional techniques for keyword spotting usually show good performance when evaluated on well articulated read speech. However, modeling natural, spontaneous, and emotionally colored speech is challenging for today’s speech recognition systems and thus requires novel approaches with enhanced robustness. In this article, we propose a new architecture for vocabulary independent keyword detection as needed for cognitive virtual agents such as the SEMAINE system. Our word spotting model is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The BLSTM network uses a self-learned amount of contextual information to provide a discrete phoneme prediction feature for the DBN, which is able to distinguish between keywords and arbitrary speech. We evaluate our Tandem BLSTM-DBN technique on both read speech and spontaneous emotional speech and show that our method significantly outperforms conventional Hidden Markov Model-based approaches for both application scenarios.


Keyword spotting Long short-term memory Dynamic bayesian networks Cognitive systems Virtual agents 



The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 211486 (SEMAINE).


  1. 1.
    Taylor JG (2009) Cognitive computation. Cognit Comput. 1(1):4–16CrossRefGoogle Scholar
  2. 2.
    Vo MT, Waibel A (1993) Multimodal human-computer interaction. In: Proceedings of ISSD. Waseda, pp 95–101Google Scholar
  3. 3.
    Oviatt S (2000) Multimodal interface research: A science without borders. In: Proceedings of ICSLP. pp 1–6Google Scholar
  4. 4.
    Schröder M, Cowie R, Heylen D, Pantic M, Pelachaud C, Schuller B (2008) Towards responsive Sensitive Artificial Listeners. In: Proceedings of 4th international workshop on human-computer conversation. Bellagio. pp 1–6Google Scholar
  5. 5.
    Rose RC (1995) Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition. Comput Speech Lang 9(4):309–333CrossRefGoogle Scholar
  6. 6.
    Keshet J, Grangier D, Bengio S (2007) Discriminative Keyword Spotting. In: Proceedings of NOLISP. Paris. pp 47–50Google Scholar
  7. 7.
    Wöllmer M, Eyben F, Keshet J, Graves A, Schuller B, Rigoll G (2009) Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks. In: Proceedings of ICASSP. Taipei. pp 3949–3952Google Scholar
  8. 8.
    Liu H, Lieberman H, Selker T (2003) A model of textual affect sensing using real-world knowledge. In: Proceedings of the 8th international conference on intelligent user interfaces. Miami, Florida. pp 125–132Google Scholar
  9. 9.
    Ma C, Prendinger H, Ishizuka M (2005) A Chat system based on emotion estimation from text and embodied conversational messengers. In: Entertainment Computing. vol. 3711/2005. Springer. pp 535–538Google Scholar
  10. 10.
    Ziemke T, Lowe R (2009) On the role of emotion in embodied cognitive architectures: from organisms to robots. Cognit Comput 1(1):104–117CrossRefGoogle Scholar
  11. 11.
    Rose RC, Paul DB (1990) A hidden markov model based keyword recognition system. In: Proceedings of ICASSP. Albuquerque. p. 129–132Google Scholar
  12. 12.
    Ketabdar H, Vepa J, Bengio S, Boulard H (2006) Posterior based keyword spotting with a priori thresholds. In: IDAIP-RR. pp 1–8Google Scholar
  13. 13.
    Benayed Y, Fohr D, Haton JP, Chollet G (2003) Confidence measure for keyword spotting using support vector machines. In: Proceedings of ICASSP. pp 588–591Google Scholar
  14. 14.
    Mamou J, Ramabhadran B, Siohan O (2007) Vocabulary independent spoken term detection. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. Amsterdam. pp 615–622Google Scholar
  15. 15.
    Weintraub M (1993) Keyword-spotting using SRI’s DECIPHER large vocabulary speech recognition system. In: Proceedings of ICASSP. Minneapolis. pp 463–466Google Scholar
  16. 16.
    Bilmes JA (2003) Graphical models and automatic speech recognition. In: Rosenfeld R, Ostendorf M, Khudanpur S, Johnson M (eds). Mathematical foundations of speech and language processing. New York: Springer. pp 191–246Google Scholar
  17. 17.
    Bilmes JA, Bartels C (2005) Graphical model architectures for speech recognition. IEEE Signal Process Mag 22(5):89–100CrossRefGoogle Scholar
  18. 18.
    Lin H, Stupakov A, Bilmes JA (2009) Improving multi-lattice alignment based spoken keyword spotting. In: Proceedings of ICASSP. Taipei. pp 4877–4880Google Scholar
  19. 19.
    Lin H, Bilmes JA, Vergyri D, Kirchhoff K (2007) OOV detection by joint word/phone lattice alignment. In: Proceedings of ASRU. Kyoto. pp 478–483Google Scholar
  20. 20.
    Wöllmer M, Eyben F, Schuller B, Rigoll G (2009) Robust vocabulary independent keyword spotting with graphical models. In: Proceedings of ASRU. Merano. pp 349–353Google Scholar
  21. 21.
    Graves A, Fernandez S, Schmidhuber J (2005) Bidirectional LSTM networks for improved phoneme classification and recognition. In: Proceedings of ICANN. Warsaw. pp 602–610Google Scholar
  22. 22.
    Eyben F, Wöllmer M, Graves A, Schuller B, Douglas-Cowie E, Cowie R (2009) On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interfaces (JMUI), Special Issue on Real-time Affect Analysis and Interpretation: Closing the Loop in Virtual Agents 3:7–19Google Scholar
  23. 23.
    Hermansky H, Ellis DPW, Sharma S (2000) Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of ICASSP. Istanbul. pp 1635–1638Google Scholar
  24. 24.
    Ketabdar H, Bourlard H (2008) Enhanced phone posteriors for improving speech recognition systems. In: IDIAP-RR. 39. pp 1–23Google Scholar
  25. 25.
    Ellis DPW, Singh R, Sivadas S (2001) Tandem acoustic modeling in large-vocabulary recognition. In: Proceedings of ICASSP. Salt Lake City. pp 517–520Google Scholar
  26. 26.
    Boulard H, Morgan N (1994) Connectionist speech recognition: a hybrid approach. Kluwer Academic Publishers, DordrechtGoogle Scholar
  27. 27.
    Bengio Y (1999) Markovian models for sequential data. Neural Comput Surv 2:129–162Google Scholar
  28. 28.
    Fernandez S, Graves A, Schmidhuber J (2007) An application of recurrent neural networks to discriminative keyword spotting. In: Proceedings of ICANN. Porto. pp 220–229Google Scholar
  29. 29.
    Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1993) DARPA TIMIT acoustic phonetic continuous speech corpus CDROM. NISTGoogle Scholar
  30. 30.
    Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, et al. (2007) The HUMAINE Database: addressing the collection and annotation of naturalistic and induced emotional data. In: Affective computing and intelligent interaction. vol. 4738/2007. Springer. pp. 488–500Google Scholar
  31. 31.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefPubMedGoogle Scholar
  32. 32.
    Yang HH, Sharma S, van Vuuren S, Hermansky H (2000) Relevance of time-frequency features for phonetic and speaker/channel classification. Speech Commun. 31:35–50CrossRefGoogle Scholar
  33. 33.
    Bilmes JA (1998) Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling. In: Proceedings of ICASSP. pp 469–472Google Scholar
  34. 34.
    Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, et al. (2009) Being bored? recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput J (IMAVIS), Special Issue on Visual and Multimodal Analysis of Human Spontaneous Behavior 27(12):1760–1774Google Scholar
  35. 35.
    Schuller B, Rigoll G (2009) Recognising interest in conversational speech—comparing bag of frames and supra-segmental features. In: Proceedings of interspeech. Brighton. pp 1999–2002Google Scholar
  36. 36.
    Quattoni A, Wang S, Morency LP, Collins M, Darrell T (2007) hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29:1848–1853CrossRefPubMedGoogle Scholar
  37. 37.
    Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer SC, Kolen JF (eds) A field guide to dynamical recurrent neural networks. IEEE Press, . pp 1–15Google Scholar
  38. 38.
    Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRefPubMedGoogle Scholar
  39. 39.
    Schaefer AM, Udluft S, Zimmermann HG (2008) Learning long-term dependencies with recurrent neural networks. Neurocomputing 71(13-15):2481–2488CrossRefGoogle Scholar
  40. 40.
    Lin T, Horne BG, Tino P, Giles CL (1996) Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans Neural Netw 7(6):1329–1338CrossRefPubMedGoogle Scholar
  41. 41.
    Lang KJ, Waibel AH, Hinton GE (1990) A time-delay neural network architecture for isolated word recognition. Neural Netw 3(1):23–43CrossRefGoogle Scholar
  42. 42.
    Schmidhuber J (1992) Learning complex extended sequences using the principle of history compression. Neural Comput 4(2):234–242CrossRefGoogle Scholar
  43. 43.
    Jaeger H (2001) The echo state approach to analyzing and training recurrent neural networks. Bremen: German national research center for information technology. (Tech. Rep. No. 148)Google Scholar
  44. 44.
    Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681CrossRefGoogle Scholar
  45. 45.
    Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5-6):602–610CrossRefPubMedGoogle Scholar
  46. 46.
    Graves A, Fernandez S, Liwicki M, Bunke H, Schmidhuber J (2008) Unconstrained online handwriting recognition with recurrent neural networks. Adv Neural Inf Process Syst. 20:1–8Google Scholar
  47. 47.
    Liwicki M, Graves A, Fernandez S, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of ICDAR. Curitiba. pp 367–371Google Scholar
  48. 48.
    Wöllmer M, Eyben F, Schuller B, Sun Y, Moosmayr T, Nguyen-Thien N (2009) Robust in-car spelling recognition—a tandem BLSTM-HMM approach. In: Proceedings of interspeech. Brighton. p. 2507–2510Google Scholar
  49. 49.
    Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, et al. (2008) Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of interspeech. Brisbane. p. 597–600Google Scholar
  50. 50.
    Wöllmer M, Eyben F, Schuller B, Douglas-Cowie E, Cowie R. Data-driven clustering in emotional space for affect recognition using discriminatively trained LSTM networks. In: Proceedings of interspeech. Brighton. pp 1595–1598 (2009)Google Scholar
  51. 51.
    Jensen FV (1996) An introduction to Bayesian networks. Springer, BrelinGoogle Scholar
  52. 52.
    Zweig G, Padmanabhan M (2000) Exact alpha-beta computation in logarithmic space with application to map word graph construction. In: Proceedings of ICSLP. Beijing. pp 855–858Google Scholar
  53. 53.
    Bilmes J, Zweig G (2002) The graphical models toolkit: an open source software system for speech and time-series processing. In: Proceedings of ICASSP. pp 3916–3919Google Scholar
  54. 54.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 39:185–197Google Scholar
  55. 55.
    Bilmes J (2008) Gaussian models in automatic speech recognition. In: Signal processing in acoustics. Springer, New York. pp 521–555Google Scholar
  56. 56.
    Bilmes J (1997) A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden markov models. University of Berkeley. Technical Report ICSI-TR-97-02Google Scholar
  57. 57.
    Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent neural networks and their computational complexity. In: Chauvin Y, Rumelhart DE, (eds) Back-propagation: theory, architectures and applications. Lawrence Erlbaum Publishers, Hillsdale, pp 433–486Google Scholar
  58. 58.
    Graves A (2008) Supervised sequence labelling with recurrent neural networks. Technische Universität München, GermanyGoogle Scholar
  59. 59.
    Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X et al. (2006) The HTK book (v3.4). Cambridge University Press, CambridgeGoogle Scholar
  60. 60.
    Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41(1):164–171CrossRefGoogle Scholar
  61. 61.
    Wöllmer M, Eyben F, Schuller B, Rigoll G (2010) Spoken term detection with connectionist temporal classification—a novel hybrid CTC-DBN approach. In: Proceedings of ICASSP. Dallas. pp. 5274–5277Google Scholar
  62. 62.
    Graves A, Fernandez S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: Labelling unsegmented data with recurrent neural networks. In: Proceedings of ICML. Pittsburgh. p. 369–376Google Scholar
  63. 63.
    Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: Proceedings of ICASSP. Glasgow. pp 23–26Google Scholar
  64. 64.
    Wöllmer M, Al-Hames M, Eyben F, Schuller B, Rigoll G (2009) A multidimensional dynamic time warping algorithm for efficient multimodal fusion of asynchronous data streams. Neurocomputing 73:366–380CrossRefGoogle Scholar
  65. 65.
    Bengio S (2003) An asynchronous Hidden Markov model for audio-visual speech recognition. Advances in NIPS 15. pp 1–8Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Martin Wöllmer
    • 1
  • Florian Eyben
    • 1
  • Alex Graves
    • 2
  • Björn Schuller
    • 1
  • Gerhard Rigoll
    • 1
  1. 1.Institute for Human-Machine CommunicationTechnische Universität MünchenMünchenGermany
  2. 2.Institute for Computer Science VITechnische Universität MünchenMünchenGermany

Personalised recommendations