Skip to main content

Very Fast Keyword Spotting System with Real Time Factor Below 0.01

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2020)

Abstract

In the paper we present an architecture of a keyword spotting (KWS) system that is based on modern neural networks, yields good performance on various types of speech data and can run very fast. We focus mainly on the last aspect and propose optimizations for all the steps required in a KWS design: signal processing and likelihood computation, Viterbi decoding, spot candidate detection and confidence calculation. We present time and memory efficient modelling by bidirectional feedforward sequential memory networks (an alternative to recurrent nets) either by standard triphones or so called quasi-monophones, and an entirely forward decoding of speech frames (with minimal need for look back). Several variants of the proposed scheme are evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17 h in total) and their performance is compared by Detection Error Tradeoff (DET) diagrams and real-time (RT) factors. We demonstrate that the complete system can run in a single pass with a RT factor close to 0.001 if all optimizations (including a GPU for likelihood computation) are applied.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zheng, N., Li, X.: A robust keyword detection system for criminal scene analysis. In 5th IEEE Conference on Industrial Electronics and Applications, Taichung, pp. 2127–2131 (2010)

    Google Scholar 

  2. Cardillo, P.S., Clements, M., Miller, M.S. Phonetic searching vs. LVCSR: how to find what you really want in audio archives. Int. J. Speech Technol. 5, 9–22 (2002)

    Google Scholar 

  3. Zhou, X., Dai, D., Xie, B., Li, X.: Multidimensional evaluation platform for call center speech service quality based on keyword spotting. In: Yang, Y., Ma, M. (eds.) Proceedings 2nd International Conference on Green Communications and Networks 2012. Lecture Notes in Electrical Engineering, vol. 225, pp. 535–544. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35470-0_66

    Chapter  Google Scholar 

  4. Oh, Y., Park, J.-S., Park, K.-M.: Keyword spotting in broadcast news. In: Global-Network-Oriented Information Electronics, Sendai, Japan, pp. 208–213 (2007)

    Google Scholar 

  5. Michaely, A.H., Zhang, X., Simko, G., Parada, C. Aleksic, P.: Keyword spotting for Google assistant using contextual speech recognition. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, pp. 272–278 (2017)

    Google Scholar 

  6. Szoke, I., et al.: Comparison of keyword spotting approaches for informal continuous speech. In: INTERSPEECH 2005, Lisbon, pp. 633–636 (2005)

    Google Scholar 

  7. Rohlicek, J.R., Russell, W., Roukos S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP, Glasgow, UK, vol. 1, pp. 627–630 (1989)

    Google Scholar 

  8. Weintraub, M.: LVCSR log-likelihood ratio scoring for keyword spotting. In: ICASSP 1995, Detroit, vol. 1, pp. 297–300 (1995)

    Google Scholar 

  9. Foote, J., Young, S., Jones, G., Jones, K.S.: Unconstrained keyword spotting using phone lattices with application to spoken document retrieval. Comput. Speech Lang. 11, 207–224 (1997)

    Article  Google Scholar 

  10. Motlicek, P., Valente, F., Szoke, I.: Improving acoustic based keyword spotting using LVCSR lattices. In ICASSP 2012, Kyoto, pp. 4413–4416 (2012)

    Google Scholar 

  11. Akbacak, M., Burget, L., Wang, W., van Hout, J.: Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams. In: ICASSP 2013, Vancouver, BC, pp. 8267–8271 (2013)

    Google Scholar 

  12. Chen, N.F., Lee, C.-H.: A hybrid HMM/DNN approach to key-word spotting of short words. In: Interspeech 2013, Lyon, pp. 1574–1557 (2013)

    Google Scholar 

  13. Palaz, D., Synnaeve, G., Collobert, R.: Jointly learning to locate and classify words using convolutional networks. In: Interspeech 2016, San Francisco, pp. 3660–3664 (2016)

    Google Scholar 

  14. Lengerich, C., Hannun, A.: An end-to-end architecture for keyword spotting and voice activity detection. In: NIPS 2016, Barcelona, Spain (2016)

    Google Scholar 

  15. Zhuang, Y., Chang, X., Qian, Y., Yu, K.: Unrestricted vocabulary keyword spotting using LSTM-CTC. In: Interspeech 2016, San Francisco, pp. 938–942 (2016)

    Google Scholar 

  16. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Proceedings Interspeech 2015, Dresden, pp. 3586–3589 (2015)

    Google Scholar 

  17. Gales, M.J.F., Knill, K.M., Ragni, A., Rath, S.P.: Speech recognition and keyword spotting for low-resource languages: babel project research at CUED. In: SLTU-2014, pp. 16–23 (2014)

    Google Scholar 

  18. Nouza, J., Silovsky, J.: Fast keyword spotting in telephone speech. Radioengineering 18(4), 665–670 (2009)

    Google Scholar 

  19. Zhang, S., Jiang, H., Xiong, S., Wei, S, Dai, L.: Compact feedforward sequential memory networks for large vocabulary continuous speech recognition. In: Proceedings Interspeech 2016, San Francisco, pp. 3389–3393 (2016)

    Google Scholar 

  20. Málek, J., Ždánský, J., Červa, P.: Robust recognition of conversational telephone speech via multi-condition training and data augmentation. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 324–333. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_35

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was supported by the Technology Agency of the Czech Republic (Project No. TH03010018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Nouza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nouza, J., Červa, P., Žďánský, J. (2020). Very Fast Keyword Spotting System with Real Time Factor Below 0.01. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58323-1_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58322-4

  • Online ISBN: 978-3-030-58323-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics