Very Fast Keyword Spotting System with Real Time Factor Below 0.01

Nouza, Jan; Červa, Petr; Žďánský, Jindřich

doi:10.1007/978-3-030-58323-1_46

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12284))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1341 Accesses
1 Citations

Abstract

In the paper we present an architecture of a keyword spotting (KWS) system that is based on modern neural networks, yields good performance on various types of speech data and can run very fast. We focus mainly on the last aspect and propose optimizations for all the steps required in a KWS design: signal processing and likelihood computation, Viterbi decoding, spot candidate detection and confidence calculation. We present time and memory efficient modelling by bidirectional feedforward sequential memory networks (an alternative to recurrent nets) either by standard triphones or so called quasi-monophones, and an entirely forward decoding of speech frames (with minimal need for look back). Several variants of the proposed scheme are evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17 h in total) and their performance is compared by Detection Error Tradeoff (DET) diagrams and real-time (RT) factors. We demonstrate that the complete system can run in a single pass with a RT factor close to 0.001 if all optimizations (including a GPU for likelihood computation) are applied.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zheng, N., Li, X.: A robust keyword detection system for criminal scene analysis. In 5th IEEE Conference on Industrial Electronics and Applications, Taichung, pp. 2127–2131 (2010)
Google Scholar
Cardillo, P.S., Clements, M., Miller, M.S. Phonetic searching vs. LVCSR: how to find what you really want in audio archives. Int. J. Speech Technol. 5, 9–22 (2002)
Google Scholar
Zhou, X., Dai, D., Xie, B., Li, X.: Multidimensional evaluation platform for call center speech service quality based on keyword spotting. In: Yang, Y., Ma, M. (eds.) Proceedings 2nd International Conference on Green Communications and Networks 2012. Lecture Notes in Electrical Engineering, vol. 225, pp. 535–544. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35470-0_66
Chapter Google Scholar
Oh, Y., Park, J.-S., Park, K.-M.: Keyword spotting in broadcast news. In: Global-Network-Oriented Information Electronics, Sendai, Japan, pp. 208–213 (2007)
Google Scholar
Michaely, A.H., Zhang, X., Simko, G., Parada, C. Aleksic, P.: Keyword spotting for Google assistant using contextual speech recognition. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, pp. 272–278 (2017)
Google Scholar
Szoke, I., et al.: Comparison of keyword spotting approaches for informal continuous speech. In: INTERSPEECH 2005, Lisbon, pp. 633–636 (2005)
Google Scholar
Rohlicek, J.R., Russell, W., Roukos S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP, Glasgow, UK, vol. 1, pp. 627–630 (1989)
Google Scholar
Weintraub, M.: LVCSR log-likelihood ratio scoring for keyword spotting. In: ICASSP 1995, Detroit, vol. 1, pp. 297–300 (1995)
Google Scholar
Foote, J., Young, S., Jones, G., Jones, K.S.: Unconstrained keyword spotting using phone lattices with application to spoken document retrieval. Comput. Speech Lang. 11, 207–224 (1997)
Article Google Scholar
Motlicek, P., Valente, F., Szoke, I.: Improving acoustic based keyword spotting using LVCSR lattices. In ICASSP 2012, Kyoto, pp. 4413–4416 (2012)
Google Scholar
Akbacak, M., Burget, L., Wang, W., van Hout, J.: Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams. In: ICASSP 2013, Vancouver, BC, pp. 8267–8271 (2013)
Google Scholar
Chen, N.F., Lee, C.-H.: A hybrid HMM/DNN approach to key-word spotting of short words. In: Interspeech 2013, Lyon, pp. 1574–1557 (2013)
Google Scholar
Palaz, D., Synnaeve, G., Collobert, R.: Jointly learning to locate and classify words using convolutional networks. In: Interspeech 2016, San Francisco, pp. 3660–3664 (2016)
Google Scholar
Lengerich, C., Hannun, A.: An end-to-end architecture for keyword spotting and voice activity detection. In: NIPS 2016, Barcelona, Spain (2016)
Google Scholar
Zhuang, Y., Chang, X., Qian, Y., Yu, K.: Unrestricted vocabulary keyword spotting using LSTM-CTC. In: Interspeech 2016, San Francisco, pp. 938–942 (2016)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Proceedings Interspeech 2015, Dresden, pp. 3586–3589 (2015)
Google Scholar
Gales, M.J.F., Knill, K.M., Ragni, A., Rath, S.P.: Speech recognition and keyword spotting for low-resource languages: babel project research at CUED. In: SLTU-2014, pp. 16–23 (2014)
Google Scholar
Nouza, J., Silovsky, J.: Fast keyword spotting in telephone speech. Radioengineering 18(4), 665–670 (2009)
Google Scholar
Zhang, S., Jiang, H., Xiong, S., Wei, S, Dai, L.: Compact feedforward sequential memory networks for large vocabulary continuous speech recognition. In: Proceedings Interspeech 2016, San Francisco, pp. 3389–3393 (2016)
Google Scholar
Málek, J., Ždánský, J., Červa, P.: Robust recognition of conversational telephone speech via multi-condition training and data augmentation. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 324–333. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_35
Chapter Google Scholar

Download references

Acknowledgments

This work was supported by the Technology Agency of the Czech Republic (Project No. TH03010018).

Author information

Authors and Affiliations

Institute of Information Technologies and Electronics, Technical University of Liberec, Studentska 2, 46117, Liberec, Czech Republic
Jan Nouza, Petr Červa & Jindřich Žďánský

Authors

Jan Nouza
View author publications
You can also search for this author in PubMed Google Scholar
Petr Červa
View author publications
You can also search for this author in PubMed Google Scholar
Jindřich Žďánský
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Nouza .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nouza, J., Červa, P., Žďánský, J. (2020). Very Fast Keyword Spotting System with Real Time Factor Below 0.01. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-58323-1_46
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics