Abstract
In the paper we present an architecture of a keyword spotting (KWS) system that is based on modern neural networks, yields good performance on various types of speech data and can run very fast. We focus mainly on the last aspect and propose optimizations for all the steps required in a KWS design: signal processing and likelihood computation, Viterbi decoding, spot candidate detection and confidence calculation. We present time and memory efficient modelling by bidirectional feedforward sequential memory networks (an alternative to recurrent nets) either by standard triphones or so called quasi-monophones, and an entirely forward decoding of speech frames (with minimal need for look back). Several variants of the proposed scheme are evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17 h in total) and their performance is compared by Detection Error Tradeoff (DET) diagrams and real-time (RT) factors. We demonstrate that the complete system can run in a single pass with a RT factor close to 0.001 if all optimizations (including a GPU for likelihood computation) are applied.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zheng, N., Li, X.: A robust keyword detection system for criminal scene analysis. In 5th IEEE Conference on Industrial Electronics and Applications, Taichung, pp. 2127–2131 (2010)
Cardillo, P.S., Clements, M., Miller, M.S. Phonetic searching vs. LVCSR: how to find what you really want in audio archives. Int. J. Speech Technol. 5, 9–22 (2002)
Zhou, X., Dai, D., Xie, B., Li, X.: Multidimensional evaluation platform for call center speech service quality based on keyword spotting. In: Yang, Y., Ma, M. (eds.) Proceedings 2nd International Conference on Green Communications and Networks 2012. Lecture Notes in Electrical Engineering, vol. 225, pp. 535–544. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35470-0_66
Oh, Y., Park, J.-S., Park, K.-M.: Keyword spotting in broadcast news. In: Global-Network-Oriented Information Electronics, Sendai, Japan, pp. 208–213 (2007)
Michaely, A.H., Zhang, X., Simko, G., Parada, C. Aleksic, P.: Keyword spotting for Google assistant using contextual speech recognition. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, pp. 272–278 (2017)
Szoke, I., et al.: Comparison of keyword spotting approaches for informal continuous speech. In: INTERSPEECH 2005, Lisbon, pp. 633–636 (2005)
Rohlicek, J.R., Russell, W., Roukos S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP, Glasgow, UK, vol. 1, pp. 627–630 (1989)
Weintraub, M.: LVCSR log-likelihood ratio scoring for keyword spotting. In: ICASSP 1995, Detroit, vol. 1, pp. 297–300 (1995)
Foote, J., Young, S., Jones, G., Jones, K.S.: Unconstrained keyword spotting using phone lattices with application to spoken document retrieval. Comput. Speech Lang. 11, 207–224 (1997)
Motlicek, P., Valente, F., Szoke, I.: Improving acoustic based keyword spotting using LVCSR lattices. In ICASSP 2012, Kyoto, pp. 4413–4416 (2012)
Akbacak, M., Burget, L., Wang, W., van Hout, J.: Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams. In: ICASSP 2013, Vancouver, BC, pp. 8267–8271 (2013)
Chen, N.F., Lee, C.-H.: A hybrid HMM/DNN approach to key-word spotting of short words. In: Interspeech 2013, Lyon, pp. 1574–1557 (2013)
Palaz, D., Synnaeve, G., Collobert, R.: Jointly learning to locate and classify words using convolutional networks. In: Interspeech 2016, San Francisco, pp. 3660–3664 (2016)
Lengerich, C., Hannun, A.: An end-to-end architecture for keyword spotting and voice activity detection. In: NIPS 2016, Barcelona, Spain (2016)
Zhuang, Y., Chang, X., Qian, Y., Yu, K.: Unrestricted vocabulary keyword spotting using LSTM-CTC. In: Interspeech 2016, San Francisco, pp. 938–942 (2016)
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Proceedings Interspeech 2015, Dresden, pp. 3586–3589 (2015)
Gales, M.J.F., Knill, K.M., Ragni, A., Rath, S.P.: Speech recognition and keyword spotting for low-resource languages: babel project research at CUED. In: SLTU-2014, pp. 16–23 (2014)
Nouza, J., Silovsky, J.: Fast keyword spotting in telephone speech. Radioengineering 18(4), 665–670 (2009)
Zhang, S., Jiang, H., Xiong, S., Wei, S, Dai, L.: Compact feedforward sequential memory networks for large vocabulary continuous speech recognition. In: Proceedings Interspeech 2016, San Francisco, pp. 3389–3393 (2016)
Málek, J., Ždánský, J., Červa, P.: Robust recognition of conversational telephone speech via multi-condition training and data augmentation. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 324–333. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_35
Acknowledgments
This work was supported by the Technology Agency of the Czech Republic (Project No. TH03010018).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Nouza, J., Červa, P., Žďánský, J. (2020). Very Fast Keyword Spotting System with Real Time Factor Below 0.01. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_46
Download citation
DOI: https://doi.org/10.1007/978-3-030-58323-1_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)