Speech densely connected convolutional networks for small-footprint keyword spotting

Tsai, Tsung-Han; Lin, Xin-Hui

doi:10.1007/s11042-023-14617-5

Speech densely connected convolutional networks for small-footprint keyword spotting

Published: 30 March 2023

Volume 82, pages 39119–39137, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

221 Accesses
1 Citation
Explore all metrics

Abstract

Keyword spotting is an important task for human-computer interaction (HCI). For high privacy, the identification task needs to be performed at the edge, so the purpose of this task is to improve the accuracy as much as possible within the limited cost. This paper proposes a new keyword spotting technique by the convolutional neural network (CNN) method. It is based on the application of densely connected convolutional networks (DenseNet). To make the model smaller, we replace the normal convolution with group convolution and depthwise separable convolution. We add squeeze-and-excitation networks (SENet) to enhance the weight of important features to increase the accuracy. To investigate the effect of different convolutions on DenseNet, we built two models: SpDenseNet and SpDenseNet-L. we validated the network using the Google speech commands dataset. Our proposed network had better accuracy than the other networks even with a fewer number of parameters and floating-point operations (FLOPs). SpDenseNet could achieve an accuracy of 96.3% with 122.63 K trainable parameters and 142.7 M FLOPs. Compared to the benchmark works, only about 52% of the number of parameters and about 12% of the FLOPs are used. In addition, we varied the depth and width of the network to build a compact variant. It also outperforms other compact variants, where SpDenseNet-L-narrow could achieve an accuracy of 93.6% withiri: An On-device DNN-powere 9.27 K trainable parameters and 3.47 M FLOPs. Compared to the benchmark works, the accuracy on SpDenseNet-L-narrow is improved by 3.5%. It only uses only about 47% of the number of parameters and about 48% of the FLOPS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Data availability

The datasets generated and/or analyzed during the present study are available from the corresponding author on reasonable request.

References

Andra MB, Usagawa T (2017) Contextual keyword spotting in lecture video with deep convolutional neural network. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp 198–203
Apple Machine Learning Blog (2017) “Hey Siri: An On-device DNN-powered Voice Trigger for Apples Personal Assistant,” [Online]. Available: https://machinelearning.apple.com/2017/10/01/hey-siri.html
Arik SO et al (2017) Convolutional recurrent neural networks for small-footprint keyword spotting. In: Interspeech, pp. 1606–1610
Y Bai et al (2016) End-to-end keywords spotting based on connectionist temporal classification for mandarin. In: 2016 10th international symposium on Chinese spoken language processing (ISCSLP), pp 1–5.
Choi J, Gill H, Ou S, Song Y, Le J (2018) Design of Voice to Text Conversion and Management Program Based on Google Cloud Speech API. In: 2018 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1452–1453
de Andrade DC, Leo S, Da ML Viana S, Bernkopf C (2018) A neural attention model for speech command recognition. Unpublished
Edu JS, Such JM, Suarez-Tangil G (2020) Smart home personal assistants: a security and privacy review. ACM Comput Surv 53(116):1–36
Google Scholar
Fukuda T, Fernandez R, Rosenberg A, Thomas S, Ramabhadran B, Sorin A, Kurata G (2018) Data augmentation improves recognition of foreign accented speech. In: Interspeech, pp. 2409–2413
Hölzke F, Ahmed H, Golatowski F, Timmermann D (2021) Keyword Spotting for Industrial Control using Deep Learning on Edge Devices. In: 2021 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), pp. 167–172
Howard A et al (2019) Searching for MobileNetV3. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 1314-1324
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7132–7141
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2261–2269
Huang G, Liu S, Van der Maaten L, Weinberger KQ (2018) CondenseNet: an efficient denseNet using learned group convolutions. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2752–2761
Ko T, Peddinti V, Povey D et al (2015) Audio augmentation for speech recognition. In: Interspeech, pp. 3586–3589
Krizhevsky A et al. (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105
Li B et al (2017) Acoustic modeling for Google home. In: Interspeech, pp. 399–403
Lin J, Kilgour K, Roblek D, Sharifi M (2020) Training keyword spotters with limited and synthesized speech data. In: ICASSP, pp. 7474–7478
Liu L, Wang S, Hu B, Qiong Q, Wen J, Rosenblum DS (2018) Learning structures of interval-based Bayesian networks in probabilistic generative model for human complex activity recognition. Pattern Recogn 81:545–561
Article Google Scholar
Mo T, Liu B (2021) Encoder-Decoder Neural Architecture Optimization for Keyword Spotting. Retrieved from https://doi.org/10.48550/arXiv.2106.02738
Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (dtw) techniques. J Comput
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML
Number of digital voice assistants in use worldwide from 2019 to 2023. https://www.statista.com/statistics/973815/worldwide-digital-voiceassistant-in-use/
Pete Warden Launching the speech commands dataset (2017) Google Research Blog, [Online]. Available: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html
Rybakov O, Kononenko N, Subrahmanya N, Visontai M, Laurenzo S (2020) Streaming keyword spotting on mobile devicesin. In: Interspeech, pp. 2277–2281
Sainath TN, Parada C (2015) Convolutional neural networks for small-footprint keyword spotting. In: Interspeech, pp. 1478–1482
Sinha D, El-Sharkawy M (2019) Thin MobileNet: An Enhanced MobileNet Architecture. In: 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), pp 0280–0285
So X (2015) audio manipulation tool (accessed March 25, 2015). [Online]. Available: http://sox.sourceforge.net/
Sun M et al (2016) Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. In: 2016 IEEE spoken language technology workshop (SLT), pp 474-480.
Tan M et al (2019) MnasNet: platform-aware neural architecture search for Mobile. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2815-2823
Tang R, Lin J (2018) Deep residual learning for small-footprint keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5484–5488
Totsuka N, Chiba Y, Nose T, Ito A (2014) Robot: Have I done something wrong? — Analysis of prosodic features of speech commands under the robot's unintended behavior. In: 2014 International Conference on Audio, Language and Image Processing, pp. 887–890
Wang D, Lv S, Wang X, Lin X (2018) Gated convolutional LSTM for speech commands recognition. In: International Conference on Computational Science, p 669–681
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated Residual Transformations for Deep Neural Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5987–5995
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5987–5995.
Yan Z et al (2015) HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In: 2015 IEEE international conference on computer vision (ICCV), pp 2740-2748
Zeng M, Xiao N (2019) Effective combination of densenet and BiLSTM for keyword spotting. IEEE Access 7:10767–10775
Article Google Scholar
Zhang T, Qi G-J, Xiao B, Wang J (2017) Interleaved group convolutions. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 4383–4392
Zhang Y, Suda N, Lai L, Chandra V (2017) Hello edge: Keyword spotting on icrocontrollers. unpublished
Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6848–6856

Download references

Funding

There are no funders to report for this submission.

Author information

Authors and Affiliations

Department of Electrical Engineering, National Central University, No.300, Jung-Da Rd., Jung -Li City, Taiwan, 320, Republic of China
Tsung-Han Tsai & Xin-Hui Lin

Authors

Tsung-Han Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Xin-Hui Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsung-Han Tsai.

Ethics declarations

Conflict of interests

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tsai, TH., Lin, XH. Speech densely connected convolutional networks for small-footprint keyword spotting. Multimed Tools Appl 82, 39119–39137 (2023). https://doi.org/10.1007/s11042-023-14617-5

Download citation

Received: 27 November 2021
Revised: 26 June 2022
Accepted: 03 February 2023
Published: 30 March 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11042-023-14617-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech densely connected convolutional networks for small-footprint keyword spotting

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech densely connected convolutional networks for small-footprint keyword spotting

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation