Abstract
This manuscript discusses the robustness to noise of deep learning models for two audio classification tasks. The first task is a speaker recognition application, trying to identify five different speakers. The second one is a speech command identification where the goal is to classify ten voice commands. These two tasks are very important to make the communication between humans and smart devices as smooth and natural as possible. The emergence of smart home devices such as personal assistants and the deployment of audio based applications in noisy environments raise new challenges and reveal the weaknesses of existing speech recognition systems. Despite the advances of deep learning in audio tasks, most of the proposed architectures are computationally inefficient and very sensitive to noise. This research addresses these problems by proposing two neural architectures that incorporate a novel pooling operation, named entropy pooling. Entropy pooling is based on the principle of maximum entropy. A detailed ablation study is conducted to evaluate the performance of entropy pooling against the classic max and average pooling layers. The neural networks that are developed are based on two architectures, convolutional networks and residual ones. The study shows that entropy based feature pooling improves the robustness of these architectures in the presence of noise.
Similar content being viewed by others
References
Abdulhussain SH, Rahman Ramli A, Mahmmod BM, Iqbal Saripan M, Al-Haddad S, Baker T, Flayyih WN, Jassim WA (2019) A fast feature extraction algorithm for image and video processing. In: 2019 international joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2019.8851750
Achille A, Soatto S (2018) Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on pattern analysis and machine intelligence
Alemi A, Fischer I, Dillon J, Murphy K (2017) Deep variational information bottleneck. In: ICLR, https://arxiv.org/abs/1612.00410
Bacanin N, Bezdan T, Venkatachalam K, Al-Turjman F (2021) Optimized convolutional neural network by firefly algorithm for magnetic resonance image classification of glioma brain tumor grade. Journal of Real-Time Image Processing, pp 1–14
Bahdanau D, Chorowski J, Serdyuk D, Brakel P, Bengio Y (2016) End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 4945–4949
Bharitkar S (2019) Generative feature models and robustness analysis for multimedia content classification. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA), pp 105–110. https://doi.org/10.1109/ICMLA.2019.00025
Bountourakis V, Vrysis L, Konstantoudakis K, Vryzas N (2019) An enhanced temporal feature integration method for environmental sound recognition. In: Acoustics, multidisciplinary digital publishing institute, vol 1, pp 410–422
Boureau YL, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 111–118
Chen S, Dobriban E, Lee J (2020) A group-theoretic framework for data augmentation. Advances in neural information processing systems 33
Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-based models for speech recognition. Advances in neural information processing systems 28:577–585
Coucke A, Chlieh M, Gisselbrecht T, Leroy D, Poumeyrol M, Lavril T (2019) Efficient keyword spotting using dilated convolutions and gating. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6351–6355
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186, https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Dong B, Lumezanu C, Chen Y, Song D, Mizoguchi T, Chen H, Khan L (2020) At the speed of sound: Efficient audio scene classification. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 301–305
Esmaeilpour M, Cardinal P, Lameiras Koerich A (2020) A robust approach for securing audio classification against adversarial attacks. IEEE transactions on information forensics and security 15:2147–2159. https://doi.org/10.1109/TIFS.2019.2956591
Falcini F, Lami G (2017) Deep learning in automotive: Challenges and opportunities. In: International conference on software process improvement and capability determination, Springer, pp 279–288
Fayyad J, Jaradat MA, Gruyer D, Najjaran H (2020) Deep learning sensor fusion for autonomous vehicle perception and localization: a review. Sensors 20(15):4220
Gkalinikis NV, Nalmpantis C, Vrakas D (2020) Attention in recurrent neural networks for energy disaggregation. In: International conference on discovery science, Springer, pp 551–565
Han W, Zhang Z, Zhang Y, Yu J, Chiu CC, Qin J, Gulati A, Pang R, Wu Y (2020) Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:200503191
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He X, Zhao K, Chu X (2021) Automl: A survey of the state-of-the-art. Knowl-Based Syst 212:106622
Jarrett K, Kavukcuoglu K, LeCun Y et al (2009) What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th international conference on computer vision, IEEE, pp 2146–2153
Kingma DP, Salimans T, Welling M (2015) Variational dropout and the local reparameterization trick. In: Advances in neural information processing systems, pp 2575–2583
Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper Res 43(4):684–691
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS
Kusupati A, Singh M, Bhatia K, Kumar A, Jain P, Varma M (2018) Fastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In: Advances in neural information processing systems, pp 9017–9028
LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396–404
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Lentzas A, Nalmpantis C, Vrakas D (2019) Hyperparameter tuning using quantum genetic algorithms. In: 2019 IEEE 31St international conference on tools with artificial intelligence (ICTAI), IEEE, pp 1412–1416
Li B, Huang K, Chen S, Xiong D, Jiang H, Claesen L (2020) Dfqf: Data free quantization-aware fine-tuning. In: Asian conference on machine learning, PMLR, pp 289–304
Li S, Raj D, Lu X, Shen P, Kawahara T, Kawai H (2019) Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation. In: INTERSPEECH, pp 4400–4404
Makridis G, Mavrepis P, Kyriazis D, Polychronou I, Kaloudis S (2020) Enhanced food safety through deep learning for food recalls prediction. In: International conference on discovery science, Springer, pp 566–580
Martin-Morato I, Cobos M, Ferri FJ (2018) On the robustness of deep features for audio event classification in adverse environments. In: 2018 14th IEEE international conference on signal processing (ICSP), pp 562–566, https://doi.org/10.1109/ICSP.2018.8652438
McGraw I, Prabhavalkar R, Alvarez R, Arenas MG, Rao K, Rybach D, Alsharif O, Sak H, Gruenstein A, Beaufays F et al (2016) Personalized speech recognition on mobile devices. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5955–5959
Nadal JP, Brunel N, Parga N (1998) Nonlinear feedforward networks with stochastic outputs: Infomax implies redundancy reduction. Network: Computation in Neural Systems 9(2):207–217
Nakamura S, Hiyane K, Asano F, Yamada T, Endo T (1999) Data collection in real acoustical environments for sound scene understanding and hands-free speech recognition. In: Sixth European conference on speech communication and technology
Nalmpantis C, Vrakas D (2019) Signal2vec: Time series embedding representation. In: International conference on engineering applications of neural networks, Springer, pp 80–90
Nalmpantis C, Vrakas D (2020) On time series representations for multi-label nilm. Neural Computing & Applications
Nalmpantis C, Lentzas A, Vrakas D (2019) A theoretical analysis of pooling operation using information theory. In: 2019 IEEE 31St international conference on tools with artificial intelligence (ICTAI), IEEE, 1729–1733
Pervaiz A, Hussain F, Israr H, Tahir MA, Raja FR, Baloch NK, Ishmanov F, Zikria YB (2020) Incorporating noise robustness in speech command recognition by noise augmentation of training data. Sensors 20(8):2326
Phan H, Hertel L, Maass M, Mertins A (2016) Robust audio event recognition with 1-max pooling convolutional neural networks. In: Proceedings of interspeech, ISCA, pp 3653–3657
Qu Y, Liu P, Song W, Liu L, Cheng M (2020) A text generation and prediction system: Pre-training on new corpora using bert and gpt-2. In: 2020 IEEE 10th international conference on electronics information and emergency communication (ICEIEC), pp 323–326. https://doi.org/10.1109/ICEIEC49280.2020.9152352
Shewry MC, Wynn HP (1987) Maximum entropy sampling. Journal of applied statistics 14(2):165–170
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
Solovyev RA, Vakhrushev M, Radionov A, Romanova II, Amerikanov AA, Aliev V, Shvets AA (2020) Deep learning approaches for understanding simple speech commands. In: 2020 IEEE 40Th international conference on electronics and nanotechnology (ELNANO), IEEE, pp 688–693
Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D (2020) MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th annual meeting of the association for computational linguistics, association for computational linguistics, online, pp 2158–2170. https://doi.org/10.18653/v1/2020.acl-main.195
Tippaya S, Sitjongsataporn S, Tan T, Khan MM, Chamnongthai K (2017) Multi-modal visual features-based video shot boundary detection. IEEE Access 5:12563–12575
Turchet L, Fazekas G, Lagrange M, Ghadikolaei HS, Fischione C (2020) The internet of audio things: State-of-the-art, vision, and challenges. IEEE internet of things journal
Viswanathan J, Saranya N, Inbamani A (2021) Deep learning applications in medical imaging: Introduction to deep learning-based intelligent systems for medical applications. In: Deep learning applications in medical imaging, IGI Global, pp 156–177
Vrysis L, Thoidis I, Dimoulas C, Papanikolaou G (2020) Experimenting with 1d cnn architectures for generic audio classification. In: Audio engineering society convention 148, Audio Engineering Society
Vrysis L, Tsipas N, Thoidis I, Dimoulas C (2020) 1d/2d deep cnns vs. temporal feature integration for general audio classification. Journal of the Audio Engineering Society 68(1/2):66–77
Vrysis L, Tsipas N, Thoidis I, Dimoulas C (2020) Enhanced temporal feature integration in audio semantics. J Audio Eng Soc 68(1/2):66–77
Wang KC (2020) Robust audio content classification using hybrid-based smd and entropy-based vad. Entropy 22(2):183
Warden P (2018) Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:180403209
Zeng M, Xiao N (2019) Effective combination of densenet and bilstm for keyword spotting. IEEE Access 7:10767–10775
Zhang Z, Geiger J, Pohjalainen J, Mousa AED, Jin W, Schuller B (2018) Deep learning for environmentally robust speech recognition: an overview of recent developments, ACM Trans Intell Syst Technol 9(5). https://doi.org/10.1145/3178115
Funding
This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (project code: T1EDK-00343 (95699) - Energy Controlling Voice Enabled Intelligent Smart Home Ecosystem).
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nalmpantis, C., Vrysis, L., Vlachava, D. et al. Noise invariant feature pooling for the internet of audio things. Multimed Tools Appl 81, 32057–32072 (2022). https://doi.org/10.1007/s11042-022-12931-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12931-y