Skip to main content
Log in

Noise invariant feature pooling for the internet of audio things

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This manuscript discusses the robustness to noise of deep learning models for two audio classification tasks. The first task is a speaker recognition application, trying to identify five different speakers. The second one is a speech command identification where the goal is to classify ten voice commands. These two tasks are very important to make the communication between humans and smart devices as smooth and natural as possible. The emergence of smart home devices such as personal assistants and the deployment of audio based applications in noisy environments raise new challenges and reveal the weaknesses of existing speech recognition systems. Despite the advances of deep learning in audio tasks, most of the proposed architectures are computationally inefficient and very sensitive to noise. This research addresses these problems by proposing two neural architectures that incorporate a novel pooling operation, named entropy pooling. Entropy pooling is based on the principle of maximum entropy. A detailed ablation study is conducted to evaluate the performance of entropy pooling against the classic max and average pooling layers. The neural networks that are developed are based on two architectures, convolutional networks and residual ones. The study shows that entropy based feature pooling improves the robustness of these architectures in the presence of noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Abdulhussain SH, Rahman Ramli A, Mahmmod BM, Iqbal Saripan M, Al-Haddad S, Baker T, Flayyih WN, Jassim WA (2019) A fast feature extraction algorithm for image and video processing. In: 2019 international joint conference on neural networks (IJCNN), pp 1–8, https://doi.org/10.1109/IJCNN.2019.8851750

  2. Achille A, Soatto S (2018) Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on pattern analysis and machine intelligence

  3. Alemi A, Fischer I, Dillon J, Murphy K (2017) Deep variational information bottleneck. In: ICLR, https://arxiv.org/abs/1612.00410

  4. Bacanin N, Bezdan T, Venkatachalam K, Al-Turjman F (2021) Optimized convolutional neural network by firefly algorithm for magnetic resonance image classification of glioma brain tumor grade. Journal of Real-Time Image Processing, pp 1–14

  5. Bahdanau D, Chorowski J, Serdyuk D, Brakel P, Bengio Y (2016) End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 4945–4949

  6. Bharitkar S (2019) Generative feature models and robustness analysis for multimedia content classification. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA), pp 105–110. https://doi.org/10.1109/ICMLA.2019.00025

  7. Bountourakis V, Vrysis L, Konstantoudakis K, Vryzas N (2019) An enhanced temporal feature integration method for environmental sound recognition. In: Acoustics, multidisciplinary digital publishing institute, vol 1, pp 410–422

  8. Boureau YL, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 111–118

  9. Chen S, Dobriban E, Lee J (2020) A group-theoretic framework for data augmentation. Advances in neural information processing systems 33

  10. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-based models for speech recognition. Advances in neural information processing systems 28:577–585

    Google Scholar 

  11. Coucke A, Chlieh M, Gisselbrecht T, Leroy D, Poumeyrol M, Lavril T (2019) Efficient keyword spotting using dilated convolutions and gating. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6351–6355

  12. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186, https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423

  13. Dong B, Lumezanu C, Chen Y, Song D, Mizoguchi T, Chen H, Khan L (2020) At the speed of sound: Efficient audio scene classification. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 301–305

  14. Esmaeilpour M, Cardinal P, Lameiras Koerich A (2020) A robust approach for securing audio classification against adversarial attacks. IEEE transactions on information forensics and security 15:2147–2159. https://doi.org/10.1109/TIFS.2019.2956591

    Article  Google Scholar 

  15. Falcini F, Lami G (2017) Deep learning in automotive: Challenges and opportunities. In: International conference on software process improvement and capability determination, Springer, pp 279–288

  16. Fayyad J, Jaradat MA, Gruyer D, Najjaran H (2020) Deep learning sensor fusion for autonomous vehicle perception and localization: a review. Sensors 20(15):4220

    Article  Google Scholar 

  17. Gkalinikis NV, Nalmpantis C, Vrakas D (2020) Attention in recurrent neural networks for energy disaggregation. In: International conference on discovery science, Springer, pp 551–565

  18. Han W, Zhang Z, Zhang Y, Yu J, Chiu CC, Qin J, Gulati A, Pang R, Wu Y (2020) Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:200503191

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  20. He X, Zhao K, Chu X (2021) Automl: A survey of the state-of-the-art. Knowl-Based Syst 212:106622

    Article  Google Scholar 

  21. Jarrett K, Kavukcuoglu K, LeCun Y et al (2009) What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th international conference on computer vision, IEEE, pp 2146–2153

  22. Kingma DP, Salimans T, Welling M (2015) Variational dropout and the local reparameterization trick. In: Advances in neural information processing systems, pp 2575–2583

  23. Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper Res 43(4):684–691

    Article  MathSciNet  Google Scholar 

  24. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS

  25. Kusupati A, Singh M, Bhatia K, Kumar A, Jain P, Varma M (2018) Fastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In: Advances in neural information processing systems, pp 9017–9028

  26. LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396–404

  27. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  28. Lentzas A, Nalmpantis C, Vrakas D (2019) Hyperparameter tuning using quantum genetic algorithms. In: 2019 IEEE 31St international conference on tools with artificial intelligence (ICTAI), IEEE, pp 1412–1416

  29. Li B, Huang K, Chen S, Xiong D, Jiang H, Claesen L (2020) Dfqf: Data free quantization-aware fine-tuning. In: Asian conference on machine learning, PMLR, pp 289–304

  30. Li S, Raj D, Lu X, Shen P, Kawahara T, Kawai H (2019) Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation. In: INTERSPEECH, pp 4400–4404

  31. Makridis G, Mavrepis P, Kyriazis D, Polychronou I, Kaloudis S (2020) Enhanced food safety through deep learning for food recalls prediction. In: International conference on discovery science, Springer, pp 566–580

  32. Martin-Morato I, Cobos M, Ferri FJ (2018) On the robustness of deep features for audio event classification in adverse environments. In: 2018 14th IEEE international conference on signal processing (ICSP), pp 562–566, https://doi.org/10.1109/ICSP.2018.8652438

  33. McGraw I, Prabhavalkar R, Alvarez R, Arenas MG, Rao K, Rybach D, Alsharif O, Sak H, Gruenstein A, Beaufays F et al (2016) Personalized speech recognition on mobile devices. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 5955–5959

  34. Nadal JP, Brunel N, Parga N (1998) Nonlinear feedforward networks with stochastic outputs: Infomax implies redundancy reduction. Network: Computation in Neural Systems 9(2):207–217

    Article  Google Scholar 

  35. Nakamura S, Hiyane K, Asano F, Yamada T, Endo T (1999) Data collection in real acoustical environments for sound scene understanding and hands-free speech recognition. In: Sixth European conference on speech communication and technology

  36. Nalmpantis C, Vrakas D (2019) Signal2vec: Time series embedding representation. In: International conference on engineering applications of neural networks, Springer, pp 80–90

  37. Nalmpantis C, Vrakas D (2020) On time series representations for multi-label nilm. Neural Computing & Applications

  38. Nalmpantis C, Lentzas A, Vrakas D (2019) A theoretical analysis of pooling operation using information theory. In: 2019 IEEE 31St international conference on tools with artificial intelligence (ICTAI), IEEE, 1729–1733

  39. Pervaiz A, Hussain F, Israr H, Tahir MA, Raja FR, Baloch NK, Ishmanov F, Zikria YB (2020) Incorporating noise robustness in speech command recognition by noise augmentation of training data. Sensors 20(8):2326

    Article  Google Scholar 

  40. Phan H, Hertel L, Maass M, Mertins A (2016) Robust audio event recognition with 1-max pooling convolutional neural networks. In: Proceedings of interspeech, ISCA, pp 3653–3657

  41. Qu Y, Liu P, Song W, Liu L, Cheng M (2020) A text generation and prediction system: Pre-training on new corpora using bert and gpt-2. In: 2020 IEEE 10th international conference on electronics information and emergency communication (ICEIEC), pp 323–326. https://doi.org/10.1109/ICEIEC49280.2020.9152352

  42. Shewry MC, Wynn HP (1987) Maximum entropy sampling. Journal of applied statistics 14(2):165–170

    Article  Google Scholar 

  43. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556

  44. Solovyev RA, Vakhrushev M, Radionov A, Romanova II, Amerikanov AA, Aliev V, Shvets AA (2020) Deep learning approaches for understanding simple speech commands. In: 2020 IEEE 40Th international conference on electronics and nanotechnology (ELNANO), IEEE, pp 688–693

  45. Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D (2020) MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th annual meeting of the association for computational linguistics, association for computational linguistics, online, pp 2158–2170. https://doi.org/10.18653/v1/2020.acl-main.195

  46. Tippaya S, Sitjongsataporn S, Tan T, Khan MM, Chamnongthai K (2017) Multi-modal visual features-based video shot boundary detection. IEEE Access 5:12563–12575

    Article  Google Scholar 

  47. Turchet L, Fazekas G, Lagrange M, Ghadikolaei HS, Fischione C (2020) The internet of audio things: State-of-the-art, vision, and challenges. IEEE internet of things journal

  48. Viswanathan J, Saranya N, Inbamani A (2021) Deep learning applications in medical imaging: Introduction to deep learning-based intelligent systems for medical applications. In: Deep learning applications in medical imaging, IGI Global, pp 156–177

  49. Vrysis L, Thoidis I, Dimoulas C, Papanikolaou G (2020) Experimenting with 1d cnn architectures for generic audio classification. In: Audio engineering society convention 148, Audio Engineering Society

  50. Vrysis L, Tsipas N, Thoidis I, Dimoulas C (2020) 1d/2d deep cnns vs. temporal feature integration for general audio classification. Journal of the Audio Engineering Society 68(1/2):66–77

    Article  Google Scholar 

  51. Vrysis L, Tsipas N, Thoidis I, Dimoulas C (2020) Enhanced temporal feature integration in audio semantics. J Audio Eng Soc 68(1/2):66–77

    Article  Google Scholar 

  52. Wang KC (2020) Robust audio content classification using hybrid-based smd and entropy-based vad. Entropy 22(2):183

    Article  Google Scholar 

  53. Warden P (2018) Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:180403209

  54. Zeng M, Xiao N (2019) Effective combination of densenet and bilstm for keyword spotting. IEEE Access 7:10767–10775

    Article  Google Scholar 

  55. Zhang Z, Geiger J, Pohjalainen J, Mousa AED, Jin W, Schuller B (2018) Deep learning for environmentally robust speech recognition: an overview of recent developments, ACM Trans Intell Syst Technol 9(5). https://doi.org/10.1145/3178115

Download references

Funding

This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (project code: T1EDK-00343 (95699) - Energy Controlling Voice Enabled Intelligent Smart Home Ecosystem).

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoforos Nalmpantis.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nalmpantis, C., Vrysis, L., Vlachava, D. et al. Noise invariant feature pooling for the internet of audio things. Multimed Tools Appl 81, 32057–32072 (2022). https://doi.org/10.1007/s11042-022-12931-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12931-y

Keywords

Navigation