Skip to main content
Log in

ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Detecting human speech is foundational for a wide range of emerging intelligent applications. However, accurately detecting human speech is challenging, especially in the presence of unknown noise patterns. Generally, deep learning-based methods have shown to be more robust and accurate than statistical methods and other existing approaches. However, typically creating a noise-robust and more generalized deep learning-based voice activity detection system requires the collection of an enormous amount of annotated audio data. In this work, we develop a generalized model trained on limited types of human speeches with noisy backgrounds. Yet, it can detect human speech in the presence of various unseen noise types, which were not present in the training set. To achieve this, we propose a one-class residual connections-based variational autoencoder (ORVAE), which only requires a limited number of human speech data with noisy background for training, thereby eliminating the need for collecting data with diverse noise patterns. Evaluating ORVAE with three different datasets (synthesized TIMIT and NOISEX-92, synthesized LibriSpeech and NOISEX-92, and a Publicly Recorded dataset), our method outperforms other one-class baseline methods, achieving \(F_1\)-scores of over \(90\%\) for multiple signal-to-noise ratio levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Alam J, Kenny P, Ouellet P, Stafylakis T, Dumouchel P (2014) Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the rsr2015 corpus. In: Odyssey speaker and language recognition workshop, pp 123–130

  2. Benyassine A, Shlomot E, Su HY, Massaloux D, Lamblin C, Petit JP (1997) Itu-t recommendation g. 729 annex b: a silence compression scheme for use with g. 729 optimized for v. 70 digital simultaneous voice and data applications. IEEE Commun Mag 35(9):64–73

    Article  Google Scholar 

  3. Beritelli F, Casale S, Cavallaero A (1998) A robust voice activity detector for wireless communications using soft computing. IEEE J Sel Areas Commun 16(9):1818–1829

    Article  Google Scholar 

  4. Böhm V, Seljak U (2020) Probabilistic auto-encoder. arXiv:2006.05479

  5. Chalapathy R, Menon AK, Chawla S (2018) Anomaly detection using one-class neural networks. arXiv:1802.06360

  6. Chang JH, Kim NS, Mitra SK (2006) Voice activity detection based on multiple statistical models. IEEE Trans Signal Process 54(6):1965–1976

    Article  Google Scholar 

  7. Chengalvarayan R (1999) Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition. In: Sixth European conference on speech communication and technology

  8. Dai W, Dai C, Qu S, Li J, Das S (2017) Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 421–425

  9. Dean D, Sridharan S, Vogt R (2010) The qut-noise databases and protocols. https://researchdatafinder.qut.edu.au/display/n442. Accessed 03 Sept 2021

  10. Dennison PE, Halligan KQ, Roberts DA (2004) A comparison of error metrics and constraints for multiple endmember spectral mixture analysis and spectral angle mapper. Remote Sens Environ 93(3):359–367

    Article  Google Scholar 

  11. Dey J, Hossain MSB, Haque MA (2018) An ensemble SVM-based approach for voice activity detection. In: 2018 10th international conference on electrical and computer engineering (ICECE). IEEE, pp 297–300

  12. Dosselmann R, Yang XD (2011) A comprehensive assessment of the structural similarity index. SIViP 5(1):81–91

    Article  Google Scholar 

  13. DSP Algorithms: Bril noise reduction (2020). https://www.dspalgorithms.com/w3/noise-reduction/bril-noise-reduction-algorithm.php. Accessed 03 Feb 2021

  14. Eyben F, Weninger F, Squartini S, Schuller B (2013) Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 483–487. https://doi.org/10.1109/ICASSP.2013.6637694

  15. Garofolo JS, Lamel LF, Fisher WM, Fiscus, JG, Pallett DS (1993) Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93

  16. Google Developers: Real-time communication for the web (2020). https://webrtc.org/. Accessed 22 Dec 2020

  17. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  18. Hirsch HG (2005) Fant-filtering and noise adding tool. Niederrhein University of Applied Sciences. http://dnt.kr.hsnr.de/download.html

  19. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl-Based Syst 6(02):107–116

    Article  Google Scholar 

  20. Hwang I, Park HM, Chang JH (2016) Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection. Comput Speech Lang 38:1–12

    Article  Google Scholar 

  21. Joyce JM (2011) Kullback–Leibler divergence. Springer, Berlin, pp 720–722. https://doi.org/10.1007/978-3-642-04898-2_327

    Book  Google Scholar 

  22. Jung Y, Kim Y, Choi Y, Kim H (2018) Joint learning using denoising variational autoencoders for voice activity detection. In: Proceedings of Interspeech 2018, pp 1210–1214. https://doi.org/10.21437/Interspeech.2018-1151

  23. Khalid H, Kim M, Tariq S, Woo SS (2021) Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In: Proceedings of the 1st workshop on synthetic multimedia-audiovisual deepfake generation and detection, pp 7–15

  24. Khalid H, Tariq S, Kim M, Woo SS (2021) FakeAVCeleb: a novel audio-video multimodal deepfake dataset. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2). https://openreview.net/forum?id=TAXFsg6ZaOl

  25. Khalid H, Woo SS (2020) kedect: classifying deepfakes using one-class variational autoencoder. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 656–657

  26. Kim J, Hahn M (2019) Speech enhancement using a two-stage network for an efficient boosting strategy. IEEE Signal Process Lett 26(5):770–774

    Article  Google Scholar 

  27. Kingma DP, Ba J (2014) A method for stochastic optimization. arXiv:1412.6980

  28. Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv:1312.6114

  29. Lee Y, Min J, Han DK, Ko H (2019) Spectro-temporal attention-based voice activity detection. IEEE Signal Process Lett 27:131–135

    Article  Google Scholar 

  30. Leglaive S, Alameda-Pineda X, Girin L, Horaud R (2019) A recurrent variational autoencoder for speech enhancement. arXiv:1910.10942

  31. Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, Boca Raton

    Book  Google Scholar 

  32. Lu X, Tsao Y, Matsuda S, Hori C (2013) Speech enhancement based on deep denoising autoencoder. In: Interspeech, pp 436–440

  33. Lu Y, Loizou PC (2008) A geometric approach to spectral subtraction. Speech Commun 50(6):453–466

    Article  Google Scholar 

  34. Mak MW, Yu HB (2014) A study of voice activity detection techniques for Nist speaker recognition evaluations. Comput Speech Lang 28(1):295–313

    Article  Google Scholar 

  35. Manevitz LM, Yousef M (2001) One-class SVMs for document classification. J Mach Learn Res 2:139–154 ((Dec))

    MATH  Google Scholar 

  36. Markou M, Singh S (2003) Novelty detection: a review-part 1: statistical approaches. Signal Process 83(12):2481–2497

    Article  Google Scholar 

  37. Mohamed OMM, Jaïdane-Saïdane M (2009) Generalized gaussian mixture model. In: 2009 17th European signal processing conference, pp 2273–2277

  38. Naranjo-Alcazar J, Perez-Castanos S, Martín-Morató I, Zuccarello P, Ferri FJ, Cobos M (2020) A comparative analysis of residual block alternatives for end-to-end audio classification. IEEE Access 8:188875–188882

    Article  Google Scholar 

  39. Nemer E, Goubran R, Mahmoud S (2001) Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans Speech Audio Process 9(3):217–231

    Article  Google Scholar 

  40. O’Shaughnessy D (1988) Linear predictive coding. IEEE Potentials 7(1):29–32. https://doi.org/10.1109/45.1890

    Article  Google Scholar 

  41. Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210

  42. Prasadh SK, Natrajan SS, Kalaivani S (2017) Efficiency analysis of noise reduction algorithms: analysis of the best algorithm of noise reduction from a set of algorithms. In: 2017 International conference on inventive computing and informatics (ICICI). IEEE, pp 1137–1140

  43. Ramirez J, Górriz J M, Segura J C (2007) Voice activity detection. fundamentals and speech recognition system robustness. Robust Speech Recognit Underst 6(9):1–22

    Google Scholar 

  44. Saha G, Chakroborty S, Senapati S (2005) A new silence removal and endpoint detection algorithm for speech and speaker recognition applications. In: Proceedings of the NCC, vol 2005. Citeseer, p 5

  45. Shin Y, Lee, S, Tariq S, Lee MS, Jung O, Chung D, Woo SS (2020) Itad: integrative tensor-based anomaly detection system for reducing false positives of satellite systems. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 2733–2740

  46. Sohn J, Kim NS, Sung W (1999) A statistical model-based voice activity detection. IEEE Signal Process Lett 6(1):1–3

    Article  Google Scholar 

  47. Tan ZH, Dehak N et al (2020) rvad: an unsupervised segment-based robust voice activity detection method. Comput Speech Lang 59:1–21

    Article  Google Scholar 

  48. Tanyer SG, Ozer H (2000) Voice activity detection in nonstationary noise. IEEE Trans Speech Audio Process 8(4):478–482

    Article  Google Scholar 

  49. Tao F, Busso C (2018) Audiovisual speech activity detection with advanced long short-term memory. In: Proceedings of Interspeech 2018, pp 1244–1248. https://doi.org/10.21437/Interspeech.2018-2490

  50. Tariq S, Lee S, Shin Y, Lee MS, Jung O, Chung D, Woo SS (2019) Detecting anomalies in space using multivariate convolutional LSTM with mixtures of probabilistic PCA. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2123–2133

  51. Tariq S, Lee S, Woo S.S (2020) Cantransfer: transfer learning based intrusion detection on a controller area network using convolutional LSTM network. In: Proceedings of the 35th annual ACM symposium on applied computing, pp 1048–1055

  52. Tucker R (1992) Voice activity detection using a periodicity measure. IEE Proc I (Commun Speech Vis) 139(4):377–380

    Article  Google Scholar 

  53. Vafeiadis A, Fanioudakis E, Potamitis I, Votis K, Giakoumis D, Tzovaras D, Chen L, Hamzaoui R (2019) Two-dimensional convolutional recurrent neural networks for speech activity detection. In: Proceedings of Interspeech 2019, pp 2045–2049. https://doi.org/10.21437/Interspeech.2019-1354

  54. Varga A, Steeneken H J (1993) Assessment for automatic speech recognition: II. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251

    Article  Google Scholar 

  55. Wikipedia contributors: Hann function—Wikipedia, the free encyclopedia (2021). https://en.wikipedia.org/w/index.php?title=Hann_function&oldid=1001711522. Accessed 09 March 2021

  56. Wu J, Zhang XL (2011) Efficient multiple kernel support vector machine based voice activity detection. IEEE Signal Process Lett 18(8):466–469

    Article  Google Scholar 

  57. You C, Robinson DP, Vidal R (2017) Provable self-representation based outlier detection in a union of subspaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3395–3404

  58. Yu C, Hung KH, Lin IF, Fu SW, Tsao Y, Hung JW (2020) Waveform-based voice activity detection exploiting fully convolutional networks with multi-branched encoders. arXiv:2006.11139

  59. Zhang XL, Wang D (2015) Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans Audio Speech Lang Process 24(2):252–264

    Article  Google Scholar 

  60. Zhang XL, Wu J (2012) Deep belief networks based voice activity detection. IEEE Trans Audio Speech Lang Process 21(4):697–710

    Article  Google Scholar 

Download references

Acknowledgements

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) Grant funded by the Korea government (MSIT) (No.2019-0-00421, AI Graduate School Support Program (Sungkyunkwan University)), (No. 2019-0-01343, Regional strategic industry convergence security core talent training business) and the Basic Science Research Program through National Research Foundation of Korea (NRF) Grant funded by Korea government MSIT (No. 2020R1C1C1006004). Also, this research was partly supported by IITP Grant funded by the Korea government MSIT (No. 2021-0-00066 and 2021-0-00017, Core Technology Development of Artificial Intelligence Industry), and was partly supported by the MSIT (Ministry of Science, ICT), Korea, under the High-Potential Individuals Global Training Program (2020-0-01550) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation)

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jong Hwan Ko or Simon S. Woo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khalid, H., Tariq, S., Kim, T. et al. ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment. Neural Process Lett 54, 1565–1586 (2022). https://doi.org/10.1007/s11063-021-10695-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10695-4

Keywords

Navigation