Abstract
Voice Activity Detection (VAD) is a technique to classify speech signal into two parts as speech signal and background noises, and widely used in emerging speech recognition technologies such as mobile communication, high-quality multimedia transmission, forensic science, and voice recognition applications. As this technique is integral part of speech communication system, selection of precise VAD is the most challenging part in terms of complexity, feature extractions, threshold selection, and percentage of correctness. The researchers have generally classified VAD into supervised and unsupervised system and introduced various characteristics-based algorithm to reflect the occurrence of speech signal. However, a pervasive study is desired for the selection of appropriate techniques from predefined VAD along with the challenges and solutions to set the future research directions in the emerging area of voice recognition. Therefore, an extensive study is presented in this manuscript especially to set a tradeoff between obstacles and performance of earlier developed VAD. The authors believe that this review will be helpful to researchers working in the challenging speech processing and recognition domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
I. Mc Cowan, D. Dean, M. McLaren, R. Vogt, S. Sridharan, The delta phase spectrum with application to voice activity detection and speaker recognition. IEEE Trans. Audio Speech Lang. Proc. 19, 2026–2038 (2011)
D. Valj, B. Kotnik, B. Horvat, Z. Kacic, A computationally efficient mel filter bank VAD algorithm for distributed speech recognition systems. Eurasip J. Appl. Sig. Process. 4, 487–497 (2005)
B. Kotnik, Z. Kacic, B. Horvat, A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm, in Proceedings of 7th Europseech (2001), pp. 197–200
T. Kristjansson, S. Deligne, P. Olsen, Voicing features for robust speech detection, in Proceedings of Interspeech (2005), pp. 369–372
J. Haigh, J. Mason, A voice activity detector based on Cepstral analysis, in Proceedings of Eurospeech (2003), pp. 1103–1106
S.O. Sadjadi, J. Hansen, Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Sig. Pro. Lett. 20, 197–200 (2013)
M. Marzinzik, B. Kollmeier, Speech pause detection for noise spectrum estimation by tracking power envelope dynamics. IEEE Trans. Speech Audio Process. 10, 109–118 (2002)
E. Nemer, R. Goubran, S. Mahmoud, Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans. Speech Audio Process. 9, 217–231 (2001)
K. Ishizuka, T. Nakatani, Study of noise robust voice activity detection based on periodic component to aperiodic component ratio, in Proceedings of ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition (2006), p. 6570
J. Ramirez, J. Segura, M. Benitez, L. Garcia, A. Rubio, Statistical voice activity detection using a multiple observation likelihood ratio test. IEEE Sig. Proc. Lett. 12, 689–692 (2005)
P. Ghosh, A. Tsiartas, S. Narayanan, Robust voice activity detection using long-term signal variability. IEEE Trans. Audio Speech Lang. Process. 19, 600–613 (2011)
Y. Kida, T. Kawahara, Voice activity detection based on optimally weighted combination of multiple features, in Proceedings of Interspeech (2005), pp. 2621–2624
S. Soleimani, S. Ahadi, Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses, in Proceedings of Information and Communication Technologies: From Theory to Applications (2008), pp. 1–5
H. Singh, A.K. Bathla, A survey on speech recognition. Int. J. Adv. Res. Comput. Eng. Technol. 2(6), 2186–2189 (2013)
M.A. Anusuya, S.K. Katti, Speech recognition by machine: a review. Int. J. Comput. Sci. Inf. Secur. 6(3), 181–205 (2009)
J. Padmanabhan, M.J.J. Premkumar, Machine learning in automatic speech recognition: A survey. IETE Tech. Rev. 32(4), 240–251 (2015)
C.-C. Shen, W. Plishker, S.S. Bhattacharyya, Design and optimization of a distributed, embedded speech recognition system, in Proceedings of the International Workshop on Parallel and Distributed Real-Time Systems, Miami, Florida, April 2008
G. Zhou, J.H.L. Hansen, J.F. Kaiser, Nonlinear feature based classification of speech under stress. IEEE Trans. Speech Audio Process. 9(3), 201–216 (2001)
C. Fredouille, G. Pouchoulin, J.-F. Bonastre, M. Azzarello, A. Giovanni, A. Ghio, Application of automatic speaker recognition techniques to pathological voice assessment (dysphonia), in Proceedings of 9th European Conference on Speech Communication and Technology (Eurospeech) (2005), pp. 149–152
V.A. Petrushin, Emotion recognition in speech signal: experimental study, development, and application, in Proceedings of Sixth International Conference on Spoken Language Processing (ICSLP) (2000), p. 5
N. Fragopanagos, J.G. Taylor, Emotion recognition in human–computer interaction. Neural Netw. 18(4), 389–405 (2005)
E. Douglas-Cowie, N. Campbell, R. Cowie, P. Roach, Emotional speech: towards a new generation of databases. Speech Commun. 40(1–2), 33–60 (2003)
B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan, R. Sarikaya, Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system. Proc. ICASSP 1, 53–56 (2002)
ETSI standard document, ETSI ES 202 050 V 1.1.3. (2003)
K. Li, N.S. Swamy, M.O. Ahmad, An improved voice activity detection using higher order statistics. IEEE Trans. Speech Audio Process. 13, 965–974 (2005)
G.D. Wuand, C.T. Lin, Word boundary detection with MEL scale frequency bank in noisy environment. IEEE Trans. Speech Audio Process. (2000)
A. Lee, K. Nakamura, R. Nisimura, H. Saruwatari, K. Shikano, Noise robust real world spoken dialogue system using GMM based rejection of unintended inputs, in Interspeech (2004), pp. 173–176
B. Lee, M. Hasegawa-Johnson, Minimum mean squared error a posteriori estimation of high variance vehicular noise, in Proceedings of Biennial on DSP for In-Vehicle and Mobile Systems, Istanbul, Turkey, June 2007
ETSI ES 202 050 Recommendation, Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms (2002)
C.F. Juang, C.N. Cheng, T.M. Chen, Speech detection in noisy environments by wavelet energy-based recurrent neural fuzzy network. Exp. Syst. Appl. 36(1), 321–332 (2009)
K.C. Wang, Y.H. Tasi, Voice activity detection algorithm with low signal-to-noise ratios based on spectrum entropy, in Second International Symposium on Universal Communication (ISUC’08) (2008), pp. 423–428
S.K. Kim, S.I. Kang, Y.J. Park, S. Lee, S. Lee, Power spectral deviation-based voice activity detection incorporating teager energy for speech enhancement. Symmetry 8(7), 58 (2016)
F.G. Germain, D.L. Sun, G.J. Mysore, Speaker and noise independent voice activity detection, in Interspeech (2013), pp. 732–736
T. Kinnunen, P. Rajan, A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data, in ICASSP (2013), pp. 7229–7233
I. Ariav, I. Cohen, An end-to-end multimodal voice activity detection using wavenet encoder and residual networks. IEEE J. Sel. Topics Sig. Process. 13(2), 265–274 (2019)
A. Ivry, B. Berdugo, I. Cohen, Voice activity detection for transient noisy environment based on diffusion nets. IEEE J. Sel. Topics Sig. Process. 13(2), 254–264 (2019)
H. Dubey, A. Sangwan, J.H. Hansen, Leveraging frequency dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2056–2071 (2018)
G.-B. Wang, W.-Q. Zhang, An RNN and CRNN based approach to robust voice activity detection (2019). https://doi.org/10.1109/apsipaasc47483.2019.9023320
Available online http://www.alango.com/voice-activity-detection.php
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sharma, S., Rattan, P., Sharma, A. (2021). Recent Developments, Challenges, and Future Scope of Voice Activity Detection Schemes—A Review. In: Kaiser, M.S., Xie, J., Rathore, V.S. (eds) Information and Communication Technology for Competitive Strategies (ICTCS 2020). Lecture Notes in Networks and Systems, vol 190. Springer, Singapore. https://doi.org/10.1007/978-981-16-0882-7_39
Download citation
DOI: https://doi.org/10.1007/978-981-16-0882-7_39
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0881-0
Online ISBN: 978-981-16-0882-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)