Audio-visual voice activity detection

Liu, Peng; Wang, Zuo-ying

doi:10.1007/s11460-006-0081-5

Audio-visual voice activity detection

Research Article
Published: December 2006

Volume 1, pages 425–430, (2006)
Cite this article

Frontiers of Electrical and Electronic Engineering in China

Liu Peng¹ &
Wang Zuo-ying¹

74 Accesses
3 Citations
Explore all metrics

Abstract

In speech signal processing systems, frame-energy based voice activity detection (VAD) method may be interfered with the background noise and non-stationary characteristic of the frame-energy in voice segment. The purpose of this paper is to improve the performance and robustness of VAD by introducing visual information. Meanwhile, data-driven linear transformation is adopted in visual feature extraction, and a general statistical VAD model is designed. Using the general model and a two-stage fusion strategy presented in this paper, a concrete multimodal VAD system is built. Experiments show that a 55.0 % relative reduction in frame error rate and a 98.5% relative reduction in sentence-breaking error rate are obtained when using multimodal VAD, compared to frame-energy based audio VAD. The results show that using multimodal method, sentence-breaking errors are almost avoided, and frame-detection performance is clearly improved, which proves the effectiveness of the visual modal in VAD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Voice Activity Detection for Monaural Speech Enhancement Using Visual Cues

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave

References

Lamel L. F., Rabiner L. R., Rosenberg A. E. et al. An improved endpoint detector for isolated word recognition, IEEE Trans. Acoust., Voice, Signal Processing, 1981, 29(8): 777–785
Article Google Scholar
Shen J. L., Hung J. W., Lee L. S., Robust entropy based endpoint detection for voice recognition in noisy environments, Proc 4th Int Conf on Spoken Language Processing (ICSLP’96), Philadelphia, 1996: 881–884
Chen Tsuhan, Audiovisual speech processing, IEEE Signal Processing Magazine, 2001, 18(1): 9–21
Article MATH Google Scholar
Liu Peng, Wang Zuo-ying, Visual information assisted Mandarin large vocabulary continuous speech recognition, Proc. NIP-KE’03, 2003
Kirby M., Sirovich L., Application of the Karhunen-Loeve procedure for the characterization of human faces, IEEE Trans. Pattern Analysis and Machine Intelligence, 1990, 12(1): 103–108
Article Google Scholar
Nelder J. A., Mead R., A simplex method for function optimization, Comput. J., 1965, 7(4): 308–313
MATH Google Scholar
Tanyer S. G., Ozer H., Voice activity detection in nonstationary noise, IEEE Trans Acoust, Voice, Signal Processing, 2000, 8(7): 478–482
Google Scholar
Tian Ye, Wu Ji, Wang Zuo-ying et al., Fuzzy clustering and Bayesian information criterion based threshold estimation For robust voice activity detection, Proc 2003 IEEE Int. Conf. on Acoustic, Speech, and Signal Processing (ICASSP’03), Hong Kong: 2003, 444–447

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Tsinghua University, Beijing, 100084, China
Liu Peng & Wang Zuo-ying

Authors

Liu Peng
View author publications
You can also search for this author in PubMed Google Scholar
Wang Zuo-ying
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wang Zuo-ying.

Additional information

__________

Translated from Journal of Tsinghua University (Science and Technology), 2005, 45(7): 896–899 (in Chinese)

About this article

Cite this article

Liu, P., Wang, Zy. Audio-visual voice activity detection. Front. Electr. Electron. Eng. China 1, 425–430 (2006). https://doi.org/10.1007/s11460-006-0081-5

Download citation

Issue Date: December 2006
DOI: https://doi.org/10.1007/s11460-006-0081-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-visual voice activity detection

Abstract

Access this article

Similar content being viewed by others

Voice Activity Detection for Monaural Speech Enhancement Using Visual Cues

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords

Navigation

Audio-visual voice activity detection

Abstract

Access this article

Similar content being viewed by others

Voice Activity Detection for Monaural Speech Enhancement Using Visual Cues

Real-Life Voice Activity Detection Based on Audio-Visual Alignment

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords

Search

Navigation