Research Article

Frontiers of Electrical and Electronic Engineering in China

, Volume 1, Issue 4, pp 425-430

First online:

Audio-visual voice activity detection

  • Liu Peng Affiliated withDepartment of Electronic Engineering, Tsinghua University
  • , Wang Zuo-ying Affiliated withDepartment of Electronic Engineering, Tsinghua University Email author 

Rent the article at a discount

Rent now

* Final gross prices may vary according to local VAT.

Get Access


In speech signal processing systems, frame-energy based voice activity detection (VAD) method may be interfered with the background noise and non-stationary characteristic of the frame-energy in voice segment. The purpose of this paper is to improve the performance and robustness of VAD by introducing visual information. Meanwhile, data-driven linear transformation is adopted in visual feature extraction, and a general statistical VAD model is designed. Using the general model and a two-stage fusion strategy presented in this paper, a concrete multimodal VAD system is built. Experiments show that a 55.0 % relative reduction in frame error rate and a 98.5% relative reduction in sentence-breaking error rate are obtained when using multimodal VAD, compared to frame-energy based audio VAD. The results show that using multimodal method, sentence-breaking errors are almost avoided, and frame-detection performance is clearly improved, which proves the effectiveness of the visual modal in VAD.


speech recognition voice activity detection multimodal