Audio-visual voice activity detection

Research Article


In speech signal processing systems, frame-energy based voice activity detection (VAD) method may be interfered with the background noise and non-stationary characteristic of the frame-energy in voice segment. The purpose of this paper is to improve the performance and robustness of VAD by introducing visual information. Meanwhile, data-driven linear transformation is adopted in visual feature extraction, and a general statistical VAD model is designed. Using the general model and a two-stage fusion strategy presented in this paper, a concrete multimodal VAD system is built. Experiments show that a 55.0 % relative reduction in frame error rate and a 98.5% relative reduction in sentence-breaking error rate are obtained when using multimodal VAD, compared to frame-energy based audio VAD. The results show that using multimodal method, sentence-breaking errors are almost avoided, and frame-detection performance is clearly improved, which proves the effectiveness of the visual modal in VAD.


speech recognition voice activity detection multimodal 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lamel L. F., Rabiner L. R., Rosenberg A. E. et al. An improved endpoint detector for isolated word recognition, IEEE Trans. Acoust., Voice, Signal Processing, 1981, 29(8): 777–785CrossRefGoogle Scholar
  2. 2.
    Shen J. L., Hung J. W., Lee L. S., Robust entropy based endpoint detection for voice recognition in noisy environments, Proc 4th Int Conf on Spoken Language Processing (ICSLP’96), Philadelphia, 1996: 881–884Google Scholar
  3. 3.
    Chen Tsuhan, Audiovisual speech processing, IEEE Signal Processing Magazine, 2001, 18(1): 9–21MATHCrossRefGoogle Scholar
  4. 4.
    Liu Peng, Wang Zuo-ying, Visual information assisted Mandarin large vocabulary continuous speech recognition, Proc. NIP-KE’03, 2003Google Scholar
  5. 5.
    Kirby M., Sirovich L., Application of the Karhunen-Loeve procedure for the characterization of human faces, IEEE Trans. Pattern Analysis and Machine Intelligence, 1990, 12(1): 103–108CrossRefGoogle Scholar
  6. 6.
    Nelder J. A., Mead R., A simplex method for function optimization, Comput. J., 1965, 7(4): 308–313MATHGoogle Scholar
  7. 7.
    Tanyer S. G., Ozer H., Voice activity detection in nonstationary noise, IEEE Trans Acoust, Voice, Signal Processing, 2000, 8(7): 478–482Google Scholar
  8. 8.
    Tian Ye, Wu Ji, Wang Zuo-ying et al., Fuzzy clustering and Bayesian information criterion based threshold estimation For robust voice activity detection, Proc 2003 IEEE Int. Conf. on Acoustic, Speech, and Signal Processing (ICASSP’03), Hong Kong: 2003, 444–447Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag 2006

Authors and Affiliations

  1. 1.Department of Electronic EngineeringTsinghua UniversityBeijingChina

Personalised recommendations