Look at Who’s Talking: Voice Activity Detection by Automated Gesture Analysis

  • Marco Cristani
  • Anna Pesarin
  • Alessandro Vinciarelli
  • Marco Crocco
  • Vittorio Murino
Part of the Communications in Computer and Information Science book series (CCIS, volume 277)


This paper proposes an approach for Voice Activity Detection (VAD) based on the automatic measurement of gesturing. The main motivation of the work is that gestures have been shown to be tightly correlated with speech, hence they can be considered a reliable evidence that a person is talking. The use of gestures rather than speech for performing VAD can be helpful in many situation (e.g., surveillance and monitoring in public spaces) where speech cannot be obtained for technical, legal or ethical issues. The results show that the gesturing measurement approach proposed in this work achieves, on a frame-by-frame basis, an accuracy of 71 percent in distinguishing between speech and non-speech.


Audio Signal Video Signal Voice Activity Detection Gesture Activity Surveillance Scenario 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20(2), 356–370 (2011)Google Scholar
  2. 2.
    Boersma, P.: Accurate short term analysis of the fundamental frequency and the harmonics to noise ratio of a sampled sound. IEEE Transactions on Image Processing 17, 97–110 (1993)Google Scholar
  3. 3.
    Boersma, P.: Praat, a system for doing phonetics by computer. Glot International 5(9/10), 341–345 (2001)Google Scholar
  4. 4.
    Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S., Achorn, B.: Modeling the interaction between speech and gesture. In: Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pp. 153–158 (1994)Google Scholar
  5. 5.
    Cristani, M., Bazzani, L., Paggetti, G., Fossati, A., Bue, A.D., Menegaz, G., Murino, V.: Social interaction discovery by statistical analysis of f-formations. In: Proceedings of the British Machine Vision Conference (2011)Google Scholar
  6. 6.
    Fisher, J.W., Freeman, W.T., Darrell, T., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: Advanced in Neural Inf. Process. Syst., vol. 13, pp. 772–778 (2001)Google Scholar
  7. 7.
    Hung, H., Ba, S.O.: Speech/non-speech detection in meetings from automatically extracted low resolution visual features. In: ICASSP, pp. 830–833 (2010)Google Scholar
  8. 8.
    Hung, H., Huang, Y., Yeo, C., Gatica-Perez, D.: Associating audio-visual activity cues in a dominance estimation framework. In: First IEEE Workshop on CVPR for Human Communicative Behavior Analysis (2008)Google Scholar
  9. 9.
    Kendon, A.: Gesticulation and speech: Two aspects of the process of utterance. In: The Relationship of Verbal and Nonverbal Communication, pp. 207–227 (1980)Google Scholar
  10. 10.
    Kendon, A.: Language and gesture: unity or duality?, pp. 47–63. Cambridge University Press (2000)Google Scholar
  11. 11.
    Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004)Google Scholar
  12. 12.
    Khondaker, A., Ghulam, M.: Improved noise reduction with pitch enabled voice activity detection. In: ISIVC 2008 (2008)Google Scholar
  13. 13.
    Kopp, S., Wachsmuth, I.: Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds 15(1), 39–52 (2004)CrossRefGoogle Scholar
  14. 14.
    McNeill, D.: Hand and mind: What gestures reveal about thought. Chicago University Press, Chicago (1992)Google Scholar
  15. 15.
    Noulas, A., Englebienne, G., Krose, B.J.A.: Multimodal speaker diarization. IEEE Transactions on Pattesrnss Analysis and Machine Intelligence 99 (2011)Google Scholar
  16. 16.
    Rao, R., Chen, T.: Cross-modal prediction in audio-visual communication. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1996, vol. 4, pp. 2056–2059 (1996)Google Scholar
  17. 17.
    Siracusa, M.R., John, W.F.: Dynamic dependency tests: Analysis and applications to multi-modal data association (2007)Google Scholar
  18. 18.
    Vajaria, H., Islam, T., Sarkar, S., Sankar, R., Kasturi, R.: Audio segmentation and speaker localization in meeting videos. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 2, pp. 1150–1153 (2006)Google Scholar
  19. 19.
    Vinciarelli, A., Pantic, M., Heylen, D., Pelachaud, C., Poggi, I., D’Errico, F., Schröder, M.: Bridging the gap between social animal and unsocial machine: A survey of social signal processing. IEEE Transactions on Affective Computing (2011) (to appear) Google Scholar
  20. 20.
    Wells, G., Petty, R.: The e_ects of over head movements on persuasion. Basic and Applied Social Psychology 1(3), 219–230 (1980)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Marco Cristani
    • 1
    • 2
  • Anna Pesarin
    • 1
  • Alessandro Vinciarelli
    • 3
    • 4
  • Marco Crocco
    • 2
  • Vittorio Murino
    • 1
    • 2
  1. 1.Dipartimento di InformaticaUniversity of VeronaItaly
  2. 2.Istituto Italiano di TecnologiaItaly
  3. 3.University of GlasgowUK
  4. 4.Idiap Research InstituteSwitzerland

Personalised recommendations