Journal on Multimodal User Interfaces

, Volume 7, Issue 1–2, pp 79–91

RAVEL: an annotated corpus for training robots with audiovisual abilities

  • Xavier Alameda-Pineda
  • Jordi Sanchez-Riera
  • Johannes Wienke
  • Vojtěch Franc
  • Jan Čech
  • Kaustubh Kulkarni
  • Antoine Deleforge
  • Radu Horaud
Original Paper

Abstract

We introduce Ravel (Robots with Audiovisual Abilities), a publicly available data set which covers examples of Human Robot Interaction (HRI) scenarios. These scenarios are recorded using the audio-visual robot head POPEYE, equipped with two cameras and four microphones, two of which being plugged into the ears of a dummy head. All the recordings were performed in a standard room with no special equipment, thus providing a challenging indoor scenario. This data set provides a basis to test and benchmark methods and algorithms for audio-visual scene analysis with the ultimate goal of enabling robots to interact with people in the most natural way. The data acquisition setup, sensor calibration, data annotation and data content are fully detailed. Moreover, three examples of using the recorded data are provided, illustrating its appropriateness for carrying out a large variety of HRI experiments. The Ravel data are publicly available at: http://ravel.humavips.eu/.

Keywords

Audio-visual data set Binocular vision Binaural hearing Action/gesture recognition Human robot interaction Audio-visual robotics 

References

  1. 1.
    Alameda-Pineda X, Khalidov V, Horaud R, Forbes F (2011) Finding audio-visual events in informal social gatherings. In: Proceedings of the ACM/IEEE international conference on multimodal interactionGoogle Scholar
  2. 2.
    Arnaud E, Christensen H, Lu Y-C, Barker J, Khalidov V, Hansard M, Holveck B, Mathieu H, Narasimha R, Taillant E, Forbes F, Horaud RP (2008) The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements. In: Proceedings of the ACM/IEEE international conference on multimodal interfaces. http://perception.inrialpes.fr/CAVA_Dataset/
  3. 3.
    Bailly-Baillire E, Bengio S, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, Matas J, Messer K, Porée F, Ruiz B (2003) The BANCA database and evaluation protocol. In: Proceedings of the international conference on audio and video-based biometric person authentication. Springer, Berlin, pp 625–638 (2003)Google Scholar
  4. 4.
    Bouguet J-Y (2008) Camera calibration toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib_doc/
  5. 5.
    Brookes M. Voicebox: speech processing toolbox for matlab. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
  6. 6.
    Brugman H, Russel A, Nijmegen X (2004) Annotating multi-media/multimodal resources with ELAN. In: Proceedings of the international conference on language resources and evaluation, pp 2065–2068Google Scholar
  7. 7.
    Cech J, Sanchez-Riera J, Horaud RP (2011) Scene flow estimation by growing correspondence seeds. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (2011)Google Scholar
  8. 8.
    Cherry EC (1953) Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am 25(5):975–979CrossRefGoogle Scholar
  9. 9.
    Cooke M, Barker J, Cunningham S, Shao X (2007) An audio-visual corpus for speech perception and automatic speech recognition (l). Speech Commun 49(5):384–401CrossRefGoogle Scholar
  10. 10.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  11. 11.
    Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253. http://www.wisdom.weizmann.ac.il/vision/SpaceTimeActions.html Google Scholar
  12. 12.
    Hansard M, Horaud RP (2008) Cyclopean geometry of binocular vision. J Opt Soc Am 25(9):2357–2369CrossRefGoogle Scholar
  13. 13.
    Hartley RI, Zisserman A (2004) Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge. ISBN:0521540518Google Scholar
  14. 14.
    Haykin S, Chen Z (2005) The cocktail party problem. J Neural Comput 17:1875–1902CrossRefGoogle Scholar
  15. 15.
    Hazen TJ, Saenko K, La C-H, Glass JR (2004) A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the ACM international conference on multimodal interfaces, ICMI ’04. ACM, New York, pp 235–242 (2004)Google Scholar
  16. 16.
    Hoai M, Zhong Lan Z, De la Torre F (2011) Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  17. 17.
    Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. IEEE Trans Pattern Anal Mach Intell 34(7):1409–1422CrossRefGoogle Scholar
  18. 18.
    Khalidov V, Forbes F, Horaud R (2011) Conjugate mixture models for clustering multimodal data. J Neural Comput 23(2):517–557MathSciNetMATHCrossRefGoogle Scholar
  19. 19.
    Kim HD, Suk Choi J, Kim M (2007) Human-robot interaction in real environments by audio-visual integration. Int J Control Autom Syst 5(1):61–69Google Scholar
  20. 20.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123Google Scholar
  21. 21.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  22. 22.
    Lathoud G, Odobez J, Gatica-Pérez D (2005) AV16.3: an audio-visual corpus for speaker localization and tracking. In: Proceedings of the workshop on machine learning and multimodal interaction. Springer, Berlin (2005)Google Scholar
  23. 23.
    Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proceedings of the IEEE international conference on computer vision and, pattern recognitionGoogle Scholar
  24. 24.
    Luo R, Kay M (1989) Multisensor integration and fusion in intelligent systems. IEEE Trans Syst Man Cybern 19(5):901–931CrossRefGoogle Scholar
  25. 25.
    Marcel S, McCool C, Matejka P, Ahonen T, Cernocky J (2010) Mobile biometry (MOBIO) face and speaker verification evaluation. Idiap-RR Idiap-RR-09-2010, Idiap Google Scholar
  26. 26.
    Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of the IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  27. 27.
    Messer K, Matas J, Kittler J, Jonsson K (1999) XM2VTSDB: the extended M2VTS database. In: Proceedings of the international conference on audio and video-based biometric person authentication, pp 72–77Google Scholar
  28. 28.
    Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of the IEEE international conference on computer vision. IEEE Computer Society, WashingtonGoogle Scholar
  29. 29.
    Mohammad Y, Xu Y, Matsumura K, Nishida T (2008) The H3R explanation corpus human-human and base human-robot interaction dataset. In: International conference on intelligent sensors, sensor networks and information processing, pp 201–206Google Scholar
  30. 30.
    Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN (2002) CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of the IEEE international conference on acoustics speech and signal processing, pp 2017–2020Google Scholar
  31. 31.
    Pigeon S (1996) M2vts database. http://www.tele.ucl.ac.be/PROJECTS/M2VTS/
  32. 32.
    Rybok L, Friedberger S, Hanebeck UD, Stiefelhagen R (2011) The KIT Robo-Kitchen data set for the evaluation of view-based activity recognition systems. In: Proceedings of the IEEE-RAS international conference on humanoid robotsGoogle Scholar
  33. 33.
    Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Proceedings of the international conference on pattern recognition, pp 32–36Google Scholar
  34. 34.
    Shi Q, Wang L, Cheng L, Smola A (2011) Discriminative human action segmentation and recognition using SMMs. Int J Comput Vis 93(1):22–32MATHCrossRefGoogle Scholar
  35. 35.
  36. 36.
    Tenorth M, Bandouch J, Beetz M (2009) The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition. In: Proceedings of the IEEE international workshop on tracking humans for the evaluation of their motion in image sequences in conjunction with the international conference on computer visionGoogle Scholar
  37. 37.
    Vedula S, Baker S, Rander P, Collins R (2005) Kanade T (2005) Three-dimensional scene flow. IEEE Trans Pattern Anal Mach Intell 27(3):475–480CrossRefGoogle Scholar
  38. 38.
    Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. J Comput Vis Image Understanding 104(2):249–257. http://4drepository.inrialpes.fr/public/viewgroup/6 Google Scholar
  39. 39.
    Willems G, Becker JH, Tuytelaars T (2009) Exemplar-based action recognition in video. In: Proceedings of the British machine vision conferenceGoogle Scholar
  40. 40.
    Zivkovic Z, Booij O, Krose B, Topp E, Christensen H (2008) From sensors to human spatial concepts: an annotated data set. IEEE Trans Rob 24(2):501–505CrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2012

Authors and Affiliations

  • Xavier Alameda-Pineda
    • 1
  • Jordi Sanchez-Riera
    • 1
  • Johannes Wienke
    • 2
  • Vojtěch Franc
    • 3
  • Jan Čech
    • 1
  • Kaustubh Kulkarni
    • 1
  • Antoine Deleforge
    • 1
  • Radu Horaud
    • 1
  1. 1.INRIA Grenoble Rhône-AlpesMontobonnotFrance
  2. 2.Universität BielefeldBielefeldGermany
  3. 3.Czech Technical UniversityPragueCzech Republic

Personalised recommendations