Autonomous Robots

, Volume 32, Issue 2, pp 129–147 | Cite as

Two-handed gesture recognition and fusion with speech to command a robot

  • B. BurgerEmail author
  • I. Ferrané
  • F. Lerasle
  • G. Infantes


Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-to-use user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robot named Jido. The vision system uses a stereo head mounted on a pan-tilt unit and a bank of collaborative particle filters devoted to the upper human body extremities to track and recognize pointing/symbolic mono but also bi-manual gestures. Such framework constitutes our first contribution, as it is shown, to give proper handling of natural artifacts (self-occlusion, camera out of view field, hand deformation) when performing 3D gestures using one or the other hand even both. A speech recognition and understanding system based on the Julius engine is also developed and embedded in order to process deictic and anaphoric utterances. The second contribution deals with a probabilistic and multi-hypothesis interpreter framework to fuse results from speech and gesture components. Such interpreter is shown to improve the classification rates of multimodal commands compared to using either modality alone. Finally, we report on successful live experiments in human-centered settings. Results are reported in the context of an interactive manipulation task, where users specify local motion commands to Jido and perform safe object exchanges.


Human-robot interaction Multiple object tracking Two-handed gesture recognition Vision and speech probabilistic fusion 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

(MP4 3.85 MB)


  1. Alami, R., Chatila, R., Fleury, S., & Ingrand, F. (1998). An architecture for autonomy. The International Journal of Robotics Research, 17(4), 315–337. CrossRefGoogle Scholar
  2. Arras, K., & Burgard, W. (Eds.), Robots in exhibitions, Lausanne, Switzerland, October 2002. Google Scholar
  3. Austermann, A., Yamada, S., Funakoshi, K., & Nakano, M. (2010). Learning naturally spoken commands for a robot. In Interspeech, Makuhari, Japan, September 2010. Google Scholar
  4. Axenbeck, T., Bennewitz, M., Behnke, S., & Burgard, W. (2008). Recognizing complex, parameterized gestures from monocular image sequences. In IEEE-RAS international conference on humanoid robots (Humanoids’08), Daejeon, South Korea, December 2008. Google Scholar
  5. Azad, P., Ude, A., Asfour, T., & Dillman, R. (2007). Stereo-based markerless human motion capture for humanoid robot systems. In Int. conf. on robotics and automation (ICRA’07), Roma, Italy, April 2007. Google Scholar
  6. Badii, A., & Thiemert, D. (2009). The CompanionAble project. In Workshop co-located with the Europ. conf. on ambient intelligence, Salzburg, Austria, November 2009. Google Scholar
  7. Bar-Shalom, Y., & Jaffer, A. G. (1998). Tracking and data association. San Diego: Academic Press. Google Scholar
  8. Bennewitz, M., Faber, F., Joho, D., Schreiber, M., & Behnke, S. (2005). Towards a humanoid museum guide robot that interacts with multiple persons. In Int. conf. on humanoid robots (HUMANOID’05) (pp. 418–423). Tsukuba, Japan. CrossRefGoogle Scholar
  9. Bernier, O., & Collobert, D. (2001). Head and hands 3D tracking in real-time by the EM algorithm. In Workshop of int. conf. on computer vision, Vancouver, Canada. Google Scholar
  10. Bischoff, R., & Graefe, V. (2004). HERMES—a versatile personal robotic assistant. Proceedings of the IEEE, 92, 1759–1779. CrossRefGoogle Scholar
  11. Chen, F. S., Fu, C. M., & Huang, C. L. (2003). Hand gesture recognition using a real-time tracking method and hidden Markov models. Image and Vision Computing, 21(8), 745–758. CrossRefGoogle Scholar
  12. Corradini, A., & Gross, H. M. (2000). Camera-based gesture recognition for robot control. In Int. joint conf. on neural networks (IJCNN’00), Roma, Italy, July 2000. Google Scholar
  13. Davis, F. (1971). Inside intuition-what we know about non-verbal communication. New York: McGraw-Hill. Google Scholar
  14. Erol, A., Bebis, G., Nicolescu, M., Boyle, R., & Twombly, X. (2007). Vision-based hand pose estimation: a review. Computer Vision and Image Understanding, 108, 52–73. CrossRefGoogle Scholar
  15. Fels, S., & Hinton, G. (1997). Glove-talk II: A neural network interface which maps gestures to parallel format speech synthesizer controls. IEEE Transactions on Neural Networks, 9(1), 205–212. CrossRefGoogle Scholar
  16. Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and Autonomous Systems, 42, 143–166. zbMATHCrossRefGoogle Scholar
  17. Fontmarty, M., Lerasle, F., & Danès, P. (2007). Data fusion within a modified annealed particle filter dedicated to human motion capture. In Int. conf. on intelligent robots and systems (IROS’07) (pp. 3391–3396). San Diego, USA, November 2007. Google Scholar
  18. Fox, M., Ghallab, M., Infantes, G., & Long, D. (2006). Robot introspection through learned hidden Markov models. Artificial Intelligence, 170(2), 59–113. MathSciNetzbMATHCrossRefGoogle Scholar
  19. Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J. F., & Gravier, G. (2005). The ESTER phase II evaluation campaign for the rich transcription of French broadcast news. In Interspeech/Eurospeech, Lisbon, Portugal, September 2005. Google Scholar
  20. Gorostiza, J., Barber, R., Khamis, A., & Malfaz, M. (2006). Multimodal human-robot interaction framework for a personal robot. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 39–44). Hatfield, UK, September 2006. CrossRefGoogle Scholar
  21. Hanafiah, Z. M., Yamazaki, C., Nakamura, A., & Kuno, Y. (2004). Human-robot speech interface understanding inexplicit utterances using vision. In CHI 2004 (pp. 1321–1324). Vienna, Austria, April 2004. CrossRefGoogle Scholar
  22. Harte, E., & Jarvis, R. (2007). Multimodal human-robot interaction in an assistive technology context. In Australian conf. on robotics and automation, Brisbane, Australia, December 2007. Google Scholar
  23. Hasanuzzaman, M., Ampornaramveth, V., Zhang, T., Bhuiyan, M., Shirai, Y., & Ueno, H. (2004). Real-time vision-based gesture recognition for human robot interaction. In Int. conf. on robotics and biomimetics, Shenyang, China, August 2004. Google Scholar
  24. Hasanuzzaman, M., Zhang, T., Ampornaramveth, V., & Ueno, H. (2007). Adaptive visual gesture recognition using a knowledge-based software platform. Robotics and Autonomous Systems, 55(8), 643–657. CrossRefGoogle Scholar
  25. Huang, Y., Huang, T., & Niemann, H. (2002). Two-handed gesture tracking incorporating template warping with static segmentation. In Int. conf. on automatic face and gesture recognition (FGR’02), Washington, USA, May 2002 (pp. 275–280). Google Scholar
  26. Isard, M., & Blake, A. (1998a). I-CONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. In European conf. on computer vision (ECCV’98) (pp. 893–908). Freibourg, Germany, June 1998. Google Scholar
  27. Isard, M., & Blake, A. (1998b). CONDENSATION—conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28. CrossRefGoogle Scholar
  28. Isard, M., & Blake, A. (2001). BraMBLe: a Bayesian multiple blob tracker. In Int. conf. on computer vision (ICCV’01) (pp. 34–41). Vancouver, Canada. Google Scholar
  29. Just, A., Marcel, S., & Bernier, O. (2004). HMM and IOHMM for the recognition of mono and bi-manual 3D hand gestures. In British machine vision conference (BMVC’04), London, UK, September 2004. Google Scholar
  30. Lee, A., Kawahara, T., & Shikano, K. (2001). Julius—an open source real-time large vocabulary recognition engine. In European conference on speech communication and technology (EUROSPEECH) (pp. 1691–1694). Aalborg, Denmark, September 2001. Google Scholar
  31. Lopez-Cozar Delgado, R., & Araki, M. (2005). Spoken, multilingual and multimodal dialogues systems—development ans assessment. New York: Wiley. CrossRefGoogle Scholar
  32. Maas, J. F., Spexard, T., Fritsch, J., Wrede, B., & Sagerer, G. (2006). BIRON, what’s the topic? a multi-modal topic tracker for improved human-robot interaction. In Int. symp. on robot and human interactive communication (RO-MAN’06), Hatfield, UK, September 2006. Google Scholar
  33. Moeslund, T., Hilton, A., & Kruger, V. (2006). A survey of advanced vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 174–192. CrossRefGoogle Scholar
  34. Murphy-Chutorian, E., & Trivedi, M. (2008). Head pose estimation in computer vision: a survey. Transactions on Pattern Analysis Machine Intelligence (PAMI’08). Google Scholar
  35. Nickel, K., & Stiefelhagen, R. (2006). Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing, 3(12), 1875–1884. Google Scholar
  36. Park, H. S., Kim, E. Y., Jang, S., & Park, S. H. (2005). HMM-based gesture recognition for robot control. In Iberian conf. on pattern recognition and image analysis (IbPRIA’05), Estoril, Portugal, June 2005. Google Scholar
  37. Pérennou, G., & de Calmès, M. (2000). MHATLex: Lexical resources for modelling the French pronunciation. In Int. conf. on language resources and evaluations (pp. 257–264). Athens, Greece, June 2000. Google Scholar
  38. Pérez, P., Vermaak, J., & Blake, A. (2004). Data fusion for visual tracking with particles. Proceedings of the IEEE, 92(3), 495–513. CrossRefGoogle Scholar
  39. Pineau, J., Montemerlo, M., Pollack, M., Roy, N., & Thrun, S. (2003). Towards robotic assistants in nursing homes: challenges and results. Robotics and Autonomous Systems, 42, 271–281. zbMATHCrossRefGoogle Scholar
  40. Prodanov, P., & Drygajlo, A. (2003a). Multimodal interaction management for tour-guide robots using Bayesian networks. In Int. conf. on intelligent robots and systems (IROS’03) (pp. 3447–3452). Las Vegas, Canada, October 2003. CrossRefGoogle Scholar
  41. Prodanov, P., & Drygajlo, A. (2003b). Bayesian networks for spoken dialogue managements in multimodal systems of tour-guide robots. In European conf. on speech communication and technology (EUROSPEECH’03) (pp. 1057–1060). Geneva, Switzerland. September 2003. Google Scholar
  42. Qu, W., Schonfeld, D., & Mohamed, M. (2007). Distributed Bayesian multiple-target tracking in crowded environments using multiple collaborative cameras. EURASIP Journal on Advances in Signal Processing. Google Scholar
  43. Rasmussen, C., & Hager, G. (2001). Probabilistic data association methods for tracking complex visual objects. Transactions on Pattern Analysis Machine Intelligence 560–576. Google Scholar
  44. Richarz, J., Martin, C., Scheidig, A., & Gross, H. M. (2006). There you go!—estimating pointing gestures in monocular images for mobile robot instruction. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 546–551). Hartfield, UK, September 2006. CrossRefGoogle Scholar
  45. Rogalla, O., Ehrenmann, M., Zollner, R., Becher, R., & Dillman, R. (2004). Using gesture and speech control for commanding a robot. In Advances in human-robot interaction (Vol. 14). Berlin: Springer. Google Scholar
  46. Shimizu, M., Yoshizuka, T., & Miyamoto, H. (2006). A gesture recognition system using stereo vision and arm model fitting. In Int. conf. on brain-inspired information technology (BrainIT’06), Hibikino, Japan, September 2006. Google Scholar
  47. Siegwart, R. et al. (2003). Robox at expo 0.2: a large scale installation of personal robots. Robotics and Autonomous Systems, 42, 203–222. zbMATHCrossRefGoogle Scholar
  48. Skubic, M., Perzanowski, D., Blisard, S., Schultz, A., & Adams, W. (2004). Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cybernetics, 2(34), 154–167. Google Scholar
  49. Stiefelhagen, R., Fügen, C., Gieselmann, P., Holzapfel, H., Nickel, K., & Waibel, A. (2004). Natural human-robot interaction using speech head pose and gestures. In Int. conf. on intelligent robots and systems (IROS’04), Sendal, Japan, October 2004. Google Scholar
  50. Stückler, J., Gräve, K., Kläß, J., Muszynski, S., Schreiber, M., Tischler, O., Waldukat, R., & Behnke, S. (2009). Dynamaid: Towards a personal robot that helps with household chores. In Robotics: science and systems conference (RSS’09), Seattle, USA, June 2009. Google Scholar
  51. Thayananthan, A., Stenger, B., Torr, P. H. S., & Cipolla, R. (2003). Learning a kinematic prior for tree-based filtering. In British machine vision conf. (BMVC’03) (Vol. 2, pp. 589–598). Norwick, UK, September 2003. Google Scholar
  52. Theobalt, C., Bos, J., Chapman, T., & Espinosa, A. (2002). Talking to godot: Dialogue with a mobile robot. In Int. conf. on intelligent robots and systems (IROS’02), Lausanne, Switzerland, September 2002. Google Scholar
  53. Triesch, J., & Von der Malsburg, C. (2001). A system for person-independent hand posture recognition against complex backgrounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12), 1449–1453. CrossRefGoogle Scholar
  54. Vallée, M., Burger, B., Ertl, D., Lerasle, F., & Falb, J. (2009). Improving user of interfaces robots with multimodality. In Int. conf. on advanced robotics (ICAR’09), Munich, Germany. Google Scholar
  55. Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Features. In Int. conf. on computer vision and pattern recognition (CVPR’01), Hawaii, December 2001. Google Scholar
  56. Waldherr, S., Thrun, S., & Romero, R. (2000). A gesture-based interface for human-robot interaction. Autonomous Robots, 9(2), 151–173. CrossRefGoogle Scholar
  57. Yoshizaki, M., Kuno, Y., & Nakamura, A. (2002). Mutual assistance between speech and vision for human-robot interface. In Int. conf. on intelligent robots and systems (IROS’02) (pp. 1308–1313). Lausanne, Switzerland, September 2002. CrossRefGoogle Scholar
  58. Yu, T., & Wu, Y. (2004). Collaborative tracking of multiple targets. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004. Google Scholar
  59. Zhao, T., & Nevatia, R. (2004). Tracking multiple humans in crowded environment. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • B. Burger
    • 1
    • 2
    Email author
  • I. Ferrané
    • 2
    • 3
  • F. Lerasle
    • 1
    • 3
  • G. Infantes
    • 4
  1. 1.CNRSLAASToulouse CedexFrance
  2. 2.IRITUniversité de ToulouseToulouse CedexFrance
  3. 3.Université de ToulouseUPS, INSA, INP, ISAE; UT1, UTM, LAASToulouse CedexFrance
  4. 4.OneraToulouse Cedex 4France

Personalised recommendations