Skip to main content
Log in

Two-handed gesture recognition and fusion with speech to command a robot

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-to-use user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robot named Jido. The vision system uses a stereo head mounted on a pan-tilt unit and a bank of collaborative particle filters devoted to the upper human body extremities to track and recognize pointing/symbolic mono but also bi-manual gestures. Such framework constitutes our first contribution, as it is shown, to give proper handling of natural artifacts (self-occlusion, camera out of view field, hand deformation) when performing 3D gestures using one or the other hand even both. A speech recognition and understanding system based on the Julius engine is also developed and embedded in order to process deictic and anaphoric utterances. The second contribution deals with a probabilistic and multi-hypothesis interpreter framework to fuse results from speech and gesture components. Such interpreter is shown to improve the classification rates of multimodal commands compared to using either modality alone. Finally, we report on successful live experiments in human-centered settings. Results are reported in the context of an interactive manipulation task, where users specify local motion commands to Jido and perform safe object exchanges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alami, R., Chatila, R., Fleury, S., & Ingrand, F. (1998). An architecture for autonomy. The International Journal of Robotics Research, 17(4), 315–337.

    Article  Google Scholar 

  • Arras, K., & Burgard, W. (Eds.), Robots in exhibitions, Lausanne, Switzerland, October 2002.

  • Austermann, A., Yamada, S., Funakoshi, K., & Nakano, M. (2010). Learning naturally spoken commands for a robot. In Interspeech, Makuhari, Japan, September 2010.

    Google Scholar 

  • Axenbeck, T., Bennewitz, M., Behnke, S., & Burgard, W. (2008). Recognizing complex, parameterized gestures from monocular image sequences. In IEEE-RAS international conference on humanoid robots (Humanoids’08), Daejeon, South Korea, December 2008.

    Google Scholar 

  • Azad, P., Ude, A., Asfour, T., & Dillman, R. (2007). Stereo-based markerless human motion capture for humanoid robot systems. In Int. conf. on robotics and automation (ICRA’07), Roma, Italy, April 2007.

    Google Scholar 

  • Badii, A., & Thiemert, D. (2009). The CompanionAble project. In Workshop co-located with the Europ. conf. on ambient intelligence, Salzburg, Austria, November 2009.

    Google Scholar 

  • Bar-Shalom, Y., & Jaffer, A. G. (1998). Tracking and data association. San Diego: Academic Press.

    Google Scholar 

  • Bennewitz, M., Faber, F., Joho, D., Schreiber, M., & Behnke, S. (2005). Towards a humanoid museum guide robot that interacts with multiple persons. In Int. conf. on humanoid robots (HUMANOID’05) (pp. 418–423). Tsukuba, Japan.

    Chapter  Google Scholar 

  • Bernier, O., & Collobert, D. (2001). Head and hands 3D tracking in real-time by the EM algorithm. In Workshop of int. conf. on computer vision, Vancouver, Canada.

    Google Scholar 

  • Bischoff, R., & Graefe, V. (2004). HERMES—a versatile personal robotic assistant. Proceedings of the IEEE, 92, 1759–1779.

    Article  Google Scholar 

  • Chen, F. S., Fu, C. M., & Huang, C. L. (2003). Hand gesture recognition using a real-time tracking method and hidden Markov models. Image and Vision Computing, 21(8), 745–758.

    Article  Google Scholar 

  • Corradini, A., & Gross, H. M. (2000). Camera-based gesture recognition for robot control. In Int. joint conf. on neural networks (IJCNN’00), Roma, Italy, July 2000.

    Google Scholar 

  • Davis, F. (1971). Inside intuition-what we know about non-verbal communication. New York: McGraw-Hill.

    Google Scholar 

  • Erol, A., Bebis, G., Nicolescu, M., Boyle, R., & Twombly, X. (2007). Vision-based hand pose estimation: a review. Computer Vision and Image Understanding, 108, 52–73.

    Article  Google Scholar 

  • Fels, S., & Hinton, G. (1997). Glove-talk II: A neural network interface which maps gestures to parallel format speech synthesizer controls. IEEE Transactions on Neural Networks, 9(1), 205–212.

    Article  Google Scholar 

  • Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and Autonomous Systems, 42, 143–166.

    Article  MATH  Google Scholar 

  • Fontmarty, M., Lerasle, F., & Danès, P. (2007). Data fusion within a modified annealed particle filter dedicated to human motion capture. In Int. conf. on intelligent robots and systems (IROS’07) (pp. 3391–3396). San Diego, USA, November 2007.

    Google Scholar 

  • Fox, M., Ghallab, M., Infantes, G., & Long, D. (2006). Robot introspection through learned hidden Markov models. Artificial Intelligence, 170(2), 59–113.

    Article  MathSciNet  MATH  Google Scholar 

  • Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J. F., & Gravier, G. (2005). The ESTER phase II evaluation campaign for the rich transcription of French broadcast news. In Interspeech/Eurospeech, Lisbon, Portugal, September 2005.

    Google Scholar 

  • Gorostiza, J., Barber, R., Khamis, A., & Malfaz, M. (2006). Multimodal human-robot interaction framework for a personal robot. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 39–44). Hatfield, UK, September 2006.

    Chapter  Google Scholar 

  • Hanafiah, Z. M., Yamazaki, C., Nakamura, A., & Kuno, Y. (2004). Human-robot speech interface understanding inexplicit utterances using vision. In CHI 2004 (pp. 1321–1324). Vienna, Austria, April 2004.

    Chapter  Google Scholar 

  • Harte, E., & Jarvis, R. (2007). Multimodal human-robot interaction in an assistive technology context. In Australian conf. on robotics and automation, Brisbane, Australia, December 2007.

    Google Scholar 

  • Hasanuzzaman, M., Ampornaramveth, V., Zhang, T., Bhuiyan, M., Shirai, Y., & Ueno, H. (2004). Real-time vision-based gesture recognition for human robot interaction. In Int. conf. on robotics and biomimetics, Shenyang, China, August 2004.

    Google Scholar 

  • Hasanuzzaman, M., Zhang, T., Ampornaramveth, V., & Ueno, H. (2007). Adaptive visual gesture recognition using a knowledge-based software platform. Robotics and Autonomous Systems, 55(8), 643–657.

    Article  Google Scholar 

  • Huang, Y., Huang, T., & Niemann, H. (2002). Two-handed gesture tracking incorporating template warping with static segmentation. In Int. conf. on automatic face and gesture recognition (FGR’02), Washington, USA, May 2002 (pp. 275–280).

    Google Scholar 

  • Isard, M., & Blake, A. (1998a). I-CONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. In European conf. on computer vision (ECCV’98) (pp. 893–908). Freibourg, Germany, June 1998.

    Google Scholar 

  • Isard, M., & Blake, A. (1998b). CONDENSATION—conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.

    Article  Google Scholar 

  • Isard, M., & Blake, A. (2001). BraMBLe: a Bayesian multiple blob tracker. In Int. conf. on computer vision (ICCV’01) (pp. 34–41). Vancouver, Canada.

    Google Scholar 

  • Just, A., Marcel, S., & Bernier, O. (2004). HMM and IOHMM for the recognition of mono and bi-manual 3D hand gestures. In British machine vision conference (BMVC’04), London, UK, September 2004.

    Google Scholar 

  • Lee, A., Kawahara, T., & Shikano, K. (2001). Julius—an open source real-time large vocabulary recognition engine. In European conference on speech communication and technology (EUROSPEECH) (pp. 1691–1694). Aalborg, Denmark, September 2001.

    Google Scholar 

  • Lopez-Cozar Delgado, R., & Araki, M. (2005). Spoken, multilingual and multimodal dialogues systems—development ans assessment. New York: Wiley.

    Book  Google Scholar 

  • Maas, J. F., Spexard, T., Fritsch, J., Wrede, B., & Sagerer, G. (2006). BIRON, what’s the topic? a multi-modal topic tracker for improved human-robot interaction. In Int. symp. on robot and human interactive communication (RO-MAN’06), Hatfield, UK, September 2006.

    Google Scholar 

  • Moeslund, T., Hilton, A., & Kruger, V. (2006). A survey of advanced vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 174–192.

    Article  Google Scholar 

  • Murphy-Chutorian, E., & Trivedi, M. (2008). Head pose estimation in computer vision: a survey. Transactions on Pattern Analysis Machine Intelligence (PAMI’08).

  • Nickel, K., & Stiefelhagen, R. (2006). Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing, 3(12), 1875–1884.

    Google Scholar 

  • Park, H. S., Kim, E. Y., Jang, S., & Park, S. H. (2005). HMM-based gesture recognition for robot control. In Iberian conf. on pattern recognition and image analysis (IbPRIA’05), Estoril, Portugal, June 2005.

    Google Scholar 

  • Pérennou, G., & de Calmès, M. (2000). MHATLex: Lexical resources for modelling the French pronunciation. In Int. conf. on language resources and evaluations (pp. 257–264). Athens, Greece, June 2000.

    Google Scholar 

  • Pérez, P., Vermaak, J., & Blake, A. (2004). Data fusion for visual tracking with particles. Proceedings of the IEEE, 92(3), 495–513.

    Article  Google Scholar 

  • Pineau, J., Montemerlo, M., Pollack, M., Roy, N., & Thrun, S. (2003). Towards robotic assistants in nursing homes: challenges and results. Robotics and Autonomous Systems, 42, 271–281.

    Article  MATH  Google Scholar 

  • Prodanov, P., & Drygajlo, A. (2003a). Multimodal interaction management for tour-guide robots using Bayesian networks. In Int. conf. on intelligent robots and systems (IROS’03) (pp. 3447–3452). Las Vegas, Canada, October 2003.

    Chapter  Google Scholar 

  • Prodanov, P., & Drygajlo, A. (2003b). Bayesian networks for spoken dialogue managements in multimodal systems of tour-guide robots. In European conf. on speech communication and technology (EUROSPEECH’03) (pp. 1057–1060). Geneva, Switzerland. September 2003.

    Google Scholar 

  • Qu, W., Schonfeld, D., & Mohamed, M. (2007). Distributed Bayesian multiple-target tracking in crowded environments using multiple collaborative cameras. EURASIP Journal on Advances in Signal Processing.

  • Rasmussen, C., & Hager, G. (2001). Probabilistic data association methods for tracking complex visual objects. Transactions on Pattern Analysis Machine Intelligence 560–576.

  • Richarz, J., Martin, C., Scheidig, A., & Gross, H. M. (2006). There you go!—estimating pointing gestures in monocular images for mobile robot instruction. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 546–551). Hartfield, UK, September 2006.

    Chapter  Google Scholar 

  • Rogalla, O., Ehrenmann, M., Zollner, R., Becher, R., & Dillman, R. (2004). Using gesture and speech control for commanding a robot. In Advances in human-robot interaction (Vol. 14). Berlin: Springer.

    Google Scholar 

  • Shimizu, M., Yoshizuka, T., & Miyamoto, H. (2006). A gesture recognition system using stereo vision and arm model fitting. In Int. conf. on brain-inspired information technology (BrainIT’06), Hibikino, Japan, September 2006.

    Google Scholar 

  • Siegwart, R. et al. (2003). Robox at expo 0.2: a large scale installation of personal robots. Robotics and Autonomous Systems, 42, 203–222.

    Article  MATH  Google Scholar 

  • Skubic, M., Perzanowski, D., Blisard, S., Schultz, A., & Adams, W. (2004). Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cybernetics, 2(34), 154–167.

    Google Scholar 

  • Stiefelhagen, R., Fügen, C., Gieselmann, P., Holzapfel, H., Nickel, K., & Waibel, A. (2004). Natural human-robot interaction using speech head pose and gestures. In Int. conf. on intelligent robots and systems (IROS’04), Sendal, Japan, October 2004.

    Google Scholar 

  • Stückler, J., Gräve, K., Kläß, J., Muszynski, S., Schreiber, M., Tischler, O., Waldukat, R., & Behnke, S. (2009). Dynamaid: Towards a personal robot that helps with household chores. In Robotics: science and systems conference (RSS’09), Seattle, USA, June 2009.

    Google Scholar 

  • Thayananthan, A., Stenger, B., Torr, P. H. S., & Cipolla, R. (2003). Learning a kinematic prior for tree-based filtering. In British machine vision conf. (BMVC’03) (Vol. 2, pp. 589–598). Norwick, UK, September 2003.

    Google Scholar 

  • Theobalt, C., Bos, J., Chapman, T., & Espinosa, A. (2002). Talking to godot: Dialogue with a mobile robot. In Int. conf. on intelligent robots and systems (IROS’02), Lausanne, Switzerland, September 2002.

    Google Scholar 

  • Triesch, J., & Von der Malsburg, C. (2001). A system for person-independent hand posture recognition against complex backgrounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12), 1449–1453.

    Article  Google Scholar 

  • Vallée, M., Burger, B., Ertl, D., Lerasle, F., & Falb, J. (2009). Improving user of interfaces robots with multimodality. In Int. conf. on advanced robotics (ICAR’09), Munich, Germany.

    Google Scholar 

  • Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Features. In Int. conf. on computer vision and pattern recognition (CVPR’01), Hawaii, December 2001.

    Google Scholar 

  • Waldherr, S., Thrun, S., & Romero, R. (2000). A gesture-based interface for human-robot interaction. Autonomous Robots, 9(2), 151–173.

    Article  Google Scholar 

  • Yoshizaki, M., Kuno, Y., & Nakamura, A. (2002). Mutual assistance between speech and vision for human-robot interface. In Int. conf. on intelligent robots and systems (IROS’02) (pp. 1308–1313). Lausanne, Switzerland, September 2002.

    Chapter  Google Scholar 

  • Yu, T., & Wu, Y. (2004). Collaborative tracking of multiple targets. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004.

    Google Scholar 

  • Zhao, T., & Nevatia, R. (2004). Tracking multiple humans in crowded environment. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Burger.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(MP4 3.85 MB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Burger, B., Ferrané, I., Lerasle, F. et al. Two-handed gesture recognition and fusion with speech to command a robot. Auton Robot 32, 129–147 (2012). https://doi.org/10.1007/s10514-011-9263-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-011-9263-y

Keywords

Navigation