Two-handed gesture recognition and fusion with speech to command a robot

Burger, B.; Ferrané, I.; Lerasle, F.; Infantes, G.

doi:10.1007/s10514-011-9263-y

Two-handed gesture recognition and fusion with speech to command a robot

Published: 21 December 2011

Volume 32, pages 129–147, (2012)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

B. Burger^1,2,
I. Ferrané^2,3,
F. Lerasle^1,3 &
…
G. Infantes⁴

1044 Accesses
64 Citations
3 Altmetric
Explore all metrics

Abstract

Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-to-use user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robot named Jido. The vision system uses a stereo head mounted on a pan-tilt unit and a bank of collaborative particle filters devoted to the upper human body extremities to track and recognize pointing/symbolic mono but also bi-manual gestures. Such framework constitutes our first contribution, as it is shown, to give proper handling of natural artifacts (self-occlusion, camera out of view field, hand deformation) when performing 3D gestures using one or the other hand even both. A speech recognition and understanding system based on the Julius engine is also developed and embedded in order to process deictic and anaphoric utterances. The second contribution deals with a probabilistic and multi-hypothesis interpreter framework to fuse results from speech and gesture components. Such interpreter is shown to improve the classification rates of multimodal commands compared to using either modality alone. Finally, we report on successful live experiments in human-centered settings. Results are reported in the context of an interactive manipulation task, where users specify local motion commands to Jido and perform safe object exchanges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alami, R., Chatila, R., Fleury, S., & Ingrand, F. (1998). An architecture for autonomy. The International Journal of Robotics Research, 17(4), 315–337.
Article Google Scholar
Arras, K., & Burgard, W. (Eds.), Robots in exhibitions, Lausanne, Switzerland, October 2002.
Austermann, A., Yamada, S., Funakoshi, K., & Nakano, M. (2010). Learning naturally spoken commands for a robot. In Interspeech, Makuhari, Japan, September 2010.
Google Scholar
Axenbeck, T., Bennewitz, M., Behnke, S., & Burgard, W. (2008). Recognizing complex, parameterized gestures from monocular image sequences. In IEEE-RAS international conference on humanoid robots (Humanoids’08), Daejeon, South Korea, December 2008.
Google Scholar
Azad, P., Ude, A., Asfour, T., & Dillman, R. (2007). Stereo-based markerless human motion capture for humanoid robot systems. In Int. conf. on robotics and automation (ICRA’07), Roma, Italy, April 2007.
Google Scholar
Badii, A., & Thiemert, D. (2009). The CompanionAble project. In Workshop co-located with the Europ. conf. on ambient intelligence, Salzburg, Austria, November 2009.
Google Scholar
Bar-Shalom, Y., & Jaffer, A. G. (1998). Tracking and data association. San Diego: Academic Press.
Google Scholar
Bennewitz, M., Faber, F., Joho, D., Schreiber, M., & Behnke, S. (2005). Towards a humanoid museum guide robot that interacts with multiple persons. In Int. conf. on humanoid robots (HUMANOID’05) (pp. 418–423). Tsukuba, Japan.
Chapter Google Scholar
Bernier, O., & Collobert, D. (2001). Head and hands 3D tracking in real-time by the EM algorithm. In Workshop of int. conf. on computer vision, Vancouver, Canada.
Google Scholar
Bischoff, R., & Graefe, V. (2004). HERMES—a versatile personal robotic assistant. Proceedings of the IEEE, 92, 1759–1779.
Article Google Scholar
Chen, F. S., Fu, C. M., & Huang, C. L. (2003). Hand gesture recognition using a real-time tracking method and hidden Markov models. Image and Vision Computing, 21(8), 745–758.
Article Google Scholar
Corradini, A., & Gross, H. M. (2000). Camera-based gesture recognition for robot control. In Int. joint conf. on neural networks (IJCNN’00), Roma, Italy, July 2000.
Google Scholar
Davis, F. (1971). Inside intuition-what we know about non-verbal communication. New York: McGraw-Hill.
Google Scholar
Erol, A., Bebis, G., Nicolescu, M., Boyle, R., & Twombly, X. (2007). Vision-based hand pose estimation: a review. Computer Vision and Image Understanding, 108, 52–73.
Article Google Scholar
Fels, S., & Hinton, G. (1997). Glove-talk II: A neural network interface which maps gestures to parallel format speech synthesizer controls. IEEE Transactions on Neural Networks, 9(1), 205–212.
Article Google Scholar
Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and Autonomous Systems, 42, 143–166.
Article MATH Google Scholar
Fontmarty, M., Lerasle, F., & Danès, P. (2007). Data fusion within a modified annealed particle filter dedicated to human motion capture. In Int. conf. on intelligent robots and systems (IROS’07) (pp. 3391–3396). San Diego, USA, November 2007.
Google Scholar
Fox, M., Ghallab, M., Infantes, G., & Long, D. (2006). Robot introspection through learned hidden Markov models. Artificial Intelligence, 170(2), 59–113.
Article MathSciNet MATH Google Scholar
Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J. F., & Gravier, G. (2005). The ESTER phase II evaluation campaign for the rich transcription of French broadcast news. In Interspeech/Eurospeech, Lisbon, Portugal, September 2005.
Google Scholar
Gorostiza, J., Barber, R., Khamis, A., & Malfaz, M. (2006). Multimodal human-robot interaction framework for a personal robot. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 39–44). Hatfield, UK, September 2006.
Chapter Google Scholar
Hanafiah, Z. M., Yamazaki, C., Nakamura, A., & Kuno, Y. (2004). Human-robot speech interface understanding inexplicit utterances using vision. In CHI 2004 (pp. 1321–1324). Vienna, Austria, April 2004.
Chapter Google Scholar
Harte, E., & Jarvis, R. (2007). Multimodal human-robot interaction in an assistive technology context. In Australian conf. on robotics and automation, Brisbane, Australia, December 2007.
Google Scholar
Hasanuzzaman, M., Ampornaramveth, V., Zhang, T., Bhuiyan, M., Shirai, Y., & Ueno, H. (2004). Real-time vision-based gesture recognition for human robot interaction. In Int. conf. on robotics and biomimetics, Shenyang, China, August 2004.
Google Scholar
Hasanuzzaman, M., Zhang, T., Ampornaramveth, V., & Ueno, H. (2007). Adaptive visual gesture recognition using a knowledge-based software platform. Robotics and Autonomous Systems, 55(8), 643–657.
Article Google Scholar
Huang, Y., Huang, T., & Niemann, H. (2002). Two-handed gesture tracking incorporating template warping with static segmentation. In Int. conf. on automatic face and gesture recognition (FGR’02), Washington, USA, May 2002 (pp. 275–280).
Google Scholar
Isard, M., & Blake, A. (1998a). I-CONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. In European conf. on computer vision (ECCV’98) (pp. 893–908). Freibourg, Germany, June 1998.
Google Scholar
Isard, M., & Blake, A. (1998b). CONDENSATION—conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.
Article Google Scholar
Isard, M., & Blake, A. (2001). BraMBLe: a Bayesian multiple blob tracker. In Int. conf. on computer vision (ICCV’01) (pp. 34–41). Vancouver, Canada.
Google Scholar
Just, A., Marcel, S., & Bernier, O. (2004). HMM and IOHMM for the recognition of mono and bi-manual 3D hand gestures. In British machine vision conference (BMVC’04), London, UK, September 2004.
Google Scholar
Lee, A., Kawahara, T., & Shikano, K. (2001). Julius—an open source real-time large vocabulary recognition engine. In European conference on speech communication and technology (EUROSPEECH) (pp. 1691–1694). Aalborg, Denmark, September 2001.
Google Scholar
Lopez-Cozar Delgado, R., & Araki, M. (2005). Spoken, multilingual and multimodal dialogues systems—development ans assessment. New York: Wiley.
Book Google Scholar
Maas, J. F., Spexard, T., Fritsch, J., Wrede, B., & Sagerer, G. (2006). BIRON, what’s the topic? a multi-modal topic tracker for improved human-robot interaction. In Int. symp. on robot and human interactive communication (RO-MAN’06), Hatfield, UK, September 2006.
Google Scholar
Moeslund, T., Hilton, A., & Kruger, V. (2006). A survey of advanced vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 174–192.
Article Google Scholar
Murphy-Chutorian, E., & Trivedi, M. (2008). Head pose estimation in computer vision: a survey. Transactions on Pattern Analysis Machine Intelligence (PAMI’08).
Nickel, K., & Stiefelhagen, R. (2006). Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing, 3(12), 1875–1884.
Google Scholar
Park, H. S., Kim, E. Y., Jang, S., & Park, S. H. (2005). HMM-based gesture recognition for robot control. In Iberian conf. on pattern recognition and image analysis (IbPRIA’05), Estoril, Portugal, June 2005.
Google Scholar
Pérennou, G., & de Calmès, M. (2000). MHATLex: Lexical resources for modelling the French pronunciation. In Int. conf. on language resources and evaluations (pp. 257–264). Athens, Greece, June 2000.
Google Scholar
Pérez, P., Vermaak, J., & Blake, A. (2004). Data fusion for visual tracking with particles. Proceedings of the IEEE, 92(3), 495–513.
Article Google Scholar
Pineau, J., Montemerlo, M., Pollack, M., Roy, N., & Thrun, S. (2003). Towards robotic assistants in nursing homes: challenges and results. Robotics and Autonomous Systems, 42, 271–281.
Article MATH Google Scholar
Prodanov, P., & Drygajlo, A. (2003a). Multimodal interaction management for tour-guide robots using Bayesian networks. In Int. conf. on intelligent robots and systems (IROS’03) (pp. 3447–3452). Las Vegas, Canada, October 2003.
Chapter Google Scholar
Prodanov, P., & Drygajlo, A. (2003b). Bayesian networks for spoken dialogue managements in multimodal systems of tour-guide robots. In European conf. on speech communication and technology (EUROSPEECH’03) (pp. 1057–1060). Geneva, Switzerland. September 2003.
Google Scholar
Qu, W., Schonfeld, D., & Mohamed, M. (2007). Distributed Bayesian multiple-target tracking in crowded environments using multiple collaborative cameras. EURASIP Journal on Advances in Signal Processing.
Rasmussen, C., & Hager, G. (2001). Probabilistic data association methods for tracking complex visual objects. Transactions on Pattern Analysis Machine Intelligence 560–576.
Richarz, J., Martin, C., Scheidig, A., & Gross, H. M. (2006). There you go!—estimating pointing gestures in monocular images for mobile robot instruction. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 546–551). Hartfield, UK, September 2006.
Chapter Google Scholar
Rogalla, O., Ehrenmann, M., Zollner, R., Becher, R., & Dillman, R. (2004). Using gesture and speech control for commanding a robot. In Advances in human-robot interaction (Vol. 14). Berlin: Springer.
Google Scholar
Shimizu, M., Yoshizuka, T., & Miyamoto, H. (2006). A gesture recognition system using stereo vision and arm model fitting. In Int. conf. on brain-inspired information technology (BrainIT’06), Hibikino, Japan, September 2006.
Google Scholar
Siegwart, R. et al. (2003). Robox at expo 0.2: a large scale installation of personal robots. Robotics and Autonomous Systems, 42, 203–222.
Article MATH Google Scholar
Skubic, M., Perzanowski, D., Blisard, S., Schultz, A., & Adams, W. (2004). Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cybernetics, 2(34), 154–167.
Google Scholar
Stiefelhagen, R., Fügen, C., Gieselmann, P., Holzapfel, H., Nickel, K., & Waibel, A. (2004). Natural human-robot interaction using speech head pose and gestures. In Int. conf. on intelligent robots and systems (IROS’04), Sendal, Japan, October 2004.
Google Scholar
Stückler, J., Gräve, K., Kläß, J., Muszynski, S., Schreiber, M., Tischler, O., Waldukat, R., & Behnke, S. (2009). Dynamaid: Towards a personal robot that helps with household chores. In Robotics: science and systems conference (RSS’09), Seattle, USA, June 2009.
Google Scholar
Thayananthan, A., Stenger, B., Torr, P. H. S., & Cipolla, R. (2003). Learning a kinematic prior for tree-based filtering. In British machine vision conf. (BMVC’03) (Vol. 2, pp. 589–598). Norwick, UK, September 2003.
Google Scholar
Theobalt, C., Bos, J., Chapman, T., & Espinosa, A. (2002). Talking to godot: Dialogue with a mobile robot. In Int. conf. on intelligent robots and systems (IROS’02), Lausanne, Switzerland, September 2002.
Google Scholar
Triesch, J., & Von der Malsburg, C. (2001). A system for person-independent hand posture recognition against complex backgrounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12), 1449–1453.
Article Google Scholar
Vallée, M., Burger, B., Ertl, D., Lerasle, F., & Falb, J. (2009). Improving user of interfaces robots with multimodality. In Int. conf. on advanced robotics (ICAR’09), Munich, Germany.
Google Scholar
Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Features. In Int. conf. on computer vision and pattern recognition (CVPR’01), Hawaii, December 2001.
Google Scholar
Waldherr, S., Thrun, S., & Romero, R. (2000). A gesture-based interface for human-robot interaction. Autonomous Robots, 9(2), 151–173.
Article Google Scholar
Yoshizaki, M., Kuno, Y., & Nakamura, A. (2002). Mutual assistance between speech and vision for human-robot interface. In Int. conf. on intelligent robots and systems (IROS’02) (pp. 1308–1313). Lausanne, Switzerland, September 2002.
Chapter Google Scholar
Yu, T., & Wu, Y. (2004). Collaborative tracking of multiple targets. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004.
Google Scholar
Zhao, T., & Nevatia, R. (2004). Tracking multiple humans in crowded environment. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004.
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS, LAAS, 7 avenue du Colonel Roche, 31077, Toulouse Cedex, France
B. Burger & F. Lerasle
IRIT, Université de Toulouse, 118 route de Narbonne, 31062, Toulouse Cedex, France
B. Burger & I. Ferrané
Université de Toulouse, UPS, INSA, INP, ISAE; UT1, UTM, LAAS, 31077, Toulouse Cedex, France
I. Ferrané & F. Lerasle
Onera, 2 avenue Edouard Belin, 31055, Toulouse Cedex 4, France
G. Infantes

Authors

B. Burger
View author publications
You can also search for this author in PubMed Google Scholar
I. Ferrané
View author publications
You can also search for this author in PubMed Google Scholar
F. Lerasle
View author publications
You can also search for this author in PubMed Google Scholar
G. Infantes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Burger.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(MP4 3.85 MB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Burger, B., Ferrané, I., Lerasle, F. et al. Two-handed gesture recognition and fusion with speech to command a robot. Auton Robot 32, 129–147 (2012). https://doi.org/10.1007/s10514-011-9263-y

Download citation

Received: 06 February 2010
Accepted: 28 November 2011
Published: 21 December 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s10514-011-9263-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-handed gesture recognition and fusion with speech to command a robot

Abstract

Access this article

Similar content being viewed by others

Pointing Estimation for Human-Robot Interaction Using Hand Pose, Verbal Cues, and Confidence Heuristics

Human Robot Interaction Using Dynamic Hand Gestures

Fully Responsive Image and Speech Detection Artificial Yankee (FRIDAY): Human Assistant

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-handed gesture recognition and fusion with speech to command a robot

Abstract

Access this article

Similar content being viewed by others

Pointing Estimation for Human-Robot Interaction Using Hand Pose, Verbal Cues, and Confidence Heuristics

Human Robot Interaction Using Dynamic Hand Gestures

Fully Responsive Image and Speech Detection Artificial Yankee (FRIDAY): Human Assistant

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation