Abstract
We present a new framework for multimodal gesture recognition that is based on a multiple hypotheses rescoring fusion scheme. We specifically deal with a demanding Kinect-based multimodal dataset, introduced in a recent gesture recognition challenge (CHALEARN 2013), where multiple subjects freely perform multimodal gestures. We employ multiple modalities, that is, visual cues, such as skeleton data, color and depth images, as well as audio, and we extract feature descriptors of the hands’ movement, handshape, and audio spectral properties. Using a common hidden Markov model framework we build single-stream gesture models based on which we can generate multiple single stream-based hypotheses for an unknown gesture sequence. By multimodally rescoring these hypotheses via constrained decoding and a weighted combination scheme, we end up with a multimodally-selected best hypothesis. This is further refined by means of parallel fusion of the monomodal gesture models applied at a segmental level. In this setup, accurate gesture modeling is proven to be critical and is facilitated by an activity detection system that is also presented. The overall approach achieves 93.3% gesture recognition accuracy in the CHALEARN Kinect-based multimodal dataset, significantly outperforming all recently published approaches on the same challenging multimodal gesture recognition task, providing a relative error rate reduction of at least 47.6%.
Keywords
- Multimodal gesture recognition
- HMMs
- Speech recognition
- Mulimodal fusion
- Activity detection
Editors: Isabelle Guyon, Vassilis Athitsos and Sergio Escalera
This is a preview of subscription content, access via your institution.
Buying options







Notes
- 1.
For the case of video data an observation corresponds to a single image frame; for the case of audio modality it corresponds to a 25 msec window.
- 2.
That is, transformed to have zero mean and a standard deviation of one.
- 3.
- 4.
These parameters are set after experimentation in a single video of the validation set, that was annotated in terms of activity.
- 5.
Parameter ranges in the experiments for each modality are as follows. Audio: States 10–28, Gaussians: 2–32; Skeleton/Handshape: States 7–15, Gaussians: 2–10.
- 6.
For the measurements we employed an AMD Opteron(tm) Processor 6386 at 2.80 GHz with 32 GB RAM.
- 7.
The weights take values in [0, 1] while their sum across the modalities adds to one; these values are then scaled by 100 for the sake of numerical presentation. For the w stream weights we sampled the [0, 1] with 12 samples for each modality, resulting to 1728 combinations. For the \(w'\) case, we sampled the [0, 1] space by employing 5, 5 and 21 samples for the gesture, handshape and speech modalities respectively, resulting on 525 combinations.
- 8.
Note that the Levensthein distance takes values in [0, 1] and is equivalent to the word error rate.
- 9.
D1-3 notation refers to D1, D2 and D3 cases.
- 10.
All relative percentages, unless stated otherwise, refer to relative LD reduction (LDR). LDR is equivalent to the known relative word error rate reduction.
- 11.
Statistical significance tests are computed on the raw recognition values and not on the relative improvement scores.
- 12.
References
U. Agris, J. Zieren, U. Canzler, B. Bauer, K.-F. Kraiss, Recent developments in visual sign language recognition. Univers. Access Inf. Soc. 6, 323–362 (2008)
J. Alon, V. Athitsos, O. Yuan, S. Sclaroff, A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 31(9), 1685–1699 (2009)
A. Argyros, M. Lourakis, Real time tracking of multiple skin-colored objects with a possibly moving camera, in Proceedings of the European Conference on Computer Vision, 2004
B. Bauer, K.F. Kraiss, Towards an automatic sign language recognition system using subunits. in Proceedings of International Gesture Workshop, vol. 2298, 2001, pp. 64–75
I. Bayer, S. Thierry, A multi modal approach to gesture recognition from audio and video data, in Proceedings of the 15th ACM International Conference on Multimodal Interaction (ACM, 2013), pp. 461–466
P. Bernardis, M. Gentilucci, Speech and gesture share the same communication system. Neuropsychologia 44(2), 178–190 (2006)
N.D. Binh, E. Shuichi, T. Ejima, Real-time hand tracking and gesture recognition system, in Proceedings of International Conference on Graphics, Vision and Image Processing (GVIP), 2005, pp. 19–21
A.F. Bobick, J.W. Davis, The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)
R. A. Bolt, “Put-that-there”: voice and gesture at the graphics interface, in Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, vol. 14 ( ACM, 1980)
H. Bourlard, S. Dupont, Subband-based speech recognition, in Proceedings of the International Conference on Acoustics, Speech and Signal Processings, vol. 2 (IEEE, Piscataway, 1997), pp. 1251–1254
K. Bousmalis, L. Morency, M. Pantic, Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition, in Proceedings of the International Conference on Automatic Face and Gesture Recognition (IEEE, Piscataway, 2011), pp. 746–752
P. Buehler, M. Everingham, A. Zisserman, Learning sign language by watching TV (using weakly aligned subtitles), in Proceedings of the International Conference on Computer Vision and Pattern Recognition, 2009
S. Celebi, A.S. Aydin, T.T. Temiz, T. Arici, Gesture recognition using skeleton data with weighted dynamic time warping. Comput. Vis. Theory Appl. 1, 620–625 (2013)
F.-S. Chen, C.-M. Fu, C.-L. Huang, Hand gesture recognition using a real-time tracking method and hidden markov models. Image Vis. Comput. 21(8), 745–758 (2003)
X. Chen, M. Koskela, Online rgb-d gesture recognition with extreme learning machines, in Proceedings of the 15th ACM International Conference on Multimodal Interaction (ACM, 2013), pp. 467–474
Y. L. Chow, R. Schwartz, The n-best algorithm: An efficient procedure for finding top n sentence hypotheses, in Proceedings of the Workshop on Speech and Natural Language (Association for Computational Linguistics, 1989), pp. 199–202
S. Conseil, S. Bourennane, L. Martin, Comparison of Fourier descriptors and Hu moments for hand posture recognition, in Proceedings of the European Conference on Signal Processing, 2007
Y. Cui, J. Weng, Appearance-based hand sign recognition from intensity image sequences. Comput. Vis. Image Underst. 78(2), 157–176 (2000)
N. Dalal, B. Triggs, Histogram of oriented gradients for human detection, in Proceedins International Conference on Computer Vision and Pattern Recognition, 2005
W. Du, J. Piater, Hand modeling and tracking for video-based sign language recognition by robust principal component analysis, in Proceedings of the ECCV Workshop on Sign, Gesture and Activity, September 2010
S. Escalera, J. Gonzàlez, X. Baró, M. Reyes, I. Guyon, V. Athitsos, H. Escalante, L. Sigal, A. Argyros, C. Sminchisescu, R. Bowden, S. Sclaroff, Chalearn multi-modal gesture recognition 2013: grand challenge and workshop summary, in Proceedings of the 15th ACM on International Conference on Multimodal Interaction (ACM, 2013a), pp. 365–368
S. Escalera, J. Gonzlez, X. Bar, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, H.J. Escalante. Multi-modal Gesture Recognition Challenge 2013: Dataset and Results, in 15th ACM International Conference on Multimodal Interaction (ICMI), ChaLearn Challenge and Workshop on Multi-modal Gesture Recognition (ACM, 2013b)
J. Foote, An overview of audio information retrieval. Multimedia Syst. 7(1):2–10 (1999), http://link.springer.com/article/10.1007/s005300050106
L. Gillick, S.J. Cox, Some statistical issues in the comparison of speech recognition algorithms, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 1, May 1989, pp. 532–535
H. Glotin, D. Vergyr, C. Neti, G. Potamianos, J. Luettin, Weighting schemes for audio-visual fusion in speech recognition, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 1 (IEEE, Piscataway, 2001), pp. 173–176
B. Habets, S. Kita, Z. Shao, A. Özyurek, P. Hagoort, The role of synchrony and ambiguity in speech-gesture integration during comprehension. J. Cogn. Neurosci. 23(8), 1845–1854 (2011)
J. Han, G. Awad, A. Sutherland, Modelling and segmenting subunits for sign language recognition based on hand motion analysis. Pattern Recognit. Lett. 30, 623–633 (2009)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
A. Hernández-Vela, M.Á. Bautista, X. Perez-Sala, V. Ponce-López, S. Escalera, X. Baró, O. Pujol, C. Angulo, Probability-based dynamic time warping and bag-of-visual-and-depth-words for human gesture recognition in rgb-d. Pattern Recognit. Lett. (2013)
C.-L. Huang, S.-H. Jeng, A model-based hand gesture recognition system. Mach. Vis. Appl. 12(5), 243–258 (2001)
M. Isard, A. Blake, Condensation-conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998)
J.M. Iverson, S. Goldin-Meadow, Why people gesture when they speak. Nature 396(6708), 228 (1998)
A. Jaimes, N. Sebe, Multimodal human-computer interaction: a survey. Comput. Vis. Image Underst. 108(1), 116–134 (2007)
S.D. Kelly, A. Özyürek, E. Maris, Two sides of the same coin speech and gesture mutually interact to enhance comprehension. Psychol. Sci. 21(2), 260–267 (2010)
A. Kendon, Gesture: Visible Action as Utterance (Cambridge University Press, New York, 2004)
W. Kong, S. Ranganath, Sign language phoneme transcription with rule-based hand trajectory segmentation. J. Signal Process. Syst. 59, 211–222 (2010)
I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in Proceedings of the International Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 2008), pp. 1–8
H.-K. Lee, J.-H. Kim, An HMM-based threshold model approach for gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 21(10), 961–973 (1999)
J. Li, N.M. Allinson, Simultaneous gesture segmentation and recognition based on forward spotting accumulative hmms. Pattern Recognit. 40(11), 3012–3026 (2007)
J. Li, N.M. Allinson, A comprehensive review of current local features for computer vision. Neurocomputing 71(10), 1771–1787 (2008)
D. G. Lowe, Object recognition from local scale-invariant features, in Proceedings of the International Conference on Computer Vision, 1999, pp. 1150–1157
P. Maragos, P. Gros, A. Katsamanis, G. Papandreou, Cross-modal integration for performance improving in multimedia: a review, in Multimodal Processing and Interaction: Audio, Video, Text ed. by P. Maragos, A. Potamianos, and P. Gros, chapter 1 (Springer, New York, 2008), pp. 3–48
D. McNeill, Hand and Mind: What Gestures Reveal About Thought (University of Chicago Press, Chicago, 1992)
M. Miki, N. Kitaoka, C. Miyajima, T. Nishino, K. Takeda, Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech. EURASIP J. Audio Speech Music Process. 2014(1), 17 (2014). doi:10.1186/1687-4722-2014-2
d Morris, p Collett, p Marsh, M. O’Shaughnessy, Gestures: Their Origins and Distribution (Stein and Day, New York, 1979)
Y. Nam, K. Wohn, Recognition of space-time hand-gestures using hidden Markov model, in ACM Symposium on Virtual Reality Software and Technology, 1996, pp. 51–58
K. Nandakumar, K. W. Wan, S. Chan, W. Ng, J. G. Wang, and W. Y. Yau. A multi-modal gesture recognition system using audio, video, and skeletal joint data. in Proceedings of the 15th ACM Int’l Conf. on Multimodal Interaction (ACM, 2013), pages 475–482
N. Neverova, C. Wolf, G. Paci, G. Sommavilla, G. Taylor, F. Nebout, A multi-scale approach to gesture detection and recognition, in Proceedings of the IEEE International Conference on Computer Vision Workshop, 2013, pp. 484–491
E.-J. Ong, R. Bowden, A boosted classifier tree for hand shape detection, in Proceedings of the International Conference on Automation Face Gest Recognition (IEEE, Piscataway, 2004), pp. 889–894
M. Ostendorf, A. Kannan, S. Austin, O. Kimball, R. M. Schwartz, J. R. Rohlicek, Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses, in HLT, 1991
S. Oviatt, P. Cohen, Perceptual user interfaces: multimodal interfaces that process what comes naturally. Commun. ACM 43(3), 45–53 (2000)
G. Papandreou, A. Katsamanis, V. Pitsikalis, P. Maragos, Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(3), 423–435 (2009)
V. Pitsikalis, S. Theodorakis, C. Vogler, P. Maragos, Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition, in IEEE CVPR Workshop on Gesture Recognition, 2011
I. Poddar, Y. Sethi, E. Ozyildiz, R. Sharma, Toward natural gesture/speech HCI: A case study of weather narration, in Proceedings of the Workshop on Perceptual User Interfaces, 1998
V. Ponce-López, S. Escalera, X. Baró, Multi-modal social signal analysis for predicting agreement in conversation settings, in Proceedings of the 15th ACM International Conference on Multimodal Interaction (ACM, 2013), pp. 495–502
G. Potamianos, C. Neti, J. Luettin, I. Matthews, Audio-visual automatic speech recognition: an overview. Issues Vis. Audio Vis Speech Process. 22, 23 (2004)
L.R. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice Hall, Upper Saddle River, 1993)
Z. Ren, J. Yuan, Z. Zhang, Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera, in Proceedings of the 19th ACM International Conference on Multimedia (ACM, 2011), pp. 1093–1096
R. C. Rose, Discriminant wordspotting techniques for rejecting non-vocabulary utterances in unconstrained speech, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 2 (IEEE, Piscataway, 1992), pp. 105–108, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=226109
R. C. Rose, D. B. Paul, A hidden Markov model based keyword recognition system, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 1990, pp. 129–132, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=115555
A. Roussos, S. Theodorakis, V. Pitsikalis, P. Maragos, Dynamic affine-invariant shape-appearance handshape features and classification in sign language videos. J. Mach. Learn. Res. 14(1), 1627–1663 (2013)
S. Ruffieux, D. Lalanne, E. Mugellini, ChAirGest: a challenge for multimodal mid-air gesture recognition for close HCI, in Proceedings of the 15th ACM International Conference on Multimodal Interaction, ICMI ’13 (ACM, New York, NY, USA, 2013), pp. 483–488
S. Ruffieux, D. Lalanne, E. Mugellini, O. A. Khaled, A survey of datasets for human gesture recognition, in Human-Computer Interaction. Advanced Interaction Modalities and Techniques (Springer, 2014), pp. 337–348
R. Sharma, M. Yeasin, N. Krahnstoever, I. Rauschert, G. Cai, I. Brewer, A.M. MacEachren, K. Sengupta, Speech-gesture driven multimodal interfaces for crisis management. Proc. IEEE 91(9), 1327–1354 (2003)
S. Shimojo, L. Shams, Sensory modalities are not separate modalities: plasticity and interactions. Curr. Opin. Neurobiol. 11(4), 505–509 (2001)
J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, R. Moore, Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)
R. Shwartz, S. Austin, A comparison of several approximate algorithms for finding multiple N-Best sentence hypotheses, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 1991
Y. C. Song, H. Kautz, J. Allen, M. Swift, Y. Li, J. Luo, C. Zhang, A markov logic framework for recognizing complex events from multimodal data, in Proceedings of the 15th ACM International Conference on Multimodal Interaction (ACM, 2013), pp. 141–148
T. Starner, J. Weaver, A. Pentland, Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998)
L. N. Tan, B. J. Borgstrom, A. Alwan, Voice activity detection using harmonic frequency components in likelihood ratio test, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (IEEE, Piscataway, 2010), pp. 4466–4469
N. Tanibata, N. Shimada, Y. Shirai, Extraction of hand features for recognition of sign language words, in Proceedings of the International Conference on Vision, Interface, 2002, pp. 391–398
S. Theodorakis, V. Pitsikalis, P. Maragos, Dynamic-static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Imave Vis. Comput. 32(8), 533549 (2014)
M. Turk, Multimodal interaction: a review. Pattern. Recognit. Lett. 36, 189–195 (2014)
C. Vogler, D. Metaxas, A framework for recognizing the simultaneous aspects of american sign language. Comput. Vis. Image Underst. 81, 358 (2001)
S. B. Wang, A. Quattoni, L. Morency, D. Demirdjian, T. Darrell, Hidden conditional random fields for gesture recognition, in Proceedings of the International Conference on Computer Vision and Pattern Recognition, vol. 2 (IEEE, Piscataway, 2006), pp. 1521–1527
D. Weimer, S. Ganapathy, A synthetic visual environment with hand gesturing and voice input, in ACM SIGCHI Bulletin, vol. 20 (ACM, 1989), pp. 235–240
L. D Wilcox, M. Bush, Training and search algorithms for an interactive wordspotting system, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 2 (IEEE, Piscataway, 1992), pp. 97–100
J. Wilpon, L.R. Rabiner, C.-H. Lee, E.R. Goldman, Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. Acoustics Speech Signal Process. 38(11), 1870–1878 (1990)
A. Wilson, A. Bobick, Parametric hidden markov models for gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 21, 884–900 (1999)
J. Wu, J. Cheng, C. Zhao, H. Lu. Fusing multi-modal features for gesture recognition, in Proceedings of the 15th ACM International Conference on Multimodal Interaction (ACM, 2013), pp. 453–460
M.-H. Yang, N. Ahuja, M. Tabb, Extraction of 2d motion trajectories and its application to hand gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1061–1074 (2002)
S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book (Entropic Cambridge Research Laboratory, Cambridge, 2002)
Acknowledgements
This research work was supported by the European Union under the project “MOBOT” with grant FP7-ICT-2011-9 2.1 - 600796. The authors want to gratefully thank G. Pavlakos for his contribution in previous, earlier stages, of this work. This work was done while V. Pitsikalis and S. Theodorakis were both with the National Technical University of Athens; they are now with deeplab.ai, Athens, GR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Pitsikalis, V., Katsamanis, A., Theodorakis, S., Maragos, P. (2017). Multimodal Gesture Recognition via Multiple Hypotheses Rescoring. In: Escalera, S., Guyon, I., Athitsos, V. (eds) Gesture Recognition. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-57021-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-57021-1_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57020-4
Online ISBN: 978-3-319-57021-1
eBook Packages: Computer ScienceComputer Science (R0)