Abstract
We present a study of multi-modal freehand gesture recognition relying on three sensory modalities. The modalities are RGB images, depth data, and acceleration data from an IMD attached to the hand. Based on a new self-recorded dataset, we initially establish the ability of a deep Long Short-Term Memory (LSTM) network to correctly classify individual data streams from each modality. Notably, classifying the IMD stream alone generates very good results already. In addition, we investigate two different strategies of multi-modal fusion, since there is no agreement in the literature as to which strategy is preferable. Combining the modalities leads to better recognition performance. Most importantly, fusion considerably improves ahead-of-time classification, i.e., gesture class estimates before sequences are completed, for classes that are difficult to classify on their own.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Angelaki, D.E., Gu, Y., DeAngelis, G.C.: Multisensory integration: psychophysics, neurophysiology, and computation. Curr. Opinion Neurobiol. 19(4), 452–458 (2009)
Beauchamp, M.S.: See me, hear me, touch me: multisensory integration in lateral occipital-temporal cortex. Curr. Opinion Neurobiol. 15(2), 145–153 (2005)
Bradski, G.: The OpenCV library. Dr. Dobb’s J. Softw. Tools (2000)
Caron, L.-C., Filliat, D., Gepperth, A.: Neural network fusion of color, depth and location for object instance recognition on a mobile robot. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 791–805. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16199-0_55
Chen, C., Jafari, R., Kehtarnavaz, N.: Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Trans. Hum. Mach. Syst. 45 (2014). https://doi.org/10.1109/THMS.2014.2362520
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893 (2005)
Ernst, M.O., Banks, M.S.: Humans integrate visual and haptic information in a statistically optimal fashion. Nature 415(6870), 429–433 (2002)
Gepperth, A.R., Hecht, T., Gogate, M.: A generative learning approach to sensor fusion and change detection. Cogn. Comput. 8(5), 806–817 (2016)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning, pp. 1764–1772. No. 2 in Proceedings of Machine Learning Research, PMLR, Bejing, China, 22–24 June 2014. http://proceedings.mlr.press/v32/graves14.html
Imran, J., Raman, B.: Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition. J. Ambient Intell. Hum. Comput. February 2019. https://doi.org/10.1007/s12652-019-01239-9
Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018)
Liu, K., Chen, C., Jafari, R., Kehtarnavaz, N.: Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sens. J. 14(6), 1898–1903 (2014)
McConnell, R.: Method of and Apparatus for Pattern Recognition, January 1986
Rusu, R.B., Blodow, N., Marton, Z.C., Beetz, M.: Aligning point cloud views using persistent feature histograms. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3384–3391. IEEE (2008). https://doi.org/10.1109/IROS.2008.4650967
Sachara, F., Kopinski, T., Gepperth, A., Handmann, U.: Free-hand gesture recognition with 3D-CNNs for in-car infotainment control in real-time. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 959–964, October 2017. https://doi.org/10.1109/ITSC.2017.8317684
Sarkar, A., Gepperth, A., Handmann, U., Kopinski, T.: Dynamic hand gesture recognition for mobile systems using deep LSTM. In: Horain, P., Achard, C., Mallem, M. (eds.) IHCI 2017. LNCS, vol. 10688, pp. 19–31. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72038-8_3
Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp 2013, pp. 729–738. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2493432.2493482
Tran, T., et al.: A multi-modal multi-view dataset for human fall analysis and preliminary investigation on modality. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1947–1952, August 2018. https://doi.org/10.1109/ICPR.2018.8546308
William, T., Freeman, M.R.: Orientation histograms for hand gesture recognition. Technical report TR94-03, MERL - Mitsubishi Electric Research Laboratories, Cambridge, MA 02139, December 1994. https://www.merl.com/publications/TR94-03/
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Schak, M., Gepperth, A. (2020). On Multi-modal Fusion for Freehand Gesture Recognition. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12396. Springer, Cham. https://doi.org/10.1007/978-3-030-61609-0_68
Download citation
DOI: https://doi.org/10.1007/978-3-030-61609-0_68
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61608-3
Online ISBN: 978-3-030-61609-0
eBook Packages: Computer ScienceComputer Science (R0)