Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

A framework for the fusion of visual and tactile modalities for improving robot perception


  • 186 Accesses

  • 4 Citations


Robots should ideally perceive objects using human-like multi-modal sensing such as vision, tactile feedback, smell, and hearing. However, the features presentations are different for each modal sensor. Moreover, the extracted feature methods for each modal are not the same. Some modal features such as vision, which presents a spatial property, are static while features such as tactile feedback, which presents temporal pattern, are dynamic. It is difficult to fuse these data at the feature level for robot perception. In this study, we propose a framework for the fusion of visual and tactile modal features, which includes the extraction of features, feature vector normalization and generation based on bag-of-system (BoS), and coding by robust multi-modal joint sparse representation (RM-JSR) and classification, thereby enabling robot perception to solve the problem of diverse modal data fusion at the feature level. Finally, comparative experiments are carried out to demonstrate the performance of this framework.


提出了一种视触觉信息融合框架和鲁棒多模态联合稀疏表示编码方法, 解决由于机器人感知的视(静态)、触觉(动态)跨模态信息特征空间维度不同而带来的特征层融合难题。具体包括:视触觉特征提取、用“词袋”算法归一化维度不同的特征向量、鲁棒多模态联合稀疏表示编码、通过视触觉融合算法进行分类。

This is a preview of subscription content, log in to check access.


  1. 1

    Sharma R, Pavlovic V I, Huang T S. Toward multimodal human-computer interface. Proc IEEE, 1998, 86: 853–869

  2. 2

    Nock H J, Iyengar G, Neti C. Assessing face and speech consistency for monologue detection in video. In: Proceedings of the 10th ACM International Conference on Multimedia. New York: ACM, 2002. 303–306

  3. 3

    Meier U, Stiefelhagen R, Yang J, et al. Towards unrestricted lip reading. Int J Pattern Recogn Artif Intell, 2000, 14: 571–585

  4. 4

    Wolff G J, Prasad K V, Stork D G, et al. Lipreading by neural networks: visual processing, learning and sensory integration. In: Proceedings of Advances in Neural Information Processing Systems, Denver, 1993. 1027–1034

  5. 5

    Olshausen B A, Field D J. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res, 1997, 37: 3311–3325

  6. 6

    Nguyen N H, Nasrabadi N M, Tran T D. Robust multi-sensor classification via joint sparse representation. In: Proceedings of the 14th International Conference on Information Fusion. New York: IEEE Press, 2011. 1–8

  7. 7

    Zhang H C, Zhang Y N, Nasrabadi N M, et al. Joint-structured-sparsity-based classification for multiple-measurement transient acoustic signals. IEEE Trans Syst Man Cybern-part B Cybern, 2012, 42: 1586–1598

  8. 8

    Yuan X-T, Liu X B, Yan S C. Visual classification with multitask joint sparse representation. IEEE Trans Image Process, 2012, 21: 4349–4360

  9. 9

    Liu H P, Sun F C. Fusion tracking in color and infrared images using joint sparse representation. Sci China Inf Sci, 2012, 55: 590–599

  10. 10

    Shekhar S, Patel V M, Nasrabadi N M, et al. Joint sparse representation for robust multimodal biometrics recognition. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 113–126

  11. 11

    Rao N, Nowak R, Cox C, et al. Classification with the sparse group lasso. IEEE Trans Signal Process, 2016, 64: 448–463

  12. 12

    Zhang Q, Levine M D. Robust multi-focus image fusion using multi-task sparse representation and spatial context. IEEE Trans Image Process, 2016, 25: 2045–2058

  13. 13

    Lowe D. Distinctive image features from scale-invariant keypoints. Int J Comput Vision, 2004, 60: 91–110

  14. 14

    Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2005. 886–893

  15. 15

    Chatzichristofis S A, Zagoris K, Boutalis Y S, et al. Accurate image retrieval based on compact composite descriptors and relevance feedback information. Int J Pattern Recogn Artif Intell, 2010, 24: 207–244

  16. 16

    Aldous D, Ibragimov I, Jacod J. Exchangeability and Related Topics. Berlin: Springer, 1985. 1–198

  17. 17

    van Gemert J C, Veenman C J, Smeulders A W, et al. Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell, 2010, 32: 1271–1283

  18. 18

    Wang J, Yang J, Yu K, et al. Locality-constrained linear coding for image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE Press, 2010. 3360–3367

  19. 19

    Doretto G, Chiuso A, Wu Y N, et al. Dynamic textures. Int J Comput Vision, 2003, 51: 91–109

  20. 20

    Ellis K, Coviello E, Chan A B, et al. A bag of systems representation for music auto-tagging. IEEE Trans Audio Speech Lang Process, 2013, 21: 2554–2569

  21. 21

    Mumtaz A, Coviello E, Lanckriet G R G, et al. A scalable and accurate descriptor for dynamic textures using bag of system trees. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 697–712

  22. 22

    Ma R, Liu H P, Sun F C, et al. Linear dynamic system method for tactile object classification. Sci China Inf Sci, 2014, 57: 120205

  23. 23

    Sprechmann P, Ramirez I, Sapiro G, et al. C-hilasso: a collaborative hierarchical sparse modeling framework. IEEE Trans Signal Process, 2011, 59: 4183–4198

  24. 24

    Jalali A, Sanghavi S, Ruan C, et al. A dirty model for multi-task learning. In: Proceedings of Conference on Neural Information Processing Systems, Canada, 2010. 964–972

  25. 25

    Clarke F H. Optimization and Nonsmooth Analysis. Hoboken: Wiley, 1990. 24–109

  26. 26

    Chen X J, Zhou W J. Smoothing nonlinear conjugate gradient method for image restoration using nonsmooth nonconvex minimization. SIAM J Imag Sci, 2010, 3: 765–790

  27. 27

    Schmidt M, Fung G, Rosaless R. Optimization Methods for L1 Regularization. Berlin: Springer-Verlag, 2009

  28. 28

    Figueiredo M A T, Nowak R D, Wright S J. Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J Sel Topics Signal Process, 2007, 1: 586–597

  29. 29

    Wright S J, Nowak R D, Figueiredo M A T. Sparse reconstruction by separable approximation. IEEE J Sel Topics Signal Process, 2009, 57: 2479–2493

  30. 30

    Yin WT, Osher S, Goldfarb D, et al. Bregman iterative algorithms for l1-minimization with applications to compressed sensing. SIAM J Imag Sci, 2008, 1: 143–168

  31. 31

    Boyd S, Parikh N, Chu E, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Lear, 2010, 3: 1–122

  32. 32

    Chi E C, Lange K. Splitting methods for convex clustering. J Comput Graph Stat, 2015, 24: 994–1013

Download references

Author information

Correspondence to Fuchun Sun.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Sun, F., Wu, H. et al. A framework for the fusion of visual and tactile modalities for improving robot perception. Sci. China Inf. Sci. 60, 012201 (2017). https://doi.org/10.1007/s11432-016-0158-2

Download citation


  • multi-modal fusion
  • robot perception
  • vision
  • tactile
  • classification


  • 多模态融合
  • 机器人感知
  • 视觉
  • 触觉
  • 分类