Multimedia Tools and Applications

, Volume 76, Issue 19, pp 20125–20148 | Cite as

Evaluation of regularized multi-task leaning algorithms for single/multi-view human action recognition

  • Z. GaoEmail author
  • S. H. Li
  • G. T. Zhang
  • Y. J. Zhu
  • C. Wang
  • H. ZhangEmail author


Regularized multi-task learning (MTL) algorithms have been exploited in the field of pattern recognition and computer vision gradually, which can fully excavate the relationships of different related tasks. Therefore, many dramatically favorable approaches based on regularized MTL have been proposed. In the past decades, although the promising results about human action recognition have been achieved, most of existing action recognition algorithms focus on action descriptors, single/multi-view and multi-modality action recognition, and few works are related with MTL, especial of lacking the systematic evaluation of existing MTL algorithms for human action recognition. Thus, in the paper, seven popular regularized MTL algorithms in which different actions are considered as different tasks, are systematically exploited on two public multi-view action datasets. In detail, dense trajectory features are firstly extracted for each view, and then the shared codebook are constructed for all views by k-means, and then each video is coded by the shared codebook. Moreover, according to different regularized MTL algorithms, all actions or part of actions are considered as related, and then these actions are set to different tasks in MTL. Further, the effectiveness of different number of training samples from different action views is also evaluated for MTL. Large scale experimental results show that: 1) Regularized MTL is very useful for action recognition which can dig the latent relationship among different actions; 2) Not of all human actions are related, if irrelative actions are put together in MTL, its performance will fall; 3) With the increase of the training samples from different views, the relationships about different actions can be fully exploited, and it promotes the accuracy improvement of action recognition.


Action recognition Dense trajectory Regularized multi-task learning MTL Multi-view 



This work was supported in part by the National Natural Science Foundation of China (No.61572357, No.61202168). Tianjin Research Program of Application Foundation and Advanced Technology (14JCZDJC31700 and 13JCQNJC0040). Tianjin Municipal Natural Science Foundation (No.13JCQNJC0040). Country China Scholarship Council (No.201608120021).


  1. 1.
    Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75MathSciNetCrossRefGoogle Scholar
  2. 2.
    Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. in: VS-PETSGoogle Scholar
  3. 3.
    Doumanoglou A, Kim T-K, Zhao X, Malassiotis S (2014) Active random forests: an application to autonomous unfolding of clothes. In Proceedings of the European Conference on Computer Vision (ECCV)Google Scholar
  4. 4.
    Everts I, van Gemert J, Gevers T (2014) Evaluation of color spatio-temporal interest points for human action recognition, IEEE trans. Image Process 23(4):1569–1580MathSciNetCrossRefGoogle Scholar
  5. 5.
    Evgeniou T, Pontil M (2004) Regularized multi–task learning. in: KDDGoogle Scholar
  6. 6.
    Gao Z, Song JM, Zhang H, Liu AA, Xu GP, Xue YB (2013) Human action recognition via multi-modality information. J Elect Eng Technol 8(2):742–751Google Scholar
  7. 7.
    Gao Y, Wang M, Ji R, Wu X, Dai Q (2014a) 3D object retrieval with Hausdorff distance learning. IEEE Trans Ind Electron 61(4):2088–2098CrossRefGoogle Scholar
  8. 8.
    Gao Z, Zhang H, Liu AA, Xue YB, Xu GP (2014b) Human action recognition using pyramid histograms of oriented gradients and collaborative multi-task learning. KSII Trans Int Inf Syst 8(2):483–503Google Scholar
  9. 9.
    Gao Z, Zhang LF, Chen MY et al (2014c) Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimed Tools Appl 68(3):641–657CrossRefGoogle Scholar
  10. 10.
    Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015a) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97CrossRefGoogle Scholar
  11. 11.
    Z. Gao, H. Zhang, G.P Xu, Y.B Xue (2015b) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition, Neurocomputing, 151, Part 2, Pages 554–564.Google Scholar
  12. 12.
    Gao Z, Zhang H, Liu AA, Xu GP, Xue YB (2016a) Human action recognition on depth dataset. Neural Comput & Applic 27(7):2047–2054CrossRefGoogle Scholar
  13. 13.
    Gao Z, Zhang Y, Zhang H, Xue YB, Xu GP (2016b) Multi-dimensional human action recognition model based on image set and group sparisty. Neurocomputing 215:138–149. doi: 10.1016/j.neucom.2016.01.113
  14. 14.
    Gao Z, Nie WZ, Liu AA, Zhang H (2016c) Evaluation of local spatial–temporal features for cross-view action recognition. Neurocomputing, 173. Part 1:110–117Google Scholar
  15. 15.
    Gao Z, Wang D, Zhang H, Xue Y, Xu G (2016d) A fast 3D retrieval algorithm via class-statistic and pair-constraint model. Proceedings of the 2016 ACM on Multimedia Conference, 117–121Google Scholar
  16. 16.
    Ge L, Ju R, Ren T, Wu G (2015) Interactive RGB-D image segmentation using hierarchical graph cut and geodesic distance. Proceedings of Pacific Rim Conference on Multimedia (PCM'15), Gwangju, Korea, 114–124Google Scholar
  17. 17.
    Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space time shapes. IEEE Trans Pattern Anal Mach Intell:2247–2253Google Scholar
  18. 18.
    Guo Y (2013) Convex subspace representation learning from multi-view data. In AAAI:387–393Google Scholar
  19. 19.
    Guo W, Chen G (2015) Human action recognition via multi-task learning base on spatial–temporal feature. Inf Sci 320(1):418–428MathSciNetCrossRefGoogle Scholar
  20. 20.
    Guo J, Ren T, Bei J (2016) Salient object detection for RGB-D image via saliency evolution. Proceedings of IEEE International Conference on Multimedia and Expo (ICME'16), Seattle, USAGoogle Scholar
  21. 21.
    Hao T, Peng W, Wang Q, Wang B, Sun J-S (2016) Reconstruction and application of protein–protein interaction network. Int J Mol Sci 17:907CrossRefGoogle Scholar
  22. 22.
    Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2015) Natural language object retrieval. arXiv preprint arXiv:1511.04164Google Scholar
  23. 23.
    Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d gradients. Proceedings of European Conference on Computer Vision 275:1–10Google Scholar
  24. 24.
    Konecny J, Hagara M (2013) One-shot-learning gesture recognition using HOG-HOF features. CoRR, abs/1312.4190Google Scholar
  25. 25.
    Kumar A, Daum’e H III (2011) A co-training approach for multi-view spectral clustering. In ICML 393–400Google Scholar
  26. 26.
    Laptev I, Lindeberg T (2003) Space-time interest points. in: ICCV’03, p 432–439Google Scholar
  27. 27.
    Laptev I, Marszałek M, Schmid C, Rozenfeld B (2009) Learning realistic human actions from movies. in Proc. CVPR'08Google Scholar
  28. 28.
    Li R, Tian T, Sclaroff S (2007) Simultaneous learning of nonlinear manifold and dynamical models for high-dimensional time series. in: ICCV'07, p 1–8Google Scholar
  29. 29.
    Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L (2015) A deep structured model with radius-margin bound for 3d human activity recognition. Int J Comput Vis 118:256MathSciNetCrossRefGoogle Scholar
  30. 30.
    Liu A, Wang Z, Nie W, Yuting S (2015a) Graph-based characteristic view set extraction and matching for 3D model retrieval. Inf Sci, doi: 10.1016/j.ins.2015.04.042
  31. 31.
    Liu A-A, Su Y-T, Jia P-P, Gao Z, Hao T, Yang Z-X (2015b) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194–1208CrossRefGoogle Scholar
  32. 32.
    Liu A-A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2016a) Benchmarking a multimodal and Multiview and interactive dataset for human action recognition. IEEE Transactions on Cybernetics 0(0):1–1Google Scholar
  33. 33.
    Liu A-A, Nie W-Z, Gao Y, Su Y-T (2016b) Multi-modal clique-graph matching for view-based 3D model retrieval. IEEE Trans Image Process 25(5):2103–2116MathSciNetCrossRefGoogle Scholar
  34. 34.
    Liu J, Ren T, Wang Y, Zhong S-H, Bei J, Chen S (2016c) Object proposal on RGB-D images via elastic edge boxes. Neurocomputing, doi: 10.1016/j.neucom.2016.09.111
  35. 35.
    Liu A-A, Su Y-T, Nie W-Z, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114Google Scholar
  36. 36.
    Mansur A, Makihara Y, Yagi Y (2013) Inverse dynamics for action recognition. IEEE Trans Cybern 43(4):1226–1236CrossRefGoogle Scholar
  37. 37.
    Marszalek M, Laptev I, Schmid C (2009) Actions in context. in: CVPR’09, p 2929–2936Google Scholar
  38. 38.
    Nie L, Wang M, Zha Z-J, Li G, Chua T-S (2011) Multimedia answering: enriching text QA with media information. SIGIR:695–704Google Scholar
  39. 39.
    Nie WZ, Liu AA, Gao Z, Su YT (2015) Clique-graph matching by preserving global & local structure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4503–4510Google Scholar
  40. 40.
    Nie WZ, Liu AA, Li WH, Su YT (2016) Cross-view action recognition by cross-domain learning, Image and Vision Computing.Google Scholar
  41. 41.
    Onishi K, Takiguchi T, Ariki Y (2008) 3D human posture estimation using the HOG features from monocular image. in: ICPR, p 1–4Google Scholar
  42. 42.
    Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: CVPRGoogle Scholar
  43. 43.
    Ran J, Yang L, Ren T, Ge L, Wu G (2015) Depth-aware salient object detection using anisotropic center-surround difference. Signal Processing: Image Communication (SPIC) 38:115–126Google Scholar
  44. 44.
    Rodriguez MD, Ahmed J, Shah M (2008) Action match a spatio-temporal maximum average correlation height filter for action recognition. in: CVPR’08, p 1–8Google Scholar
  45. 45.
    Suk H, Jain AK, Lee S (2011) A network of dynamic probabilistic models for human interaction analysis. IEEE Trans Circuits Syst Video Technol 21(7):932–945CrossRefGoogle Scholar
  46. 46.
    Sun S (2013) A survey of multi-view machine learning. Neural Comput & Applic 23(Issue 7-8):2031–2038CrossRefGoogle Scholar
  47. 47.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. ICCVGoogle Scholar
  48. 48.
    Wang H, Kläser A, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. CVPR:3169–3176Google Scholar
  49. 49.
    Wang H, Klaser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1):60–79MathSciNetCrossRefGoogle Scholar
  50. 50.
    Wang J, Nie X, Xia Y, Wu Y, Zhu S (2014a) Cross-view action modeling, learning and recognition. In CVPRGoogle Scholar
  51. 51.
    Wang J, Nie X, Xia Y, Wu Y, Zhu S-C (2014b) Cross-view action modeling, learning, and recognition. Proc of IEEE Conf on Computer Vision and Pattern Recognition (CVPR)Google Scholar
  52. 52.
    Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3d exemplars. ICCVGoogle Scholar
  53. 53.
    Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3D joints. In CVPRWGoogle Scholar
  54. 54.
    Xu C, Tao D, Xu C (2013) A survey on multi-view learning
  55. 55.
    Yao H, Zhang S, Zhang Y, Li J, Tian Q (2016) Coarse-to-fine description for fine-grained visual categorization. IEEE Trans Image Process 25(10):4858–4872MathSciNetCrossRefGoogle Scholar
  56. 56.
    Yuting S et al (2014) Coupled hidden conditional random fields for RGB-D human action recognition. Singal Process. doi: 10.1016/j.sigpro.2014.08.038 Google Scholar
  57. 57.
    Zhang X, Zhang H, Zhang Y, Yang Y, Wang M, Luan H-B, Li J, Chua T-S (2016) Deep fusion of multiple semantic cues for complex event recognition. IEEE Trans Image Process 25(3):1033–1046MathSciNetCrossRefGoogle Scholar
  58. 58.
    Zhou J, Chen J, Ye J (2012) MALSAR: multi-tAsk learning via structural regularization. Arizona State University,
  59. 59.
    Zhou Q, Wang G, Jia K, Zhao Q (2013) Learning to share latent tasks for action recognition. in: ICCVGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Key Laboratory of Computer Vision and System, Ministry of EducationTianjin University of TechnologyTianjinChina
  2. 2.Tianjin Key Laboratory of Intelligence Computing and Novel Software TechnologyTianjin University of TechnologyTianjinChina
  3. 3.National Engineering Laboratory for Information Security TechnologiesInstitute of Information Engineering, Chinese Academy of SciencesBeijingChina
  4. 4.The Home DepotAtlantaUnited States

Personalised recommendations