MMA: a multi-view and multi-modality benchmark dataset for human action recognition

  • Zan Gao
  • Tao-tao Han
  • Hua Zhang
  • Yan-bing Xue
  • Guang-ping Xu
Article
  • 68 Downloads

Abstract

Human action recognition is an active research topic in both computer vision and machine learning communities, which has broad applications including surveillance, biometrics and human computer interaction. In the past decades, although some famous action datasets have been released, there still exist limitations, including the limited action categories and samples, camera views and variety of scenarios. Moreover, most of them are designed for a subset of the learning problems, such as single-view learning problem, cross-view learning problem and multi-task learning problem. In this paper, we introduce a multi-view, multi-modality benchmark dataset for human action recognition (abbreviated to MMA). MMA consists of 7080 action samples from 25 action categories, including 15 single-subject actions and 10 double-subject interactive actions in three views of two different scenarios. Further, we systematically benchmark the state-of-the-art approaches on MMA with respective to all three learning problems by different temporal-spatial feature representations. Experimental results demonstrate that MMA is challenging on all three learning problems due to significant intra-class variations, occlusion issues, views and scene variations, and multiple similar action categories. Meanwhile, we provide the baseline for the evaluation of existing state-of-the-art algorithms.

Keywords

Action recognition Benchmark dataset Multi-view Multi-modalidy Cross-view Multi-task Cross-domain 

References

  1. 1.
    Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75MathSciNetCrossRefGoogle Scholar
  2. 2.
    Chen G (2015) Human action recognition via multi-task learning base on spatial-temporal feature. Elsevier Science Inc, pp 418–428Google Scholar
  3. 3.
    Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: IEEE international conference on image processing, pp 168–172Google Scholar
  4. 4.
    Cheng Z, Qin L, Ye Y, Huang Q, Qi T (2012) Human daily action analysis with multi-view and color-depth data. In: International conference on computer vision, pp 52–61Google Scholar
  5. 5.
    Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 109–117Google Scholar
  6. 6.
    Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112(C):83–97CrossRefGoogle Scholar
  7. 7.
    Gao Z, Nie W, Liu A, Zhang H (2016) Evaluation of local spatial-temporal features for cross-view action recognition. Neurocomputing 173(P1):110–117CrossRefGoogle Scholar
  8. 8.
    Gao Z, Li SH, Zhu YJ, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for rgbd action recognition. J Vis Commun Image RepresentGoogle Scholar
  9. 9.
    Gao Z, Li SH, Zhang GT, Zhu YJ, Wang C, Zhang H (2017) Evaluation of regularized multi-task leaning algorithms for single/multi-view human action recognition. Multimedia Tools and Applications 76(19):1–24CrossRefGoogle Scholar
  10. 10.
    Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253CrossRefGoogle Scholar
  11. 11.
    Han Y, Wu F, Tao D, Shao J, Zhuang Y, Jiang J (2012) Sparse unsupervised dimensionality reduction for multiple view data. IEEE Trans Circuits Syst Video Technol 22(10):1485–1496CrossRefGoogle Scholar
  12. 12.
    Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126MathSciNetCrossRefGoogle Scholar
  13. 13.
    Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252MathSciNetCrossRefGoogle Scholar
  14. 14.
    He X, Kan MY, Xie P, Chen X (2014) Comment-based multi-view clustering of web 2.0 items. In: International conference on World Wide Web, pp 771–782Google Scholar
  15. 15.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPRGoogle Scholar
  16. 16.
    Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb a large video database for human motion recognition. In: IEEE international conference on computer vision, ICCV 2011, Barcelona, pp 2556–2563Google Scholar
  17. 17.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR 2008. IEEE conference on computer vision and pattern recognition, 2008, pp 1–8Google Scholar
  18. 18.
    Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: Computer vision and pattern recognition workshops, pp 9–14Google Scholar
  19. 19.
    Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098CrossRefGoogle Scholar
  20. 20.
    Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L (2015) A deep structured model with radius–margin bound for 3d human activity recognition. Int J Comput Vis 118(2):256–273MathSciNetCrossRefGoogle Scholar
  21. 21.
    Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L (2016) A deep structured model with radius–margin bound for 3d human activity recognition. Int J Comput Vis 118(2):256–273MathSciNetCrossRefGoogle Scholar
  22. 22.
    Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114CrossRefGoogle Scholar
  23. 23.
    Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2017) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Transactions on Cybernetics 47(7):1781–1794CrossRefGoogle Scholar
  24. 24.
    Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. pp 1996–2003Google Scholar
  25. 25.
    Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision pattern recognitionGoogle Scholar
  26. 26.
    Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 2929–2936Google Scholar
  27. 27.
    Rahmani H, Mian A (2016) 3d action recognition from novel viewpoints. In: Computer vision and pattern recognition, pp 1506–1515Google Scholar
  28. 28.
    Rahmani H, Mahmood A, Du QH, Mian A (2014) HOPC: Histogram of oriented principal components of 3d pointclouds for action recognition. In: European conference on computer vision, pp 742–757Google Scholar
  29. 29.
    Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981CrossRefGoogle Scholar
  30. 30.
    Ren T, Qiu Z, Liu Y, Yu T, Bei J (2015) Soft-assigned bag of features for object tracking. Multimedia Systems 21(2):189–205CrossRefGoogle Scholar
  31. 31.
    Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, vol 3, pp 32–36Google Scholar
  32. 32.
    Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb + d: a large scale dataset for 3d human activity analysis. pp 1010–1019Google Scholar
  33. 33.
    Soomro K, Zamir AR (2014) Action recognition in realistic sports videos. Springer International Publishing, pp 408–411Google Scholar
  34. 34.
    Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. Computer ScienceGoogle Scholar
  35. 35.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) Learning spatiotemporal features with 3d convolutional networks. pp 4489–4497Google Scholar
  36. 36.
    Wang H, Schmid C (2014) Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp 3551–3558Google Scholar
  37. 37.
    Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79MathSciNetCrossRefGoogle Scholar
  38. 38.
    Wang J, Nie X, Xia Y, Wu Y, Zhu SC (2014) Cross-view action modeling, learning, and recognition. In: IEEE conference on computer vision and pattern recognition, pp 2649–2656Google Scholar
  39. 39.
    Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2):249–257CrossRefGoogle Scholar
  40. 40.
    Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15 (3):661–669CrossRefGoogle Scholar
  41. 41.
    Yuan J, Wu Y, Liu Z, Wang J (2014) Mining actionlet ensemble for action recognition with depth cameras. IEEE Trans Softw Eng 36(5):914–927Google Scholar
  42. 42.
    Zhang H, Zha Z-J, Yang Y, Yan S, Chua T-S (2014) Robust (semi) nonnegative graph embedding. IEEE Trans Image Process 23(7):2996?-3012MathSciNetCrossRefMATHGoogle Scholar
  43. 43.
    Zheng J, Jiang Z, Chellappa R (2016) Cross-view action recognition via transferable dictionary learning. IEEE press, p 2542Google Scholar
  44. 44.
    Zhou Q, Wang G, Jia K, Qi Z (2014) Learning to share latent tasks for action recognition. In: IEEE international conference on computer vision, pp 2264–2271Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Zan Gao
    • 1
    • 2
  • Tao-tao Han
    • 1
    • 2
  • Hua Zhang
    • 1
    • 2
  • Yan-bing Xue
    • 1
    • 2
  • Guang-ping Xu
    • 1
    • 2
  1. 1.Key Laboratory of Computer Vision and SystemTianjin University of Technology Ministry of EducationTianjinChina
  2. 2.Tianjin Key Laboratory of Intelligence Computing and Novel Software TechnologyTianjin University of TechnologyTianjinChina

Personalised recommendations