International Journal of Computer Vision

, Volume 118, Issue 2, pp 256–273 | Cite as

A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition

  • Liang Lin
  • Keze Wang
  • Wangmeng ZuoEmail author
  • Meng Wang
  • Jiebo Luo
  • Lei Zhang


Understanding human activity is very challenging even with the recently developed 3D/depth sensors. To solve this problem, this work investigates a novel deep structured model, which adaptively decomposes an activity instance into temporal parts using the convolutional neural networks. Our model advances the traditional deep learning approaches in two aspects. First, we incorporate latent temporal structure into the deep model, accounting for large temporal variations of diverse human activities. In particular, we utilize the latent variables to decompose the input activity into a number of temporally segmented sub-activities, and accordingly feed them into the parts (i.e. sub-networks) of the deep architecture. Second, we incorporate a radius–margin bound as a regularization term into our deep model, which effectively improves the generalization performance for classification. For model training, we propose a principled learning algorithm that iteratively (i) discovers the optimal latent variables (i.e. the ways of activity decomposition) for all training instances, (ii) updates the classifiers based on the generated features, and (iii) updates the parameters of multi-layer neural networks. In the experiments, our approach is validated on several complex scenarios for human activity recognition and demonstrates superior performances over other state-of-the-art approaches.


Human action and activity RGB-depth analysis Structured model Deep learning 



This work was supported in part by the Hong Kong Scholar Program, and in part by the HK PolyU’s Joint Supervision Scheme with the Chinese Mainland, Taiwan and Macao Universities (Grant no. G-SB20), in part by Guangdong Natural Science Foundation (Grant nos. S2013010013432 and S2013050014548), and in part by Guangdong Science and Technology Program (Grant no. 2013B010406005).


  1. Amer, M. R., & Todorovic, S. (2012). Sum-product networks for modeling activities with stochastic structure. In CVPR, pp 1314–1321Google Scholar
  2. Bayer, J., Osendorfer, C., Korhammer, D., Chen, N., Urban, S., & van der Smagt, P. (2014). On fast dropout and its applicability to recurrent networks. In Proc. ICLR Google Scholar
  3. Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In: ICCV, pp 778–785Google Scholar
  4. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1–3), 131–159.CrossRefzbMATHGoogle Scholar
  5. Chaquet, J. M., Carmona, E. J., & Fernandez-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.CrossRefGoogle Scholar
  6. Cheng, Z., Qin, L., Huang, Q., Jiang, S., Yan, S., & Tian, Q. (2011). Human group activity analysis with fusion of motion and appearance information. In ACM Multimedia, pp 1401–1404Google Scholar
  7. Chung, K. M., Kao, W. C., Sun, C. L., Wang, L. L., & Lin, C. J. (2003). Radius margin bounds for support vector machines with the rbf kernel. Neural Computation, 15(11), 2643–2681.CrossRefzbMATHGoogle Scholar
  8. Do, H., & Kalousis, A. (2013). Convex formulations of radius-margin based support vector machines. In: ICML Google Scholar
  9. Do, H., Kalousis, A., & Hilario, M. (2009). Feature weighting using margin and radius based error bound optimization in svms. Machine Learning and Knowledge Discovery in Databases (Vol. 5781, pp. 315–329)., Lecture Notes in Computer Science Berlin Heidelberg: Springer.Google Scholar
  10. Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR Google Scholar
  11. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRefGoogle Scholar
  12. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Google Scholar
  13. Gupta, R., Chia, A. Y., Rajan, D., Ng E. S., & Lung, E. H. (2013). Human activities recognition using depth images. In ACM Multimedia pp 283–292Google Scholar
  14. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.MathSciNetCrossRefzbMATHGoogle Scholar
  15. Huang, F. J., & LeCun, Y. (2006). Large-scale learning with svm and convolutional for generic object categorization. In CVPR, pp 284–291Google Scholar
  16. Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell, 35(1), 221–231.CrossRefGoogle Scholar
  17. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR Google Scholar
  18. Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. International Journal of Robotics Research (IJRR), 32(8), 951–970.CrossRefGoogle Scholar
  19. Koppula, H. S., & Saxena, A. (2013). Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation. In ICML pp 792–800Google Scholar
  20. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp 1097–1105Google Scholar
  21. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel ea, L. D. (1990). Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems Google Scholar
  22. Liang, X., Lin, L., & Cao, L. (2013). Learning latent spatio-temporal compositional model for human action recognition. In ACM Multimedia, pp 263–272Google Scholar
  23. Lin, L., Wang, X., Yang, W., & Lai, J. H. (2015). Discriminatively trained and-or graph models for object shape detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(5), 959–972.CrossRefGoogle Scholar
  24. Lin, L., Wu, T., Porway, J., & Xu, Z. (2009). A stochastic graph grammar for compositional object representation and recognition. Pattern Recognition, 42(7), 1297–1307.CrossRefzbMATHGoogle Scholar
  25. Luo, P., Tian, Y., Wang, X., & Tang, X. (2014). Switchable deep network for pedestrian detection. In CVPR Google Scholar
  26. Luo, P., Wang, X., & Tang, X. (2013a). A deep sum-product architecture for robust facial attributes analysis. In ICCV, pp 2864–2871Google Scholar
  27. Luo, P., Wang, X., & Tang, X. (2013b). Pedestrian parsing via deep decompositional neural network. In ICCV, pp 2648–2655Google Scholar
  28. Ni, B., Wang, G., & Moulin, P. (2013a). Rgbd-hudaact: A color-depth video database for human daily activity recognition. Consumer Depth Cameras for Computer Vision, Lecture Notes in Computer Science (pp. 193–208). Springer.Google Scholar
  29. Ni, B., YPei, Z.L., Lin, L., & Moulin, P. (2013b). Integrating multi-stage depth-induced contextual information for human action recognition and localization. In International Conference and Workshops on Automatic Face and Gesture Recognition, pp 1–8Google Scholar
  30. Oreifej, O., & Liu, Z. (2013). Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In CVPR, pp 716–723Google Scholar
  31. Packer, B., Saenko, K., & Koller, D. (2012). A combined pose, object, and feature model for action understanding. In CVPR, pp 1378–1385Google Scholar
  32. Pei, M., Jia, Y., & Zhu, S. (2011). Parsing video events with goal inference and intent prediction. In ICCV, pp 487–494Google Scholar
  33. Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR, pp 1234–1241Google Scholar
  34. Scovanner, P., Ali, S., & Shah, M. (2007) A 3-dimensional sift descriptor and its application to action recognition. In ACM Multimedia, pp 357–360Google Scholar
  35. Sermanet, P., Kavukcuoglu, K., Chintala, S., & LeCun, Y. (2013). Pedestrian detection with unsupervised multi- stage feature learning. In CVPR Google Scholar
  36. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overtting. The Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetzbMATHGoogle Scholar
  37. Sung, J., Ponce, C., Selman, B., & Saxena, A. (2012) Unstructured human activity detection from rgbd images. In ICRA, pp 842–849Google Scholar
  38. Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR, pp 1250–1257Google Scholar
  39. Tu, K., Meng, M., Lee, M. W., Choi, T., & Zhu, S. (2014). Joint video and text parsing for understanding events and answering queries. IEEE Transactions on Multimedia, 21(2), 42–70.CrossRefGoogle Scholar
  40. Vapnik, V. (1998). Statistical learning theory. New York: John Wiley and Sons.Google Scholar
  41. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2015). Translating videos to natural language using deep recurrent neural networks. In North American Chapter of the Association for Computational Linguistics Google Scholar
  42. Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In CVPR, pp 1290–1297Google Scholar
  43. Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic vs. max-margin. IEEE Trans Pattern Anal Mach Intell, 33(7), 1310–1323.CrossRefGoogle Scholar
  44. Wang, K., Wang, X., Lin, L. (2014). 3d human activity recognition with reconfigurable convolutional neural networks. In ACM MM Google Scholar
  45. Wang, C., Wang, Y., & Yuille, A. L. (2013). An approach to pose-based action recognition. In CVPR, pp 915–922Google Scholar
  46. Wang, J., & Wu, Y. (2013) Learning maximum margin temporal warping for action recognition. In ICCV, pp 2688–2695Google Scholar
  47. Wu, C. F. J. (1983). On the convergence properties of the em algorithm. Annals of Statistics, 11(1), 95–103.MathSciNetCrossRefzbMATHGoogle Scholar
  48. Wu, P., Hoi, S., Xia, H., Zhao, P., Wang, D., & Miao, C. (2013) Online multimodal deep similarity learning with application to image retrieval. In ACM Mutilmedia, pp 153–162Google Scholar
  49. Xia, L., & Aggarwal, J. (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In CVPR, pp 2834–2841Google Scholar
  50. Xia, L., Chen, C., & Aggarwal, J. (2012a). View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE, pp 20–27Google Scholar
  51. Xia, L., Chen, C., & Aggarwal, J. K. (2012b). View invariant human action recognition using histograms of 3d joints. In CVPRW, pp 20–27Google Scholar
  52. Yang, X., Zhang, C., & Tian, Y. (2012). Recognizing actions using depth motion maps-based histograms of oriented gradients. In ACM Multimedia, pp 1057–1060Google Scholar
  53. Yu, K., Lin, Y., & Lafferty, J. (2011). Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pp 1713–1720Google Scholar
  54. Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., & Samaras, D. (2012) Two-person interaction detection using body-pose features and multiple instance learning. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEEGoogle Scholar
  55. Zhao, X., Liu, Y., & Fu, Y. (2013). Exploring discriminative pose sub-patterns for effective action classification. In: ACM Multimedia, pp 273–282Google Scholar
  56. Zhou, X., Zhuang, X., Yan, S., Chang, S. F., Johnson, M. H., & Huang, T. S. (2009) Sift-bag kernel for video event analysis. In ACM Multimedia, pp 229–238Google Scholar
  57. Zhu, S., & Mumford, D. (2007). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Liang Lin
    • 1
    • 2
  • Keze Wang
    • 1
    • 2
  • Wangmeng Zuo
    • 3
    Email author
  • Meng Wang
    • 4
  • Jiebo Luo
    • 5
  • Lei Zhang
    • 2
  1. 1.Sun Yat-sen UniversityGuangzhouChina
  2. 2.Department of ComputingThe Hong Kong Polytechnic UniversityHong KongChina
  3. 3.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina
  4. 4.Hefei University of TechnologyHefeiChina
  5. 5.Department of Computer ScienceUniversity of RochesterRochesterUSA

Personalised recommendations