International Journal of Computer Vision

, Volume 109, Issue 1–2, pp 60–73 | Cite as

Harnessing Lab Knowledge for Real-World Action Recognition

  • Zhigang Ma
  • Yi YangEmail author
  • Feiping Nie
  • Nicu Sebe
  • Shuicheng Yan
  • Alexander G. Hauptmann


Much research on human action recognition has been oriented toward the performance gain on lab-collected datasets. Yet real-world videos are more diverse, with more complicated actions and often only a few of them are precisely labeled. Thus, recognizing actions from these videos is a tough mission. The paucity of labeled real-world videos motivates us to “borrow” strength from other resources. Specifically, considering that many lab datasets are available, we propose to harness lab datasets to facilitate the action recognition in real-world videos given that the lab and real-world datasets are related. As their action categories are usually inconsistent, we design a multi-task learning framework to jointly optimize the classifiers for both sides. The general Schatten \(p\)-norm is exerted on the two classifiers to explore the shared knowledge between them. In this way, our framework is able to mine the shared knowledge between two datasets even if the two have different action categories, which is a major virtue of our method. The shared knowledge is further used to improve the action recognition in the real-world videos. Extensive experiments are performed on real-world datasets with promising results.


Action recognition Lab to real-world Transfer learning General Schatten-p norm 



This paper was partially supported by the US Department of Defense, the U.S. Army Research Office (W911NF-13-1-0277) and by the National Science Foundation under Grant No. IIS-1251187, the xLiMe EC project, the ARC Project DE130101311 and the Singapore National Research Foundation under its International Research Centre @Singapore Funding Initiative and administered by the IDM Programme Office. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Disclaimer The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ARO, the National Science Foundation or the U.S. Government.


  1. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning Research, 73(3), 243–272.CrossRefGoogle Scholar
  2. Argyriou, A., Micchelli, C. A., Pontil, M., & Ying, Y. (2010). A spectral regularization framework for multi-task structure learning. Journal of Machine Learning Research, 11, 935–953.zbMATHGoogle Scholar
  3. Aytar, Y., & Zisserman, A. (2011). Tabula rasa: Model transfer for object category detection. In International conference on computer vision (pp. 2252–2259).Google Scholar
  4. Cao, L., Liu, Z., & Huang, T. S. (2010). Cross-dataset action detection. In IEEE conference on computer vision and pattern recognition (pp. 1998–2005).Google Scholar
  5. Chen, C., Zhuang, Y., Nie, F., Yang, Y., Wu, F., & Xiao, J. (2011). Learning a 3D human pose distance metric from geometric pose descriptor. IEEE Transactions on Visualization and Computer Graphics, 17(11), 1676–1689.CrossRefGoogle Scholar
  6. Chen, M.-Y., & Hauptmann, A. (2009). Mosift: Recognizing human actions in surveillance videos. In Technical Report CMU-CS-09-161, Carnegie Mellon University.Google Scholar
  7. Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision, 100(3), 275–293.CrossRefMathSciNetGoogle Scholar
  8. Duan, L., Xu, D., Tsang, I. W.-H., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.CrossRefGoogle Scholar
  9. Farhadi, A., & Tabrizi, M. K. (2008) Learning to recognize activities from the wrong view point. In European conference on computer vision (pp. 154–166).Google Scholar
  10. Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2003). A practical guide to support vector classification. In Technical Report: Department of Computer Science, National Taiwan University.Google Scholar
  11. Jhuo, I.-H., Liu, D., Lee, D. T., & Chang, S.-F. (2012). Robust visual domain adaptation with low-rank reconstruction. In IEEE conference on computer vision and pattern recognition (pp. 2168–2175).Google Scholar
  12. Kläser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In British machine vision conference.Google Scholar
  13. Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In IEEE conference on computer vision and pattern recognition (pp. 2046–2053).Google Scholar
  14. Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In IEEE conference on computer vision and pattern recognition (pp. 1785–1792).Google Scholar
  15. Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In International conference on computer vision (pp. 432–439).Google Scholar
  16. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition.Google Scholar
  17. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos. In IEEE conference on computer vision and pattern recognition (pp. 1996–2003).Google Scholar
  18. Liu, J., Shah, M., Kuipers, B., & Savarese, S. (2011). Cross-view action recognition via view knowledge transfer. In IEEE conference on computer vision and pattern recognition (pp. 3209–3216).Google Scholar
  19. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  20. Luo, J., Tommasi, T., & Caputo, B. (2011). Multiclass transfer learning from unconstrained priors. In International conference on computer vision (pp. 1863–1870).Google Scholar
  21. Ma, Z., Yang, Y., Cai, Y., Sebe, N., & Hauptmann, A. G. (2012). Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In ACM MM (pp. 469–478).Google Scholar
  22. Nie, F., Huang, H., & Ding, C. (2012). Low-rank matrix recovery via efficient schatten p-norm minimization. In AAAI conference on artificial intelligence.Google Scholar
  23. Obozinski, G., Taskar, B., & Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2), 231–252.CrossRefMathSciNetGoogle Scholar
  24. Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.CrossRefGoogle Scholar
  25. Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976–990.CrossRefGoogle Scholar
  26. Qi, G., Aggarwal, C., Rui, Y., Tian, Q., Chang, S., & Huang, T. (2011). Towards cross-category knowledge propagation for learning visual concepts. In IEEE conference on computer vision and pattern recognition (pp. 897–904). Google Scholar
  27. Saberian, M. J., Masnadi-Shirazi, H., & Vasconcelos, N. (2011). Taylorboost: First and second-order boosting algorithms with explicit margin control. In IEEE conference on computer vision and pattern recognition (pp. 2929–2934).Google Scholar
  28. Salakhutdinov, R., Torralba, A., & Tenenbaum, J. (2011). Learning to share visual appearance for multiclass object detection. In IEEE conference on computer vision and pattern recognition (pp. 1481–1488).Google Scholar
  29. Schölkopf, B., Smola, A. J., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.CrossRefGoogle Scholar
  30. Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In International conference on pattern recognition (pp. 32–36).Google Scholar
  31. Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-Markov models. International Journal of Computer Vision, 93(1), 22–32.CrossRefzbMATHGoogle Scholar
  32. Sigal, L., Balan, A. O., & Black, M. J. (2010). HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Internatinal Journal of Computer Vision, 87(1–2), 4–27.CrossRefGoogle Scholar
  33. Torresani, L., Szummer, M., & Fitzgibbon, A. W. (2010). Efficient object category recognition using classemes. In European conference on computer vision (pp. 776–789).Google Scholar
  34. Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009) Evaluation of local spatio-temporal features for action recognition. In British machine vision conference.Google Scholar
  35. Wang, L., Wang, Y., & Gao, W. (2011). Mining layered grammar rules for action recognition. International Journal of Computer Vision, 93(2), 162–182.CrossRefzbMATHMathSciNetGoogle Scholar
  36. Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., & Hauptmann, A. (2012). Action recognition by exploring data distribution and feature correlation. In IEEE conference on computer vision and pattern recognition (pp. 1370–1377).Google Scholar
  37. Willems, G., Tuytelaars, T., & Gool, L. J. V. (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In European conference on computer vision (pp. 650–663).Google Scholar
  38. Wu, X., Xu, D., Duan, L., & Luo, J. (2011). Recognizing realistic actions from videos. In IEEE conference on computer vision and pattern recognition (pp. 489–496).Google Scholar
  39. Yang, J., Yan, R., & Hauptmann, A. G. (2007). Cross-domain video concept detection using adaptive svms. In ACM international conference on multimedia (pp. 188–197).Google Scholar
  40. Yang, Y., Ma, Z., Hauptmann, A. G., & Sebe, N. (2013). Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Transactions on Multimedia, 15(3), 661–669.CrossRefGoogle Scholar
  41. You, D., Martínez, A. M. (2010). Bayes optimal kernel discriminant analysis. In IEEE conference on computer vision and pattern recognition (pp. 3533–3538).Google Scholar
  42. Yu, X., & Aloimonos, Y. (2010). Attribute-based transfer learning for object categorization with zero/one training example. In European conference on computer vision (pp. 127–140).Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Zhigang Ma
    • 1
  • Yi Yang
    • 2
    Email author
  • Feiping Nie
    • 3
  • Nicu Sebe
    • 4
  • Shuicheng Yan
    • 5
  • Alexander G. Hauptmann
    • 1
  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.ITEEThe University of QueenslandBrisbaneAustralia
  3. 3.University of Texas at ArlingtonArlingtonUSA
  4. 4.University of TrentoTrentoItaly
  5. 5.National University of SingaporeSingaporeSingapore

Personalised recommendations