Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

A Weakly Supervised Multi-task Ranking Framework for Actor–Action Semantic Segmentation


Modeling human behaviors and activity patterns has attracted significant research interest in recent years. In order to accurately model human behaviors, we need to perform fine-grained human activity understanding in videos. Fine-grained activity understanding in videos has attracted considerable recent attention with a shift from action classification to detailed actor and action understanding that provides compelling results for perceptual needs of cutting-edge autonomous systems. However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely labeled data, and they fail to capture any internal relationship among actors and actions. To address these issues, in this paper, we propose a novel Schatten p-norm robust multi-task ranking model for weakly-supervised actor–action segmentation where only video-level tags are given for training samples. Our model is able to share useful information among different actors and actions while learning a ranking matrix to select representative supervoxels for actors and actions respectively. Final segmentation results are generated by a conditional random field that considers various ranking scores for video parts. Extensive experimental results on both the actor–action dataset and the Youtube-objects dataset demonstrate that the proposed approach outperforms the state-of-the-art weakly supervised methods and performs as well as the top-performing fully supervised method.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. Technical report. Preprint arXiv:1609.08675.

  2. Amini, M. R., Truong, T. V., & Goutte, C. (2008). A boosting algorithm for learning bipartite ranking functions with partially labeled data. In SIGIR.

  3. Argyriou, A., Evgeniou, T., & Pontil, M. (2007). Multi-task feature learning. In NIPS.

  4. Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV.

  5. Brendel, W., & Todorovic, S. (2009). Video object segmentation by tracking regions. In ICCV.

  6. Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In ECCV.

  7. Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR.

  8. Cao, Y., Xu, J., Liu, T. Y., Li, H., Huang, Y., & Hon, H. W. (2006). Adapting ranking SVM to document retrieval. In SIGIR.

  9. Chao, Y. W., Wang, Z., Mihalcea, R., & Deng, J. (2015). Mining semantic affordances of visual object categories. In CVPR.

  10. Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning. In ACM SIGKDD conferences on knowledge discovery and data mining.

  11. Chen, W., & Corso, J. J. (2015). Action detection by implicit intentional motion clustering. In ICCV.

  12. Chiu, W. C., & Fritz, M. (2013). Multi-class video co-segmentation with a generative multi-video model. In CVPR.

  13. Corso, J. J., Sharon, E., Dube, S., El-Saden, S., Sinha, U., & Yuille, A. (2008). Efficient multilevel brain tumor segmentation with integrated Bayesian model classification. IEEE Transactions on Medical Imaging, 27, 629–640.

  14. Dang, K., Zhou, C., Tu, Z., Hoy, M., Dauwels, J., & Yuan, J. (2018). Actor action semantic segmentation with region masks. In BMVC.

  15. Delong, A., Osokin, A., Isack, H. N., & Boykov, Y. (2012). Fast approximate energy minimization with label costs. International Journal of Computer Vision, 96(1), 1–27.

  16. Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision, 100(3), 275–293.

  17. Dp, B. (1996). Constrained optimization and lagrange multiplier methods. Belmont: Athena Scientific.

  18. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In KDD.

  19. Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.

  20. Fu, H., Xu, D., Zhang, B., & Lin, S. (2014). Object-based multiple foreground video co-segmentation. In CVPR.

  21. Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In ICCV.

  22. Gabay, D., & Mercier, B. (1976). A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers and Mathematics with Applications, 2(1), 17–40.

  23. Galasso, F., Cipolla, R., & Schiele, B. (2012). Video segmentation with superpixels. In Asian conference on computer vision.

  24. Gavrilyuk, K., Ghodrati, A., Li, Z., & Snoek, C. G. (2018). Actor and action video segmentation from a sentence. In CVPR.

  25. Geest, R. D., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In ECCV.

  26. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142–158.

  27. Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In CVPR.

  28. Guo, J., Li, Z., Cheong, L. F., & Zhou, S. Z. (2013). Video co-segmentation for meaningful action extraction. In ICCV.

  29. Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.

  30. Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., et al. (2012). Weakly supervised learning of object segmentations from web-scale video. In ECCV workshops (pp. 198–208). Berlin: Springer.

  31. Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In IEEE international conference on pattern recognition.

  32. Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. In NIPS.

  33. Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., & Snoek, C., et al. (2014). Action localization with tubelets from motion. In CVPR.

  34. Jain, S., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV.

  35. Jalali, A., Ravikumar, P., Sanghavi, S., & Ruan, C. (2010). A dirty model for multi-task learning. In NIPS.

  36. Ji, J., Buch, S., Soto, A., & Niebles, J. C. (2018). End-to-end joint semantic segmentation of actors and actions in video. In ECCV.

  37. Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD conferences on knowledge discovery and data mining.

  38. Joulin, A., Tang, K., & Fei-Fei, L. (2014). Efficient image and video co-localization with Frank–Wolfe algorithm. In ECCV.

  39. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In ICCV.

  40. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  41. Krähenbühl, P., & Keltun, V. (2011a). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.

  42. Krähenbühl, P., & Koltun, V. (2011b). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.

  43. Kumar, M., Torr, P., & Zisserman, A. (2005). Learning layered motion segmentations of video. In ICCV.

  44. Kundu, A., Vineet, V., & Koltun, V. (2016). Feature space optimization for semantic video segmentation. In CVPR.

  45. Ladicky, L., Russell, C., Kohli, P., & Torr, P. (2014). Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1056–1077.

  46. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

  47. Lea, C., Reiter, A., Vidal, R., & Hager, G.D. (2016). Segmental spatiotemporal CNNs for fine-grained action segmentation. In ECCV.

  48. Lezama, J., Alahari, K., Josef, S., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR.

  49. Lin, G., Shen, C., van den Hengel, A., & Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In CVPR.

  50. Liu, B., & He, X. (2015). Multiclass semantic video segmentation with object-level active inference. In CVPR.

  51. Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.

  52. Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., & Bu, J. (2014). Weakly supervised multiclass video segmentation. In CVPR.

  53. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  54. Lu, J., Xu, R., & Corso, J. J. (2015). Human action segmentation with hierarchical supervoxel consistency. In CVPR.

  55. Luo, Y., Tao, D., Geng, B., Xu, C., & Maybank, S. (2013). Manifold regularized multitask learning for semi-supervised multilabel image classification. IEEE Transactions on Transactions on Pattern Recognition and Machine Intelligence, 22(2), 523–536.

  56. Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In ECCV.

  57. Mosabbeb, E. A., Cabral, R., De la Torre, F., & Fathy, M. (2014). Multi-label discriminative weakly-supervised human activity recognition and localization. In Asian conference on computer vision.

  58. Parikh, N., & Boyd, S. (2013). Proximal algorithms. Foundations and Trends \({}^{\textregistered }\) in Optimization, 1(3), 127–239.

  59. Paris, S. (2008). Edge-preserving smoothing and mean-shift segmentation of video streams. In ECCV.

  60. Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In ECCV.

  61. Pinto, L., Gandhi, D., Han, Y., Park, Y. L., & Gupta, A. (2016). The curious robot: Learning visual representations via physical interactions. In ECCV.

  62. Prest, A., Leistner, C., Civera, J., Schmid, C., & Ferrari, V. (2012). Learning object class detectors from weakly annotated video. In CVPR.

  63. Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.

  64. Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV.

  65. Salakhutdinov, R., Torralba, A., & Tenenbaum, J. (2011). Learning to share visual appearance for multiclass object detection. In CVPR.

  66. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In IEEE international conference on pattern recognition.

  67. Sculley, D. (2010). Combined regression and ranking. In KDD.

  68. Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In CVPR.

  69. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

  70. Song, Y. C., Naim, I., Al Mamun, A., Kulkarni, K., Singla, P., Luo, J., Gildea, D., & Kautz, H. (2016). Unsupervised alignment of actions in video with text descriptions. In International joint conference on artificial intelligence.

  71. Soomro, K., Idrees, H., & Shah, M. (2016). Predicting the where and what of actors and actions through online action localization. In CVPR.

  72. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.

  73. Tang, K., Joulin, A., Li, L. J., & Fei-Fei, L. (2014). Co-localization in real-world images. In CVPR.

  74. Tang, K., Sukthankar, R., Yagnik, J., & Fei-Fei, L. (2013). Discriminative segment annotation in weakly labeled video. In CVPR.

  75. Tian, Y., Sukthankar, R., & Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.

  76. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.

  77. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.

  78. Tsai, Y. H., Zhong, G., Yang, M. H. (2016). Semantic co-segmentation in videos. In ECCV.

  79. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.

  80. Wang, L., Hua, G., Sukthankar, R., Xue, J., & Zheng, N. (2014). Video object discovery and co-segmentation with extremely weak supervision. In ECCV.

  81. Xiong, C., & Corso, J. J. (2012). Coaction discovery: Segmentation of common actions across multiple videos. In ACM international workshop on multimedia data mining.

  82. Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In CVPR.

  83. Xu, C., & Corso, J. J. (2016a). Actor–action semantic segmentation with grouping process models. In CVPR.

  84. Xu, C., & Corso, J. J. (2016b). LIBSVX: A supervoxel library and benchmark for early video processing. International Journal of Computer Vision, 119(3), 272–290.

  85. Xu, C., Hsieh, S. H., Xiong, C., & Corso, J. J. (2015). Can humans fly? Action understanding with multiple classes of actors. In CVPR.

  86. Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In CVPR.

  87. Yan, Y., Ricci, E., Subramanian, R., Lanz, O., & Sebe, N. (2013). No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In ICCV.

  88. Yan, Y., Ricci, E., Subramanian, R., Liu, G., Lanz, O., & Sebe, N. (2016). A multi-task learning framework for head pose estimation under target motion. IEEE Transactions on Pattern Recognition and Machine Intelligence, 38(6), 1070–1083.

  89. Yan, Y., Ricci, E., Subramanian, R., Liu, G., & Sebe, N. (2014). Multi-task linear discriminant analysis for multi-view action recognition. IEEE Transactions on Image Processing, 23(12), 5599–5611.

  90. Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2017). Weakly supervised actor–action segmentation via robust multi-task ranking. In CVPR.

  91. Yang, Y., Li, Y., Fermüller, C., & Aloimonos, Y. (2015). Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In AAAI conference on artificial intelligence.

  92. Yu, S., Tresp, V., & Yu, K. (2007). Robust multi-task learning with t-processes. In ICML.

  93. Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In CVPR.

  94. Zhang, D., Javed, O., & Shah, M. (2014). Video object co-segmentation by regulated maximum weight cliques. In ECCV.

  95. Zhang, D., Yang, L., Meng, D., & Dong Xu, J. H. (2017). Spftn: A self-paced fine-tuning network for segmenting objects in weakly labelled videos. In CVPR.

  96. Zhang, Y., Chen, X., Li, J., Wang, C., & Xia, C. (2015). Semantic object segmentation via detection in weakly labeled video. In CVPR.

  97. Zhang, Y., & Yeung, D. (2010). A convex formulation for learning task relationships in multi-task learning. In Uncertainty in artificial intelligence.

  98. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In ICCV.

  99. Zhong, G., Tsai, Y. H., & Yang, M. H. (2016). Weakly-supervised video scene co-parsing. In ACCV.

  100. Zhou, J., Chen, J., & Ye, J. (2011a). Clustered multi-task learning via alternating structure optimization. In NIPS.

  101. Zhou, J., Chen, J., & Ye, J. (2011b). MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University. http://www.public.asu.edu/~jye02/Software/MALSAR

Download references


This research was partially supported by a University of Michigan MiBrain Grant (DC, JC), DARPA FA8750-17-2-0112 (JC), National Institute of Standards and Technology Grant 60NANB17D191 (JC, YY), NSF IIS-1741472 and IIS-1813709 (CX), NSF NeTS-1909185 and CSR-1908658 (YY), and gift donation from Cisco Inc (YY). This article solely reflects the opinions and conclusions of its authors and not the funding agents.

Author information

Correspondence to Yan Yan or Dawen Cai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Xavier Alameda-Pineda, Elisa Ricci, Albert Ali Salah, Nicu Sebe, Shuicheng Yan.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yan, Y., Xu, C., Cai, D. et al. A Weakly Supervised Multi-task Ranking Framework for Actor–Action Semantic Segmentation. Int J Comput Vis (2019). https://doi.org/10.1007/s11263-019-01244-7

Download citation


  • Weakly supervised learning
  • Actor–action semantic segmentation
  • Multi-task ranking