Weakly Supervised Action Labeling in Videos under Ordering Constraints

  • Piotr Bojanowski
  • Rémi Lajugie
  • Francis Bach
  • Ivan Laptev
  • Jean Ponce
  • Cordelia Schmid
  • Josef Sivic
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8693)


We are given a set of video clips, each one annotated with an ordered list of actions, such as “walk” then “sit” then “answer phone” extracted from, for example, the associated text script. We seek to temporally localize the individual actions in each clip as well as to learn a discriminative classifier for each action. We formulate the problem as a weakly supervised temporal assignment with ordering constraints. Each video clip is divided into small time intervals and each time interval of each video clip is assigned one action label, while respecting the order in which the action labels appear in the given annotations. We show that the action label assignment can be determined together with learning a classifier for each action in a discriminative manner. We evaluate the proposed model on a new and challenging dataset of 937 video clips with a total of 787720 frames containing sequences of 16 different actions from 69 Hollywood movies.


Video Clip Action Recognition Temporal Constraint Dynamic Time Warping Convex Relaxation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
  2. 2.
    Amer, M.R., Todorovic, S., Fern, A., Zhu, S.C.: Monte carlo tree search for scheduling activity recognition. In: ICCV (2013)Google Scholar
  3. 3.
    Bach, F., Harchaoui, Z.: DIFFRAC: a discriminative and flexible framework for clustering. In: NIPS (2007)Google Scholar
  4. 4.
    Bertsekas, D.: Nonlinear Programming. Athena Scientific (1999)Google Scholar
  5. 5.
    Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding Actors and Actions in Movies. In: ICCV (2013)Google Scholar
  6. 6.
    Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Weakly Supervised Action Labeling in Videos Under Ordering Constraints. In: arXiv (2014)Google Scholar
  7. 7.
    Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)Google Scholar
  8. 8.
    Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Research Logistics Quarterly (1956)Google Scholar
  9. 9.
    Gold, B., Morgan, N., Ellis, D.: Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition. Wiley (2011)Google Scholar
  10. 10.
    Guo, Y., Schuurmans, D.: Convex Relaxations of Latent Variable Training. In: NIPS (2007)Google Scholar
  11. 11.
    Harchaoui, Z.: Conditional gradient algorithms for machine learning. In: NIPS Workshop (2012)Google Scholar
  12. 12.
    Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining, inference and prediction. Springer (2009)Google Scholar
  13. 13.
    Hongeng, S., Nevatia, R.: Large-scale event detection using semi-hidden markov models. In: ICCV (2003)Google Scholar
  14. 14.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of classification (1985)Google Scholar
  15. 15.
    Ivanov, Y.A., Bobick, A.F.: Recognition of visual activities and interactions by stochastic parsing. PAMI (2000)Google Scholar
  16. 16.
    Jaccard, P.: The distribution of the flora in the alpine zone. New Phytologist (1912)Google Scholar
  17. 17.
    Jaggi, M.: Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In: ICML (2013)Google Scholar
  18. 18.
    Joulin, A., Bach, F., Ponce, J.: Discriminative Clustering for Image Co-segmentation. In: CVPR (2010)Google Scholar
  19. 19.
    Joulin, A., Bach, F., Ponce, J.: Multi-class cosegmentation. In: CVPR (2012)Google Scholar
  20. 20.
    Khamis, S., Morariu, V.I., Davis, L.S.: Combining per-frame and per-track cues for multi-person action recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 116–129. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  21. 21.
    Kwak, S., Han, B., Han, J.H.: Scenario-based video event recognition by constraint flow. In: CVPR (2011)Google Scholar
  22. 22.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  23. 23.
    Laxton, B., Lim, J., Kriegman, D.J.: Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In: CVPR (2007)Google Scholar
  24. 24.
    Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: CVPR (2011)Google Scholar
  25. 25.
    Nguyen, M.H., Lan, Z.Z., la Torre, F.D.: Joint segmentation and classification of human actions in video. In: CVPR (2011)Google Scholar
  26. 26.
    Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  27. 27.
    Rabiner, L.R., Juang, B.H.: Fundamentals of speech recognition. Prentice Hall (1993)Google Scholar
  28. 28.
    Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script Data for Attribute-Based Recognition of Composite Activities. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 144–157. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  29. 29.
    Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: CVPR (2006)Google Scholar
  30. 30.
    Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR (2012)Google Scholar
  31. 31.
    Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. In: CVPR (1997)Google Scholar
  32. 32.
    Sivic, J., Everingham, M., Zisserman, A.: “Who are you?” - Learning person specific classifiers from video. In: CVPR (2009)Google Scholar
  33. 33.
    Tang, K., Fei-Fei, L., Koller, D.: Learning latent temporal structure for complex event detection. In: CVPR (2012)Google Scholar
  34. 34.
    Vu, V.T., Bremond, F., Thonnat, M.: Automatic video interpretation: A novel algorithm for temporal scenario recognition. In: IJCAI (2003)Google Scholar
  35. 35.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR (2011)Google Scholar
  36. 36.
    Wang, H., Schmid, C.: Action Recognition with Improved Trajectories. In: ICCV (2013)Google Scholar
  37. 37.
    Xu, L., Neufeld, J., Larson, B., Schuurmans, D.: Maximum Margin Clustering. In: NIPS (2004)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Piotr Bojanowski
    • 1
  • Rémi Lajugie
    • 1
  • Francis Bach
    • 1
  • Ivan Laptev
    • 1
  • Jean Ponce
    • 2
  • Cordelia Schmid
    • 1
  • Josef Sivic
    • 1
  1. 1.INRIAFrance
  2. 2.École Normale SupérieureFrance

Personalised recommendations