International Journal of Computer Vision

, Volume 126, Issue 2–4, pp 375–389 | Cite as

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

  • Serena YeungEmail author
  • Olga Russakovsky
  • Ning Jin
  • Mykhaylo Andriluka
  • Greg Mori
  • Li Fei-Fei


Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.



We would like to thank Andrej Karpathy and Amir Zamir for helpful comments and discussion.


  1. Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
  2. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R. (2005). Actions as space-time shapes. In The Tenth IEEE International Conference on Computer Vision (ICCV’05).Google Scholar
  3. Choi, M. J., Lim, J. J., Torralba, A., Willsky, A. S. (2010). Exploiting hierarchical context on a large database of object categories. In CVPR.Google Scholar
  4. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389.
  5. Fabian Caba Heilbron, B. G., Victor Escorcia and Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 961–970).Google Scholar
  6. Gkioxari, G., Malik, J. (2014). Finding action tubes. arXiv:1411.6031.
  7. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  8. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. PAMI, 20(11), 1254–1259.CrossRefGoogle Scholar
  9. Jiang, Y.-G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes.
  10. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.Google Scholar
  11. Ke, Y., Sukthankar, R., Hebert, M. (2007). Event detection in crowded videos. In ICCV.Google Scholar
  12. Kitani, K. M., Ziebart, B., Bagnell, J. D., Hebert, M. (2012). Activity forecasting. In ECCV.Google Scholar
  13. Kuehne, H., Arslan, A., Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR.Google Scholar
  14. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011). Hmdb: a large video database for human motion recognition. In ICCV.Google Scholar
  15. Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In ICCV.Google Scholar
  16. Lv, F. J., Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. In CVPR.Google Scholar
  17. Mansimov, E., Srivastava, N., Salakhutdinov, R. (2015). Initialization strategies of spatio-temporal convolutional neural networks. arXiv:1503.07274.
  18. Marszałek, M., Laptev, I., Schmid, C. (2009). Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition.Google Scholar
  19. Myers, G. K., Nallapati, R., van Hout, J., Pancoast, S., Nevatia, R., Sun, C., et al. (2014). Evaluating multimedia features and fusion for example-based event detection. Machine Vision and Applications, 25(1), 17–32.CrossRefGoogle Scholar
  20. Ng, J. Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. arXiv:1503.08909.
  21. Ni, B., Paramathayalan, V. R., Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR.Google Scholar
  22. Niebles, J. C., Chen, C.-W., Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.Google Scholar
  23. Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., Desai, M. (2011) A large-scale benchmark dataset for event recognition in surveillance video. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  24. Oh, S., Mccloskey, S., Kim, I., Vahdat, A., Cannons, K. J., Hajimirsadeghi, H., et al. (2014). Multimedia event detection with multimodal feature fusion and temporal concept localization. Machine Vision and Applications, 25(1), 49–69.CrossRefGoogle Scholar
  25. Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A. F., Quenot, G. (2011). Trecvid 2011—An overview of the goals, tasks, data, evaluation mechansims and metrics. In Proceedings of TRECVID 2011.Google Scholar
  26. Pirsiavash, H., Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR.Google Scholar
  27. Poppe, R. (2010). A survey on vision-based human action recognition. IVC, 28, 976–990.CrossRefGoogle Scholar
  28. Qinfeng Shi, L. W. A. S., Cheng, Li (2011). Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 93, May.Google Scholar
  29. Rohrbach, M., Amin, S., Andriluka, M., Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR.Google Scholar
  30. Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., Schiele, B. (2015) Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv:1502.06648.
  31. Russakovsky, O., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  32. Ryoo, M. S., Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
  33. Schuldt, C., Laptev, I., Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR.Google Scholar
  34. Shapovalova, N., Raptis, M., Sigal, L., Mori, G. (2013). Action is in the eye of the beholder: Eye-gaze driven model for spatio-temporal action localization. In NIPS.Google Scholar
  35. Simonyan, K., Zisserman, A. (2014a). Two-stream convolutional networks for action recognition in videos. In NIPS.Google Scholar
  36. Simonyan, K., Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. abs/1409.1556.Google Scholar
  37. Soomro, K., Zamir, A. R., Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402.
  38. Srivastava, N., Mansimov, E., Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. arXiv:1502.04681.
  39. Tang, K., Fei-Fei, L., Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.Google Scholar
  40. Tian, Y., Sukthankar, R., Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.Google Scholar
  41. Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 26–31.Google Scholar
  42. Tong, W., Yang, Y., Jiang, L., Yu, S.-I., Lan, Z., Ma, Z., et al. (2014). E-lamp: Integration of innovative ideas for multimedia event detection. Machine Vision and Applications, 25(1), 5–15.CrossRefGoogle Scholar
  43. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M. (2015) C3d: Generic features for video analysis. arXiv:1412.0767.
  44. Vahdat, A., Gao, B., Ranjbar, M., Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In VS.Google Scholar
  45. Wang, P., Cao, Y., Shen, C., Liu, L., Shen, H. T. (2015). Temporal pyramid pooling based convolutional neural networks for action recognition. arXiv:1503.01224.
  46. Wang, H., Kläser, A., Schmid, C., Liu, C.-L. (2011). Action recognition by dense trajectories. In CVPR.Google Scholar
  47. Wang, H., Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, Sydney.Google Scholar
  48. Weinland, D., Ronfard, R., Boyer, E. (2010). A survey of vision-based methods for action representation, segmentation and recognition. In CVIU, 115(2), (pp. 224–241).Google Scholar
  49. Xu, K. et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044.
  50. Yamato, J., Ohya, J., Ishii, K. (1992). Recognizing human action in time-sequential images using hidden markov model. In CVPR.Google Scholar
  51. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Video description generation incorporating spatio-temporal features and a soft-attention mechanism. arXiv:1502.08029.
  52. Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R. (2015). Exploiting image-trained cnn architectures for unconstrained video classification. arXiv:1503.04144.

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Serena Yeung
    • 1
    Email author
  • Olga Russakovsky
    • 1
    • 2
  • Ning Jin
    • 1
  • Mykhaylo Andriluka
    • 1
    • 3
  • Greg Mori
    • 4
  • Li Fei-Fei
    • 1
  1. 1.Stanford UniversityStanfordUSA
  2. 2.Carnegie Mellon UniversityPittsburghUSA
  3. 3.Max Planck Institute for InformaticsSaarbrückenGermany
  4. 4.Simon Fraser UniversityBurnabyCanada

Personalised recommendations