Diagnosing Error in Temporal Action Detectors

  • Humam AlwasselEmail author
  • Fabian Caba Heilbron
  • Victor Escorcia
  • Bernard Ghanem
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)


Despite the recent progress in video understanding and the continuous rate of improvement in temporal action localization throughout the years, it is still unclear how far (or close?) we are to solving the problem. To this end, we introduce a new diagnostic tool to analyze the performance of temporal action detectors in videos and compare different methods beyond a single scalar metric. We exemplify the use of our tool by analyzing the performance of the top rewarded entries in the latest ActivityNet action localization challenge. Our analysis shows that the most impactful areas to work on are: strategies to better handle temporal context around the instances, improving the robustness w.r.t. the instance absolute and relative size, and strategies to reduce the localization errors. Moreover, our experimental analysis finds the lack of agreement among annotator is not a major roadblock to attain progress in the field. Our diagnostic tool is publicly available to keep fueling the minds of other researchers with additional insights about their algorithms.


Temporal action detection Error analysis Diagnosis tool Action localization 



This publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-CRG2017-3405.

Supplementary material

474178_1_En_16_MOESM1_ESM.pdf (2.2 mb)
Supplementary material 1 (pdf 2261 KB)


  1. 1.
    Alwassel, H., Caba Heilbron, F., Ghanem, B.: Action search: spotting actions in videos and its application to temporal action localization. In: Ferrari, V. (ed.) ECCV 2018, Part IX. LNCS, vol. 11213, pp. 253–269. Springer, Cham (2018). Scholar
  2. 2.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 6373–6382 (2017)Google Scholar
  3. 3.
    Caba Heilbron, F., Barrios, W., Escorcia, V., Ghanem, B.: SCC: semantic context cascade for efficient action detection. In: CVPR (2017)Google Scholar
  4. 4.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR 2015, pp. 961–970 (2015)Google Scholar
  5. 5.
    Caba Heilbron, F., Lee, J.Y., Jin, H., Ghanem, B.: What do I annotate next? An empirical study of active learning for action localization. In: Ferrari, V., et al. (eds.) ECCV 2018, Part XI. LNCS, vol. 11215, pp. 212–229. Springer, Cham (2018). Scholar
  6. 6.
    Caba Heilbron, F., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 1914–1923 (2016)Google Scholar
  7. 7.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July, 2017, pp. 4724–4733 (2017)Google Scholar
  8. 8.
    Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q.: Temporal context network for activity localization in videos. In: ICCV, pp. 5727–5736 (2017)Google Scholar
  9. 9.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). Scholar
  10. 10.
    Escorcia, V., Dao, C.D., Jain, M., Ghanem, B., Snoek, C.: Guess where? Actor-supervision for spatiotemporal action localization. CoRR abs/1804.01824 (2018)Google Scholar
  11. 11.
    Everingham, M., Eslami, S.M.A., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. IJCV 111(1), 98–136 (2015)CrossRefGoogle Scholar
  12. 12.
    Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: ICCV (2017)Google Scholar
  13. 13.
    Ghanem, B., et al.: ActivityNet challenge 2017 summary. CoRR abs/1710.08011 (2017)Google Scholar
  14. 14.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2016)CrossRefGoogle Scholar
  15. 15.
    Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 5843–5851 (2017).
  16. 16.
    Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 (2018)Google Scholar
  17. 17.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778 (2016).
  18. 18.
    Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 340–353. Springer, Heidelberg (2012). Scholar
  19. 19.
    Idrees, H., et al.: The THUMOS challenge on action recognition for videos “in the wild”. Comput. Vis. Image Underst. 155, 1–23 (2017)CrossRefGoogle Scholar
  20. 20.
    Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014).
  21. 21.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  22. 22.
    Kay, W., et al.: The kinetics human action video dataset. CoRR abs/1705.06950 (2017)Google Scholar
  23. 23.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)Google Scholar
  24. 24.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (2008)Google Scholar
  25. 25.
    Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM on Multimedia Conference, MM 2017 (2017)Google Scholar
  26. 26.
    Lin, T., Zhao, X., Shou, Z.: Temporal convolution based action proposal: submission to ActivityNet 2017. CoRR abs/1707.06750 (2017)Google Scholar
  27. 27.
    Moltisanti, D., Wray, M., Mayol-Cuevas, W.W., Damen, D.: Trespassing the boundaries: labeling temporal bounds for object interactions in egocentric video. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017, pp. 2905–2913 (2017)Google Scholar
  28. 28.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understandingGoogle Scholar
  29. 29.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  30. 30.
    Ronchi, M.R., Perona, P.: Benchmarking and error diagnosis in multi-instance pose estimation. In: ICCV 2017, pp. 369–378 (2017)Google Scholar
  31. 31.
    Russakovsky, O., Deng, J., Huang, Z., Berg, A.C., Li, F.: Detecting avocados to zucchinis: what have we done, and where are we going? ICCV 2013, pp. 2064–2071 (2013)Google Scholar
  32. 32.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017)Google Scholar
  34. 34.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)Google Scholar
  35. 35.
    Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  36. 36.
    Sigurdsson, G.A., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: ICCV 2017, pp. 2156–2165 (2017)Google Scholar
  37. 37.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). Scholar
  38. 38.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  39. 39.
    Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  40. 40.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  41. 41.
    Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. CoRR abs/1711.07971 (2017)Google Scholar
  42. 42.
    Xiong, Y., Zhao, Y., Wang, L., Lin, D., Tang, X.: A pursuit of temporal accuracy in general activity detection. CoRR abs/1703.02716 (2017)Google Scholar
  43. 43.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: ICCV (2017)Google Scholar
  44. 44.
    Zhang, S., Benenson, R., Omran, M., Hosang, J.H., Schiele, B.: How far are we from solving pedestrian detection? In: CVPR 2016, pp. 1259–1267 (2016)Google Scholar
  45. 45.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV 2017, October 2017Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Humam Alwassel
    • 1
    Email author
  • Fabian Caba Heilbron
    • 1
  • Victor Escorcia
    • 1
  • Bernard Ghanem
    • 1
  1. 1.King Abdullah University of Science and Technology (KAUST)ThuwalSaudi Arabia

Personalised recommendations