Advertisement

End-to-End Joint Semantic Segmentation of Actors and Actions in Video

  • Jingwei Ji
  • Shyamal Buch
  • Alvaro Soto
  • Juan Carlos Niebles
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11208)

Abstract

Traditional video understanding tasks include human action recognition and actor/object semantic segmentation. However, the combined task of providing semantic segmentation for different actor classes simultaneously with their action class remains a challenging but necessary task for many applications. In this work, we propose a new end-to-end architecture for tackling this task in videos. Our model effectively leverages multiple input modalities, contextual information, and multitask learning in the video to directly output semantic segmentations in a single unified framework. We train and benchmark our model on the Actor-Action Dataset (A2D) for joint actor-action semantic segmentation, and demonstrate state-of-the-art performance for both segmentation and detection. We also perform experiments verifying our approach improves performance for zero-shot recognition, indicating generalizability of our jointly learned feature space.

Keywords

Semantic segmentation Actor Action Video End-to-End Zero-shot 

Notes

Acknowledgments

Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. This work is also partially funded by the Millennium Institute for Foundational Research on Data. We also thank NVIDIA for their DGX-1 donation.

References

  1. 1.
    Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)Google Scholar
  2. 2.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  3. 3.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  4. 4.
    Dai, J., He, K., Li, Y., Ren, S., Sun, J.: Instance-sensitive fully convolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 534–549. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_32CrossRefGoogle Scholar
  5. 5.
    Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3150–3158 (2016)Google Scholar
  6. 6.
    Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  7. 7.
    Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 759–768 (2015)Google Scholar
  8. 8.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017)Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  10. 10.
    Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. arXiv preprint arXiv:1605.04988 (2016)
  11. 11.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Joint learning of object and action detectors. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  12. 12.
    Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  13. 13.
    Li, X., et al.: Video object segmentation with re-identification. arXiv preprint arXiv:1708.00197 (2017)
  14. 14.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, vol. 1, p. 4 (2017)Google Scholar
  15. 15.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  16. 16.
    Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: Advances in Neural Information Processing Systems, pp. 1990–1998 (2015)Google Scholar
  17. 17.
    Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_5CrossRefGoogle Scholar
  18. 18.
    Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542. IEEE (2017)Google Scholar
  19. 19.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  20. 20.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015).  https://doi.org/10.1007/s11263-015-0816-yMathSciNetCrossRefGoogle Scholar
  21. 21.
    Russell, C., Kohli, P., Torr, P.H., et al.: Associative hierarchical CRFs for object class image segmentation. In: IEEE 12th International Conference on Computer Vision, pp. 739–746. IEEE (2009)Google Scholar
  22. 22.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: NIPS (2014)Google Scholar
  23. 23.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  24. 24.
    Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)Google Scholar
  25. 25.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  26. 26.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 (2017)
  27. 27.
    Xu, C., Corso, J.J.: Actor-action semantic segmentation with grouping process models. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3083–3092, June 2016.  https://doi.org/10.1109/CVPR.2016.336
  28. 28.
    Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? Action understanding with multiple classes of actors. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2015). http://web.eecs.umich.edu/~jjcorso/pubs/xu_corso_CVPR2015_A2D.pdf
  29. 29.
    Yan, Y., Xu, C., Cai, D., Corso, J.J.: Weakly supervised actor-action segmentation via robust multi-task ranking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jingwei Ji
    • 1
  • Shyamal Buch
    • 1
  • Alvaro Soto
    • 2
  • Juan Carlos Niebles
    • 1
  1. 1.Stanford Vision and Learning LabStanfordUSA
  2. 2.Pontificia Universidad Católica de ChileSantiagoChile

Personalised recommendations