Skip to main content

SOS! Self-supervised Learning over Sets of Handled Objects in Egocentric Action Recognition

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13673))

Included in the following conference series:

Abstract

Learning an egocentric action recognition model from video data is challenging due to distractors in the background, e.g., irrelevant objects. Further integrating object information into an action model is hence beneficial. Existing methods often leverage a generic object detector to identify and represent the objects in the scene. However, several important issues remain. Object class annotations of good quality for the target domain (dataset) are still required for learning good object representation. Moreover, previous methods deeply couple existing action models with object representations, and thus need to retrain them jointly, leading to costly and inflexible integration. To overcome both limitations, we introduce Self-Supervised Learning Over Sets (SOS), an approach to pre-train a generic Objects In Contact (OIC) representation model from video object regions detected by an off-the-shelf hand-object contact detector. Instead of augmenting object regions individually as in conventional self-supervised learning, we view the action process as a means of natural data transformations with unique spatiotemporal continuity and exploit the inherent relationships among per-video object sets. Extensive experiments on two datasets, EPIC-KITCHENS-100 and EGTEA, show that our OIC significantly boosts the performance of multiple state-of-the-art video classification models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: ICCV (2015)

    Google Scholar 

  2. Baradel, F., Neverova, N., Wolf, C., Mille, J., Mori, G.: Object level visual reasoning in videos. In: ECCV (2018)

    Google Scholar 

  3. Bertasius, G., Torresani, L.: COBE: contextualized object embeddings from narrated instructional video. In: NeurIPS (2020)

    Google Scholar 

  4. Bulat, A., Perez-Rua, J.M., Sudhakaran, S., Martinez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. In: NeurIPS (2021)

    Google Scholar 

  5. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NeurIPS (2020)

    Google Scholar 

  6. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  7. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)

    Google Scholar 

  8. Chen, X., He, K.: Exploring simple Siamese representation learning. In: CVPR (2021)

    Google Scholar 

  9. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)

    Google Scholar 

  10. Damen, D., et al.: Rescaling egocentric vision: collection pipeline and challenges for epic-kitchens-100. In: IJCV (2021)

    Google Scholar 

  11. Dong, Q., Gong, S., Zhu, X.: Imbalanced deep learning by minority class incremental rectification. IEEE TPAMI 41(6), 1367–1381 (2018)

    Article  Google Scholar 

  12. Escorcia, V., Carlos Niebles, J.: Spatio-temporal human-object interactions for action recognition in videos. In: ICCVW, June 2013

    Google Scholar 

  13. Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.C.: Temporal localization of moments in video collections with natural language. CoRR abs/1907.12763 (2019). arXiv:1907.12763

  14. Falcon, W.: Pytorch lightning. https://github.com/PytorchLightning/pytorch-lightning (2019)

  15. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV (2019)

    Google Scholar 

  16. Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: CVPR (2018)

    Google Scholar 

  17. Goyal, P., et al.: Self-supervised pretraining of visual features in the wild. CoRR (2021). arXiv:2103.01988

  18. Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. In: NeurIPS (2020)

    Google Scholar 

  19. Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)

    Google Scholar 

  20. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  22. Hou, Z., Peng, X., Qiao, Y., Tao, D.: Visual compositional learning for human-object interaction detection. In: ECCV (2020)

    Google Scholar 

  23. Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: CVPR, pp. 5375–5384 (2016)

    Google Scholar 

  24. Huang, Y., Cai, M., Li, Z., Lu, F., Sato, Y.: Mutual context network for jointly estimating egocentric gaze and action. IEEE TIP 29, 7795–7806 (2020)

    MATH  Google Scholar 

  25. Kang, B., Li, Y., Xie, S., Yuan, Z., Feng, J.: Exploring balanced feature spaces for representation learning. In: ICLR (2021)

    Google Scholar 

  26. Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition. In: ICLR (2020)

    Google Scholar 

  27. Kazakos, E., Huh, J., Nagrani, A., Zisserman, A., Damen, D.: With a little help from my temporal context: multimodal egocentric action recognition. In: BMVC (2021)

    Google Scholar 

  28. Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: ICCV (2019)

    Google Scholar 

  29. Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Slow-fast auditory streams for audio recognition. In: ICASSP (2021)

    Google Scholar 

  30. Li, Y., Nagarajan, T., Xiong, B., Grauman, K.: Ego-Exo: Transferring visual representations from third-person to first-person videos. In: CVPR (2021)

    Google Scholar 

  31. Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE TPAMI (2021)

    Google Scholar 

  32. Li, Y., Liu, M., Rehg, J.M.: In the eye of beholder: Joint learning of gaze and actions in first person video. In: ECCV (2018)

    Google Scholar 

  33. Li, Y.L., Liu, X., Wu, X., Li, Y., Lu, C.: Hoi analysis: Integrating and decomposing human-object interaction. In: NeurIPS (2020)

    Google Scholar 

  34. Lin, J., Gan, C., Han, S.: Temporal shift module for efficient video understanding. In: ICCV (2019)

    Google Scholar 

  35. Liu, M., Tang, S., Li, Y., Rehg, J.M.: Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In: ECCV (2020)

    Google Scholar 

  36. Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment. In: ICLR (2021)

    Google Scholar 

  37. Narasimhaswamy, S., Nguyen, T., Hoai, M.: Detecting hands and recognizing physical contact in the wild. In: NeurIPS (2020)

    Google Scholar 

  38. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) NIPS, pp. 8024–8035 (2019). https://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  39. Purushwalkam, S., Gupta, A.: Demystifying contrastive self-supervised learning: invariances, augmentations and dataset biases. In: NeurIPS (2020)

    Google Scholar 

  40. Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.: Learning human-object interactions by graph parsing neural networks. In: ECCV (2018)

    Google Scholar 

  41. Reed, C.J., et al.: Self-supervised pretraining improves self-supervised pretraining. arXiv:2103.12718 (2021)

  42. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  43. Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: CVPR (2020)

    Google Scholar 

  44. Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Actor and observer: joint modeling of first and third-person videos. In: CVPR (2018)

    Google Scholar 

  45. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  46. Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: CVPR (2019)

    Google Scholar 

  47. Sudhakaran, S., Lanz, O.: Attention is all we need: nailing down object-centric attention for egocentric activity recognition. In: BMVC (2018)

    Google Scholar 

  48. Sun, C., Nagrani, A., Tian, Y., Schmid, C.: Composable augmentation encoding for video representation learning. In: ICCV (2021)

    Google Scholar 

  49. Umesh, P.: Image processing in python. CSI Commun. 23 (2012)

    Google Scholar 

  50. Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: ECCV (2018)

    Google Scholar 

  51. Wang, L., et al.: Temporal segment networks for action recognition in videos. IEEE TPAMI 41(11), 2740–2755 (2018)

    Article  Google Scholar 

  52. Wang, X., Wu, Y., Zhu, L., Yang, Y.: Symbiotic attention with privileged information for egocentric action recognition. In: AAAI (2020)

    Google Scholar 

  53. Wang, X., Zhu, L., Wang, H., Yang, Y.: Interactive prototype learning for egocentric action recognition. In: ICCV (2021)

    Google Scholar 

  54. Wang, X., Gupta, A.: Videos as space-time region graphs. In: ECCV (2018)

    Google Scholar 

  55. Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: ICCV (2017)

    Google Scholar 

  56. Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: CVPR (2019)

    Google Scholar 

  57. Yan, R., Xie, L., Shu, X., Tang, J.: Interactive fusion of multi-level features for compositional activity recognition. arXiv:2012.05689 (2020)

  58. Yang, Y., Xu, Z.: Rethinking the value of labels for improving class-imbalanced learning. NeurIPS 33, 19290–19301 (2020)

    Google Scholar 

  59. Zhang, X., et al.: VideoLT: large-scale long-tailed video recognition. In: ICCV, pp. 7960–7969 (2021)

    Google Scholar 

  60. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: ECCV (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Victor Escorcia .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4047 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Escorcia, V., Guerrero, R., Zhu, X., Martinez, B. (2022). SOS! Self-supervised Learning over Sets of Handled Objects in Egocentric Action Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19778-9_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19777-2

  • Online ISBN: 978-3-031-19778-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics