Skip to main content

My View is the Best View: Procedure Learning from Egocentric Videos

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13673))

Included in the following conference series:

Abstract

Procedure learning involves identifying the key-steps and determining their logical order to perform a task. Existing approaches commonly use third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action. However, procedure learning from egocentric videos is challenging because (a) the camera view undergoes extreme changes due to the wearer’s head motion, and (b) the presence of unrelated frames due to the unconstrained nature of the videos. Due to this, current state-of-the-art methods’ assumptions that the actions occur at approximately the same time and are of the same duration, do not hold. Instead, we propose to use the signal provided by the temporal correspondences between key-steps across videos. To this end, we present a novel self-supervised Correspond and Cut (CnC ) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by \(5.2\%\) and \(6.3\%\), respectively. Furthermore, for procedure learning using egocentric videos, we propose the EgoProceL dataset consisting of 62 hours of videos captured by 130 subjects performing 16 tasks. The source code and the dataset are available on the project page https://sid2697.github.io/egoprocel/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahsan, U., Sun, C., Essa, I.: DiscrimNet: semi-supervised action recognition from videos using generative adversarial networks. In: Computer Vision and Pattern Recognition Workshops (CVPRW) ‘Women in Computer Vision (WiCV)’ (2018)

    Google Scholar 

  2. Alayrac, J.B., Bojanowski, P., Agrawal, N., Laptev, I., Sivic, J., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  3. Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41

    Chapter  Google Scholar 

  4. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. (2001)

    Google Scholar 

  5. Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving Jigsaw puzzles. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  6. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  7. Chang, C.Y., Huang, D.A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  8. Conners, R.W., Harlow, C.A.: A theoretical comparison of texture algorithms. IEEE Trans. Pattern Anal. Mach. Intell. (1980)

    Google Scholar 

  9. Damen, D., et al.: Scaling egocentric vision: the EPIC-KITCHENS dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44

    Chapter  Google Scholar 

  10. Damen, D., Leelasawassuk, T., Haines, O., Calway, A., Mayol-Cuevas, W.: You-Do, I-Learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. In: British Machine Vision Conference (BMVC) (2014)

    Google Scholar 

  11. De La Torre, F., et al.: Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database. In: Robotics Institute (2008)

    Google Scholar 

  12. Diba, A., Sharma, V., Gool, L., Stiefelhagen, R.: DynamoNet: dynamic action and motion network. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  13. Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  14. Doughty, H., Laptev, I., Mayol-Cuevas, W., Damen, D.: Action modifiers: learning from adverbs in instructional videos. In: Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  15. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. (1973)

    Google Scholar 

  16. Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  17. ELAN (Version 6.0) [Computer software] (2020). Nijmegen: Max Planck Institute for Psycholinguistics, The Language Archive: https://archive.mpi.nl/tla/elan

  18. Elhamifar, E., Huynh, D.: Self-supervised multi-task procedure learning from instructional videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 557–573. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_33

    Chapter  Google Scholar 

  19. Elhamifar, E., Naing, Z.: Unsupervised procedure learning via joint dynamic summarization. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  20. Feng, Z., Xu, C., Tao, D.: Self-supervised representation learning by rotation feature decoupling. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  21. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  22. Fried, D., Alayrac, J.B., Blunsom, P., Dyer, C., Clark, S., Nematzadeh, A.: Learning to segment actions from observation and narration. In: Association for Computational Linguistics (ACL) (2020)

    Google Scholar 

  23. Furnari, A., Farinella, G.: Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

    Google Scholar 

  24. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  25. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for binary images. J. Roy. Stat. Soc. Ser. B-Methodol. (1989)

    Google Scholar 

  26. Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: Workshop on Large Scale Holistic Video Understanding, ICCV (2019)

    Google Scholar 

  27. Haresh, S., et al.: Learning by aligning videos in time. In: Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  28. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  29. Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and helmholtz free energy. In: Neural Information Processing Systems (1993)

    Google Scholar 

  30. Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9

    Chapter  Google Scholar 

  31. Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 789–804. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_46

    Chapter  Google Scholar 

  32. Jang, Y., Sullivan, B., Ludwig, C., Gilchrist, I., Damen, D., Mayol-Cuevas, W.: EPIC-tent: an egocentric video dataset for camping tent assembly. In: International Conference on Computer Vision (ICCV) Workshops (2019)

    Google Scholar 

  33. Ji, L., et al.: Learning temporal video procedure segmentation from an automatically collected large dataset. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022)

    Google Scholar 

  34. Choi, J., Sharma, G., Schulter, S., Huang, J.-B.: Shuffle and attend: video domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 678–695. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_40

    Chapter  Google Scholar 

  35. Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  36. Kim, D., Cho, D., Yoo, D., Kweon, I.S.: Learning image representations by completing damaged Jigsaw puzzles. In: Winter Conference on Applications of Computer Vision (WACV) (2018)

    Google Scholar 

  37. Komodakis, N., Gidaris, S.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  38. Kuehne, H., Arslan, A.B., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  39. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. (1955)

    Google Scholar 

  40. Kukleva, A., Kuehne, H., Sener, F., Gall, J.: Unsupervised learning of action classes with continuous temporal embedding. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  41. Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  42. Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 577–593. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_35

    Chapter  Google Scholar 

  43. Lee, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  44. Li, J., Lei, P., Todorovic, S.: Weakly supervised energy-based learning for action segmentation. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  45. Li, J., Todorovic, S.: Set-constrained viterbi for set-supervised action segmentation. In: Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  46. Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: International Conference on Computer Vision (ICCV) (2013)

    Google Scholar 

  47. Li, Y., Liu, M., Rehg, J.M.: In the eye of beholder: joint learning of gaze and actions in first person video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 639–655. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_38

    Chapter  Google Scholar 

  48. Liu, X., van de Weijer, J., Bagdanov, A.D.: Leveraging unlabeled data for crowd counting by learning to rank. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  49. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory (1982)

    Google Scholar 

  50. Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s Cookin’? Interpreting cooking videos using text. speech and vision. In: HLT-NAACL (2015)

    Google Scholar 

  51. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  52. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  53. Naing, Z., Elhamifar, E.: Procedure completion by learning from partial summaries. In: British Machine Vision Conference (BMVC) (2020)

    Google Scholar 

  54. Ng, E., Xiang, D., Joo, H., Grauman, K.: You2Me: inferring body pose in egocentric video via first and second person interactions. In: Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  55. Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. In: International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  56. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Neural Information Processing Systems (2019)

    Google Scholar 

  57. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: Computer Vision and Pattern Recognition (CVPR) (2012)

    Google Scholar 

  58. Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The MECCANO dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: Winter Conference on Applications of Computer Vision (WACV), pp. 1569–1578 (2021)

    Google Scholar 

  59. Richard, A., Kuehne, H., Gall, J.: Action sets: weakly supervised action segmentation without ordering constraints. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  60. Richard, A., Kuehne, H., Iqbal, A., Gall, J.: NeuralNetwork-viterbi: a framework for weakly supervised video learning. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  61. Sener, F., Yao, A.: Zero-shot anticipation for instructional activities. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  62. Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  63. Shen, Y., Wang, L., Elhamifar, E.: Learning To segment actions from visual and language instructions via differentiable weak sequence alignment. In: Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  64. Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Actor and observer: joint modeling of first and third-person videos. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  65. Singh, S., Arora, C., Jawahar, C.V.: First person action recognition using deep learned descriptors. In: Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  66. Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  67. Tang, Y., et al.: COIN: a large-scale dataset for comprehensive instructional video analysis. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  68. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  69. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  70. VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021)

    Google Scholar 

  71. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML) (2008)

    Google Scholar 

  72. Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Neural Information Processing Systems (2016)

    Google Scholar 

  73. Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  74. Wei, D., Lim, o., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  75. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  76. Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos for unsupervised harvesting and learning of action examples. In: ACM International Conference on Multimedia (2014)

    Google Scholar 

  77. Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  78. Zhukov, D., Alayrac, J.-B., Laptev, I., Sivic, J.: Learning actionness via long-range temporal order verification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 470–487. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_28

    Chapter  Google Scholar 

  79. Zhukov, D., Alayrac, J.B., Cinbis, R.G., Fouhey, D., Laptev, I., Sivic, J.: Cross-task weakly supervised learning from instructional videos. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

Download references

Acknowledgements

The work was supported in part by the Department of Science and Technology, Government of India, under DST/ICPS/Data-Science project ID T-138. We acknowledge Pravin Nagar and Sagar Verma for sharing the PC Assembly and Disassembly videos recorded at IIIT Delhi. We also acknowledge Jehlum Vitasta Pandit and Astha Bansal for their help with annotating a portion of EgoProceL.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siddhant Bansal .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2319 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bansal, S., Arora, C., Jawahar, C.V. (2022). My View is the Best View: Procedure Learning from Egocentric Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19778-9_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19777-2

  • Online ISBN: 978-3-031-19778-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics