Advertisement

Self-supervised Multi-task Procedure Learning from Instructional Videos

Conference paper
  • 544 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12362)

Abstract

We address the problem of unsupervised procedure learning from instructional videos of multiple tasks using Deep Neural Networks (DNNs). Unlike existing works, we assume that training videos come from multiple tasks without key-step annotations or grammars, and the goals are to classify a test video to the underlying task and to localize its key-steps. Our DNN learns task-dependent attention features from informative regions of each frame without ground-truth bounding boxes and learns to discover and localize key-steps without key-step annotations by using an unsupervised subset selection module as a teacher. It also learns to classify an input video using the discovered key-steps using a learnable key-step feature pooling mechanism that extracts and learns to combine key-step based features for task recognition. By experiments on two instructional video datasets, we show the effectiveness of our method for unsupervised localization of procedure steps and video classification.

Keywords

Procedure learning Instructional videos Subset selection Self-supervised learning Deep Neural Networks Attention modeling 

Notes

Acknowledgements

This work is partially supported by DARPA Young Faculty Award (D18AP00050), NSF (IIS-1657197), ONR (N000141812132) and ARO (W911NF1810300).

Supplementary material

504472_1_En_33_MOESM1_ESM.pdf (6.6 mb)
Supplementary material 1 (pdf 6768 KB)

References

  1. 1.
    Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  2. 2.
    Ba, J., Caruana, R.: Do deep nets really need to be deep?. In: Neural Information Processing Systems (2013)Google Scholar
  3. 3.
    Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: European Conference on Computer Vision (2014)Google Scholar
  4. 4.
    Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: European Conference on Computer Vision (2018)Google Scholar
  5. 5.
    Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  6. 6.
    Du, X., et al.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Annual Meeting of the North American Association for Computational Linguistics (2019)Google Scholar
  7. 7.
    Elhamifar, E.: Sequential facility location: approximate submodularity and greedy algorithm. In: International Conference on Machine Learning (2019)Google Scholar
  8. 8.
    Elhamifar, E., De-Paolis-Kaluza, M.C.: Subset selection and summarization in sequential data. In: Neural Information Processing Systems (2017)Google Scholar
  9. 9.
    Elhamifar, E., Naing, Z.: Unsupervised procedure learning via joint dynamic summarization. In: International Conference on Computer Vision (2019)Google Scholar
  10. 10.
    Goel, K., Brunskill, E.: Learning procedural abstractions and evaluating discrete latent temporal structure. In: International Conference on Learning Representation (2019)Google Scholar
  11. 11.
    Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. ArXiv (2015)Google Scholar
  12. 12.
    Huang, D., Buch, S., Dery, L., Garg, A., Fei-Fei, L., Niebles, J.C.: Finding it?: Weakly-supervised reference-aware visual grounding in instructional videos. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  13. 13.
    Huang, D.A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: European Conference on Computer Vision (2016)Google Scholar
  14. 14.
    Huynh, D., Elhamifar, E.: Fine-grained generalized zero-shot learning via dense attribute-based attention. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)Google Scholar
  15. 15.
    Huynh, D., Elhamifar, E.: A shared multi-attention framework for multi-label zero-shot learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)Google Scholar
  16. 16.
    Kukleva, A., Kuehne, H., Sener, F., Gall, J.: Unsupervised learning of action classes with continuous temporal embedding. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  17. 17.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  18. 18.
    Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? Interpreting cooking videos using text, speech and vision. In: NAACL (2015)Google Scholar
  19. 19.
    Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. arXiv:1912.06430 (2019)
  20. 20.
    Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: International Conference on Computer Vision (2019)Google Scholar
  21. 21.
    Phuong, M., Lampert, C.: Towards understanding knowledge distillation. In: International Conference on Machine learning (2019)Google Scholar
  22. 22.
    Puig, X., et al.: Simulating household activities via programs. In: IEEE Conference on computer Vision and Pattern Recognition (2018)Google Scholar
  23. 23.
    Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with RNN based fine-to-coarse modeling. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  24. 24.
    Rochan, M., Ye, L., Wang, Y.: Video summarization using fully convolutional sequence networks. In: European Conference on Computer Vision (2018)Google Scholar
  25. 25.
    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. In: International Conference on Learning Representations (2014)Google Scholar
  26. 26.
    Sculley, D.: Web-scale k-means clustering. WWW (2010)Google Scholar
  27. 27.
    Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  28. 28.
    Sener, F., Yao, A.: Zero-shot anticipation for instructional activities. In: International Conference on Computer Vision (2019)Google Scholar
  29. 29.
    Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: IEEE International Conference on Computer Vision (2015)Google Scholar
  30. 30.
    Tijmen, T., Hinton, G.: Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural networks for machine learning 4.2 (2012)Google Scholar
  31. 31.
    Xu, C., Elhamifar, E.: Deep supervised summarization: algorithm and application to learning instructions. In: Neural Information Processing Systems (2019)Google Scholar
  32. 32.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention (2015)Google Scholar
  33. 33.
    Yang, Y., Li, Y., Fermüller, C., Aloimonos, Y.: Robot learning manipulation action plans by watching unconstrained videos from the World Wide Web. In: AAAI (2015)Google Scholar
  34. 34.
    Yu, S.I., Jiang, L., Hauptmann, A.: Instructional videos for unsupervised harvesting and learning of action examples. In: ACM International Conference on Multimedia (2014)Google Scholar
  35. 35.
    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: International Conference on Learning Representations (2017)Google Scholar
  36. 36.
    Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)Google Scholar
  37. 37.
    Zhukov, D., Alayrac, J.B., Cinbis, R.G., Fouhey, D., Laptev, I., Sivic, J.: Cross-task weakly supervised learning from instructional videos. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Khoury College of Computer SciencesNortheastern UniversityBostonUSA

Personalised recommendations