Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)


We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at


First Person Vision Action anticipation Motor attention 



Portions of this research were supported in part by National Science Foundation Award 1936970 and a gift from Facebook. YL acknowledges the support from the Wisconsin Alumni Research Foundation.

Supplementary material

500725_1_En_41_MOESM1_ESM.pdf (1022 kb)
Supplementary material 1 (pdf 1022 KB)


  1. 1.
    Aglioti, S.M., Cesari, P., Romani, M., Urgesi, C.: Action anticipation and motor resonance in elite basketball players. Nat. Neurosci. 11(9), 1109 (2008)CrossRefGoogle Scholar
  2. 2.
    Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV (2019)Google Scholar
  3. 3.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)Google Scholar
  4. 4.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  5. 5.
    Chen, C.Y., Grauman, K.: Subjects and their objects: localizing interactees for a person-centric view of importance. Int. J. Comput. Vision 126(2–4), 292–313 (2018). Scholar
  6. 6.
    Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). Scholar
  7. 7.
    Delaitre, V., Fouhey, D.F., Laptev, I., Sivic, J., Gupta, A., Efros, A.A.: Scene semantics from long-term observation of people. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 284–298. Springer, Heidelberg (2012). Scholar
  8. 8.
    di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., Rizzolatti, G.: Understanding motor events: a neurophysiological study. Exp. Brain Res. 91, 176–180 (1992). Scholar
  9. 9.
    Fang, K., Wu, T.L., Yang, D., Savarese, S., Lim, J.J.: Demo2Vec: reasoning object affordances from online videos. In: CVPR (2018)Google Scholar
  10. 10.
    Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)Google Scholar
  11. 11.
    Felsen, P., Agrawal, P., Malik, J.: What will happen next? Forecasting player moves in sports videos. In: ICCV (2017)Google Scholar
  12. 12.
    Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)Google Scholar
  13. 13.
    Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. 49, 401–411 (2017) CrossRefGoogle Scholar
  14. 14.
    Furnari, A., Battiato, S., Farinella, G.M.: Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 389–405. Springer, Cham (2019). Scholar
  15. 15.
    Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: ICCV (2019)Google Scholar
  16. 16.
    Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. In: BMVC (2017)Google Scholar
  17. 17.
    Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)Google Scholar
  18. 18.
    Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: CVPR (2011)Google Scholar
  19. 19.
    Gui, L.-Y., Wang, Y.-X., Liang, X., Moura, J.M.F.: Adversarial geometry-aware human motion prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 823–842. Springer, Cham (2018). Scholar
  20. 20.
    Hari, R., Forss, N., Avikainen, S., Kirveskari, E., Salenius, S., Rizzolatti, G.: Activation of human primary motor cortex during action observation: a neuromagnetic study. Proc. Natl. Acad. Sci. 95(25), 15061–15065 (1998)CrossRefGoogle Scholar
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  22. 22.
    Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: CVPR (2017)Google Scholar
  23. 23.
    Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 789–804. Springer, Cham (2018). Scholar
  24. 24.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  25. 25.
    James, W., Burkhardt, F., Bowers, F., Skrupskelis, I.K.: The Principles of Psychology, vol. 1. Macmillan, London (1890)Google Scholar
  26. 26.
    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. In: ICLR (2017)Google Scholar
  27. 27.
    Kataoka, H., Miyashita, Y., Hayashi, M., Iwata, K., Satoh, Y.: Recognition of transitional action for short-term action prediction using discriminative temporal CNN feature. In: BMVC (2016)Google Scholar
  28. 28.
    Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)Google Scholar
  29. 29.
    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). Scholar
  30. 30.
    Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)
  31. 31.
    Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 14–29 (2015)CrossRefGoogle Scholar
  32. 32.
    Li, Y., Liu, M., Rehg, J.M.: In the eye of beholder: joint learning of gaze and actions in first person video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 639–655. Springer, Cham (2018). Scholar
  33. 33.
    Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video. arXiv preprint arXiv:2006.00626 (2020)
  34. 34.
    Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)Google Scholar
  35. 35.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  36. 36.
    Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)Google Scholar
  37. 37.
    Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. In: ICLR (2017)Google Scholar
  38. 38.
    Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: CVPR Workshops (2019)Google Scholar
  39. 39.
    Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: ICCV (2019)Google Scholar
  40. 40.
    Pavlovic, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) NeurIPS, pp. 981–987. MIT Press, Cambridge (2001)Google Scholar
  41. 41.
    Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)Google Scholar
  42. 42.
    Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: WACV (2016)Google Scholar
  43. 43.
    Rhinehart, N., Kitani, K.M.: Learning action maps of large environments via first-person vision. In: CVPR (2016)Google Scholar
  44. 44.
    Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: ICCV (2017)Google Scholar
  45. 45.
    Rushworth, M., Johansen-Berg, H., Göbel, S.M., Devlin, J.: The left parietal and premotor cortices: motor attention and selection. Neuroimage 20, S89–S100 (2003)CrossRefGoogle Scholar
  46. 46.
    Ryoo, M.S., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: CVPR (2015)Google Scholar
  47. 47.
    Ryoo, M., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: Robot-centric activity prediction from first-person videos: what will they do to me? In: HRI (2015)Google Scholar
  48. 48.
    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)Google Scholar
  49. 49.
    Shen, Y., Ni, B., Li, Z., Zhuang, N.: Egocentric activity prediction via event modulated attention. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 202–217. Springer, Cham (2018). Scholar
  50. 50.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  51. 51.
    Soo Park, H., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: CVPR (2016)Google Scholar
  52. 52.
    Soran, B., Farhadi, A., Shapiro, L.: Generating notifications for missing actions: don’t forget to turn the lights off! In: ICCV (2015)Google Scholar
  53. 53.
    Thermos, S., Papadopoulos, G.T., Daras, P., Potamianos, G.: Deep affordance-grounded sensorimotor object recognition. In: CVPR (2017)Google Scholar
  54. 54.
    Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: ICCV (2019)Google Scholar
  55. 55.
    Urtasun, R., Fleet, D.J., Geiger, A., Popović, J., Darrell, T.J., Lawrence, N.D.: Topologically-constrained latent variable models. In: ICML (2008)Google Scholar
  56. 56.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)Google Scholar
  57. 57.
    Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)Google Scholar
  58. 58.
    Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 283–298 (2007)CrossRefGoogle Scholar
  59. 59.
    Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)Google Scholar
  60. 60.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  61. 61.
    Wei, P., Xie, D., Zheng, N., Zhu, S.C.: Inferring human attention by learning latent intentions. In: IJCAI (2017)Google Scholar
  62. 62.
    Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: CVPR (2018)Google Scholar
  63. 63.
    Zhou, Y., Ni, B., Hong, R., Yang, X., Tian, Q.: Cascaded interactional targeting network for egocentric video analysis. In: CVPR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Georgia Institute of TechnologyAtlantaUSA
  2. 2.University of Wisconsin-MadisonMadisonUSA
  3. 3.ETH ZürichZürichSwitzerland

Personalised recommendations