Learning Object Permanence from Video

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)


Object Permanence allows people to reason about the location of non-visible objects, by understanding that they continue to exist even when not perceived directly. Object Permanence is critical for building a model of the world, since objects in natural visual scenes dynamically occlude and contain each-other. Intensive studies in developmental psychology suggest that object permanence is a challenging task that is learned through extensive experience.

Here we introduce the setup of learning Object Permanence from labeled videos. We explain why this learning problem should be dissected into four components, where objects are (1) visible, (2) occluded, (3) contained by another object and (4) carried by a containing object. The fourth subtask, where a target object is carried by a containing object, is particularly challenging because it requires a system to reason about a moving location of an invisible object. We then present a unified deep architecture that learns to predict object location under these four scenarios. We evaluate the architecture and system on a new dataset based on CATER, with per-frame labels, and find that it outperforms previous localization methods and various baselines.


Object Permanence Reasoning Video Analysis 



This study was funded by grants to GC from the Israel Science Foundation and Bar-Ilan University (ISF 737/2018, ISF 2332/18). AS is funded by the Israeli innovation authority through the AVATAR consortium. AG received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program (grant ERC HOLI 819080).

Supplementary material

504471_1_En_3_MOESM1_ESM.pdf (2.9 mb)
Supplementary material 1 (pdf 2979 KB)

Supplementary material 2 (mp4 11424 KB)

Supplementary material 3 (mp4 15464 KB)

Supplementary material 4 (mp4 14424 KB)

Supplementary material 5 (mp4 11395 KB)


  1. 1.
    Aguiar, A., Baillargeon, R.: 2.5-month-old infants’ reasoning about when objects should and should not be occluded. Cogn. Psychol. 39(2), 116–157 (1999)CrossRefGoogle Scholar
  2. 2.
    Baillargeon, R., DeVos, J.: Object permanence in young infants: further evidence. Child Dev. 62(6), 1227–1246 (1991)CrossRefGoogle Scholar
  3. 3.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)Google Scholar
  4. 4.
    Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)Google Scholar
  5. 5.
    Fan, H., Ling, H.: Siamese cascaded region proposal networks for real-time visual tracking. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  6. 6.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. Corr abs/1611.02155. arXiv preprint arXiv:1611.02155 (2016)
  7. 7.
    Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimed. 19(9), 2045–2055 (2017)CrossRefGoogle Scholar
  8. 8.
    Girdhar, R., Ramanan, D.: CATER: a diagnostic dataset for compositional actions and temporal reasoning. arXiv preprint arXiv:1910.04744 (2019)
  9. 9.
    Grabner, H., Matas, J., Van Gool, L., Cattin, P.: Tracking the invisible: learning where the object might be. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1285–1292. IEEE (2010)Google Scholar
  10. 10.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  11. 11.
    Huang, Y., Essa, I.: Tracking multiple objects through occlusions. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 1051–1058. IEEE (2005)Google Scholar
  12. 12.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)Google Scholar
  13. 13.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Kristan, M., Leonardis, A., et al., J.M.: The sixth visual object tracking vot2018 challenge results. In: ECCV Workshops (2018)Google Scholar
  15. 15.
    Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018)Google Scholar
  16. 16.
    Liang, W., Zhu, Y., Zhu, S.C.: Tracking occluded objects and recovering incomplete trajectories by reasoning about containment relations and human actions. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  17. 17.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  18. 18.
    Mojtaba Marvasti-Zadeh, S., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S.: Deep learning for visual tracking: a comprehensive survey. arXiv-1912 (2019)Google Scholar
  19. 19.
    Papadourakis, V., Argyros, A.: Multiple objects tracking in the presence of long-term occlusions. Comput. Vis. Image Underst. 114(7), 835–846 (2010)CrossRefGoogle Scholar
  20. 20.
    Piaget, J.: The construction of reality in the child (1954)Google Scholar
  21. 21.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  22. 22.
    Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: CVPR 2011, pp. 1745–1752. IEEE (2011)Google Scholar
  23. 23.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015)
  24. 24.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  25. 25.
    Smitsman, A.W., Dejonckheere, P.J., De Wit, T.C.: The significance of event information for 6-to 16-month-old infants’ perception of containment. Dev. Psychol. 45(1), 207 (2009)CrossRefGoogle Scholar
  26. 26.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-first AAAI Conference on Artificial Intelligence (2017)Google Scholar
  27. 27.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar
  28. 28.
    Ullman, S., Dorfman, N., Harari, D.: A model for discovering ‘containment’ relations. Cognition 183, 67–81 (2019)CrossRefGoogle Scholar
  29. 29.
    Vaswani, A., et al.: Attention is all you need (2017)Google Scholar
  30. 30.
    Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)CrossRefGoogle Scholar
  31. 31.
    Yi, K., et al.: CLEVRER: collision events for video representation and reasoning (2019)Google Scholar
  32. 32.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)Google Scholar
  33. 33.
    Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. In: Advances in Neural Information Processing Systems, pp. 3391–3401 (2017)Google Scholar
  34. 34.
    Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. European Conference on Computer Vision (2018)Google Scholar
  35. 35.
    Zhu, Z., Wang, Q., Bo, L., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: European Conference on Computer Vision (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Bar-Ilan UniversityRamat-GanIsrael
  2. 2.Tel Aviv UniversityTel AvivIsrael
  3. 3.NVIDIA ResearchTel-AvivIsrael

Personalised recommendations