Predicting Future Instance Segmentation by Forecasting Convolutional Features

  • Pauline LucEmail author
  • Camille Couprie
  • Yann LeCun
  • Jakob Verbeek
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11213)


Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy task towards this goal. Recent work has shown that to predict semantic segmentation of future frames, forecasting at the semantic level is more effective than forecasting RGB frames and then segmenting these. In this paper we consider the more challenging problem of future instance segmentation, which additionally segments out individual objects. To deal with a varying number of output labels per image, we develop a predictive model in the space of fixed-sized convolutional features of the Mask R-CNN instance segmentation model. We apply the “detection head” of Mask R-CNN on the predicted features to produce the instance segmentation of future frames. Experiments show that this approach significantly improves over strong baselines based on optical flow and repurposed instance segmentation architectures.


Video prediction Instance segmentation Deep learning Convolutional neural networks 



This work has been partially supported by the grant ANR-16-CE23-0006 “Deep in France” and LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01). We thank Matthijs Douze, Xavier Martin, Ilija Radosavovic and Thomas Lucas for their precious comments.

Supplementary material

474192_1_En_36_MOESM1_ESM.pdf (3.1 mb)
Supplementary material 1 (pdf 3128 KB)


  1. 1.
    Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV (2017)Google Scholar
  2. 2.
    Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  3. 3.
    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: ICLR (2016)Google Scholar
  4. 4.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv 1412.6604 (2014)
  5. 5.
    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)Google Scholar
  6. 6.
    Kalchbrenner, N., et al.: Video pixel networks. In: ICML (2017)Google Scholar
  7. 7.
    Shalev-Shwartz, S., Ben-Zrihem, N., Cohen, A., Shashua, A.: Long-term planning by short-term prediction. arXiv 1602.01580 (2016)
  8. 8.
    Shalev-Shwartz, S., Shashua, A.: On the sample complexity of end-to-end training vs. semantic abstraction training. arXiv 1604.06915 (2016)
  9. 9.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  10. 10.
    Kokkinos, I.: UberNet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: CVPR (2017)Google Scholar
  11. 11.
    Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: ICLR (2017)Google Scholar
  12. 12.
    Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML (2017)Google Scholar
  13. 13.
    Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). Scholar
  14. 14.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). Scholar
  15. 15.
    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). Scholar
  16. 16.
    Lee, N., Choi, W., Vernaza, P., Choy, C., Torr, P., Chandraker, M.: DESIRE: distant future prediction in dynamic scenes with interacting agents. In: CVPR (2017)Google Scholar
  17. 17.
    Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: ICLR (2017)Google Scholar
  18. 18.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. In: CVPR (2016)Google Scholar
  19. 19.
    Jin, X., et al.: Predicting scene parsing and motion dynamics in the future. In: NIPS (2017)Google Scholar
  20. 20.
    Romera-Paredes, B., Torr, P.H.S.: Recurrent instance segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 312–329. Springer, Cham (2016). Scholar
  21. 21.
    Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In: CVPR (2017)Google Scholar
  22. 22.
    Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016). Scholar
  23. 23.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  24. 24.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)Google Scholar
  25. 25.
    Cordts, M., et al.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  26. 26.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  27. 27.
    Yang, A., Wright, J., Ma, Y., Sastry, S.: Unsupervised segmentation of natural images via lossy data compression. CVIU 110(2), 212–225 (2008)Google Scholar
  28. 28.
    Parntofaru, C., Hebert, M.: A comparison of image segmentation algorithms. Technical report CMU-RI-TR-05-40, Carnegie Mellon University (2005)Google Scholar
  29. 29.
    Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV (2001)Google Scholar
  30. 30.
    Meilǎ, M.: Comparing clusterings: An axiomatic view. In: ICML (2005)Google Scholar
  31. 31.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)Google Scholar
  32. 32.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  33. 33.
    Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)Google Scholar
  34. 34.
    Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Pauline Luc
    • 1
    • 2
    Email author
  • Camille Couprie
    • 1
  • Yann LeCun
    • 3
    • 4
  • Jakob Verbeek
    • 2
  1. 1.Facebook AI ResearchParisFrance
  2. 2.Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP (Institute of Engineering Univ. Grenoble Alpes), LJKGrenobleFrance
  3. 3.New York UniversityNew YorkUSA
  4. 4.Facebook AI ResearchNew YorkUSA

Personalised recommendations