Joint Future Semantic and Instance Segmentation Prediction

  • Camille CouprieEmail author
  • Pauline Luc
  • Jakob Verbeek
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11131)


The ability to predict what will happen next from observing the past is a key component of intelligence. Methods that forecast future frames were recently introduced towards better machine intelligence. However, predicting directly in the image color space seems an overly complex task, and predicting higher level representations using semantic or instance segmentation approaches were shown to be more accurate. In this work, we introduce a novel prediction approach that encodes instance and semantic segmentation information in a single representation based on distance maps. Our graph-based modeling of the instance segmentation prediction problem allows us to obtain temporal tracks of the objects as an optimal solution to a watershed algorithm. Our experimental results on the Cityscapes dataset present state-of-the-art semantic segmentation predictions, and instance segmentation results outperforming a strong baseline based on optical flow.



We thank Piotr Dollárd and anonymous reviewers for their precious comments.


  1. 1.
    Allène, C., Audibert, J.Y., Couprie, M., Keriven, R.: Some links between extremum spanning forests, watersheds and min-cuts. Image Vis. Comput. 28, 1460–1471 (2009)CrossRefGoogle Scholar
  2. 2.
    Arnab, A., Torr, P.H.S.: Pixelwise instance segmentation with a dynamically instantiated network. In: CVPR (2017)Google Scholar
  3. 3.
    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: ICLR (2018)Google Scholar
  4. 4.
    Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In: CVPR (2017)Google Scholar
  5. 5.
    Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: ICCV (2007)Google Scholar
  6. 6.
    Bhattacharyya, A., Fritz, M., Schiele, B.: Long-term on-board prediction of people in traffic scenes under uncertainty. In: CVPR (2018)Google Scholar
  7. 7.
    Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images. In: ICCV (2001)Google Scholar
  8. 8.
    Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. PAMI 23, 1222–1239 (2001)CrossRefGoogle Scholar
  9. 9.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  10. 10.
    Couprie, C., Grady, L., Najman, L., Talbot, H.: Power watershed: a unifying graph-based optimization framework. PAMI 33(7), 1384–1399 (2011)CrossRefGoogle Scholar
  11. 11.
    Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: NIPS (2017)Google Scholar
  12. 12.
    Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: ICML (2018).
  13. 13.
    Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: ICLR (2017)Google Scholar
  14. 14.
    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: NIPS (2016)Google Scholar
  15. 15.
    Grady, L.: Random walks for image segmentation. PAMI 28(11), 1768–1783 (2006)CrossRefGoogle Scholar
  16. 16.
    Grady, L., Sinop, A.K.: Fast approximate random walker segmentation using eigenvector precomputation. In: CVPR (2008)Google Scholar
  17. 17.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  18. 18.
    Jin, X., et al.: Predicting scene parsing and motion dynamics in the future. In: NIPS (2017)Google Scholar
  19. 19.
    Luc, P., Couprie, C., Verbeek, J., LeCun, Y.: Predictive learning in feature space for future instance segmentation. In: ECCV (2018)Google Scholar
  20. 20.
    Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV (2017)Google Scholar
  21. 21.
    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: ICLR (2016)Google Scholar
  22. 22.
    Meijster, A., Roerdink, J.B.T.M., Hesselink, W.H.: A general algorithm for computing distance transforms in linear time. In: Goutsias, J., Vincent, L., Bloomberg, D.S. (eds.) Mathematical Morphology and its Applications to Image and Signal Processing, pp. 331–340. Springer, Boston (2000). Scholar
  23. 23.
    Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.P.: Action-conditional video prediction using deep networks in Atari games. arXiv:1507.08750 (2015)
  24. 24.
    Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016). Scholar
  25. 25.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv:1412.6604 (2014)
  26. 26.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  27. 27.
    Romera-Paredes, B., Torr, P.H.S.: Recurrent instance segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 312–329. Springer, Cham (2016). Scholar
  28. 28.
    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)Google Scholar
  29. 29.
    Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. PAMI 13(6), 583–598 (1991)CrossRefGoogle Scholar
  30. 30.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. In: CVPR (2016)Google Scholar
  31. 31.
    Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). Scholar
  32. 32.
    Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)Google Scholar
  33. 33.
    Watanabe, T., Wolf, D.: Distance to center of mass encoding for instance segmentation. arXiv:1711.09060 (2017)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Facebook AI ResearchParisFrance
  2. 2.Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJKGrenobleFrance

Personalised recommendations