Learning to Segment Moving Objects

  • Pavel TokmakovEmail author
  • Cordelia Schmid
  • Karteek Alahari


We study the problem of segmenting moving objects in unconstrained videos. Given a video, the task is to segment all the objects that exhibit independent motion in at least one frame. We formulate this as a learning problem and design our framework with three cues: (1) independent object motion between a pair of frames, which complements object recognition, (2) object appearance, which helps to correct errors in motion estimation, and (3) temporal consistency, which imposes additional constraints on the segmentation. The framework is a two-stream neural network with an explicit memory module. The two streams encode appearance and motion cues in a video sequence respectively, while the memory module captures the evolution of objects over time, exploiting the temporal consistency. The motion stream is a convolutional neural network trained on synthetic videos to segment independently moving objects in the optical flow field. The module to build a “visual memory” in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. For every pixel in a frame of a test video, our approach assigns an object or background label based on the learned spatio-temporal features as well as the “visual memory” specific to the video. We evaluate our method extensively on three benchmarks, DAVIS, Freiburg-Berkeley motion segmentation dataset and SegTrack. In addition, we provide an extensive ablation study to investigate both the choice of the training data and the influence of each component in the proposed framework.


Motion segmentation Video object segmentation Visual memory 



This work was supported in part by the ERC advanced Grant ALLEGRO, a Google research award, the Inria-CMU associate team GAYA, a Facebook and an Intel gift. We gratefully acknowledge the support of NVIDIA with the donation of GPUs used for this work. We also thank Yeong Jun Koh for providing segmentation masks produced by their method (Koh et al. 2017) on the FBMS dataset, and the associate editor and the anonymous reviewers for their suggestions.


  1. Adelson, E. H. (2001). On seeing stuff: The perception of materials by humans and machines. In Proceedings of SPIE.Google Scholar
  2. Badrinarayanan, V., Galasso, F., & Cipolla, R. (2010). Label propagation in video sequences. In CVPR.Google Scholar
  3. Ballas, N., Yao, L., Pal, C., & Courville, A. (2016). Delving deeper into convolutional networks for learning video representations. In ICLR.Google Scholar
  4. Bideau, P., & Learned-Miller, E. G. (2016). It’s moving! A probabilistic model for causal motion segmentation in moving camera videos. In ECCV.CrossRefGoogle Scholar
  5. Brendel, W., & Todorovic, S. (2009). Video object segmentation by tracking regions. In ICCV.Google Scholar
  6. Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In ECCV.CrossRefGoogle Scholar
  7. Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. PAMI, 33(3), 500–513.CrossRefGoogle Scholar
  8. Byeon, W., Breuel, T. M., Raue, F., & Liwicki, M. (2015). Scene labeling with lstm recurrent neural networks. In CVPR.Google Scholar
  9. Caelles, S., Pont-Tuset, J., Maninis, K. K, Leal-Taixé, L., Cremers, D., & Van Gool, L. (2017). One-shot video segmentation. In CVPR.Google Scholar
  10. Chen, J., Yang, L., Zhang, Y., Alber, M., & Chen, D.Z. (2016). Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation. In NIPS.Google Scholar
  11. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR.Google Scholar
  12. Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI, 40, 834–848.CrossRefGoogle Scholar
  13. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.Google Scholar
  14. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.Google Scholar
  15. Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazırbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). FlowNet: Learning optical flow with convolutional networks. In ICCV.Google Scholar
  16. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2012). The PASCAL visual object classes challenge (VOC2012) results.
  17. Faktor, A., & Irani, M. (2014). Video segmentation by non-local consensus voting. In BMVC.Google Scholar
  18. Fayyaz, M., Saffar, M. H., Sabokrou, M., Fathy, M., Klette, R., & Huang, F. (2016). Stfcn: spatio-temporal fcn for semantic video segmentation. arXiv preprint arXiv:1608.05971.
  19. Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NIPS.Google Scholar
  20. Fragkiadaki, K., Zhang, G., & Shi, J. (2012). Video segmentation by tracing discontinuities in a trajectory embedding. In CVPR.Google Scholar
  21. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS.Google Scholar
  22. Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
  23. Graves, A., Jaitly, N., & Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM. In Workshop on automatic speech recognition and understanding.Google Scholar
  24. Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In ICASSP Google Scholar
  25. Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.CrossRefGoogle Scholar
  26. Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph based video segmentation. In CVPR.Google Scholar
  27. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. In ECCV (pp. 630–645). Springer.Google Scholar
  28. Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(2), 107–116.MathSciNetCrossRefGoogle Scholar
  29. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  30. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of National Academy of Sciences, 79(8), 2554–2558.MathSciNetCrossRefGoogle Scholar
  31. Huguet, F., & Devernay, F. (2007). A variational method for scene flow estimation from stereo sequences. In ICCV.Google Scholar
  32. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.Google Scholar
  33. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.Google Scholar
  34. Jain, S. D., Xiong, B., & Grauman, K. (2017). Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR.Google Scholar
  35. Keuper, M., Andres, B., & Brox, T. (2015). Motion trajectory segmentation via minimum cost multicuts. In ICCV.Google Scholar
  36. Khoreva, A., Galasso, F., Hein, M., & Schiele, B. (2015). Classifier based graph construction for video segmentation. In CVPR.Google Scholar
  37. Khoreva, A., Perazzi, F., Benenson, R., Schiele, B., & Sorkine-Hornung, A. (2017). Learning video object segmentation from static images. In CVPR.Google Scholar
  38. Koh, Y. J., & Kim, C. S. (2017). Primary object segmentation in videos based on region augmentation and reduction. In CVPR.Google Scholar
  39. Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.Google Scholar
  40. Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.Google Scholar
  41. Learning motion patterns in videos.
  42. Lezama, J., Alahari, K., Sivic, J., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR.Google Scholar
  43. Li, F., Kim, T., Humayun, A., Tsai, D., & Rehg, J. M. (2013). Video segmentation by tracking many figure-ground segments. In ICCV.Google Scholar
  44. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV.Google Scholar
  45. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.Google Scholar
  46. Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR.Google Scholar
  47. Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech.Google Scholar
  48. Narayana, M., Hanson, A. R., & Learned-Miller, E. G. (2013). Coherent motion segmentation in moving camera videos using optical flow orientations. In ICCV.Google Scholar
  49. Ng, J. Y., Hausknecht, M. J., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.Google Scholar
  50. Ochs, P., & Brox, T. (2012). Higher order motion models and spectral clustering. In CVPR.Google Scholar
  51. Ochs, P., Malik, J., & Brox, T. (2014). Segmentation of moving objects by long term video analysis. PAMI, 36(6), 1187–1200.CrossRefGoogle Scholar
  52. Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In ICCV.Google Scholar
  53. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In ICML.Google Scholar
  54. Patraucean, V., Handa, A., & Cipolla, R. (2016). Spatio-temporal video autoencoder with differentiable memory. In ICLR Workshop track.Google Scholar
  55. Perazzi, F., Pont-Tuset, J., McWilliams, B., van Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.Google Scholar
  56. Pinheiro, P. O., Lin, T. Y., Collobert, R., & Dollár, P. (2016). Learning to refine object segments. In ECCV.CrossRefGoogle Scholar
  57. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS.Google Scholar
  58. Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). EpicFlow: Edge-preserving interpolation of correspondences for optical flow. In CVPR.Google Scholar
  59. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In MICCAI.Google Scholar
  60. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.CrossRefGoogle Scholar
  61. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  62. Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.Google Scholar
  63. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.Google Scholar
  64. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.Google Scholar
  65. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.Google Scholar
  66. Sundaram, N., Brox, T., & Keutzer, K. (2010). Dense point trajectories by GPU-accelerated large displacement optical flow. In ECCV.CrossRefGoogle Scholar
  67. Taylor, B., Karasev, V., & Soatto, S. (2015). Causal video object segmentation from persistence of occlusions. In CVPR.Google Scholar
  68. Tieleman, T., & Hinton, G. (2012). RMSProp. COURSERA: Lecture 6.5—Neural Networks for Machine Learning.Google Scholar
  69. Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning motion patterns in videos. In CVPR.Google Scholar
  70. Tokmakov, P., Alahari, K., & Schmid, C. (2017). Learning video object segmentation with visual memory. In ICCV.Google Scholar
  71. Torr, P. H. S. (1998). Geometric motion segmentation and model selection. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 356(1740), 1321–1340.MathSciNetCrossRefGoogle Scholar
  72. Vedula, S., Baker, S., Rander, P., Collins, R., & Kanade, T. (2005). Three-dimensional scene flow. PAMI, 27(3), 475–480.CrossRefGoogle Scholar
  73. Vogel, C., Schindler, K., & Roth, S. (2015). 3D scene flow estimation with a piecewise rigid scene model. IJCV, 115(1), 1–28.MathSciNetCrossRefGoogle Scholar
  74. Wang, W., Shen, J., & Porikli, F. (2015). Saliency-aware geodesic video object segmentation. In CVPR.Google Scholar
  75. Wedel, A., Brox, T., Vaudrey, T., Rabe, C., Franke, U., & Cremers, D. (2011). Stereoscopic scene flow computation for 3D motion understanding. IJCV, 95(1), 29–51.CrossRefGoogle Scholar
  76. Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of IEEE, 78(10), 1550–1560.CrossRefGoogle Scholar
  77. Xu, C., & Corso, J. J. (2016). Libsvx: A supervoxel library and benchmark for early video processing. International Journal of Computer Vision, 119(3), 272–290.MathSciNetCrossRefGoogle Scholar
  78. Zhang, D., Javed, O., & Shah, M. (2013). Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Robotics Institute at Carnegie Mellon UniversityPittsburghUSA
  2. 2.Univ. Grenoble Alpes, Inria, CNRSGrenoble INP, LJKGrenobleFrance

Personalised recommendations