STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)


Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at



This project was funded, in parts, by ERC Consolidator Grant DeeVise (ERC-2017-COG-773161), EU project CROWDBOT (H2020-ICT-2017-779942) and the Humboldt Foundation through the Sofja Kovalevskaja Award. Computing resources for several experiments were granted by RWTH Aachen University under project ‘rwth0519’. We thank Sebastian Hennen for help with experiments and Francis Engelmann, Theodora Kontogianni, Paul Voigtlaender, Gulliem Brasó and Aysim Toker for helpful discussions.

Supplementary material

504452_1_En_10_MOESM1_ESM.pdf (38.7 mb)
Supplementary material 1 (pdf 39589 KB)

Supplementary material 2 (mp4 13330 KB)


  1. 1.
    Hu, A., Kendall, A., Cipolla, R.: Learning a spatio-temporal embedding for video instance segmentation. arxiv preprint arXiv:1912:08969v (2019)
  2. 2.
    Van den Bergh, M., Roig, G., Boix, X., Manen, S., Van Gool, L.: Online video seeds for temporal window objectness. In: ICCV (2013)Google Scholar
  3. 3.
    Berman, M., Blaschko, M.B.: Optimization of the Jaccard index for image segmentation with the Lovász hinge. In: CVPR (2018)Google Scholar
  4. 4.
    Berman, M., Rannen Triki, A., Blaschko, M.B.: The Lovász-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR (2018)Google Scholar
  5. 5.
    Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. JIVP 2008, 1:1–1:10 (2008)Google Scholar
  6. 6.
    Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: AVSS (2017)Google Scholar
  7. 7.
    Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010). Scholar
  8. 8.
    Butt, A.A., Collins, R.T.: Multi-target tracking by Lagrangian relaxation to min-cost network flow. In: CVPR (2013)Google Scholar
  9. 9.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)Google Scholar
  10. 10.
    Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1803.00557 (2018)
  11. 11.
    Caelles, S., Pont-Tuset, J., Perazzi, F., Montes, A., Maninis, K., Gool, L.V.: The 2019 DAVIS challenge on VOS: unsupervised multi-object segmentation. arXiv arXiv:1905.00737 (2019)
  12. 12.
    Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
  13. 13.
    Chen, X., Girshick, R., He, K., Dollár, P.: TensorMask: a foundation for dense object segmentation. In: ICCV (2019)Google Scholar
  14. 14.
    Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)Google Scholar
  15. 15.
    Cho, D., Hong, S., Kim, J., Kang, S.: Key instance selection for unsupervised video object segmentation. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019)Google Scholar
  16. 16.
    Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. PAMI 24(5), 603–619 (2002)CrossRefGoogle Scholar
  17. 17.
    Dave, A., Tokmakov, P., Ramanan, D.: Towards segmenting everything that moves. arXiv preprint arXiv:1902.03715 (2019)
  18. 18.
    De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation for autonomous driving. In: CVPR Workshops (2017)Google Scholar
  19. 19.
    De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551 (2017)
  20. 20.
    Dong, M., et al.: Temporal feature augmented network for video instance segmentation. In: ICCV Workshops (2019)Google Scholar
  21. 21.
    Elich, C., Engelmann, F., Schult, J., Kontogianni, T., Leibe, B.: 3D-BEVIS: birds-eye-view instance segmentation. In: German Conference on Pattern Recognition (GCPR) (2019)Google Scholar
  22. 22.
    Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M.: 3D-MPA: multi proposal aggregation for 3D semantic instance segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  23. 23.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: ACM Conference on Knowledge Discovery and Data Mining (KDD) (1996)Google Scholar
  24. 24.
    Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)CrossRefGoogle Scholar
  25. 25.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: ICCV (2017)Google Scholar
  26. 26.
    Feng, Q., Yang, Z., Li, P., Wei, Y., Yang, Y.: Dual embedding learning for video instance segmentation. In: ICCV Workshops (2019)Google Scholar
  27. 27.
    Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21(1), 32–40 (1975)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  29. 29.
    Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)Google Scholar
  30. 30.
    Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: IJCNN (2005)Google Scholar
  31. 31.
    Han, W., et al.: Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465 (2016)
  32. 32.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  33. 33.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  34. 34.
    Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: ICCV (2017)Google Scholar
  35. 35.
    Hou, R., Chen, C., Sukthankar, R., Shah, M.: An efficient 3D CNN for action/object segmentation in video. In: BMVC (2019)Google Scholar
  36. 36.
    Hu, Y., Huang, J., Schwing, A.: MaskRNN: instance level video object segmentation. In: NIPS (2017)Google Scholar
  37. 37.
    Huang, C., Wu, B., Nevatia, R.: Robust object tracking by hierarchical association of detection responses. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 788–801. Springer, Heidelberg (2008). Scholar
  38. 38.
    Jain, R., Nagel, H.H.: On the analysis of accumulative difference pictures from image sequences of real world scenes. PAMI 1, 206–214 (1979)CrossRefGoogle Scholar
  39. 39.
    Jain, S., Xiong, B., Grauman, K.: FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: CVPR (2017)Google Scholar
  40. 40.
    Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: PointGroup: dual-set point grouping for 3D instance segmentation. In: CVPR (2020)Google Scholar
  41. 41.
    Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: CVPR (2017)Google Scholar
  42. 42.
    Kong, S., Fowlkes, C.C.: Recurrent pixel embedding for instance grouping. In: CVPR (2018)Google Scholar
  43. 43.
    Kuhn, H.W., Yaw, B.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Kwak, S., Cho, M., Laptev, I., Ponce, J., Schmid, C.: Unsupervised object discovery and tracking in video collections. In: ICCV (2015)Google Scholar
  45. 45.
    Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: MOTChallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942 (2015)
  46. 46.
    Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. IJCV 77(1–3), 259–289 (2008)CrossRefGoogle Scholar
  47. 47.
    Leibe, B., Schindler, K., Cornelis, N., Gool, L.V.: Coupled object detection and tracking from static cameras and moving vehicles. PAMI 30(10), 1683–1698 (2008)CrossRefGoogle Scholar
  48. 48.
    Li, S., Seybold, B., Vorobyov, A., Fathi, A., Huang, Q., Kuo, C.C.J.: Instance embedding transfer to unsupervised video object segmentation. In: CVPR (2018)Google Scholar
  49. 49.
    Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  50. 50.
    Liu, R., et al.: An intriguing failing of convolutional neural networks and the CoordConv solution. In: NIPS (2018)Google Scholar
  51. 51.
    Liu, X., Ye, T.: Spatio-temporal attention network for video instance segmentation. In: ICCV Workshops (2019)Google Scholar
  52. 52.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefGoogle Scholar
  53. 53.
    Luiten, J., Voigtlaender, P., Leibe, B.: PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 565–580. Springer, Cham (2019). Scholar
  54. 54.
    McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)CrossRefGoogle Scholar
  55. 55.
    Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
  56. 56.
    Milan, A., Leal-Taixé, L., Schindler, K., Reid, I.: Joint tracking and segmentation of multiple targets. In: CVPR (2015)Google Scholar
  57. 57.
    Neven, D., Brabandere, B.D., Proesmans, M., Gool, L.V.: Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In: CVPR (2019)Google Scholar
  58. 58.
    Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: NIPS (2017)Google Scholar
  59. 59.
    Novotny, D., Albanie, S., Larlus, D., Vedaldi, A.: Semi-convolutional operators for instance segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 89–105. Springer, Cham (2018). Scholar
  60. 60.
    Ochs, P., Brox, T.: Higher order motion models and spectral clustering. In: CVPR (2012)Google Scholar
  61. 61.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)Google Scholar
  62. 62.
    Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: multitarget detection and tracking. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3021, pp. 28–39. Springer, Heidelberg (2004). Scholar
  63. 63.
    Ošep, A., Mehner, W., Voigtlaender, P., Leibe, B.: Track, then decide: category-agnostic vision-based multi-object tracking. In: ICRA (2018)Google Scholar
  64. 64.
    Ošep, A., Voigtlaender, P., Luiten, J., Breuers, S., Leibe, B.: Large-scale object mining for object discovery from unlabeled video (2019)Google Scholar
  65. 65.
    Ošep, A., Voigtlaender, P., Weber, M., Luiten, J., Leibe, B.: 4D generic video object proposals. In: ICRA (2020)Google Scholar
  66. 66.
    Palmer, S.E.: Organizing objects and scenes. In: Foundations of Cognitive Psychology: Core Readings, pp. 189–211 (2002)Google Scholar
  67. 67.
    Palou, G., Salembier, P.: Hierarchical video representation with trajectory binary partition tree. In: CVPR (2013)Google Scholar
  68. 68.
    Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detection and tracking of moving objects. PAMI 22, 266–280 (2000)CrossRefGoogle Scholar
  69. 69.
    Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016). Scholar
  70. 70.
    Pinheiro, P., Collobert, R., Dollár, P.: Learning to segment object candidates. In: NIPS (2015)Google Scholar
  71. 71.
    Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Gool, L.V.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)Google Scholar
  72. 72.
    Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: CVPR (2019)Google Scholar
  73. 73.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  74. 74.
    Siam, M., et al.: Video segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. In: ICRA (2018)Google Scholar
  75. 75.
    Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.-M.: Pyramid dilated deeper ConvLSTM for video salient object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 744–760. Springer, Cham (2018). Scholar
  76. 76.
    Teichman, A., Levinson, J., Thrun, S.: Towards 3D object recognition via classification of arbitrary object tracks. In: ICRA (2011)Google Scholar
  77. 77.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)Google Scholar
  78. 78.
    Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marqués, F., Gir’o i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. CVPR (2019)Google Scholar
  79. 79.
    Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen., L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)Google Scholar
  80. 80.
    Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation. In: CVPR (2019)Google Scholar
  81. 81.
    Wang, H., Luo, R., Maire, M., Shakhnarovich, G.: Pixel consensus voting for panoptic segmentation. In: CVPR (2020)Google Scholar
  82. 82.
    Wang, L., Hua, G., Sukthankar, R., Xue, J., Zheng, N.: Video object discovery and co-segmentation with extremely weak supervision. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 640–655. Springer, Cham (2014). Scholar
  83. 83.
    Wang, Q., He, Y., Yang, X., Yang, Z., Torr, P.: An empirical study of detection-based video instance segmentation. In: ICCV Workshops (2019)Google Scholar
  84. 84.
    Wang, W., Lu, X., Shen, J., Crandall, D.J., Shao, L.: Zero-shot video object segmentation via attentive graph neural networks. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  85. 85.
    Wojke, N., Bewley, A., Paulus., D.: Onboard contextual classification of 3D point clouds with learned high-order Markov random fields. In: ICIP (2017)Google Scholar
  86. 86.
    Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: real-time tracking of the human body. PAMI 19, 780–785 (1997)CrossRefGoogle Scholar
  87. 87.
    Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). Scholar
  88. 88.
    Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: revisiting the ResNet model for visual recognition. arXiv preprint arXiv:1611.10080 (2016)
  89. 89.
    Wug Oh, S., Lee, J.Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: CVPR (2018)Google Scholar
  90. 90.
    Xiao, F., Jae Lee, Y.: Track and segment: an iterative unsupervised approach for video object proposals. In: CVPR (2016)Google Scholar
  91. 91.
    Xie, C., Xiang, Y., Harchaoui, Z., Fox, D.: Object discovery in videos as foreground motion clustering. In: CVPR (2019)Google Scholar
  92. 92.
    Xu, C.: Evaluation of super-voxel methods for early video processing. In: CVPR (2012)Google Scholar
  93. 93.
    Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018). Scholar
  94. 94.
    Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)Google Scholar
  95. 95.
    Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)Google Scholar
  96. 96.
    Yang, Z., Wang, Q., Bertinetto, L., Hu, W., Bai, S., Torr, P.H.S.: Anchor diffusion for unsupervised video object segmentation. In: ICCV (2019)Google Scholar
  97. 97.
    Jun Koh, Y., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR (2017)Google Scholar
  98. 98.
    Yu, J., Blaschko, M.: Learning submodular losses with the Lovász hinge. In: International Conference on Machine Learning (ICML) (2015)Google Scholar
  99. 99.
    Zeng, X., Liao, R., Gu, L., Xiong, Y., Fidler, S., Urtasun, R.: DMM-Net: differentiable mask-matching network for video object segmentation. In: ICCV (2019)Google Scholar
  100. 100.
    Zhang, D., Chun, J., Cha, S.K., Kim, Y.M.: Spatial semantic embedding network: fast 3D instance segmentation with deep metric learning. arXiv preprint arXiv:2007.03169 (2020)
  101. 101.
    Zulfikar, I.E., Luiten, J., Leibe, B.: UnOVOST: unsupervised offline video object segmentation and tracking for the 2019 unsupervised DAVIS challenge. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.RWTH Aachen UniversityAachenGermany
  2. 2.Technical University of MunichMunichGermany

Personalised recommendations