Abstract
Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at https://github.com/sabarim/STEm-Seg.
A. Athar and S. Mahadevan—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This notation denotes element-wise addition between \(\textit{\textbf{e}}_i\) and \([x_i, y_i]\).
- 2.
Results for various \(\phi (\cdot )\) on YT-VIS and KITTI-MOTS are given in supplementary.
References
Hu, A., Kendall, A., Cipolla, R.: Learning a spatio-temporal embedding for video instance segmentation. arxiv preprint arXiv:1912:08969v (2019)
Van den Bergh, M., Roig, G., Boix, X., Manen, S., Van Gool, L.: Online video seeds for temporal window objectness. In: ICCV (2013)
Berman, M., Blaschko, M.B.: Optimization of the Jaccard index for image segmentation with the Lovász hinge. In: CVPR (2018)
Berman, M., Rannen Triki, A., Blaschko, M.B.: The Lovász-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR (2018)
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. JIVP 2008, 1:1–1:10 (2008)
Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: AVSS (2017)
Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15555-0_21
Butt, A.A., Collins, R.T.: Multi-target tracking by Lagrangian relaxation to min-cost network flow. In: CVPR (2013)
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1803.00557 (2018)
Caelles, S., Pont-Tuset, J., Perazzi, F., Montes, A., Maninis, K., Gool, L.V.: The 2019 DAVIS challenge on VOS: unsupervised multi-object segmentation. arXiv arXiv:1905.00737 (2019)
Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Chen, X., Girshick, R., He, K., Dollár, P.: TensorMask: a foundation for dense object segmentation. In: ICCV (2019)
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)
Cho, D., Hong, S., Kim, J., Kang, S.: Key instance selection for unsupervised video object segmentation. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019)
Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. PAMI 24(5), 603–619 (2002)
Dave, A., Tokmakov, P., Ramanan, D.: Towards segmenting everything that moves. arXiv preprint arXiv:1902.03715 (2019)
De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation for autonomous driving. In: CVPR Workshops (2017)
De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551 (2017)
Dong, M., et al.: Temporal feature augmented network for video instance segmentation. In: ICCV Workshops (2019)
Elich, C., Engelmann, F., Schult, J., Kontogianni, T., Leibe, B.: 3D-BEVIS: birds-eye-view instance segmentation. In: German Conference on Pattern Recognition (GCPR) (2019)
Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M.: 3D-MPA: multi proposal aggregation for 3D semantic instance segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: ACM Conference on Knowledge Discovery and Data Mining (KDD) (1996)
Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: ICCV (2017)
Feng, Q., Yang, Z., Li, P., Wei, Y., Yang, Y.: Dual embedding learning for video instance segmentation. In: ICCV Workshops (2019)
Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21(1), 32–40 (1975)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)
Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)
Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: IJCNN (2005)
Han, W., et al.: Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465 (2016)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: ICCV (2017)
Hou, R., Chen, C., Sukthankar, R., Shah, M.: An efficient 3D CNN for action/object segmentation in video. In: BMVC (2019)
Hu, Y., Huang, J., Schwing, A.: MaskRNN: instance level video object segmentation. In: NIPS (2017)
Huang, C., Wu, B., Nevatia, R.: Robust object tracking by hierarchical association of detection responses. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 788–801. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88688-4_58
Jain, R., Nagel, H.H.: On the analysis of accumulative difference pictures from image sequences of real world scenes. PAMI 1, 206–214 (1979)
Jain, S., Xiong, B., Grauman, K.: FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: CVPR (2017)
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C.W., Jia, J.: PointGroup: dual-set point grouping for 3D instance segmentation. In: CVPR (2020)
Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: CVPR (2017)
Kong, S., Fowlkes, C.C.: Recurrent pixel embedding for instance grouping. In: CVPR (2018)
Kuhn, H.W., Yaw, B.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)
Kwak, S., Cho, M., Laptev, I., Ponce, J., Schmid, C.: Unsupervised object discovery and tracking in video collections. In: ICCV (2015)
Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: MOTChallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942 (2015)
Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved categorization and segmentation. IJCV 77(1–3), 259–289 (2008)
Leibe, B., Schindler, K., Cornelis, N., Gool, L.V.: Coupled object detection and tracking from static cameras and moving vehicles. PAMI 30(10), 1683–1698 (2008)
Li, S., Seybold, B., Vorobyov, A., Fathi, A., Huang, Q., Kuo, C.C.J.: Instance embedding transfer to unsupervised video object segmentation. In: CVPR (2018)
Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, R., et al.: An intriguing failing of convolutional neural networks and the CoordConv solution. In: NIPS (2018)
Liu, X., Ye, T.: Spatio-temporal attention network for video instance segmentation. In: ICCV Workshops (2019)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Luiten, J., Voigtlaender, P., Leibe, B.: PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 565–580. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_35
McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017)
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Milan, A., Leal-Taixé, L., Schindler, K., Reid, I.: Joint tracking and segmentation of multiple targets. In: CVPR (2015)
Neven, D., Brabandere, B.D., Proesmans, M., Gool, L.V.: Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In: CVPR (2019)
Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: NIPS (2017)
Novotny, D., Albanie, S., Larlus, D., Vedaldi, A.: Semi-convolutional operators for instance segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 89–105. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_6
Ochs, P., Brox, T.: Higher order motion models and spectral clustering. In: CVPR (2012)
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)
Okuma, K., Taleghani, A., de Freitas, N., Little, J.J., Lowe, D.G.: A boosted particle filter: multitarget detection and tracking. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3021, pp. 28–39. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24670-1_3
Ošep, A., Mehner, W., Voigtlaender, P., Leibe, B.: Track, then decide: category-agnostic vision-based multi-object tracking. In: ICRA (2018)
Ošep, A., Voigtlaender, P., Luiten, J., Breuers, S., Leibe, B.: Large-scale object mining for object discovery from unlabeled video (2019)
Ošep, A., Voigtlaender, P., Weber, M., Luiten, J., Leibe, B.: 4D generic video object proposals. In: ICRA (2020)
Palmer, S.E.: Organizing objects and scenes. In: Foundations of Cognitive Psychology: Core Readings, pp. 189–211 (2002)
Palou, G., Salembier, P.: Hierarchical video representation with trajectory binary partition tree. In: CVPR (2013)
Paragios, N., Deriche, R.: Geodesic active contours and level sets for the detection and tracking of moving objects. PAMI 22, 266–280 (2000)
Pinheiro, P.O., Lin, T.-Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 75–91. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_5
Pinheiro, P., Collobert, R., Dollár, P.: Learning to segment object candidates. In: NIPS (2015)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Gool, L.V.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep Hough voting for 3D object detection in point clouds. In: CVPR (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Siam, M., et al.: Video segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. In: ICRA (2018)
Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.-M.: Pyramid dilated deeper ConvLSTM for video salient object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 744–760. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_44
Teichman, A., Levinson, J., Thrun, S.: Towards 3D object recognition via classification of arbitrary object tracks. In: ICRA (2011)
Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marqués, F., Gir’o i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. CVPR (2019)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen., L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)
Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation. In: CVPR (2019)
Wang, H., Luo, R., Maire, M., Shakhnarovich, G.: Pixel consensus voting for panoptic segmentation. In: CVPR (2020)
Wang, L., Hua, G., Sukthankar, R., Xue, J., Zheng, N.: Video object discovery and co-segmentation with extremely weak supervision. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 640–655. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_42
Wang, Q., He, Y., Yang, X., Yang, Z., Torr, P.: An empirical study of detection-based video instance segmentation. In: ICCV Workshops (2019)
Wang, W., Lu, X., Shen, J., Crandall, D.J., Shao, L.: Zero-shot video object segmentation via attentive graph neural networks. In: The IEEE International Conference on Computer Vision (ICCV) (2019)
Wojke, N., Bewley, A., Paulus., D.: Onboard contextual classification of 3D point clouds with learned high-order Markov random fields. In: ICIP (2017)
Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: real-time tracking of the human body. PAMI 19, 780–785 (1997)
Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_1
Wu, Z., Shen, C., van den Hengel, A.: Wider or deeper: revisiting the ResNet model for visual recognition. arXiv preprint arXiv:1611.10080 (2016)
Wug Oh, S., Lee, J.Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: CVPR (2018)
Xiao, F., Jae Lee, Y.: Track and segment: an iterative unsupervised approach for video object proposals. In: CVPR (2016)
Xie, C., Xiang, Y., Harchaoui, Z., Fox, D.: Object discovery in videos as foreground motion clustering. In: CVPR (2019)
Xu, C.: Evaluation of super-voxel methods for early video processing. In: CVPR (2012)
Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_36
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)
Yang, Z., Wang, Q., Bertinetto, L., Hu, W., Bai, S., Torr, P.H.S.: Anchor diffusion for unsupervised video object segmentation. In: ICCV (2019)
Jun Koh, Y., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR (2017)
Yu, J., Blaschko, M.: Learning submodular losses with the Lovász hinge. In: International Conference on Machine Learning (ICML) (2015)
Zeng, X., Liao, R., Gu, L., Xiong, Y., Fidler, S., Urtasun, R.: DMM-Net: differentiable mask-matching network for video object segmentation. In: ICCV (2019)
Zhang, D., Chun, J., Cha, S.K., Kim, Y.M.: Spatial semantic embedding network: fast 3D instance segmentation with deep metric learning. arXiv preprint arXiv:2007.03169 (2020)
Zulfikar, I.E., Luiten, J., Leibe, B.: UnOVOST: unsupervised offline video object segmentation and tracking for the 2019 unsupervised DAVIS challenge. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019)
Acknowledgements
This project was funded, in parts, by ERC Consolidator Grant DeeVise (ERC-2017-COG-773161), EU project CROWDBOT (H2020-ICT-2017-779942) and the Humboldt Foundation through the Sofja Kovalevskaja Award. Computing resources for several experiments were granted by RWTH Aachen University under project ‘rwth0519’. We thank Sebastian Hennen for help with experiments and Francis Engelmann, Theodora Kontogianni, Paul Voigtlaender, Gulliem Brasó and Aysim Toker for helpful discussions.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 2 (mp4 13330 KB)
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Athar, A., Mahadevan, S., Os̆ep, A., Leal-Taixé, L., Leibe, B. (2020). STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12356. Springer, Cham. https://doi.org/10.1007/978-3-030-58621-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-58621-8_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58620-1
Online ISBN: 978-3-030-58621-8
eBook Packages: Computer ScienceComputer Science (R0)