Advertisement

Learning What to Learn for Video Object Segmentation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12347)

Abstract

Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined by a first-frame reference mask during inference. The problem of how to capture and utilize this limited information to accurately segment the target remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learner. Our learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond the standard few-shot learning paradigm by learning what our target model should learn in order to maximize segmentation accuracy. We perform extensive experiments on standard benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a \(2.6\%\) relative improvement over the previous best result. The code and models are available at https://github.com/visionml/pytracking.

Notes

Acknowledgments

This work was partly supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, Nvidia, ELLIIT Excellence Center, the Wallenberg AI, Autonomous Systems and Software Program (WASP) and the SSF project Symbicloud.

Supplementary material

504434_1_En_46_MOESM1_ESM.pdf (5.8 mb)
Supplementary material 1 (pdf 5901 KB)

Supplementary material 2 (mp4 18627 KB)

References

  1. 1.
    Behl, H.S., Najafi, M., Arnab, A., Torr, P.H.S.: Meta learning deep visual words for fast video object segmentation. In: NeurIPS 2019 Workshop on Machine Learning for Autonomous Driving (2018)Google Scholar
  2. 2.
    Berman, M., Rannen Triki, A., Blaschko, M.B.: The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4413–4421 (2018)Google Scholar
  3. 3.
    Bertinetto, L., Henriques, J.F., Torr, P., Vedaldi, A.: Meta-learning with differentiable closed-form solvers. In: International Conference on Learning Representations (2019)Google Scholar
  4. 4.
    Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6182–6191 (2019)Google Scholar
  5. 5.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5320–5329. IEEE (2017)Google Scholar
  6. 6.
    Choi, J., Kwon, J., Lee, K.M.: Deep meta learning for real-time target-aware visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 911–920 (2019)Google Scholar
  7. 7.
    Cohen, I., Medioni, G.: Detecting and tracking moving objects for video surveillance. In: Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), vol. 2, pp. 319–325. IEEE (1999)Google Scholar
  8. 8.
    Danelljan, M., Van Gool, L., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)Google Scholar
  9. 9.
    Erdélyi, A., Barát, T., Valet, P., Winkler, T., Rinner, B.: Adaptive cartooning for privacy protection in camera networks. In: 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 44–49. IEEE (2014)Google Scholar
  10. 10.
    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org (2017)Google Scholar
  11. 11.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)Google Scholar
  13. 13.
    Hu, P., Wang, G., Kong, X., Kuen, J., Tan, Y.P.: Motion-guided cascaded refinement network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1400–1409 (2018)Google Scholar
  14. 14.
    Hu, Y.-T., Huang, J.-B., Schwing, A.G.: VideoMatch: matching based video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 56–73. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01237-3_4CrossRefGoogle Scholar
  15. 15.
    Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  16. 16.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014Google Scholar
  17. 17.
    Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: CVPR (2019)Google Scholar
  18. 18.
    Lin, H., Qi, X., Jia, J.: Agss-vos: attention guided single-shot video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3949–3957 (2019)Google Scholar
  19. 19.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  20. 20.
    Liu, Y., Liu, L., Zhang, H., Rezatofighi, H., Reid, I.: Meta learning with differentiable closed-form solver for fast video object segmentation. arXiv preprint arXiv:1909.13046 (2019)
  21. 21.
    Luiten, J., Voigtlaender, P., Leibe, B.: PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 565–580. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20870-7_35CrossRefGoogle Scholar
  22. 22.
    Maninis, K.K., et al.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41(6), 1515–1530 (2018)CrossRefGoogle Scholar
  23. 23.
    Massa, F., Girshick, R.: maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark (2018). Accessed 04 Sep 2019
  24. 24.
    Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7376–7385. IEEE (2018)Google Scholar
  25. 25.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision (2019)Google Scholar
  26. 26.
    Park, E., Berg, A.C.: Meta-tracker: fast and robust online adaptation for visual object trackers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 569–585 (2018)Google Scholar
  27. 27.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Computer Vision and Pattern Recognition (2016)Google Scholar
  28. 28.
    Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2663–2672 (2017)Google Scholar
  29. 29.
    Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., Felsberg, M.: Learning fast and robust target models for video object segmentation (2020)Google Scholar
  30. 30.
    Ros, G., Ramos, S., Granados, M., Bakhtiary, A., Vazquez, D., Lopez, A.M.: Vision-based offline-online perception paradigm for autonomous driving. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 231–238. IEEE (2015)Google Scholar
  31. 31.
    Saleh, K., Hossny, M., Nahavandi, S.: Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network. In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE (2016)Google Scholar
  32. 32.
    Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)Google Scholar
  33. 33.
    Voigtlaender, P., Leibe, B.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  34. 34.
    Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam r-cnn: visual tracking by re-detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  35. 35.
    Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 402–419. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_24CrossRefGoogle Scholar
  36. 36.
    Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1328–1338 (2019)Google Scholar
  37. 37.
    Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: Ranet: ranking attention network for fast video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3978–3987 (2019)Google Scholar
  38. 38.
    Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01228-1_36CrossRefGoogle Scholar
  39. 39.
    Xu, N., et al.: Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
  40. 40.
    Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. Algorithms 29, 15 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.CVL, ETH ZürichZürichSwitzerland
  2. 2.CVL, Linköping UniversityLinköpingSweden

Personalised recommendations