Interactive Video Object Segmentation Using Global and Local Transfer Modules

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12362)


An interactive video object segmentation algorithm, which takes scribble annotations on query objects as input, is proposed in this paper. We develop a deep neural network, which consists of the annotation network (A-Net) and the transfer network (T-Net). First, given user scribbles on a frame, A-Net yields a segmentation result based on the encoder-decoder architecture. Second, T-Net transfers the segmentation result bidirectionally to the other frames, by employing the global and local transfer modules. The global transfer module conveys the segmentation information in an annotated frame to a target frame, while the local transfer module propagates the segmentation information in a temporally adjacent frame to the target frame. By applying A-Net and T-Net alternately, a user can obtain desired segmentation results with minimal efforts. We train the entire network in two stages, by emulating user scribbles and employing an auxiliary loss. Experimental results demonstrate that the proposed interactive video object segmentation algorithm outperforms the state-of-the-art conventional algorithms. Codes and models are available at


Video object segmentation Interactive segmentation Deep learning 



This work was supported in part by ‘The Cross-Ministry Giga KOREA Project’ grant funded by the Korea government (MSIT) (No. GK20P0200, Development of 4D reconstruction and dynamic deformable action model based hyper-realistic service technology), in part by Institute of Information & communications Technology Planning & evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01441, Artificial Intelligence Convergence Research Center (Chungnam National University)) and in part by the National Research Foundation of Korea (NRF) through the Korea Government (MSIP) under Grant NRF-2018R1A2B3003896.

Supplementary material (29.6 mb)
Supplementary material 1 (zip 30354 KB)


  1. 1.
    Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object cutout using localized classifiers. ACM Trans. Graph. 28(3), 70 (2009)CrossRefGoogle Scholar
  2. 2.
    Benard, A., Gygli, M.: Interactive video object segmentation in the wild. arXiv:1801.00269 (2017)
  3. 3.
    Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 282–295. Springer, Heidelberg (2010). Scholar
  4. 4.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)Google Scholar
  5. 5.
    Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv:1803.00557 (2018)
  6. 6.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062 (2014)
  7. 7.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). Scholar
  8. 8.
    Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)Google Scholar
  9. 9.
    Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV (2017)Google Scholar
  10. 10.
    Fan, Q., Zhong, F., Lischinski, D., Cohen-Or, D., Chen, B.: JumpCut: non-successive mask transfer and interpolation for video cutout. ACM Trans. Graph. 34(6), 195:1–195:10 (2015)CrossRefGoogle Scholar
  11. 11.
    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (1999)CrossRefGoogle Scholar
  12. 12.
    Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)Google Scholar
  13. 13.
    Heo, Y., Koh, Y.J., Kim, C.S.: Interactive video object segmentation using sparse-to-dense networks. In: CVPRW (2019)Google Scholar
  14. 14.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)Google Scholar
  15. 15.
    Hu, Y.-T., Huang, J.-B., Schwing, A.G.: VideoMatch: matching based video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 56–73. Springer, Cham (2018). Scholar
  16. 16.
    Jain, S.D., Grauman, K.: Supervoxel-consistent foreground propagation in video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 656–671. Springer, Cham (2014). Scholar
  17. 17.
    Jain, S.D., Xiong, B., Grauman, K.: FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: CVPR (2017)Google Scholar
  18. 18.
    Jang, W.D., Kim, C.S.: Semi-supervised video object segmentation using multiple random walkers. In: BMVC (2016)Google Scholar
  19. 19.
    Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: CVPR (2017)Google Scholar
  20. 20.
    Jang, W.D., Kim, C.S.: Interactive image segmentation via backpropagating refinement scheme. In: CVPR (2019)Google Scholar
  21. 21.
    Jang, W.D., Lee, C., Kim, C.S.: Primary object segmentation in videos via alternate convex optimization of foreground and background distributions. In: CVPR (2016)Google Scholar
  22. 22.
    Koh, Y.J., Jang, W.D., Kim, C.S.: POD: Discovering primary objects in videos based on evolutionary refinement of object recurrence, background, and primary object models. In: CVPR (2016)Google Scholar
  23. 23.
    Koh, Y.J., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR (2017)Google Scholar
  24. 24.
    Koh, Y.J., Lee, Y.-Y., Kim, C.-S.: Sequential clique optimization for video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 537–556. Springer, Cham (2018). Scholar
  25. 25.
    Lempitsky, V.S., Kohli, P., Rother, C., Sharp, T.: Image segmentation with a bounding box prior. In: ICCV (2009)Google Scholar
  26. 26.
    Maninis, K.K., et al.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1515–1530 (2018)CrossRefGoogle Scholar
  27. 27.
    Maninis, K.K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: from extreme points to object segmentation. In: CVPR (2018)Google Scholar
  28. 28.
    Najafi, M., Kulharia, V., Ajanthan, T., Torr, P.: Similarity learning for dense label transfer. In: CVPRW (2018)Google Scholar
  29. 29.
    Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: CVPR (2018)Google Scholar
  30. 30.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Fast user-guided video object segmentation by interaction-and-propagation networks. In: CVPR (2019)Google Scholar
  31. 31.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)Google Scholar
  32. 32.
    Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: ICCV (2013)Google Scholar
  33. 33.
    Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)Google Scholar
  34. 34.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)Google Scholar
  35. 35.
    Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv:1704.00675 (2017)
  36. 36.
    Price, B.L., Morse, B.S., Cohen, S.: LIVEcut: learning-based interactive video segmentation by evaluation of multiple propagated cues. In: ICCV (2009)Google Scholar
  37. 37.
    Ren, H., Yang, Y., Liu, X.: Robust multiple object mask propagation with efficient object tracking. In: CVPRW (2019)Google Scholar
  38. 38.
    Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3), 309–314 (2004)CrossRefGoogle Scholar
  39. 39.
    Shankar Nagaraja, N., Schmidt, F.R., Brox, T.: Video segmentation with just a few strokes. In: ICCV (2015)Google Scholar
  40. 40.
    Song, G., Myeong, H., Lee, K.M.: SeedNet: automatic seed generation with deep reinforcement learning for robust interactive segmentation. In: CVPR (2018)Google Scholar
  41. 41.
    Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.-M.: Pyramid dilated deeper ConvLSTM for video salient object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 744–760. Springer, Cham (2018). Scholar
  42. 42.
    Tang, M., Gorelick, L., Veksler, O., Boykov, Y.: GrabCut in one cut. In: ICCV (2013)Google Scholar
  43. 43.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning motion patterns in videos. In: CVPR (2017)Google Scholar
  44. 44.
    Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)Google Scholar
  45. 45.
    Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)Google Scholar
  46. 46.
    Wang, J., Bhat, P., Colburn, R.A., Agrawala, M., Cohen, M.F.: Interactive video cutout. ACM Trans. Graph. 24(3), 585–594 (2005)CrossRefGoogle Scholar
  47. 47.
    Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: CVPR (2015)Google Scholar
  48. 48.
    Wang, W., et al.: Learning unsupervised video object segmentation through visual attention. In: CVPR (2019)Google Scholar
  49. 49.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  50. 50.
    Wu, J., Zhao, Y., Zhu, J.Y., Luo, S., Tu, Z.: MILCut: a sweeping line multiple instance learning paradigm for interactive image segmentation. In: CVPR (2014)Google Scholar
  51. 51.
    Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV (2015)Google Scholar
  52. 52.
    Xu, N., Price, B., Cohen, S., Yang, J., Huang, T.S.: Deep interactive object selection. In: CVPR (2016)Google Scholar
  53. 53.
    Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv:1809.03327 (2018)
  54. 54.
    Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Electrical EngineeringKorea UniversitySeoulKorea
  2. 2.Department of Computer Science and EngineeringChungnam National UniversityDaejeonKorea

Personalised recommendations