ScribbleBox: Interactive Annotation Framework for Video Object Segmentation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12358)


Manually labeling video datasets for segmentation tasks is extremely time consuming. We introduce ScribbleBox, an interactive framework for annotating object instances with masks in videos with a significant boost in efficiency. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in box placements, thus typically requiring only a few clicks to annotate a track to a sufficient accuracy. Segmentation masks are corrected via scribbles which are propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with an average of 9.14 clicks per box track, and only 4 frames requiring scribble annotation in a video of 65.3 frames on average.



This work was supported by NSERC. SF acknowledges the Canada CIFAR AI Chair award at the Vector Institute.

Supplementary material

Supplementary material 1 (mp4 11133 KB)

504454_1_En_18_MOESM2_ESM.pdf (12.5 mb)
Supplementary material 2 (pdf 12776 KB)


  1. 1.
    Acuna, D., Ling, H., Kar, A., Fidler, S.: Efficient interactive annotation of segmentation datasets with Polygon-RNN++. In CVPR (2018)Google Scholar
  2. 2.
    Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, 14–20 October 2007, pp. 1–8. IEEE Computer Society (2007)Google Scholar
  3. 3.
    Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object cutout using localized classifiers. ACM Trans. Graph. 28(3) (2009). Article no. 70 Google Scholar
  4. 4.
    Benard, A., Gygli, M.: Interactive video object segmentation in the wild. ArXiv, abs/1801.00269 (2018)Google Scholar
  5. 5.
    Caelles, S., Maninis, K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Gool, L.V.: One-shot video object segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 5320–5329. IEEE Computer Society (2017)Google Scholar
  6. 6.
    Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv:1803.00557 (2018)
  7. 7.
    Castrejon, L., Kundu, K., Urtasun, R., Fidler, S.: Annotating object instances with a Polygon-RNN. In: CVPR (2017)Google Scholar
  8. 8.
    Chen, L.-C., Fidler, S., Yuille, A., Urtasun, R.: Beat the MTurkers: automatic image labeling from weak 3D supervision. In: CVPR (2014)Google Scholar
  9. 9.
    Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  10. 10.
    Damen, D., et al.: Scaling egocentric vision: the EPIC-KITCHENS dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). Scholar
  11. 11.
    Damen, D., et al.: The EPIC-KITCHENS dataset: collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. (2020) Google Scholar
  12. 12.
    Gao, J., Tang, C., Ganapathi-Subramanian, V., Huang, J., Su, H., Guibas, L.J.: DeepSpline: data-driven reconstruction of parametric curves and surfaces. arXiv preprint arXiv:1901.03781 (2019)
  13. 13.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, June 2012Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  15. 15.
    Yuen, J., Russell, B., Liu, C., Torralba, A.: LabelMe video: building a video database with human annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1451–1458, September 2009Google Scholar
  16. 16.
    Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 123–141. Springer, Cham (2019). Scholar
  17. 17.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings (2017)Google Scholar
  18. 18.
    Levinkov, E., Tompkin, J., Bonneel, N., Kirchhoff, S., Andres, B., Pfister, H.: Interactive multicut video segmentation. In: PG 2016 (2016)Google Scholar
  19. 19.
    Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980, June 2018Google Scholar
  20. 20.
    Li, Y., Sun, J., Shum, H.: Video object cut and paste. ACM Trans. Graph. 24(3), 595–600 (2005)CrossRefGoogle Scholar
  21. 21.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  22. 22.
    Lin, Z., Xie, J., Zhou, C., Hu, J., Zheng, W.: Interactive video object segmentation via spatio-temporal context aggregation and online learning. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019)Google Scholar
  23. 23.
    Ling, H., Gao, J., Kar, A., Chen, W., Fidler, S.: Fast interactive object annotation with Curve-GCN. In: CVPR, June 2019Google Scholar
  24. 24.
    Mahadevan, S., Voigtlaender, P., Leibe, B.: Iteratively trained interactive segmentation. arXiv preprint arXiv:1805.04398 (2018)
  25. 25.
    Manen, S., Gygli, M., Dai, D., Van Gool, L.: PathTrack: fast trajectory annotation with path supervision. arXiv:1703.02437 (2017)
  26. 26.
    Maninis, K.-K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: from extreme points to object segmentation. In: CVPR (2018)Google Scholar
  27. 27.
    Mortensen, E.N., Barrett, W.A.: Intelligent scissors for image composition. In: SIGGRAPH, pp. 191–198 (1995)Google Scholar
  28. 28.
    Nagaraja, N.S., Schmidt, F.R., Brox, T.: Video segmentation with just a few strokes. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015, pp. 3235–3243. IEEE Computer Society (2015)Google Scholar
  29. 29.
    Najafi, M., Kulharia, V., Ajanthan, T., Torr, P.H.S.: Similarity learning for dense label transfer. In: The 2018 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2018)Google Scholar
  30. 30.
    Oh, S.W., Lee, J.-Y., Xu, N., Kim, S.J.: Fast user-guided video object segmentation by interaction-and-propagation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5247–5256 (2019)Google Scholar
  31. 31.
    Oh, S.W., Lee, J.-Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9226–9235 (2019)Google Scholar
  32. 32.
    Price, B.L., Morse, B.S., Cohen, S.: LIVEcut: learning-based interactive video segmentation by evaluation of multiple propagated cues. In: IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, 27 September–4 October 2009, pp. 779–786. IEEE Computer Society (2009)Google Scholar
  33. 33.
    Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. In: SIGGRAPH (2004)Google Scholar
  34. 34.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). Scholar
  35. 35.
    Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  36. 36.
    Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation (2019)Google Scholar
  37. 37.
    Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vis. 101(1), 184–204 (2013). Scholar
  38. 38.
    Vondrick, C., Ramanan, D.: Video annotation and tracking with active learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 2011, USA, pp. 28–36. Curran Associates Inc. (2011)Google Scholar
  39. 39.
    Wang, J., Bhat, P., Colburn, A., Agrawala, M., Cohen, M.F.: Interactive video cutout. ACM Trans. Graph. 24(3), 585–594 (2005)CrossRefGoogle Scholar
  40. 40.
    Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)Google Scholar
  41. 41.
    Wang, Z., Ling, H., Acuna, D., Kar, A., Fidler, S.: Object instance annotation with deep extreme level set evolution. In: CVPR (2019)Google Scholar
  42. 42.
    Wug Oh, S., Lee, J.-Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)Google Scholar
  43. 43.
    Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of TorontoTorontoCanada
  2. 2.Vector InstituteTorontoCanada
  3. 3.NVIDIASanta ClaraUSA

Personalised recommendations