Skip to main content
Log in

Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.


We present a novel form of interactive object segmentation called Click Carving which enables accurate segmentation of objects in images and videos with only a few point clicks. Whereas conventional interactive pipelines take the user’s initialization as a starting point, we show the value in the system taking lead even in initialization. In particular, for a given image or a video frame, the system precomputes a ranked list of thousands of possible segmentation hypotheses (also referred to as object region proposals) using appearance and motion cues. Then, the user looks at the top ranked proposals, and clicks on the object boundary to carve away erroneous ones. This process iterates (typically 2–3 times), and each time the system revises the top ranked proposal set, until the user is satisfied with a resulting segmentation mask. In the case of images, this mask is considered as the final object segmentation. However in the case of videos, the object region proposals rely on motion as well, and the resulting segmentation mask in the first frame is further propagated across the video to obtain a complete spatio-temporal object tube. On six challenging image and video datasets, we provide extensive comparisons with both existing work and simpler alternative methods. In all, the proposed Click Carving approach strikes an excellent of accuracy and human effort. It outperforms all similarly fast methods, and is competitive or better than those requiring 2–12 times the effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others


  1. More details and videos can be found at:

  2. Code available at:

  3. The unsupervised NLC method (Faktor and Irani 2014) reports excellent results on a subset of the Segtrack-v2 dataset; the method achieves state of the art results for that subset. We were unable to reproduce the results using the publicly available NLC code, potentially because of an OS incompatibility.

  4. IVID (Shankar Nagaraja et al. 2015) does not report annotation times for Segtrack-v2. Also, VSB100 dataset wasn’t used in their experiments.

  5. More details and videos can be found at:


  • Acuna, D., Ling, H., Kar, A., & Fidler, S. (2018). Efficient interactive annotation of segmentation datasets with polygon-rnn++.

  • Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.

  • Badrinarayanan, V., Galasso, F., & Cipolla, R. (2010). Label propagation in video sequences. In CVPR.

  • Bai, X., & Sapiro, G. (2007). Distancecut: Interactive segmentation and matting of images and videos. In 2007 IEEE international conference on image processing.

  • Bai, X., Wang, J., Simons, D., & Sapiro, G. (2009) Video snapcut: Robust video object cutout using localized classifiers. In SIGGRAPH.

  • Batra, D., Kowdle, A., Parikh, D., Luo, J., & Chen, T. (2010). iCoseg: Interactive co-segmentation with intelligent scribble guidance. In CVPR.

  • Bearman, A., Russakovsky, O., Ferrari, V., & Fei-Fei, L. (2015). What’s the point: Semantic segmentation with point supervision. ArXiv e-prints.

  • Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2015). Material recognition in the wild with the materials in context database. In Computer Vision and Pattern Recognition (CVPR).

  • Boykov, Y., & Jolly, M. (2001). Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In CVPR.

  • Carreira, J., & Sminchisescu, C. (2012). CPMC: Automatic object segmentation using constrained parametric min-cuts. PAMI, 34(7), 1312–1328.

    Article  Google Scholar 

  • Castrejón, L., Kundu, K., Urtasun, R., & Fidler, S. (2017). Annotating object instances with a polygon-rnn. In CVPR.

  • Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR.

  • Cheng, M.-M., Zhang, G.-X., Mitra, N. J., Huang, X., & Hu, S.-M. (2011). Global contrast based salient region detection. In CVPR (pp. 409–416).

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Faktor, A., & Irani, M. (2014). Video segmentation by non-local consensus voting. In Proceedings of the British machine vision conference. BMVA Press.

  • Fathi, A., Balcan, M., Ren, X., & Rehg, J. (2011). Combining self training and active learning for video segmentation. In BMVC.

  • Fragkiadaki, K., Arbelaez, P., Felsen, P., & Malik, J. (2015). Learning to segment moving objects in videos. In CVPR.

  • Galasso, F., Nagaraja, N. S., Cardenas, T. J., Brox, T., & Schiele, B. (2013). A unified video segmentation benchmark: Annotation, metrics and analysis. In ICCV.

  • Godec, M., Roth, P. M., & Bischof, H. (2011). Hough-based tracking of non-rigid objects. In ICCV.

  • Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph based video segmentation. In CVPR.

  • Gulshan, V., Rother, C., Criminisi, A., Blake, A., & Zisserman, A. (2010). Geodesic star convexity for interactive image segmentation. In CVPR.

  • Jain, S., & Grauman, K. (2013). Predicting sufficient annotation strength for interactive foreground segmentation. In ICCV.

  • Jain, S. D., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV 2014. Lecture notes in computer science (pp. 656–671). Springer.

  • Jain, S. D., & Grauman, K. (2016). Click carving: Segmenting objects in video with point clicks. In AAAI conference on human computation and crowdsourcing (HCOMP).

  • Jiang, B., Zhang, L., Lu, H., Yang, C., & Yang, M.-H. (2013). Saliency detection via absorbing markov chain. In ICCV.

  • Karasev, V., Ravichandran, A., & Soatto, S. (2014). Active frame, location, and detector selection for automated and manual video annotation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kass, M., Witkin, A., & Terzopoulos, D. (1988). Snakes: Active contour models. In IJCV (pp. 321–331).

  • Kohli, P., Nickisch, H., Rother, C., & Rhemann, C. (2012). User-centric learning and evaluation of interactive segmentation systems. IJCV, 100(3), 261–274.

    Article  MathSciNet  Google Scholar 

  • Krähenbühl, P., & Koltun, V. (2014). In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, part V, chapter geodesic object proposals (pp. 725–739). Cham: Springer.

  • Krause, A., & Guestrin, C. (2007). Near-optimal observation selection using submodular functions. In National conference on artificial intelligence (AAAI), nectar track.

  • Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.

  • Lempitsky, V. S., Kohli, P., Rother, C., & Sharp, T. (2009). Image segmentation with a bounding box prior. In ICCV

  • Levinkov, E., Tompkin, J., Bonneel, N., Kirchhoff, S., Andres, B., & Pfister, H. (2016). Interactive multicut video segmentation. In Proceedings of the 24th Pacific conference on computer graphics and applications: Short papers (pp. 33–38).

  • Li, F., Kim, T., Humayun, A., Tsai, D., & Rehg, J. M. (2013). Video segmentation by tracking many figure-ground segments. In ICCV.

  • Li, X., Zhao, L., Wei, L., Yang, M.-H., Fei, W., Zhuang, Y., et al. (2016). DeepSaliency: Multi-task deep neural network model for salient object detection. IEEE TIP, 25(8), 3919–3930.

    MathSciNet  MATH  Google Scholar 

  • Li, Y., Hou, X., Koch, C., Rehg, J. M., & Yuille, A. L. (2014). The secrets of salient object segmentation. In CVPR.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV.

  • Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. PAMI, 33(2), 353–367.

    Article  Google Scholar 

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  • Ma, T., & Latecki, L. (2012). Maximum weight cliques with mutex constraints for video object segmentation. In CVPR.

  • Malisiewicz, T., & Efros, A. A. (2007). Spatial support for objects via multiple segmentations. In BMVC.

  • Malmberg, F., Strand, R., & Nyström, I. (2011). Generalized hard constraints for graph segmentation. In SCIA.

  • McGuinness, K., & O’Connor, N. E. (2010). A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2), 434–444. Interactive Imaging and Vision.

    Article  MATH  Google Scholar 

  • Mortensen, E., & Barrett, W. (1995). Intelligent scissors for image composition. In SIGGRAPH.

  • Nickisch, H., Rother, C., Kohli, P., & Rhemann, C. (2010). Learning an interactive segmentation system. In Proceedings of the seventh Indian conference on computer vision, graphics and image processing, ICVGIP ’10 (pp. 274–281). New York, NY: ACM.

  • Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In 2015 IEEE international conference on computer vision (ICCV).

  • Oneata, D., Revaud, J., Verbeek, J., & Schmid, C. (2014). Spatio-temporal object detection proposals. In ECCV.

  • Papadopoulos, D., Uijlings, J., Keller, F., & Ferrari, V. (2017). Training object class detectors with click supervision. In CVPR.

  • Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In ICCV.

  • Perazzi, F., Krähenbühl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In CVPR (pp. 733–740).

  • Pinheiro, P. O., Collobert, R., & Dollár, P. (2015). Learning to segment object candidates. In NIPS

  • Pont-Tuset, J., Farré, M. A., & Smolic, A. (2015). Semi-automatic video object segmentation by advanced manipulation of segmentation hierarchies. In International workshop on content-based multimedia indexing (CBMI).

  • Ren, X., & Malik, J. (2007). Tracking as repeated figure/ground segmentation. In CVPR.

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut-interactive foreground extraction using iterated graph cuts. In SIGGRAPH.

  • Russakovsky, O., Li, L.-J., & Fei-Fei, L. (2015). Best of both worlds: Human–machine collaboration for object annotation. In CVPR.

  • Shankar Nagaraja, N., Schmidt, F. R., & Brox, T. (2015). Video segmentation with just a few strokes. In ICCV.

  • Sundberg, P., Brox, T., Maire, M., Arbelaez, P., & Malik, J. (2011). Occlusion boundary detection and figure/ground assignment from optical flow. In CVPR, Washington, DC, USA.

  • Tsai, D., Flagg, M., & Rehg, J. (2010). Motion coherent tracking with multi-label mrf optimization. In BMVC.

  • The OpenCV reference manual, edition, April 2014.

  • Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.

    Article  Google Scholar 

  • Vijayanarasimhan, S., & Grauman, K. (2012). Active frame selection for label propagation in videos. In ECCV.

  • Vondrick, C., & Ramanan, D. (2011). Video annotation and tracking with active learning. In NIPS.

  • Wang, J., Bhat, P., Colburn, A., Agrawala, M., & Cohen, M. F. (2005). Interactive video cutout. ACM Transactions on Graphics, 24(3), 585–594.

    Article  Google Scholar 

  • Wang, T., Han, B., & Collomosse, J. (2014). Touchcut: Fast image and video segmentation using single-touch interaction. Computer Vision and Image Understanding, 120, 14–30.

    Article  Google Scholar 

  • Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2015). Learning to detect motion boundaries. In CVPR 2015, Boston, United States.

  • Wen, L., Du, D., Lei, Z., Li, S. Z., & Yang, M.-H. (2015). Jots: Joint online tracking and segmentation. In CVPR.

  • Wu, Z., Li, F., Sukthankar, R., & Rehg, J. M. (2015). Robust video segment proposals with painless occlusion handling. In CVPR.

  • Xu, N., Price, B. L., Cohen, S., Yang, J., & Huang, T. S. (2016). Deep interactive object selection. CVPR (pp. 373–381).

  • Yu, G., & Yuan, J. (2015). Fast action proposals for human action detection and search. In CVPR.

  • Zhang, D., Javed, O., & Shah, M. (2013). Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR.

  • Zhao, R., Ouyang, W., Li, H., & Wang, X. (2015). Saliency detection by multi-context learning. In CVPR.

  • Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., et al. (2015). Conditional random fields as recurrent neural networks.

Download references


This research is supported in part by ONR PECASE N00014-15-1-2291, NSF IIS-1514118, a gift from Qualcomm and a gift from AWS Machine Learning. We would like to thank Shankar Nagaraja for providing the iVideoseg dataset timing data. We also thank all the participants in our user studies.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Suyog Dutt Jain.

Additional information

Communicated by Jakob Verbeek.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jain, S.D., Grauman, K. Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks. Int J Comput Vis 127, 1321–1344 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: