Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s)


We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems. Specifically, we distinguish between images which lead multiple annotators to segment different foreground objects (ambiguous) versus minor inter-annotator differences of the same object. Taking images from eight widely used datasets, we crowdsource labeling the images as “ambiguous” or “not ambiguous” to segment in order to construct a new dataset we call STATIC. Using STATIC, we develop a system that automatically predicts which images are ambiguous. Experiments demonstrate the advantage of our prediction system over existing saliency-based methods on images from vision benchmarks and images taken by blind people who are trying to recognize objects in their environment. Finally, we introduce a crowdsourcing system to achieve cost savings for collecting the diversity of all valid “ground truth” foreground object segmentations by collecting extra segmentations only when ambiguity is expected. Experiments show our system eliminates up to 47% of human effort compared to existing crowdsourcing methods with no loss in capturing the diversity of ground truths.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Also, some times referred to as salient object detection (Borji et al. 2015; Cheng et al. 2015).

  2. 2.

    We excluded all images for which the majority of three crowd workers indicated the answer to their visual question could be recognized by text in the image.

  3. 3.

    We explore classifiers for the task, treating ambiguity as a binary random variable. One could further consider ambiguity as a continuous value when sufficient annotator votes are available for training, in which case classifiers would be replaced with regression.

  4. 4.

  5. 5.

    Collecting annotations from multiple independent annotators is necessary to avoid annotator bias [e.g., Berkeley Segmentation Dataset Martin et al. 2001, MSRA Liu et al. 2011]. As discussed in Sect. 2, this is in stark contrast to dataset collection systems that solicit redundant annotations by showing each new annotator all previously-collected segmentations overlaid on the image [e.g., LabelMe (Russell et al. 2008), VOC (Everingham et al. 2010), MSCOCO (Lin et al. 2014)]. This design difference stems from different aims. While the latter aims to annotate all objects (possibly only for a pre-defined set of object categories) in an image, the former focuses on localizing all objects deemed the single most prominent object according to human perception.

  6. 6.

    This method requires a minimum of two segmentations per image and allocates a different number of additional annotations for different images.


  1. Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In IEEE conference on computer vision and pattern recognition (CVPR).

  2. Alpert, S., Galun, M., Basri, R., Brandt, A. (2007). Image segmentation by probabilistic bottom-up aggregation and cue integration. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).

  3. Arbelaez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J. (2014). Multiscale combinatorial grouping. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 328–335).

  4. Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In The 2nd internet vision workshop at conference on computer vision and pattern recognition.

  5. Berg, A., Berg, T., Daume, H., Dodge, J., Goyal, A., & Han, X. et al. (2012). Understanding and predicting importance in images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3562–3569).

  6. Biancardi, A. M., Jirapatnakul, A. C., & Reeves, A. P. (2010). A comparison of ground truth estimation methods. International Journal of Computer Assisted Radiology and Surgery, 5(3), 295–305.

    Article  Google Scholar 

  7. Bigham, J. P., Jayant, C., Ji, H., Little, G., Miller, A., & Miller, R. C. et al. (2010). Vizwiz: Nearly real-time answers to visual questions. In ACM symposium on user interface software and technology (UIST) (pp. 333–342).

  8. Borenstein, E., & Ullman, S. (2008). Combined top-down/bottom-up segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12), 2109–2125.

    Article  Google Scholar 

  9. Borji, A., Sihite, D. N., & Itti, L. (2013). What stands out in a scene? A study of human explicit saliency judgment. Vision Research, 91, 62.

    Article  MATH  Google Scholar 

  10. Borji, A., Cheng, M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706.

    MathSciNet  Article  Google Scholar 

  11. Brady, E., Morris, M. R., Zhong, Y., White, S., & Bigham, J. P. (2013). Visual challenges in the everyday lives of blind people. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 2117–2126).

  12. Caselles, V., Kimmel, R., & Sapiro, G. (1997). Geodesic active contours. IEEE Transactions on Image Processing, 22(1), 61–79.

    MATH  Google Scholar 

  13. Chen, T., Cheng, M., Tan, P., Shamir, A., & Hu, S. (2009). Sketch2photo: Internet image montage. ACM Transactions on Graphics, 28(5), 124.

    Google Scholar 

  14. Cheng, M., Mitra, M. J., Huang, X., Torr, P. H. S., & Hu, S. (2015). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.

    Article  Google Scholar 

  15. Cholleti, S. R., Goldman, S. A., Blum, A., Politte, D. G., Don, S., Smith, K., et al. (2009). Veritas: Combining expert opinions without labeled data. International Journal on Artificial Intelligence Tools, 18(5), 633–651.

    Article  Google Scholar 

  16. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition (CVPR).

  17. Dollar, P., & Zitnick, C. L. (2015). Fast edge detection using structured forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1558–1570.

    Article  Google Scholar 

  18. Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  19. Feng, J., Wei, Y., Tao, L., Zhang, C., & Sun, J. (2011). Salient object detection by composition. In IEEE International conference on computer vision (ICCV) (pp. 1028–1035).

  20. GarcieaPerez, M. A. (1989). Visual inhomogeneity and eye movements in multistable perception. Perception and Psychophysics, 46(4), 397–400.

    Article  Google Scholar 

  21. Gilbert, E. (2014) What if we ask a different question? Social inferences create product ratings faster. In CHI extended abstracts (pp. 2759–2762).

  22. Gulshan, V., Rother, C., Criminisi, A., Blake, A., & Zisserman, A. (2010). Geodesic star convexity for interactive image segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3129–3136).

  23. Gurari, D., Sameki, M., & Betke, M. (2016). Investigating the influence of data familiarity to improve the design of a crowdsourcing image annotation system. In AAAI conference on human computation and crowdsourcing (HCOMP) (pp. 59–68).



  26. Jain, S. D., & Grauman, K. (2013). Predicting sufficient annotation strength for interactive foreground segmentation. In IEEE International Conference on Computer Vision (ICCV) (pp. 1313–1320). IEEE

  27. Jas, M., & Parikh, D. (2015). Image specificity. In IEEE conference on computer vision and pattern recognition (CVPR).

  28. Jayant, C., Ji, H., White, S., & Bigham, J. P. (2011). Supporting blind photography. In ASSETS.

  29. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., & Girshick, R. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.

  30. Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., & Li, S. (2013). Salient object detection: A discriminative regional feature integration approach. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2083–2090).

  31. Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., & Grady, L. (2012). Evaluating segmentation error without ground truth. In Medical image computing and computer assisted intervention (MICCAI) (pp. 528–536).

  32. Kovashka, A., Parikh, D., & Grauman, K. (2014). Discovering attribute shades of meaning with the crowd. International Journal on Computer Vision (IJCV), 115(2), 185–210.

    Article  Google Scholar 

  33. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1097–1105).

  34. Leopold, D., Wilke, M., Maier, A., Logothetis, N., Blake, A. R. et al. (2004). Binocular rivalry and the illusion of monocular vision. In Binocular rivalry (pp. 231–259). MIT Press.

  35. Li, S., Purushotham, S., Chen, C., Ren, Y., & Kuo, C. (2017). Measuring and predicting tag importance for image retrieval. In IEEE transactions on pattern analysis and machine intelligence.

  36. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., & Ramanan, D. et al. (2014). Microsoft COCO: Common objects in context. In IEEE European conference on computer vision (ECCV) (pp. 740–755).

  37. Liu, D., Xiong, Y., Pulli, K., & Shapiro, L. (2011). Estimating image segmentation difficulty. In Machine learning and data mining in pattern recognition (pp. 484–495).

  38. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.

    Article  Google Scholar 

  39. Margolin, R., Zelnik-Manor, L., Tal, A. (2014). How to evaluate foreground maps? In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).

  40. Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In International Conference on Computer Vision (ICCV), 2, 416–423.

    Google Scholar 

  41. Meger, D., Forssen, P., Lai, K., Helmer, S., McCann, S., Southey, T., et al. (2008). Curious george: An attentive semantic robot. Robotics and Autonomous Systems, 56(6), 503–511.

    Article  Google Scholar 

  42. Perronnin, F., Nchez, J. S., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision (ECCV).

  43. Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 3, 309–314.

    Article  Google Scholar 

  44. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173.

    Article  Google Scholar 

  45. Shaw, A. D., Horton, J. J., & Chen, D. L. (2011). Designing incentives for inexpert human raters. In ACM conference on computer supported cooperative work (CSCW) (pp. 275–284)

  46. Sheshadri, A., & Lease, M. (2013). SQUARE: A benchmark for research on computing crowd consensus. In AAAI conference on human computation and crowdsourcing (HCOMP) (pp. 156–164).

  47. Shi, Y., & Karl, W. C. (2008). A real-time algorithm for the approximation of level-set based curve evolution. IEEE Transactions on Image Processing, 17(5), 645–656.

    MathSciNet  Article  Google Scholar 

  48. Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision (ECCV) (pp. 1–15). Springer.

  49. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  50. Torralba, A., Murphy, K. P., Freeman, W. T. & Rubin, M. A. (2003). Context-based vision system for place and object recognition. In International conference on computer vision (ICCV)

  51. Vázquez, M., & Steinfeld, A. (2014). An assisted photography framework to help visually impaired users properly aim a camera. ACM Transactions on Computer-Human Interaction (TOCHI), 21, 25.

    Article  Google Scholar 

  52. Vijayanarasimhan, S., & Grauman, K. (2011). Cost-sensitive active visual category learning. International Journal of Computer Vision, 91, 24–44.

    Article  MATH  Google Scholar 

  53. Warfield, S. K., Zou, K. H., & Wells, W. M. (2004). Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7), 903–921.

    Article  Google Scholar 

  54. Welinder, P., & Perona, P. (2010). Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In 2010 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW) (pp. 25–32).

  55. Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In Advances in neural information processing systems (NIPS) (pp. 2424–2432).

  56. Whitehill, J., Wu, T., Bergsma, J., Movellan, J. R. & Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems (NIPS) (pp. 2035–2043).

  57. Zhang, J. Ma, S., Sameki, M., Sclaroff, S., Betke, M., & Lin, Z et al. (2015). Salient object subitizing. In IEEE conference on computer vision and pattern recognition (CVPR).

  58. Zhong, Y., Garrigues, P. J. & Bigham, J. P. (2013). Real time object scanning using a mobile phone and cloud-based visual search engine. In SIGACCESS conference on computers and accessibility (p. 20).

Download references


The authors gratefully acknowledge funding from the Office of Naval Research (ONR YIP N00014-12-1-0754) and National Science Foundation (IIS-1421943) and thank the anonymous crowd workers for participating in our experiments.

Author information



Corresponding author

Correspondence to Danna Gurari.

Additional information

Communicated by Deva Ramanan.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gurari, D., He, K., Xiong, B. et al. Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s). Int J Comput Vis 126, 714–730 (2018).

Download citation


  • Salient object detection
  • Segmentation
  • Crowdsourcing