We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems. Specifically, we distinguish between images which lead multiple annotators to segment different foreground objects (ambiguous) versus minor inter-annotator differences of the same object. Taking images from eight widely used datasets, we crowdsource labeling the images as “ambiguous” or “not ambiguous” to segment in order to construct a new dataset we call STATIC. Using STATIC, we develop a system that automatically predicts which images are ambiguous. Experiments demonstrate the advantage of our prediction system over existing saliency-based methods on images from vision benchmarks and images taken by blind people who are trying to recognize objects in their environment. Finally, we introduce a crowdsourcing system to achieve cost savings for collecting the diversity of all valid “ground truth” foreground object segmentations by collecting extra segmentations only when ambiguity is expected. Experiments show our system eliminates up to 47% of human effort compared to existing crowdsourcing methods with no loss in capturing the diversity of ground truths.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
We excluded all images for which the majority of three crowd workers indicated the answer to their visual question could be recognized by text in the image.
We explore classifiers for the task, treating ambiguity as a binary random variable. One could further consider ambiguity as a continuous value when sufficient annotator votes are available for training, in which case classifiers would be replaced with regression.
Collecting annotations from multiple independent annotators is necessary to avoid annotator bias [e.g., Berkeley Segmentation Dataset Martin et al. 2001, MSRA Liu et al. 2011]. As discussed in Sect. 2, this is in stark contrast to dataset collection systems that solicit redundant annotations by showing each new annotator all previously-collected segmentations overlaid on the image [e.g., LabelMe (Russell et al. 2008), VOC (Everingham et al. 2010), MSCOCO (Lin et al. 2014)]. This design difference stems from different aims. While the latter aims to annotate all objects (possibly only for a pre-defined set of object categories) in an image, the former focuses on localizing all objects deemed the single most prominent object according to human perception.
This method requires a minimum of two segmentations per image and allocates a different number of additional annotations for different images.
Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In IEEE conference on computer vision and pattern recognition (CVPR).
Alpert, S., Galun, M., Basri, R., Brandt, A. (2007). Image segmentation by probabilistic bottom-up aggregation and cue integration. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Arbelaez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J. (2014). Multiscale combinatorial grouping. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 328–335).
Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In The 2nd internet vision workshop at conference on computer vision and pattern recognition.
Berg, A., Berg, T., Daume, H., Dodge, J., Goyal, A., & Han, X. et al. (2012). Understanding and predicting importance in images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3562–3569).
Biancardi, A. M., Jirapatnakul, A. C., & Reeves, A. P. (2010). A comparison of ground truth estimation methods. International Journal of Computer Assisted Radiology and Surgery, 5(3), 295–305.
Bigham, J. P., Jayant, C., Ji, H., Little, G., Miller, A., & Miller, R. C. et al. (2010). Vizwiz: Nearly real-time answers to visual questions. In ACM symposium on user interface software and technology (UIST) (pp. 333–342).
Borenstein, E., & Ullman, S. (2008). Combined top-down/bottom-up segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12), 2109–2125.
Borji, A., Sihite, D. N., & Itti, L. (2013). What stands out in a scene? A study of human explicit saliency judgment. Vision Research, 91, 62.
Borji, A., Cheng, M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706.
Brady, E., Morris, M. R., Zhong, Y., White, S., & Bigham, J. P. (2013). Visual challenges in the everyday lives of blind people. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 2117–2126).
Caselles, V., Kimmel, R., & Sapiro, G. (1997). Geodesic active contours. IEEE Transactions on Image Processing, 22(1), 61–79.
Chen, T., Cheng, M., Tan, P., Shamir, A., & Hu, S. (2009). Sketch2photo: Internet image montage. ACM Transactions on Graphics, 28(5), 124.
Cheng, M., Mitra, M. J., Huang, X., Torr, P. H. S., & Hu, S. (2015). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.
Cholleti, S. R., Goldman, S. A., Blum, A., Politte, D. G., Don, S., Smith, K., et al. (2009). Veritas: Combining expert opinions without labeled data. International Journal on Artificial Intelligence Tools, 18(5), 633–651.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition (CVPR).
Dollar, P., & Zitnick, C. L. (2015). Fast edge detection using structured forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1558–1570.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Feng, J., Wei, Y., Tao, L., Zhang, C., & Sun, J. (2011). Salient object detection by composition. In IEEE International conference on computer vision (ICCV) (pp. 1028–1035).
GarcieaPerez, M. A. (1989). Visual inhomogeneity and eye movements in multistable perception. Perception and Psychophysics, 46(4), 397–400.
Gilbert, E. (2014) What if we ask a different question? Social inferences create product ratings faster. In CHI extended abstracts (pp. 2759–2762).
Gulshan, V., Rother, C., Criminisi, A., Blake, A., & Zisserman, A. (2010). Geodesic star convexity for interactive image segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3129–3136).
Gurari, D., Sameki, M., & Betke, M. (2016). Investigating the influence of data familiarity to improve the design of a crowdsourcing image annotation system. In AAAI conference on human computation and crowdsourcing (HCOMP) (pp. 59–68).
Jain, S. D., & Grauman, K. (2013). Predicting sufficient annotation strength for interactive foreground segmentation. In IEEE International Conference on Computer Vision (ICCV) (pp. 1313–1320). IEEE
Jas, M., & Parikh, D. (2015). Image specificity. In IEEE conference on computer vision and pattern recognition (CVPR).
Jayant, C., Ji, H., White, S., & Bigham, J. P. (2011). Supporting blind photography. In ASSETS.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., & Girshick, R. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., & Li, S. (2013). Salient object detection: A discriminative regional feature integration approach. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2083–2090).
Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., & Grady, L. (2012). Evaluating segmentation error without ground truth. In Medical image computing and computer assisted intervention (MICCAI) (pp. 528–536).
Kovashka, A., Parikh, D., & Grauman, K. (2014). Discovering attribute shades of meaning with the crowd. International Journal on Computer Vision (IJCV), 115(2), 185–210.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1097–1105).
Leopold, D., Wilke, M., Maier, A., Logothetis, N., Blake, A. R. et al. (2004). Binocular rivalry and the illusion of monocular vision. In Binocular rivalry (pp. 231–259). MIT Press.
Li, S., Purushotham, S., Chen, C., Ren, Y., & Kuo, C. (2017). Measuring and predicting tag importance for image retrieval. In IEEE transactions on pattern analysis and machine intelligence.
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., & Ramanan, D. et al. (2014). Microsoft COCO: Common objects in context. In IEEE European conference on computer vision (ECCV) (pp. 740–755).
Liu, D., Xiong, Y., Pulli, K., & Shapiro, L. (2011). Estimating image segmentation difficulty. In Machine learning and data mining in pattern recognition (pp. 484–495).
Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.
Margolin, R., Zelnik-Manor, L., Tal, A. (2014). How to evaluate foreground maps? In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In International Conference on Computer Vision (ICCV), 2, 416–423.
Meger, D., Forssen, P., Lai, K., Helmer, S., McCann, S., Southey, T., et al. (2008). Curious george: An attentive semantic robot. Robotics and Autonomous Systems, 56(6), 503–511.
Perronnin, F., Nchez, J. S., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision (ECCV).
Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 3, 309–314.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173.
Shaw, A. D., Horton, J. J., & Chen, D. L. (2011). Designing incentives for inexpert human raters. In ACM conference on computer supported cooperative work (CSCW) (pp. 275–284)
Sheshadri, A., & Lease, M. (2013). SQUARE: A benchmark for research on computing crowd consensus. In AAAI conference on human computation and crowdsourcing (HCOMP) (pp. 156–164).
Shi, Y., & Karl, W. C. (2008). A real-time algorithm for the approximation of level-set based curve evolution. IEEE Transactions on Image Processing, 17(5), 645–656.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision (ECCV) (pp. 1–15). Springer.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Torralba, A., Murphy, K. P., Freeman, W. T. & Rubin, M. A. (2003). Context-based vision system for place and object recognition. In International conference on computer vision (ICCV)
Vázquez, M., & Steinfeld, A. (2014). An assisted photography framework to help visually impaired users properly aim a camera. ACM Transactions on Computer-Human Interaction (TOCHI), 21, 25.
Vijayanarasimhan, S., & Grauman, K. (2011). Cost-sensitive active visual category learning. International Journal of Computer Vision, 91, 24–44.
Warfield, S. K., Zou, K. H., & Wells, W. M. (2004). Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7), 903–921.
Welinder, P., & Perona, P. (2010). Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In 2010 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW) (pp. 25–32).
Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In Advances in neural information processing systems (NIPS) (pp. 2424–2432).
Whitehill, J., Wu, T., Bergsma, J., Movellan, J. R. & Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems (NIPS) (pp. 2035–2043).
Zhang, J. Ma, S., Sameki, M., Sclaroff, S., Betke, M., & Lin, Z et al. (2015). Salient object subitizing. In IEEE conference on computer vision and pattern recognition (CVPR).
Zhong, Y., Garrigues, P. J. & Bigham, J. P. (2013). Real time object scanning using a mobile phone and cloud-based visual search engine. In SIGACCESS conference on computers and accessibility (p. 20).
The authors gratefully acknowledge funding from the Office of Naval Research (ONR YIP N00014-12-1-0754) and National Science Foundation (IIS-1421943) and thank the anonymous crowd workers for participating in our experiments.
Communicated by Deva Ramanan.
About this article
Cite this article
Gurari, D., He, K., Xiong, B. et al. Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s). Int J Comput Vis 126, 714–730 (2018). https://doi.org/10.1007/s11263-018-1065-7
- Salient object detection