International Journal of Computer Vision

, Volume 126, Issue 7, pp 714–730 | Cite as

Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s)

  • Danna GurariEmail author
  • Kun He
  • Bo Xiong
  • Jianming Zhang
  • Mehrnoosh Sameki
  • Suyog Dutt Jain
  • Stan Sclaroff
  • Margrit Betke
  • Kristen Grauman


We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems. Specifically, we distinguish between images which lead multiple annotators to segment different foreground objects (ambiguous) versus minor inter-annotator differences of the same object. Taking images from eight widely used datasets, we crowdsource labeling the images as “ambiguous” or “not ambiguous” to segment in order to construct a new dataset we call STATIC. Using STATIC, we develop a system that automatically predicts which images are ambiguous. Experiments demonstrate the advantage of our prediction system over existing saliency-based methods on images from vision benchmarks and images taken by blind people who are trying to recognize objects in their environment. Finally, we introduce a crowdsourcing system to achieve cost savings for collecting the diversity of all valid “ground truth” foreground object segmentations by collecting extra segmentations only when ambiguity is expected. Experiments show our system eliminates up to 47% of human effort compared to existing crowdsourcing methods with no loss in capturing the diversity of ground truths.


Salient object detection Segmentation Crowdsourcing 



The authors gratefully acknowledge funding from the Office of Naval Research (ONR YIP N00014-12-1-0754) and National Science Foundation (IIS-1421943) and thank the anonymous crowd workers for participating in our experiments.


  1. Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  2. Alpert, S., Galun, M., Basri, R., Brandt, A. (2007). Image segmentation by probabilistic bottom-up aggregation and cue integration. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).Google Scholar
  3. Arbelaez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J. (2014). Multiscale combinatorial grouping. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 328–335).Google Scholar
  4. Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In The 2nd internet vision workshop at conference on computer vision and pattern recognition.Google Scholar
  5. Berg, A., Berg, T., Daume, H., Dodge, J., Goyal, A., & Han, X. et al. (2012). Understanding and predicting importance in images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3562–3569).Google Scholar
  6. Biancardi, A. M., Jirapatnakul, A. C., & Reeves, A. P. (2010). A comparison of ground truth estimation methods. International Journal of Computer Assisted Radiology and Surgery, 5(3), 295–305.CrossRefGoogle Scholar
  7. Bigham, J. P., Jayant, C., Ji, H., Little, G., Miller, A., & Miller, R. C. et al. (2010). Vizwiz: Nearly real-time answers to visual questions. In ACM symposium on user interface software and technology (UIST) (pp. 333–342).Google Scholar
  8. Borenstein, E., & Ullman, S. (2008). Combined top-down/bottom-up segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12), 2109–2125.CrossRefGoogle Scholar
  9. Borji, A., Sihite, D. N., & Itti, L. (2013). What stands out in a scene? A study of human explicit saliency judgment. Vision Research, 91, 62.CrossRefzbMATHGoogle Scholar
  10. Borji, A., Cheng, M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12), 5706.MathSciNetCrossRefGoogle Scholar
  11. Brady, E., Morris, M. R., Zhong, Y., White, S., & Bigham, J. P. (2013). Visual challenges in the everyday lives of blind people. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 2117–2126).Google Scholar
  12. Caselles, V., Kimmel, R., & Sapiro, G. (1997). Geodesic active contours. IEEE Transactions on Image Processing, 22(1), 61–79.zbMATHGoogle Scholar
  13. Chen, T., Cheng, M., Tan, P., Shamir, A., & Hu, S. (2009). Sketch2photo: Internet image montage. ACM Transactions on Graphics, 28(5), 124.Google Scholar
  14. Cheng, M., Mitra, M. J., Huang, X., Torr, P. H. S., & Hu, S. (2015). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.CrossRefGoogle Scholar
  15. Cholleti, S. R., Goldman, S. A., Blum, A., Politte, D. G., Don, S., Smith, K., et al. (2009). Veritas: Combining expert opinions without labeled data. International Journal on Artificial Intelligence Tools, 18(5), 633–651.CrossRefGoogle Scholar
  16. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  17. Dollar, P., & Zitnick, C. L. (2015). Fast edge detection using structured forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1558–1570.CrossRefGoogle Scholar
  18. Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRefGoogle Scholar
  19. Feng, J., Wei, Y., Tao, L., Zhang, C., & Sun, J. (2011). Salient object detection by composition. In IEEE International conference on computer vision (ICCV) (pp. 1028–1035).Google Scholar
  20. GarcieaPerez, M. A. (1989). Visual inhomogeneity and eye movements in multistable perception. Perception and Psychophysics, 46(4), 397–400.CrossRefGoogle Scholar
  21. Gilbert, E. (2014) What if we ask a different question? Social inferences create product ratings faster. In CHI extended abstracts (pp. 2759–2762).Google Scholar
  22. Gulshan, V., Rother, C., Criminisi, A., Blake, A., & Zisserman, A. (2010). Geodesic star convexity for interactive image segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3129–3136).Google Scholar
  23. Gurari, D., Sameki, M., & Betke, M. (2016). Investigating the influence of data familiarity to improve the design of a crowdsourcing image annotation system. In AAAI conference on human computation and crowdsourcing (HCOMP) (pp. 59–68).Google Scholar
  24. Jain, S. D., & Grauman, K. (2013). Predicting sufficient annotation strength for interactive foreground segmentation. In IEEE International Conference on Computer Vision (ICCV) (pp. 1313–1320). IEEEGoogle Scholar
  25. Jas, M., & Parikh, D. (2015). Image specificity. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  26. Jayant, C., Ji, H., White, S., & Bigham, J. P. (2011). Supporting blind photography. In ASSETS.Google Scholar
  27. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., & Girshick, R. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
  28. Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., & Li, S. (2013). Salient object detection: A discriminative regional feature integration approach. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2083–2090).Google Scholar
  29. Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., & Grady, L. (2012). Evaluating segmentation error without ground truth. In Medical image computing and computer assisted intervention (MICCAI) (pp. 528–536).Google Scholar
  30. Kovashka, A., Parikh, D., & Grauman, K. (2014). Discovering attribute shades of meaning with the crowd. International Journal on Computer Vision (IJCV), 115(2), 185–210.CrossRefGoogle Scholar
  31. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1097–1105).Google Scholar
  32. Leopold, D., Wilke, M., Maier, A., Logothetis, N., Blake, A. R. et al. (2004). Binocular rivalry and the illusion of monocular vision. In Binocular rivalry (pp. 231–259). MIT Press.Google Scholar
  33. Li, S., Purushotham, S., Chen, C., Ren, Y., & Kuo, C. (2017). Measuring and predicting tag importance for image retrieval. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  34. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., & Ramanan, D. et al. (2014). Microsoft COCO: Common objects in context. In IEEE European conference on computer vision (ECCV) (pp. 740–755).Google Scholar
  35. Liu, D., Xiong, Y., Pulli, K., & Shapiro, L. (2011). Estimating image segmentation difficulty. In Machine learning and data mining in pattern recognition (pp. 484–495).Google Scholar
  36. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.CrossRefGoogle Scholar
  37. Margolin, R., Zelnik-Manor, L., Tal, A. (2014). How to evaluate foreground maps? In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).Google Scholar
  38. Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In International Conference on Computer Vision (ICCV), 2, 416–423.Google Scholar
  39. Meger, D., Forssen, P., Lai, K., Helmer, S., McCann, S., Southey, T., et al. (2008). Curious george: An attentive semantic robot. Robotics and Autonomous Systems, 56(6), 503–511.CrossRefGoogle Scholar
  40. Perronnin, F., Nchez, J. S., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision (ECCV).Google Scholar
  41. Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 3, 309–314.CrossRefGoogle Scholar
  42. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173.CrossRefGoogle Scholar
  43. Shaw, A. D., Horton, J. J., & Chen, D. L. (2011). Designing incentives for inexpert human raters. In ACM conference on computer supported cooperative work (CSCW) (pp. 275–284)Google Scholar
  44. Sheshadri, A., & Lease, M. (2013). SQUARE: A benchmark for research on computing crowd consensus. In AAAI conference on human computation and crowdsourcing (HCOMP) (pp. 156–164).Google Scholar
  45. Shi, Y., & Karl, W. C. (2008). A real-time algorithm for the approximation of level-set based curve evolution. IEEE Transactions on Image Processing, 17(5), 645–656.MathSciNetCrossRefGoogle Scholar
  46. Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision (ECCV) (pp. 1–15). Springer.Google Scholar
  47. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  48. Torralba, A., Murphy, K. P., Freeman, W. T. & Rubin, M. A. (2003). Context-based vision system for place and object recognition. In International conference on computer vision (ICCV) Google Scholar
  49. Vázquez, M., & Steinfeld, A. (2014). An assisted photography framework to help visually impaired users properly aim a camera. ACM Transactions on Computer-Human Interaction (TOCHI), 21, 25.CrossRefGoogle Scholar
  50. Vijayanarasimhan, S., & Grauman, K. (2011). Cost-sensitive active visual category learning. International Journal of Computer Vision, 91, 24–44.CrossRefzbMATHGoogle Scholar
  51. Warfield, S. K., Zou, K. H., & Wells, W. M. (2004). Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7), 903–921.CrossRefGoogle Scholar
  52. Welinder, P., & Perona, P. (2010). Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In 2010 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW) (pp. 25–32).Google Scholar
  53. Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In Advances in neural information processing systems (NIPS) (pp. 2424–2432).Google Scholar
  54. Whitehill, J., Wu, T., Bergsma, J., Movellan, J. R. & Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems (NIPS) (pp. 2035–2043).Google Scholar
  55. Zhang, J. Ma, S., Sameki, M., Sclaroff, S., Betke, M., & Lin, Z et al. (2015). Salient object subitizing. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  56. Zhong, Y., Garrigues, P. J. & Bigham, J. P. (2013). Real time object scanning using a mobile phone and cloud-based visual search engine. In SIGACCESS conference on computers and accessibility (p. 20).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of Texas at Austin, School of InformationAustinUSA
  2. 2.Computer Science DepartmentUniversity of Texas at AustinAustinUSA
  3. 3.Computer Science DepartmentBoston UniversityBostonUSA
  4. 4.Adobe ResearchSan JoseUSA

Personalised recommendations