Salient Object Subitizing

Abstract

We study the problem of salient object subitizing, i.e. predicting the existence and the number of salient objects in an image using holistic cues. This task is inspired by the ability of people to quickly and accurately identify the number of items within the subitizing range (1–4). To this end, we present a salient object subitizing image dataset of about 14 K everyday images which are annotated using an online crowdsourcing marketplace. We show that using an end-to-end trained convolutional neural network (CNN) model, we achieve prediction accuracy comparable to human performance in identifying images with zero or one salient object. For images with multiple salient objects, our model also provides significantly better than chance performance without requiring any localization process. Moreover, we propose a method to improve the training of the CNN subitizing model by leveraging synthetic images. In experiments, we demonstrate the accuracy and generalizability of our CNN subitizing model and its applications in salient object detection and image retrieval.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    http://www.cs.bu.edu/groups/ivc/Subitizing/.

  2. 2.

    We use the subset of ImageNet images with bounding box annotations.

  3. 3.

    The F-score is computed as \(\frac{2RP}{(R+P)}\), where R and P denote recall and precision respectively.

  4. 4.

    When evaluated on the test set used by Zhang et al. (2015a), our best method GoogleNet_Syn_FT achieves a mAP score of 85.0%.

References

  1. Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In IEEE conference on computer vision and pattern recognition (CVPR).

  2. Anoraganingrum, D. (1999). Cell segmentation with median filter and mathematical morphology operation. In International conference on image analysis and processing.

  3. Arteta, C., Lempitsky, V., Noble, J. A., & Zisserman, A. (2014). Interactive object counting. In European conference on computer vision (ECCV).

  4. Atkinson, J., Campbell, F. W., & Francis, M. R. (1976). The magic number \(4\pm 0\): A new look at visual numerosity judgements. Perception, 5(3), 327–34.

    Google Scholar 

  5. Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In IEEE conference on computer vision and pattern recognition (CVPR) workshops.

  6. Borji, A., Sihite, D. N., & Itti, L. (2012). Salient object detection: A benchmark. In European conference on computer vision (ECCV).

  7. Boysen, S. T., & Capaldi, E. J. (2014). The development of numerical competence: Animal and human models. Hove: Psychology Press.

    Google Scholar 

  8. Chan, A. B., & Vasconcelos, N. (2009). Bayesian poisson regression for crowd counting. In IEEE international conference on computer vision (ICCV).

  9. Chan, A. B., Liang, Z.-S. J., & Vasconcelos, N. (2008). Privacy preserving crowd monitoring: Counting people without people models or tracking. In IEEE conference on computer vision and pattern recognition (CVPR).

  10. Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In British Machine Vision Conference (BMVC).

  11. Cheng, M.-M, Zhang, G.-X., Mitra, N. J., Huang, X., & Hu, S.-M. (2011). Global contrast based salient region detection. In IEEE conference on computer vision and pattern recognition (CVPR).

  12. Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H. S., & Hu, S.-M. (2015). Global contrast based salient region detection. IEEE Transaction on Pattern Analysis and Machine Intelligence, 37(3), 569–582.

    Article  Google Scholar 

  13. Choi, J., Jung, C., Lee, J., & Kim, C. (2014). Determining the existence of objects in an image and its application to image thumbnailing. Signal Processing Letters, 21(8), 957–961.

    Article  Google Scholar 

  14. Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y. (2009). NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM international conference on image and video retrieval.

  15. Clements, D. H. (1999). Subitizing: What is it? Why teach it? Teaching Children Mathematics, 5, 400–405.

    Google Scholar 

  16. Davis, H., & Pérusse, R. (1988). Numerical competence in animals: Definitional issues, current evidence, and a new research agenda. Behavioral and Brain Sciences, 11(04), 561–579.

    Article  Google Scholar 

  17. Dehaene, S. (2011). The number sense: How the mind creates mathematics. Oxford: Oxford University Press.

    Google Scholar 

  18. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman J. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

  19. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  20. Feng, J., Wei, Y., Tao, L., Zhang, C., & Sun, J. (2011). Salient object detection by composition. In IEEE international conference on computer vision (ICCV).

  21. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR).

  22. Gopalakrishnan, V., Hu, Y., Rajan, D. (2009). Random walks on graphs to model saliency in images. In IEEE conference on computer vision and pattern recognition (CVPR).

  23. Gross, H. J. (2012). The magical number four: A biological, historical and mythological enigma. Communicative & Integrative Biology, 5(1), 1–2.

    Article  Google Scholar 

  24. Gross, H. J., Pahl, M., Si, A., Zhu, H., Tautz, J., & Zhang, S. (2009). Number-based visual generalisation in the honeybee. PLoS ONE, 4(1), e4263.

    Article  Google Scholar 

  25. Gurari, D., & Grauman, K. (2016). Visual question: Predicting if a crowd will agree on the answer. ArXiv preprint arXiv:1608.08188.

  26. Heo, J.-P., Lin, Z., & Yoon, S.-E. (2014). Distance encoded product quantization. In IEEE conference on computer vision and pattern recognition (CVPR).

  27. Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. In Workshop on deep learning, NIPS.

  28. Jansen, B. R. J., Hofman, A. D., Straatemeier, M., Bers, B. M. C. W., Raijmakers, M. E. J., & Maas, H. L. J. (2014). The role of pattern recognition in children’s exact enumeration of small numbers. British Journal of Developmental Psychology, 32(2), 178–194.

    Article  Google Scholar 

  29. Jevons, W. S. (1871). The power of numerical discrimination. Nature, 3, 281–282.

    Article  Google Scholar 

  30. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM international conference on multimedia.

  31. Kaufman, E. L., Lord, M. W., Reese, T. W., & Volkmann, J. (1949). The discrimination of visual number. The American Journal of Psychology, 62, 498–525.

    Article  Google Scholar 

  32. Kazemzadeh, S., Ordonez, V., Matten, M., & Berg, T. L. (2014). Referitgame: Referring to objects in photographs of natural scenes. In Conference on empirical methods in natural language processing (EMNLP).

  33. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS).

  34. Lee, Y. J., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In IEEE conference on computer vision and pattern recognition (CVPR).

  35. Lempitsky, V., & Zisserman, A. (2010). Learning to count objects in images. In Advances in neural information processing systems (NIPS).

  36. Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C. G. M., & Bimbo, A. D. (2016). Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys, 49(1), 14:1–14:39.

    Article  Google Scholar 

  37. Li, Y., Hou, X., Koch, C., Rehg, J., & Yuille, A. (2014). The secrets of salient object segmentation. In IEEE conference on computer vision and pattern recognition (CVPR).

  38. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV).

  39. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.

    Article  Google Scholar 

  40. Mandler, G., & Shebo, B. J. (1982). Subitizing: An analysis of its component processes. Journal of Experimental Psychology: General, 111(1), 1.

    Article  Google Scholar 

  41. Nath, S. K., Palaniappan, K., & Bunyak, F. (2006). Cell segmentation using coupled level sets and graph-vertex coloring. In Medical image computing and computer-assisted intervention (MICCAI).

  42. Pahl, M., Si, A., & Zhang, S. (2013). Numerical cognition in bees and other insects. Frontiers in psychology, 4, 162.

    Article  Google Scholar 

  43. Peng, Xi., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3d models. In IEEE international conference on computer vision (ICCV).

  44. Piazza, M., & Dehaene, S. (2004). From number neurons to mental arithmetic: The cognitive neuroscience of number sense. The Cognitive Neurosciences (3rd ed.), pp. 865–877.

  45. Pinheiro, P. O., Lin, T.-Y, Collobert, R., & Dollr, P. (2016). Learning to refine object segments. In European conference on computer vision (ECCV).

  46. Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2017). Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 128–140.

    Article  Google Scholar 

  47. Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In IEEE conference on computer vision and pattern recognition (CVPR), DeepVision Workshop.

  48. Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.

    MathSciNet  Article  Google Scholar 

  49. Scharfenberger, C., Waslander, S. L., Zelek, J. S., & Clausi, D. A. (2013). Existence detection of objects in images for robot vision using saliency histogram features. In IEEE international conference on computer and robot vision (CRV).

  50. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In International conference on learning representations (ICLR).

  51. Shen, X., & Wu, Y. (2012). A unified approach to salient object detection via low rank matrix recovery. In IEEE conference on computer vision and pattern recognition (CVPR).

  52. Shin, D., He, Shu, Lee, G. M, Whinston, A. B., Cetintas, S., & Lee, K.-C. (2016). Content complexity, similarity, and consistency in social media: A deep learning approach. https://ssrn.com/abstract=2830377.

  53. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  54. Siva, P., Russell, C., Xiang, T., & Agapito, L. (2013). Looking beyond the image: Unsupervised learning for object saliency and detection. In IEEE conference on computer vision and pattern recognition (CVPR).

  55. Stark, M., Goesele, M., Schiele, B. (2010). Back to the future: Learning shape models from 3D CAD data. In British Machine Vision Conference (BMVC).

  56. Stoianov, I., & Zorzi, M. (2012). Emergence of a visual number sense in hierarchical generative models. Nature Neuroscience, 15(2), 194–196.

    Article  Google Scholar 

  57. Subburaman, V. B., Descamps, A., & Carincotte, C. (2012). Counting people in the crowd using a generic head detector. In IEEE international conference on advanced video and signal-based surveillance (AVSS).

  58. Sun, B., & Saenko, K. (2014). From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British Machine Vision Conference (BMVC).

  59. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition (CVPR).

  60. Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003). Context-based vision system for place and object recognition. In IEEE international conference on computer vision (ICCV).

  61. Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited-capacity preattentive stage in vision. Psychological Review, 101(1), 80.

    Article  Google Scholar 

  62. Vedaldi, A., & Fulkerson, B. (2008). VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/.

  63. Vuilleumier, P. O., & Rafal, R. D. (2000). A systematic study of visual extinction between-and within-field deficits of attention in hemispatial neglect. Brain, 123(6), 1263–1279.

    Article  Google Scholar 

  64. Wang, P., Wang, J., Zeng, G., Feng, J., Zha, H., & Li, S. (2012). Salient object detection for searched web images via global saliency. In IEEE conference on computer vision and pattern recognition (CVPR).

  65. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition (CVPR).

  66. Xiong, B., & Grauman, K. (2014). Detecting snap points in egocentric video with a web photo prior. In European conference on computer vision (ECCV). Springer.

  67. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Internation conference on machine learning (ICML).

  68. Zhang, J., Ma, S., Sameki, M., Sclaroff, S., Betke, M., Lin, Z., et al. (2015a). Salient object subitizing. In IEEE conference on computer vision and pattern recognition (CVPR).

  69. Zhang, J., Sclaroff, S., Lin, Z., Shen, X., Price, B., & Mĕch, R. (2015b). Minimum barrier salient object detection at 80 fps. In IEEE international conference on computer vision (ICCV).

  70. Zhang, J., Sclaroff, S., Lin, Z., Shen, X., Price, B., & Mĕch, R. (2016). Unconstrained salient object detection via proposal subset optimization. In IEEE conference on computer vision and pattern recognition (CVPR).

  71. Zhao, R., Ouyang, W., Li, H., & Wang, X. (2015). Saliency detection by multi-context deep learning. In IEEE conference on computer vision and pattern recognition (CVPR).

  72. Zou, W. Y., & McClelland, J. L. (2013). Progressive development of the number sense in a deep neural network. In Annual conference of the Cognitive Science Society (CogSci).

Download references

Acknowledgements

This research was supported in part by US NSF Grants 0910908 and 1029430, and gifts from Adobe and NVIDIA.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jianming Zhang.

Additional information

Communicated by Antonio Torralba.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 136 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Ma, S., Sameki, M. et al. Salient Object Subitizing. Int J Comput Vis 124, 169–186 (2017). https://doi.org/10.1007/s11263-017-1011-0

Download citation

Keywords

  • Salient object
  • Subitizing
  • Deep learning
  • Convolutional neural network