The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization


We present a visual recognition system for fine-grained visual categorization. The system is composed of a human and a machine working together and combines the complementary strengths of computer vision algorithms and (non-expert) human users. The human users provide two heterogeneous forms of information object part clicks and answers to multiple choice questions. The machine intelligently selects the most informative question to pose to the user in order to identify the object class as quickly as possible. By leveraging computer vision and analyzing the user responses, the overall amount of human effort required, measured in seconds, is minimized. Our formalism shows how to incorporate many different types of computer vision algorithms into a human-in-the-loop framework, including standard multiclass methods, part-based methods, and localized multiclass and attribute methods. We explore our ideas by building a field guide for bird identification. The experimental results demonstrate the strength of combining ignorant humans with poor-sighted machines the hybrid system achieves quick and accurate bird identification on a dataset containing 200 bird species.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19


  1. 1.

    Our user model assumes binary or multinomial attributes; however, one could use continuous attribute values for the computer vision component described in this section

  2. 2.

    The integral in Eq. 26 involves a bottom-up traversal of \(T=(V,E)\), at each step convolving a spatial score map with a unary score map (takes time \(O(n \log n)\) time in the number of pixels).

  3. 3.

    Maximum likelihood inference involves a bottom-up traversal of \(T\), doing a distance transform operation (Felzenszwalb et al. 2008) for each part in the tree (takes time \(O(n)\) time in the number of pixels).

  4. 4.

    in practice, we also computed an average segmentation mask for each part-aspect and used that to weight each extracted patch, see supplementary material

  5. 5.


  1. Belhumeur, P., Chen, D., Feiner, S., Jacobs, D., Kress, W., Ling, H., Lopez, I., Ramamoorthi, R., Sheorey, S., White, S. & Zhang, L. (2008). Searching the world’s herbaria. In ECCV.

  2. Berg, T. & Belhumeur, P.N. (2013). Poof: Part-based one-vs-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR.

  3. Biederman, I., Subramaniam, S., Bar, M., Kalocsai, P., & Fiser, J. (1999). Subordinate-level object classification reexamined. Psychological Research, 63(2–3), 131–153.

    Article  Google Scholar 

  4. Bourdev, L. & Malik, J. (2009). Poselets: Body part detectors trained using 3d annotations. In ICCV.

  5. Branson, S., Perona, P. & Belongie, S. (2011). Strong supervision from weak annotation. In ICCV.

  6. Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P. & Belongie, S. (2010). Visual recognition with humans in the loop. In ECCV.

  7. Chai, Y., Lempitsky, V. & Zisserman, A. (2011). Bicos: A bi-level co-segmentation method. In ICCV.

  8. Chai, Y., Lempitsky, V. & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In ICCV.

  9. Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L. & Zisserman, A. (2012). Tricos. In ECCV.

  10. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V. & Yianilos, P.N. (2000). The bayesian image retrieval system, pichunter: Theory, implementation, and psychophysical experiments. Image processing.

  11. Donahue, J. & Grauman, K. (2011). Annotator rationales for visual recognition. In ICCV.

  12. Douze, M., Ramisa, A. & Schmid, C. (2011). Combining attributes and fisher vectors for efficient image retrieval. In CVPR.

  13. Duan, K., Parikh, D., Crandall, D. & Grauman, K. (2012). Discovering localized attributes for fine-grained recognition. In CVPR.

  14. Fang, Y. & Geman, D. (2005). Experiments in mental face retrieval. In AVBPA.

  15. Farhadi, A., Endres, I. & Hoiem, D. (2010). Attribute-centric recognition for generalization. In CVPR.

  16. Farhadi, A., Endres, I., Hoiem, D. & Forsyth, D. (2009). Describing objects by attributes. In CVPR.

  17. Farrell, R., Oza, O., Zhang, N., Morariu, V., Darrell, T. & Davis, L. (2011). Birdlets. In ICCV.

  18. Felzenszwalb, P. & Huttenlocher, D. (2002). Efficient matching of pictorial structures. In CVPR.

  19. Felzenszwalb, P., McAllester, D. & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.

  20. Ferecatu, M. & Geman, D. (2007). Interactive search by mental matching. In ICCV .

  21. Ferecatu, M. & Geman, D. (2009). A statistical framework for image category search from a mental picture. In PAMI.

  22. Gavves, E., Fernando, B., Snoek, C., Smeulders, A. & Tuytelaars, T. (2013). Fine-grained categorization by alignments. In ICCV.

  23. Geman, D. & Jedynak, B. (1993). Shape recognition and twenty questions. Belmont: Wadsworth.

  24. Geman, D. & Jedynak, B. (1996). An active testing model for tracking roads in satellite images. In PAMI.

  25. Jedynak, B., Frazier, P. I., & Sznitman, R. (2012). Twenty questions with noise: Bayes optimal policies for entropy loss. Journal of Applied Probability, 49(1), 114–136.

    Article  MATH  MathSciNet  Google Scholar 

  26. Khosla, A., Jayadevaprakash, N., Yao, B. & Li, F.F. (2011). Novel dataset for fgvc: Stanford dogs. San Diego: CVPR Workshop on FGVC.

  27. Kumar, N., Belhumeur, P., Biswas, A., Jacobs, D., Kress, W., Lopez, I. & Soares, J. (2012). Leafsnap: A computer vision system for automatic plant species identification. In ECCV.

  28. Kumar, N., Belhumeur, P. & Nayar, S. (2008). Facetracer: A search engine for large collections of images with faces. In ECCV.

  29. Kumar, N., Berg, A.C., Belhumeur, P.N. & Nayar, S.K. (2009). Attribute and simile classifiers for face verification. In ICCV.

  30. Lampert, C., Nickisch, H. & Harmeling, S. (2009). Learning to detect unseen object classes. In CVPR.

  31. Larios, N., Soran, B., Shapiro, L.G., Martinez-Munoz, G., Lin, J. & Dietterich, T.G. (2010). Haar random forest features and svm spatial matching kernel for stonefly species identification. In ICPR.

  32. Lazebnik, S., Schmid, C. & Ponce, J. (2005). A maximum entropy framework for part-based texture and object recognition. In ICCV.

  33. Levin, A., Lischinski, D. & Weiss, Y. (2007). A closed-form solution to natural image matting. In PAMI.

  34. Liu, J., Kanazawa, A., Jacobs, D. & Belhumeur, P. (2012). Dog breed classification using part localization. In ECCV.

  35. Lu, Y., Hu, C., Zhu, X., Zhang, H. & Yang, Q. (2000). A unified framework for semantics and feature based relevance feedback in image retrieval systems. In ACM Multimedia.

  36. Maji, S. (2012). Discovering a lexicon of parts and attributes. In ECCV Parts and Attributes.

  37. Maji, S. & Shakhnarovich, G. (2012). Part annotations via pairwise correspondence. In Conference on Artificial Intelligence Workshop.

  38. Martınez-Munoz et al. (2009). Dictionary-free categorization of very similar objects. In CVPR.

  39. Mervis, C. B., & Crisafi, M. A. (1982). Order of acquisition of subordinate-, basic-, and superordinate-level categories. Child Development, 53(1), 256–266.

    Google Scholar 

  40. Nilsback, M. & Zisserman, A. (2008). Automated flower classification. In ICVGIP.

  41. Nilsback, M.E. & Zisserman, A. (2006). A visual vocabulary for flower classification. In CVPR.

  42. Ott, P. & Everingham, M. (2011). Shared parts for deformable part-based models. In CVPR.

  43. Parikh, D. & Grauman, K. (2011). Interactively building a vocabulary of attributes. In CVPR.

  44. Parikh, D. & Grauman, K. (2011). Relative attributes. In ICCV.

  45. Parikh, D. & Grauman, K. (2013). Implied feedback: Learning nuances of user behavior in image search. In ICCV.

  46. Parikh, D. & Zitnick, C.L. (2011a). Finding the weakest link in person detectors. In CVPR .

  47. Parikh, D. & Zitnick, C.L. (2011b). Human-debugging of machines. In NIPS Wisdom of Crowds.

  48. Parkash, A. & Parikh, D. (2012). Attributes for classifier feedback. In ECCV.

  49. Parkhi, O., Vedaldi, A., Zisserman, A. & Jawahar, C. (2012). Cats and dogs. In CVPR.

  50. Parkhi, O.M., Vedaldi, A., Jawahar, C. & Zisserman, A. (2011). The truth about cats and dogs. In ICCV.

  51. Perronnin, F., Sánchez, J. & Mensink, T. (2010). Improving the fisher kernel. In ECCV.

  52. Platt, J.C. (1999). Probabilistic outputs for svms. In ALMC.

  53. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Burlington: Morgan Kaufmann.

    Google Scholar 

  54. Rasiwasia, N., Moreno, P.J. & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. In Multimedia.

  55. Rosch, E. (1999). Principles of categorization. In Concepts: Core readings.

  56. Rosch, E., Mervis, C.B. & Gray, W.D., Johnson, D.M., Boyes-Braem, P. (1976). Basic objects in natural categories. In Cognitive Psychology.

  57. Rother, C., Kolmogorov, V. & Blake, A. (2004). Grabcut: Interactive foreground extraction. In TOG.

  58. Settles, B. (2008). Curious machines: Active learning with structured instances.

  59. Stark, M., Krause, J., Pepik, B., Meger, D., Little, J.J., Schiele, B. & Koller, D. (2012). Fine-grained categorization for 3d scene understanding. In BMVC.

  60. Sznitman, R., Basu, A., Richa, R., Handa, J., Gehlbach, P., Taylor, R.H., Jedynak, B. & Hager, G.D. (2011). Unified detection and tracking in retinal microsurgery. In MICCAI.

  61. Sznitman, R. & Jedynak, B. (2010). Active testing for face detection and localization. In PAMI.

  62. Tsiligkaridis, T., Sadler, B. & Hero, A. (2013). A collaborative 20 questions model for target search with human-machine interaction. In ICASSP.

  63. Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. (2006). Large margin methods for structured and interdependent output variables. In JMLR.

  64. Vijayanarasimhan, S. & Grauman, K. (2009). What’s It Going to Cost You? In CVPR.

  65. Vijayanarasimhan, S. & Grauman, K. (2011). Large-scale live active learning. In CVPR.

  66. Vondrick, C. & Ramanan, D. (2011). Video Annotation and Tracking with Active Learning. In NIPS.

  67. Vondrick, C., Ramanan, D. & Patterson, D. (2010). Efficiently scaling up video annotation. In ECCV.

  68. Wah, C., Branson, S., Perona, P. & Belongie, S. (2011). Multiclass recognition and part localization with humans in the loop. In ICCV.

  69. Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, Pasadena: Caltech.

  70. Wang, G. & Forsyth, D. (2009). Joint learning of visual attributes, object classes. In ICCV.

  71. Wang, J., Markert, K. & Everingham, M. (2009). Learning models for object recognition from natural language descriptions. In BMVC.

  72. Wu, W. & Yang, J. (2006). SmartLabel: an object labeling tool. In Multimedia.

  73. Yang, Y. & Ramanan, D. (2011). Articulated pose estimation using mixtures of parts. In CVPR.

  74. Yao, B., Bradski, G., Fei-Fei, L.: A codebook and annotation-free approach for fgvc. In: CVPR (2012)

  75. Yao, B., Khosla, A. & Fei-Fei, L. (2011). Combining randomization and discrimination for fgvc. In CVPR.

  76. Zhang, N., Farrell, R. & Darrell, T. (2012). Pose pooling kernels for sub-category recognition. In CVPR.

  77. Zhang, N., Farrell, R., Iandola, F. & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In ICCV.

  78. Zhou, X. & Huang, T. (2003). Relevance feedback in image retrieval. In Multimedia.

Download references

Author information



Corresponding author

Correspondence to Steve Branson.

Additional information

Communicated by M. Hebert.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material ESM3 (MP4 3.28 KB)

Supplementary material ESM1 (PDF 25.6 KB)

Supplementary material ESM2 (PDF 4.27 KB)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Branson, S., Van Horn, G., Wah, C. et al. The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization. Int J Comput Vis 108, 3–29 (2014).

Download citation


  • Fine-grained categorization
  • Human-in-the-loop
  • Interactive
  • Parts
  • Attributes
  • Crowdsourcing
  • Deformable part models
  • Pose mixture models
  • Object recognition
  • Information gain
  • Birds