The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization

  • Steve BransonEmail author
  • Grant Van Horn
  • Catherine Wah
  • Pietro Perona
  • Serge Belongie


We present a visual recognition system for fine-grained visual categorization. The system is composed of a human and a machine working together and combines the complementary strengths of computer vision algorithms and (non-expert) human users. The human users provide two heterogeneous forms of information object part clicks and answers to multiple choice questions. The machine intelligently selects the most informative question to pose to the user in order to identify the object class as quickly as possible. By leveraging computer vision and analyzing the user responses, the overall amount of human effort required, measured in seconds, is minimized. Our formalism shows how to incorporate many different types of computer vision algorithms into a human-in-the-loop framework, including standard multiclass methods, part-based methods, and localized multiclass and attribute methods. We explore our ideas by building a field guide for bird identification. The experimental results demonstrate the strength of combining ignorant humans with poor-sighted machines the hybrid system achieves quick and accurate bird identification on a dataset containing 200 bird species.


Fine-grained categorization Human-in-the-loop Interactive Parts Attributes Crowdsourcing Deformable part models Pose mixture models Object recognition Information gain Birds 

Supplementary material

11263_2014_698_MOESM1_ESM.pdf (26 kb)
Supplementary material ESM1 (PDF 25.6 KB)
11263_2014_698_MOESM2_ESM.pdf (4.3 mb)
Supplementary material ESM2 (PDF 4.27 KB)

Supplementary material ESM3 (MP4 3.28 KB)


  1. Belhumeur, P., Chen, D., Feiner, S., Jacobs, D., Kress, W., Ling, H., Lopez, I., Ramamoorthi, R., Sheorey, S., White, S. & Zhang, L. (2008). Searching the world’s herbaria. In ECCV.Google Scholar
  2. Berg, T. & Belhumeur, P.N. (2013). Poof: Part-based one-vs-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR.Google Scholar
  3. Biederman, I., Subramaniam, S., Bar, M., Kalocsai, P., & Fiser, J. (1999). Subordinate-level object classification reexamined. Psychological Research, 63(2–3), 131–153.CrossRefGoogle Scholar
  4. Bourdev, L. & Malik, J. (2009). Poselets: Body part detectors trained using 3d annotations. In ICCV.Google Scholar
  5. Branson, S., Perona, P. & Belongie, S. (2011). Strong supervision from weak annotation. In ICCV.Google Scholar
  6. Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P. & Belongie, S. (2010). Visual recognition with humans in the loop. In ECCV.Google Scholar
  7. Chai, Y., Lempitsky, V. & Zisserman, A. (2011). Bicos: A bi-level co-segmentation method. In ICCV.Google Scholar
  8. Chai, Y., Lempitsky, V. & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In ICCV.Google Scholar
  9. Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L. & Zisserman, A. (2012). Tricos. In ECCV.Google Scholar
  10. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V. & Yianilos, P.N. (2000). The bayesian image retrieval system, pichunter: Theory, implementation, and psychophysical experiments. Image processing.Google Scholar
  11. Donahue, J. & Grauman, K. (2011). Annotator rationales for visual recognition. In ICCV.Google Scholar
  12. Douze, M., Ramisa, A. & Schmid, C. (2011). Combining attributes and fisher vectors for efficient image retrieval. In CVPR.Google Scholar
  13. Duan, K., Parikh, D., Crandall, D. & Grauman, K. (2012). Discovering localized attributes for fine-grained recognition. In CVPR.Google Scholar
  14. Fang, Y. & Geman, D. (2005). Experiments in mental face retrieval. In AVBPA.Google Scholar
  15. Farhadi, A., Endres, I. & Hoiem, D. (2010). Attribute-centric recognition for generalization. In CVPR.Google Scholar
  16. Farhadi, A., Endres, I., Hoiem, D. & Forsyth, D. (2009). Describing objects by attributes. In CVPR.Google Scholar
  17. Farrell, R., Oza, O., Zhang, N., Morariu, V., Darrell, T. & Davis, L. (2011). Birdlets. In ICCV.Google Scholar
  18. Felzenszwalb, P. & Huttenlocher, D. (2002). Efficient matching of pictorial structures. In CVPR.Google Scholar
  19. Felzenszwalb, P., McAllester, D. & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.Google Scholar
  20. Ferecatu, M. & Geman, D. (2007). Interactive search by mental matching. In ICCV .Google Scholar
  21. Ferecatu, M. & Geman, D. (2009). A statistical framework for image category search from a mental picture. In PAMI.Google Scholar
  22. Gavves, E., Fernando, B., Snoek, C., Smeulders, A. & Tuytelaars, T. (2013). Fine-grained categorization by alignments. In ICCV.Google Scholar
  23. Geman, D. & Jedynak, B. (1993). Shape recognition and twenty questions. Belmont: Wadsworth.Google Scholar
  24. Geman, D. & Jedynak, B. (1996). An active testing model for tracking roads in satellite images. In PAMI.Google Scholar
  25. Jedynak, B., Frazier, P. I., & Sznitman, R. (2012). Twenty questions with noise: Bayes optimal policies for entropy loss. Journal of Applied Probability, 49(1), 114–136.CrossRefzbMATHMathSciNetGoogle Scholar
  26. Khosla, A., Jayadevaprakash, N., Yao, B. & Li, F.F. (2011). Novel dataset for fgvc: Stanford dogs. San Diego: CVPR Workshop on FGVC.Google Scholar
  27. Kumar, N., Belhumeur, P., Biswas, A., Jacobs, D., Kress, W., Lopez, I. & Soares, J. (2012). Leafsnap: A computer vision system for automatic plant species identification. In ECCV.Google Scholar
  28. Kumar, N., Belhumeur, P. & Nayar, S. (2008). Facetracer: A search engine for large collections of images with faces. In ECCV.Google Scholar
  29. Kumar, N., Berg, A.C., Belhumeur, P.N. & Nayar, S.K. (2009). Attribute and simile classifiers for face verification. In ICCV.Google Scholar
  30. Lampert, C., Nickisch, H. & Harmeling, S. (2009). Learning to detect unseen object classes. In CVPR.Google Scholar
  31. Larios, N., Soran, B., Shapiro, L.G., Martinez-Munoz, G., Lin, J. & Dietterich, T.G. (2010). Haar random forest features and svm spatial matching kernel for stonefly species identification. In ICPR.Google Scholar
  32. Lazebnik, S., Schmid, C. & Ponce, J. (2005). A maximum entropy framework for part-based texture and object recognition. In ICCV.Google Scholar
  33. Levin, A., Lischinski, D. & Weiss, Y. (2007). A closed-form solution to natural image matting. In PAMI.Google Scholar
  34. Liu, J., Kanazawa, A., Jacobs, D. & Belhumeur, P. (2012). Dog breed classification using part localization. In ECCV.Google Scholar
  35. Lu, Y., Hu, C., Zhu, X., Zhang, H. & Yang, Q. (2000). A unified framework for semantics and feature based relevance feedback in image retrieval systems. In ACM Multimedia.Google Scholar
  36. Maji, S. (2012). Discovering a lexicon of parts and attributes. In ECCV Parts and Attributes.Google Scholar
  37. Maji, S. & Shakhnarovich, G. (2012). Part annotations via pairwise correspondence. In Conference on Artificial Intelligence Workshop.Google Scholar
  38. Martınez-Munoz et al. (2009). Dictionary-free categorization of very similar objects. In CVPR.Google Scholar
  39. Mervis, C. B., & Crisafi, M. A. (1982). Order of acquisition of subordinate-, basic-, and superordinate-level categories. Child Development, 53(1), 256–266.Google Scholar
  40. Nilsback, M. & Zisserman, A. (2008). Automated flower classification. In ICVGIP.Google Scholar
  41. Nilsback, M.E. & Zisserman, A. (2006). A visual vocabulary for flower classification. In CVPR.Google Scholar
  42. Ott, P. & Everingham, M. (2011). Shared parts for deformable part-based models. In CVPR.Google Scholar
  43. Parikh, D. & Grauman, K. (2011). Interactively building a vocabulary of attributes. In CVPR.Google Scholar
  44. Parikh, D. & Grauman, K. (2011). Relative attributes. In ICCV.Google Scholar
  45. Parikh, D. & Grauman, K. (2013). Implied feedback: Learning nuances of user behavior in image search. In ICCV.Google Scholar
  46. Parikh, D. & Zitnick, C.L. (2011a). Finding the weakest link in person detectors. In CVPR .Google Scholar
  47. Parikh, D. & Zitnick, C.L. (2011b). Human-debugging of machines. In NIPS Wisdom of Crowds.Google Scholar
  48. Parkash, A. & Parikh, D. (2012). Attributes for classifier feedback. In ECCV.Google Scholar
  49. Parkhi, O., Vedaldi, A., Zisserman, A. & Jawahar, C. (2012). Cats and dogs. In CVPR.Google Scholar
  50. Parkhi, O.M., Vedaldi, A., Jawahar, C. & Zisserman, A. (2011). The truth about cats and dogs. In ICCV.Google Scholar
  51. Perronnin, F., Sánchez, J. & Mensink, T. (2010). Improving the fisher kernel. In ECCV.Google Scholar
  52. Platt, J.C. (1999). Probabilistic outputs for svms. In ALMC.Google Scholar
  53. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Burlington: Morgan Kaufmann.Google Scholar
  54. Rasiwasia, N., Moreno, P.J. & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. In Multimedia.Google Scholar
  55. Rosch, E. (1999). Principles of categorization. In Concepts: Core readings.Google Scholar
  56. Rosch, E., Mervis, C.B. & Gray, W.D., Johnson, D.M., Boyes-Braem, P. (1976). Basic objects in natural categories. In Cognitive Psychology.Google Scholar
  57. Rother, C., Kolmogorov, V. & Blake, A. (2004). Grabcut: Interactive foreground extraction. In TOG.Google Scholar
  58. Settles, B. (2008). Curious machines: Active learning with structured instances.Google Scholar
  59. Stark, M., Krause, J., Pepik, B., Meger, D., Little, J.J., Schiele, B. & Koller, D. (2012). Fine-grained categorization for 3d scene understanding. In BMVC.Google Scholar
  60. Sznitman, R., Basu, A., Richa, R., Handa, J., Gehlbach, P., Taylor, R.H., Jedynak, B. & Hager, G.D. (2011). Unified detection and tracking in retinal microsurgery. In MICCAI.Google Scholar
  61. Sznitman, R. & Jedynak, B. (2010). Active testing for face detection and localization. In PAMI.Google Scholar
  62. Tsiligkaridis, T., Sadler, B. & Hero, A. (2013). A collaborative 20 questions model for target search with human-machine interaction. In ICASSP.Google Scholar
  63. Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. (2006). Large margin methods for structured and interdependent output variables. In JMLR.Google Scholar
  64. Vijayanarasimhan, S. & Grauman, K. (2009). What’s It Going to Cost You? In CVPR.Google Scholar
  65. Vijayanarasimhan, S. & Grauman, K. (2011). Large-scale live active learning. In CVPR.Google Scholar
  66. Vondrick, C. & Ramanan, D. (2011). Video Annotation and Tracking with Active Learning. In NIPS.Google Scholar
  67. Vondrick, C., Ramanan, D. & Patterson, D. (2010). Efficiently scaling up video annotation. In ECCV. Google Scholar
  68. Wah, C., Branson, S., Perona, P. & Belongie, S. (2011). Multiclass recognition and part localization with humans in the loop. In ICCV.Google Scholar
  69. Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, Pasadena: Caltech.Google Scholar
  70. Wang, G. & Forsyth, D. (2009). Joint learning of visual attributes, object classes. In ICCV.Google Scholar
  71. Wang, J., Markert, K. & Everingham, M. (2009). Learning models for object recognition from natural language descriptions. In BMVC.Google Scholar
  72. Wu, W. & Yang, J. (2006). SmartLabel: an object labeling tool. In Multimedia.Google Scholar
  73. Yang, Y. & Ramanan, D. (2011). Articulated pose estimation using mixtures of parts. In CVPR.Google Scholar
  74. Yao, B., Bradski, G., Fei-Fei, L.: A codebook and annotation-free approach for fgvc. In: CVPR (2012)Google Scholar
  75. Yao, B., Khosla, A. & Fei-Fei, L. (2011). Combining randomization and discrimination for fgvc. In CVPR.Google Scholar
  76. Zhang, N., Farrell, R. & Darrell, T. (2012). Pose pooling kernels for sub-category recognition. In CVPR.Google Scholar
  77. Zhang, N., Farrell, R., Iandola, F. & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In ICCV.Google Scholar
  78. Zhou, X. & Huang, T. (2003). Relevance feedback in image retrieval. In Multimedia.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Steve Branson
    • 1
    Email author
  • Grant Van Horn
    • 2
  • Catherine Wah
    • 2
  • Pietro Perona
    • 1
  • Serge Belongie
    • 2
  1. 1.CaltechPasadenaUSA
  2. 2.University of California, San DiegoLa JollaUSA

Personalised recommendations