The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization

Branson, Steve; Van Horn, Grant; Wah, Catherine; Perona, Pietro; Belongie, Serge

doi:10.1007/s11263-014-0698-4

The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization

Published: 20 February 2014

Volume 108, pages 3–29, (2014)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Steve Branson¹,
Grant Van Horn²,
Catherine Wah²,
Pietro Perona¹ &
…
Serge Belongie²

1764 Accesses
46 Citations
3 Altmetric
Explore all metrics

Abstract

We present a visual recognition system for fine-grained visual categorization. The system is composed of a human and a machine working together and combines the complementary strengths of computer vision algorithms and (non-expert) human users. The human users provide two heterogeneous forms of information object part clicks and answers to multiple choice questions. The machine intelligently selects the most informative question to pose to the user in order to identify the object class as quickly as possible. By leveraging computer vision and analyzing the user responses, the overall amount of human effort required, measured in seconds, is minimized. Our formalism shows how to incorporate many different types of computer vision algorithms into a human-in-the-loop framework, including standard multiclass methods, part-based methods, and localized multiclass and attribute methods. We explore our ideas by building a field guide for bird identification. The experimental results demonstrate the strength of combining ignorant humans with poor-sighted machines the hybrid system achieves quick and accurate bird identification on a dataset containing 200 bird species.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can computer vision problems benefit from structured hierarchical classification?

Article Open access 06 May 2016

Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring

Which Looks Like Which: Exploring Inter-class Relationships in Fine-Grained Visual Categorization

Notes

Our user model assumes binary or multinomial attributes; however, one could use continuous attribute values for the computer vision component described in this section
The integral in Eq. 26 involves a bottom-up traversal of \(T=(V,E)\), at each step convolving a spatial score map with a unary score map (takes time \(O(n \log n)\) time in the number of pixels).
Maximum likelihood inference involves a bottom-up traversal of \(T\), doing a distance transform operation (Felzenszwalb et al. 2008) for each part in the tree (takes time \(O(n)\) time in the number of pixels).
in practice, we also computed an average segmentation mask for each part-aspect and used that to weight each extracted patch, see supplementary material
http://www.allaboutbirds.org/NetCommunity/page.aspx?pid=1053

References

Belhumeur, P., Chen, D., Feiner, S., Jacobs, D., Kress, W., Ling, H., Lopez, I., Ramamoorthi, R., Sheorey, S., White, S. & Zhang, L. (2008). Searching the world’s herbaria. In ECCV.
Berg, T. & Belhumeur, P.N. (2013). Poof: Part-based one-vs-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR.
Biederman, I., Subramaniam, S., Bar, M., Kalocsai, P., & Fiser, J. (1999). Subordinate-level object classification reexamined. Psychological Research, 63(2–3), 131–153.
Article Google Scholar
Bourdev, L. & Malik, J. (2009). Poselets: Body part detectors trained using 3d annotations. In ICCV.
Branson, S., Perona, P. & Belongie, S. (2011). Strong supervision from weak annotation. In ICCV.
Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P. & Belongie, S. (2010). Visual recognition with humans in the loop. In ECCV.
Chai, Y., Lempitsky, V. & Zisserman, A. (2011). Bicos: A bi-level co-segmentation method. In ICCV.
Chai, Y., Lempitsky, V. & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In ICCV.
Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L. & Zisserman, A. (2012). Tricos. In ECCV.
Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V. & Yianilos, P.N. (2000). The bayesian image retrieval system, pichunter: Theory, implementation, and psychophysical experiments. Image processing.
Donahue, J. & Grauman, K. (2011). Annotator rationales for visual recognition. In ICCV.
Douze, M., Ramisa, A. & Schmid, C. (2011). Combining attributes and fisher vectors for efficient image retrieval. In CVPR.
Duan, K., Parikh, D., Crandall, D. & Grauman, K. (2012). Discovering localized attributes for fine-grained recognition. In CVPR.
Fang, Y. & Geman, D. (2005). Experiments in mental face retrieval. In AVBPA.
Farhadi, A., Endres, I. & Hoiem, D. (2010). Attribute-centric recognition for generalization. In CVPR.
Farhadi, A., Endres, I., Hoiem, D. & Forsyth, D. (2009). Describing objects by attributes. In CVPR.
Farrell, R., Oza, O., Zhang, N., Morariu, V., Darrell, T. & Davis, L. (2011). Birdlets. In ICCV.
Felzenszwalb, P. & Huttenlocher, D. (2002). Efficient matching of pictorial structures. In CVPR.
Felzenszwalb, P., McAllester, D. & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.
Ferecatu, M. & Geman, D. (2007). Interactive search by mental matching. In ICCV .
Ferecatu, M. & Geman, D. (2009). A statistical framework for image category search from a mental picture. In PAMI.
Gavves, E., Fernando, B., Snoek, C., Smeulders, A. & Tuytelaars, T. (2013). Fine-grained categorization by alignments. In ICCV.
Geman, D. & Jedynak, B. (1993). Shape recognition and twenty questions. Belmont: Wadsworth.
Geman, D. & Jedynak, B. (1996). An active testing model for tracking roads in satellite images. In PAMI.
Jedynak, B., Frazier, P. I., & Sznitman, R. (2012). Twenty questions with noise: Bayes optimal policies for entropy loss. Journal of Applied Probability, 49(1), 114–136.
Article MATH MathSciNet Google Scholar
Khosla, A., Jayadevaprakash, N., Yao, B. & Li, F.F. (2011). Novel dataset for fgvc: Stanford dogs. San Diego: CVPR Workshop on FGVC.
Kumar, N., Belhumeur, P., Biswas, A., Jacobs, D., Kress, W., Lopez, I. & Soares, J. (2012). Leafsnap: A computer vision system for automatic plant species identification. In ECCV.
Kumar, N., Belhumeur, P. & Nayar, S. (2008). Facetracer: A search engine for large collections of images with faces. In ECCV.
Kumar, N., Berg, A.C., Belhumeur, P.N. & Nayar, S.K. (2009). Attribute and simile classifiers for face verification. In ICCV.
Lampert, C., Nickisch, H. & Harmeling, S. (2009). Learning to detect unseen object classes. In CVPR.
Larios, N., Soran, B., Shapiro, L.G., Martinez-Munoz, G., Lin, J. & Dietterich, T.G. (2010). Haar random forest features and svm spatial matching kernel for stonefly species identification. In ICPR.
Lazebnik, S., Schmid, C. & Ponce, J. (2005). A maximum entropy framework for part-based texture and object recognition. In ICCV.
Levin, A., Lischinski, D. & Weiss, Y. (2007). A closed-form solution to natural image matting. In PAMI.
Liu, J., Kanazawa, A., Jacobs, D. & Belhumeur, P. (2012). Dog breed classification using part localization. In ECCV.
Lu, Y., Hu, C., Zhu, X., Zhang, H. & Yang, Q. (2000). A unified framework for semantics and feature based relevance feedback in image retrieval systems. In ACM Multimedia.
Maji, S. (2012). Discovering a lexicon of parts and attributes. In ECCV Parts and Attributes.
Maji, S. & Shakhnarovich, G. (2012). Part annotations via pairwise correspondence. In Conference on Artificial Intelligence Workshop.
Martınez-Munoz et al. (2009). Dictionary-free categorization of very similar objects. In CVPR.
Mervis, C. B., & Crisafi, M. A. (1982). Order of acquisition of subordinate-, basic-, and superordinate-level categories. Child Development, 53(1), 256–266.
Google Scholar
Nilsback, M. & Zisserman, A. (2008). Automated flower classification. In ICVGIP.
Nilsback, M.E. & Zisserman, A. (2006). A visual vocabulary for flower classification. In CVPR.
Ott, P. & Everingham, M. (2011). Shared parts for deformable part-based models. In CVPR.
Parikh, D. & Grauman, K. (2011). Interactively building a vocabulary of attributes. In CVPR.
Parikh, D. & Grauman, K. (2011). Relative attributes. In ICCV.
Parikh, D. & Grauman, K. (2013). Implied feedback: Learning nuances of user behavior in image search. In ICCV.
Parikh, D. & Zitnick, C.L. (2011a). Finding the weakest link in person detectors. In CVPR .
Parikh, D. & Zitnick, C.L. (2011b). Human-debugging of machines. In NIPS Wisdom of Crowds.
Parkash, A. & Parikh, D. (2012). Attributes for classifier feedback. In ECCV.
Parkhi, O., Vedaldi, A., Zisserman, A. & Jawahar, C. (2012). Cats and dogs. In CVPR.
Parkhi, O.M., Vedaldi, A., Jawahar, C. & Zisserman, A. (2011). The truth about cats and dogs. In ICCV.
Perronnin, F., Sánchez, J. & Mensink, T. (2010). Improving the fisher kernel. In ECCV.
Platt, J.C. (1999). Probabilistic outputs for svms. In ALMC.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Burlington: Morgan Kaufmann.
Google Scholar
Rasiwasia, N., Moreno, P.J. & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. In Multimedia.
Rosch, E. (1999). Principles of categorization. In Concepts: Core readings.
Rosch, E., Mervis, C.B. & Gray, W.D., Johnson, D.M., Boyes-Braem, P. (1976). Basic objects in natural categories. In Cognitive Psychology.
Rother, C., Kolmogorov, V. & Blake, A. (2004). Grabcut: Interactive foreground extraction. In TOG.
Settles, B. (2008). Curious machines: Active learning with structured instances.
Stark, M., Krause, J., Pepik, B., Meger, D., Little, J.J., Schiele, B. & Koller, D. (2012). Fine-grained categorization for 3d scene understanding. In BMVC.
Sznitman, R., Basu, A., Richa, R., Handa, J., Gehlbach, P., Taylor, R.H., Jedynak, B. & Hager, G.D. (2011). Unified detection and tracking in retinal microsurgery. In MICCAI.
Sznitman, R. & Jedynak, B. (2010). Active testing for face detection and localization. In PAMI.
Tsiligkaridis, T., Sadler, B. & Hero, A. (2013). A collaborative 20 questions model for target search with human-machine interaction. In ICASSP.
Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. (2006). Large margin methods for structured and interdependent output variables. In JMLR.
Vijayanarasimhan, S. & Grauman, K. (2009). What’s It Going to Cost You? In CVPR.
Vijayanarasimhan, S. & Grauman, K. (2011). Large-scale live active learning. In CVPR.
Vondrick, C. & Ramanan, D. (2011). Video Annotation and Tracking with Active Learning. In NIPS.
Vondrick, C., Ramanan, D. & Patterson, D. (2010). Efficiently scaling up video annotation. In ECCV.
Wah, C., Branson, S., Perona, P. & Belongie, S. (2011). Multiclass recognition and part localization with humans in the loop. In ICCV.
Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, Pasadena: Caltech.
Wang, G. & Forsyth, D. (2009). Joint learning of visual attributes, object classes. In ICCV.
Wang, J., Markert, K. & Everingham, M. (2009). Learning models for object recognition from natural language descriptions. In BMVC.
Wu, W. & Yang, J. (2006). SmartLabel: an object labeling tool. In Multimedia.
Yang, Y. & Ramanan, D. (2011). Articulated pose estimation using mixtures of parts. In CVPR.
Yao, B., Bradski, G., Fei-Fei, L.: A codebook and annotation-free approach for fgvc. In: CVPR (2012)
Yao, B., Khosla, A. & Fei-Fei, L. (2011). Combining randomization and discrimination for fgvc. In CVPR.
Zhang, N., Farrell, R. & Darrell, T. (2012). Pose pooling kernels for sub-category recognition. In CVPR.
Zhang, N., Farrell, R., Iandola, F. & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In ICCV.
Zhou, X. & Huang, T. (2003). Relevance feedback in image retrieval. In Multimedia.

Download references

Author information

Authors and Affiliations

Caltech, Pasadena, CA, USA
Steve Branson & Pietro Perona
University of California, San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
Grant Van Horn, Catherine Wah & Serge Belongie

Authors

Steve Branson
View author publications
You can also search for this author in PubMed Google Scholar
Grant Van Horn
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Wah
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Perona
View author publications
You can also search for this author in PubMed Google Scholar
Serge Belongie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steve Branson.

Additional information

Communicated by M. Hebert.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material ESM1 (PDF 25.6 KB)

Supplementary material ESM2 (PDF 4.27 KB)

Supplementary material ESM3 (MP4 3.28 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Branson, S., Van Horn, G., Wah, C. et al. The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization. Int J Comput Vis 108, 3–29 (2014). https://doi.org/10.1007/s11263-014-0698-4

Download citation

Received: 07 March 2013
Accepted: 08 January 2014
Published: 20 February 2014
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11263-014-0698-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization

Abstract

Access this article

Similar content being viewed by others

Can computer vision problems benefit from structured hierarchical classification?

Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring

Which Looks Like Which: Exploring Inter-class Relationships in Fine-Grained Visual Categorization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material ESM1 (PDF 25.6 KB)

Supplementary material ESM2 (PDF 4.27 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization

Abstract

Access this article

Similar content being viewed by others

Can computer vision problems benefit from structured hierarchical classification?

Bird Region Detection in Images with Multi-scale HOG Features and SVM Scoring

Which Looks Like Which: Exploring Inter-class Relationships in Fine-Grained Visual Categorization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material ESM1 (PDF 25.6 KB)

Supplementary material ESM2 (PDF 4.27 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation