Abstract
It is widely recognized that human vision relies on contextual information, typically arising from each of many levels of analysis. Local gradient information, otherwise ambiguous, is seen as part of a smooth contour or sharp angle in the context of an object’s boundary or corner. A stroke or degraded letter, unreadable by itself, contributes to the perception of a familiar word in the context of the surrounding strokes and letters. The iconic Dalmatian dog stays invisible until a multitude of clues about body parts and posture, and figure and ground, are coherently integrated. Context is always based on knowledge about the composition of parts that make up a whole, as in the arrangement of strokes that make up a letter, the arrangement of body parts that make up an animal, or the poses and postures of individuals that make up a mob. From this point of view, the hierarchy of contextual information available to an observer derives from the compositional nature of the world being observed. We will formulate this combinatorial viewpoint in terms of probability distributions and examine the computational implications. Whereas optimal recognition performance in this formulation is NP-complete, we will give mathematical and experimental evidence that a properly orchestrated computational algorithm can achieve nearly optimal recognition within a feasible number of operations. We will interpret the notions of bottom-up and top-down processing as steps in the staging of one such orchestration.
Similar content being viewed by others
References
Ahuja, N., & Todorovic, S. (2008). Connected segmentation tree—a joint representation of region layout and hierarchy. In CVPR’08.
Amit, Y., & Geman, D. (1998). A computational model for visual selection. Neural Computation, 11, 1691–1715.
Amit, Y., & Trouvé, A. (2007). Pop: Patchwork of parts models for object recognition. International Journal of Computer Vision 75(2).
Amit, Y., & Trouvé, A. (2010). The more you look the more you see: efficient resource allocation for curve tracking in noise images (Technical Report). University of Chicago, Statistics.
Bahadur, R. R., & Rao, R. R. (1960). On deviations of the sample mean. Annals of Mathematical Statistics, 31, 1015–1027.
Barlow, H. (1994). What is the computational goal of the neocortex? In C. Koch, & J. Davis (Eds.), Large-scale neuronal theories of the brain (pp. 1–22). Cambridge: MIT Press.
Bengio, Y., & LeCun, Y. (2007). Scaling learning algorithms towards ai. In L. Bottou, O. Chapelle, D. DeCoste, & J. Weston (Eds.), Large-scale kernel machines. Cambridge: MIT Press.
Blanchard, G., & Geman, D. (2005). Hierarchical testing designs for pattern recognition. Annals of Statistics, 33, 1155–1202.
Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV ’02: proceedings of the 7th european conference on computer vision-part II (pp. 109–124). Berlin: Springer.
Burl, M. C., & Perona, P. (1998). Using hierarchical shape models to spot keywords in cursive handwriting data. In CVPR.
Chang, L. B. (2010). Conditional modeling and conditional inference. PhD thesis, Brown University, Division of Applied Mathematics.
Chen, Y., Zhu, L., Lin, C., Yuille, A., & Zhang, H. (2007). Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In NIPS.
Epshtein, B., & Ullman, S. (2005). Feature hierarchies for object classification. In ICCV’05.
Felzenszwalb, P. F., & McAllester, D. (2010). Object detection grammars. University of Chicago, Computer Science TR-2010-02.
Fidler, S., & Leonardis, A. (2007). Towards scalable representations of object categories: Learning a hierarchy of parts. In CVPR’07.
Fleuret, F., & Geman, D. (2001). Coarse-to-fine face detection. International Journal of Computer Vision, 41, 85–107.
Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: a critical analysis. Cognition, 28, 3–71.
Harrison, M. T. (2005). Discovering compositional structure. PhD thesis, Brown University, Division of Applied Mathematics.
Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In CVPR’06 (vol. 2, pp. 2145–2152). New York: IEEE Press.
Kokkinos, I., Maragos, P., & Yuille, A. (2006). Bottom-up & top-down object detection using primal sketch features and graphical models. In CVPR’06.
Ommer, B., & Buhmann, J. M. (2007). Learning the compositional nature of visual objects. In CVPR’07.
Moreels, P., & Perona, P. (2008). A probabilistic cascade of detectors for individual object recognition. In ECCV.
Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., & Poggio, T. (2007). Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29, 411–426.
Shieber, S. (1992). Constraint-based grammar formalisms. Cambridge: MIT Press.
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2005). Learning hierarchical models of scenes, objects, and parts. In IEEE international conference on computer vision.
Viola, P., & Jones, M. J. (2001). Robust real-time face detection. In Proc. ICCV01 (vol. II, p. 747).
Warren, W. H. (2010). Direct perception. In E. Goldstein (Ed.), Encyclopedia of perception. Thousand Oaks: Sage.
Wu, T. F., & Zhu, S. C. (2010). A numerical study of the bottom-up and top-down inference processes in and-or graphs. International Journal of Computer Vision. doi:10.1007/s11263-010-0346-6.
Zhang, W. (2009). Statistical inference and probabilistic modeling in compositional vision. PhD thesis, Brown University, Division of Applied Mathematics.
Zhu, S. C., & Mumford, D. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.
Author information
Authors and Affiliations
Corresponding author
Additional information
Partially supported by the Office of Naval Research under N000140610749, and the National Science Foundation under ITR-0427223 and DMS-1007593.
Rights and permissions
About this article
Cite this article
Chang, LB., Jin, Y., Zhang, W. et al. Context, Computation, and Optimal ROC Performance in Hierarchical Models. Int J Comput Vis 93, 117–140 (2011). https://doi.org/10.1007/s11263-010-0391-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-010-0391-1