Abstract
In this paper we present a hierarchical and contextual model for aerial image understanding. Our model organizes objects (cars, roofs, roads, trees, parking lots) in aerial scenes into hierarchical groups whose appearances and configurations are determined by statistical constraints (e.g. relative position, relative scale, etc.). Our hierarchy is a non-recursive grammar for objects in aerial images comprised of layers of nodes that can each decompose into a number of different configurations. This allows us to generate and recognize a vast number of scenes with relatively few rules. We present a minimax entropy framework for learning the statistical constraints between objects and show that this learned context allows us to rule out unlikely scene configurations and hallucinate undetected objects during inference. A similar algorithm was proposed for texture synthesis (Zhu et al. in Int. J. Comput. Vis. 2:107–126, 1998) but didn’t incorporate hierarchical information. We use a range of different bottom-up detectors (AdaBoost, TextonBoost, Compositional Boosting (Freund and Schapire in J. Comput. Syst. Sci. 55, 1997; Shotton et al. in Proceedings of the European Conference on Computer Vision, pp. 1–15, 2006; Wu et al. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, 2007)) to propose locations of objects in new aerial images and employ a cluster sampling algorithm (C4 (Porway and Zhu, 2009)) to choose the subset of detections that best explains the image according to our learned prior model. The C4 algorithm can quickly and efficiently switch between alternate competing sub-solutions, for example whether an image patch is better explained by a parking lot with cars or by a building with vents. We also show that our model can predict the locations of objects our detectors missed. We conclude by presenting parsed aerial images and experimental results showing that our cluster sampling and top-down prediction algorithms use the learned contextual cues from our model to improve detection results over traditional bottom-up detectors alone.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Barbu, A., & Zhu, S.-C. (2005). Generalizing Swendsen-Wang to sampling arbitrary posterior probabilities. Pattern Analysis and Machine Intelligence, 27, 1239–1253.
Berg, A., Grabler, F., & Malik, J. (2007). Parsing images of architectural scenes. In IEEE 11th international conference on computer vision.
Chen, H., Xu, Z., Liu, Z., & Zhu, S.-C. (2006). Composite templates for cloth modeling and sketching. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 943–950).
Chi, Z., & Geman, S. (1998). Estimation of probabilistic context-free grammars. Computational Linguistics, 24(2).
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Fischler, M., & Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1), 67–92.
Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55.
Fu, K. S. (1981). Syntactic pattern recognition and applications. New York: Prentice Hall.
Han, F., & Zhu, S.-C. (2005). Bottom-up and top-down image parsing by attribute graph grammar. In Proceedings of the international conference on computer vision (Vol. 2).
Hinz, S., & Baumgartner, A. (2000). Road extraction in urban areas supported by context objects. International Archives of Photogrammetry and Remote Sensing, 33.
Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2145–2152).
Johnson, M., Geman, S., Canon, S., Chi, Z., & Riezler, S. (1999). Estimators for stochastic unification-based grammars. In Proceedings ACL’99, Maryland.
Keselman, Y., & Dickinson, S. (2001). Generic model abstraction from examples. Pattern Analysis and Machine Intelligence, 27, 1141–1156.
Li, F.-F., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 524–531).
Li, Y., Atmosukarto, I., Kobashi, M., Yuen, J., & Shapiro, L. (2005). Object and event recognition for aerial surveillance. In SPIE—the international society for optical engineering.
Maloof, M. A., Langley, P., Binford, T. O., Nevatia, R., & Sage, S. (2003). Improved rooftop detection in aerial images with machine learning. Machine Learning.
Matsuyama, T., & Hang, V. (1990). Sigma: A framework for image understanding integration of bottom-up and top-down analyses. New York: Plenum.
Moissinac, H., Matre, H., & Bloch, I. (1994). Urban aerial image understanding using symbolic data. In Image and signal processing for remote sensing, proc. SPIE.
Nicolas, B., Viglino, J., & Cocquerez, J. (2000). Knowledge based system for the automatic extraction of road intersections from aerial images. International Archives of Photogrammetry and Remote Sensing.
Ohta, Y. (1985). Knowledge-based interpretation of outdoor natural color scenes. London: Pitman.
Porway, J., & Zhu, S. C. (2009). C4: Stochastic inference on graphical models with positive and negative edges for rapidly exploring competing solutions (Technical Report).
Porway, J., Wang, K., Yao, B., & Zhu, S.-C. (2008). A hierarchical and contextual model for aerial image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European conference on computer vision (pp. 1–15).
Siddiqi, K., Shokoufandeh, A., Dickinson, S., & Zucker, S. W. (1999). Shock graphs and shape matching. International Journal of Computer Vision, 35(1), 13–32.
Singhal, A., Luo, J., & Zhu, W. (2003). Probabilistic spatial context models for scene content understanding. In IEEE computer society conference on computer vision and pattern recognition (Vol. 1).
Sivic, J., Russell, B., Efros, A., Zisserman, A., & Freeman, W. (2005). Discovering objects and their location in images. In Tenth IEEE international conference on computer vision.
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2005). Describing visual scenes using transformed Dirichlet processes. In Neural information processing systems.
Swendsen, R., & Wang, J. (1987). Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters.
Todorovic, S., & Ahuja, N. (2006). Extracting subimages of an unknown category from a set of images. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 927–934).
Tu, Z., & Zhu, S.-C. (2002). Image segmentation by data-driven Markov chain Monte Carlo. IEEE Transactions on Pattern Analysis and Machine Learning, 24(5), 657–673.
Ullman, S., Sali, E., & Vidal, M. (2001). A fragment-based approach to object representation and classification. In Proceedings of the 4th international workshop on visual form.
Vestri, C., & Devernay, F. (2001). Using robust methods for automatic extraction of buildings. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1).
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 511–518).
Wainwright, M., & Jordan, M. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1), 1–305.
Weber, M., Welling, M., & Perona, P. (2000). Towards automatic discovery of object categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 101–108).
Wei, L., & Prinet, V. (2005). Building detection from high-resolution satellite image using probability model. In Geoscience and remote sensing symposium, IGARSS (pp. 25–29).
Wu, T. F., Xia, G. S., & Zhu, S.-C. (2007). Compositional boosting for computing hierarchical image structures. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Yao, B., Yang, X., & Zhu, S.-C. (2007). Introduction to a large scale general purpose groundtruth dataset: methodology, annotation tool, and benchmarks. Energy Minimization Methods in Computer Vision and Pattern Recognition, 4697, 169–183.
Zhao, T., & Nevatia, R. (2001). Car detection in low resolution aerial image. In IEEE international conference on computer vision (Vol. 1).
Zhu, S.-C., & Mumford, D. (2006). A stochastic grammar of images. Foundation and Trends in Computer Graphics and Vision, 2(4), 259–362.
Zhu, S.-C., Wu, Y.-N., & Mumford, D. (1998). Frame: Filters, random fields, and minimax entropy towards a unified theory for texture modeling. International Journal of Computer Vision, 2, 107–126.
Zhu, L., Lin, C., Huang, H., Chen, Y., & Yuille, A. (2008). Unsupervised structure learning: Hierarchical recursive composition, suspicious coincidence and competitive exclusion. In Proceedings of the 10th European conference on computer vision: Part II.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Porway, J., Wang, Q. & Zhu, S.C. A Hierarchical and Contextual Model for Aerial Image Parsing. Int J Comput Vis 88, 254–283 (2010). https://doi.org/10.1007/s11263-009-0306-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-009-0306-1