Describing Visual Scenes Using Transformed Objects and Parts

Sudderth, Erik B.; Torralba, Antonio; Freeman, William T.; Willsky, Alan S.

doi:10.1007/s11263-007-0069-5

Describing Visual Scenes Using Transformed Objects and Parts

Published: 09 August 2007

Volume 77, pages 291–330, (2008)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Erik B. Sudderth¹,
Antonio Torralba²,
William T. Freeman² &
…
Alan S. Willsky²

542 Accesses
106 Citations
Explore all metrics

Abstract

We develop hierarchical, probabilistic models for objects, the parts composing them, and the visual scenes surrounding them. Our approach couples topic models originally developed for text analysis with spatial transformations, and thus consistently accounts for geometric constraints. By building integrated scene models, we may discover contextual relationships, and better exploit partially labeled training images. We first consider images of isolated objects, and show that sharing parts among object categories improves detection accuracy when learning from few examples. Turning to multiple object scenes, we propose nonparametric models which use Dirichlet processes to automatically learn the number of parts underlying each object category, and objects composing each scene. The resulting transformed Dirichlet process (TDP) leads to Monte Carlo algorithms which simultaneously segment and recognize objects in street and office scenes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adams, N. J., & Williams, C. K. I. (2003). Dynamic trees for image modelling. Image and Vision Computing, 21, 865–877.
Article Google Scholar
Amit, Y., & Trouvé, A. (2007). Generative models for labeling multi-object configurations in images. In J. Ponce, et al. (Ed.), Toward category-level object recognition. Berlin: Springer.
Google Scholar
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.
Article MATH Google Scholar
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522.
Article Google Scholar
Bienenstock, E., Geman, S., & Potter, D. (1997). Compositionality, MDL priors, and object recognition. In Neural information processing systems 9 (pp. 838–844). Cambridge: MIT Press.
Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Article MATH Google Scholar
Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In European conference on computer vision (Vol. 2, pp. 109–122).
Bosch, A., Zisserman, A., & Muñoz, X. (2006). Scene classification via pLSA. In European conference on computer vision (pp. 517–530).
Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698.
Article Google Scholar
Casella, G., & Robert, C. P. (1996). Rao–Blackwellisation of sampling schemes. Biometrika, 83(1), 81–94.
Article MATH MathSciNet Google Scholar
Csurka, G., et al. (2004). Visual categorization with bags of keypoints. In ECCV workshop on statistical learning in computer vision.
De Iorio, M., Müller, P., Rosner, G. L., & MacEachern, S. N. (2004). An ANOVA model for dependent random measures. Journal of the American Statistical Association, 99(465), 205–215.
Article MATH MathSciNet Google Scholar
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics, 44, 837–845.
Article MATH Google Scholar
Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430), 577–588.
Article MATH MathSciNet Google Scholar
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 524–531).
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In CVPR workshop on generative model based vision.
Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In International conference on computer vision (Vol. 2, pp. 1816–1823).
Fink, M., & Perona, P. (2004). Mutual boosting for contextual inference. In Neural information processing systems 16. Cambridge: MIT Press.
Google Scholar
Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1), 67–92.
Article Google Scholar
Frey, B. J., & Jojic, N. (2003). Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 1–17.
Article Google Scholar
Gelfand, A. E., Kottas, A., & MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association, 100(471), 1021–1035.
Article MATH MathSciNet Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. London: Chapman & Hall.
MATH Google Scholar
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235.
Article Google Scholar
He, X., Zemel, R. S., & Carreira-Perpiñán, M. A. (2004). Multiscale conditional random fields for image labeling. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 695–702).
Helmer, S., & Lowe, D. G. (2004). Object class recognition with many local features. In CVPR workshop on generative model based vision.
Hinton, G. E., Ghahramani, Z., & Teh, Y. W. (2000). Learning to parse images. In Neural information processing systems 12 (pp. 463–469). Cambridge: MIT Press.
Google Scholar
Ishwaran, H., & James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453), 161–173.
Article MATH MathSciNet Google Scholar
Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12, 941–963.
MATH MathSciNet Google Scholar
Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2145–2152).
Jojic, N., & Frey, B. J. (2001). Learning flexible sprites in video layers. In IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 199–206).
Jordan, M. I. (2004). Graphical models. Statistical Science, 19(1), 140–155.
Article MATH MathSciNet Google Scholar
Jordan, M. I. (2005). Dirichlet processes, Chinese restaurant processes and all that. Tutorial at Neural Information Processing Systems.
Kovesi, P. (2005). MATLAB and Octave functions for computer vision and image processing. Available from http://www.csse.uwa.edu.au/~pk/research/matlabfns/.
LeCun, Y., Huang, F. J., & Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 97–104).
Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision.
Liter, J. C., & Bülthoff, H. H. (1998). An introduction to object recognition. Zeitschrift für Naturforschung, 53c, 610–621.
Google Scholar
Loeff, N., Arora, H., Sorokin, A., & Forsyth, D. (2006). Efficient unsupervised learning for localization and detection in object categories. In Neural information processing systems 18 (pp. 811–818). Cambridge: MIT Press.
Google Scholar
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
MacEachern, S. N. (1999). Dependent nonparametric processes. In Proceedings section on Bayesian statistical science (pp. 50–55). Alexandria: American Statistical Association.
Google Scholar
Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In British machine vision conference (pp. 384–393).
Mikolajczyk, K., & Schmid, C. (2004). Scale and affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86.
Article Google Scholar
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.
Article Google Scholar
Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D. L., & Kolobov, A. (2005). BLOG: Probabilistic models with unknown objects. In International joint conference on artificial intelligence 19 (pp. 1352–1359)
Miller, E. G., & Chefd’hotel, C. (2003). Practical nonparametric density estimation on a transformation group for vision. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 114–121).
Miller, E. G., Matsakis, N. E., & Viola, P. A. (2000). Learning from one example through shared densities on transforms. In IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 464–471).
Murphy, K., Torralba, A., & Freeman, W. T. (2004). Using the forest to see the trees: A graphical model relating features, objects, and scenes. In Neural information processing systems 16. Cambridge: MIT Press.
Google Scholar
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.
Article MathSciNet Google Scholar
Pitman, J. (2002). Combinatorial stochastic processes. Technical Report 621, U.C. Berkeley Department of Statistics, August 2002.
Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2006). The nested Dirichlet process. Working Paper 2006-19, Duke Institute of Statistics and Decision Sciences.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Uncertainty in artificial intelligence 20 (pp. 487–494). Corvallis: AUAI Press.
Google Scholar
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). LabelMe: A database and web-based tool for image annotation. Technical Report 2005-025, MIT AI Lab.
Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390–398.
Article MathSciNet Google Scholar
Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition: Tangent distance and tangent propagation. In B. O. Genevieve & K. R. Müller (Eds.), Neural networks: tricks of the trade (pp. 239–274). Berlin: Springer.
Chapter Google Scholar
Siskind, J. M., Sherman, J., Pollak, I., Harper, M. P., & Bouman, C. A. (2004, submitted). Spatial random tree grammars for modeling hierarchal structure in images. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. In International conference on computer vision (Vol. 1, pp. 370–377).
Storkey, A. J., & Williams, C. K. I. (2003). Image modeling with position-encoding dynamic trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(7), 859–871.
Article Google Scholar
Sudderth, E. B. (2006). Graphical models for visual object recognition and tracking. PhD thesis, Massachusetts Institute of Technology.
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2005). Learning hierarchical models of scenes, objects, and parts. In International conference on computer vision (Vol. 2, pp. 1331–1338).
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2006a). Depth from familiar objects: A hierarchical model for 3D scenes. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2410–2417).
Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2006b). Describing visual scenes using transformed Dirichlet processes. In Neural information processing systems 18 (pp. 1297–1304). Cambridge: MIT Press.
Google Scholar
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
Article MATH MathSciNet Google Scholar
Tenenbaum, J. M., & Barrow, H. G. (1977). Experiments in interpretation-guided segmentation. Artificial Intelligence, 8, 241–274.
Article Google Scholar
Torralba, A. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2), 169–191.
Article Google Scholar
Torralba, A., Murphy, K. P., & Freeman, W. T. (2004). Sharing features: Efficient boosting procedures for multiclass object detection. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 762–769).
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63(2), 113–140.
Article Google Scholar
Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7), 682–687.
Google Scholar
Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.
Article Google Scholar
Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models for recognition. In European conference on computer vision (pp. 18–32).
Williams, C. K. I., & Allan, M. (2006). On a connection between object localization with a generative template of features and pose-space prediction methods. Informatics Research Report 719, University of Edinburgh.

Download references

Author information

Authors and Affiliations

Computer Science Division, University of California, Berkeley, USA
Erik B. Sudderth
Electrical Engineering & Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Antonio Torralba, William T. Freeman & Alan S. Willsky

Authors

Erik B. Sudderth
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Torralba
View author publications
You can also search for this author in PubMed Google Scholar
William T. Freeman
View author publications
You can also search for this author in PubMed Google Scholar
Alan S. Willsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erik B. Sudderth.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sudderth, E.B., Torralba, A., Freeman, W.T. et al. Describing Visual Scenes Using Transformed Objects and Parts. Int J Comput Vis 77, 291–330 (2008). https://doi.org/10.1007/s11263-007-0069-5

Download citation

Received: 20 September 2005
Accepted: 29 May 2007
Published: 09 August 2007
Issue Date: May 2008
DOI: https://doi.org/10.1007/s11263-007-0069-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Describing Visual Scenes Using Transformed Objects and Parts

Abstract

Access this article

Similar content being viewed by others

The Role of Mid-Level Shape Priors in Perceptual Grouping and Image Abstraction

Probabilistic modeling of scenes using object frames

Microsoft COCO: Common Objects in Context

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Describing Visual Scenes Using Transformed Objects and Parts

Abstract

Access this article

Similar content being viewed by others

The Role of Mid-Level Shape Priors in Perceptual Grouping and Image Abstraction

Probabilistic modeling of scenes using object frames

Microsoft COCO: Common Objects in Context

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation