Skip to main content
Log in

Describing Visual Scenes Using Transformed Objects and Parts

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We develop hierarchical, probabilistic models for objects, the parts composing them, and the visual scenes surrounding them. Our approach couples topic models originally developed for text analysis with spatial transformations, and thus consistently accounts for geometric constraints. By building integrated scene models, we may discover contextual relationships, and better exploit partially labeled training images. We first consider images of isolated objects, and show that sharing parts among object categories improves detection accuracy when learning from few examples. Turning to multiple object scenes, we propose nonparametric models which use Dirichlet processes to automatically learn the number of parts underlying each object category, and objects composing each scene. The resulting transformed Dirichlet process (TDP) leads to Monte Carlo algorithms which simultaneously segment and recognize objects in street and office scenes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adams, N. J., & Williams, C. K. I. (2003). Dynamic trees for image modelling. Image and Vision Computing, 21, 865–877.

    Article  Google Scholar 

  • Amit, Y., & Trouvé, A. (2007). Generative models for labeling multi-object configurations in images. In J. Ponce, et al. (Ed.), Toward category-level object recognition. Berlin: Springer.

    Google Scholar 

  • Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.

    Article  MATH  Google Scholar 

  • Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522.

    Article  Google Scholar 

  • Bienenstock, E., Geman, S., & Potter, D. (1997). Compositionality, MDL priors, and object recognition. In Neural information processing systems 9 (pp. 838–844). Cambridge: MIT Press.

    Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Article  MATH  Google Scholar 

  • Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In European conference on computer vision (Vol. 2, pp. 109–122).

  • Bosch, A., Zisserman, A., & Muñoz, X. (2006). Scene classification via pLSA. In European conference on computer vision (pp. 517–530).

  • Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698.

    Article  Google Scholar 

  • Casella, G., & Robert, C. P. (1996). Rao–Blackwellisation of sampling schemes. Biometrika, 83(1), 81–94.

    Article  MATH  MathSciNet  Google Scholar 

  • Csurka, G., et al. (2004). Visual categorization with bags of keypoints. In ECCV workshop on statistical learning in computer vision.

  • De Iorio, M., Müller, P., Rosner, G. L., & MacEachern, S. N. (2004). An ANOVA model for dependent random measures. Journal of the American Statistical Association, 99(465), 205–215.

    Article  MATH  MathSciNet  Google Scholar 

  • DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics, 44, 837–845.

    Article  MATH  Google Scholar 

  • Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430), 577–588.

    Article  MATH  MathSciNet  Google Scholar 

  • Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 524–531).

  • Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In CVPR workshop on generative model based vision.

  • Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In International conference on computer vision (Vol. 2, pp. 1816–1823).

  • Fink, M., & Perona, P. (2004). Mutual boosting for contextual inference. In Neural information processing systems 16. Cambridge: MIT Press.

    Google Scholar 

  • Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1), 67–92.

    Article  Google Scholar 

  • Frey, B. J., & Jojic, N. (2003). Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 1–17.

    Article  Google Scholar 

  • Gelfand, A. E., Kottas, A., & MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association, 100(471), 1021–1035.

    Article  MATH  MathSciNet  Google Scholar 

  • Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. London: Chapman & Hall.

    MATH  Google Scholar 

  • Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235.

    Article  Google Scholar 

  • He, X., Zemel, R. S., & Carreira-Perpiñán, M. A. (2004). Multiscale conditional random fields for image labeling. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 695–702).

  • Helmer, S., & Lowe, D. G. (2004). Object class recognition with many local features. In CVPR workshop on generative model based vision.

  • Hinton, G. E., Ghahramani, Z., & Teh, Y. W. (2000). Learning to parse images. In Neural information processing systems 12 (pp. 463–469). Cambridge: MIT Press.

    Google Scholar 

  • Ishwaran, H., & James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453), 161–173.

    Article  MATH  MathSciNet  Google Scholar 

  • Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12, 941–963.

    MATH  MathSciNet  Google Scholar 

  • Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2145–2152).

  • Jojic, N., & Frey, B. J. (2001). Learning flexible sprites in video layers. In IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 199–206).

  • Jordan, M. I. (2004). Graphical models. Statistical Science, 19(1), 140–155.

    Article  MATH  MathSciNet  Google Scholar 

  • Jordan, M. I. (2005). Dirichlet processes, Chinese restaurant processes and all that. Tutorial at Neural Information Processing Systems.

  • Kovesi, P. (2005). MATLAB and Octave functions for computer vision and image processing. Available from http://www.csse.uwa.edu.au/~pk/research/matlabfns/.

  • LeCun, Y., Huang, F. J., & Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 97–104).

  • Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision.

  • Liter, J. C., & Bülthoff, H. H. (1998). An introduction to object recognition. Zeitschrift für Naturforschung, 53c, 610–621.

    Google Scholar 

  • Loeff, N., Arora, H., Sorokin, A., & Forsyth, D. (2006). Efficient unsupervised learning for localization and detection in object categories. In Neural information processing systems 18 (pp. 811–818). Cambridge: MIT Press.

    Google Scholar 

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • MacEachern, S. N. (1999). Dependent nonparametric processes. In Proceedings section on Bayesian statistical science (pp. 50–55). Alexandria: American Statistical Association.

    Google Scholar 

  • Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In British machine vision conference (pp. 384–393).

  • Mikolajczyk, K., & Schmid, C. (2004). Scale and affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86.

    Article  Google Scholar 

  • Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.

    Article  Google Scholar 

  • Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D. L., & Kolobov, A. (2005). BLOG: Probabilistic models with unknown objects. In International joint conference on artificial intelligence 19 (pp. 1352–1359)

  • Miller, E. G., & Chefd’hotel, C. (2003). Practical nonparametric density estimation on a transformation group for vision. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 114–121).

  • Miller, E. G., Matsakis, N. E., & Viola, P. A. (2000). Learning from one example through shared densities on transforms. In IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 464–471).

  • Murphy, K., Torralba, A., & Freeman, W. T. (2004). Using the forest to see the trees: A graphical model relating features, objects, and scenes. In Neural information processing systems 16. Cambridge: MIT Press.

    Google Scholar 

  • Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.

    Article  MathSciNet  Google Scholar 

  • Pitman, J. (2002). Combinatorial stochastic processes. Technical Report 621, U.C. Berkeley Department of Statistics, August 2002.

  • Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2006). The nested Dirichlet process. Working Paper 2006-19, Duke Institute of Statistics and Decision Sciences.

  • Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Uncertainty in artificial intelligence 20 (pp. 487–494). Corvallis: AUAI Press.

    Google Scholar 

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). LabelMe: A database and web-based tool for image annotation. Technical Report 2005-025, MIT AI Lab.

  • Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390–398.

    Article  MathSciNet  Google Scholar 

  • Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition: Tangent distance and tangent propagation. In B. O. Genevieve & K. R. Müller (Eds.), Neural networks: tricks of the trade (pp. 239–274). Berlin: Springer.

    Chapter  Google Scholar 

  • Siskind, J. M., Sherman, J., Pollak, I., Harper, M. P., & Bouman, C. A. (2004, submitted). Spatial random tree grammars for modeling hierarchal structure in images. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. In International conference on computer vision (Vol. 1, pp. 370–377).

  • Storkey, A. J., & Williams, C. K. I. (2003). Image modeling with position-encoding dynamic trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(7), 859–871.

    Article  Google Scholar 

  • Sudderth, E. B. (2006). Graphical models for visual object recognition and tracking. PhD thesis, Massachusetts Institute of Technology.

  • Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2005). Learning hierarchical models of scenes, objects, and parts. In International conference on computer vision (Vol. 2, pp. 1331–1338).

  • Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2006a). Depth from familiar objects: A hierarchical model for 3D scenes. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2410–2417).

  • Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2006b). Describing visual scenes using transformed Dirichlet processes. In Neural information processing systems 18 (pp. 1297–1304). Cambridge: MIT Press.

    Google Scholar 

  • Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.

    Article  MATH  MathSciNet  Google Scholar 

  • Tenenbaum, J. M., & Barrow, H. G. (1977). Experiments in interpretation-guided segmentation. Artificial Intelligence, 8, 241–274.

    Article  Google Scholar 

  • Torralba, A. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2), 169–191.

    Article  Google Scholar 

  • Torralba, A., Murphy, K. P., & Freeman, W. T. (2004). Sharing features: Efficient boosting procedures for multiclass object detection. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 762–769).

  • Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63(2), 113–140.

    Article  Google Scholar 

  • Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7), 682–687.

    Google Scholar 

  • Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.

    Article  Google Scholar 

  • Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models for recognition. In European conference on computer vision (pp. 18–32).

  • Williams, C. K. I., & Allan, M. (2006). On a connection between object localization with a generative template of features and pose-space prediction methods. Informatics Research Report 719, University of Edinburgh.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erik B. Sudderth.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sudderth, E.B., Torralba, A., Freeman, W.T. et al. Describing Visual Scenes Using Transformed Objects and Parts. Int J Comput Vis 77, 291–330 (2008). https://doi.org/10.1007/s11263-007-0069-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-007-0069-5

Keywords

Navigation