Advertisement

Describing Visual Scenes Using Transformed Objects and Parts

  • Erik B. Sudderth
  • Antonio Torralba
  • William T. Freeman
  • Alan S. Willsky
Article

Abstract

We develop hierarchical, probabilistic models for objects, the parts composing them, and the visual scenes surrounding them. Our approach couples topic models originally developed for text analysis with spatial transformations, and thus consistently accounts for geometric constraints. By building integrated scene models, we may discover contextual relationships, and better exploit partially labeled training images. We first consider images of isolated objects, and show that sharing parts among object categories improves detection accuracy when learning from few examples. Turning to multiple object scenes, we propose nonparametric models which use Dirichlet processes to automatically learn the number of parts underlying each object category, and objects composing each scene. The resulting transformed Dirichlet process (TDP) leads to Monte Carlo algorithms which simultaneously segment and recognize objects in street and office scenes.

Keywords

Object recognition Dirichlet process Hierarchical Dirichlet process Transformation Context Graphical models Scene analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adams, N. J., & Williams, C. K. I. (2003). Dynamic trees for image modelling. Image and Vision Computing, 21, 865–877. CrossRefGoogle Scholar
  2. Amit, Y., & Trouvé, A. (2007). Generative models for labeling multi-object configurations in images. In J. Ponce, et al. (Ed.), Toward category-level object recognition. Berlin: Springer. Google Scholar
  3. Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135. zbMATHCrossRefGoogle Scholar
  4. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522. CrossRefGoogle Scholar
  5. Bienenstock, E., Geman, S., & Potter, D. (1997). Compositionality, MDL priors, and object recognition. In Neural information processing systems 9 (pp. 838–844). Cambridge: MIT Press. Google Scholar
  6. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. zbMATHCrossRefGoogle Scholar
  7. Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In European conference on computer vision (Vol. 2, pp. 109–122). Google Scholar
  8. Bosch, A., Zisserman, A., & Muñoz, X. (2006). Scene classification via pLSA. In European conference on computer vision (pp. 517–530). Google Scholar
  9. Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698. CrossRefGoogle Scholar
  10. Casella, G., & Robert, C. P. (1996). Rao–Blackwellisation of sampling schemes. Biometrika, 83(1), 81–94. zbMATHCrossRefMathSciNetGoogle Scholar
  11. Csurka, G., et al. (2004). Visual categorization with bags of keypoints. In ECCV workshop on statistical learning in computer vision. Google Scholar
  12. De Iorio, M., Müller, P., Rosner, G. L., & MacEachern, S. N. (2004). An ANOVA model for dependent random measures. Journal of the American Statistical Association, 99(465), 205–215. zbMATHCrossRefMathSciNetGoogle Scholar
  13. DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics, 44, 837–845. zbMATHCrossRefGoogle Scholar
  14. Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430), 577–588. zbMATHCrossRefMathSciNetGoogle Scholar
  15. Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 524–531). Google Scholar
  16. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In CVPR workshop on generative model based vision. Google Scholar
  17. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In International conference on computer vision (Vol. 2, pp. 1816–1823). Google Scholar
  18. Fink, M., & Perona, P. (2004). Mutual boosting for contextual inference. In Neural information processing systems 16. Cambridge: MIT Press. Google Scholar
  19. Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1), 67–92. CrossRefGoogle Scholar
  20. Frey, B. J., & Jojic, N. (2003). Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 1–17. CrossRefGoogle Scholar
  21. Gelfand, A. E., Kottas, A., & MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical Association, 100(471), 1021–1035. zbMATHCrossRefMathSciNetGoogle Scholar
  22. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. London: Chapman & Hall. zbMATHGoogle Scholar
  23. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235. CrossRefGoogle Scholar
  24. He, X., Zemel, R. S., & Carreira-Perpiñán, M. A. (2004). Multiscale conditional random fields for image labeling. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 695–702). Google Scholar
  25. Helmer, S., & Lowe, D. G. (2004). Object class recognition with many local features. In CVPR workshop on generative model based vision. Google Scholar
  26. Hinton, G. E., Ghahramani, Z., & Teh, Y. W. (2000). Learning to parse images. In Neural information processing systems 12 (pp. 463–469). Cambridge: MIT Press. Google Scholar
  27. Ishwaran, H., & James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453), 161–173. zbMATHCrossRefMathSciNetGoogle Scholar
  28. Ishwaran, H., & Zarepour, M. (2002). Dirichlet prior sieves in finite normal mixtures. Statistica Sinica, 12, 941–963. zbMATHMathSciNetGoogle Scholar
  29. Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2145–2152). Google Scholar
  30. Jojic, N., & Frey, B. J. (2001). Learning flexible sprites in video layers. In IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 199–206). Google Scholar
  31. Jordan, M. I. (2004). Graphical models. Statistical Science, 19(1), 140–155. zbMATHCrossRefMathSciNetGoogle Scholar
  32. Jordan, M. I. (2005). Dirichlet processes, Chinese restaurant processes and all that. Tutorial at Neural Information Processing Systems. Google Scholar
  33. Kovesi, P. (2005). MATLAB and Octave functions for computer vision and image processing. Available from http://www.csse.uwa.edu.au/~pk/research/matlabfns/.
  34. LeCun, Y., Huang, F. J., & Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 97–104). Google Scholar
  35. Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision. Google Scholar
  36. Liter, J. C., & Bülthoff, H. H. (1998). An introduction to object recognition. Zeitschrift für Naturforschung, 53c, 610–621. Google Scholar
  37. Loeff, N., Arora, H., Sorokin, A., & Forsyth, D. (2006). Efficient unsupervised learning for localization and detection in object categories. In Neural information processing systems 18 (pp. 811–818). Cambridge: MIT Press. Google Scholar
  38. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. CrossRefGoogle Scholar
  39. MacEachern, S. N. (1999). Dependent nonparametric processes. In Proceedings section on Bayesian statistical science (pp. 50–55). Alexandria: American Statistical Association. Google Scholar
  40. Matas, J., Chum, O., Urban, M., & Pajdla, T. (2002). Robust wide baseline stereo from maximally stable extremal regions. In British machine vision conference (pp. 384–393). Google Scholar
  41. Mikolajczyk, K., & Schmid, C. (2004). Scale and affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86. CrossRefGoogle Scholar
  42. Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630. CrossRefGoogle Scholar
  43. Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D. L., & Kolobov, A. (2005). BLOG: Probabilistic models with unknown objects. In International joint conference on artificial intelligence 19 (pp. 1352–1359) Google Scholar
  44. Miller, E. G., & Chefd’hotel, C. (2003). Practical nonparametric density estimation on a transformation group for vision. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 114–121). Google Scholar
  45. Miller, E. G., Matsakis, N. E., & Viola, P. A. (2000). Learning from one example through shared densities on transforms. In IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 464–471). Google Scholar
  46. Murphy, K., Torralba, A., & Freeman, W. T. (2004). Using the forest to see the trees: A graphical model relating features, objects, and scenes. In Neural information processing systems 16. Cambridge: MIT Press. Google Scholar
  47. Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265. CrossRefMathSciNetGoogle Scholar
  48. Pitman, J. (2002). Combinatorial stochastic processes. Technical Report 621, U.C. Berkeley Department of Statistics, August 2002. Google Scholar
  49. Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2006). The nested Dirichlet process. Working Paper 2006-19, Duke Institute of Statistics and Decision Sciences. Google Scholar
  50. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Uncertainty in artificial intelligence 20 (pp. 487–494). Corvallis: AUAI Press. Google Scholar
  51. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). LabelMe: A database and web-based tool for image annotation. Technical Report 2005-025, MIT AI Lab. Google Scholar
  52. Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390–398. CrossRefMathSciNetGoogle Scholar
  53. Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition: Tangent distance and tangent propagation. In B. O. Genevieve & K. R. Müller (Eds.), Neural networks: tricks of the trade (pp. 239–274). Berlin: Springer. CrossRefGoogle Scholar
  54. Siskind, J. M., Sherman, J., Pollak, I., Harper, M. P., & Bouman, C. A. (2004, submitted). Spatial random tree grammars for modeling hierarchal structure in images. IEEE Transactions on Pattern Analysis and Machine Intelligence. Google Scholar
  55. Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. In International conference on computer vision (Vol. 1, pp. 370–377). Google Scholar
  56. Storkey, A. J., & Williams, C. K. I. (2003). Image modeling with position-encoding dynamic trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(7), 859–871. CrossRefGoogle Scholar
  57. Sudderth, E. B. (2006). Graphical models for visual object recognition and tracking. PhD thesis, Massachusetts Institute of Technology. Google Scholar
  58. Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2005). Learning hierarchical models of scenes, objects, and parts. In International conference on computer vision (Vol. 2, pp. 1331–1338). Google Scholar
  59. Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2006a). Depth from familiar objects: A hierarchical model for 3D scenes. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2410–2417). Google Scholar
  60. Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2006b). Describing visual scenes using transformed Dirichlet processes. In Neural information processing systems 18 (pp. 1297–1304). Cambridge: MIT Press. Google Scholar
  61. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581. zbMATHCrossRefMathSciNetGoogle Scholar
  62. Tenenbaum, J. M., & Barrow, H. G. (1977). Experiments in interpretation-guided segmentation. Artificial Intelligence, 8, 241–274. CrossRefGoogle Scholar
  63. Torralba, A. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2), 169–191. CrossRefGoogle Scholar
  64. Torralba, A., Murphy, K. P., & Freeman, W. T. (2004). Sharing features: Efficient boosting procedures for multiclass object detection. In IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 762–769). Google Scholar
  65. Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63(2), 113–140. CrossRefGoogle Scholar
  66. Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7), 682–687. Google Scholar
  67. Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154. CrossRefGoogle Scholar
  68. Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models for recognition. In European conference on computer vision (pp. 18–32). Google Scholar
  69. Williams, C. K. I., & Allan, M. (2006). On a connection between object localization with a generative template of features and pose-space prediction methods. Informatics Research Report 719, University of Edinburgh. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Erik B. Sudderth
    • 1
  • Antonio Torralba
    • 2
  • William T. Freeman
    • 2
  • Alan S. Willsky
    • 2
  1. 1.Computer Science DivisionUniversity of CaliforniaBerkeleyUSA
  2. 2.Electrical Engineering & Computer ScienceMassachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations