International Journal of Computer Vision

, Volume 81, Issue 1, pp 2–23 | Cite as

TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context

  • Jamie Shotton
  • John Winn
  • Carsten Rother
  • Antonio Criminisi
Article

Abstract

This paper details a new approach for learning a discriminative model of object classes, incorporating texture, layout, and context information efficiently. The learned model is used for automatic visual understanding and semantic segmentation of photographs. Our discriminative model exploits texture-layout filters, novel features based on textons, which jointly model patterns of texture and their spatial layout. Unary classification and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number of classes. Accurate image segmentation is achieved by incorporating the unary classifier in a conditional random field, which (i) captures the spatial interactions between class labels of neighboring pixels, and (ii) improves the segmentation of specific object instances. Efficient training of the model on large datasets is achieved by exploiting both random feature selection and piecewise training methods.

High classification and segmentation accuracy is demonstrated on four varied databases: (i) the MSRC 21-class database containing photographs of real objects viewed under general lighting conditions, poses and viewpoints, (ii) the 7-class Corel subset and (iii) the 7-class Sowerby database used in He et al. (Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 695–702, June 2004), and (iv) a set of video sequences of television shows. The proposed algorithm gives competitive and visually pleasing results for objects that are highly textured (grass, trees, etc.), highly structured (cars, faces, bicycles, airplanes, etc.), and even articulated (body, cow, etc.).

Keywords

Image understanding Object recognition Segmentation Texture Layout Context Textons Conditional random field Boosting Semantic image segmentation Piecewise training 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amit, Y., Geman, D., & Wilder, K. (1997). Joint induction of shape features and tree classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 1300–1305. CrossRefGoogle Scholar
  2. Baluja, S., & Rowley, H. A. (2005). Boosting sex identification performance. In AAAI (pp. 1508–1513). Google Scholar
  3. Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London. Google Scholar
  4. Beis, J. S., & Lowe, D. G. (1997). Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1000–1006). June 1997. Google Scholar
  5. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(24), 509–522. CrossRefGoogle Scholar
  6. Berg, A. C., Berg, T. L., & Malik, J. (2005). Shape matching and object recognition using low distortion correspondences. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 26–33). June 2005. Google Scholar
  7. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. MATHGoogle Scholar
  8. Blake, A., Rother, C., Brown, M., Perez, P., & Torr, P. H. S. (2004). Interactive image segmentation using an adaptive GMMRF model. In T. Pajdla & J. Matas (Eds.), LNCS : Vol. 3021. Proceedings of European conference on computer vision (pp. 428–441). Prague, Czech Republic, May 2004. New York: Springer. Google Scholar
  9. Borenstein, E., Sharon, E., & Ullman, S. (2004) Combining top-down and bottom-up segmentations. In IEEE workshop on perceptual organization in computer vision (Vol. 4, p. 46). Google Scholar
  10. Boykov, Y., & Jolly, M.-P. (2004). Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In Proceedings of international conference on computer vision (Vol. 1, pp. 105–112). Vancouver, Canada, July 2001. Google Scholar
  11. Criminisi, A., Perez, P., & Toyama, K. (2004). Region filling and object removal by exemplar-based inpainting. IEEE Transactions on Image Processing, 13(9), 1200–1212. CrossRefGoogle Scholar
  12. Dempster, A., Laird, N., & Rubin, D. (1976). Maximum likelihood from incomplete data via the EM algorithm. JRSS B, 39, 1–38. MathSciNetGoogle Scholar
  13. Dollár, P., Tu, Z., & Belongie, S. (2006). Supervised learning of edges and object boundaries. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 1964–1971). Google Scholar
  14. Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In A. Heyden, G. Sparr, & P. Johansen (Eds.), LNCS : Vol. 2353. Proceedings of European conference on computer vision (pp. 97–112). May 2002. New York: Springer. Google Scholar
  15. Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of international conference on machine learning (pp. 147–153). Google Scholar
  16. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Proceedings of CVPR 2004. Workshop on generative-model based vision. Google Scholar
  17. Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 264–271). June 2003. Google Scholar
  18. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 337–407. MATHCrossRefMathSciNetGoogle Scholar
  19. He, X., Zemel, R. S., & Carreira-Perpiñán, M.Á. (2004). Multiscale conditional random fields for image labeling. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 695–702). June 2004. Google Scholar
  20. He, X., Zemel, R. S., & Ray, D. (2006). Learning and incorporating top-down cues in image segmentation. In A. Leonardis, H. Bischof, & A. Pinz (Eds.), LNCS : Vol. 3951. Proceeding of European conference on computer vision (pp. 338–351). May 2006. New York: Springer. Google Scholar
  21. Johnson, M., Brostow, G., Shotton, J., Arandjelovic, O., Kwatra, V., & Cipolla, R. (2006). Semantic photo synthesis. Computer Graphics Forum, 25(3), 407–413. CrossRefGoogle Scholar
  22. Jones, D. G., & Malik, J. (1992). A computational framework for determining stereo correspondence from a set of linear spatial filters. In Proceedings of European conference on computer vision (pp. 395–410). Google Scholar
  23. Julesz, B. (1981). Textons, the elements of texture perception, and their interactions. Nature, 290(5802), 91–97. CrossRefGoogle Scholar
  24. Kohli, P., & Torr, P. H. S. (2005). Efficiently solving dynamic Markov random fields using graph cuts. In Proceedings of international conference on computer vision (Vol. 2, pp. 922–929), Beijing, China, October 2005. Google Scholar
  25. Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 147–159. CrossRefGoogle Scholar
  26. Konishi, S., & Yuille, A. L. (2000). Statistical cues for domain specific image segmentation with performance analysis. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 125–132). June 2000. Google Scholar
  27. Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). OBJ CUT. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 18–25). June 2005. Google Scholar
  28. Kumar, S., & Hebert, M. (2003). Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of international conference on computer vision (Vol. 2, pp. 1150–1157). October 2003. Google Scholar
  29. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of international conference on machine learning (pp. 282–289). Google Scholar
  30. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE conference on computer vision and pattern recognition. Google Scholar
  31. Leibe, B., & Schiele, B. (2003). Interleaved object categorization and segmentation. In Proceedings of British machine vision conference (Vol. II, pp. 264–271). Google Scholar
  32. Lepetit, V., Lagger, P., & Fua, P. (2005). Randomized trees for real-time keypoint recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 775–781). June 2005. Google Scholar
  33. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal Computer Vision, 60(2), 91–110. CrossRefGoogle Scholar
  34. Malik, J., Belongie, S., Leung, T., & Shi, J. (2001). Contour and texture analysis for image segmentation. International Journal Computer Vision, 43(1), 7–27. MATHCrossRefGoogle Scholar
  35. Marszałek, M., & Schmid, C. (2007). Semantic hierarchies for visual object recognition. In Proceedings of IEEE conference on computer vision and pattern recognition. June 2007. Google Scholar
  36. Mikolajczyk, K., & Schmid, C. (2002). An affine invariant interest point detector. In A. Heyden, G. Sparr, & P. Johansen (Eds.), LNCS : Vol. 2350. Proceedings of European conference on computer vision (pp. 128–142). May 2002. New York: Springer. Google Scholar
  37. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo: Morgan Kaufmann. Google Scholar
  38. Porikli, F. M. (2005). Integral histogram: A fast way to extract histograms in cartesian spaces. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 829–836), June 2005. Google Scholar
  39. Ren, X., Fowlkes, C., & Malik, J. (2006). Figure/ground assignment in natural images. In A. Leonardis, H. Bischof, & A. Pinz (Eds.), Proceedings of European conference on computer vision (Vol. 2, pp. 614–627). Graz, Austria, May 2006. New York: Springer. Google Scholar
  40. Rother, C., Kolmogorov, V., & Blake, A. (2004). GrabCut—interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314. CrossRefGoogle Scholar
  41. Rother, C., Bordeaux, L., Hamadi, Y., & Blake, A. (2006). AutoCollage. ACM Transactions on Graphics, 25(3), 847–852. CrossRefGoogle Scholar
  42. Russel, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). LabelMe: database and web-based tool for image annotation (Technical Report 25). MIT AI Lab, September 2005. Google Scholar
  43. Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In A. Leonardis, H. Bischof, & A. Pinz (Eds.), LNCS : Vol. 3951. Proceedings of European conference on computer vision (pp. 1–15). May 2006. New York: Springer. Google Scholar
  44. Sutton, C., & McCallum, A. (2005). Piecewise training of undirected models. In Proceedings of conference on uncertainty in artificial intelligence. Google Scholar
  45. Torralba, A., Murphy, K. P., & Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 854–869. CrossRefGoogle Scholar
  46. Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2003). Image parsing: unifying segmentation, detection, and recognition. In Proceedings of international conference on computer vision (Vol. 1, pp. 18–25). Nice, France, October 2003. Google Scholar
  47. Varma, M., & Zisserman, A. (2005). A statistical approach to texture classification from single images. International Journal Computer Vision, 62(1–2), 61–81. Google Scholar
  48. Viola, P., & Jones, M. J. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 511–518). December 2001. Google Scholar
  49. Winn, J., & Jojic, N. (2005). LOCUS: Learning object classes with unsupervised segmentation. In Proceedings of international conference on computer vision (Vol. 1, pp. 756–763). Beijing, China, October 2005. Google Scholar
  50. Winn, J., & Shotton, J. (2006). The layout consistent random field for recognizing and segmenting partially occluded objects. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 37–44). June 2006. Google Scholar
  51. Winn, J., Criminisi, A., & Minka, T. (2005). Categorization by learned universal visual dictionary. In Proceedings of international conference on computer vision (Vol. 2, pp. 1800–1807). Beijing, China, October 2005. Google Scholar
  52. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003). Understanding belief propagation and its generalizations. San Mateo: Morgan Kaufmann. Google Scholar
  53. Yin, P., Criminisi, A., Winn, J., & Essa, I. (2007). Tree based classifiers for bilayer video segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Jamie Shotton
    • 1
  • John Winn
    • 2
  • Carsten Rother
    • 2
  • Antonio Criminisi
    • 2
  1. 1.Machine Intelligence LaboratoryUniversity of CambridgeCambridgeUK
  2. 2.Microsoft Research CambridgeCambridgeUK

Personalised recommendations