Advertisement

Visual Dictionary Learning for Joint Object Categorization and Segmentation

  • Aastha Jain
  • Luca Zappella
  • Patrick McClure
  • René Vidal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7576)

Abstract

Representing objects using elements from a visual dictionary is widely used in object detection and categorization. Prior work on dictionary learning has shown improvements in the accuracy of object detection and categorization by learning discriminative dictionaries. However none of these dictionaries are learnt for joint object categorization and segmentation. Moreover, dictionary learning is often done separately from classifier training, which reduces the discriminative power of the model. In this paper, we formulate the semantic segmentation problem as a joint categorization, segmentation and dictionary learning problem. To that end, we propose a latent conditional random field (CRF) model in which the observed variables are pixel category labels and the latent variables are visual word assignments. The CRF energy consists of a bottom-up segmentation cost, a top-down bag of (latent) words categorization cost, and a dictionary learning cost. Together, these costs capture relationships between image features and visual words, relationships between visual words and object categories, and spatial relationships among visual words. The segmentation, categorization, and dictionary learning parameters are learnt jointly using latent structural SVMs, and the segmentation and visual words assignments are inferred jointly using energy minimization techniques. Experiments on the Graz02 and CamVid datasets demonstrate the performance of our approach.

References

  1. 1.
    Winn, J.M., Shotton, J.: The layout consistent random field for recognizing and segmenting partially occluded objects. In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 37–44 (2006)Google Scholar
  2. 2.
    Larlus, D., Jurie, F.: Combining appearance models and Markov random fields for category level object segmentation. In: IEEE Conf. on Computer Vision and Pattern Recognition (2008)Google Scholar
  3. 3.
    Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: IEEE Conf. on Computer Vision and Pattern Recognition (2008)Google Scholar
  4. 4.
    Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-occurrence, location and appearance. In: IEEE Conf. on Computer Vision and Pattern Recognition (2008)Google Scholar
  5. 5.
    Gould, S., Gao, T., Koller, D.: Region-based segmentation and object detection. In: Neural Information Processing Systems (2009)Google Scholar
  6. 6.
    Fulkerson, B., Vedaldi, A., Soatto, S.: Class segmentation and object localization with superpixel neighborhoods. In: IEEE Int. Conf. on Computer Vision (2009)Google Scholar
  7. 7.
    Micusik, B., Kosecka, J.: Semantic segmentation of street scenes by superpixel co-occurrence and 3d geometry. In: IEEE Workshop on Video-Oriented Object and Event Classification (2009)Google Scholar
  8. 8.
    Ladicky, L., Russell, C., Kohli, P., Torr, P.: Associative hierarchical CRFs for object class image segmentation. In: IEEE Int. Conf. on Computer Vision (2009)Google Scholar
  9. 9.
    Lempitsky, V.S., Vedaldi, A., Zisserman, A.: A pylon model for semantic segmentation. In: Neural Information Processing Systems (2011)Google Scholar
  10. 10.
    Russell, C., Ladicky, L., Kohli, P., Torr, P.: Graph Cut Based Inference with Co-occurrence Statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 239–253. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Verbeek, J., Triggs, B.: Scene segmentation with CRFs learned from partially labeled images. In: Neural Information Processing Systems (2008)Google Scholar
  12. 12.
    Singaraju, D., Vidal, R.: Using global bag of features models in random fields for joint categorization and segmentation of objects. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  13. 13.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, pp. 1470–1477 (2003)Google Scholar
  14. 14.
    Dance, C., Willamowski, J., Fan, L., Bray, C., Csurka, G.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (2004)Google Scholar
  15. 15.
    Winn, J.M., Criminisi, A., Minka, T.P.: Object categorization by learned universal visual dictionary. In: IEEE International Conference on Computer Vision (2005)Google Scholar
  16. 16.
    Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Neural Information Processing Systems, vol. 19, pp. 985–991 (2007)Google Scholar
  17. 17.
    Yang, L., Jin, R., Sukthankar, R., Jurie, F., Yang, L., Jin, R., Sukthankar, R., Jurie, F.: Unifying discriminative visual codebook generation with classifier training for object category recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2008)Google Scholar
  18. 18.
    Yang, J., Yang, M.: Top-down visual saliency via joint crf and dictionary learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  19. 19.
    Vedaldi, A., Soatto, S.: Quick Shift and Kernel Methods for Mode Seeking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 705–718. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  20. 20.
    Kolmogorov, V., Zabih, R.: What Energy Functions Can Be Minimized Via Graph Cuts? IEEE Trans. on Pattern Analysis and Machine Intelligence 26, 147–159 (2004)CrossRefGoogle Scholar
  21. 21.
    Kohli, P., Ladicky, L., Torr, P.H.S.: Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision 82, 302–324 (2009)CrossRefGoogle Scholar
  22. 22.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, 1453–1484 (2005)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Murphy, K.P., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: An empirical study. In: Proceedings of Uncertainty in AI, pp. 467–475 (1999)Google Scholar
  24. 24.
    Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco (1988)Google Scholar
  25. 25.
    Yu, C.N.J., Joachims, T.: Learning structural svms with latent variables. In: International Conference on Machine Learning, ICML (2009)Google Scholar
  26. 26.
    Opelt, A., Pinz, A.: The TU Graz-02 database (2002), http://www.emt.tugraz.at/~pinz/data/GRAZ_02/
  27. 27.
    Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and Recognition Using Structure from Motion Point Clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  28. 28.
    Zhang, H., Xiao, J., Quan, L.: Supervised Label Transfer for Semantic Segmentation of Street Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 561–574. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  29. 29.
    Floros, G., Leibe, B.: Joint 2D-3D temporally consistent semantic segmentation of street scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Aastha Jain
    • 1
  • Luca Zappella
    • 1
  • Patrick McClure
    • 1
  • René Vidal
    • 1
  1. 1.Center for Imaging ScienceJohns Hopkins UniversityUSA

Personalised recommendations