Image Representation Learning by Deep Appearance and Spatial Coding

  • Bingyuan LiuEmail author
  • Jing Liu
  • Zechao Li
  • Hanqing Lu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9003)


The bag of feature model is one of the most successful model to represent an image for classification task. However, the discrimination loss in the local appearance coding and the lack of spatial information hinder its performance. To address these problems, we propose a deep appearance and spatial coding model to build more optimal image representation for the classification task. The proposed model is a hierarchical architecture consisting of three operations: appearance coding, max-pooling and spatial coding. Firstly, with an image as input, we extract a set of local descriptors and adopt the appearance coding to encode them into high-dimensional robust vectors. Then max-pooling is performed within the over spatial partitioned grids to incorporate spatial information. After that, spatial coding is carried out to increasingly integrate the region vectors to a global image signature. Finally, the resulting image representation are employed to train a one-versus-others SVM classifier. In the learning of the proposed model, we layerwisely pre-train the network and then perform supervised fine-tuning with image labels. The experiments on three image benchmark datasets (i.e. 15-Scenes, PASCAL VOC 2007 and Caltech-256) demonstrate the effectiveness of our proposed model.


Hide Unit Sparse Code Spatial Code Feature Code Restrict Boltzmann Machine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported by 863 Program (2014AA015104) and National Natural Science Foundation of China (61332016, 61272329, 61472422, and 61273034).


  1. 1.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: ECCV 2004 Workshop on Statistical Learning in Computer Vision (2004)Google Scholar
  2. 2.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)CrossRefGoogle Scholar
  3. 3.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  4. 4.
    van Gemert, J.C., Geusebroek, J.-M., Veenman, C.J., Smeulders, A.W.M.: Kernel codebooks for scene categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 696–709. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  5. 5.
    Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)Google Scholar
  6. 6.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)Google Scholar
  7. 7.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)Google Scholar
  8. 8.
    Swersky, K., Tarlow, D., Sutskever, I., Salakhutdinov, R., Zemel, R., Adams, R.: Cardinality restricted boltzmann machines. In: NIPS (2012)Google Scholar
  9. 9.
    Roth, P.M., Winter, M.: Survey of Appearance-Based methods for object recognition. Institute for Computer Graphics and Vision, Graz University of Technology, Technical report (2008)Google Scholar
  10. 10.
    Perronnin, F., Dance, C., Csurka, G., Bressan, M.: Adapted vocabularies for generic visual categorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 464–475. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  11. 11.
    Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: ICCV (2005)Google Scholar
  12. 12.
    Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image classification. In: CVPR (2008)Google Scholar
  13. 13.
    van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1271–1283 (2010)CrossRefGoogle Scholar
  14. 14.
    Jiang, Z., Lin, Z., Davis, L.S.: Learning a discriminative dictionary for sparse coding via label consistent k-svd. In: CVPR (2011)Google Scholar
  15. 15.
    Yang, J., Yu, K., Huang, T.S.: Supervised translation-invariant sparse coding. In: CVPR (2010)Google Scholar
  16. 16.
    Goh, H., Thome, N., Cord, M., Lim, J.-H.: Unsupervised and supervised visual codes with restricted boltzmann machines. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 298–311. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  17. 17.
    Li, Z., Liu, J., Yang, Y., Zhou, X., Lu, H.: Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans. Knowl. Data Eng. 26, 2138–2150 (2014)CrossRefGoogle Scholar
  18. 18.
    Li, Z., Yang, Y., Liu, J., Zhou, X., Lu, H.: Unsupervised feature selection using nonnegative spectral analysis. In: AAAI (2012)Google Scholar
  19. 19.
    Savarese, S., Winn, J., Criminisi, A.: Discriminative object class models of appearance and shape by correlatons. In: CVPR (2006)Google Scholar
  20. 20.
    Liu, D., Hua, G., Viola, P., Chen, T.: Integrated feature selection and higher-order spatial feature extraction for object categorization. In: CVPR (2008)Google Scholar
  21. 21.
    Morioka, N., Satoh, S.: Building compact local pairwise codebook with joint feature space clustering. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 692–705. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  22. 22.
    Morioka, N., Satoh, S.: Learning directional local pairwise bases with sparse coding. In: BMVC (2010)Google Scholar
  23. 23.
    Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  24. 24.
    Perronnin, F., Dance, C.R.: Fisher kernels on visual vocabularies for image categorization. In: CVPR (2007)Google Scholar
  25. 25.
    Harada, T., Ushiku, Y., Yamashita, Y., Kuniyoshi, Y.: Discriminative spatial pyramid. In: CVPR (2011)Google Scholar
  26. 26.
    Sharma, G., Jurie, F.: Learning discriminative spatial representation for image classification. In: BMVC (2011)Google Scholar
  27. 27.
    Jia, Y., Huang, C., Darrell, T.: Beyond spatial pyramids: receptive field learning for pooled image features. In: CVPR (2012)Google Scholar
  28. 28.
    Liu, B., Liu, J., Lu, H.: Adaptive spatial partition learning for image classification. Neurocomputing 142, 282–290 (2014)CrossRefGoogle Scholar
  29. 29.
    Huang, F.J., lan Boureau, Y., Lecun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: CVPR (2007)Google Scholar
  30. 30.
    Hinton, G.E., Osindero, S.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  31. 31.
    Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML (2009)Google Scholar
  32. 32.
    Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierarchical sparse coding. In: CVPR (2011)Google Scholar
  33. 33.
    Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  34. 34.
    Tieleman, T.: Training restricted boltzmann machines using approximations to the likelihood gradient. In: ICML (2008)Google Scholar
  35. 35.
    Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC (2011)Google Scholar
  36. 36.
    Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR (2005)Google Scholar
  37. 37.
    Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV (2011)Google Scholar
  38. 38.
    Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: CVPR (2010)Google Scholar
  39. 39.
    Zhou, X., Cui, N., Li, Z., Liang, F., Huang, T.: Hierarchical gaussianization for image classification. In: ICCV (2009)Google Scholar
  40. 40.
    Feng, J., Ni, B., Tian, Q., Yan, S.: Geometric lp-norm feature pooling for image classification. In: CVPR (2011)Google Scholar
  41. 41.
    Li, L.J., Su, H., Xing, E.P., Fei-Fei, L.: Object bank: A high-level image representation for scene classification and semantic feature sparsification. In: NIPS (2010)Google Scholar
  42. 42.
    Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical report 7694, California Institute of Technology (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.School of Computer ScienceNanjing University of Science and TechnologyNanjingChina

Personalised recommendations