Robust Scene Classification with Cross-Level LLC Coding on CNN Features

  • Zequn JieEmail author
  • Shuicheng Yan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9004)


Convolutional Neural Network (CNN) features have demonstrated outstanding performance as global representations for image classification, but they lack invariance to scale transformation, which makes it difficult to adapt to various complex tasks such as scene classification. To strengthen the scale invariance of CNN features and meanwhile retain their powerful discrimination in scene classification, we propose a framework where cross-level Locality-constrained Linear Coding and cascaded fine-tuned CNN features are combined, which is shorted as cross-level LLC-CNN. Specifically, this framework first fine-tunes multi-level CNNs in a cascaded way, then extracts multi-level CNN features to learn a cross-level universal codebook, and finally performs locality-constrained linear coding (LLC) and max-pooling on the patches of all levels to form the final representation. It is experimentally verified that the LLC responses on the universal codebook outperform the CNN features and achieve the state-of-the-art performance on the two currently largest scene classification benchmarks, MIT Indoor Scenes and SUN 397.


Recognition Accuracy Convolutional Neural Network Code Word Scale Transformation Scene Classification 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its International Research Centre @ Singapore Funding Initiative and administered by the Interactive&Digital Media Programme Office.


  1. 1.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer vision, ECCV (2004)Google Scholar
  2. 2.
    Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Blocks that shout: Distinctive parts for scene classification. In: IEEE CVPR, pp. 923–930 (2013)Google Scholar
  3. 3.
    Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: NIPS, pp. 1378–1386 (2010)Google Scholar
  4. 4.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE CVPR, pp. 2169–2178 (2006)Google Scholar
  5. 5.
    Dong, J., Xia, W., Chen, Q., Feng, J., Huang, Z., Yan, S.: Subcategory-aware object classification. In: IEEE CVPR, pp. 827–834 (2013)Google Scholar
  6. 6.
    Dong, J., Chen, Q., Yan, S., Yuille, A.: Towards unified object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 299–314. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  7. 7.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91–110 (2004)CrossRefGoogle Scholar
  8. 8.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE CVPR, pp. 886–893 (2005)Google Scholar
  9. 9.
    LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361 (1995)Google Scholar
  10. 10.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  11. 11.
    Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: IEEE ICCV, pp. 2146–2153 (2009)Google Scholar
  12. 12.
    Ouyang, W., Wang, X.: Joint deep learning for pedestrian detection. In: IEEE ICCV, pp. 2056–2063 (2013)Google Scholar
  13. 13.
    Sun, Y., Wang, X., Tang, X.: Hybrid deep learning for face verification. In: IEEE ICCV, pp. 1489–1496 (2013)Google Scholar
  14. 14.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE CVPR, pp. 248–255 (2009)Google Scholar
  15. 15.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)
  16. 16.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524 (2013)
  17. 17.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J., et al.: Learning and transferring mid-level image representations using convolutional neural networks. arXiv preprint (2013)Google Scholar
  18. 18.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
  19. 19.
    Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. arXiv preprint arXiv:1403.1840 (2014)
  20. 20.
    Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: Cnn: Single-label to multi-label. arXiv preprint arXiv:1406.5726 (2014)
  21. 21.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901 (2013)
  22. 22.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: IEEE CVPR, pp. 3485–3492 (2010)Google Scholar
  23. 23.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: IEEE CVPR, pp. 3360–3367 (2010)Google Scholar
  24. 24.
    Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: IEEE CVPR (2009)Google Scholar
  25. 25.
    Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  26. 26.
    Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE CVPR, pp. 1794–1801 (2009)Google Scholar
  27. 27.
    Xie, L., Wang, J., Guo, B., Zhang, B., Tian, Q.: Orientational pyramid matching for recognizing indoor scenes. In: IEEE CVPR (2014)Google Scholar
  28. 28.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep. (2009)Google Scholar
  29. 29.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  30. 30.
    Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: IEEE CVPR, pp. 3304–3311 (2010)Google Scholar
  31. 31.
    Shabou, A., LeBorgne, H.: Locality-constrained and spatially regularized coding for scene categorization. In: IEEE CVPR, pp. 3618–3625 (2012)Google Scholar
  32. 32.
    Doersch, C., Gupta, A., Efros, A.A.: Mid-level visual element discovery as discriminative mode seeking. In: NIPS, pp. 494–502 (2013)Google Scholar
  33. 33.
    Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: Theory and practice. IJCV 105, 222–245 (2013)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Keio-NUS CUTE CenterNational University of SingaporeSingaporeSingapore
  2. 2.Department of Electrical and Computer EngineeringNational University of SingaporeSingaporeSingapore

Personalised recommendations