Unsupervised Learning of Semantics of Object Detections for Scene Categorization

  • Grégoire Mesnil
  • Salah Rifai
  • Antoine Bordes
  • Xavier Glorot
  • Yoshua Bengio
  • Pascal Vincent
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 318)


Classifying scenes (e.g. into “street”, “home” or “leisure”) is an important but complicated task nowadays, because images come with variability, ambiguity, and a wide range of illumination or scale conditions. Standard approaches build an intermediate representation of the global image and learn classifiers on it. Recently, it has been proposed to depict an image as an aggregation of its contained objects:  the representation on which classifiers are trained is composed of many heterogeneous feature vectors derived from various object detectors. In this paper, we propose to study different approaches to efficiently learn contextual semantics out of these object detections. We use the features provided by Object-Bank [24] (177 different object detectors producing 252 attributes each), and show on several benchmarks for scene categorization that careful combinations, taking into account the structure of the data, allows to greatly improve over original results (from \(+5\) to \(+11\,\%\)) while drastically reducing the dimensionality of the representation by 97 % (from 44,604 to 1,000). We also show that the uncertainty relative to object detectors hampers the use of external semantic knowledge to improve detectors combination, unlike our unsupervised learning approach.


Unsupervised learning Transfer learning Deep learning Scene categorization Object detection 



We would like to thank Gloria Zen for her helpful comments. This work was supported by NSERC, CIFAR, the Canada Research Chairs, Compute Canada and by the French ANR Project ASAP ANR-09-EMER-001. Codes for the experiments have been implemented using Theano [4] Machine Learning library.


  1. 1.
    Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2, 53–58 (1989)CrossRefGoogle Scholar
  2. 2.
    Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Proc. Sys. 19, 153–160 (2007)Google Scholar
  3. 3.
    Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009). Also published as a book. Now Publishers, 2009Google Scholar
  4. 4.
    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral PresentationGoogle Scholar
  5. 5.
    Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via plsa. In: In Proceedings of the ECCV, pp. 517–530 (2006)Google Scholar
  6. 6.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR09 (2009)Google Scholar
  7. 7.
    Espinace, P., Kollar, T., Soto, A., Roy, N.: Indoor scene recognition through object detection. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Anchorage, AK (2010)Google Scholar
  8. 8.
    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar
  9. 9.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785 (2009)Google Scholar
  10. 10.
    Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)—Volume 2—Volume 02, CVPR’05, pp. 524–531. IEEE Computer Society (2005)Google Scholar
  11. 11.
    Felzenszwalb, P., McAllester, D., Ramanan, D.: A discrimitatively trained, multiscale, deformable part model. In: CVPR (2008)Google Scholar
  12. 12.
    Gao, S., Tsang, I., Chia, L., Zhao, P.: Local features are not lonely laplacian sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2010)Google Scholar
  13. 13.
    Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks. In: NIPS’09, pp. 646–654 (2009)Google Scholar
  14. 14.
    Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)CrossRefMathSciNetzbMATHGoogle Scholar
  15. 15.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 177–196 (2001)CrossRefzbMATHGoogle Scholar
  16. 16.
    Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. SIGGRAPH 24(3), 577584 (2005)Google Scholar
  17. 17.
    Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441, 498–520 (1933)Google Scholar
  18. 18.
    Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition?. In: Proceedings of the International Conference on Computer Vision (ICCV’09), pp. 2146–2153. IEEE (2009)Google Scholar
  19. 19.
    Kavukcuoglu, K., Ranzato, M., Fergus, R., LeCun, Y.: Learning invariant features through topographic filter maps. In: Proceedings of the CVPR’09, pp. 1605–1612. IEEE (2009)Google Scholar
  20. 20.
    Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for training deep neural networks. JMLR 10, 1–40 (2009)zbMATHGoogle Scholar
  21. 21.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (2006)Google Scholar
  22. 22.
    LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning. In: Shape, Contour and Grouping in Computer Vision, pp. 319–345. Springer (1999)Google Scholar
  23. 23.
    Li, L.-J., Fei-Fei, L.: What, where and who? classifying events by scene and object recognition. In: ICCV (2007)Google Scholar
  24. 24.
    Li-Jia Li, E.P.X., Su, H., Fei-Fei, L.: Object bank: a high-level image representation for scene classification and semantic feature sparsification. In: Proceedings of the Neural Information Processing Systems (NIPS) (2010)Google Scholar
  25. 25.
    Li-Jia Li, Y.L., Su, H., Fei-Fei, L.: Objects as attributes for scene classification. In: European Conference of Computer Vision (ECCV), International Workshop on Parts and Attributes, Crete, Greece, September 2010Google Scholar
  26. 26.
    Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning approach. In: Guyon I., Dror, G., Lemaire, V., Taylor, G., Silver, D. (Eds.) JMLR W & CP: Proceedings of the Unsupervised and Transfer Learning challenge and workshop, vol. 27, pp. 97–110 (2012)Google Scholar
  27. 27.
    Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  28. 28.
    Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. In: Visual Perception, Progress in Brain Research, vol. 155 (2006)Google Scholar
  29. 29.
    Pandey, M., Lazebnik, S.: Scene recognition and weakly supervised object localization with deformable part-based models. In: ICCV (2011)Google Scholar
  30. 30.
    Pearson, K.: On lines and planes of closest fit to systems of points in space. Phil. Mag. 2(6), 559–572 (1901)CrossRefGoogle Scholar
  31. 31.
    Quattoni, A., Torralba, A., Recognizing indoor scenes. In: CVPR (2009)Google Scholar
  32. 32.
    Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: NIPS’06 (2007)Google Scholar
  33. 33.
    Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., Glorot, X.: Higher order contractive auto-encoder. In: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) (2011)Google Scholar
  34. 34.
    Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting auto-encoders: explicit invariance during feature extraction. In: Proceedings of the Twenty-eight International Conference on Machine Learning (ICML’11), June 2011Google Scholar
  35. 35.
    Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: a database and web-based tool for image annotation. Int. J. Comput. Vision 77, 157–173 (2008)CrossRefGoogle Scholar
  36. 36.
    Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual cortex. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)Google Scholar
  37. 37.
    Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1349–1380 (2000)Google Scholar
  38. 38.
    Torralba, A.: Contextual priming for object detection. Int. J. Comput. Vis. 53(2), 169–191 (2003)CrossRefMathSciNetGoogle Scholar
  39. 39.
    Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Cohen W.W., McCallum A., Roweis, S.T. (eds.) ICML’08, pp. 1096–1103. ACM (2008)Google Scholar
  40. 40.
    Vogel, J., Schiele, B.: Natural scene retrieval based on a semantic modeling step. In: Proceeedings of the International Conference on Image and Video Retrieval CIVR 2004, Dublin, Ireland, LNCS, vol. 3115, pp. 7 (2004)Google Scholar
  41. 41.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492. IEEE, June 2010Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Grégoire Mesnil
    • 1
    • 2
  • Salah Rifai
    • 1
  • Antoine Bordes
    • 3
  • Xavier Glorot
    • 1
  • Yoshua Bengio
    • 1
  • Pascal Vincent
    • 1
  1. 1.LISA, Université de MontréalQuébecCanada
  2. 2.LITIS, Université de RouenRouenFrance
  3. 3.CNRS - Heudiasyc UMR 7253Université de Technologie de CompiègneCompiègneFrance

Personalised recommendations