SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition

Abstract

Bag of Words (BoW) model and Convolutional Neural Network (CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as selective, discriminative and equalizing pooling (SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on seven benchmark datasets (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art fused results (mAP) of 93.21 and 93.97% on the PASCAL VOC2007 and VOC2012 datasets, respectively.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Notes

  1. 1.

    http://host.robots.ox.ac.uk:/leaderboard/main_bootstrap.php.

References

  1. Berg, T., & Belhumeur, P. N. (2013). Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 955–962).

  2. Bo, L., Ren, X., & Fox, D. (2013). Multipath sparse coding using hierarchical matching pursuit. In CVPR.

  3. Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2559–2566).

  4. Bradley, D. M., & Bagnell, J. A. (2008). Differential sparse coding. In Neural information processing systems.

  5. Chai, Y., Lempitsky, V., & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In 2013 IEEE international conference on computer vision (ICCV) (pp. 321–328).

  6. Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. ArXiv preprint arXiv:1405.3531.

  7. Chen, Q., Song, Z., Dong, J., Huang, Z., Hua, Y., & Yan, S. (2015). Contextualizing object detection and classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1), 13–27.

    Article  Google Scholar 

  8. Chen, Q., Song, Z., Hua, Y., Huang, Z., & Yan, S. (2012). Hierarchical matching with side information for image classification. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3426–3433).

  9. Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.

    MathSciNet  Article  Google Scholar 

  10. Cimpoi, M., Maji, S., & Vedaldi, A. (2015) Deep filter banks for texture recognition and segmentation. In CVPR.

  11. Colson, B., Marcotte, P., & Savard, G. (2007). An overview of bilevel optimization. Annals of Operations Research, 153(1), 235–256.

    MathSciNet  Article  MATH  Google Scholar 

  12. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV.

  13. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

  14. Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In NIPS.

  15. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531.

  16. Dong, J., Xia, W., Chen, Q., Feng, J., Huang, Z., & Yan, S. (2013). Subcategory-aware object classification. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 827–834).

  17. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.

    MathSciNet  Article  MATH  Google Scholar 

  18. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  19. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  20. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.

    MATH  Google Scholar 

  21. Fanello, S. R., Noceti, N., Ciliberto, C., Metta, G., & Odone, F. (2014). Ask the image: Supervised pooling to preserve feature locality. In CVPR.

  22. Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1), 59–70.

    Google Scholar 

  23. Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric \(\ell \)p-norm feature pooling for image classification. In CVPR.

  24. Fernando, B., Fromont, E., & Tuytelaars, T. (2012). Effective use of frequent itemset mining for image classification. In Computer vision—ECCV 2012 (pp. 214–227).

  25. Gao, S., Tsang, I. W. H., & Chia, L. T. (2013). Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 92–104.

    Article  Google Scholar 

  26. Gavves, E., Fernando, B., Snoek, C. G., Smeulders, A. W., & Tuytelaars, T. (2013). Fine-grained categorization by alignments. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1713–1720).

  27. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  28. Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In ECCV.

  29. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv:1406.4729.

  30. Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).

  31. Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3304–3311).

  32. Jégou, H., & Zisserman, A. (2014). Triangulation embedding and democratic aggregation for image search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3310–3317).

  33. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

  34. Jiang, Y., Yuan, J., & Yu, G. (2012). Randomized spatial partition for scene recognition. In Computer vision—ECCV 2012 (pp. 730–743).

  35. Jiang, Z., Lin, Z., & Davis, L. S. (2013). Label consistent k-svd: Learning a discriminative dictionary for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2651–2664.

    Article  Google Scholar 

  36. Juneja, M., Vedaldi, A., Jawahar, C., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR (pp. 923–930).

  37. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  38. Kwitt, R., Vasconcelos, N., & Rasiwasia, N. (2012). Scene recognition on the semantic manifold. In Computer vision—ECCV 2012 (pp. 359–372).

  39. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  40. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

    Article  Google Scholar 

  41. Li, L. J., Su, H., Fei-Fei, L., & Xing, E. P. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems (pp. 1378–1386).

  42. Li, Q., Wu, J., & Tu, Z. (2013). Harvesting mid-level visual concepts from large-scale internet images. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 851–858).

  43. Lin, D., Lu, C., Liao, R., & Jia, J. (2014). Learning important spatial pooling regions for scene classification. In CVPR.

  44. Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1449–1457).

  45. Liu, L., Wang, L., & Liu, X. (2011). In defense of soft-assignment coding. In ICCV.

  46. Long, J., Shelhamer, E., & Darrell, T. (2014). Fully convolutional networks for semantic segmentation. ArXiv preprint arXiv:1411.4038.

  47. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.

    Article  Google Scholar 

  48. Mairal, J., Bach, F., & Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 791–804.

    Article  Google Scholar 

  49. Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19–60.

    MathSciNet  MATH  Google Scholar 

  50. Murray, N., & Perronnin, F. (2014). Generalized max pooling. In CVPR.

  51. Nie, F., Huang, H., Cai, X., & Ding, C. H. (2010). Efficient and robust feature selection via joint \(\ell \)2, 1-norms minimization. In Advances in neural information processing systems (pp. 1813–1821).

  52. Nilsback, M. E., & Zisserman, A. (2006). A visual vocabulary for flower classification. In CVPR.

  53. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1717–1724).

  54. Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV (pp. 1307–1314).

  55. Parizi, S. N., Oberlin, J. G., & Felzenszwalb, P. F. (2012). Reconfigurable models for scene recognition. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2775–2782).

  56. Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.

  57. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In Computer vision—ECCV 2010 (pp. 143–156).

  58. Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.

  59. Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR.

  60. Sadeghi, F., & Tappen, M. F. (2012). Latent pyramidal regions for recognizing scenes. In Computer vision—ECCV 2012 (pp. 228–241).

  61. Shabou, A., & LeBorgne, H. (2012). Locality-constrained and spatially regularized coding for scene categorization. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3618–3625).

  62. Shao, M., Li, S., Liu, T., Tao, D., Huang, T. S., & Fu, Y. (2014). Learning relative features through adaptive pooling for image classification. In 2014 IEEE international conference on multimedia and expo (ICME) (pp. 1–6).

  63. Sharma, G., Jurie, F., & Schmid, C. (2012). Discriminative Spatial Saliency for Image Classification. In CVPR 2012—Conference on computer vision and pattern recognition (pp. 3506–3513). IEEE, Providence, Rhode Island, United States. https://hal.inria.fr/hal-00714311.

  64. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  65. Singh, S., Gupta, A., & Efros, A. A. (2012). Unsupervised discovery of mid-level discriminative patches. In ECCV (pp. 73–86).

  66. Sun, J., & Ponce, J. (2013). Learning discriminative part detectors for image classification and cosegmentation. In 2013 IEEE international conference on computer vision (ICCV) (pp. 3400–3407). IEEE.

  67. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.

  68. van Gemert, J. C., Geusebroek, J. M., Veenman, C. J., & Smeulders, A. W. (2008). Kernel codebooks for scene categorization. In Computer vision—ECCV 2008 (pp. 696–709).

  69. Vedaldi, A., & Lenc, K. (2014). Matconvnet-convolutional neural networks for matlab. ArXiv preprint arXiv:1412.4564.

  70. Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology.

  71. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.

  72. Wang, X., Wang, B., Bai, X., Liu, W., & Tu, Z. (2013). Max-margin multiple-instance dictionary learning. In ICML (pp. 846–854).

  73. Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., & Yan, S. (2014). CNN: Single-label to multi-label. ArXiv preprint arXiv:1406.5726.

  74. Wu, J., & Rehg, J. M. (2011). Centrist: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1489–1501.

    Article  Google Scholar 

  75. Xiang, S., Nie, F., Meng, G., Pan, C., & Zhang, C. (2012). Discriminative least squares regression for multiclass classification and feature selection. TNNLS, 23(11), 1738–1754.

    Google Scholar 

  76. Xie, G. S., Zhang, X. Y., & Liu, C. L. (2014). Efficient feature coding based on auto-encoder network for image classificatio. In ACCV.

  77. Xie, G. S., Zhang, X. Y., Shu, X., Yan, S., & Liu, C. L. (2015). Task-driven feature pooling for image classification. In 2015 IEEE international conference on computer vision (ICCV).

  78. Xie, N., Ling, H., Hu, W., & Zhang, X. (2010). Use bin-ratio information for category and scene classification. In CVPR.

  79. Xu, Z., Yang, Y., & Hauptmann, A. G. (2014). A discriminative CNN video representation for event detection. ArXiv preprint arXiv:1411.4006.

  80. Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51.

    Article  Google Scholar 

  81. Yang, H., Zhou, J. T., Zhang, Y., Gao, B., Wu, J., & Cai, J. (2015). Can partial strong labels boost multi-label object recognition? CoRR arXiv:1504.05843.

  82. Yang, J., Wang, Z., Lin, Z., Cohen, S., & Huang, T. (2012). Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing, 21(8), 3467–3478.

    MathSciNet  Article  Google Scholar 

  83. Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.

  84. Yang, M., Zhang, L., Feng, X., & Zhang, D. (2014). Sparse representation based fisher discrimination dictionary learning for image classification. International Journal of Computer Vision, 109(3), 209–232.

    MathSciNet  Article  MATH  Google Scholar 

  85. Ye, G., Liu, D., Jhuo, I. H., & Chang, S. F. (2012). Robust late fusion with rank minimization. In CVPR.

  86. Yoo, D., Park, S., Lee, J. Y., & Kweon, I. S. (2015). Fisher kernel for deep neural activations. In CVPRW.

  87. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.

  88. Zhang, L., Yang, M., & Feng, X. (2011). Sparse representation or collaborative representation: Which helps face recognition? In 2011 International conference on computer vision (pp. 471–478). IEEE.

  89. Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In Computer vision—ECCV 2014 (pp. 834–849).

  90. Zhang, N., Farrell, R., Iandola, F., & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In 2013 IEEE international conference on computer vision (ICCV) (pp. 729–736).

  91. Zhang, Z., Lai, Z., Xu, Y., Shao, L., j. Wu, & Xie, G. S. (2017). Discriminative elastic-net regularized linear regression. IEEE Transactions on Image Processing. doi:10.1109/TIP.2017.2651396.

  92. Zhang, Z., Xu, Y., Yang, J., Li, X., & Zhang, D. (2015). A survey of sparse representation: Algorithms and applications. IEEE Access, 3, 490–530.

    Article  Google Scholar 

  93. Zheng, Y., Jiang, Y. G., & Xue, X. (2012). Learning hybrid part filters for scene recognition. In ECCV (pp. 172–185).

  94. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.

  95. Zhou, X., Yu, K., Zhang, T., & Huang, T. S. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.

  96. Zhu, J., Li, L. J., Fei-Fei, L., & Xing, E. P. (2010). Large margin learning of upstream scene understanding models. In NIPS (pp. 2586–2594).

  97. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.

    MathSciNet  Article  MATH  Google Scholar 

  98. Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q., & Jiang, X. (2014). Learning discriminative and shareable features for scene classification. In Computer vision—ECCV 2014 (pp. 552–568).

Download references

Acknowledgements

This work was supported by the National Basic Research Program of China (973 Program) Grant 2012CB316302, the Strategic Priority Research Program of the CAS (Grant XDA06040102), the National Natural Science Foundation of China (NSFC) (Grant 61403380), and the Henan International Cooperation Project (Grant 152102410036).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Cheng-Lin Liu.

Additional information

Communicated by Josef Sivic.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Xie, GS., Zhang, XY., Yan, S. et al. SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition. Int J Comput Vis 124, 145–168 (2017). https://doi.org/10.1007/s11263-017-1007-9

Download citation

Keywords

  • Convolutional Neural Network
  • Feature learning
  • Pooling
  • Bag of Words
  • Bilevel optimization