Skip to main content
Log in

Mining Mid-level Visual Patterns with Deep CNN Activations

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The purpose of mid-level visual element discovery is to find clusters of image patches that are representative of, and which discriminate between, the contents of the relevant images. Here we propose a pattern-mining approach to the problem of identifying mid-level elements within images, motivated by the observation that such techniques have been very effective, and efficient, in achieving similar goals when applied to other data types. We show that Convolutional Neural Network (CNN) activations extracted from image patches typical possess two appealing properties that enable seamless integration with pattern mining techniques. The marriage between CNN activations and a pattern mining technique leads to fast and effective discovery of representative and discriminative patterns from a huge number of image patches, from which mid-level elements are retrieved. Given the patterns and retrieved mid-level visual elements, we propose two methods to generate image feature representations. The first encoding method uses the patterns as codewords in a dictionary in a manner similar to the Bag-of-Visual-Words model. We thus label this a Bag-of-Patterns representation. The second relies on mid-level visual elements to construct a Bag-of-Elements representation. We evaluate the two encoding methods on object and scene classification tasks, and demonstrate that our approach outperforms or matches the performance of the state-of-the-arts on these tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Answer key: 1. aeroplane, 2. train, 3. cow, 4. motorbike, 5. bike, 6. sofa.

  2. http://www.borgelt.net/apriori.html.

References

  • Agarwal, A., & Triggs, B. (2008). Multilevel image coding with hyperfeatures. International Journal of Computer Vision, 78(1), 15–27.

    Article  Google Scholar 

  • Agrawal, P., Girshick, R., & Malik, J. (2014). Analyzing the performance of multilayer neural networks for object recognition. In Proceedings European Conference on Computer Vision, (pp. 329–344).

  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings International Conference Very Large Databases, (pp. 487–499).

  • Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., Sivic, J. (2014a) Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In Proceedings of IEEE Conference on Computer Vision Pattern Recognition, (pp. 3762–3769).

  • Aubry, M., Russell, B. C., & Sivic, J. (2014b). Painting-to-3d model alignment via discriminative visual elements. In Proceedings Annual ACM SIGIR Conference, 33(2), p. 14.

  • Azizpour, H., Razavian, A. S., Sullivan, J., Maki, A., & Carlsson, S. (2016). Factors of transferability for a generic convnet representation. IEEE Transactions Pattern Analysis and Machine Intelligence, 38(9),1790–1802.

  • Bansal, A., Shrivastava, A., Doersch, C., & Gupta, A. (2015). Mid-level elements for object detection. arXiv preprint arXiv:1504.07284

  • Borgelt, C. (2012). Frequent item set mining. Wiley Interdisc Review: Data Mining and Knowledge Discovery, 2(6), 437–456.

    Google Scholar 

  • Bossard, L., Guillaumin, M., & Gool, L. V. (2014). Food-101 mining discriminative components with random forests. In Proceedings European Conference on Computer Vision, (pp. 446–461).

  • Bourdev, L. D., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In Proceedings IEEE International Conference on Computer Vision, (pp. 1365–1372).

  • Bourdev, L. D., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In Proceeding European Conference on Computer Vision, (pp. 168–181).

  • Bourdev, L. D., Maji, S., & Malik, J. (2011). Describing people: A poselet-based approach to attribute classification. In Proceedings IEEE International Conference on Computer Vision, (pp. 1543–1550).

  • Boureau, Y., Bach, F. R., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 2559–2566).

  • Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In Proceedings British Machine Vision Conference.

  • Cheng, H., Yan, X., Han, J., & Yu, P. S. (2008). Direct discriminative pattern mining for effective classification. In Proceedings IEEE International Conference on Data Engineering, (pp. 169–178).

  • Choi, M. J., Torralba, A., & Willsky, A. S. (2012). A tree-based context model for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(2), 240–252.

    Article  Google Scholar 

  • Cimpoi, M., Maji, S., & Vedaldi, A. (2015). Deep filter banks for texture recognition and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3828–3836).

  • Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.

    Article  MathSciNet  Google Scholar 

  • Courbariaux, M., & Bengio, Y. (2016). Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830

  • Crowley, E., & Zisserman, A. (2014). The state of the art: Object retrieval in paintings using discriminative regions. In Proceedings British Machine Vision Conference.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Li, F. F. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, (pp. 248–255).

  • Diba, A., Pazandeh, A. M., Pirsiavash, H., & Gool, L. V. (2016). Deepcamp: Deep convolutional action & attribute mid-level patterns. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition.

  • Divvala, S. K., Hoiem, D., Hays, J., Efros, A. A., Hebert, M. (2009). An empirical study of context in object detection. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1271–1278).

  • Doersch, C., Singh, S., Gupta, A., Sivic, J., & Efros, A. A. (2012). What makes paris look like paris? In Proceedings Annual International ACM SIGIR Conference, 31(4), p. 101.

  • Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In Proceedings Advances in Neural Information Processing Systems, (pp. 494–502).

  • Dosovitskiy, A., & Brox, T. (2016). Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Endres, I., Shih, K. J., Jiaa, J., & Hoiem, D. (2013). Learning collections of part models for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 939–946).

  • Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

    MATH  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Fernando, B., & Tuytelaars, T. (2013). Mining multiple queries for image retrieval: On-the-fly learning of an object-specific mid-level representation. In Proceedings of IEEE International Conference on Computer Vision, (pp. 2544–2551).

  • Fernando, B., Fromont, É., & Tuytelaars, T. (2012). Effective use of frequent itemset mining for image classification. In Proceedings of European Conference on Computer Vision, (pp. 214–227).

  • Fernando, B., Fromont, É., & Tuytelaars, T. (2014). Mining mid-level features for image classification. International Journal of Computer Vision, 108(3), 186–203.

    Article  MathSciNet  Google Scholar 

  • Fouhey, D. F., Gupta, A., & Hebert, M. (2013). Data-driven 3d primitives for single image understanding. In Proceedings of IEEE International Conference on Computer Vision, (pp. 3392–3399).

  • Fouhey, D. F., Hussain, W., Gupta, A., & Hebert, M. (2015). Single image 3d without a single 3d image. In Proceedings of IEEE International Conference on Computer Vision, (pp. 1053–1061).

  • Gao, Y., Beijbom, O., Zhang, N., & Darrell, T. (2010). Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 317–326).

  • Gilbert, A., & Bowden, R. (2014). Data mining for action recognition. In Proceedings of Asian Conference on Computer Vision, (pp. 290–303).

  • Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 883–897.

    Article  Google Scholar 

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 580–587).

  • Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142–158.

    Article  Google Scholar 

  • Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of European Conference on Computer Vision, (pp. 392–407).

  • Grahne, G., & Zhu, J. (2005). Fast algorithms for frequent itemset mining using fp-trees. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1347–1362.

    Article  Google Scholar 

  • Hariharan, B., Malik, J., & Ramanan, D. (2012). Discriminative decorrelation for clustering and classification. In Proceedings of European Conference on Computer Vision, (pp. 459–472).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  • Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision, 80(1), 3–15.

    Article  Google Scholar 

  • Jain, A., Gupta, A., Rodriguez, M., & Davis, L. S. (2013). Representing videos using mid-level discriminative patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2571–2578).

  • Jegou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3304–3311).

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093

  • Juneja, M., Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 923–930).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of Advances Neural Information Processing Systems, (pp. 1106–1114).

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2169–2178).

  • Lee, Y. J., Efros, A. A., & Hebert, M. (2013). Style-aware mid-level representation for discovering visual connections in space and time. In Proceedings of IEEE International Conference on Computer Vision, (pp. 1857–1864).

  • Li, Q., Wu, J., & Tu, Z. (2013). Harvesting mid-level visual concepts from large-scale internet images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 851–858).

  • Li, Y., Liu, L., Shen, C., & van den Hengel, A. (2015). Mid-level deep pattern mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 971–980).

  • Lin, T., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In Proceedings of European Conference on Computer Vision, (pp. 1449–1457).

  • Liu, L., & Wang, L. (2012). What has my classifier learned? visualizing the classification rules of bag-of-feature model by support region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3586–3593).

  • Liu, L., Shen, C., Wang, L., van den Hengel, A., & Wang, C. (2014). Encoding high dimensional local features by sparse coding based fisher vectors. In Proceedings of Advances Neural Information Processing Systems, (pp. 1143–1151).

  • Liu, L., Shen, C., & van den Hengel, A. (2015). The treasure beneath convolutional layers: Cross convolutional layer pooling for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 4749–4757).

  • Malisiewicz, T., & Efros, A. A. (2009). Beyond categories: The visual memex model for reasoning about object relationships. In Proceedings of Advances Neural Information Processing Systems, (pp. 1222–1230).

  • Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Ensemble of exemplar-svms for object detection and beyond. In Proceedings of IEEE International Conference on Computer Vision, (pp. 89–96).

  • Matzen, K., & Snavely, N. (2015). Bubblenet: Foveated imaging for visual discovery. In Proceedings of IEEE International Conference on Computer Vision, (pp. 1931–1939).

  • Mettes, P., van Gemert, J. C., & Snoek, C. G. M. (2016). No spare parts: Sharing part detectors for image categorization. Computer Vision Image Understanding

  • Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1717–1724).

  • Oramas, J., & Tuytelaars, T. (2016). Modeling visual compatibility through hierarchical mid-level elements. arXiv preprint arXiv:1604.00036

  • Owens, A., Xiao, J., Torralba, A., & Freeman, W. T. (2013). Shape anchors for data-driven multi-view reconstruction. In Proceedings of IEEE International Conference on Computer Vision, (pp. 33–40).

  • Parizi, S. N., Vedaldi, A., Zisserman, A., & Felzenszwalb, P. (2015). Automatic discovery and optimization of parts for image classification. In Proceedings International Conference on Learning Representations.

  • Perronnin, F., Liu, Y., Sánchez, J., Poirier, H. (2010a) Large-scale image retrieval with compressed fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3384–3391).

  • Perronnin, F., Sánchez, J., Mensink, T. (2010b) Improving the fisher kernel for large-scale image classification. In Proceedings of European Conference on Computer Vision, (pp. 143–156).

  • Quack, T., Ferrari, V., Leibe, B., & Gool, L. J. V. (2007). Efficient mining of frequent and distinctive feature configurations. In Proceedings of IEEE International Conference on Computer Vision, (pp. 1–8).

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 413–420).

  • Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). In Proceedings of European Conference on Computer Vision.

  • Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, (pp. 512–519).

  • Rematas, K., Fernando, B., Dellaert, F., & Tuytelaars, T. (2015). Dataset fingerprints: Exploring image collections through data mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 4867–4875).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Shih, K. J., Endres, I., & Hoiem, D. (2015). Learning discriminative collections of part detectors for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1571–1584.

    Article  Google Scholar 

  • Shrivastava, A., Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Data-driven visual similarity for cross-domain image matching. Proceedings of Annual ACM SIGIR Conference, 30(6), p. 154.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings International Conference on Learning Representations.

  • Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep fisher networks for large-scale image classification. In Proceedings of Advances Neural Information Processing Systems, (pp. 163–171).

  • Singh, S., Gupta, A., & Efros, A. A. (2012). Unsupervised discovery of mid-level discriminative patches. In Proceedings of European Conference on Computer Vision, (pp. 73–86).

  • Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In Proceedings of IEEE International Conference on Computer Vision, (pp. 1470–1477).

  • Song, H. O., Lee, Y. J., Jegelka, S., & Darrell, T. (2014). Weakly-supervised discovery of visual pattern configurations. In Proceedings of Advances Neural Information Processing Systems, (pp. 1637–1645).

  • Sun, J., & Ponce, J. (2013). Learning discriminative part detectors for image classification and cosegmentation. In Proceedings of IEEE International Conference on Computer Vision, (pp. 3400–3407).

  • Sun, J., & Ponce, J. (2016). Learning dictionary of discriminative part detectors for image categorization and cosegmentation. International Journal of Computer Vision, 2, 1–23.

    MathSciNet  Google Scholar 

  • Torralba, A. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2), 169–191.

    Article  MathSciNet  Google Scholar 

  • Uno, T., Asai, T., Uchida, Y., & Arimura, H. (2003). LCM: An efficient algorithm for enumerating frequent closed item sets. In Proceedings of the Workshop on Frequent Itemset Mining Implementations, International Conference on Data Mining.

  • Voravuthikunchai, W., Crémilleux, B., & Jurie, F. (2014). Histograms of pattern sets for image classification and object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 224–231).

  • Vreeken, J., van Leeuwen, M., & Siebes, A. (2011). Krimp: mining itemsets that compress. Data Mining and Knowledge Discovery, 23(1), 169–214.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2014). Learning actionlet ensemble for 3d human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 914–927.

    Article  Google Scholar 

  • Wang, J., Yang, Y., Mao, J., Huang, Z., & Xu, C. H. W. (2016a). Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Wang, L., Qiao, Y., Tang, X. (2013a) Motionlets: Mid-level 3d parts for human motion recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2674–2681).

  • Wang, X., Wang, B., Bai, X., Liu, W., Tu, Z. (2013b) Max-margin multiple-instance dictionary learning. In Proceedings International Conference on Machine Learning, (pp. 846–854).

  • Wang, Y., Choi, J., Morariu, V. I., & Davis, L. S. (2016b). Mining discriminative triplets of patches for fine-grained classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 1163–1172).

  • Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S. (2014). CNN: single-label to multi-label. CoRR arXiv:1406.5726

  • Yao, B., & Fei-Fei, L. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 9–16).

  • Yoo, D., Park, S., Lee, J. Y., & Kweon, I. S. (2015). Multi-scale pyramid pooling for deep convolutional representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, (pp. 71–80).

  • Yuan, J., Wu, Y., & Yang, M. (2007). Discovery of collocation patterns: from visual words to visual phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Proceedings of European Conference on Computer Vision, (pp. 818–833).

  • Zhao, R., Ouyang, W., & Wang, X. (2014). Learning mid-level filters for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 144–151).

  • Zhou, B., Lapedriza À, Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Proceedings of Advances Neural Information Processing Systems, (pp. 487–495).

Download references

Acknowledgments

This work was in part supported by ARC Future Fellowship (FT120100969). Y. Li and L. Liu equally contributed to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunhua Shen.

Additional information

Communicated by Josef Sivic.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Liu, L., Shen, C. et al. Mining Mid-level Visual Patterns with Deep CNN Activations. Int J Comput Vis 121, 344–364 (2017). https://doi.org/10.1007/s11263-016-0945-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0945-y

Keywords

Navigation