Skip to main content
Log in

Coarse2Fine: a two-stage training method for fine-grained visual classification

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Small inter-class and large intra-class variations are the key challenges in fine-grained visual classification. Objects from different classes share visually similar structures, and objects in the same class can have different poses and viewpoints. Therefore, the proper extraction of discriminative local features (e.g., bird’s beak or car’s headlight) is crucial. Most of the recent successes on this problem are based upon the attention models which can localize and attend the local discriminative objects parts. In this work, we propose a training method for visual attention networks, Coarse2Fine, which creates a differentiable path from the attended feature maps to the input space. Coarse2Fine learns an inverse mapping function from the attended feature maps to the informative regions in the raw image, which will guide the attention maps to better attend the fine-grained features. Besides, we propose an initialization method for the attention weights. Our experiments show that Coarse2Fine reduces the classification error by up to 5.1% on common fine-grained datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R., Naik, N.: Pairwise confusion for fine-grained visual classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision: ECCV 2018, pp. 71–88. Springer, Cham (2018)

    Chapter  Google Scholar 

  2. Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4476–4484 (2017)

  3. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

  4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

  5. He, X., Peng, Y.: Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In: AAAI (2017)

  6. Horn, G.V., Mac Aodha, O., Song, Y., Shepard, A., Adam, H., Perona, P., Belongie, S.J.: The inaturalist challenge 2017 dataset. arXiv:1707.06642 (2017)

  7. Hu, T., Qi, H.: See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv:1901.09891 (2019)

  8. Hu, T., Qi, H., Huang, C., Huang, Q., Lu, Y., Xu, J.: Weakly supervised local attention network for fine-grained visual classification. arXiv:1808.02152 (2018)

  9. Jaderberg, M., Simonyan, K., Zisserman, A., kavukcuoglu, k.: Spatial transformer networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds) Advances in Neural Information Processing Systems 28, pp 2017–2025 (2015)

  10. Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization : Stanford dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011) (2012)

  11. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 2013 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 554–561. IEEE Computer Society, Los Alamitos, CA, USA (2013)

  12. Krause, J., Jin, H., Yang, J., Fei-Fei, L.: Fine-grained recognition without part annotations. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5546–5555 (2015)

  13. Li, P., Xie, J., Wang, Q., Gao, Z.: Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  14. Lin, D., Shen, X., Lu, C., Jia, J.: Deep lac: Deep localization, alignment and classification for fine-grained recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1666–1674 (2015)

  15. Lin, T., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1449–1457 (2015)

  16. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV) (2015)

  17. Maji, S., Rahtu, E., Kannala, J., Blaschko, M.B., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)

  18. Simon, M., Rodner, E.: Neural activation constellations: unsupervised part model discovery with convolutional networks. In: The IEEE International Conference on Computer Vision (ICCV) (2015)

  19. Sun, M., Yuan, Y., Zhou, F., Ding, E.: Multi-attention multi-class constraint for fine-grained image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision: ECCV 2018, pp. 834–850. Springer, Cham (2018)

    Chapter  Google Scholar 

  20. Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261 (2016)

  21. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  22. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001 (2011)

  23. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. arXiv:1704.06904 (2017)

  24. Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W., Maybank, S.J.: Learning attentions: residual attentional siamese network for high performance online visual tracking. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 4854–4863 (2018)

  25. Wang, Y., Morariu, V.I., Davis, L.S.: Weakly-supervised discriminative patch learning via CNN for fine-grained recognition. arXiv:1611.09932 (2016)

  26. Wei, X., Zhang, Y., Gong, Y., Zhang, J., Zheng, N.: Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision: ECCV 2018, pp. 365–380. Springer, Cham (2018)

    Chapter  Google Scholar 

  27. Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: ECCV (2016)

  28. Yang, Z., Luo, T., Wang, D., Hu, Z., Gao, J., Wang, L.: Learning to navigate for fine-grained classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision: ECCV 2018, pp. 438–454. Springer, Cham (2018)

    Chapter  Google Scholar 

  29. Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based r-cnns for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision: ECCV 2014, pp. 834–849. Springer, Cham (2014)

    Chapter  Google Scholar 

  30. Zhang, X., Wei, Y., Feng, J., Yang, Y.,Huang, T.S.: Adversarial complementary learning for weakly supervised object localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

  31. Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5219–5227 (2017)

  32. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  33. Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5(1), 44–53 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been done during the internship of Amir Erfan Eshratifar at Clarifai.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amir Erfan Eshratifar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Eshratifar, A.E., Eigen, D., Gormish, M. et al. Coarse2Fine: a two-stage training method for fine-grained visual classification. Machine Vision and Applications 32, 49 (2021). https://doi.org/10.1007/s00138-021-01180-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-021-01180-y

Keywords

Navigation