Advertisement

Pairwise Confusion for Fine-Grained Visual Classification

  • Abhimanyu DubeyEmail author
  • Otkrist Gupta
  • Pei Guo
  • Ramesh Raskar
  • Ryan Farrell
  • Nikhil Naik
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11216)

Abstract

Fine-Grained Visual Classification (FGVC) datasets contain small sample sizes, along with significant intra-class variation and inter-class similarity. While prior work has addressed intra-class variation using localization and segmentation techniques, inter-class similarity may also affect feature learning and reduce classification performance. In this work, we address this problem using a novel optimization procedure for the end-to-end neural network training on FGVC tasks. Our procedure, called Pairwise Confusion (PC) reduces overfitting by intentionally introducing confusion in the activations. With PC regularization, we obtain state-of-the-art performance on six of the most widely-used FGVC datasets and demonstrate improved localization ability. PC is easy to implement, does not need excessive hyperparameter tuning during training, and does not add significant overhead during test time.

Notes

Acknowledgements

We would like to thank Dr. Ashok Gupta for his guidance on bird recognition, and Dr. Sumeet Agarwal, Spandan Madan and Ishaan Grover for their feedback at various stages of this work. RF and PG were supported in part by the National Science Foundation under Grant No. IIS1651832, and AD, OG, RR and NN acknowledge the generous support of the MIT Media Lab Consortium.

Supplementary material

474200_1_En_5_MOESM1_ESM.pdf (253 kb)
Supplementary material 1 (pdf 253 KB)

References

  1. 1.
    Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: IEEE International Conference on Computer Vision, pp. 1449–1457 (2015)Google Scholar
  2. 2.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)Google Scholar
  3. 3.
    Zhang, Y., et al.: Weakly supervised fine-grained categorization with part-based image representation. IEEE Trans. Image Process. 25(4), 1713–1725 (2016)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Krause, J., Jin, H., Yang, J., Fei-Fei, L.: Fine-grained recognition without part annotations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5546–5555 (2015)Google Scholar
  5. 5.
    Zhang, N., Shelhamer, E., Gao, Y., Darrell, T.: Fine-grained pose prediction, normalization, and recognition. In: International Conference on Learning Representations Workshops (2015)Google Scholar
  6. 6.
    Krause, J., et al.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 301–320. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_19CrossRefGoogle Scholar
  7. 7.
    Cui, Y., Zhou, F., Lin, Y., Belongie, S.: Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  8. 8.
    Lin, T.Y., Maji, S.: Improved bilinear pooling with CNNs. arXiv preprint arXiv:1707.06772 (2017)
  9. 9.
    Cui, Y., Zhou, F., Wang, J., Liu, X., Lin, Y., Belongie, S.: Kernel pooling for convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)Google Scholar
  11. 11.
    Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  13. 13.
    Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for fine-grained image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1577–1584 (2011)Google Scholar
  14. 14.
    Yao, B., Bradski, G., Fei-Fei, L.: A codebook-free and annotation-free approach for fine-grained image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3466–3473 (2012)Google Scholar
  15. 15.
    Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 834–849. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_54CrossRefGoogle Scholar
  16. 16.
    Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–326 (2016)Google Scholar
  17. 17.
    Wang, Y., Choi, J., Morariu, V., Davis, L.S.: Mining discriminative triplets of patches for fine-grained classification. In: IEEE Conference on Computer Vision and Pattern Recognition, June 2016Google Scholar
  18. 18.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  19. 19.
    Branson, S., Van Horn, G., Belongie, S., Perona, P.: Bird species categorization using pose normalized deep convolutional nets. In: British Machine Vision Conference (2014)Google Scholar
  20. 20.
    Zhang, N., Farrell, R., Darrell, T.: Pose pooling Kernels for sub-category recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 3665–3672 (2012)Google Scholar
  21. 21.
    Moghimi, M., Saberian, M., Yang, J., Li, L.J., Vasconcelos, N., Belongie, S.: Boosted convolutional neural networks (2016)Google Scholar
  22. 22.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–546 (2005)Google Scholar
  23. 23.
    Parikh, D., Grauman, K.: Relative attributes. In: IEEE International Conference on Computer Vision, pp. 503–510 (2011)Google Scholar
  24. 24.
    Dubey, A., Agarwal, S.: Modeling image virality with pairwise spatial transformer networks. arXiv preprint arXiv:1709.07914 (2017)
  25. 25.
    Souri, Y., Noury, E., Adeli, E.: Deep relative attributes. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 118–133. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54193-8_8CrossRefGoogle Scholar
  26. 26.
    Dubey, A., Naik, N., Parikh, D., Raskar, R., Hidalgo, C.A.: Deep learning the city: quantifying urban perception at a global scale. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 196–212. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_12CrossRefGoogle Scholar
  27. 27.
    Singh, K.K., Lee, Y.J.: End-to-end localization and ranking for relative attributes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 753–769. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_45CrossRefGoogle Scholar
  28. 28.
    Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014)
  29. 29.
    Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699 (2015)Google Scholar
  30. 30.
    Neelakantan, A., et al.: Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015)
  31. 31.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  32. 32.
    Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729 (2008)Google Scholar
  33. 33.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)Google Scholar
  34. 34.
    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)Google Scholar
  35. 35.
    Van Horn, G., et al.: Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 595–604 (2015)Google Scholar
  36. 36.
    Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
  37. 37.
    Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: stanford dogs. In: IEEE International Conference on Computer Vision Workshops on Fine-Grained Visual Categorization, p. 1 (2011)Google Scholar
  38. 38.
    Krizhevsky, A., Nair, V., Hinton, G.: The cifar-10 dataset otkrist (2014)Google Scholar
  39. 39.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, no. 2, p. 5 (2011)Google Scholar
  40. 40.
    Jeffreys, H.: The Theory of Probability. OUP Oxford (1998)Google Scholar
  41. 41.
    Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  43. 43.
    Székely, G.J., Rizzo, M.L.: Energy statistics: a class of statistics based on distances. J. Stat. Plan. Infer. 143(8), 1249–1272 (2013)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  45. 45.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia, pp. 675–678 (2014)Google Scholar
  46. 46.
    Paskze, A., Chintala, S.: Tensors and dynamic neural networks in Python with strong GPU acceleration. https://github.com/pytorch. Accessed 1 Jan 2017
  47. 47.
    Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q.: Picking deep filter responses for fine-grained image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1134–1142 (2016)Google Scholar
  48. 48.
    Liu, M., Yu, C., Ling, H., Lei, J.: Hierarchical joint CNN-based models for fine-grained cars recognition. In: International Conference on Cloud Computing and Security, pp. 337–347 (2016)CrossRefGoogle Scholar
  49. 49.
    Simon, M., Gao, Y., Darrell, T., Denzler, J., Rodner, E.: Generalized orderless pooling performs implicit salient matching. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar
  50. 50.
    Kong, S., Fowlkes, C.: Low-rank bilinear pooling for fine-grained classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)Google Scholar
  51. 51.
    Angelova, A., Zhu, S.: Efficient object detection and segmentation for fine-grained recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 811–818 (2013)Google Scholar
  52. 52.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, June 2014Google Scholar
  53. 53.
    Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: why did you say that? visual explanations from deep networks via gradient-based localization. arXiv preprint arXiv:1610.02391 (2016)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Abhimanyu Dubey
    • 1
    Email author
  • Otkrist Gupta
    • 1
  • Pei Guo
    • 2
  • Ramesh Raskar
    • 1
  • Ryan Farrell
    • 2
  • Nikhil Naik
    • 1
    • 3
  1. 1.Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.Brigham Young UniversityProvoUSA
  3. 3.Harvard UniversityCambridgeUSA

Personalised recommendations