Cascaded one-vs-rest detection network for fine-grained recognition without part annotations


Fine-grained recognition is a challenging task due to small intra-category variances. Most of the top-performing fine-grained recognition methods leverage parts of objects for better performance. Therefore, part annotations which are extremely computationally expensive are required. In this paper, we propose a novel cascaded deep CNN detection framework for fine-grained recognition which is trained to detect a whole object without considering parts. Nevertheless, most of the current top-performing detection networks use N + 1 class (N object categories plus background) softmax loss. The background category with much more training samples dominates the feature learning progress where the features are not suitable for object categorisation with fewer samples. To address this issue, we here introduce two strategies: 1) We leverage a cascaded structure to eliminate the background. 2) We introduce a novel one-vs-rest loss function to capture more minute variances from different subordinate categories. Experiments show that our proposed recognition framework achieves comparable performance against the state-of-the-art, part-free, fine-grained recognition methods on the CUB-200-2011 Bird dataset. Meanwhile, our method outperforms most of the existing part annotation based methods and does not need part annotations at the training stage whilst being free from any annotations at the test stage.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. 1.

    Alsmirat MA, Jararweh Y, Al-Ayyoub M, Shehab MA, Gupta BB (2017) Accelerating compute intensive medical imaging segmentation algorithms using hybrid CPU-GPU implementations. Multimed Tools Appl 76(3):3537–3555

    Article  Google Scholar 

  2. 2.

    Atawneh S, Almomani A, Al Bazar H, Sumari P, Gupta B (2017) Secure and imperceptible digital image steganographic algorithm based on diamond encoding in DWT domain. Multimed Tools Appl 76(18):18451–18472

    Article  Google Scholar 

  3. 3.

    Berg T, Liu J, Lee SW, Alexander ML, Jacobs DW, Belhumeur PN (2014, June) Birdsnap: Large-scale fine-grained visual categorization of birds. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on IEEE, pp 2019–2026

  4. 4.

    Branson S, Van Horn G, Wah C, Perona P, Belongie S (2014) The ignorant led by the blind: a hybrid human–machine vision system for fine-grained categorization. Int J Comput Vis 108(1–2):3–29

    MathSciNet  MATH  Google Scholar 

  5. 5.

    Branson S, Van Horn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952

  6. 6.

    Chang X, Yang Y (2017) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst 28(10):2294–2305

    MathSciNet  Article  Google Scholar 

  7. 7.

    Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920

    MathSciNet  MATH  Article  Google Scholar 

  8. 8.

    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197

    Article  Google Scholar 

  9. 9.

    Chang X, Yu YL, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632

    Article  Google Scholar 

  10. 10.

    Gavves E, Fernando B, Snoek CG, Smeulders AW, Tuytelaars T (2013, December) Fine-grained categorization by alignments. In: Computer Vision (ICCV), 2013 I.E. International Conference on IEEE, pp 1713–1720

  11. 11.

    Girshick R (2015) Fast r-cnn. arXiv preprint arXiv:1504.08083

  12. 12.

    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  13. 13.

    Huang S, Xu Z, Tao D, Zhang Y (2016) Part-stacked cnn for fine-grained visual categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 1173–1182

  14. 14.

    Ibtihal M, Hassan N (2017) Homomorphic encryption as a service for outsourced images in mobile cloud computing environment. Int J Cloud Appl Comput (IJCAC) 7(2):27–40

    Google Scholar 

  15. 15.

    Jouini M, Rabai LBA (2016) A security framework for secure cloud computing environments. Int J Cloud Appl Comput (IJCAC) 6(3):32–44

    Google Scholar 

  16. 16.

    Krause J, Jin H, Yang J, Fei-Fei L (2015, June) Fine-grained recognition without part annotations. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on IEEE, pp 5546–5555

  17. 17.

    Kumar N, Belhumeur PN, Biswas A, Jacobs DW, Kress WJ, Lopez IC, Soares JV (2012) Leafsnap: A computer vision system for automatic plant species identification. In: Computer vision–ECCV 2012. Springer, Berlin, pp 502–516

  18. 18.

    Li Z, Nie F, Chang X, Yang Y (2017) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110

    Article  Google Scholar 

  19. 19.

    Lin D, Shen X, Lu C, Jia J (2015, June) Deep lac: Deep localization, alignment and classification for fine-grained recognition. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on IEEE, pp 1666–1674

  20. 20.

    Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1449–1457

  21. 21.

    Maji S (2012, October) Discovering a lexicon of parts and attributes. In: European Conference on Computer Vision. Springer, Berlin, pp 21–30

  22. 22.

    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  23. 23.

    Sfar AR, Boujemaa N, Geman D (2013, June) Vantage feature frames for fine-grained categorization. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on IEEE, pp 835–842

  24. 24.

    Simon M, Rodner E (2015) Neural activation constellations: Unsupervised part model discovery with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1143–1151

  25. 25.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014 Sep 4

  26. 26.

    Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  27. 27.

    Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset

    Google Scholar 

  28. 28.

    Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015, June) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on IEEE, pp 842–850

  29. 29.

    Yang B, Yan J, Lei Z, Li SZ (2016) Craft objects from images. arXiv preprint arXiv:1604.03239

  30. 30.

    Yu C, Li J, Li X et al (2018) Four-image encryption scheme based on quaternion Fresnel transform, chaos and computer generated hologram[J]. Multimed Tools Appl 77(4):4585–4608

    Article  Google Scholar 

  31. 31.

    Zhang N, Donahue J, Girshick R, Darrell T (2014, September) Part-based R-CNNs for fine-grained category detection. In: European conference on computer vision. Springer, Cham, pp 834–849

  32. 32.

    Zhang X, Xiong H, Zhou W, Tian Q (2014, November) Fused one-vs-all mid-level features for fine-grained visual categorization. In: Proceedings of the 22nd ACM international conference on Multimedia ACM, pp 287–296

  33. 33.

    Zhang H, Xu T, Elhoseiny M, Huang X, Zhang S, Elgammal A, Metaxas D (2016) Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1143–1152

  34. 34.

    Zhang Z, Sun R, Zhao C, Wang J, Chang CK, Gupta BB (2017) CyVOD: a novel trinity multimedia social network scheme. Multimed Tools Appl 76(18):18513–18529

    Article  Google Scholar 

Download references


2014DFA10410. H. Zhou is supported by UK EPSRC under Grant EP/N011074/1 and Royal Society-Newton Advanced Fellowship under Grant NA160342.

Author information



Corresponding author

Correspondence to Shengke Wang.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Wang, S., Lam, K. et al. Cascaded one-vs-rest detection network for fine-grained recognition without part annotations. Multimed Tools Appl 78, 4381–4395 (2019).

Download citation


  • Fine-grained Recognition
  • Detection
  • One-vs-rest
  • Without part annotations