Multimedia Tools and Applications

, Volume 78, Issue 4, pp 4381–4395 | Cite as

Cascaded one-vs-rest detection network for fine-grained recognition without part annotations

  • Long Chen
  • Shengke WangEmail author
  • Kin-Man Lam
  • Huiyu Zhou
  • Muwei Jian
  • Junyu Dong


Fine-grained recognition is a challenging task due to small intra-category variances. Most of the top-performing fine-grained recognition methods leverage parts of objects for better performance. Therefore, part annotations which are extremely computationally expensive are required. In this paper, we propose a novel cascaded deep CNN detection framework for fine-grained recognition which is trained to detect a whole object without considering parts. Nevertheless, most of the current top-performing detection networks use N + 1 class (N object categories plus background) softmax loss. The background category with much more training samples dominates the feature learning progress where the features are not suitable for object categorisation with fewer samples. To address this issue, we here introduce two strategies: 1) We leverage a cascaded structure to eliminate the background. 2) We introduce a novel one-vs-rest loss function to capture more minute variances from different subordinate categories. Experiments show that our proposed recognition framework achieves comparable performance against the state-of-the-art, part-free, fine-grained recognition methods on the CUB-200-2011 Bird dataset. Meanwhile, our method outperforms most of the existing part annotation based methods and does not need part annotations at the training stage whilst being free from any annotations at the test stage.


Fine-grained Recognition Detection One-vs-rest Without part annotations 



2014DFA10410. H. Zhou is supported by UK EPSRC under Grant EP/N011074/1 and Royal Society-Newton Advanced Fellowship under Grant NA160342.


  1. 1.
    Alsmirat MA, Jararweh Y, Al-Ayyoub M, Shehab MA, Gupta BB (2017) Accelerating compute intensive medical imaging segmentation algorithms using hybrid CPU-GPU implementations. Multimed Tools Appl 76(3):3537–3555CrossRefGoogle Scholar
  2. 2.
    Atawneh S, Almomani A, Al Bazar H, Sumari P, Gupta B (2017) Secure and imperceptible digital image steganographic algorithm based on diamond encoding in DWT domain. Multimed Tools Appl 76(18):18451–18472CrossRefGoogle Scholar
  3. 3.
    Berg T, Liu J, Lee SW, Alexander ML, Jacobs DW, Belhumeur PN (2014, June) Birdsnap: Large-scale fine-grained visual categorization of birds. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on IEEE, pp 2019–2026Google Scholar
  4. 4.
    Branson S, Van Horn G, Wah C, Perona P, Belongie S (2014) The ignorant led by the blind: a hybrid human–machine vision system for fine-grained categorization. Int J Comput Vis 108(1–2):3–29MathSciNetzbMATHGoogle Scholar
  5. 5.
    Branson S, Van Horn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952Google Scholar
  6. 6.
    Chang X, Yang Y (2017) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst 28(10):2294–2305MathSciNetCrossRefGoogle Scholar
  7. 7.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197CrossRefGoogle Scholar
  9. 9.
    Chang X, Yu YL, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632CrossRefGoogle Scholar
  10. 10.
    Gavves E, Fernando B, Snoek CG, Smeulders AW, Tuytelaars T (2013, December) Fine-grained categorization by alignments. In: Computer Vision (ICCV), 2013 I.E. International Conference on IEEE, pp 1713–1720Google Scholar
  11. 11.
    Girshick R (2015) Fast r-cnn. arXiv preprint arXiv:1504.08083Google Scholar
  12. 12.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587Google Scholar
  13. 13.
    Huang S, Xu Z, Tao D, Zhang Y (2016) Part-stacked cnn for fine-grained visual categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 1173–1182Google Scholar
  14. 14.
    Ibtihal M, Hassan N (2017) Homomorphic encryption as a service for outsourced images in mobile cloud computing environment. Int J Cloud Appl Comput (IJCAC) 7(2):27–40Google Scholar
  15. 15.
    Jouini M, Rabai LBA (2016) A security framework for secure cloud computing environments. Int J Cloud Appl Comput (IJCAC) 6(3):32–44Google Scholar
  16. 16.
    Krause J, Jin H, Yang J, Fei-Fei L (2015, June) Fine-grained recognition without part annotations. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on IEEE, pp 5546–5555Google Scholar
  17. 17.
    Kumar N, Belhumeur PN, Biswas A, Jacobs DW, Kress WJ, Lopez IC, Soares JV (2012) Leafsnap: A computer vision system for automatic plant species identification. In: Computer vision–ECCV 2012. Springer, Berlin, pp 502–516Google Scholar
  18. 18.
    Li Z, Nie F, Chang X, Yang Y (2017) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110CrossRefGoogle Scholar
  19. 19.
    Lin D, Shen X, Lu C, Jia J (2015, June) Deep lac: Deep localization, alignment and classification for fine-grained recognition. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on IEEE, pp 1666–1674Google Scholar
  20. 20.
    Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1449–1457Google Scholar
  21. 21.
    Maji S (2012, October) Discovering a lexicon of parts and attributes. In: European Conference on Computer Vision. Springer, Berlin, pp 21–30Google Scholar
  22. 22.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  23. 23.
    Sfar AR, Boujemaa N, Geman D (2013, June) Vantage feature frames for fine-grained categorization. In: Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on IEEE, pp 835–842Google Scholar
  24. 24.
    Simon M, Rodner E (2015) Neural activation constellations: Unsupervised part model discovery with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1143–1151Google Scholar
  25. 25.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014 Sep 4Google Scholar
  26. 26.
    Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171CrossRefGoogle Scholar
  27. 27.
    Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 datasetGoogle Scholar
  28. 28.
    Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015, June) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on IEEE, pp 842–850Google Scholar
  29. 29.
    Yang B, Yan J, Lei Z, Li SZ (2016) Craft objects from images. arXiv preprint arXiv:1604.03239Google Scholar
  30. 30.
    Yu C, Li J, Li X et al (2018) Four-image encryption scheme based on quaternion Fresnel transform, chaos and computer generated hologram[J]. Multimed Tools Appl 77(4):4585–4608CrossRefGoogle Scholar
  31. 31.
    Zhang N, Donahue J, Girshick R, Darrell T (2014, September) Part-based R-CNNs for fine-grained category detection. In: European conference on computer vision. Springer, Cham, pp 834–849Google Scholar
  32. 32.
    Zhang X, Xiong H, Zhou W, Tian Q (2014, November) Fused one-vs-all mid-level features for fine-grained visual categorization. In: Proceedings of the 22nd ACM international conference on Multimedia ACM, pp 287–296Google Scholar
  33. 33.
    Zhang H, Xu T, Elhoseiny M, Huang X, Zhang S, Elgammal A, Metaxas D (2016) Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1143–1152Google Scholar
  34. 34.
    Zhang Z, Sun R, Zhao C, Wang J, Chang CK, Gupta BB (2017) CyVOD: a novel trinity multimedia social network scheme. Multimed Tools Appl 76(18):18513–18529CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Ocean University of ChinaQingdaoChina
  2. 2.Hong Kong Polytechnic UniversityHung HomHong Kong
  3. 3.University of LeicesterLeicesterUK

Personalised recommendations