Advertisement

A survey on deep learning-based fine-grained object classification and semantic segmentation

  • Bo Zhao
  • Jiashi Feng
  • Xiao WuEmail author
  • Shuicheng Yan
Review

Abstract

The deep learning technology has shown impressive performance in various vision tasks such as image classification, object detection and semantic segmentation. In particular, recent advances of deep learning techniques bring encouraging performance to fine-grained image classification which aims to distinguish subordinate-level categories, such as bird species or dog breeds. This task is extremely challenging due to high intra-class and low inter-class variance. In this paper, we review four types of deep learning based fine-grained image classification approaches, including the general convolutional neural networks (CNNs), part detection based, ensemble of networks based and visual attention based fine-grained image classification approaches. Besides, the deep learning based semantic segmentation approaches are also covered in this paper. The region proposal based and fully convolutional networks based approaches for semantic segmentation are introduced respectively.

Keywords

Deep learning fine-grained image classification semantic segmentation convolutional neural network (CNN) recurrent neural network (RNN) 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 855–868, 2009.CrossRefGoogle Scholar
  2. [2]
    H. Sak, A. W. Senior, F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, ISCA, Singapore, pp. 338–342, 2014.Google Scholar
  3. [3]
    W. Zaremba, I. Sutskever, O. Vinyals. Recurrent neural network regularization. arXiv:1409.2329, 2014.Google Scholar
  4. [4]
    K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv:1409.1259, 2014.Google Scholar
  5. [5]
    G. B. Zhou, J. X. Wu, C. L. Zhang, Z. H. Zhou, Minimal gated unit for recurrent neural networks. International Journal of Automation and Computing, vol 13, no. 3, pp. 226–234, 2016.CrossRefGoogle Scholar
  6. [6]
    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition. Neural Computation, vol 1, no. 4, pp. 541–551, 1989.CrossRefGoogle Scholar
  7. [7]
    J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, F. F. Li. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 248–255, 2009.Google Scholar
  8. [8]
    A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, NIPS, Lake Tahoe, USA, pp. 1097–1105, 2012.Google Scholar
  9. [9]
    K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.Google Scholar
  10. [10]
    C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 1–9, 2014.Google Scholar
  11. [11]
    A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern RecognitionWorkshops, IEEE, Columbus, USA, pp. 512–519, 2014.Google Scholar
  12. [12]
    L. X. Xie, R. C. Hong, B. Zhang, Q. Tian. Image classification and retrieval are ONE. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ACM, New York, USA, pp. 3–10, 2015.CrossRefGoogle Scholar
  13. [13]
    L. X. Xie, L. Zheng, J. D. Wang, A. Yuille, Q. Tian. Interactive: Inter-layer activeness propagation. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 270–279, 2016.Google Scholar
  14. [14]
    T. Berg, P. N. Belhumeur. POOF: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, USA, pp. 955–962, 2013.Google Scholar
  15. [15]
    J. X. Liu, A. Kanazawa, D. Jacobs, P. Belhumeur, Dog breed classification using part localization. In Proceedings of the 12th European Conference on Computer Vision, Springer, Florence, Italy, vol 7572, pp. 172–185, 2012.Google Scholar
  16. [16]
    S. L. Yang, L. F. Bo, J.Wang, L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. Advances in Neural Information Processing Systems 25, NIPS, Lake Tahoe, USA, pp. 3122–3130, 2012.Google Scholar
  17. [17]
    E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, T. Tuytelaars. Fine-grained categorization by alignments. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 1713–1720, 2013.Google Scholar
  18. [18]
    Y. N. Chai, V. Lempitsky, A. Zisserman. BiCoS: A Bi-level co-segmentation method for image classification. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2579–2586, 2011.Google Scholar
  19. [19]
    N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based R-CNNs for fine-grained category detection. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, vol 8689, pp. 834–849, 2014.Google Scholar
  20. [20]
    R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 580–587, 2014.Google Scholar
  21. [21]
    J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders, Selective search for object recognition. International Journal of Computer Vision, vol 104, no. 2, pp. 154–171, 2013.CrossRefGoogle Scholar
  22. [22]
    K. J. Shih, A. Mallya, S. Singh, D. Hoiem. Part localization using multi-proposal consensus for fine-grained categorization. arXiv:1507.06332, 2015.CrossRefGoogle Scholar
  23. [23]
    C. L. Zitnick, P. Dollár. Edge boxes: Locating object proposals from edges. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 391–405, vol. 8693, 2014.Google Scholar
  24. [24]
    S. Branson, G. Van Horn, S. Belongie, P. Perona. Bird species categorization using pose normalized deep convolutional nets. arXiv:1406.2952, 2014.Google Scholar
  25. [25]
    S. Branson, O. Beijbom, S. Belongie. Efficient large-scale structured learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, USA, pp. 1806–1813, 2013.Google Scholar
  26. [26]
    S. L. Huang, Z. Xu, D. C. Tao, Y. Zhang. Part-stacked CNN for fine-grained visual categorization. arXiv:1512.08086, 2015.Google Scholar
  27. [27]
    O. Matan, C. J. C. Burges, Y. LeCun, J. S. Denker. Multidigit recognition using a space displacement neural network. Advances in Neural Information Processing Systems 4, NIPS, San Mateo, USA, pp. 488–495, 1992.Google Scholar
  28. [28]
    D. Lin, X. Y. Shen, C. W. Lu, J. Y. Jia. Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 1666–1674, 2015.Google Scholar
  29. [29]
    J. P. W. Pluim, J. B. A. Maintz, M. A. Viergever, Mutualinformation-based registration of medical images: A survey. IEEE Transactions on Medical Imaging, vol 22, no. 8, pp. 986–1004, 2003.CrossRefGoogle Scholar
  30. [30]
    Z. Y. Ge, C. McCool, C. Sanderson, P. Corke. Subset feature learning for fine-grained category classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Boston, USA, pp. 46–52, 2015.Google Scholar
  31. [31]
    Z. Y. Ge, A. Bewley, C. McCool, P. Corke, B. Upcroft, C. Sanderson. Fine-grained classification via mixture of deep convolutional neural networks. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Lake Placid, USA, pp. 1–6, 2016.Google Scholar
  32. [32]
    Z. H. Wang, X. X. Wang, G. Wang. Learning finegrained features via a CNN tree for large-scale classification. arXiv:1511.04534, 2015.Google Scholar
  33. [33]
    D. Q. Wang, Z. Q. Shen, J. Shao, W. Zhang, X. Y. Xue, Z. Zhang. Multiple granularity descriptors for fine-grained categorization. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 2399–2406, 2015.Google Scholar
  34. [34]
    T. Y. Lin, A. RoyChowdhury, S. Maji. Bilinear CNN models for fine-grained visual recognition. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1449–1457, 2015.Google Scholar
  35. [35]
    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, A. Vedaldi. Describing textures in the wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 3606–3613, 2014.Google Scholar
  36. [36]
    J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzen, T. Darrel. DeCAF: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531, 2013.Google Scholar
  37. [37]
    A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern RecognitionWorkshops, IEEE, Columbus, USA, pp. 512–519, 2014.Google Scholar
  38. [38]
    T. J. Xiao, Y. C. Xu, K. Y. Yang, J. X. Zhang, Y. X. Peng, Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 842–850, 2015.Google Scholar
  39. [39]
    P. Sermanet, A. Frome, E. Real. Attention for fine-grained categorization. arXiv:1412.7054, 2014.Google Scholar
  40. [40]
    J. Ba, V. Mnih, K. Kavukcuoglu. Multiple object recognition with visual attention. arXiv:1412.7755, 2014.Google Scholar
  41. [41]
    X. Liu, T. Xia, J. Wang, Y. Q. Lin. Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition. arXiv:1603.06765, 2016.Google Scholar
  42. [42]
    V. Mnih, N. Heess, A. Graves, K. kavukcuoglu. Recurrent models of visual attention. Advances in Neural Information Processing Systems 27, Montréal, Canada, pp. 2204–2212, 2014.Google Scholar
  43. [43]
    B. Zhao, X. Wu, J. S. Feng, Q. Peng, S. C. Yan. Diversified visual attention networks for fine-grained object classification. arXiv:1606.08572, 2016.Google Scholar
  44. [44]
    C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset, Computation & Neural Systems, Technical Report, CNS-TR, California Institute of Technology, USA, 2011.Google Scholar
  45. [45]
    S. Sharma, R. Kiros, R. Salakhutdinov. Action recognition using visual attention. arXiv:1511.04119, 2015.Google Scholar
  46. [46]
    M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu. Spatial transformer networks. Advances in Neural Information Processing Systems 28, Montréal, Canada,pp. 2017–2025, 2015.Google Scholar
  47. [47]
    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044, 2015.Google Scholar
  48. [48]
    R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, vol 8, no. 3–4, pp. 229–256, 1992.zbMATHGoogle Scholar
  49. [49]
    C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 35, no. 8, pp. 1915–1929, 2013.CrossRefGoogle Scholar
  50. [50]
    L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062, 2014.Google Scholar
  51. [51]
    J. Long, E. Shelhamer, T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3431–3440, 2015.Google Scholar
  52. [52]
    B. Hariharan, P. Arbeláez, R. Girshick, J. Malik, Simultaneous detection and segmentation. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, vol 8695, pp. 297–312, 2014.Google Scholar
  53. [53]
    J. F. Dai, K. M. He, J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1635–1643, 2015.Google Scholar
  54. [54]
    P. Arbeláez, J. Pont-Tuset, J. Barron, F. Marques, J. Malik. Multiscale combinatorial grouping. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 328–335, 2014.Google Scholar
  55. [55]
    S. Gupta, R. Girshick, P. Arbeláez, J. Malik, Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of the 13th European Conference Computer Vision, Springer, Zurich, Switzerland, vol 8695, pp. 345–360, 2014.Google Scholar
  56. [56]
    H. Noh, S. Hong, B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1520–1528, 2015.Google Scholar
  57. [57]
    L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915, 2016.Google Scholar
  58. [58]
    D. R. Liu, Hong-Liang Li, L. D. Wang, Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey. International Journal of Automation and Computing, vol 12, no. 3, pp. 229–242, 2015.CrossRefGoogle Scholar

Copyright information

© Institute of Automation, Chinese Academy of Sciences and Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.School of Information Science and TechnologySouthwest Jiaotong UniversityChengduChina
  2. 2.Department of Electrical and Computer EngineeringNational University of SingaporeSingaporeSingapore

Personalised recommendations