Skip to main content
Log in

Robust image features for classification and zero-shot tasks by merging visual and semantic attributes

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

We investigate visual-semantic representations by combining visual features and semantic attributes to form a compact subspace containing the most relevant properties of each domain. This subspace can better represent image features for recognition tasks and even allow to better interpret results in the light of the nature of semantic attributes, offering a path for explainable learning. Experiments were performed in four benchmark datasets and compared against state-of-the-art algorithms. The method shows to be robust for up to 20% degradation of semantic attributes and offering possibilities for future work on the deployment of an automatic gathering of semantic data to improve representations for image classification. Additionally, empirical evidence suggests the high-level concepts adds linearity to the feature space, allowing for example PCA and SVM to perform well in the combined visual and semantic features. Also, the representations allow for zero-shot learning which demonstrates the viability of merging semantic and visual data at both training and test time for learning aspects that transcend class boundaries that allow the classification of unseen data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459

    Article  Google Scholar 

  2. Akata Z, Perronnin F, Harchaoui Z, Schmid C (2015) Label-embedding for image classification. IEEE Transactions Pattern Anal Mach Intell 38(7):1425–1438

    Article  Google Scholar 

  3. Almousli H, Vincent P (2013) Semi supervised autoencoders: better focusing model capacity during feature extraction In: International Conference on Neural Information Processing, Springer pp 328–335

  4. Biederman I (1987) Recognition-by-components: a theory of human image understanding. Psychol Rev 94(2):115

    Article  Google Scholar 

  5. Brodersen KH, Ong CS, Stephan KE, Buhmann JM (2010) The balanced accuracy and its posterior distribution In: 2010 20th International Conference on Pattern Recognition IEEE pp 3121–3124

  6. Cavallari G, Ribeiro L, Ponti M (2018) Unsupervised representation learning using convolutional and stacked auto-encoders: a domain and cross-domain feature space analysis In: 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) IEEE pp 440–446

  7. Cayton L (2005) Algorithms for manifold learning. Univ California San Diego Tech Rep 12(1–17):1

    Google Scholar 

  8. Chollet F (2015) Keras https://github.com/fchollet/keras

  9. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database In: CVPR09

  10. Deselaers T, Ferrari V (2011) Visual and semantic similarity in imagenet In: CVPR 2011, pp 1777–1784 IEEE

  11. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE pp 1778–1785

  12. Ge Z, Demyanov S, Bozorgtabar B, Abedini M, Chakravorty R, Bowling A, Garnavi R (2017) Exploiting local and generic features for accurate skin lesions classification using clinical and dermoscopy imaging In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), IEEE pp 986–990

  13. Gonzalez RC, Thomason MG (1978) Syntactic pattern recognition: an introduction. Addison-Wesley, Reading, MA

    MATH  Google Scholar 

  14. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  Google Scholar 

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 770–778

  16. Hotelling H (1992) Relations between two sets of variates. Breakthroughs in statistics. Springer, New York, NY, pp 162–190

    Chapter  Google Scholar 

  17. Jayaraman D, Grauman K (2014) Zero-shot recognition with unreliable attributes In: Advances in neural information processing systems pp 3464–3472

  18. Juan DC, Lu CT, Li Z, Peng F, Timofeev A, Chen YT, Gao Y, Duerig T, Tomkins A, Ravi S (2019) Graph-rise: Graph-regularized image semantic embedding arXiv preprint arXiv:1902.10814

  19. Kodirov E, Xiang T, Gong S (2017) Semantic autoencoder for zero-shot learning In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 3174–3183

  20. Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer In: 2009 IEEE Conference on Computer Vision and Pattern Recognition IEEE pp 951–958

  21. Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition In: Proceedings of the IEEE International Conference on Computer Vision pp 1449–1457

  22. Lu Y (2015) Unsupervised learning on neural network outputs: with application in zero-shot learning arXiv preprint arXiv:1506.00990

  23. Mello RF, Ponti MA (2018) Machine learning: a practical approach on the statistical learning theory. Springer, New York

    Book  Google Scholar 

  24. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space arXiv preprint arXiv:1301.3781

  25. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119.

  26. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines In: ICML

  27. Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Computer V 108(1–2):59–81

    Google Scholar 

  28. Ponti MA, Ribeiro LSF, Nazare TS, Bui T, Collomosse J (2017) Everything you wanted to know about deep learning for computer vision but were afraid to ask In: 30th SIBGRAPI conference on graphics, patterns and images tutorials (SIBGRAPI-T), IEEE pp 17–41

  29. Ponti MA, Santos FPd, Ribeiro LSF, Cavallari GB (2021) Training deep networks from zero to hero: avoiding pitfalls and going beyond In: SIBGRAPI - Conference on graphics, patterns and images

  30. Ranzato M, Boureau YL, Chopra S, LeCun Y (2007) A unified energy-based framework for unsupervised learning In: Artificial Intelligence and Statistics, pp 371–379

  31. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement arXiv preprint arXiv:1804.02767

  32. Ren Z, Jin H, Lin Z, Fang C, Yuille A (2015) Multi-instance visual-semantic embedding arXiv preprint arXiv:1512.06963

  33. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  34. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Article  Google Scholar 

  35. Silberer C, Ferrari V, Lapata M (2013) Models of semantic representation with visual attributes In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol 1, pp 572–582

  36. Su Y, Jurie F (2012) Improving image classification using semantic attributes. Int J Computer V 100(1):59–77

    Google Scholar 

  37. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  38. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  39. Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323

    Article  Google Scholar 

  40. Vogel J, Schiele B (2004) Natural scene retrieval based on a semantic modeling step In: International Conference on Image and Video Retrieval Springer pp 207–215

  41. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD Birds-200-2011 Dataset Tech Rep CNS-TR-2011-001, California Institute of Technology

  42. Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) Cnn-rnn: a unified framework for multi-label image classification In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2285–2294

  43. Xian Y, Akata Z, Sharma G, Nguyen Q, Hein M, Schiele B (2016) Latent embeddings for zero-shot classification In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 69–77

  44. Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions Pattern Anal Mach Intell 41(9):2251–2265

    Article  Google Scholar 

  45. Xian Y, Lorenz T, Schiele B, Akata Z (2018) Feature generating networks for zero-shot learning In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  46. Xian Y, Schiele B, Akata Z (2017) Zero-shot learning - the good, the bad and the ugly In: IEEE Computer Vision and Pattern Recognition (CVPR)

  47. Xiao J, Hays J, Ehinger K.A, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo In: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE pp 3485–3492

  48. Xu H, Qi G, Li J, Wang M, Xu K, Gao H (2018) Fine-grained image classification by visual-semantic embedding In: IJCAI, pp 1043–1049

  49. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, pp 3320–3328

  50. Zhang J, Wu Q, Shen C, Zhang J, Lu J (2018) Multilabel image classification with regional latent semantic dependencies. IEEE Transactions Multimedia 20(10):2801–2813

    Article  Google Scholar 

  51. Zhang Z, Saligrama V (20136 Zero-shot learning via joint latent similarity embedding In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 6034–6042

Download references

Acknowledgment

This work was supported by FAPESP grants #2018/22482-0 and #2019/07316-0; CNPq fellowship #304266/2020-5.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damares Crystina Oliveira de Resende.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

Source code can be found at a public Git repository https://github.com/MAP-VICG/VisualSemanticEncoder.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Resende, D.C.O., Ponti, M.A. Robust image features for classification and zero-shot tasks by merging visual and semantic attributes. Neural Comput & Applic 34, 4459–4471 (2022). https://doi.org/10.1007/s00521-021-06601-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06601-7

Keywords

Navigation