Abstract
We investigate visual-semantic representations by combining visual features and semantic attributes to form a compact subspace containing the most relevant properties of each domain. This subspace can better represent image features for recognition tasks and even allow to better interpret results in the light of the nature of semantic attributes, offering a path for explainable learning. Experiments were performed in four benchmark datasets and compared against state-of-the-art algorithms. The method shows to be robust for up to 20% degradation of semantic attributes and offering possibilities for future work on the deployment of an automatic gathering of semantic data to improve representations for image classification. Additionally, empirical evidence suggests the high-level concepts adds linearity to the feature space, allowing for example PCA and SVM to perform well in the combined visual and semantic features. Also, the representations allow for zero-shot learning which demonstrates the viability of merging semantic and visual data at both training and test time for learning aspects that transcend class boundaries that allow the classification of unseen data.
Similar content being viewed by others
References
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459
Akata Z, Perronnin F, Harchaoui Z, Schmid C (2015) Label-embedding for image classification. IEEE Transactions Pattern Anal Mach Intell 38(7):1425–1438
Almousli H, Vincent P (2013) Semi supervised autoencoders: better focusing model capacity during feature extraction In: International Conference on Neural Information Processing, Springer pp 328–335
Biederman I (1987) Recognition-by-components: a theory of human image understanding. Psychol Rev 94(2):115
Brodersen KH, Ong CS, Stephan KE, Buhmann JM (2010) The balanced accuracy and its posterior distribution In: 2010 20th International Conference on Pattern Recognition IEEE pp 3121–3124
Cavallari G, Ribeiro L, Ponti M (2018) Unsupervised representation learning using convolutional and stacked auto-encoders: a domain and cross-domain feature space analysis In: 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) IEEE pp 440–446
Cayton L (2005) Algorithms for manifold learning. Univ California San Diego Tech Rep 12(1–17):1
Chollet F (2015) Keras https://github.com/fchollet/keras
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database In: CVPR09
Deselaers T, Ferrari V (2011) Visual and semantic similarity in imagenet In: CVPR 2011, pp 1777–1784 IEEE
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE pp 1778–1785
Ge Z, Demyanov S, Bozorgtabar B, Abedini M, Chakravorty R, Bowling A, Garnavi R (2017) Exploiting local and generic features for accurate skin lesions classification using clinical and dermoscopy imaging In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), IEEE pp 986–990
Gonzalez RC, Thomason MG (1978) Syntactic pattern recognition: an introduction. Addison-Wesley, Reading, MA
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 770–778
Hotelling H (1992) Relations between two sets of variates. Breakthroughs in statistics. Springer, New York, NY, pp 162–190
Jayaraman D, Grauman K (2014) Zero-shot recognition with unreliable attributes In: Advances in neural information processing systems pp 3464–3472
Juan DC, Lu CT, Li Z, Peng F, Timofeev A, Chen YT, Gao Y, Duerig T, Tomkins A, Ravi S (2019) Graph-rise: Graph-regularized image semantic embedding arXiv preprint arXiv:1902.10814
Kodirov E, Xiang T, Gong S (2017) Semantic autoencoder for zero-shot learning In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 3174–3183
Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer In: 2009 IEEE Conference on Computer Vision and Pattern Recognition IEEE pp 951–958
Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition In: Proceedings of the IEEE International Conference on Computer Vision pp 1449–1457
Lu Y (2015) Unsupervised learning on neural network outputs: with application in zero-shot learning arXiv preprint arXiv:1506.00990
Mello RF, Ponti MA (2018) Machine learning: a practical approach on the statistical learning theory. Springer, New York
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space arXiv preprint arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119.
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines In: ICML
Patterson G, Xu C, Su H, Hays J (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Computer V 108(1–2):59–81
Ponti MA, Ribeiro LSF, Nazare TS, Bui T, Collomosse J (2017) Everything you wanted to know about deep learning for computer vision but were afraid to ask In: 30th SIBGRAPI conference on graphics, patterns and images tutorials (SIBGRAPI-T), IEEE pp 17–41
Ponti MA, Santos FPd, Ribeiro LSF, Cavallari GB (2021) Training deep networks from zero to hero: avoiding pitfalls and going beyond In: SIBGRAPI - Conference on graphics, patterns and images
Ranzato M, Boureau YL, Chopra S, LeCun Y (2007) A unified energy-based framework for unsupervised learning In: Artificial Intelligence and Statistics, pp 371–379
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement arXiv preprint arXiv:1804.02767
Ren Z, Jin H, Lin Z, Fang C, Yuille A (2015) Multi-instance visual-semantic embedding arXiv preprint arXiv:1512.06963
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Silberer C, Ferrari V, Lapata M (2013) Models of semantic representation with visual attributes In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol 1, pp 572–582
Su Y, Jurie F (2012) Improving image classification using semantic attributes. Int J Computer V 100(1):59–77
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323
Vogel J, Schiele B (2004) Natural scene retrieval based on a semantic modeling step In: International Conference on Image and Video Retrieval Springer pp 207–215
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD Birds-200-2011 Dataset Tech Rep CNS-TR-2011-001, California Institute of Technology
Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) Cnn-rnn: a unified framework for multi-label image classification In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2285–2294
Xian Y, Akata Z, Sharma G, Nguyen Q, Hein M, Schiele B (2016) Latent embeddings for zero-shot classification In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 69–77
Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions Pattern Anal Mach Intell 41(9):2251–2265
Xian Y, Lorenz T, Schiele B, Akata Z (2018) Feature generating networks for zero-shot learning In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Xian Y, Schiele B, Akata Z (2017) Zero-shot learning - the good, the bad and the ugly In: IEEE Computer Vision and Pattern Recognition (CVPR)
Xiao J, Hays J, Ehinger K.A, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo In: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE pp 3485–3492
Xu H, Qi G, Li J, Wang M, Xu K, Gao H (2018) Fine-grained image classification by visual-semantic embedding In: IJCAI, pp 1043–1049
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, pp 3320–3328
Zhang J, Wu Q, Shen C, Zhang J, Lu J (2018) Multilabel image classification with regional latent semantic dependencies. IEEE Transactions Multimedia 20(10):2801–2813
Zhang Z, Saligrama V (20136 Zero-shot learning via joint latent similarity embedding In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 6034–6042
Acknowledgment
This work was supported by FAPESP grants #2018/22482-0 and #2019/07316-0; CNPq fellowship #304266/2020-5.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Code availability
Source code can be found at a public Git repository https://github.com/MAP-VICG/VisualSemanticEncoder.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
de Resende, D.C.O., Ponti, M.A. Robust image features for classification and zero-shot tasks by merging visual and semantic attributes. Neural Comput & Applic 34, 4459–4471 (2022). https://doi.org/10.1007/s00521-021-06601-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06601-7