Skip to main content
Log in

Semantic Contrastive Bootstrapping for Single-Positive Multi-label Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Learning multi-label image recognition with incomplete annotation is gaining popularity due to its superior performance and significant labor savings when compared to training with fully labeled datasets. Existing literature mainly focuses on label completion and co-occurrence learning while facing difficulties with the most common single-positive label manner. To tackle this problem, we present a semantic contrastive bootstrapping (Scob) approach to gradually recover the cross-object relationships by introducing class activation as semantic guidance. With this learning guidance, we then propose a recurrent semantic masked transformer to extract iconic object-level representations and delve into the contrastive learning problems on multi-label classification tasks. We further propose a bootstrapping framework in an Expectation-Maximization fashion that iteratively optimizes the network parameters and refines semantic guidance to alleviate possible disturbance caused by wrong semantic guidance. Extensive experimental results demonstrate that the proposed joint learning framework surpasses the state-of-the-art models by a large margin on four public multi-label image recognition benchmarks. Codes can be found at https://github.com/iCVTEAM/Scob.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1:
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availibility Statement

The datasets generated during and/or analyzed during the current study are available in the original references, i.e., PASCAL VOC 2007/2012 (Everingham et al., 2012http://host.robots.ox.ac.uk/pascal/VOC/, Microsoft COCO 2014 (Lin et al., 2014https://cocodataset.org/, CUB-200-2011 (Wah et al., 2011https://www.vision.caltech.edu/datasets/cub_200_2011/. The source codes and models corresponding to this study can be found at https://github.com/iCVTEAM/Scob.

References

  • Balcan, M. F. F., Sharma, D. (2021). Data driven semi-supervised learning. In Advances in neural information processing systems (NeurIPS).

  • Bar, A., Wang, X., Kantorov, V., Reed, C. J., Herzig, R., Chechik, G., Rohrbach, A., Darrell, T., & Globerson, A. (2022). Detreg: Unsupervised pretraining with region priors for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14605–14615).

  • Bucak, S. S., Jin, R., & Jain, A. K. (2011). Multi-label learning with incomplete class assignments. In 2011 IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2011.5995734.

  • Cabral, R., Torre, F., Costeira, J. P., & Bernardino, A. (2011). Matrix completion for multi-label image classification. In Advances in neural information processing systems (NeurIPS).

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision (ECCV), https://doi.org/10.1007/978-3-030-58452-8_13.

  • Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV), https://doi.org/10.1109/WACV.2018.00097.

  • Chen, T., Xu, M., Hui, X., Wu, H., & Lin, L. (2019a). Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th international conference on machine learning (ICML).

  • Chen, T., Pu, T., Wu, H., Xie, Y., & Lin, L. (2021). Structured semantic transfer for multi-label recognition with partial labels.

  • Chen, T., Lin, L., Chen, R., Hui, X., & Wu, H. (2022). Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 5, 69.

    Google Scholar 

  • Chen, Z. M., Wei, X. S., Wang, P., & Guo, Y. (2019b). Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. In Advances in neural information processing systems (NeurIPS).

  • Cole, E., Mac Aodha, O., Lorieul, T., Perona, P., Morris, D., & Jojic, N. (2021). Multi-label learning from single positive labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Dai, Z., Cai, B., Lin, Y., & Chen, J. (2021). Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1601–1610).

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the Association for computational linguistics: Human Language Technologies.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021a). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021b). An image is worth 16 x 16 words: Transformers for image recognition at scale. In International conference on learning representations.

  • Durand, T., Mehrasa, N., & Mori, G. (2019). Learning a deep convnet for multi-label classification with partial labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2012). The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  • Gao, B. B., & Zhou, H. Y. (2021). Learning to discover multi-class attentional regions for multi-label image recognition. IEEE Transactions on Image Processing, 30, 5920–5932.

    Article  Google Scholar 

  • Ge, W., Yang, S., & Yu, Y. (2018). Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Gong, X., Yuan, D., & Bao, W. (2021). Understanding partial multi-label learning via mutual information. In Advances in neural information processing systems (NeurIPS).

  • Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., kavukcuoglu, k., Munos, R., & Valko, M. (2020). Bootstrap your own latent-a new approach to self-supervised learning. In: Advances in neural information processing systems (NeurIPS).

  • Guo, H., & Wang, S. (2021). Long-tailed multi-label visual recognition by collaborative training on uniform and re-balanced samplings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377

  • Hu, H., Xie, L., Du, Z., Hong, R., & Tian, Q. (2020). One-bit supervision for image classification. In Advances in neural information processing systems (NeurIPS).

  • Huynh, D., & Elhamifar, E. (2020). Interactive multi-label CNN learning with partial labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Jia, J., Chen, X., & Huang, K. (2021). Spatial and semantic consistency regularizations for pedestrian attribute recognition. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Jiang, L., Zhou, Z., Leung, T., Li, L.J., & Fei-Fei, L. (2018). MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Proceedings of the 35th international conference on machine learning (ICML).

  • Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. In Advances in neural information processing systems (NeurIPS).

  • Li, Z., Chen, Z., Yang, F., Li, W., Zhu, Y., Zhao, C., Deng, R., Wu, L., Zhao, R., Tang, M., & Wang, J. (2021). Mst: Masked self-supervised transformer for visual representation. In Advances in neural information processing systems (NeurIPS).

  • Li, Z., Zhu, Y., Yang, F., Li, W., Zhao, C., Chen, Y., Chen, Z., Xie, J., Wu, L., Zhao, R., Tang, M., & Wang, J. (2022). Univip: A unified framework for self-supervised visual pre-training.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV).

  • Liu, H., Wang, R., Shan, S., & Chen, X. (2019). Deep supervised hashing for fast image retrieval. International Journal of Computer Vision, 127(9), 1217–1234. https://doi.org/10.1007/s11263-019-01174-4

    Article  MATH  Google Scholar 

  • Liu, Y., Sheng, L., Shao, J., Yan, J., Xiang, S., & Pan, C. (2018). Multi-label image classification via knowledge distillation from weakly-supervised detection. In Proceedings of the 26th ACM international conference on multimedia.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

  • Pu, T., Chen, T., Wu, H., & Lin, L. (2022). Semantic-aware representation blending for multi-label image recognition with partial labels. In Proceedings of the AAAI conference on artificial intelligence.

  • Rao, Y., Zhao, W., Zhu, Z., Lu, J., & Zhou, J. (2021). Global filter networks for image classification. In Advances in neural information processing systems (NeurIPS).

  • Sener O, & Koltun V (2018) Multi-task learning as multi-objective optimization. In Advances in neural information processing systems (NeurIPS).

  • Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., & zhang, y. (2021). Transmil: Transformer based correlated multiple instance learning for whole slide image classification. In Advances in neural information processing systems (NeurIPS).

  • Shin, A., Ishii, M., & Narihira, T. (2022). Perspectives and prospects on transformer architecture for cross-modal tasks with language and vision. International Journal of Computer Vision, 130(2), 435–454. https://doi.org/10.1007/s11263-021-01547-8

    Article  Google Scholar 

  • Song, L., Liu, J., Sun, M., & Shang, X. (2021). Weakly supervised group mask network for object detection. International Journal of Computer Vision, 129(3), 681–702. https://doi.org/10.1007/s11263-020-01397-w

    Article  Google Scholar 

  • Tsipras, D., Santurkar, S., Engstrom, L., Ilyas, A., & Madry, A. (2020). From imagenet to image classification: Contextualizing progress on benchmarks. In Proceedings of the 37th international conference on machine learning (ICML).

  • Tsoumakas, G., & Katakis, I. (2009). Multi-label classification: An overview. International Journal of Data Warehousing and Mining. https://doi.org/10.4018/jdwm.2007070101

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (NeurIPS).

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology

  • Wang, H., Xiao, R., Li, Y., Feng, L., Niu, G., Chen, G., & Zhao, J. (2022a). Pico: Contrastive label disambiguation for partial label learning.

  • Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., & Xu, W. (2016). Cnn-rnn: A unified framework for multi-label image classification. In 2016 IEEE conference on computer vision and pattern recognition (CVPR)https://doi.org/10.1109/CVPR.2016.251.

  • Wang, X., Yu, Z., De Mello, S., Kautz, J., Anandkumar, A., Shen, C., & Alvarez, J.M. (2022b). Freesolo: Learning to segment objects without annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14176–14186).

  • Wang, Y., Huang, R., Song, S., Huang, Z., & Huang, G. (2021). Not all images are worth 16 x 16 words: Dynamic transformers for efficient image recognition. In Advances in neural information processing systems (NeurIPS).

  • Wolfe, J. M., Horowitz, T. S., & Kenner, N. M. (2005). Rare items often missed in visual searches. Nature. https://doi.org/10.1038/435439a

    Article  Google Scholar 

  • Wu, B., Jia, F., Liu, W., Ghanem, B., & Lyu, S. (2018). Multi-label learning with missing labels using mixed dependency graphs. International Journal of Computer Vision, 5, 96.

    MathSciNet  MATH  Google Scholar 

  • Wu, B., Chen, W., Fan, Y., Zhang, Y., Hou, J., Liu, J., & Zhang, T. (2019). Tencent ML-images: A large-scale multi-label image database for visual representation learning. IEEE Access. https://doi.org/10.1109/access.2019.2956775

    Article  Google Scholar 

  • Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Sun, P., Li, Z., & Luo, P. (2021a). Detco: Unsupervised contrastive learning for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8392–8401).

  • Xie, J., Zhan, X., Liu, Z., Ong, Y. S., & Loy, C. C. (2021b). Unsupervised object-level representation learning from scene images. Advances in Neural Information Processing Systems 34, 28864–28876.

  • Xu, M., Jin, R., & Zhou, Z.H. (2013). Speedup matrix completion with side information: Application to multi-label learning. In Advances in neural information processing systems (NeurIPS).

  • Yang, H., Zhou, J. T., & Cai, J. (2016). Improving multi-label learning with missing labels by structured semantic correlations. In European conference on computer vision (ECCV).

  • Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. In Advances in neural information processing systems (NeurIPS).

  • Yun, S., Oh, S. J., Heo, B., Han, D., Choe, J., & Chun, S. (2021). Re-labeling imagenet: From single to multi-labels, from global to localized labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Zhang, B., Wang, Y., Hou, W., WU, H., Wang, J., Okumura, M., & Shinozaki, T. (2021a). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. In Advances in neural information processing systems (NeurIPS).

  • Zhang, D., Han, J., Zhao, L., & Meng, D. (2019). Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework. International Journal of Computer Vision, 127(4), 363–380. https://doi.org/10.1007/s11263-018-1112-4

    Article  MATH  Google Scholar 

  • Zhang, W., Pang, J., Chen, K., & Loy, C.C. (2021b). K-net: Towards unified image segmentation. In Advances in neural information processing systems (NeurIPS).

  • Zhao, J., Yan, K., Zhao, Y., Guo, X., Huang, F., & Li, J. (2021). Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

Download references

Acknowledgements

This work is partially supported by grants from the National Natural Science Foundation of China under contracts No. 62132002 and No. 62202010 and also supported by China Postdoctoral Science Foundation No. 2022M710212.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Li.

Additional information

Communicated by Jianfei Cai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Fig. 10
figure 10

Visualization of semantic masks on MS-COCO. In each group, the left and right images denote the input image and the masked class activation maps in our proposed Scob approach

Table 9 The AP results of 10 small-scale classes in MS-COCO dataset

1.1 Appendix A: Visualizations on Class-specific Semantic Masks

Beyond the ablations on semantic masks in the main manuscript, here we exhibit more semantic masks generated on the MS-COCO dataset in Fig. 10. In each group, the left and right images denote the input image and the masked class activation maps in our proposed Scob approach. As in the first two columns, it can be found that our proposed approach focuses on the semantic objects e.g., airplane and bus, while filtering the background information. In the second to the fourth groups of Fig. 10, two major challenges occur to distinguish these objects: (1) the semantic objects are relatively small compared to the image size; (2) these objects show high dependencies on the other objects or namely co-occurrences. As can be seen from this generated semantic guidance, our proposed Scob has the ability to distinguish the specific object from its related objects and then forms the representative features.

Fig. 11
figure 11

The network architecture of the proposed Recurrent Semantic Masked Transformer. We leverage the masked multi-head attention mechanisms with iterative semantic guidance generated by CAMs

1.2 Appendix B: Performances on Recognizing Small Objects

Besides Fig. 10 showing some CAM results of large and small-scale objects. We additionally list the AP results of 10 small-scale classes in the MS-COCO dataset to show our approach is robust to them in Table 9.

1.3 Appendix C: Details on Recurrent Semantic Masked Transformer

The proposed recurrent Semantic Masked Transformer (SMT) mainly consists of positional encoding, semantic mask, and multiple multi-head attention units as in previous work (Vaswani et al., 2017). Different from the implementation of these works, we rely on the feature maps of ResNet backbones as image feature patches (illustrated as image patches for a better view). Inspired by the masked coding manner (Devlin et al., 2019) in the field of natural language processing, here we adopt the class activation generated from the last optimization as the guidance for semantic masked attention. We present the detailed network architectures in Fig. 11.

Semantic masked encoding Denote \(\textbf{F}=\Phi (\textbf{x};\theta _b) \in \mathbb R^{HW \times K}\) as the backbone feature extracted from ResNet stages with parameter \(\theta _b \in \Theta \). \(\textbf{F}\) is split into \(H \times W\) patches \(\left\{ \textbf{F}_{i,j} \mid i \in \{1,2,\dots ,H\}, j \in \{1,2,\dots ,W\} \right\} \) of which each patch has channel size K, as the input of SMT. As mentioned in Section 3.2, the recurrent SMT applies a learnable positional encoding \(\Delta (\cdot ):\mathbb N^{W \times H} \mapsto \mathbb R^1\) and semantic mask to \(\{ \textbf{F}_{i,j} \}\):

$$\begin{aligned} \mathcal M(\textbf{F}_{i,j})=(\textbf{F}_{i,j}+\Delta (i,j))\cdot (1-\textbf{M}^c), \end{aligned}$$
(14)

where the symbols are described as aforementioned. The implementation of \(\Delta \) follows the previous Transformer architecture, i.e., DETR (Carion et al., 2020). Let \(\Delta _h:\mathbb N^W \mapsto \mathbb R^1\) and \(\Delta _v:\mathbb N^H \mapsto \mathbb R^1\) be the horizontal and vertical encoding. Then \(\Delta \) is the concatenation of them:

$$\begin{aligned} \Delta (i,j) = \Delta _h(i) \oplus \Delta _v(j), \end{aligned}$$
(15)

where \(\oplus \) denotes the feature concatenation operation. Following Devlin et al. (2019), positional encoding and masks are applied to the query and key of multi-head attention, providing global position information and constraining the extracted features \(\textbf{H}_{i,j}^c\) showing high response to only a specific semantic class related to the masks.

Multi-scale SMTs As the different network stages during training are sensitive to semantic objects of different scales, hence incorporating multi-scale and multi-level features can be beneficial for final feature representations, which is also one of the challenging problems in multi-label classification tasks. In our implementation, we propose to use two SMTs connecting to Stage 3 and Stage 4 of ResNet-50, and combine their outputs to extract image features at multiple scales. In this manner, objects of small scales are more easily to be presented in earlier network stages without a significant loss in resolutions.

Algorithm 2:
figure b

Insert activated node v to \(\mathbb G\) and adjust priority tree

Algorithm 3:
figure c

Pop top nodes \(\mathcal N'\) from \(\mathbb G^i\) for negative sampling

1.4 Appendix E: Algorithms of Instance Priority Trees

Instance Priority Tree (IPT) \(\mathbb {G}=\{\mathcal {E}, \mathcal {N}\}\) is a heap implemented with a complete binary tree structure, maintaining high confident instances to construct negative samples for contrastive learning. Each element \(u \in \mathcal N\) is a binary tuple \((\widetilde{H}_u, s_u)\), where \(\widetilde{H}_u\) is a semantic masked feature and \(s_u\) is the corresponding activation confidence score. Each edge \(e \in \mathcal {E}\) indicates an affiliation relationship (uv) where the parent node u is with higher confidence than the leaf ones v.

In our implementations, we adopt an array of length n, \([u_0,u_1,u_2,\dots ,u_n]\), to store the nodes \(\mathcal N\) and describe the edges \(\mathcal E\) by their indices in the array, i.e., \(u_i\) is the parent node of \(u_{2i}\) and \(u_{2i+1}\). The success of IPT relies on three typical operations: “Insert activated nodes”, “Adjust priority tree”, and “Pop top nodes”. In our implementation, the tree adjustment operation needs to be conducted after every Insert or Pop operation for maintaining the tree structures. In other words, the “Insert activated nodes” and “Adjust priority tree” functions are operated together. Here we elaborate the detailed algorithms in Algorithm 2 and Algorithm 3.

In addition, we maintain the size of instance priority tree \(\mathbb {G}\) in a proper range to reduce computation costs. In this manner, instances with lower confidence features would not be added to our tree storage, indicating the efficiency of our model.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Zhao, Y. & Li, J. Semantic Contrastive Bootstrapping for Single-Positive Multi-label Recognition. Int J Comput Vis 131, 3289–3306 (2023). https://doi.org/10.1007/s11263-023-01849-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01849-z

Keywords

Navigation