Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Wu, Wenhao; Sun, Zhun; Song, Yuxin; Wang, Jingdong; Ouyang, Wanli

doi:10.1007/s11263-023-01876-w

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Open access
Published: 07 September 2023

Volume 132, pages 392–409, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Download PDF

Wenhao Wu ORCID: orcid.org/0000-0002-8511-743X¹,
Zhun Sun²,
Yuxin Song²,
Jingdong Wang² &
…
Wanli Ouyang³

3218 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Transferring knowledge from pre-trained deep models for downstream tasks, particularly with limited labeled samples, is a fundamental problem in computer vision research. Recent advances in large-scale, task-agnostic vision-language pre-trained models, which are learned with billions of samples, have shed new light on this problem. In this study, we investigate how to efficiently transfer aligned visual and textual knowledge for downstream visual recognition tasks. We first revisit the role of the linear classifier in the vanilla transfer learning framework, and then propose a new paradigm where the parameters of the classifier are initialized with semantic targets from the textual encoder and remain fixed during optimization. To provide a comparison, we also initialize the classifier with knowledge from various resources. In the empirical study, we demonstrate that our paradigm improves the performance and training speed of transfer learning tasks. With only minor modifications, our approach proves effective across 17 visual datasets that span three different data domains: image, video, and 3D point cloud.

SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

Universal Representations: A Unified Look at Multiple Task and Domain Learning

Article Open access 24 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the field of optimizing neural network training efficiency, knowledge transfer aims to provide pre-learned information to downstream tasks. For visual recognition tasks, the approach typically involves leveraging feature representations derived from a task-agnostic model optimized with large-scale universal datasets, followed by building a classifier on the top of the model. Former studies put more emphasis on learning the base model. Over the last decade, for example, the dominant approach involved training models on the ImageNet (Deng et al., 2009) dataset and subsequently transferring them to downstream tasks. Owing to the dramatically increasing computational capacity, general-proposed pre-trained models with several magnitudes more parameters and FLOPs have been successfully trained in both full-/semi-supervised (Sun et al., 2017) and self-supervised (He et al., 2020, 2022) style. Recently, contrastive vision-language models (Radford et al., 2021; Jia et al., 2021a; Yuan et al., 2021) have garnered increasing interest as pre-training models in transfer learning due to their superior capabilities and effectiveness for visual recognition tasks. These models, which benefit from the knowledge of the language modality, have shown improved performance on various visual tasks, such as zero-shot classification (Radford et al., 2021), captioning (Mokady et al., 2021), and image generation (Ramesh et al., 2021), to name a few.

In this study, we aim to enhance the transferability of vision-language pre-training models for downstream visual recognition tasks by revisiting the knowledge-transferring progress from the perspective of the classifier. Specifically, we examine the properties of the pre-training models, and propose a simple yet effective paradigm to enhance their transferability. Our findings demonstrate that these pre-training models hold three essential properties for our paradigm: (i) Semantic-rich representations, which are obtained by training the models with extensive weakly-related image-text sample pairs using large neural network architectures. In contrast to supervised-style models learned on standard image-label datasets, the semantic-rich representations are expected to contain more semantics and diverse representations of concepts, which is crucial in the unknown target domain settings. (ii) Modality alignment, which aligns the representation vectors from a paired sample’s visual and textual modality in semantic embedding space. This property provides an advantage in the initialization when the samples for downstream tasks are limited, i.e., in the zero-/few-shot scenarios, compared to the visual-only classifier fine-tuning approach. (iii) Intra-modality correlations. The contrastive training algorithm also provides weak intra-modality correlations. That is, the representation vectors of similar images or texts are close to each other (Radford et al., 2021; Sun, 2022). In contrast to the aforementioned properties, intra-modality correlations from samples’ influence are often overlooked. Concisely, a classifier with appropriately correlated targets rather than one-hot labels learns faster and performs better.

To demonstrate the importance of appropriately correlated classifier targets, we conduct a toy experiment to depict the intra-modality correlations in two scenarios. We employ the Kinetics video recognition dataset (Kay et al., 2017) for the analysis (The detailed configurations are provided in Sect. 4.3). In the first scenario, we extract the textual embedding vectors of the name of class labels using the textual encoder of CLIP (Radford et al., 2021) and then calculate the correlation among the textual embedding vectors. In the second scenario, we examine the final projection head of a vanilla fine-tuning framework. Precisely, we learn a classifier based on the visual encoder from the same CLIP model. The projection head of the classifier is a matrix of $d \times c$ used to compute the pre-softmax logits, from the d-dimensional feature vectors for the c classes. Therefore, we treat the d-dimensional row vectors as the “embeddings” of the class labels. This non-rigorous setting allows us to explore the intra-modality correlation between these learned “embeddings”. The results are plotted in Fig. 1. While we could observe clear correlations among the embeddings of category names since some of them contain the same keywords (e.g., playing <something>.) Interestingly, in the second scenario, these learned “embeddings” also reveal a similar correlation map after the training, despite being initialized randomly and optimized without knowing any textual information (That is, optimized with the cross-entropy loss with one-hot labels).

In summary, we take full advantage of the large-scale contrastive image-language pre-trained models and build a novel general paradigm for the transfer learning settings. Our main contributions are as follows:

We revisit the transfer learning pipeline from the perspective of classifiers and spot that properly correlated targets, and pre-aligned semantic knowledge are crucial for downstream visual recognition tasks.
We build a new paradigm to transfer textual knowledge for visual recognition using contrastively pre-trained vision-language models. Our paradigm accelerates the transfer learning progress while taking full advantage of the pre-trained models.
Comprehensive experiments are conducted on 17 visual datasets that span three distinct data domains: image, video, and 3D point cloud. For video recognition, we evaluate our model on 6 well-known video benchmarks, including single-label and multi-label recognition. while also verifying its effectiveness in zero-shot and few-shot scenarios. For image classification, we perform experiments on 10 different image datasets, and the results demonstrate that our method is an effective few-shot learner. For 3D point cloud recognition, we validate our method on the ModelNet40 dataset, and find that it outperforms the vision-only paradigm by a significant margin.
We open-source our code and models at https://github.com/whwu95/Text4Vis.

2 Related Works

2.1 Visual Recognition Tasks and Transfer Learning

Visual recognition is one of the most important tasks in the design of machine learning systems. From the perspective of the visual backbone, we could roughly divide the evolution of the system into two eras: i) The Convolutional Neural Network (CNN) based architectures for image (Krizhevsky et al., 2012; He et al., 2016; Simonyan & Zisserman, 2014; Ioffe & Szegedy, 2015) or video recognition (Carreira & Zisserman, 2017; Qiu et al., 2017; Xie et al., 2018; Tran et al., 2018; Wu et al., 2021a, b). ii) The Vision Transformer (ViT) based architectures for image (Dosovitskiy et al., 2020; Han et al., 2021; Liu et al., 2021) or video recognition (Bertasius et al., 2021; Arnab et al., 2021; Liu et al., 2022; Fan et al., 2021). As ViT models are challenging to train from scratch without large-scale datasets, transfer learning techniques have regained popularity.

Transfer learning aims to enhance target learners’ performance on target domains by transferring knowledge from related but different source domains (Tan et al., 2018; Ribani & Marengoni, 2019; Zhuang et al., 2020), thereby reducing the requirements of target domain data for learning the target model. A typical transfer learning system is built with a pre-trained model trained with source domain data and a classifier for the target domain data. This study discusses a sub-family of transfer learning systems that utilize large-scale task-agnostic models. Related studies on this sub-family are discussed in Sect. 2.3.

2.2 Image-Language Pre-training

The recent success of Contrastive Language-Image Pre-Training (CLIP) (Radford et al., 2021) has paved the way for coordinated vision-language pre-training models utilizing the image-text InfoNCE contrastive loss (Van den Oord et al., 2018). After that, several works have since been proposed that combine various learning tasks, including image-text matching and masked image/language modeling, such as ALIGN (Jia et al., 2021b), BLIP (Li et al., 2022b), Florence (Yuan et al., 2021), and CoCa (Yu et al., 2022). These contrastively learned models exhibit two essential properties for downstream tasks: rich visual feature representations and aligned textual feature representations. Another recent study (Yang et al., 2022) has incorporated the downstream classification task into the pretraining process, resulting in improved accuracy over the standard cross-entropy loss. These developments demonstrate the potential for coordinated pre-training of vision and language models and open up exciting opportunities for further advances in vision-language understanding.

2.3 Transferring CLIP for Downstream Tasks

The transfer of pre-trained CLIP to downstream tasks is a recent and emerging research direction. Several recent studies (Gao et al., 2021; Zhang et al., 2021b; Zhou et al., 2021, 2022) have investigated the efficient transfer of pre-trained CLIP to downstream image recognition tasks. In addition, CLIP has been leveraged to enhance dense prediction tasks such as object detection (Rao et al., 2022) and segmentation (Lüddecke & Ecker, 2022; Li et al., 2022a). In the video domain, CLIP has also benefited many text-video retrieval methods (Zhao et al., 2022; Luo et al., 2021). For video recognition, ActionClip (Wang et al., 2021b) and VideoPrompt (Ju et al., 2022) extend CLIP (Radford et al., 2021) to train a downstream video-text matching model with contrastive loss and utilize the similarity between learned video and text embeddings during inference. Other methods, such as ST-Adapter (Pan et al., 2022) and EVL (Lin et al., 2022b), use only the visual encoder for unimodality transferring without involving textual knowledge. This study investigates the correlation between the linear classifier and efficient feature transfer in the standard visual recognition paradigm. We propose a direct transfer of visual and textual knowledge for visual recognition, without using contrastive-based methods.

3 Methodology

3.1 Denotations

In this paper, we use bold letters to denote Vector, while capital italic letters are used to denote Tensor or Matrix. For example, we use ${\textbf{z}} \in {\mathbb {R}}^d$ to denote the feature vector extracted from a pre-trained model of dimension d, and $W \in {\mathbb {R}}^{d\times c}$ to denote the projection matrix for the c-class linear classifier. Without ambiguity, we also use capital italic letters to denote the modality in subscripts. Specifically, we use V and T to denote the Visual modality and the Textual modality, respectively. We also use lowercase italic letters to denote functions or neural networks, such as $g_V(\cdot , \varTheta _V)$ and $g_T(\cdot , \varTheta _T)$, which represent the visual and textual encoders, respectively. Furthermore, we employ calligraphic letters, such as ${\mathcal {D}}$, to denote sets of elements.

3.2 Revisiting of Existing Learning Paradigms

Standard Transfer Learning Paradigm In Fig. 2a, we depict the conventional scenario, where a visual encoder model $g_V$ is trained on a large-scale dataset ${\mathcal {D}}$ containing visual samples, with or without ground-truth labels. On our labeled downstream dataset ${\tilde{\mathcal {D}}} = \{ ({\varvec{x}}_1, {\varvec{y}}_1), ({\varvec{x}}_2, {\varvec{y}}_2), \ldots \}$, our empirical learning target can be expressed as

$$\begin{aligned} g^*_V, W^* = \underset{\varTheta _V,W}{\textrm{arg min}}~ {{\mathbb {E}}}_{{{\varvec{x}},{\varvec{y}} \sim {\tilde{\mathcal {D}}}}}\big [ H({\varvec{y}} | \sigma (W \cdot g_V({\varvec{x}}))) \big ], \end{aligned}$$

(1)

where $H({\hat{p}}|p)$ represents the CrossEntropy between the predicted distribution p and the ground-truth distribution ${\hat{p}}$. The symbol $\sigma $ denotes the softmax operation, $W \in {\mathbb {R}}^{c\times d}$ denotes the linear projection matrix for classification. The formulation in Eq. 1 is a standard visual feature transferring paradigm, where the visual encoder $g_V$ and the projection matrix W are learned jointly.

Vision-Language Contrastive Learning Paradigm As shown in Fig. 2b, we then review the contrastive learning paradigm of vision-language models, which has gained widespread use in vision-language pre-training, such as CLIP (Radford et al., 2021), and extended to video-text fine-tuning, e.g., ActionCLIP (Wang et al., 2021b), CLIP4Clip (Luo et al., 2021).

Given a dataset ${\mathcal {D}} = \{ ({\varvec{x}}_{V,1}, {\varvec{x}}_{T,1}), ({\varvec{x}}_{V,2}, {\varvec{x}}_{T,2}), \cdots \}$, consisting of weakly related vision-language pairs (e.g., image-text, video-text). With slight abuse of the notations, we employ the ${\varvec{x}}_V,{\varvec{x}}_T$ to denote a mini-batch of size b, then we minimize the following target:

$$\begin{aligned} g^*_V, g^*_T = \underset{\varTheta _V,\varTheta _T}{\textrm{arg min}}~ {{\mathbb {E}}}_{{{\varvec{x}}_V,{\varvec{x}}_T \sim {\tilde{\mathcal {D}}}}}\big [H( {\mathcal {Q}} | \sigma (g_V({\varvec{x}}_V)^{\text {T}}\cdot g_T({\varvec{x}}_T))) \big ],\nonumber \\ \end{aligned}$$

(2)

where ${\mathcal {Q}}$ is the set that contains b one-hot labels of size c, with their $1, 2, \ldots , b$-th element being 1 ($b<c$), representing the positive vision-language pairs. We note that the definition in Eq. 2 is not the rigorous form of the Noise-Contrastive Estimation (NCE) loss proposed in Van den Oord et al. (2018). Instead, we employ the cross-entropy version implementation used in Radford et al. (2021); Chen et al. (2021). The contrastive learning paradigm first projects the visual feature $g_V({\varvec{x}}_V)$ with a projection matrix $g_T({\varvec{x}}_T)$, then follows the standard transfer learning paradigm to match the similarity matrix with the diagonal label set ${\mathcal {Q}}$.

3.3 Our Proposed Paradigm

As depicted in Fig. 2, we propose a more generalized paradigm by replacing the learnable, randomly initialized linear projection matrix W with a pre-defined matrix ${\tilde{W}}$, building upon the classifier perspective. Following Sect. 3.2, the training target can be formulated as:

$$\begin{aligned} g^*_V = \underset{\varTheta _V}{\textrm{arg min}}~ {{\mathbb {E}}}_{{{\varvec{x}},{\varvec{y}} \sim {\tilde{\mathcal {D}}}}}\big [ H({\varvec{y}} | \sigma ({\tilde{W}}\cdot g_V({\varvec{x}}))) \big ]. \end{aligned}$$

(3)

In the following subsections, we investigate different initialization methods for ${\tilde{W}}$.

3.4 Discussion on Initialization

To investigate the extent to which the correlation between semantic information contained in the samples is helpful, we examine several types of initialization, which represent different degrees of intra-modality (or inter-class from the perspective of classifier) correlation, as illustrated in Fig. 3.

3.4.1 Trivial Inter-class Correlation

Randomized Matrix We start with the simplest initialization method, which involves setting each row of ${\tilde{W}}$ to a random Gaussian vector with zero mean and standard deviation. This can be denoted as follows:

$$\begin{aligned} {\tilde{W}} \sim {\mathcal {N}}({\varvec{0}}, I_d), \end{aligned}$$

(4)

where $I_d$ denotes the identity matrix of dimension $d\times d$. While this method generates trivial correlations between the rows of ${\tilde{W}}$ due to its stochasticity, these correlations cannot reflect the actual correspondence between the visual classes. Therefore, we expect the model to have inferior performance since it needs to avoid these incorrect correlations when learning the visual feature representation.

Randomized Orthogonal Matrix Next, we consider the case where correlations are removed from the projection matrix. We follow the approach of the randomized matrix and then remove the correlation by ensuring that the row vectors are orthogonal. This is achieved by QR decomposition. Concretely, since $d>c$, we first generate a random matrix of size $d \times d$ and select the first c rows as our projection matrix. Formally, we have,

$$\begin{aligned} \begin{aligned} {\tilde{W}}_j \sim \textrm{QR}(U)_j, j = 1, 2, \ldots , c, \\ U_i\sim {\mathcal {N}}({\varvec{0}}, I_d), i = 1, 2, \ldots , d, \end{aligned} \end{aligned}$$

(5)

where U is the intermediate randomized matrix, and $\textrm{QR}(U)$ is the row orthogonal matrix obtained through the QR decomposition. Similar to the randomized matrix, we expect this initialization to have inferior performance. Since the one-hot label vectors are also orthogonal to each other, it will not be helpful to project the visual feature vectors with an orthogonal matrix, which may increase the difficulty of learning meaningful visual features.

3.4.2 Correlation from Visual Statistic Knowledge

Class Center Projection To utilize the visual encoder’s statistical knowledge, we randomly select a small subset of labeled samples from the training dataset. For our experiments on the Kinetics-400 dataset, we sample 60 videos from each class, which is approximately 10% of the training data. Next, we compute the mean value of each class’s visual embeddings extracted from the visual encoder. These mean vectors are treated as the centers for each class and are used to initialize the classifier’s parameters. The class center initialization provides a basic approximation of the visual knowledge obtained from the pre-trained model. However, its effectiveness largely depends on the data used to compute the projection matrix, and when the data is limited, the estimated correlation among visual embeddings may be biased.

Linear Discriminant Projection We propose another approach to initializing the projection matrix using visual statistics. We use multi-class Fisher’s linear discriminant analysis (LDA) to learn a linear classifier and employ the weight matrix of the classifier as our initialization for the projection matrix. Specifically, we first use the same visual embeddings as the previous approach to computing the LDA coefficient, following previous work (Li et al., 2006). Then, we use the LDA coefficient to initialize ${\tilde{W}}$ and freeze it for fine-tuning the visual encoder on the dataset. Intuitively, the LDA simultaneously maximizes the inter-class covariance and minimizes intra-class covariance. Therefore, we term this as the maximal correlation initialization using visual statistic knowledge. However, the linear discriminant projection also suffers from biased data sampling progress.

3.4.3 Correlation from Textual Semantic Knowledge

Textual Embedding Vectors We now describe how we transfer textual semantic knowledge from a pre-trained textual encoder to initialize the projection weight ${\tilde{W}}$. Given a set of tokenized class labels ${\mathcal {L}} = {{\varvec{l}}_1, {\varvec{l}}_2, \ldots , {\varvec{l}}_c }$, we initialize the i-th row vector in ${\tilde{W}}$ as follows:

$$\begin{aligned} {\tilde{W}}_i \sim g_T({\varvec{l}}_i), \qquad i = 1, 2, \ldots , c, \end{aligned}$$

(6)

where $g_T$ is a function that maps a textual input to an embedded feature vector using a pre-trained textual encoder. In our experiments, we investigate two types of textual feature encoders: i) The encoder that is trained solely using textual samples on tasks such as masked language modeling, i.e., DistilBERT (Sanh et al., 2019); ii) The encoder that is trained with a visual encoder in the contrastive style, i.e., CLIP (Radford et al., 2021). Using the textual embeddings to initialize ${\tilde{W}}$ allows us to roughly pre-align the visual and textual embeddings in the same embedding space.

3.5 Discussion on Parameter Frozen

It is worth mentioning that, in our paradigms, ${\tilde{W}}$ is not in the optimization targets. This means we freeze it from updating during the fine-tuning of the downstream tasks. We have the following reasons for this: firstly, since the textual knowledge is extracted by the textual encoder, freezing this part could significantly decrease the computational resources required for fine-tuning. As we showed in Sect. 4, freezing the parameters of ${\tilde{W}}$ leads to a decrease in the training period. Secondly, freezing the parameter helps to reduce biases brought by the limited semantic knowledge of class names. By keeping the feature embeddings distributed as they were learned on large-scale datasets, we improve the diversity of the representations and the learning stability. Finally, this configuration also compares former studies that employ textual information for vision transfer learning.

4 Experiments: Video Recognition

In this section, we transfer the image-language pre-trained model to the video modality, i.e., the video recognition task. To evaluate the effectiveness of the transferred model, we conduct experiments on six well-known video datasets, which include both trimmed and untrimmed video data. Specifically, the datasets are Kinetics-400 & 600 (Kay et al., 2017; Carreira et al., 2018), UCF-101 (Soomro et al., 2012), HMDB-51 (Kuehne et al., 2011), ActivityNet-v1.3 (Caba Heilbron et al., 2015), and Charades (Sigurdsson et al., 2016). These datasets are selected to represent a wide range of video recognition tasks, and are commonly used as benchmarks in this field.

We evaluate the transferred model in three distinct scenarios: zero-shot, few-shot, and regular video recognition. In the zero-shot scenario, the model has not trained on the target dataset but is evaluated on it, allowing us to assess its ability to generalize to new data. In the few-shot scenario, the model is trained on a small subset of the target dataset and evaluated on the validation set, enabling us to explore its capacity to learn from limited labeled data. In the typical recognition scenario, the model is trained on the entire target dataset and evaluated on the validation set, allowing us to measure its performance in a standard supervised learning configuration. By evaluating the model in these three scenarios, we aim to provide a comprehensive assessment of its performance under different conditions.

4.1 Training

The video recognition task takes a video as input, then feeds it into a learned encoder to estimate the action category of the video. Given a video, we first uniformly sample T (e.g., 8, 16, 32) frames over the entire video. Then we utilize ResNet (He et al., 2016) or ViT (Dosovitskiy et al., 2020) as the video encoders. The classifier in our paradigm is initialized from the textual embedding of the class names and then frozen (fixed), leaving only the parameters in the video encoder to be learned.

Table 1 Default training details for video recognition

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Abstract

Similar content being viewed by others

SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

Universal Representations: A Unified Look at Multiple Task and Domain Learning

1 Introduction

2 Related Works

2.1 Visual Recognition Tasks and Transfer Learning

2.2 Image-Language Pre-training

2.3 Transferring CLIP for Downstream Tasks

3 Methodology

3.1 Denotations

3.2 Revisiting of Existing Learning Paradigms

3.3 Our Proposed Paradigm

3.4 Discussion on Initialization

3.4.1 Trivial Inter-class Correlation

3.4.2 Correlation from Visual Statistic Knowledge

3.4.3 Correlation from Textual Semantic Knowledge

3.5 Discussion on Parameter Frozen

4 Experiments: Video Recognition

4.1 Training

4.2 Inference

4.3 Ablation Studies

4.4 Main Results

5 Experiments: Image Recognition

5.1 Training

5.2 Main Results

6 Experiments: 3D Point Cloud Recognition

6.1 Training

6.2 Main Results

7 Conclusion and Limitation

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Additional Details

1.1 Statistics of Video Datasets

1.2 Batch Gather for Distributed InfoNCE

1.3 LDA Classifier

1.4 Visual Encoder Architectures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation