1 Introduction

Fig. 1
figure 1

Inter-class correlation maps of “embeddings of class labels” for 20 categories on Kinetics-400. Left: The extracted textual vectors of class labels, Right: The “embeddings” from learned classifier. The color thresholds are adjusted for a better view. Please zoom in for the best view

In the field of optimizing neural network training efficiency, knowledge transfer aims to provide pre-learned information to downstream tasks. For visual recognition tasks, the approach typically involves leveraging feature representations derived from a task-agnostic model optimized with large-scale universal datasets, followed by building a classifier on the top of the model. Former studies put more emphasis on learning the base model. Over the last decade, for example, the dominant approach involved training models on the ImageNet (Deng et al., 2009) dataset and subsequently transferring them to downstream tasks. Owing to the dramatically increasing computational capacity, general-proposed pre-trained models with several magnitudes more parameters and FLOPs have been successfully trained in both full-/semi-supervised (Sun et al., 2017) and self-supervised (He et al., 2020, 2022) style. Recently, contrastive vision-language models (Radford et al., 2021; Jia et al., 2021a; Yuan et al., 2021) have garnered increasing interest as pre-training models in transfer learning due to their superior capabilities and effectiveness for visual recognition tasks. These models, which benefit from the knowledge of the language modality, have shown improved performance on various visual tasks, such as zero-shot classification (Radford et al., 2021), captioning (Mokady et al., 2021), and image generation (Ramesh et al., 2021), to name a few.

In this study, we aim to enhance the transferability of vision-language pre-training models for downstream visual recognition tasks by revisiting the knowledge-transferring progress from the perspective of the classifier. Specifically, we examine the properties of the pre-training models, and propose a simple yet effective paradigm to enhance their transferability. Our findings demonstrate that these pre-training models hold three essential properties for our paradigm: (i) Semantic-rich representations, which are obtained by training the models with extensive weakly-related image-text sample pairs using large neural network architectures. In contrast to supervised-style models learned on standard image-label datasets, the semantic-rich representations are expected to contain more semantics and diverse representations of concepts, which is crucial in the unknown target domain settings. (ii) Modality alignment, which aligns the representation vectors from a paired sample’s visual and textual modality in semantic embedding space. This property provides an advantage in the initialization when the samples for downstream tasks are limited, i.e., in the zero-/few-shot scenarios, compared to the visual-only classifier fine-tuning approach. (iii) Intra-modality correlations. The contrastive training algorithm also provides weak intra-modality correlations. That is, the representation vectors of similar images or texts are close to each other (Radford et al., 2021; Sun, 2022). In contrast to the aforementioned properties, intra-modality correlations from samples’ influence are often overlooked. Concisely, a classifier with appropriately correlated targets rather than one-hot labels learns faster and performs better.

To demonstrate the importance of appropriately correlated classifier targets, we conduct a toy experiment to depict the intra-modality correlations in two scenarios. We employ the Kinetics video recognition dataset (Kay et al., 2017) for the analysis (The detailed configurations are provided in Sect. 4.3). In the first scenario, we extract the textual embedding vectors of the name of class labels using the textual encoder of CLIP (Radford et al., 2021) and then calculate the correlation among the textual embedding vectors. In the second scenario, we examine the final projection head of a vanilla fine-tuning framework. Precisely, we learn a classifier based on the visual encoder from the same CLIP model. The projection head of the classifier is a matrix of \(d \times c\) used to compute the pre-softmax logits, from the d-dimensional feature vectors for the c classes. Therefore, we treat the d-dimensional row vectors as the “embeddings” of the class labels. This non-rigorous setting allows us to explore the intra-modality correlation between these learned “embeddings”. The results are plotted in Fig. 1. While we could observe clear correlations among the embeddings of category names since some of them contain the same keywords (e.g., playing <something>.) Interestingly, in the second scenario, these learned “embeddings” also reveal a similar correlation map after the training, despite being initialized randomly and optimized without knowing any textual information (That is, optimized with the cross-entropy loss with one-hot labels).

In summary, we take full advantage of the large-scale contrastive image-language pre-trained models and build a novel general paradigm for the transfer learning settings. Our main contributions are as follows:

  • We revisit the transfer learning pipeline from the perspective of classifiers and spot that properly correlated targets, and pre-aligned semantic knowledge are crucial for downstream visual recognition tasks.

  • We build a new paradigm to transfer textual knowledge for visual recognition using contrastively pre-trained vision-language models. Our paradigm accelerates the transfer learning progress while taking full advantage of the pre-trained models.

  • Comprehensive experiments are conducted on 17 visual datasets that span three distinct data domains: image, video, and 3D point cloud. For video recognition, we evaluate our model on 6 well-known video benchmarks, including single-label and multi-label recognition. while also verifying its effectiveness in zero-shot and few-shot scenarios. For image classification, we perform experiments on 10 different image datasets, and the results demonstrate that our method is an effective few-shot learner. For 3D point cloud recognition, we validate our method on the ModelNet40 dataset, and find that it outperforms the vision-only paradigm by a significant margin.

  • We open-source our code and models at https://github.com/whwu95/Text4Vis.

2 Related Works

2.1 Visual Recognition Tasks and Transfer Learning

Visual recognition is one of the most important tasks in the design of machine learning systems. From the perspective of the visual backbone, we could roughly divide the evolution of the system into two eras: i) The Convolutional Neural Network (CNN) based architectures for image (Krizhevsky et al., 2012; He et al., 2016; Simonyan & Zisserman, 2014; Ioffe & Szegedy, 2015) or video recognition (Carreira & Zisserman, 2017; Qiu et al., 2017; Xie et al., 2018; Tran et al., 2018; Wu et al., 2021a, b). ii) The Vision Transformer (ViT) based architectures for image (Dosovitskiy et al., 2020; Han et al., 2021; Liu et al., 2021) or video recognition (Bertasius et al., 2021; Arnab et al., 2021; Liu et al., 2022; Fan et al., 2021). As ViT models are challenging to train from scratch without large-scale datasets, transfer learning techniques have regained popularity.

Transfer learning aims to enhance target learners’ performance on target domains by transferring knowledge from related but different source domains (Tan et al., 2018; Ribani & Marengoni, 2019; Zhuang et al., 2020), thereby reducing the requirements of target domain data for learning the target model. A typical transfer learning system is built with a pre-trained model trained with source domain data and a classifier for the target domain data. This study discusses a sub-family of transfer learning systems that utilize large-scale task-agnostic models. Related studies on this sub-family are discussed in Sect. 2.3.

2.2 Image-Language Pre-training

The recent success of Contrastive Language-Image Pre-Training (CLIP) (Radford et al., 2021) has paved the way for coordinated vision-language pre-training models utilizing the image-text InfoNCE contrastive loss (Van den Oord et al., 2018). After that, several works have since been proposed that combine various learning tasks, including image-text matching and masked image/language modeling, such as ALIGN (Jia et al., 2021b), BLIP (Li et al., 2022b), Florence (Yuan et al., 2021), and CoCa (Yu et al., 2022). These contrastively learned models exhibit two essential properties for downstream tasks: rich visual feature representations and aligned textual feature representations. Another recent study (Yang et al., 2022) has incorporated the downstream classification task into the pretraining process, resulting in improved accuracy over the standard cross-entropy loss. These developments demonstrate the potential for coordinated pre-training of vision and language models and open up exciting opportunities for further advances in vision-language understanding.

2.3 Transferring CLIP for Downstream Tasks

The transfer of pre-trained CLIP to downstream tasks is a recent and emerging research direction. Several recent studies (Gao et al., 2021; Zhang et al., 2021b; Zhou et al., 2021, 2022) have investigated the efficient transfer of pre-trained CLIP to downstream image recognition tasks. In addition, CLIP has been leveraged to enhance dense prediction tasks such as object detection (Rao et al., 2022) and segmentation (Lüddecke & Ecker, 2022; Li et al., 2022a). In the video domain, CLIP has also benefited many text-video retrieval methods (Zhao et al., 2022; Luo et al., 2021). For video recognition, ActionClip (Wang et al., 2021b) and VideoPrompt (Ju et al., 2022) extend CLIP (Radford et al., 2021) to train a downstream video-text matching model with contrastive loss and utilize the similarity between learned video and text embeddings during inference. Other methods, such as ST-Adapter (Pan et al., 2022) and EVL (Lin et al., 2022b), use only the visual encoder for unimodality transferring without involving textual knowledge. This study investigates the correlation between the linear classifier and efficient feature transfer in the standard visual recognition paradigm. We propose a direct transfer of visual and textual knowledge for visual recognition, without using contrastive-based methods.

Fig. 2
figure 2

Illustration of transferring vision-language pre-trained models for visual recognition. a The widely-used standard vision-only tuning paradigm with cross-entropy loss. b The vision-language contrastive learning paradigm with contrastive loss, e.g., CLIP (Radford et al., 2021), ActionCLIP (Wang et al., 2021b). c Revisiting the role of the classifier to transfer knowledge from vision-language pre-trained models. c denotes the number of categories, b is the batch size and d represents the dimension of embeddings

3 Methodology

3.1 Denotations

In this paper, we use bold letters to denote Vector, while capital italic letters are used to denote Tensor or Matrix. For example, we use \({\textbf{z}} \in {\mathbb {R}}^d\) to denote the feature vector extracted from a pre-trained model of dimension d, and \(W \in {\mathbb {R}}^{d\times c}\) to denote the projection matrix for the c-class linear classifier. Without ambiguity, we also use capital italic letters to denote the modality in subscripts. Specifically, we use V and T to denote the Visual modality and the Textual modality, respectively. We also use lowercase italic letters to denote functions or neural networks, such as \(g_V(\cdot , \varTheta _V)\) and \(g_T(\cdot , \varTheta _T)\), which represent the visual and textual encoders, respectively. Furthermore, we employ calligraphic letters, such as \({\mathcal {D}}\), to denote sets of elements.

3.2 Revisiting of Existing Learning Paradigms

Standard Transfer Learning Paradigm In Fig. 2a, we depict the conventional scenario, where a visual encoder model \(g_V\) is trained on a large-scale dataset \({\mathcal {D}}\) containing visual samples, with or without ground-truth labels. On our labeled downstream dataset \({\tilde{\mathcal {D}}} = \{ ({\varvec{x}}_1, {\varvec{y}}_1), ({\varvec{x}}_2, {\varvec{y}}_2), \ldots \}\), our empirical learning target can be expressed as

$$\begin{aligned} g^*_V, W^* = \underset{\varTheta _V,W}{\textrm{arg min}}~ {{\mathbb {E}}}_{{{\varvec{x}},{\varvec{y}} \sim {\tilde{\mathcal {D}}}}}\big [ H({\varvec{y}} | \sigma (W \cdot g_V({\varvec{x}}))) \big ], \end{aligned}$$
(1)

where \(H({\hat{p}}|p)\) represents the CrossEntropy between the predicted distribution p and the ground-truth distribution \({\hat{p}}\). The symbol \(\sigma \) denotes the softmax operation, \(W \in {\mathbb {R}}^{c\times d}\) denotes the linear projection matrix for classification. The formulation in Eq. 1 is a standard visual feature transferring paradigm, where the visual encoder \(g_V\) and the projection matrix W are learned jointly.

Fig. 3
figure 3

Illustration of 6 types of projection matrix initialization which develop different levels of correlation between the target embedding vectors. On the Left: a trivial or no correlation between the target vector; b correlation calculated from the visual statistic; c correlation calculated from the textual semantic knowledge. On the Right: Inter-class correlation map obtained from the six types of initialization. Impressively, correlation maps yield a similar appearance from transferring visual statistics and textual semantic knowledge. See Fig. 1 for more details

Vision-Language Contrastive Learning Paradigm As shown in Fig. 2b, we then review the contrastive learning paradigm of vision-language models, which has gained widespread use in vision-language pre-training, such as CLIP (Radford et al., 2021), and extended to video-text fine-tuning, e.g., ActionCLIP (Wang et al., 2021b), CLIP4Clip (Luo et al., 2021).

Given a dataset \({\mathcal {D}} = \{ ({\varvec{x}}_{V,1}, {\varvec{x}}_{T,1}), ({\varvec{x}}_{V,2}, {\varvec{x}}_{T,2}), \cdots \}\), consisting of weakly related vision-language pairs (e.g., image-text, video-text). With slight abuse of the notations, we employ the \({\varvec{x}}_V,{\varvec{x}}_T\) to denote a mini-batch of size b, then we minimize the following target:

$$\begin{aligned} g^*_V, g^*_T = \underset{\varTheta _V,\varTheta _T}{\textrm{arg min}}~ {{\mathbb {E}}}_{{{\varvec{x}}_V,{\varvec{x}}_T \sim {\tilde{\mathcal {D}}}}}\big [H( {\mathcal {Q}} | \sigma (g_V({\varvec{x}}_V)^{\text {T}}\cdot g_T({\varvec{x}}_T))) \big ],\nonumber \\ \end{aligned}$$
(2)

where \({\mathcal {Q}}\) is the set that contains b one-hot labels of size c, with their \(1, 2, \ldots , b\)-th element being 1 (\(b<c\)), representing the positive vision-language pairs. We note that the definition in Eq. 2 is not the rigorous form of the Noise-Contrastive Estimation (NCE) loss proposed in Van den Oord et al. (2018). Instead, we employ the cross-entropy version implementation used in Radford et al. (2021); Chen et al. (2021). The contrastive learning paradigm first projects the visual feature \(g_V({\varvec{x}}_V)\) with a projection matrix \(g_T({\varvec{x}}_T)\), then follows the standard transfer learning paradigm to match the similarity matrix with the diagonal label set \({\mathcal {Q}}\).

3.3 Our Proposed Paradigm

As depicted in Fig. 2, we propose a more generalized paradigm by replacing the learnable, randomly initialized linear projection matrix W with a pre-defined matrix \({\tilde{W}}\), building upon the classifier perspective. Following Sect. 3.2, the training target can be formulated as:

$$\begin{aligned} g^*_V = \underset{\varTheta _V}{\textrm{arg min}}~ {{\mathbb {E}}}_{{{\varvec{x}},{\varvec{y}} \sim {\tilde{\mathcal {D}}}}}\big [ H({\varvec{y}} | \sigma ({\tilde{W}}\cdot g_V({\varvec{x}}))) \big ]. \end{aligned}$$
(3)

In the following subsections, we investigate different initialization methods for \({\tilde{W}}\).

3.4 Discussion on Initialization

To investigate the extent to which the correlation between semantic information contained in the samples is helpful, we examine several types of initialization, which represent different degrees of intra-modality (or inter-class from the perspective of classifier) correlation, as illustrated in Fig. 3.

3.4.1 Trivial Inter-class Correlation

Randomized Matrix We start with the simplest initialization method, which involves setting each row of \({\tilde{W}}\) to a random Gaussian vector with zero mean and standard deviation. This can be denoted as follows:

$$\begin{aligned} {\tilde{W}} \sim {\mathcal {N}}({\varvec{0}}, I_d), \end{aligned}$$
(4)

where \(I_d\) denotes the identity matrix of dimension \(d\times d\). While this method generates trivial correlations between the rows of \({\tilde{W}}\) due to its stochasticity, these correlations cannot reflect the actual correspondence between the visual classes. Therefore, we expect the model to have inferior performance since it needs to avoid these incorrect correlations when learning the visual feature representation.

Randomized Orthogonal Matrix Next, we consider the case where correlations are removed from the projection matrix. We follow the approach of the randomized matrix and then remove the correlation by ensuring that the row vectors are orthogonal. This is achieved by QR decomposition. Concretely, since \(d>c\), we first generate a random matrix of size \(d \times d\) and select the first c rows as our projection matrix. Formally, we have,

$$\begin{aligned} \begin{aligned} {\tilde{W}}_j \sim \textrm{QR}(U)_j, j = 1, 2, \ldots , c, \\ U_i\sim {\mathcal {N}}({\varvec{0}}, I_d), i = 1, 2, \ldots , d, \end{aligned} \end{aligned}$$
(5)

where U is the intermediate randomized matrix, and \(\textrm{QR}(U)\) is the row orthogonal matrix obtained through the QR decomposition. Similar to the randomized matrix, we expect this initialization to have inferior performance. Since the one-hot label vectors are also orthogonal to each other, it will not be helpful to project the visual feature vectors with an orthogonal matrix, which may increase the difficulty of learning meaningful visual features.

3.4.2 Correlation from Visual Statistic Knowledge

Class Center Projection To utilize the visual encoder’s statistical knowledge, we randomly select a small subset of labeled samples from the training dataset. For our experiments on the Kinetics-400 dataset, we sample 60 videos from each class, which is approximately 10% of the training data. Next, we compute the mean value of each class’s visual embeddings extracted from the visual encoder. These mean vectors are treated as the centers for each class and are used to initialize the classifier’s parameters. The class center initialization provides a basic approximation of the visual knowledge obtained from the pre-trained model. However, its effectiveness largely depends on the data used to compute the projection matrix, and when the data is limited, the estimated correlation among visual embeddings may be biased.

Linear Discriminant Projection We propose another approach to initializing the projection matrix using visual statistics. We use multi-class Fisher’s linear discriminant analysis (LDA) to learn a linear classifier and employ the weight matrix of the classifier as our initialization for the projection matrix. Specifically, we first use the same visual embeddings as the previous approach to computing the LDA coefficient, following previous work (Li et al., 2006). Then, we use the LDA coefficient to initialize \({\tilde{W}}\) and freeze it for fine-tuning the visual encoder on the dataset. Intuitively, the LDA simultaneously maximizes the inter-class covariance and minimizes intra-class covariance. Therefore, we term this as the maximal correlation initialization using visual statistic knowledge. However, the linear discriminant projection also suffers from biased data sampling progress.

3.4.3 Correlation from Textual Semantic Knowledge

Textual Embedding Vectors We now describe how we transfer textual semantic knowledge from a pre-trained textual encoder to initialize the projection weight \({\tilde{W}}\). Given a set of tokenized class labels \({\mathcal {L}} = {{\varvec{l}}_1, {\varvec{l}}_2, \ldots , {\varvec{l}}_c }\), we initialize the i-th row vector in \({\tilde{W}}\) as follows:

$$\begin{aligned} {\tilde{W}}_i \sim g_T({\varvec{l}}_i), \qquad i = 1, 2, \ldots , c, \end{aligned}$$
(6)

where \(g_T\) is a function that maps a textual input to an embedded feature vector using a pre-trained textual encoder. In our experiments, we investigate two types of textual feature encoders: i) The encoder that is trained solely using textual samples on tasks such as masked language modeling, i.e., DistilBERT (Sanh et al., 2019); ii) The encoder that is trained with a visual encoder in the contrastive style, i.e., CLIP (Radford et al., 2021). Using the textual embeddings to initialize \({\tilde{W}}\) allows us to roughly pre-align the visual and textual embeddings in the same embedding space.

3.5 Discussion on Parameter Frozen

It is worth mentioning that, in our paradigms, \({\tilde{W}}\) is not in the optimization targets. This means we freeze it from updating during the fine-tuning of the downstream tasks. We have the following reasons for this: firstly, since the textual knowledge is extracted by the textual encoder, freezing this part could significantly decrease the computational resources required for fine-tuning. As we showed in Sect. 4, freezing the parameters of \({\tilde{W}}\) leads to a decrease in the training period. Secondly, freezing the parameter helps to reduce biases brought by the limited semantic knowledge of class names. By keeping the feature embeddings distributed as they were learned on large-scale datasets, we improve the diversity of the representations and the learning stability. Finally, this configuration also compares former studies that employ textual information for vision transfer learning.

4 Experiments: Video Recognition

In this section, we transfer the image-language pre-trained model to the video modality, i.e., the video recognition task. To evaluate the effectiveness of the transferred model, we conduct experiments on six well-known video datasets, which include both trimmed and untrimmed video data. Specifically, the datasets are Kinetics-400 & 600 (Kay et al., 2017; Carreira et al., 2018), UCF-101 (Soomro et al., 2012), HMDB-51 (Kuehne et al., 2011), ActivityNet-v1.3 (Caba Heilbron et al., 2015), and Charades (Sigurdsson et al., 2016). These datasets are selected to represent a wide range of video recognition tasks, and are commonly used as benchmarks in this field.

We evaluate the transferred model in three distinct scenarios: zero-shot, few-shot, and regular video recognition. In the zero-shot scenario, the model has not trained on the target dataset but is evaluated on it, allowing us to assess its ability to generalize to new data. In the few-shot scenario, the model is trained on a small subset of the target dataset and evaluated on the validation set, enabling us to explore its capacity to learn from limited labeled data. In the typical recognition scenario, the model is trained on the entire target dataset and evaluated on the validation set, allowing us to measure its performance in a standard supervised learning configuration. By evaluating the model in these three scenarios, we aim to provide a comprehensive assessment of its performance under different conditions.

4.1 Training

The video recognition task takes a video as input, then feeds it into a learned encoder to estimate the action category of the video. Given a video, we first uniformly sample T (e.g., 8, 16, 32) frames over the entire video. Then we utilize ResNet (He et al., 2016) or ViT (Dosovitskiy et al., 2020) as the video encoders. The classifier in our paradigm is initialized from the textual embedding of the class names and then frozen (fixed), leaving only the parameters in the video encoder to be learned.

Table 1 Default training details for video recognition
Table 2 The effects of different initializations for the frozen (offline) classifiers

Default Training Recipe Table 1 presents our training details for regular video recognition. We share the same recipe on all the video datasets, i.e., Kinetics-400, ActivityNet, HMDB-51, UCF-101, and Charades.

Few-Shot Video Recognition All training strategies employed in the training process are consistent with those presented in Table 1, with only one modification: the number of epochs was increased to 100.

Zero-Shot Video Recognition We use the Kinetics-400 pre-trained models to directly perform cross-dataset zero-shot video recognition without any additional training on other datasets, i.e., ActivityNet, HMDB-51, UCF-101 and Kinetics-600.

4.2 Inference

To trade off accuracy and speed, we consider two inference strategies: (1) Single View: This strategy involves using only a single clip per video and the center crop for efficient evaluation, as shown in Table 4.3. (2) Multiple Views: This strategy, which is widely used in previous works, involves sampling multiple clips per video with several spatial crops to improve accuracy. For comparison with state-of-the-art approaches, we use four clips with three crops (“4\(\times \)3 Views”).

4.3 Ablation Studies

In this section, we conduct extensive ablation experiments on the Kinetics-400 dataset. Unless specified otherwise, we use ViT-B/16 with 8 frames as the video backbone and a single view for testing. The default settings are marked in bold italics.

Different Initializations to the Offline Classifier We first examine how the initializations affect the learning of classifiers. Then, we prepare our controlled environment using a classifier with parameters \(W \in {\mathbb {R}}^{d\times c}\), which is built on the average of pooled temporal feature representations of all the frames. According to Sect. 3.3, we evaluate the performance of six types of initializations on both the few-shot and full-shot settings. For reference, we also provide the results using the standard vision-only fine-tuning (i.e., online) classifier with trainable weights.

Table 2 lists the results. Feeding the offline classifier a random d-by-c matrix with a normal distribution leads to significantly reduced performance. Furthermore, removing the classifier’s intra-modality correlation also results in inferior performance. From this family of initialization, we understand the necessity of a proper correlation in the classifier targets. Next, we observe that providing correlation information using a small labeled sub-set from the visual side leads to improved performance, with the classifier no longer guessing the results in the few-shot scenario and learning reasonably well in the full-shot scenario, compared to the vision-only online classifier. Compared to learnable correlation, pre-extracted proper correlation provides a more explicit target, making it a more efficient approach, especially in the process of transfer learning, particularly in few-shot learning. Notably, the class center initialization performs better than the LDA initialization in the few-shot scenario, demonstrating the CLIP encoder has a naturally well-distributed feature embedding.

Finally, we investigate the effect of the textual semantic family of initialization on the classifier’s performance. We observe that the embeddings from the textual encoder of CLIP significantly improve the few-shot and full-shot accuracy. Interestingly, the DistilBERT-based initialization also performs remarkably well despite the semantics not being directly aligned with the visual modality. This result can be explained by the fact that both DistillBERT and CLIP are pre-trained with large-scale data and have strong language modeling capabilities, allowing them to generate good semantic targets. Therefore, we conclude that the visual embeddings benefit from the correlation of semantic targets, and the extract alignment further boosts the learning progress, reducing the need for a large number of samples.Footnote 1

We also provide the visualizations of these classifiers in Fig. 3. Apparently, the latter two families of initializations share the same patterns among the correlation maps, which could be easily distinguished from the random and orthogonal ones.

Table 3 Temporal modeling for video encoders
Table 4 Ours vs. Contrastive-based paradigm with ViT-B/16 on Kinetics-400

Temporal Modeling In this study, we explore several temporal modeling strategies for both ViT and ResNet, including:

  1. 1.

    TAP: Temporal average pooling is a straightforward temporal modeling strategy that provides a simple baseline for comparison.

  2. 2.

    T1D: Channel-wise temporal 1D convolutions, which are commonly used in previous works (Wu et al., 2021a; Wang et al., 2021a; Liu et al., 2020), are employed to facilitate efficient temporal interaction in the later stages (res\(_{4-5}\)) of ResNet.

  3. 3.

    T-Trans: This strategy involves feeding the embeddings of frames to a multi-layer (e.g., 6-layer) temporal transformer encoder.

  4. 4.

    TokenT1D: This approach involves using T1D to model temporal relations for the [class] token features that are aggregated from local features via attention in the vision transformer. We apply TokenT1D to multiple positions of a vision transformer to model temporal dependencies among the tokens.

Our experimental results are presented in Table 3. We observed that on both ViT and ResNet backbones, TAP provides a simple baseline for temporal modeling, and T-Trans achieves the best top-1 accuracy. Interestingly, we found that T1D does not appear to be effective in this scenario. This could be due to the potential for T1D to disrupt the strong representations learned by CLIP. In contrast, TokenT1D is another internal-backbone temporal modeling strategy that modifies only the global [class] token features instead of patch features. We observed that TokenT1D does not lead to a performance drop and even slightly improves the TAP baseline. We believe that this is because TokenT1D results in minimal modifications to the pre-trained features, which allows the model to retain the learned representations while incorporating temporal dependencies among the tokens.

Ours v.s. Contrastive-Based Paradigm we compare our proposed approach with the contrastive-based tuning method ActionClip (Wang et al., 2021b), which is introduced in Sect. 2.2. This paradigm treats the video recognition task as a video-text matching problem with a contrastive loss, which requires batch gathering to collect embeddings of all batches across all GPUs and calculate cosine similarity for a given batch across all other batches.

To ensure a fair comparison, we follow the official code and configurations from ActionClip (Wang et al., 2021b) in our experiments. In contrast to the contrastive-based paradigm, our recognition paradigm uses the Cross-Entropy loss to train the model, and we employ pre-extracted text embeddings as our classifier. Thus, the only learned part in our paradigm is the visual encoder, whereas the pre-trained textual encoder still needs to be updated in the Contrastive-based paradigm, requiring larger GPU memory. In Table 4, we compare our approach with the contrastive-based paradigm and observe that the latter performs poorly without batch gathering. This is because contrastive learning favors a large batch size, e.g., CLIP (Radford et al., 2021) used 256 GPUs with a batch size of 128 per GPU to maintain a large 32768\(\times \)32768 similarity matrix. Moreover, involving batch gathering will multiply the training time.

Our results demonstrate that our proposed approach achieves the best accuracy-cost trade-off. Specifically, our method achieves a performance of 81.5% with ViT-B/16, which takes only 10 h to run the training using 8 GPUs and is 2\(\times \) faster than the matching counterpart. Our approach is more efficient and effective for video recognition tasks, especially in applications with limited computational resources. Please refer to Appendix §A.2for further details on batch gathering.

Additionally, in order to mitigate the impact of different implementation details, we have incorporated the contrastive-style training loss function based on our code. As observed in Table 5, training with contrastive loss introduces a reduction in training efficiency without significant performance improvement. Moreover, we have further enhanced the performance by incorporating a fixed offline classifier in the contrastive-style approach. This improvement can be attributed to the accelerated convergence achieved by the fixed textual target during training.

Table 5 More ablations on contrastive-style paradigm
Table 6 Study on various text input forms

Text Input Forms We investigate several text input forms in Table 6, including class names, single hard template, multiple hard templates, and learnable templates. The details are as follows:

  1. 1.

    Class name. To generate textual embeddings, we utilize the category names of the dataset as the text input, such as “eating hotdog” or “driving car”. The results show that using only the label text can yield good performance.

  2. 2.

    Single hard template. We use a hand-crafted template, “a video of a person {class name}.” as input. This template only slightly improves performance over the label text’s baseline.

  3. 3.

    Multiple hard templates. CLIP Footnote 2 provides 28 templates for Kinetics, including the single template described above. During training, we use these templates as text augmentation by randomly selecting one at each iteration. Then, we evaluate the model using the single template as input. The performance decreases by 0.6% on Kinetics-400, which may be because various prompt templates introduce extra noise during training.

  4. 4.

    Learnable templates. We use the automated prompt CoOp (Zhou et al., 2021) to describe a prompt’s context using a set of learnable vectors. Specifically, the prompt given to the text encoder is designed with the following form:

    $$\begin{aligned} {\varvec{t}} = [\text {V}]_1 [\text {V}]_2 \ldots [\text {V}]_M [\text {class name}], \end{aligned}$$
    (7)

    where each \([\text {V}]_m\) (\(m\!\in \!\{1, \ldots , M\}\)) is a vector of the same size as word embeddings, and M is the number of context tokens. We set M to 4 in our experiments.

Our results suggest that different templates have little impact on our model’s performance.

Computational Cost and Efficiency Table 7 presents our models’ computational cost and efficiency, measured in terms of throughput using a single NVIDIA A100 GPU with a batch size of 16, which aligns with standard inference settings. Our models exhibit a 29\(\times \) faster throughput, and 44\(\times \) fewer FLOPs than the previous transformer-based method ViViT (Arnab et al., 2021), while maintaining the same accuracy. These results confirm the high efficiency of our approach.

4.4 Main Results

Table 7 Analysis on throughput
Table 8 Comparison with previous works on Kinetics-400

Regular Video Recognition We evaluate the performance of our model on the Kinetics-400 dataset, a challenging benchmark for regular video recognition. Table 8 provides a comparison of our model with state-of-the-art methods that were pre-trained on large-scale datasets such as ImageNet-21K (Deng et al., 2009), IG-65 M (Ghadiyaram et al., 2019), JFT-300 M (Sun et al., 2017), FLD-900 M (Yuan et al., 2021), and JFT-3B (Zhai et al., 2021). To date, none of the three largest datasets (JFT-300 M, FLD-900 M, and JFT-3B) are open-sourced, and pre-trained models are not provided. Hence, we utilized the publicly available CLIP (Radford et al., 2021) checkpoints, which have been trained on 400 million web image-text pairs (WIT-400 M). Significantly, by utilizing the same CLIP pre-trained backbones, our model demonstrates substantial performance improvements over EVL (Lin et al., 2022b) and ST-Adapter (Pan et al., 2022). Furthermore, our method achieves superior performance compared to methods that were pre-trained with JFT-300 M (Sun et al., 2017) or FLD-900 M (Yuan et al., 2021), while requiring less computational cost or a smaller resolution. Furthermore, with the significant scale-up of the pre-training data to 2 billion samples (namely Merged-2B (Sun et al., 2023), which merges 1.6 billion samples from the LAION-2B (Schuhmann et al., 2022) dataset with 0.4 billion samples from the COYO-700 M (Byeon et al., 2022) dataset), our method achieves an outstanding top accuracy of 89.4%, solidifying its position as a state-of-the-art approach.

Table 9 Comparisons with previous works on ActivityNet
Table 10 Mean class accuracy on UCF-101 and HMDB-51 achieved by different methods which are transferred from their Kinetics models with RGB modality

To verify the generalization ability of our method, we further evaluate its performance on the widely-used untrimmed video benchmark, ActivityNet-v1.3. Specifically, we finetune the Kinetics-400 pre-trained models, with ViT-L backbone and 16 frames, on the ActivityNet-v1.3 dataset. The top-1 accuracy and mean average precision (mAP) are reported using the official evaluation metrics. As shown in Table 9, our method outperforms recent state-of-the-art models with a clear margin, with an mAP accuracy of 96.9%.

Table 11 Comparisons with previous works on few-shot action recognition

We also evaluate our method on the UCF-101 and HMDB-51 datasets to demonstrate its capacity to generalize to smaller datasets. We finetune our models on these two datasets using the pre-trained ViT-L model on Kinetics-400 and present the mean class accuracy on split one. We utilize 16 frames as inputs. As shown in Table 10, our model exhibited strong transferability, achieving a mean class accuracy of 98.2% on UCF-101 and 81.3% on HMDB-51.

Few-Shot Video Recognition In few-shot video recognition, where only a few training samples are available, we investigate a more challenging K-shot C-way situation, instead of the conventional 5-shot 5-way configuration. We aim to categorize all categories in the dataset with just K samples per category for training, where the lower and upper bounds are denoted by the terms “Zero-shot” and “All-shot”, respectively. Using the CLIP-pretrained ViT-L/14 with 8 frames and TAP for few-shot video recognition, we report the Top-1 accuracy for the four datasets in Table 11. Despite the limited amount of data, our method demonstrates remarkable transferability to diverse domain data. Furthermore, our approach outperforms previous methods significantly, showing robustness in these extremely data-poor situations. For instance, when comparing the accuracy on HMDB-51 with 2-shot, our method outperforms Swin (Liu et al., 2022) and X-Florence (Ni et al., 2022) by +52.6% and +21.9%, respectively.

Multi-Label Video Recognition We mainly focused on the single-label video recognition scenario in the previous experiments. To further validate the performance of our method, we conducted experiments on multi-label video recognition tasks. The Charades dataset is a multi-label untrimmed video dataset containing long-term activities with multiple actions. For this task, we utilized the Kinetics-400 pre-trained ViT-L backbone for training and evaluated our results using the Mean Average Precision (mAP) metric. As shown in Table 12, our method achieved the highest performance of 46.0 mAP, demonstrating its effectiveness in multi-label video classification.

Zero-Shot Video Recognition In addition, we conducted experiments in the open-set setting. We use our Kinetics-400 pre-trained models (i.e., ViT-L with 8 frames) to perform zero-shot evaluations on four other video datasets. For UCF-101, HMDB-51, and ActivityNet, we follow two evaluation protocols from E2E (Brattoli et al., 2020):

  1. 1.

    To make a fair comparison with previous works, we randomly selected half of the test dataset’s classes: 50 for UCF-101, 25 for HMDB-51, and 100 for ActivityNet, and evaluated our method on them. We repeated this process ten times and averaged the results for each test dataset. We refer to this setting as UCF\(^*\), HMDB\(^*\), and ANet\(^*\).

  2. 2.

    In the second evaluation protocol, we directly evaluated the full datasets to obtain more realistic accuracy scores.

Table 12 Comparison with previous works on Multi-Label video dataset Charades
Table 13 Comparison with previous works on zero-shot video recognition

For Kinetics-600, we chose 220 new categories outside Kinetics-400 for evaluation. We used the three splits provided by Chen and Huang (2021) and sampled 160 categories for evaluation from the 220 categories in Kinetics-600 for each split. We reported the mean accuracy for the three splits. As shown in Table 13, our method demonstrates a strong cross-dataset generalization ability, achieving significant improvements over previous zero-shot video recognition methods (+27.1% on UCF-101, +17.0% on HMDB-51, +52.1% on ActivityNet, +26.8% on Kinetics-600).

5 Experiments: Image Recognition

In this work, we also apply our method to image recognition. We conduct a comprehensive evaluation on 10 datasets that represent a diverse set of visual recognition tasks, i.e., ImageNet (Deng et al., 2009), StanfordCars (Krause et al., 2013), Caltech101 (Fei-Fei et al., 2004), OxfordPets (Parkhi et al., 2012), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), DTD (Cimpoi et al., 2014), EuroSAT (Helber et al., 2019). These tasks include classifying generic objects, scenes, and fine-grained categories and specialized tasks such as texture recognition and satellite imagery analysis.

5.1 Training

For the pre-trained CLIP model, we use the ResNet-50 (He et al., 2016) as the default backbone for the image encoder, and the image backbone is updated during training. We train the model using the AdamW optimizer with an initial learning rate of 5e-6 and a cosine annealing schedule to reduce the learning rate gradually. We also employ a warmup strategy of 5 epochs. The maximum number of training epochs is set to 150.

5.2 Main Results

Results on 10 Image Datasets As illustrated in Fig. 4, the performance of our method, vision-only method, and zero-shot method are evaluated on 10 image datasets, all trained with 16 shots. The results of 10 datasets are arranged from left to right, and the average results of the 10 datasets are presented on the far right. Our findings reveal that CLIP showcases strong zero-shot performance on all 10 datasets. However, the vision-only method exhibits poor performance on all datasets. We posit that this may be attributed to the absence of a suitable classifier target. Consequently, it may be susceptible to biases in small samples, which can disrupt the well-pretrained image encoder. Our approach demonstrates a substantial improvement in recognition accuracy compared to both the vision-only and zero-shot methods on all 10 datasets. Specifically, the average improvement over the 10 datasets is 41% and 18% compared to the vision-only method and the zero-shot method, respectively. This indicates the effectiveness of our method in enhancing few-shot learning performance.

Fig. 4
figure 4

Comparison of few-shot learning performance on 10 image datasets. Assessment of zero-shot CLIP, vision-only, and the proposed method underlines the significance of incorporating a suitable classifier target to mitigate biases in small samples and achieve high accuracy on a diverse set of image datasets

Table 14 presents a further comparison of our method with two other transfer methods, specifically linear probe and CoOp (Zhou et al., 2021), on the widely used ImageNet dataset. We implemented the linear probe method as instructed in the original CLIP paper (Radford et al., 2021). Our findings indicate that the CoOp method contributes to a significant enhancement of the zero-shot model by 4.77%. Notably, our proposed approach surpasses this performance by achieving a further improvement of 8.33% on the zero-shot model, underscoring the effectiveness of our method in incorporating an appropriate classifier target.

Table 14 Comparison of our method with other tuning methods on ImageNet (using 16 shots)

6 Experiments: 3D Point Cloud Recognition

We further extend our approach to 3D point cloud recognition and evaluated it on the ModelNet40 dataset (Wu et al., 2015). This dataset comprises 12,311 3D CAD models across 40 categories: airplanes, cars, plants, and lamps. The point clouds are normalized to a unit sphere and divided into 9,843 training models and 2,468 testing models. ModelNet40 is a widely used benchmark for point cloud recognition.

6.1 Training

For the visual encoder, we use the ResNet-101 architecture (He et al., 2016) as the default backbone and apply multi-view perspective projection on the input point cloud following SimpleView (Goyal et al., 2021). SimpleView projects the point cloud from six orthogonal views: front, right, back, left, top, and bottom. In addition, we also include the views of the upper/bottom-front/back-left corners based on the observation from Zhang et al. (2022) that the left view is the most informative for few-shot recognition. For each view, a point with a 3D coordinate is projected onto a pixel on the 2D image plane, and its depth value is used as the pixel intensity, which is repeated three times for the RGB channels. Finally, all the resulting images are upsampled to (224, 224) to align with CLIP’s settings.

We train the model using the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 2e-4 and a cosine annealing schedule to reduce the learning rate gradually. We also employ a warmup strategy of 10 epochs; the maximum number of training epochs is set to 250.

Fig. 5
figure 5

Results of 3D point cloud recognition on the ModelNet40 dataset. Comparison of different tuning methods in the few-shot scenario

6.2 Main Results

Comparison with the Vision-Only Paradigm in the Few-Shot Scenario We evaluate our method using the few-shot evaluation protocol adopted in CLIP (Radford et al., 2021), which involves training with 1, 2, 4, 8, and 16 shots and deploying models on the full test set. As shown in Fig. 5, we first present our method’s zero-shot result (14.9%), which was obtained by directly utilizing the CLIP model to classify each view and averaging the results of the 10 views. We then compare the performance of different tuning models on 3D point cloud recognition. Our results show that all models gradually improve in accuracy as the number of training samples increases. Notably, our method (green curve) outperforms the vision-only method (orange curve) by a large absolute improvement of 30%-40%, which is consistent with findings in image recognition and video recognition, and validates the effectiveness of our approach. Additionally, we find that our method significantly outperforms the linear probe method (blue curve) at all training sample levels, known as a strong few-shot learning baseline. These results confirm the effectiveness and superiority of our proposed approach, which involves textual knowledge to improve transferability.

7 Conclusion and Limitation

This study presents a new paradigm for enhancing the transferability of visual recognition tasks based on the knowledge from the textual encoder of a well-trained vision-language model. Specifically, we initialize the classifier with semantic targets from the textual encoder and freeze it during optimization. We conduct extensive experiments to examine how the paradigm functions: Firstly, we demonstrate that proper correlation among target initialization is beneficial. Secondly, we show that alignment of visual and textual semantics is key to improving few-shot performance and shortening the learning progress. Finally, we verify the effectiveness of our proposed paradigm on three types of visual recognition tasks (i.e., image, video, and 3D point cloud recognition) across 17 visual datasets.

The study still has some limitations worth diving into in future research. i) The performance of the proposed paradigm is restricted to how the category labels are represented. For instance, in tasks such as human re-identification, where the labels are often numerical values such as 0, 1, 2, etc. In this case, we cannot transfer any semantic information from the textual encoders, while transferring visual statistic knowledge (i.e., LDA classifier) could be helpful. ii) The performance of the proposed paradigm relies on the capacity of the vision-language pre-training models. Although we use CLIP as our source model in this study, obtaining models with better performance remains an open problem. iii) The way category names are described also impacts performance. For example, in the action recognition dataset Something-Something, category names such as “Putting something into something” and “Covering something with something” lack a clear target subject. Consequently, leveraging the prior knowledge of pre-aligned vision-language models becomes challenging, resulting in subpar performance.