1 Introduction

Video representation learning is fundamental to various downstream video-related applications, e.g., action recognition [1, 2], spatio-temporal detection [3, 4], video retrieval [5, 6], etc. Traditional supervised learning schemes require costly human labeling, and the performance is usually restricted by the granularity of the annotations. More precisely, coarse-grained video-level annotations could lead the model to attend to the background [1, 7], while fine-grained annotations greatly facilitate general video analysis but are much more expensive [8, 9]. Unsupervised video representation learning has begun to attract more attention to solve this problem. Some early works designed diverse pretext tasks to learn the video characteristics in a self-supervised manner [1015]. Recently, the formulation of contrastive learning further improved the performance by a large margin [1619].

A prevalent method for contrastive video representation learning is to sample several clips and regard those from the video as positive pairs [17, 2022]. However, this formulation has two drawbacks. On the one hand, these methods tend to be biased toward a static background [23, 24]. This is because the sampled clips mostly share the same background, but subtle differences in motions probably exist. For example, in Fig. 1, the video contains a high jump scene. The clip sampled at an early timestamp shows the running action, but the same clip sampled at a later timestamp presents the jumping action. Thus, pulling these two clips closer in the feature space will lead the model to neglect their distinct motions and only attend to the background of the stadium. On the other hand, there remains an obvious gap between clip-level features and video-level representation. The sampled clips have a limited temporal receptive field, and thus, cannot provide comprehensive information. For example, Clip 1 in Fig. 1 only shows the momentary process of running. When we jointly leverage the correctly ordered two clips, i.e., the running action occurs before jumping, we can understand the original video. Motivated by these observations, we intend to address these problems from two aspects, one is detailed region-level correspondence, and the other is general long-term temporal perception.

Figure 1
figure 1

An illustration of clip sampling and temporal correspondence. We present a high-jump video with two sampled clips. Two clips have the same background but different motions: one running and the other jumping. We provide an example of temporal correspondence between clip and video, where we coarsely divide the video (clip) into three segments. The value in the matrix indicates the intersection ratio. The spatial correspondence can be calculated similarly

In this paper, we propose a framework to learn comprehensive appearance and motion patterns in videos. Concretely, we develop a set of controllable augmentations to achieve this goal. First, we use constrained spatio-temporal cropping to sample several local clips from each video such that the clips cover diverse timestamps of the video. Then we generate dense spatio-temporal position-wise correspondences between the local clip and global video feature maps based on the cropping parameters. In Fig. 1, we present a toy example on temporal correspondence, whereby the spatial correspondence is established by employing these soft codes to align features in corresponding regions. In this way, we can match the exact same appearance and motion content, while avoid aligning inconsistent motions between various timestamps. However, there also exist “shortcuts” that govern the overlapping regions between local clips and global videos, e.g., the low-level color statistics. These shortcuts could prevent the model from learning useful semantics. To avoid them, we define different intensity levels of color jitter and Gaussian blur augmentations, and regard the samples generated by the same level augmentation as sharing similar low-level attributes. We then minimize the mutual information between them to mitigate the impact of low-level shortcuts on the extracted representation.

To further bridge the gap between clip-level and video-level representations, we intuitively introduce a learning objective to model temporal order dependency between local clips and global video. In particular, we have access to the temporal order of the sampled clips in accordance with the cropping parameters. With that, we aim to maximize the mutual information between correctly ordered clip features and the global video features. Through this operation, we facilitate the temporal awareness of the model in the pretraining stage.

In summary, our contributions are as follows:

1) We propose a unified framework to learn video representations from detailed local contrast and general long-term temporal modeling.

2) We develop controllable augmentations to match the visual contents in corresponding spatio-temporal positions for detailed content alignment, and perform mutual information minimization to avoid low-level shortcuts.

3) We introduce the temporal order dependency between the local clips and global video to enhance general temporal structure modeling.

4) We achieve superior results on downstream action recognition and video retrieval tasks, while capturing more accurate motion patterns.

2 Related work

2.1 Contrastive learning

Recently, contrastive learning [2527] has revolutionized self-supervised learning. Its core idea is to discriminate different instances by attracting the positive pairs and repelling the negative pairs in the feature space [28, 29]. Following this, Wu et al. [30] formulated instance discrimination as a non-parametric classification problem. Van den Oord et al. [27] introduced mutual information estimation with InfoNCE loss [29], which led to easy optimization and fast convergence. Inspired by this, a line of works [25, 26, 31, 32] adopted this learning objective for image representation learning and showed significant improvement on downstream tasks. Later, Xie et al. [33] and Wang et al. [34] developed dense contrastive learning, which performed pixel-level contrast. Compared to instance-level discrimination, dense contrastive learning preserves richer characteristics, and performs better on dense prediction tasks and visual correspondence learning. In our work, we focus on video representation learning. Considering that natural spatio-temporal correspondences exist in video domains, we propose to utilize them as self-supervisory signals for spatio-temporal region contrast to learn more comprehensive video representations.

2.2 Video representation learning

Unlike images, videos contain internal temporal structures that are crucial for video content analysis. To this end, many works [11, 14, 35] designed various pretext tasks to leverage the natural spatio-temporal correspondence as self-supervisory signals. Some typical pretext tasks include temporal ordering [11, 14, 19], spatio-temporal puzzles [12, 15], colorization [36], playback speed prediction [10, 13], temporal cycle-consistency [3739] and future prediction [40, 41]. There were also some works [42, 43] using cross-modal correspondence for self-supervised pretraining. Inspired by the success of contrastive learning in the image domain, a series of works [1618, 44] extended this pipeline to video domain. Particularly, Han et al. [45, 46] employed information noise contrastive estimation (InfoNCE) loss for dense future prediction, while Wang et al. [18] and Yang et al. [47] sampled clips of different rates as positive pairs for visual content learning. However, video contrastive learning could lead the model to place more emphasis on the static scene and focus less on motion [23]. To solve this problem, Chen et al. [48] and Simon Jenni et al. [13] integrated contrastive learning with temporal pretext tasks to enhance temporal awareness. Han et al. [20] and Li et al. [49] used optical flow to assist motion modeling. Qian et al. [50] and Ding et al. [51] used static frames, frame differences and consecutive frames to balance appearance and motion perception. Liu et al. [52] and Ding et al. [53] carefully designed motion focused augmentations to place more emphasis on dynamic motions. In our work, we do not resort to frame difference or optical flow to enhance motion learning and temporal modeling. Instead, we hypothesize that the underlying reason for static scene bias lies in the positive pair formulation. Most existing works use either different frames [16, 19] or different clips [17, 21] from the same video as the positive pair, which usually have similar backgrounds but possess different motions. Hence, we propose to consider the corresponding regions within local and global views to form accurate positive, concurrently with low-level shortcut elimination, which captures the desired static and dynamic characteristics. In addition, we develop a temporal dependency between these views to bridge the gap between local clip and global video representations, while learning robust temporal structures.

2.3 Local-global views for video representation

There have also been some works using local and global views for self-supervised video representation learning [21, 5457]. The major difference between our work and those works lies in the concept of local global views and its target. In our work, “local global” means short and long video clips, and the major target is to construct spatio-temporal overlaps and formulate a soft learning objective, which guides detailed region-level video content alignment. In Ref. [54], local global meant local fine-grained and global coarse-grained features, which were designed for general audio-visual correspondence. Recasens et al. [55] aimed to extrapolate the neighboring video content in the global view based on the observation from the local view. Dave et al. [56] designed a loss function to learn emporal correspondence between local and global clips but still with hard positive assignment. Behrmann et al. [57] employed local global views to decompose stationary and non-stationary features and Kuang et al. [21] used them for segment-based positive sampling. Qing et al. [58] built hierarchical structures on videos and employed multi-level temporal consistency to guide local and global video representation learning.

3 Method

The core idea of our proposed framework is to enhance self-supervised video representation learning by comprehensive appearance and motion content modeling. As displayed in Fig. 2, we utilize a set of controllable augmentations to achieve detailed spatio-temporal region contrast, low-level shortcut elimination and general temporal dependency modeling.

Figure 2
figure 2

An overview of the proposed local-global composition framework. We define a set of controllable augmentations, including position transformations \(\tau _{\text{p}}\) and low-level statistics augmentations \(\tau _{\text{l}}\), to generate the global video and local clip input. Based on the extracted features, we perform spatio-temporal region contrative learning for accurate visual content alignment, and minimize the mutual information between samples with low-level statistics to eliminate the shortcut. We construct the local-global temporal order dependency to bridge the gap between clip-level and video-level features. Note that in this figure, we use cubes to present videos or clips, similar cube color means that they derive from the same video, and similar color brightness means that the two cubes share similar low-level statistics

Specifically, we divide the augmentations into two parts: spatio-temporal position transformations \(\tau _{\text{p}}\) that include crop and horizontal flip, and low-level statistic transformations \(\tau _{\text{l}}\) that include color jitter and Gaussian blur. Following the data preprocessing pipeline, given a video v, we first use \(\tau _{\text{p}}\) to sample several local clips and then perform \(\tau _{\text{l}}\) to generate the input to the encoder.

3.1 Spatio-temporal region contrast

Given a video v with temporal length T, we first use a set of spatio-temporal position transformations \(\tau _{\text{p}}^{k}\in \{\tau _{\text{p}}^{1},\tau _{\text{p}}^{2}, \ldots ,\tau _{\text{p}}^{K}\}\) to sample K clips \(v_{k}\in \{v_{1},v_{2},\ldots ,v_{K}\}\), to provide the local feature descriptions. To let the sampled clips contain as much information as the original video, we manually constrain the temporal cropping parameters in \(\tau _{\text{p}}^{k}\) to control the central timestamp of \(v_{k}\) in the range of \([{(k-1)T}/{K},{kT}/{K} ]\). In this way, sampled clips cover different temporal segments and they jointly present the rich information in v. As mentioned in Sect. 1, there could be inconsistencies in motions between different local clips such that it is not optimal to align the representations between different clips. Hence, we need to determine the exact corresponding content for feature alignment. To this end, considering that there is a natural correspondence between local clips and global video, we leverage v and \(v_{k}\) as two views for feature matching.

For local clip feature extraction, we denote the feature extractor as \(f(\cdot )\) and the local clip feature map as \(f(v_{k})\in \mathbb{R}^{CT_{\text{c}}HW}\), where C, H and W denote the channel, width and height, and \(T_{\text{c}}\) denotes the temporal dimension of the clip feature map. For global video feature extraction, we perform sparse sampling to represent v, and set some temporal stride of convolution layers to 1 to make \(f'(v)\in \mathbb{R}^{CT_{\text{v}}HW}\) possess higher temporal resolution, i.e., temporal dimension \(T_{\text{v}}>T_{\text{c}}\). Note that f and \(f'\) share the same architecture and only differ in the temporal stride. Details of the network settings are described in Sect. 4.2.

Based on \(f(v_{k})\) and \(f'(v)\), we refer to the augmentation parameters in \(\tau _{\text{p}}^{k}\) to calculate the dense spatio-temporal position correspondence. Specifically, we use \(S_{k}\in \mathbb{R}^{N_{\text{c}}\times N_{\text{v}}}\) to indicate the correspondence result, where \(N_{\text{c}}=T_{\text{c}}HW\), \(N_{\text{v}}=T_{\text{v}}HW\). \(S_{k}(i,j)\) reveals the correspondence score between the ith spatio-temporal grid in \(f(v_{k})\) and the jth grid in \(f'(v)\). Essentially, each grid on the feature map is equivalent to a tube covering a certain spatio-temporal area as illustrated in Fig. 2, and \(S_{k}(i,j)\) is measured by the ratio of the intersection of two tubes over the volume of tube \(f(v_{k})[i]\):

$$\begin{aligned} S_{k}(i,j)= \frac{\mathit{inter}(f(v_{k})[i],f'(v)[j])}{\mathit{vol}(f(v_{k})[i])}, \end{aligned}$$
(1)

where \([\cdot ]\) denotes the grid index, \(\mathit{vol}(\cdot )\) measures the spatio-temporal volume of the given feature tube, and \(\mathit{inter}(\cdot )\) measures the intersecting volume between two tubes. The detailed computation process is illustrated in Sect. 4.2. In this formulation, the row-wise summation of \(S_{k}\) equals 1, i.e., \(\sum_{j=1}^{N_{\text{v}}}S_{k}(i,j)=\{1\}^{N_{\text{c}}}\). This indicates that each row in \(S_{k}\) can be treated as a probability distribution that describes the correspondence between \(f(v_{k})[i]\) and each grid in \(f'(v)\).

Therefore, we utilize the calculated correspondence matrix \(S_{k}\) as the reference distribution to guide spatio-temporal region feature contrast for accurate visual content alignment. Specifically, we take \(f(v_{k})[i]\) as a query for illustration. Recall that InfoNCE loss can be written as the cross-entropy between a prior distribution, i.e., the indicator function, and the feature similarity distribution is given as

$$\begin{aligned} \mathcal{L}_{\text{nce}}(i) = -\sum_{j} \mathbb{I}_{ij}\log \frac{{\mathit{sim}}(\boldsymbol{q}_{i},\boldsymbol{k}_{j})}{\sum_{l}{\mathit{sim}}(\boldsymbol{q}_{i},\boldsymbol{k}_{l})}, \end{aligned}$$
(2)

where q, k respectively denotes query and key features in contrastive learning, \(\mathbb{I}_{ij}=1\) if \(i=j\) otherwise \(\mathbb{I}_{ij}=0\), and \(\mathit{sim}(\cdot ,\cdot )=\exp (\cos (\cdot ,\cdot )/\tau )\) measures the feature similarity. In our formulation, we replace the prior \(\mathbb{I}_{ij}\) with the soft distribution \(S_{k}(i,j)\) for accurate region contrast. Since the correspondence between \(v_{k}\) and clips from other videos naturally equals 0, we can intuitively enlarge the negative pool by introducing features from other videos. Thus, the spatio-temporal region contrast loss over all \(f(v_{k})[i]\) can be formulated as

$$\begin{aligned} &\mathcal{L}_{\text{rc}} = -\sum_{k=1}^{K} \sum_{i=1}^{N_{\text{c}}} \sum _{j=1}^{N_{\text{v}}}S_{k}(i,j)\log p_{k}^{ij}, \end{aligned}$$
(3)
$$\begin{aligned} &p_{k}^{ij} = \frac{\mathit{sim}(f(v_{k})[i],f'(v)[j])}{\sum_{j=1} ^{N_{\text{v}}}\mathit{sim}(f(v_{k})[i],f'(v)[j])+\sum_{\boldsymbol{n}}\mathit{sim}(f(v_{k})[i],\boldsymbol{n})}, \end{aligned}$$
(4)

where \(\boldsymbol{n}\in \mathbb{R}^{C}\) denotes the negative features sampled from other videos in the mini-batch. Note that we sample the global views of other videos to form the negative pairs by default, and we include an ablation study in the experimental part. In this way, we are able to align the exact corresponding visual contents including both static appearance and dynamic motions in videos.

3.2 Low-level shortcut elimination

However, local-global spatio-temporal correspondence for region feature contrast, can exist in the form of a “shortcut” that relies merely on low-level statistics, e.g., color distribution, to identify the overlapping areas. This shortcut could prevent the model from learning meaningful semantic features. To this end, we aim to mitigate the impact of low-level statistics on the extracted representations.

An intuitive way to solve this problem is by utilizing strong augmentations. However, we find that this is not enough in the video domain. Unlike images, the temporal continuity between sampled frames could provide extra cues to learn these shortcuts. For example, the continuous change in illumination helps to determine the corresponding segments in the local-global view. It is nontrivial to design augmentations to decouple such low-level information from the final representations. Motivated by adversarial learning, a promising approach is to learn a low-level information estimator from semantically inconsistent samples that share similar low-level statistics. Then, we let the encoder minimize this estimated information.

We note that the color and blur augmentation \(\tau _{\text{l}}\) is effective against distortions in low-level statistics. In other words, similar augmentations could generate samples that share similar low-level characteristics. Hence, we define several different intensity levels of \(\tau _{\text{l}}\) by constraining the augmentation parameters to a certain range. As such, we can generate frame sequences that possess distinct semantics but similar low-level statistics using the controlled \(\tau _{\text{l}}\). Then, we build a mutual information estimator on top of the extracted feature representation for low-level information extraction. Note that there are several ways to approximate the mutual information – we compare different estimation methods in Sect. 4.4. For illustration, we take MINE [59] as an example. Following Ref. [59], we approximate the mutual information between two variables by

$$\begin{aligned} I_{\Theta}(X;Y) = \sup_{\theta \in \Theta}\mathcal{E}_{\mathcal{P}_{XY}}[G_{ \theta}]- \log \bigl(\mathcal{E}_{\mathcal{P}_{X}\otimes \mathcal{P}_{Y}}\bigl[ \text{e}^{G_{\theta}}\bigr]\bigr), \end{aligned}$$
(5)

where \(\mathcal{P}\) denotes a probability distribution, \(\mathcal{E}\) represents taking the expectation on the corresponding distribution. X and Y are the feature representations extracted by encoder f. The projection function G projects the combination of two variants X and Y sampled from the distribution space \(\mathcal{X}\) and \(\mathcal{Y}\) to a scalar value, i.e., \(G_{\theta}:\mathcal{X}\times \mathcal{Y}\rightarrow \mathbb{R}\). It is instantiated by a neural network with parameters \(\theta \in \Theta \), where Θ is the parameter set for optimization. Empirically, we instantiate \(G_{\theta}\) as a two-layer multi-layer perceptron (MLP). We regard the features of sample pairs generated from the same intensity-level of \(\tau _{\text{l}}\) as the joint distribution \(\mathcal{P}_{XY}\), and the features of arbitrary sample pairs as the marginal \(\mathcal{P}_{X}\otimes \mathcal{P}_{Y}\), where ⊗ denotes the combination of two marginal distributions. During training, we formulate the learning objective as

$$\begin{aligned} \mathcal{L}_{\text{mi}} = \min_{f}\max _{\theta }\mathcal{E}_{ \mathcal{P}_{XY}}[G_{\theta}]-\log \bigl( \mathcal{E}_{\mathcal{P}_{X} \otimes \mathcal{P}_{Y}}\bigl[\text{e}^{G_{\theta}}\bigr]\bigr). \end{aligned}$$
(6)

We maximize Eq. (6) in regards to the MLP parameters θ to obtain a reliable low-level information extractor, but reverse the gradient back-propagated to the encoder f to minimize Eq. (6). With the learned low-level information estimator \(G_{\theta}\), we further apply it to the aforementioned local-global pairs, \(f(v_{k})\) and \(f'(v)\), to minimize the low-level shortcut by optimizing f, but not update θ. In this way, we minimize the impact of low-level statistics on the spatio-temporal region feature contrast, and facilitate detailed semantic alignment.

3.3 Local-global temporal dependency

Now, we have learned robust clip features from the detailed region semantic contrast, and the remaining task is to bridge the gap between clip-level and video-level representations. Considering that the internal temporal relationships exist between the sampled local clips which are naturally contained in the global video, we propose to model the temporal order dependency between \(f(v_{k})\), \(k=\{1,2,\ldots ,K\}\) and \(f'(v)\) to enhance video-level understanding.

Similar to Sect. 3.2, we also use mutual information to measure the local-global temporal order dependency. The target is to maximize the mutual information between correctly ordered clip-level features and the video-level representation. Mathematically, we define the sequentially ordered clip features as \(\overline{f}(v)= [f(v_{1})\circ f(v_{2})\circ \cdots \circ f(v_{K}) ]\), where ∘ denotes concatenation operation, and the arbitrarily ordered features as \(\widetilde{f}(v)\). To model the temporal dependency, we regard \(\overline{f}(v)\) and \(f'(v)\) as sampled from the joint distribution \(\mathcal{P}_{XY}\), and \(\widetilde{f}(v)\) and \(f'(v)\) as sampled from the marginal distribution \(\mathcal{P}_{X}\otimes \mathcal{P}_{Y}\). In this formulation, the learning objective can be written as

$$\begin{aligned} \mathcal{L}_{\text{td}} = \max_{f,\psi}\mathcal{E}_{\mathcal{P}_{XY}}[G_{ \psi}]- \log \bigl(\mathcal{E}_{\mathcal{P}_{X}\otimes \mathcal{P}_{Y}}\bigl[ \text{e}^{G_{\psi}}\bigr]\bigr), \end{aligned}$$
(7)

where \(G_{\psi}\) is the mutual information estimation head with parameters ψ. Several alternatives exist to instantiate \(G_{\psi}\), and we discuss this in Sect. 4.4.

There are some alternatives to establish the marginal distribution \(\mathcal{P}_{X}\otimes \mathcal{P}_{Y}\) in Eq. (7). In default, we formulate it as a uniform distribution consisting of differently ordered video clips \(\widetilde{f}(v)\). Empirically, this formulation places equal emphasis on all order combinations. However, among the shuffled orders, some are quite trivial to discriminate while others are difficult to perceive. To this end, we refer to the transformation parameters \(\tau _{\text{p}}\) in our controllable augmentations, and evaluate the difficulty of each order to pay more attention to the hard examples. In particular, an order indicator \(\mathcal{O}\in \mathbb{N}^{K}\) is given, and we have \(\mathcal{O}[k]=k\) for the correct order. We denote the central timestamp of the kth clip as \(t_{\mathcal{O}[k]}\), and the oracle central timestamp is \(\hat{t}_{k}={(2k-1)T}/{(2K)}\). We calculate the summation of the central timestamp deviations to produce the difficulty score of order \(\mathcal{O}\):

$$\begin{aligned} D(\mathcal{O}) = -\sum_{i=1}^{K} \vert t_{\mathcal{O}[k]}-\hat{t}_{k} \vert . \end{aligned}$$
(8)

The lower deviation indicates the higher learning difficulty. We take softmax normalization over the difficulty scores to generate the sampling probability of each order combination. In this way, we formulate a marginal distribution which emphasizes hard examples to improve the learning efficiency.

It is worth noting that there are some previous works using temporal order to build pretext tasks for self-supervised learning [11, 14, 35]. The major difference is that our approach incorporates the video-level feature to determine whether the clips are correctly ordered, while Refs. [11, 14, 35] have no access to the global feature. In this way, our formulation could avoid the ambiguity problem when encountering the temporal structure that cannot be determined solely by local clips. For example, in a complex gymnastic scene, it is difficult to determine the temporal order of gymnastic actions only with local clips. However, it is practical to reach the correct order with reference to the global video feature. Thus, our local-global mutual temporal order constraint is be a better way to embed video-level temporal structures into extracted representations.

3.4 Training

We jointly train our model with the aforementioned three objectives:

$$\begin{aligned} \mathcal{L} = \mathcal{L}_{\text{rc}} + \alpha \mathcal{L}_{\text{mi}} + \beta \mathcal{L}_{\text{td}}, \end{aligned}$$
(9)

where α and β serve as balancing hyper-parameters. We set \(\alpha =\beta =1\) by default, and find that the performance is fairly robust to these hyper-parameters.

In addition, we also explore applying a curriculum evolving strategy to the parameters in the controllable augmentations to adjust the training process. Intuitively, motivated by Refs. [60, 61], it is promising to learn from easier samples and then gradually expand to more difficult tasks. This phenomenon is more obvious in self-supervised learning in the absence of human annotations [62]. To this end, in this work, we design an evolving strategy for the augmentation parameters to allow the model to learn in an easy-to-hard manner. Specifically, at the beginning of the training process, we strictly constrain the cropping parameters in \(\tau _{\text{p}}\) to construct distinct local clips to reduce difficulty in dense region contrast and temporal dependency modeling. Additionally, we divide easily distinguishable intensity levels in \(\tau _{\text{l}}\) to more easily capture the low-level statistics. In the learning process, we gradually relax the constraints on the augmentation parameters to increase the learning difficulty. We present the detailed formulation of the dynamic parameter evolving process in the implementation details. We show the empirical comparison with simple constant augmentation parameters in the ablation study.

4 Experiment

4.1 Datasets

We use 4 video action recognition datasets, Kinetics-400 [1], UCF-101 [7], HMDB-51 [63] and Diving-48 [9]. Kinetics-400 [1] is a large-scale dataset consisting of 240 K video clips with 400 human action classes. UCF-101 [7] contains over 13 K clips covering 101 action classes. HMDB-51 [63] covers 51 action categories and approximately 7 K annotated clips. Diving-48 [9] contains 48 different diving actions, which mainly vary in motion patterns and share similar backgrounds. In our experiments, we use the training set of UCF-101 or Kinetics-400 for self-supervised pretraining. For the downstream tasks, following Refs. [10, 46, 56], we use split 1 of UCF-101 and HMDB-51, and the test split V1 of Diving-48 for evaluation.

4.2 Implementation details

Self-supervised pretraining

For global video input, we sparsely sample 16 frames with weak spatial cropping. For local clip input, we constrain the temporal cropping parameters to make K 16-frame clips approximately uniformly distributed in the video. The local clips are spatially cropped within the global view to ensure position-wise correspondence. For low-level augmentations, we define a set of color jitter and Gaussian blur parameters to form different intensity-level transformations. We resize the input frame sequence into \(16\times 112\times 112\), and use R3D-18 [64] as the video encoder. For local clip feature extraction, we follow the default setting and the feature resolution is \(2\times 4\times 4\). For global video feature extraction, we set the temporal stride of the last 3 stages to 1, so that the feature resolution is \(8\times 4\times 4\). We calculate the spatio-temporal correspondence matrix between local and global feature maps based on the cropping and flipping parameters for optimization.

In terms of training settings, we use batchsize of 128, and set the number of local clips K to 4 by default. We train our model on UCF-101 for 200 epochs and on Kinetics-400 for 100 epochs. We use the Adam optimizer with an initial learning rate of \(1.0\times 10^{-3}\) and weight decay of \(1.0\times 10^{-5}\). The learning rate is decayed by 10 at 70 epochs for Kinetics-400 and 150 epochs for UCF-101.

Action recognition

We load the pretrained video encoder parameters except for the last fully-connected layer. There are two protocols: (1) end-to-end finetune the whole network with action labels; (2) freeze the encoder, and only train the linear classifier that also is known as linear probe. For evaluation, we follow Refs. [14, 18] to uniformly sample 10 clips for each video, which are center cropped and resized to \(112\times 112\). We average the softmax probability of each clip as the final prediction and report the Top-1 accuracy.

Video retrieval

We directly use the pretrained model to extract video features without finetuning. Following Refs. [14, 65], we regard videos in the test set as queries, and retrieve nearest neighbors from the training set. Similar to action recognition, we average the features of ten uniformly sampled clips as the global representation. We report Top-k recall R@k.

Controllable augmentations

We also provide a detailed illustration of our controllable augmentations. We respectively describe the implementations for random spatial crop, random temporal crop, random horizontal flip, color jitter and Gaussian blur. We use the default setting, 4 local clips and 512 low-level augmentation intensities for illustration, and provide the detailed evolving progress of the dynamic augmentation parameters.

1) Random temporal crop. For global video, we do not perform temporal cropping, but uniformly sample 16 frames. For local clips, we constrain the central frame to respectively be located in the timespan \([0.00,0.25]\), \([0.25,0.50]\), \([0.50,0.75]\) and \([0.75,1.00]\) to cover the whole video. For the dynamic augmentation parameter design, we constrain the time interval to \([0.10,0.15]\), \([0.35,0.40]\), \([0.60,0.65]\) and \([0.85,0.90]\) at the beginning to generate clips with more easily distinguishable temporal boundaries. Then we linearly expand the time interval to the default \([0.00,0.25]\), \([0.25,0.50]\), \([0.50,0.75]\) and \([0.75,1.00]\) in each training iteration.

2) Random spatial crop. For global video, we perform weak spatial cropping, with the area ratio in the range \([0.8,1.0]\). For local clips, we perform strong spatial cropping relative to the cropped global video, with an area ratio of \([0.4,0.8]\). Similarly, for the dynamic curriculum training, we first constrain the local clip cropping ratio to \([0.7,0.8]\) for more discriminative region contrast. Then we linearly reduce the lower limit of the area ratio to \([0.4,0.8]\) to increase the learning difficulty with more diverse and noisier dense correspondence.

3) Random horizontal flip. For both global video and local clips, we perform random horizontal flipping with a probability of 0.5.

4) Color jitter. Referring to the default settings in [20, 22], the brightness (B), contrast (C) and saturation (S) are in the range \([0.0,0.4]\), and the hue (H) is in the range \([0,0.1]\). We uniformly divide 4 groups for B, C and S, respectively, the intensity from weak to strong order in the range \([0.0,0.1]\), \([0.1,0.2]\), \([0.2,0.3]\) and \([0.3,0.4]\), and divide 2 groups for H that in the range \([0.00,0.05]\), \([0.05,0.10]\). In the dynamic training, we initialize the intensity levels as \([0.04,0.06]\), \([0.14,0.16]\), \([0.24,0.26]\) and \([0.34,0.36]\) to easily capture the difference in low-level statistics. Along the training progress, we linearly expand the intensity levels to the default setting, \([0.0,0.1]\), \([0.1,0.2]\), \([0.2,0.3]\) and \([0.3,0.4]\).

5) Gaussian blur. We adopt 2 different radii 7 and 11, and 2 different sigma ranges \([0.1,0.5]\) and \([0.5,2.0]\), resulting in 4 combinations. In the same manner, we initially set the sigma in the range \([0.25,0.35]\), \([1.2,1.3]\), then linearly relax to \([0.1,0.5]\) and \([0.5,2.0]\) to formulate an easy-to-hard training.

With the help of these controllable augmentations, the spatio-temporal correspondence is calculated by the ratio of the intersection of two tubes. As demonstrated in Fig. 3, we have global video feature map \(F_{\text{v}}\) and local clip feature map \(F_{\text{c}}\). We aim to calculate the spatio-temporal correspondence matrix S, where \(S[i,j]\) indicates the correspondence score between the ith grid in \(F_{\text{c}}\) and the jth grid in \(F_{\text{v}}\). For better illustration, we assume \(F_{\text{c}}[i]\) covers the area \([(t_{\text{c}}^{1},t_{\text{c}}^{2}),(h_{\text{c}}^{1},h_{\text{c}}^{2}),(w_{ \text{c}}^{1},w_{\text{c}}^{2})]\), \(F_{\text{v}}[j]\) covers the area \([(t_{\text{v}}^{1},t_{\text{v}}^{2}),(h_{\text{v}}^{1},h_{\text{v}}^{2}),(w_{ \text{v}}^{1},w_{\text{v}}^{2})]\). Then the intersection can be easily written as

$$\begin{aligned}& \textit{inter\_t} = \max \bigl(\min \bigl(t_{\text{c}}^{2},t_{\text{v}}^{2} \bigr)- \max \bigl(t_{\text{c}}^{1},t_{\text{v}}^{1} \bigr),0\bigr), \end{aligned}$$
(10)
$$\begin{aligned}& \textit{inter\_h} = \max \bigl(\min \bigl(h_{\text{c}}^{2},h_{\text{v}}^{2} \bigr)- \max \bigl(h_{\text{c}}^{1},h_{\text{v}}^{1} \bigr),0\bigr), \end{aligned}$$
(11)
$$\begin{aligned}& \textit{inter\_w} = \max \bigl(\min \bigl(w_{\text{c}}^{2},w_{\text{v}}^{2} \bigr)- \max \bigl(w_{\text{c}}^{1},w_{\text{v}}^{1} \bigr),0\bigr), \\ \end{aligned}$$
(12)
$$\begin{aligned}& \mathit{inter} = \textit{inter\_t}\times \textit{inter \_h}\times \textit{inter\_w}. \end{aligned}$$
(13)
Figure 3
figure 3

An example of global video and local clip feature maps. \(F_{\text{v}}\) and \(F_{\text{c}}\) denote video-level and clip-level features, respectively. Each feature grid covers a certain spatio-temporal area, which equals a tube in the video. We calculate the spatio-temporal correspondence matrix based on the intersection of tubes. The number of feature grids in the figure is only for illustration

\(S[i,j]\) is the ratio of the intersection over \(F_{\text{c}}[i]\):

$$\begin{aligned} S[i,j] = \frac{\mathit{inter}}{(t_{\text{c}}^{2}-t_{\text{c}}^{1})(h_{\text{c}}^{2}-h_{\text{c}}^{1})(w_{\text{c}}^{2}-w_{\text{c}}^{1})}. \end{aligned}$$
(14)

4.3 Comparison with existing works

Action recognition

We first present the comparison between our method and recent video representation learning approaches on action recognition in Table 1. We report Top-1 accuracy on UCF-101 and HMDB-51 under linear probe and finetune. We exclude the methods that use different evaluation settings and much deeper backbones, such as Refs. [17, 49, 71], or those that rely on audio and text modalities, such as Refs. [72, 73]. In Table 1, we use ‘V+F’ to denote the use of both reg green blue (RGB) and optical flow in the self-supervised pretraining stage. All evaluation results are obtained using only RGB at test time.

Table 1 Comparison results for the action recognition downstream task. We provide the training setting of each method, including the backbone encoder, pretraining dataset, spatio-temporal resolution and modality, where ‘V’ means RGB frames, ‘F’ means optical flow. We use freeze (✓) to indicate the linear probe, while no freeze (✗) denotes end-to-end fine-tuning. For fairness, we exclude methods that use different evaluation settings, much deeper backbones or other modalities, such as audio and text. ‘*’ denotes 200 epochs pretraining on Kinetics-400. The bold font indicates the highest accuracy among the compared methods

Under linear probe, our method outperforms other RGB-only approaches by a large margin. The superiority over RSPNet [48], which integrates temporal pretext task with contrastive learning, demonstrates the effectiveness of our general temporal structure learning scheme. Note that our method also dramatically narrows the gap between RGB-only and RGB-flow based methods. This indicates that our method significantly improves motion pattern modeling. Under finetune, our method achieves promising results when pretrained on UCF-101, even surpassing RGB-flow based architecture [15, 20] and methods trained with higher resolution [52]. When pretrained on Kinetics-400, our method is generally comparable with the state-of-the-art RGB-flow approaches [15, 20], motion focused methods [52, 53], probabilistic and hierarchical pretraining [58, 68]. It indicates that our controllable augmentation has the potential to cover long range temporal dynamics and lead to comprehensive perception of appearance and motion cues. In addition, due to limited computational resources, we do not compare with works using very large backbones, such as Refs. [17, 71], but we present the ablation in the bottom three lines. The results indicate that our method has the potential to scale longer training epochs, deeper backbones and larger resolutions.

In addition, we also provide the results on Diving-48 [9], a dataset that mainly relies on dynamic motions to distinguish different action categories. We compare the results of both supervised (the second row to the forth row) and self-supervised methods (bottom three rows) in Table 2. Since the appearance is similar across different videos, and the Top-1 accuracy can well reflect the ability in motion understanding. We observe that in this case, semantic label supervision is not effective, and our method improves the performance by a notable margin. This demonstrates that our learning approach is superior in capturing motion patterns, with less reliance on background information.

Table 2 Action recognition results on the Diving-48 dataset. We compare different Top-1 accuracies based on the test split V1 action labels. The bold font indicates the highest accuracy among the compared methods

Video retrieval

Table 3 depicts the comparison on video retrieval with R@k. The model is pretrained on UCF-101. Our method remarkably outperforms most RGB-based approaches. Note that some methods, especially PCL [77], achieve impressive results when k increases to 20. This is because when k is large, it becomes likely to rely on the background as a shortcut to retrieve videos of the same category. We reach comparable or even better performance although STS [15] and CoCLR [20] adopt RGB and optical flow. This demonstrates once again that our integration of detailed local feature alignment and general long-term temporal modeling is effective in enhancing motion pattern modeling without resorting to motion biased input data.

Table 3 Comparison results for the video retrieval downstream task. We report R@k (\(k=1\),5,10,20) on the UCF-101 and HMDB-51 datasets. The bold font indicates the highest recall among the compared methods

Visualization analysis

We also display some visualization results to analyze the learned feature representations in Fig. 4. We employ class-agnostic activation maps (CAAM) [78] to reveal the spatio-temporal distributions of the extracted features. Generally, vanilla contrastive learning based on SimCLR [26] leads the model to focus on representative background cues, e.g., the soccer field, swimming pool and fitness equipment. In contrast, our pretrained model focuses on the moving foregrounds that contain actions, such as the moving human body and moving boat.

Figure 4
figure 4

Class-agnostic activation maps (CAAM) visualization of spatio-temporal feature maps. We compare the results of our method and the contrastive learning baseline. Our method focuses on the moving objects, while the baseline tends to emphasize background regions

4.4 Ablation study

In this section, we provide several ablation studies to analyze our video representation learning framework. Unless specifically mentioned, all models are pretrained on UCF-101 for 150 epochs, with R3D-18 as the backbone.

Local-global sampling

We first explore the impact of local-global settings. Two aspects were investigated: the number of local clips K and the global video feature temporal resolution \(T_{\text{v}}\), which is obtained by adjusting the temporal convolution stride. We presnet the results in Table 4. By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment. It is worth noting that when the ratio \(T_{\text{v}}/(KT_{\text{c}})<1\), the granularity of the local-global correspondence becomes too coarse, which constricts the performance. Overall, accurate spatio-temporal region correspondence provides a reliable reference for appearance and motion pattern matching, and significantly improves action recognition.

Table 4 Ablation study on local-global sampling. We present the results with different clip numbers and the temporal resolution of global video features. The first line equals the baseline. We report linear probe Top-1 accuracy on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings

Negative pair formulation

With the local-global sampling, there are several alternatives to formulate the negative feature pool in Eq. (4). In detail, given the query local clip feature \(f(v_{k})[i]\), we sample the matched global view features as positive pairs. To this end, we also sample the global view features from other videos to formulate the negative pairs by default. In this ablation study, we also compare with two other variants: (1) sampling local clip features from other videos as negative pairs; (2) combining both local and global view features from other videos as the negative samples. We show the empirical comparison in Table 5. We observe that sampling negative features from the global view works much better than sampling from the local views. Integrating both local and global features as negative pairs leads to negligible improvments. This is because the global view features provide richer visual contexts with larger reception fields, and serve as more informative reference signals in dense region contrast.

Table 5 Ablation study on the negative pair formulation in Eq. (4). We use ✓to indicate the sampling mode. We sample negative features from the global view by default. The bold font indicates the highest accuracy among the compared settings

Random perturbations in dense correspondence

Recalling Eq. (1), we calculate the intersection ratio of two spatio-temporal cubes to produce the dense correspondence matrix to indicate the positive pairs. This strict correspondence is purely based on the geometric positions and might neglect some visual contexts that help high-level understanding. To this end, we compare adding independent random Gaussian noise and applying spatio-temporal Gaussian blur to the dense correspondence matrix in Table 6. We find that the random Gaussian noise leads to a substantial performance drop since it destroys the correspondence relations. In contrast, Gaussian blur basically maintains the original correspondence distribution and slightly expands the positive sampling area. It improves the performance by including more diverse visual contexts as positive features.

Table 6 Ablation study on applying random perturbations on the dense correspondence matrix. The bold font indicates the highest accuracy among the compared settings

Low-level augmentation levels

We also explore the setting of the intensity levels on low-level augmentations. We follow conventional implementations: For color jitter, there are controllable parameters of brightness, contrast, saturation and hue, which are set as \((B,C,S,H)= (0.4,0.4,0.4,0.1)\) by default [20, 22]. For Gaussian blur, we control the radius and sigma. We set different intensity levels for each controllable parameter as Table 7. Note that since B, C and S are set to the same as default, we also set the same number of levels for them. The total number of predefined intensity levels equals the number of permutations across all parameters, i.e., 32 for the first row, 512 for the second row, etc. For consistency, in each iteration, we randomly sample 32 intensity levels from all possible levels, resulting in 32 groups of features that share similar low-level statistics for mutual information minimization. We observe that too few or too many levels both lead to performance drop. This is because more levels lead to less difference between different groups, while fewer levels mean more difference within each group. We conclude that a trade-off exists which requires balancing to achieve the best possible training.

Table 7 Ablation study on low-level augmentation settings. # denotes the number of intensity levels in brightness, contrast, saturation, hue and Gaussian blur. We report Top-1 accuracy of the linear prob on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings

Mutual information estimation

We also delve into several methods for mutual information estimation. For low-level shortcut elimination, we need to force the encoder to minimize the estimated mutual information. Theoretically, we should minimize the estimated upper bound. However, we find that it is difficult to converge with CLUB [79] which is an upper bound estimation approach. Therefore, we also adopt lower bound estimation methods of MINE [59], JS [80] and InfoNCE [27] for comparison. These methods provide tight lower bound estimation which leads to easier convergence, thus achieving superior performance on action recognition. Compared to the baseline as illustrated in Table 8, our method obtained significant improvement, especially when pretrained on mini-Kinetics and evaluated on UCF-101. This indicates that mutual information minimization helps mitigate low-level shortcuts and enhances the generalization ability.

Table 8 Ablation study on low-level mutual information estimation. None means the baseline without mutual information minimization. We report Top-1 accuracy of the linear probe on UCF-101 with a model pretrained on UCF-101 and mini-Kinetics. The bold font indicates the highest accuracy among the compared settings

Temporal dependency head

To further examine the feasibility of temporal dependency head implementation, we compare three typical examples: (1) MLP: concatenate \(f'(v)\) and \(\overline{f}(v)\) or \(\widetilde{f}(v)\) and pass through a multi-layer perceptron (MLP) to obtain a scalar value. (2) GRU: use a gated recurrent network (GRU) to process clip feature sequence, and calculate the cosine similarity between \(f'(v)\) and GRU output. (3) GRU+MLP: GRU is used to process the clip feature sequence, which is then concatenated with \(f'(v)\) and passed through an MLP to obtain a scalar value. The results are listed in Table 9. Compared with no temporal constraint, all three implementations gain significant improvements. We note that our MLP implementation is similar to VCOP [14], but different in the learning objective. This improvement reveals that introducing the global video feature as a reference could enhance temporal structure modeling.

Table 9 Ablation study on temporal dependency head. None denotes the baseline without temporal constraints, and follows VCOP [14] for comparison. We report Top-1 accuracy of the linear prob on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings

Marginal distribution formulation

In addition to the mutual information estimation head, we also compare different marginal distribution formulations for temporal dependency modeling. By default, we instantiate it as a uniform distribution containing different shuffled orders. In addition, we also compare establishing a difficulty-aware marginal distribution that places more emphasis on hard examples. The results are presented in Table 10. The temperature hyper-parameter controls the level of concentration of the softmax normalization over the difficulty score in Eq. (8). Lower temperature leads to sharp distributions, thus resulting in repeated sampling difficult examples. From the comparison, we observe that lower temperature impairs the performance while the smoother distribution brings improvements. This indicates the necessity of using a small number of easy examples to guide the model to discriminate temporal relations and prevent the model from falling into ambiguities. Only when based on this, sampling more hard examples could further facilitate temporal perception.

Table 10 Ablation study on the marginal distribution formulation in the mutual information estimation for temporal dependency modeling. We take the uniform distribution in default and compare it with using softmax normalization over the difficulty score with different temperature hyper-parameters. The bold font indicates the highest accuracy among the compared settings

Dynamic augmentation parameters

Here, we provide a quantitative comparison between the default augmentation parameter setting and the dynamic parameter evolving in Table 11. In the dynamic learning stage, we manually control the augmentation parameters to construct the training samples in a curriculum manner. It is clear that the dynamic augmentation parameter evolving in an easy-to-hard manner leads to performance improvement. This dynamic setting contributes to determining optimal augmentation parameter combinations that facilitate video representation learning.

Table 11 Ablation study on the controllable augmentation parameters. We compare using the static parameters over the whole training stage and dynamic parameter evolving on both spatio-temporal position transformations \(\tau _{\text{p}}\) and low-level augmentations \(\tau _{\text{l}}\). The bold font indicates the highest accuracy among the compared settings

Training efficiency

We also report the training efficiency of the variants of our method and other works. For fair comparison, we pretrain the R3D-18 backbone with a resolution of \(112\times 112\), a clip length of 16 frames on Kinetics-400 on an 8 NVIDIA 3090 GPU server, and report the training time in hours as well as the total GPU hours. For the variants of our method, we compare using different numbers of local clips K and present the results in Table 12. We observe that sampling more local clips leads to faster convergence, thus not resulting in linear increase in training time. Compared with the other two baselines [51, 67], based on our reimplementations, our method achieves better performance with fewer training hours, demonstrating the high training efficiency of our framework.

Table 12 Ablation study on the training efficiency. We show the comparison of using different numbers of local clips K in our method and the comparison with other baselines. We report the training time and finetuning results on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings. ‘x’ denotes the training time ratio comparing to our method using 2 local clips

Overall learning objectives

We finally show the ablation of the designed learning objectives in Table 13, where \(\mathcal{L}_{\text{nce}}\) is the standard contrastive loss used in existing works. We observe that the integration of \(\mathcal{L}_{\text{rc}}\) and \(\mathcal{L}_{\text{mi}}\) significantly outperforms \(\mathcal{L}_{\text{ce}}\), which indicates that the detailed region contrast with low-level shortcut elimination is more efficient than naive global contrast. In addition, \(\mathcal{L}_{\text{td}}\) further enables the model to go beyond local clips and establish long-term relationships. The improvement demonstrates that our method well integrates detailed region-level contrast and general long-term temporal perception.

Table 13 Ablation study on all learning objectives. We use ✓to indicate which loss functions are used. Note that \(\mathcal{L}_{\text{nce}}\) is the standard contrastive loss function in previous works. The bold font indicates the highest accuracy among the compared settings

5 Conclusion

In this paper, we propose a framework that leverages local clips and global video to enhance self-supervised video representation learning. We employ a set of controllable augmentations to crop local clips and generate groups of samples that share similar low-level attributes. Therefore, we use the soft codes computed from the crop and flip parameters to guide detailed spatio-temporal region contrastive learning, and minimize the mutual information within the same low-level group to avoid shortcuts. We also incorporate local-global temporal dependency to embed general temporal structures into the extracted video representations. Experiments on downstream tasks of action recognition and video retrieval demonstrate the superiority of our formulation, especially in modeling dynamic motion patterns.