Controllable Augmentations for Video Representation Learning

This paper focuses on self-supervised video representation learning. Most existing approaches follow the contrastive learning pipeline to construct positive and negative pairs by sampling different clips. However, this formulation tends to bias to static background and have difficulty establishing global temporal structures. The major reason is that the positive pairs, i.e., different clips sampled from the same video, have limited temporal receptive field, and usually share similar background but differ in motions. To address these problems, we propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as general long-term temporal relations. Based on a set of controllable augmentations, we achieve accurate appearance and motion pattern alignment through soft spatio-temporal region contrast. Our formulation is able to avoid the low-level redundancy shortcut by mutual information minimization to improve the generalization. We also introduce local-global temporal order dependency to further bridge the gap between clip-level and video-level representations for robust temporal modeling. Extensive experiments demonstrate that our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.


Introduction
Video representation learning is fundamental to video understanding applications, e.g., action recognition [10,69], spatio-temporal detection [19,9], video retrieval [40,45], etc. Traditional supervised learning schemes require large-scale human labeling, and the performance is usually restricted by the granularity of annotations.More precisely, coarse-grained video-level annotations could lead the model to attend to the background [54,10], while fine-grained annotations greatly facilitate general video analysis but are much more expensive [18,39].To solve this problem, unsupervised video representation learning begins to attract more attention.Some early works design diverse pretext tasks to learn the video characteristics in a self-supervised manner [8,46,32,31,71,61].Recently, the formulation of contrastive learning further improves the performance by a large margin [17,52,62,73].
A prevalent way for contrastive video representation learning is to sample several clips and regard those from the video as positive pairs [52,24,33,48].However, this formulation has two drawbacks.On one hand, these methods tend to be bias towards static background [64,63].This is because the sampled clips mostly share the same background, but there probably exists subtle differences in motions.For example, in Fig. 1, the video contains a high jump scene.The clip sampled at an early timestamp shows the running action, but that same clip sampled at a later timestamp presents the jumping action.Thus, pulling these two clips closer in the feature space will lead the model neglecting their distinct motions and only attend to the background of the stadium.On the other hand, there remains an obvious gap between clip-level features and videolevel representation.The sampled clips have limited temporal receptive field, and thus, cannot provide comprehensive information.For example, Clip 1 in Fig. 1 only shows the momentary process of running.When we jointly leverage the correctly ordered two clips, i.e., the running action happens before jumping, we would be able to understand the original video.Motivated by these observations, we intend to address these problems from two aspects, one is detailed region-level correspondence, the other is general long-term temporal perception.
Fig. 1.An illustration of clip sampling and temporal correspondence.We show a highjump video with two sampled clips.Two clips have the same background but different motions: one running and the other jumping.We provide an example of temporal correspondence between clip and video, where we coarsely divide the video (clip) into three segments, the value in the matrix indicates the intersection ratio.The spatial correspondence can be calculated similarly.
In this paper, we propose a framework to learn comprehensive appearance and motion patterns in videos.Concretely, we develop a set of controllable augmentations to achieve this goal.First, we use constrained spatio-temporal cropping to sample several local clips from each video such that the clips cover diverse timestamps of the video.Then, based on the cropping parameters, we generate dense spatio-temporal positionwise correspondences between the local clip and global video feature maps.In Fig. 1, we show a toy example on temporal correspondence, whereby the spatial correspondence is established by employing these soft codes to align features in corresponding regions.In this way, we can match the exact same appearance and motion content, while avoid aligning inconsistent motions between various timestamps.However, there also exist "shortcuts" that govern the overlapping regions between local clips and global videos, e.g., the low-level color statistics; these shortcuts could prevent the model from learning useful semantics.To avoid them, we define different intensity levels of color jitter and Gaussian blur augmentations, and regard the samples generated by the same level augmentation as sharing similar low-level attributes.We then minimize the mutual information between them to mitigate the impact of low-level shortcuts on the extracted representation.
To further bridge the gap between clip-level and video-level representations, we intuitively introduce a learning objective to model temporal order dependency between local clips and global video.Particularly, we have access to the temporal order of the sampled clips in accordance to the cropping parameters.With that, we aim to maximize the mutual information between correctly ordered clip features and the global video feature.Through this operation, we facilitate the model's temporal awareness in the pretraining stage.
In summary, our contributions are as follows: -We propose a unified framework to learn video representations from detailed local contrast and general long-term temporal modeling.-We develop controllable augmentations to match the visual contents in corresponding spatio-temporal positions for detailed content alignment, and perform mutual information minimization to avoid low-level shortcut.-We introduce the temporal order dependency between the local clips and global video to enhance general temporal structure modeling.-We achieve superior results on downstream action recognition and video retrieval tasks, while capturing more accurate motion patterns.

Related Work
Contrastive Learning.Recently, contrastive learning has revolutionized selfsupervised learning [26,12,47].Its core idea is to discriminate different instances by attracting the positive pairs and repelling the negative pairs in feature space [21,20].Following this, [68] formulates the instance discrimination as a non-parametric classification problem.[47] proposes to estimate mutual information with In-foNCE loss [20], which leads to easy optimization and fast convergence.Inspired by this, a line of works [57,12,26,27] adopt this learning objective for image representation learning and show significant improvement on downstream tasks.
Later, [70,67] develop dense contrastive learning, which performs pixel-level contrast.Compared to instance-level discrimination, dense contrastive learning preserves richer characteristics, and performs better on dense prediction tasks and visual correspondence learning.In our work, we focus on video representation learning.Considering that there exists natural spatio-temporal correspondences in video domain, we propose to utilize it as a self-supervisory signal for spatiotemporal region contrast to learn more comprehensive video representations.
Video Representation Learning.Unlike images, videos contain internal temporal structures that are crucial for video content analysis.To this end, many works [46,36,71] have designed various pretext tasks to leverage the natural spatio-temporal correspondence as self-supervisory signals.Some typical pretext tasks include temporal ordering [46,71,73], spatio-temporal puzzles [32,61], colorization [60], playback speed prediction [31,8], temporal cycle-consistency [66,29,38], and future prediction [59,58,43,5].There are also some works using cross-modal correspondence for self-supervised pretraining [1,50].Inspired by the success of contrastive learning in image domain, a series of works extend this pipeline to video domain [17,52,62,41].Particularly, [22,23] employ InfoNCE loss for dense future prediction, while [62,72] sample clips of different rates as positive pairs for visual content learning.However, video contrastive learning could lead the model to lay more emphasis on the static scene and focus less on motion [64].To solve this problem, [11,31] propose to integrate contrastive learning with temporal pretext tasks to enhance the temporal awareness.[24,37] use optical flow to assist motion modeling.In our work, we do not resort to optical flow to enhance motion learning and temporal modeling.Instead, we hypothesize that the underlying reason for static scene bias lies in the positive pair formulation.That is, most existing works use either different frames [73,17] or different clips [52,33] from the same video as the positive pair, which usually have similar background but possess different motions.Hence, we propose to consider the corresponding regions within local and global views to form accurate positive, concurrently with low-level shortcut elimination, which captures the desired static and dynamic characteristics.In addition, we develop a temporal dependency between these views to bridge the gap between local clip and global video representations, while learning robust temporal structures.
Local-global Views for Video Representation.There have been some works also using local and global views for self-supervised video representation learning [44,53,14,4,33].The major difference between our work and those works lies in the concept of local global views and its target.In our work, "local global" means short and long video clips, and the major target is to construct spatiotemporal overlaps and formulate a soft learning objective, which guides detailed region-level video content alignment.In [44], local global means local fine-grained and global coarse-grained features, which is designed for general audio-visual correspondence.[53] aims to extrapolate the neighboring video content in global view based on the observation from the local view.TCLR [15] designs a loss function to learn temporal correspondence between local and global clips but still with hard positive assignment.[6] employs local global views to decompose stationary and non-stationary features and [34] uses them for segment-based positive sampling.

Method
The core idea of our proposed framework is to enhance self-supervised video representation learning by comprehensive appearance and motion content modeling.
As shown in Fig. 2, we utilize a set of controllable augmentations to achieve (1) detailed spatio-temporal region contrast, (2) low-level shortcut elimination and (3) general temporal dependency modeling.Specifically, we divide the augmentations into two parts: one is spatio-temporal position transformations τ p (including crop and horizontal flip), the other is lowlevel statistic transformations τ l (including color jitter and Gaussian blur).Following the data preprocessing pipeline, given a video v, we first use τ p to sample several local clips and then perform τ l to generate the input to the encoder.

Spatio-temporal Region Contrast
Given a video v with temporal length T , we first use spatio-temporal cropping to sample K clips i.e. v k ∈ {v 1 , v 2 , ..., v K }, to provide the local feature descriptions.In order to let the sampled clips contain as much information as the original video, we manually constrain the temporal cropping parameters in τ k p to control the central timestamp of v k in range of (k−1)T K , kT K .In this way, sampled clips cover different temporal segments and they jointly present the rich information in v.As mentioned in Sec. 1, there could be inconsistencies in motions between different local clips such that it is not optimal to align the representations between different clips.Hence, we need to figure out the exact corresponding content for feature alignment.To this end, considering that there is natural correspondence between local clips and global video, we leverage v and v k as two views for feature matching.
For local clip feature extraction, we denote the feature extractor as f (•), and the local clip feature map as f (v k ) ∈ R CTcHW , where C, H, W denotes the channel, width and height respectively, T c denotes the temporal dimension of clip feature map.For global video feature extraction, we perform sparse sampling to represent v, and set some convolution layers' temporal stride to 1 to make f ′ (v) ∈ R CTvHW possess higher temporal resolution, i.e., temporal dimension T v > T c .Note that f and f ′ share the same architecture and only differ in the temporal stride.Details of the network settings are described in Sec.4.2.
Based on f (v k ) and f ′ (v), we refer to the augmentation parameters in τ k p to calculate the dense spatio-temporal position correspondence.Specifically, we use S k ∈ R Nc×Nv to indicate the correspondence result, where N c = T c HW , N v = T v HW .S k (i, j) reveals the correspondence score between i-th spatiotemporal grid in f (v k ) and j-th grid in f ′ (v).Essentially, each grid on the feature map is equivalent to a tube covering a certain spatio-temporal area (see Fig. 2), and S k (i, j) is measured by the ratio of the intersection of two tubes over the volume of tube f (v k )[i]: where [•] denotes grid index, vol(•) measures the spatio-temporal volume of the given feature tube, and (inter(•)) measures the intersecting volume between two tubes.The detailed computation process is illustrated in the Supplementary Material.In this formulation, the row-wise summation of S k equals to 1, i.e., Nv j=1 S k (i, j) = {1} Nc .This indicates that each row in S k can be treated as a probability distribution that describes the correspondence between f (v k )[i] and each grid in f ′ (v).Therefore, we utilize the calculated correspondence matrix S k as the reference distribution to guide spatio-temporal region feature contrast for accurate visual content alignment.To be specific, we take f (v k )[i] as query for illustration.Recall that the InfoNCE loss can be written as the cross-entropy between a prior distribution, i.e., the indicator function, and the feature similarity distribution is given as: where I ij = 1 if i = j otherwise I ij = 0, and sim(•, •) = exp(cos(•, •)/τ ) measures the feature similarity.In our formulation, we replace the prior I ij with the soft distribution S k (i, j) for accurate region contrast.Since the correspondence between v k and clips from other videos naturally equals to 0, we can intuitively enlarge the negative pool by introducing features from other videos.Thus, the spatio-temporal region contrast loss w.r.t.f (v k )[i] can be formulated as where n denotes the negative features sampled from other videos in the minibatch.In this way, we are able to align the exact corresponding appearance and motion content in videos.

Low-level Shortcut Elimination
However, local-global spatio-temporal correspondence for region feature contrast, can exist in the form of a "shortcut" that relies merely on low-level statistics, e.g., color distribution, to identify the overlapping areas.This shortcut could prevent the model from learning meaningful semantic features.To this end, we aim to mitigate the impact of low-level statistics on the extracted representations.
An intuitive way to solve this problem is by utilizing strong augmentations.However, we find that this is not enough in video domain.Unlike images, the temporal continuity between sampled frames could provide extra cues to learn these shortcuts.For example, the continuous change in illumination helps to determine the corresponding segments in local-global view.It is nontrivial to design augmentations to decouple such low-level information from the final representations.Motivated by adversarial learning, a promising approach is to learn a lowlevel information estimator from semantically inconsistent samples that share similar low-level statistics.Then, we let the encoder minimize this estimated information.
We note that the color and blur augmentation τ l is effective against distortions on low-level statistics.In other words, similar augmentations could generate samples that share similar low-level characteristics.Hence, we define several different intensity levels of τ l by constraining the augmentation parameters to a certain range.As such, we could generate frame sequences that possess distinct semantics but similar low-level statistics using the controlled τ l .Then, we build a mutual information estimator on top of the extracted feature representation for low-level information extraction.Note that there are several ways to approximate the mutual information -we compare different estimation methods in Sec. 7.For illustration, we take MINE [7] as example.Following [7], we approximate the mutual information between two variables by where X and Y is the feature representations extracted by encoder f , G θ : X × Y → R, which is parameterized by a neural network with θ ∈ Θ.We instantiate G θ as a two-layer MLP.We regard the features of sample pairs generated from the same intensity-level of τ l as the joint distribution P XY , and features of arbitrary sample pairs as the marginal P X ⊗P Y .During training, we formulate the learning objective as: We maximize Eq. 6 in regards to the MLP parameters θ to obtain a reliable low-level information extractor, but reverse the gradient back-propagated to the encoder f to minimize Eq. 6.With the learned low-level information estimator G θ , we further apply it to the aforementioned local-global pairs, f (v k ) and f ′ (v), to minimize the low-level shortcut by optimizing f , but not update θ.In this way, we minimize the impact of low-level statistics on the spatio-temporal region feature contrast, and facilitate detailed semantic alignment.

Local-global Temporal Dependency
Now, we have learned robust clip features from the detailed region semantic contrast, the remaining task is to bridge the gap between clip-level and videolevel representations.Considering that there exist the internal temporal relationships between the sampled local clips (which are naturally contained in the global video), we propose to model the temporal order dependency between f (v k ), k = {1, 2, ..., K} and f ′ (v) to enhance video-level understanding.Similar to Sec. 3.2, we also use mutual information to measure the localglobal temporal order dependency.The target is to maximize the mutual information between correctly ordered clip-level features and the video-level representation.Mathematically, we denote the sequentially ordered clip features as

and the arbitrarily ordered features as f (v).
To model the temporal dependency, we regard f (v) and f ′ (v) as sampled from the joint distribution P XY , and f (v) and f ′ (v) as sampled from the marginal distribution P X ⊗ P Y .In this formulation, the learning objective can be written as where G ψ is the mutual information estimation head.There exist several alternatives to instantiate G ψ , and we discuss this in Sec.4.4.
It is worth noting that there are some previous works using temporal order to build pretext tasks for self-supervised learning [46,36,71].The major difference is that our approach incorporates the video-level feature to determine whether the clips are correctly ordered, while [46,36,71] have no access to the global feature.In this way, our formulation could avoid the ambiguity problem when encountering the temporal structure that cannot be determined solely by local clips.For example, in a complex gymnastic scene, it is difficult to determine the temporal order of gymnastic actions only with local clips.But with reference to the global video feature, it is practical to reach the correct order.Thus, our local-global mutual temporal order constraint is be a better way to embed the video-level temporal structures into extracted representations.

Implementation Details
Self-supervised Pretraining.For global video input, we sparsely sample 16 frames with weak spatial cropping.For local clips input, we constrain the temporal cropping parameters to make K 16-frame clips uniformly distributed (approximately) in the video.The local clips are spatially cropped within the global view to ensure position-wise correspondence.For low-level augmentations, we define a set of color jitter and Gaussian blur parameters to form different intensitylevel transformations.We resize the input frame sequence into 16 × 112 × 112, and use R3D-18 [25] as the video encoder.For local clip feature extraction, we follow the default setting and the feature resolution is 2 × 4 × 4. For global video feature extraction, we set the temporal stride of the last 3 stages to 1, so that the feature resolution is 8 × 4 × 4. We calculate the spatio-temporal correspondence matrix between local and global features maps based on the cropping and flipping parameters for optimization.
In terms of training settings, we use batchsize 128, and set the number of local clips K to 4 by default.We train our model on UCF-101 for 200 epochs, and on Kinetics-400 for 100 epochs.We use Adam optimizer with initial learning rate of 10 −3 , weight decay 10 −5 .The learning rate is decayed by 10 at 70 epochs for Kinetics-400, and 150 epochs for UCF-101.Action Recognition.We load the pretrained video encoder parameters except the last fully-connected layer.There are two protocols: 1) End-to-end finetune the whole network with action labels; 2) Freeze the encoder, only train the linear classifier, also known as linear probe.For evaluation, we follow [71,62] to uniformly sample 10 clips for each video, which are center cropped and resized to 112 × 112.We average the softmax probability of each clip as final prediction, and report the Top-1 accuracy.
Video Retrieval.We directly use the pretrained model to extract video features without finetuning.Following [71,42], we regard videos in test set as query, and retrieve nearest neighbors from training set.Similar to action recognition, we average the feature of ten uniformly sampled clips as the global representation.We report Top-k recall R@k.

Comparison with Existing Works
Action Recognition.We first present the comparison between our method with recent video representation learning approaches on action recognition in Table 1.We report Top-1 accuracy on UCF-101 and HMDB-51 under linear probe and finetune.We exclude the methods that use different evaluation settings and much deeper backbone like [52,37,16], or those that rely on audio and text modalities like [2,49].In Table 1, we use 'V+F' to denote the use of both RGB and optical flow in the self-supervised pretraining stage.All evaluation results are obtained using only RGB at test time.
Under linear probe, our method outperforms other RGB-only approaches by a large margin.The superiority over RSPNet [11], which integrates temporal pretext task with contrastive learning, demonstrates the effectiveness of our general temporal structure learning scheme.And note that our method also dramatically narrows the gap between RGB-only and RGB-flow based method.This indicates that our method significantly improves the motion pattern modeling.Under finetune, our method achieves the best result when pretrained on UCF-101, even surpassing RGB-flow based methods.And when pretrained on Kinetics-400, ours is also comparable with state-of-the-art RGB-flow approaches.Besides, due to limited computation resource, we do not compare with works using very large backbones like [52,16], but we show the ablation in the bottom three lines.The results indicate that our method has potential to scale to longer training epochs, deeper backbone or larger resolution.

Method
Backbone Top-1 Random Init.R3D-18 13.4 RESOUND [39] C3D 16.4 TSN [65] BN-Inception 16.8 Debiased [13] R3D-18 20.5 SimCLR [12] R3D-18 20.1 TCLR [14] R3D-18 22.9 Ours R3D-18 25.4 Besides, we also provide the results on Diving-48 [39], a dataset which mainly relies on dynamic motions to distinguish different action categories.We show the comparison results between both supervised (between dashed lines) and selfsupervised methods (bottom three) in Table 2. Since the appearance is similar across different videos, the Top-1 accuracy can well reflect the ability in motion understanding.We observe that in this case, semantic label supervision is not that effective, and our method improves the performance by a notable margin.This shows that our learning approach is superior in capturing motion patterns, with less reliance on background information.Video Retrieval.Table 3 shows the comparison on video retrieval with R@k.The model is pretrained on UCF-101.Our method remarkably outperforms most RGB-based approaches.Note that some methods, especially PCL [56], achieve impressive results when k increases to 20.This is because when k is large, it becomes likely to rely on background as a shortcut to retrieve videos of the same category.Though STS [61] and CoCLR [24] adopt RGB and optical flow, we reach comparable or even better performance.This again demonstrates that our integration of detailed local feature alignment and general long-term temporal modeling is effective in enhancing motion pattern modeling without resorting to motion biased input data.
Method Backbone UCF-101 HMDB-51 R@1 R@5 R@10 R@20 R@1 R@5 R@10 R@20 PRP [    Visualization Analysis.We also show some visualization results to analyze the learned feature representations in Fig. 3.We employ class-agnostic activation maps (CAAM) [3] to reveal the spatio-temporal distributions of the extracted features.Generally, the vanilla contrastive learning based on SimCLR [12] leads the model to focus on some representative background cues, e.g., the soccer field, swimming pool and fitness equipment.On the contrary, our pretrained model focuses on the moving foregrounds that contain actions, like the moving human body and moving boat.

Ablation Study
In this section, we provide several ablation studies to analyze our video representation learning framework.If not specifically mentioned, all models are pretrained on UCF-101 for 150 epochs, with R3D-18 as backbone.
Local-global Sampling.We first explore the impact of local-global settings.Two aspects were investigated, one is the number of local clips K, the other is  Low-level Augmentation Levels.We also explore the setting of the intensity levels on low-level augmentations.We follow conventional implementations: For color jitter, there are controllable parameters of brightness, contrast, saturation and hue, which are set as (B,C,S,H)=(0.4,0.4,0.4,0.1) by default [24,48].For Gaussian blur, we control the radius and sigma.We set different intensity levels for each controllable parameter as Table 5.Note that since B,C,S are set to the same as default, we also set the same number of levels for them.The total number of predefined intensity levels equals to the number of permutations across all parameters, i.e., 32 for the first row, 512 for the second row, etc.For consistency, in each iteration, we randomly sample 32 intensity levels from all possible levels, resulting in 32 groups of features that share similar low-level statistics for mutual information minimization.We observe that too few or too many levels both leads to performance drop.This is because more levels leads to less difference between different groups, while fewer levels means more difference within each group.We conclude that there exists a trade-off that requires balancing to achieve the best possible training.Temporal Dependency Head.To further examine the feasibility of temporal dependency head implementation, we compare three typical examples: 1) MLP: concatenate f ′ (v) and f (v) or f (v) and pass through a MLP to obtain a scalar value.2) GRU: use GRU to process clip feature sequence, and calculate the cosine similarity between f ′ (v) and GRU output.3) GRU+MLP: use GRU to process clip feature sequence, then concatenate with f ′ (v) and pass through a MLP to get a scalar value.The results are listed in Table 6.Compared with no temporal constraint, all three implementations showed significant improvements.We note that our MLP implementation is similar to VCOP [71], but different in the learning objective.This improvement reveals that introducing the global video feature as reference could enhance temporal structure modeling.Overall Learning Objectives.We finally show the ablation on designed learning objectives in Table 7, where L nce is the standard contrastive loss used in existing works.We observe that the integration of L rc and L mi significantly outperforms L ce , which indicates that the detailed region contrast with lowlevel shortcut elimination is more efficient than naive global contrast.Besides, L td , further enables the model to go beyond local clips and establish long-term relationships.The improvement demonstrates that our method well integrates detailed region-level contrast and general long-term temporal perception.

Conclusion
In this paper, we propose a framework that leverages local clips and global video to enhance self-supervised video representation learning.We employ a set of controllable augmentations to crop local clips and generate groups of samples that share similar low-level attributes.Thereby, we use the soft codes computed from the crop and flip parameters to guide detailed spatio-temporal region contrastive learning, and minimize the mutual information within the same low-level group to avoid shortcuts.Meanwhile, we also incorporate local-global temporal dependency to embed general temporal structures to the extracted video representations.Experiments on downstream tasks of action recognition and video retrieval demonstrate the superiority of our formulation, especially in modeling dynamic motion patterns.

Fig. 2 .
Fig. 2.An overview of the proposed local-global composition framework.We define a set of controllable augmentations to generate the global video and local clip input.Based on the extracted features, we perform spatio-temporal region contrative learning for accurate visual content alignment, and minimize the mutual information between samples that low-level statistics to eliminate the shortcut.And we construct the localglobal temporal order dependency to bridge the gap between clip-level and video-level features.Note that in this figure, we use cubes to present videos or clips, similar cube color means they derive from the same video, and similar color brightness means two cubes share similar low-level statistics.
(a) Results of baseline video contrastive learning.(b) Results of our learning approach.

Fig. 3 .
Fig.3.CAAM visualization of spatio-temporal feature maps.We compare the results of our method and contrastive learning baseline.Ours focuses on the moving objects while the baseline inclines to emphasize background regions.

Table 2 .
Action recognition results on Diving-48 dataset.We compare different Top-1 accuracy based on V1 action labels.

Table 1 .
Comparison results for action recognition downstream task.We provide the training setting of each method, including backbone encoder, pretraining dataset, spatio-temporal resolution and the modality, where 'V' means RGB frames, 'F' means optical flow.We use freeze (tick) to indicate linear probe, while no freeze (cross) denotes end-to-end fine-tuning.For fairness, note that we exclude methods that use different evaluation settings, much deeper backbones or other modalities like audio and text.And '*' denotes 200 epochs pretraining on Kinetics-400.

Table 4 .
Ablation study on local-global sampling.We show the results with different clip numbers and the temporal resolution of global video feature.The first line equals to baseline.We report linear probe Top-1 accuracy on UCF-101 and HMDB-51.theglobalvideofeature temporal resolution T v , which is obtained by adjusting temporal convolution stride.We show the results in Table4.By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment.And it is worth noting that when the ratio T v /KT c < 1, the granularity of local-global correspondence becomes too coarse, which constricts the performance.Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching, and significantly improves action recognition.

Table 5 .
Ablation study on low-level augmentation settings.# denotes the number of intensity levels in Brightness, Contrast, Saturation, Hue and Gaussian Blur.We report linear probe Top-1 accuracy on UCF-101 and HMDB-51.

Table 6 .
[71]tion study on temporal dependency head.None denotes the baseline without temporal constraint, and VCOP follows[71]for comparison.L nce L rc L mi L td UCF-101 HMDB-51

Table 7 .
Ablation study on all learning objectives.Note that Lnce is the standard contrastive loss function in previous works.