Similarity contrastive estimation for image and video soft contrastive self-supervised learning

Denize, Julien; Rabarisoa, Jaonary; Orcesi, Astrid; Hérault, Romain

doi:10.1007/s00138-023-01444-9

Similarity contrastive estimation for image and video soft contrastive self-supervised learning

Original Paper
Open access
Published: 26 September 2023

Volume 34, article number 111, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Vision and Applications Aims and scope Submit manuscript

Similarity contrastive estimation for image and video soft contrastive self-supervised learning

Download PDF

1094 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks. Source code is available here: https://github.com/juliendenize/eztorch.

Delving into Inter-Image Invariance for Unsupervised Visual Representations

Article 24 September 2022

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Self-Supervised learning (SSL) is an unsupervised learning procedure in which the data provide its own supervision to learn a practical representation of the data. A pretext task is designed to make this supervision. The pretrained model is then fine-tuned on downstream tasks, and several works have shown that a self-supervised pretrained network can outperform its supervised counterpart for image [1,2,3] and video [4, 5]. It has been successfully applied to various image and video applications such as image classification, action classification, object detection and action localization.

Contrastive learning is a state-of-the-art self-supervised paradigm based on Noise Contrastive Estimation (NCE) [6] whose most successful applications rely on instance discrimination [7,8,9,10]. Pairs of views from same images or videos are generated by carefully designed data augmentations [4, 8, 11]. Elements from the same pairs are called positives, and their representations are pulled together to learn view invariant features. Other instances called negatives are considered as noise, and their representations are pushed away from positives. Frameworks based on contrastive learning paradigm require a procedure to sample positives and negatives to learn a good data representation. Videos add the time dimension that offers more possibilities than images to generate positives such as sampling different clips as positives [4, 12], using different temporal context [13,14,15].

A large number of negatives are essential [16], and various strategies have been proposed to enhance the number of negatives [7, 8, 17, 18]. Sampling hard negatives [18,19,20,21,22] improves the representations but can be harmful if they are semantically false negatives which causes the “class collision problem” [23,24,25].

Other approaches that learn from positive views without negatives have been proposed by predicting pseudo-classes of different views [1, 3, 26], minimizing the feature distance of positives [2, 4, 27] or matching the similarity distribution between views and other instances [28]. These methods free the mentioned problem of sampling hard negatives.

Based on the weaknesses of contrastive learning using negatives, we introduce a self-supervised soft contrastive learning approach called Similarity Contrastive Estimation (SCE) that contrasts positive pairs with other instances and leverages the push of negatives using the inter-instance similarities. Our method computes relations defined as a sharpened similarity distribution between augmented views of a batch. Each view from the batch is paired with a differently augmented query. Our objective function will maintain for each query the relations and contrast its positive with other images or videos. A memory buffer is maintained to produce a meaningful distribution. Experiments on several datasets show that our approach outperforms our contrastive and relational baselines MoCov2 [29] and ReSSL [28] on images. We also demonstrate using relations for video representation learning is better than contrastive learning.

Our contributions can be summarized as follows:

We propose a self-supervised soft contrastive learning approach called Similarity Contrastive Estimation (SCE) that contrasts pairs of augmented instances with other instances and maintains relations among instances for either image or video representation learning.
We demonstrate that SCE outperforms on several benchmarks its baselines MoCov2 [29] and ReSSL [28] on images on the same architecture.
We show that our proposed SCE is competitive with the state of the art on the ImageNet linear evaluation protocol and generalizes to several image downstream tasks.
We show that our proposed SCE reaches state-of-the-art results for video representation learning by pretraining on the Kinetics400 dataset as we beat or match previous top-1 accuracy for finetuning on HMDB51 and UCF101 for ResNet3D-18 and ResNet3D-50. We also demonstrate it generalizes to several video downstream tasks.

2 Related work

2.1 Image self-supervised learning

Early self-supervised learning In early works, different pretext tasks to perform Self-Supervised Learning have been proposed to learn a good data representation. They consist in transforming the input data or part of it to perform supervision such as: instance discrimination [30], patch localization [31], colorization [32], jigsaw puzzle [33], counting [34], angle rotation prediction [35].

Contrastive learning Contrastive learning is a learning paradigm [1, 2, 7, 8, 11, 16, 17, 21, 22, 36,37,38,39] that outperformed previously mentioned pretext tasks. Most successful methods rely on instance discrimination with a positive pair of views from the same image contrasted with all other instances called negatives. Retrieving lots of negatives is necessary for contrastive learning [16], and various strategies have been proposed. MoCo(v2) [7, 29] uses a small batch size and keeps a high number of negatives by maintaining a memory buffer of representations via a momentum encoder. Alternatively, SimCLR [8, 40] and MoCov3 [41] use a large batch size without a memory buffer, and without a momentum encoder for SimCLR.

Sampler for contrastive learning All negatives are not equal [23], and hard negatives, negatives that are difficult to distinguish with positives, are the most important to sample to improve contrastive learning. However, they are potentially harmful to the training because of the “class collision” problem [23,24,25]. Several samplers have been proposed to alleviate this problem such as debiasing negatives sampling [25] further improved by selecting hard negatives [19], or using the nearest neighbor as positive for NNCLR [22]. Truncated-triplet [39] optimizes a triplet loss using the k-th similar element as negative that showed significant improvement. It is also possible to generate views by adversarial learning as AdCo [21] showed. Some other works [42, 43] proposed a denoised contrastive loss that reduces or reverses the gradient for medium and highly similar negatives. They use hard margins between different categories of negatives. Instead, we propose a soft contrastive loss that seeks to estimate relations between instances and consider all negatives equally.

Contrastive learning without negatives Various siamese frameworks perform contrastive learning without the use of negatives to avoid the class collision problem. BYOL [2] trains an online encoder to predict the output of a momentum updated target encoder. SwAV [1] enforces consistency between online cluster assignments from learned prototypes. DINO [3] proposes a self-distillation paradigm to match distribution on pseudo class from an online encoder to a momentum target encoder. Barlow-Twins [44] aligns the cross-correlation matrix between two paired outputs to the identity matrix that VICReg [45] stabilizes by adding an intra-batch decorrelation loss function.

Regularized contrastive learning Several works regularize contrastive learning by optimizing a contrastive objective along with an objective that considers the similarities among instances. CO2 [24] adds a consistency regularization term that matches the distribution of similarity for a query and its positive. PCL [46] and WCL [47] combines unsupervised clustering with contrastive learning to tighten representations of similar instances.

Relational learning and knowledge distillation Contrastive learning implicitly learns the relations, also called semantic similarity, between instances based on the meaning or semantics they convey by optimizing alignment and matching a prior distribution [48, 49]. ReSSL [28] introduces an explicit relational learning objective by maintaining consistency of pairwise similarities between strong and weak augmented views. The pairs of views are not directly aligned which harms the discriminative performance. Other approaches relied on self-supervised knowledge distillation [50,51,52] for which a student model seeks to predict the distribution of similarities among instances computed by a larger pretrained teacher. As such, in opposition with contrastive and relational learning and therefore our approach, knowledge distillation is not an end-to-end approach and requires a former pretraining.

Masked modeling Masked modeling [53, 54] has shown impressive results in Natural Language Processing tasks using the transformer architecture [55]. More recently, it has been successfully applied to the vision domain thanks to advances on vision transformers [56, 57] which use attentions on tokens made by projecting patches of images in a token space. Specifically designed pretext tasks relying on mask modeling for images have been proposed [58,59,60]. The general idea of mask modeling is masking a part of the input and predicting the masked parts either at token level or at pixel level. It has shown competitive performance on transformer architectures with contrastive learning.

In our work, we optimize a contrastive learning objective using negatives that alleviate class collision by pulling related instances. We do not use a regularization term but directly optimize a soft contrastive learning objective that leverages the contrastive and relational aspects. As we performed a study using convolutional networks, we did not perform a comparative study with Mask Modeling approaches which rely on transformers that require supplementary computational resources.

2.2 Video self-supervised learning

Video Self-Supervised Learning follows the advances of Image Self-Supervised Learning and often picked ideas from the image modality with adjustment and improvement to make it relevant for videos and make best use of it.

Pretext tasks As for images, in early works several pretext tasks have been proposed on videos. Some were directly picked from images such as rotation [61], solving Jigsaw puzzles [62], but others have been designed specifically for videos. These specific pretext-tasks include predicting motion and appearance [63], the shuffling of frame [64, 65] or clip [66, 67] order, predicting the speed of the video [68, 69]. These methods have been replaced over time by more performing approaches that are less limited by a specific pretext task to learn a good representation. Recently, TransRank [5] introduced a new paradigm to perform temporal and spatial pretext tasks prediction on a clip relatively to other transformations to the same clip and showed promising results.

Contrastive learning Video Contrastive Learning [4, 9, 10, 12,13,14,15, 70,71,72] has been widely studied in the recent years as it gained interest after its better performance than standard pretext tasks in images. Several works studied how to form positive views from different clips [4, 10, 12, 13] to directly apply contrastive methods from images. CVRL [12] extended SimCLR to videos and propose a temporal sampler for creating temporally overlapped but not identical positive views which can avoid spatial redundancy. Also, [4] extended SimCLR, MoCo, SwaV and BYOL to videos and studied the effect of using random sampled clips from a video to form views. They pushed further the study to sample several positives to generalize the Multi-crop procedure introduced for images by [1]. Some works focused on combining contrastive learning and predicting a pretext task [73,74,75,76,77, 82]. To help better represent the time dimension, several approaches were designed to use different temporal context width [13,14,15] for the different views.

Multi-modal learning To improve self-supervised representation learning, several approaches made use of several modalities to better capture the spatio-temporal information provided by a video. It can be from text [78, 79], audio [14, 73, 80], and optical flow [10, 14, 26, 70, 73, 81, 82].

Masked modeling Transformers have been extended from images to videos for learning spatio-temporal representations [83, 84]. Approaches on videos for Masked Modeling [85,86,87] essentially converted pretext tasks from images to videos by considering spatio-temporal masking of tokens instead of simply spatial tokens.

In our work, we propose a soft contrastive learning objective using only RGB frames that directly generalizes our approach from image with changes related to data processing and architectures. To the best of our knowledge, we are the first to introduce the concept of soft contrastive learning using relations for video self-supervised representation learning. As for images, we did not perform a thorough comparative study with Mask Modeling as these methods rely on transformers and we worked with convolutional networks.

3 Methodology

In this section, we will introduce our baselines: MoCov2 [29] for the contrastive aspect and ReSSL [28] for the relational aspect. We will then present our self-supervised soft contrastive learning approach called Similarity Contrastive Estimation (SCE). All these methods share the same architecture illustrated in Fig. 1a. We provide the pseudo-code of our algorithm in Appendix B.

3.1 Contrastive and relational learning

Consider ${\textbf{x}}=\{\mathbf {x_k}\}_{k\in \{1,..., N\}}$ a batch of N images. Siamese momentum methods based on Contrastive and Relational learning, such as MoCo [7] and ReSSL [28], respectively, produce two views of ${\textbf{x}}$, $\mathbf {x^1} = t^1({\textbf{x}})$ and $\mathbf {x^2} = t^2({\textbf{x}})$, from two data augmentation distributions $T^1$ and $T^2$ with $t^1 \sim T^1$ and $t^2 \sim T^2$. For ReSSL, $T^2$ is a weak data augmentation distribution compared to $T^1$ to maintain relations. $\mathbf {x^1}$ passes through an online network $f_s$ followed by a projector $g_s$ to compute $\mathbf {z^1} = g_s(f_s(\mathbf {x^1}))$. A parallel target branch containing a projector $g_t$ and an encoder $f_t$ computes $\mathbf {z^2} = g_t(f_t(\mathbf {x^2}))$. $\mathbf {z^1}$ and $\mathbf {z^2}$ are both $l_2$-normalized.

The online branch parameters $\theta _s$ are updated by gradient ($\nabla $) descent to minimize a loss function $ {\mathcal {L}}$. The target branch parameters $\theta _t$ are updated at each iteration by exponential moving average of the online branch parameters with the momentum value m, also called keep rate, to control the update such as:

$$\begin{aligned}&\theta _s \leftarrow optimizer(\theta _s, \nabla _{\theta _s}{\mathcal {L}}), \end{aligned}$$

(1)

$$\begin{aligned}&\theta _t \leftarrow m\theta _t + (1 - m) \theta _s. \end{aligned}$$

(2)

MoCo uses the InfoNCE loss, a similarity-based function scaled by the temperature $\tau $ that maximizes agreement between the positive pair and push negatives away:

$$\begin{aligned} L_{InfoNCE} = - \frac{1}{N} \sum _{i=1}^N \log \left( \frac{\exp (\mathbf {z^1_i} \cdot \mathbf {z^2_i} / \tau )}{\sum _{j=1}^N\exp (\mathbf {z^1_i} \cdot \mathbf {z^2_j} / \tau )}\right) . \end{aligned}$$

(3)

ReSSL computes a target similarity distribution $\mathbf {s^2}$ that represents the relations between weak augmented instances, and the distribution of similarity $\mathbf {s^{1}}$ between the strongly augmented instances with the weak augmented ones. Temperature parameters are applied to each distribution: $\tau $ for $\mathbf {s^{1}}$ and $\tau _m$ for $\mathbf {s^{2}}$ with $\tau > \tau _m$ to eliminate noisy relations. Indeed, as the temperature decreases, it exponentially increases softmax values for highly similar instances and decreases exponentially values for low similar instances which makes them negligible in the target distribution. The loss function is the cross-entropy between $\textbf{s}^{2}$ and ${\textbf{s}}^{1}$:

$$\begin{aligned}{} & {} s^{1}_{ik} = \frac{\mathbb {1}_{i \ne k} \cdot \exp (\textbf{z}^\textbf{1}_\textbf{i} \cdot \textbf{z}^\textbf{2}_\textbf{k} / \tau )}{\sum _{j=1}^{N}{\mathbb {1}}_{i \ne j} \cdot \exp (\textbf{z}^\textbf{1}_\textbf{i} \cdot \textbf{z}^\textbf{2}_\textbf{j} / \tau )}, \end{aligned}$$

(4)

$$\begin{aligned}{} & {} s^{2}_{ik} = \frac{{\mathbb {1}}_{i \ne k} \cdot \exp (\textbf{z}^\textbf{2}_\textbf{i} \cdot \textbf{z}^\textbf{2}_\textbf{k} / \tau _m)}{\sum _{j=1}^{N}{\mathbb {1}}_{i \ne j} \cdot \exp (\textbf{z}^\textbf{2}_\textbf{i} \cdot \textbf{z}^\textbf{2}_\textbf{j} / \tau _m)}, \end{aligned}$$

(5)

$$\begin{aligned}{} & {} L_{ReSSL} = - \frac{1}{N} \sum _{i=1}^N\sum _{\begin{array}{c} k=1 \\ k\ne i \end{array}}^N s^{2}_{ik} \log \left( s^{1}_{ik}\right) . \end{aligned}$$

(6)

A memory buffer of size $M>> N$ filled by $\mathbf {z^2}$ is maintained for both methods.

3.2 Similarity contrastive estimation

Contrastive Learning methods damage relations among instances which Relational Learning correctly build. However, Relational Learning lacks the discriminating features that contrastive methods can learn. If we take the example of a dataset composed of cats and dogs, we want our model to be able to understand that two different cats share the same appearance, but we also want our model to learn to distinguish details specific to each cat. Based on these requirements, we propose our approach called Similarity Contrastive Estimation (SCE).

We argue that there exists a true distribution of similarity $\mathbf {w_i^*}$ between a query $\mathbf {q_i}$ and the instances in a batch of N images ${\textbf{x}}=\{\textbf{x}_\textbf{k}\}_{k\in \{1,..., N\}}$, with $\textbf{x}_\textbf{i}$ a positive view of $\mathbf {q_i}$. If we had access to $\textbf{w}_\textbf{i}^\mathbf {*}$, our training framework would estimate the similarity distribution $\mathbf {p_i}$ between $\mathbf {q_i}$ and all instances in ${\textbf{x}}$, and minimize the cross-entropy between $\mathbf {w_i^*}$ and $\mathbf {p_i}$ which is a soft contrastive learning objective:

$$\begin{aligned} L_{SCE^*} = - \frac{1}{N}\sum _{i=1}^N\sum _{k=1}^N w^*_{ik}\log \left( p_{ik}\right) . \end{aligned}$$

(7)

$L_{SCE^*}$ is a soft contrastive approach that generalizes InfoNCE and ReSSL objectives. InfoNCE is a hard contrastive loss that estimates $\mathbf {w_i^*}$ with a one-hot label and ReSSL estimates $\mathbf {w_i^*}$ without the contrastive component.

We propose an estimation of $\mathbf {w_i^*}$ based on contrastive and relational learning. We consider $\mathbf {x^1} = t^1({\textbf{x}})$ and $\mathbf {x^2} = t^2({\textbf{x}})$ generated from ${\textbf{x}}$ using two data augmentations $t^1 \sim T^1$ and $t^2 \sim T^2$. Both augmentation distributions should be different to estimate different relations for each view as shown in Sect. 4.1.1. We compute $\mathbf {z^1} = h_s(g_s(f_s(\mathbf {x^1})))$ from the online encoder $f_s$, projector $g_s$ and optionally a predictor $h_s$ [2, 41]). We also compute $\mathbf {z^2} = g_t(f_t(\mathbf {x^2}))$ from the target encoder $f_t$ and projector $g_t$. $\mathbf {z^1}$ and $\mathbf {z^2}$ are both $l_2$-normalized.

The similarity distribution $\mathbf {s^2_i}$ that defines relations between the query and other instances is computed via Eq. (5). The temperature $\tau _m$ sharpens the distribution to only keep relevant relations. A weighted positive one-hot label is added to $\mathbf {s^2_i}$ to build the target similarity distribution $\mathbf {w^2_i}$:

$$\begin{aligned} w^2_{ik} = \lambda \cdot \mathbb {1}_{i=k} + (1 - \lambda ) \cdot s^2_{ik}. \end{aligned}$$

(8)

The online similarity distribution $\mathbf {p^1_i}$ between $\mathbf {z^1_i}$ and $\mathbf {z^2}$, including the target positive representation in opposition with ReSSL, is computed and scaled by the temperature $\tau $ with $\tau > \tau _m$ to build a sharper target distribution:

$$\begin{aligned} p^1_{ik} = \frac{\exp (\mathbf {z^1_i} \cdot \mathbf {z^2_k} / \tau )}{\sum _{j=1}^{N}\exp (\mathbf {z^1_i} \cdot \mathbf {z^2_j} / \tau )}. \end{aligned}$$

(9)

The objective function illustrated in Fig. 1b is the cross-entropy between each $\mathbf {w^2}$ and $\mathbf {p^1}$:

$$\begin{aligned} L_{SCE} = - \frac{1}{N} \sum _{i=1}^N\sum _{k=1}^N w^2_{ik} \log \left( p^1_{ik}\right) . \end{aligned}$$

(10)

The loss can be symmetrized by passing $\mathbf {x^1}$ and $\mathbf {x^2}$ through the momentum and online encoders and averaging the two losses computed.

A memory buffer of size $M>> N$ filled by $\mathbf {z^2}$ is maintained to better approximate the similarity distributions.

The following proposition explicitly shows that SCE optimizes a contrastive learning objective while maintaining inter-instance relations:

Proposition 1

$L_{SCE}$ defined in Eq. (10) can be written as:

$$\begin{aligned} L_{SCE} = \lambda \cdot L_{InfoNCE} + \mu \cdot L_{ReSSL} + \eta \cdot L_{Ceil}, \end{aligned}$$

(11)

with $\mu = \eta = 1 - \lambda $ and

$$\begin{aligned} L_{Ceil} = - \frac{1}{N} \sum _{i=1}^{N}\log \left( \frac{\sum _{j=1}^{N}\mathbb {1}_{i \ne j} \cdot \exp (\textbf{z}^\textbf{1}_\textbf{i} \cdot \textbf{z}^\textbf{2}_\textbf{j} / \tau )}{\sum _{j=1}^{N} \exp (\textbf{z}^\textbf{1}_\textbf{i} \cdot \textbf{z}^\textbf{2}_\textbf{j} / \tau )}\right) . \end{aligned}$$

The proof separates the positive term and negatives. It can be found in Appendix C. $L_{Ceil}$ leverages how similar the positives should be with hard negatives. Because our approach is a soft contrastive learning objective, we optimize the formulation in Eq. (10) and have the constraint $\mu = \eta = 1 - \lambda $. It frees our implementation from having three losses to optimize with two hyperparameters $\mu $ and $\eta $ to tune. Still, we performed a small study of the objective defined in Eq. (11) without this constraint to check if $L_{Ceil}$ improves results in Sect. 4.1.1.

Table 1 Different distributions of data augmentations applied to SCE

Similarity contrastive estimation for image and video soft contrastive self-supervised learning

Abstract

Similar content being viewed by others

Delving into Inter-Image Invariance for Unsupervised Visual Representations

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

1 Introduction

2 Related work

2.1 Image self-supervised learning

2.2 Video self-supervised learning

3 Methodology

3.1 Contrastive and relational learning

3.2 Similarity contrastive estimation

Proposition 1

4 Empirical study

4.1 Image study

4.1.1 Ablation study

4.1.2 Comparison with our baselines

4.1.3 ImageNet linear evaluation

4.1.4 Transfer learning

4.2 Video study

4.2.1 Ablation study

4.2.2 Comparison with the state of the art

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 342 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation