Keywords

1 Introduction

Convolutional neural networks (CNNs) have demonstrated to learn powerful visual representations from large amounts of tediously labeled training data [23]. However, since visual data is cheap to acquire but costly to label, there has recently been great interest in learning compelling features from unlabeled data. Without any annotations, self-supervision based on surrogate tasks, for which the target value can be obtained automatically, is commonly pursued [2, 8, 9, 16, 17, 26, 27, 29,30,31, 33, 34, 38, 44]. In colorization [26], for instance, the color information is stripped from an image and serves as the target value, which has to be recovered. Various surrogate tasks have been proposed, including predicting a sequence of basic motions [29], counting parts within regions [34] or embedding images into text topic spaces [38].

The key competence of visual understanding is to recognize structure in visual data. Thus, breaking the order of visual patterns and training a network to recover the structure provides a rich training signal. This general framework of permuting the input data and learning a feature representation, from which the inverse permutation (and thus the correct order) can be inferred, is a widely applicable strategy. It has been pursued on still images [8,9,10, 33, 35] by employing spatial shuffling of images (especially permuting jigsaws) and in videos [5, 16, 27, 31] by utilizing temporally shuffled sequences. Since spatial and temporal shuffling are both ordering tasks, which only differ in the ordering dimension, they should be addressed jointly.

We observe that there has been unused potential in self-supervision based on ordering: Previous work [5, 16, 27, 33, 35] has randomly selected the permutations used for training the CNN. However, can we not find permutations that are of higher utility for improving a CNN representation than the random set? For instance, given a \(3\times 3\) jigsaw grid, shuffling two neighboring image patches, two patches in faraway corners, or shuffling all patches simultaneously will learn structure of different granularity. Thus diverse permutations will affect the CNN in a different way. Moreover the effect of the permutations on the CNN changes during training since the state of the network evolves. During learning we can examine the previous errors the network has made when recovering order and then identify a set of best suited permutations. Therefore, wrapped around the standard back-propagation training of the CNN, we have a reinforcement learning algorithm that acts by proposing permutations for the CNN training. To learn the function for proposing permutations we simultaneously train a policy and self-supervised network by utilizing the improvement over time of the CNN network as a reward signal.

2 Related Work

We first present previous work on self-supervised learning using one task or a combination of surrogate approaches. Then we introduce curriculum learning procedures and discuss meta-learning for deep neural network.

Self-supervised Representation Learning: In self-supervision, the feature representation is learned indirectly by solving a surrogate task. For that matter, visual data like images [8,9,10, 17, 26, 40, 49, 52, 55] or videos [5, 16, 27, 29, 31, 39, 51, 52] are utilized as source of information, but also text [38] or audio [37]. In contrast to the majority of recent self-supervised learning approaches, Doersch et al. [10] and Wang et al. [52] combine surrogate tasks to train a multi-task network. Doersch et al. choose 4 surrogate tasks and evaluate a naive and a mediated combination of those. Wang et al. besides a naive multi-task combination of these self-supervision tasks, use the learned features to build a graph of semantically similar objects, which is then used to train a triplet loss. Since they combine heterogeneous tasks, both methods use an additional technique on top of the self-supervised training to exploit the full potential of their approach. Our model combines two directly related ordering tasks, which are complementary without the need of additional adjustment approaches.

Curriculum Learning: In 2009 Bengio et al. [3] proposed curriculum learning (CL) to enhance the learning process by gradually increasing the complexity of the task during training. CL has been utilized by different deep learning methods [6, 18, 47] with the limitation that the complexity of samples and their scheduling during training typically has to be established a priori. Kumar et al. [25] define the sample complexity from the perspective of the classifier, but still manually define the scheduling. In contrast, our policy dynamically selects the permutations based on the current state of the network.

Meta-learning for Deep Neural Networks: Recently, methods have proposed ways to improve upon the classical training of neural networks by, for example, automatizing the selection of hyper-parameters [1, 7, 15, 36, 41, 56]. Andrychowicz et al. [1] train a recurrent neural network acting as an optimizer which makes informative decisions based on the state of the network. Fan et al. [15] propose a system to improve the final performance of the network using a reinforcement learning approach which schedules training samples during learning. Opitz et al. [36] use the gradient of the last layer for selecting uncorrelated samples to improve performance. Similar to [1, 15, 36] we propose a method which affects the training of a network to push towards better performances. In contrast to these supervised methods, where the image labels are fixed, our policy has substantial control on the training of the main network since it can directly alter the input data by proposing permutations.

Fig. 1.
figure 1

(A) Deep RL of a policy for sampling permutations. (B) Permuting training images/videos by the proposed actions of (A) to provide self-supervision for our network architecture (C). (D) Evaluating the update network (C) on validation data to receive reward and state.

3 Approach

Now we present a method for training two self-supervised tasks simultaneously to learn a general and meaningful feature representation. We then present a deep reinforcement learning approach to learn a policy that proposes best suited permutations at a given stage during training.

3.1 Self-supervised Spatiotemporal Representation Learning

Subsequently, we learn a CNN feature representation (CaffeNet [21] architecture up to pool5) for images and individual frames of a video using spatiotemporal self-supervision (see Fig. 1C). Training starts from scratch with a randomly initialized network. To obtain training samples for the spatial ordering task, we divide images into a \(m \times m\) regular grid of tiles as suggested by [33] (Fig. 1B top). For temporal ordering of u frames from a video sequence(Fig. 1B bottom), shuffling is performed on frame level and with augmentation (detailed in Sect. 4.1). Note that we do not require an object-of-interest detection, as for example done in [27, 31] by using motion (optical flow), since our approach randomly samples the frames from a video.

For the following part of this section, we are going to talk about a sample x in general, referring to a sequence of frames (temporal task) or a partitioned image (spatial task). Let \(x=\left( x_1,x_2,\dots \right) \) be the sample that is to be shuffled by permuting its parts by some index permutation \(\psi _i = (\psi _{i,1},\psi _{i,2}, \cdots )\),

$$\begin{aligned} \psi _i(x) := \left( x_{\psi _{i,1}}, x_{\psi _{i,2}}, \dots \right) . \end{aligned}$$
(1)

The set of all possible permutations \(\varPsi ^\star \) contains u! or \((m\cdot m)!\) elements. If, for example, \(u=8\) the total number of possible permutations equals \(8!=40320\). For practical reasons, a pre-processing step reduces the set of all possible permutations, following [33], by sampling a set \(\varPsi \subset \varPsi ^\star \) of maximally diverse permutations \(\psi _i \in \varPsi \). We iteratively include the permutation with the maximum Hamming distance to the already chosen ones. Both self-supervised tasks have their own set of permutations. For simplicity, we are going to explain our approach based on a general \(\varPsi \) without referring to a specific task. To solve the ordering task of undoing the shuffling based on the pool5 features we want to learn (Fig. 1(C)), we need a classifier that can identify the permutation. The classifier architecture begins with an fc6 layer. For spatial ordering, the fc6 output of all tiles is stacked in an fc7 layer; for temporal ordering the fc6 output of the frames is combined in a recurrent neural network implemented as LSTM [19] (see Fig. 1(C) and Sect. 4.1 for implementation details). The output of fc7 or the LSTM is then processed by a final fully connected classification layer. This last fc layer estimates the permutation \(\psi _i\) applied to the input sample and is trained using cross-entropy loss. The output activation \( \varphi _i, i \in \{ 1, \dots |\varPsi |\}\) of the classifier corresponds to the permutation \(\psi _i \in \varPsi \) and indicates how certain the network is that the permutation applied to the input x is \(\psi _i\). The network is trained in parallel with two batches, one of spatially permuted tiles and one of temporally shuffled frames. Back-propagation then provides two gradients, one from the spatial and one from the temporal task, which back-propagate through the entire network down to conv1.

The question is now, which permutation to apply to which training sample.

3.2 Finding an Optimal Permutation Strategy by Reinforcement Learning

In previous works [5, 16, 27, 31, 33], for each training sample one permutation is randomly selected from a large set of candidate permutations \(\psi _i \in \varPsi \). Selecting the data permutation independent from the input data is beneficial as it avoids overfitting to the training data (permutations triggered only by specific samples). However, permutations should be selected conditioned on the state of the network that is being trained to sample new permutations according to their utility for learning the CNN representation.

A Markov Decision Process for Proposing Permutations: We need to learn a function that proposes permutations conditioned on the network state and independent from samples x to avoid overfitting. Knowingly, the state of the network cannot be represented directly by the network weights, as the dimensionality would be too high for learning to be feasible. To capture the network state at time step t in a compact state vector s, we measure performance of the network on a set of validation samples \(x \in X_{val}\). Each x is permuted by some \(\psi _i \in \varPsi \). A forward pass through the network then leads to activations \(\varphi _i\) and a softmax activation of the network,

$$\begin{aligned} y_i^\star&= \frac{exp(\varphi _{i})}{\sum _{k} exp(\varphi _{k})}. \end{aligned}$$
(2)

Given all the samples, the output of the softmax function indicates how good a permutation \(\psi _i\) can already be reconstructed and which ones are hard to recover (low \(y_i^\star \)). Thus, it reflects the complexity of a permutation from the view point of the network and \(y_i^\star \) can be utilized to capture the network state s. To be precise, we measure the network’s confidence regarding its classification using the ratio of correct class l vs. second highest prediction p (or highest if the true label l is not classified correctly):

$$\begin{aligned} y_l(x) = \frac{y_l^\star (x)+1}{y_p^\star (x)+1}, \end{aligned}$$
(3)

where \(x \in X_{val}\) and adding 1 to have \(0.5 \le y_l \le 2\), so that \(y_l>1\) indicates a correct classification. The state s is then defined as

$$\begin{aligned} s = \begin{bmatrix} y_1(x_1)&\dots&y_1(x_{|X_{val}|}) \\ \vdots&\vdots \\ y_{|\varPsi |}(x_1)&\dots&y_{|\varPsi |}(x_{|X_{val}|}), \end{bmatrix} \end{aligned}$$
(4)

where one row contains the softmax ratios of a permutation \(\psi _i\) applied to all samples \(x \in X_{val}\) (see Fig. 1(D)). Using a validation set for determining the state has the advantage of obtaining the utility for all permutations \(\psi _i\) and not only for the ones applied in the previous training phase. Moreover, it guarantees the comparability between validations applied at different time points independently by the policy. The action \(a=(x,\psi _i) \in A = X \times \varPsi \) of training the network by applying a permutation \(\psi _i\) to a random training sample x changes the state s (in practice we sample an entire mini-batch of tuples for one training iteration rather than only one). Training changes the network state s at time point t into \(s'\) according to some transition probability \(T(s^\prime |s,a)\). To evaluate the chosen action a we need a reward signal \(r_t\) given the revised state \(s^\prime \). The challenge is now to find the action which maximizes the expected reward

$$\begin{aligned} R(s,a) = \mathbb {E}[r_{t} | s_t = s,a], \end{aligned}$$
(5)

given the present state of the network. The underlying problem of finding suitable permutations and training the network can be formulated as a Markov Decision Process (MDP) [48], a 5-tuple \(<S,A,T,R,\gamma>\), where S is a set of states \(s_t\), A is a set of actions \(a_t\), \(T(s^\prime |s,a)\) the transition probability, R(as) the reward and \(\gamma \in [0,1]\) is the discount which scales future rewards against present ones.

Defining a Policy: As a reward \(r_t\) we need a score which measures the impact the chosen permutations have had on the overall performance in the previous training phase. For that, the error

$$\begin{aligned} \mathcal {E} := 1 - \frac{1}{|\varPsi |\cdot |X_{val}|}\sum \limits _{l=1}^{|\varPsi |}\sum \limits _{x\in X_{val}} \delta _{\,l\,,\mathop {{{\mathrm{argmax}}}}\limits _{p=\{1,...,|\varPsi |\}} y_p^\star (x)} \end{aligned}$$
(6)

with \(\delta \) the Kronecker delta, can be used to assess the influence of a permutation. To make the reward more informative, we compare this value against a baseline (BL), which results from simply extrapolating the error of previous iterations, i.e. \(\mathcal {E}^{BL}_{t+1} = 2\mathcal {E}_t-\mathcal {E}_{t-1}\). We then seek an action that improves upon this baseline. Thus, the reward \(r_t\) obtained at time point \(t+1\) (we use the index t for r at time step \(t+1\) to indicate the connection to \(a_t\)) is defined as

$$\begin{aligned} r_t := \mathcal {E}^{BL}_{t+1} - \mathcal {E}_{t+1}. \end{aligned}$$
(7)

We determine the error using the same validation set as already employed for obtaining the state. In this way no additional computational effort is required.

Given the earlier defined state s of the network and the actions A we seek to learn a policy function

$$\begin{aligned} \pi (a|s,\theta ) = P(a_t=a|s_t=s,\theta _t = \theta ), \end{aligned}$$
(8)

that, given the \(\theta \) parameters of the policy, proposes an action \(a=(x,\psi _i)\) for a randomly sampled training data point x based on the state s, where \(\pi (a|s,\theta )\) is the probability of applying action \(a \in A\) at time point t given the state s. The parameters \(\theta \) can be learned by maximizing the reward signal r. It has been proven that a neural network is capable of learning a powerful approximation of \(\pi \) [32, 45, 48]. However, the objective function (maximizing the reward) is not differentiable. In this case, Reinforcement Learning (RL) [48] has now become a standard approach for learning \(\pi \) in this particular case.

Fig. 2.
figure 2

Training procedure of \(\pi \). The policy proposes actions \([a_{t,k}]^K_{k=1}\) to permute the data X, used for training the unsupervised network. The improvement of the network is then used as reward r to update the policy.

Policy Gradient: There are two main approaches for attacking deep RL problems: Q-Learning and Policy Gradient. We require a policy which models action probabilities to prevent the policy from converging to a small subset of permutations. Thus, we utilize a Policy Gradient (PG) algorithm which learns a stochastic policy and additionally guarantees convergence (at least to a local optimum) as opposed to Q-Learning. The objective of a PG algorithm is to maximize the expected cumulative reward (Eq. 5) by iteratively updating the policy weights through back-propagation. One update at time point \(t+1\) with learning rate \(\alpha \) is given by

$$\begin{aligned} \theta _{t+1} = \theta _t + \alpha \Bigl ( \sum \nolimits _{t^\prime \ge t}^{} \gamma ^{t^\prime - t} r_{t^\prime }\Bigr ) \nabla \log \pi (a|s,\theta ), \end{aligned}$$
(9)

Action Space: The complexity of deep RL increases significantly with the number of actions. Asking the policy to permute a sample x given the full space \(\varPsi \) leads to a large action space. Thus, we dynamically group the permutations into |C| groups based on the state of the spatiotemporal network. The permutations which are equally difficult or equally easy to classify are grouped at time point t and this grouping changes over time according to the state of the network. We utilize the state s (Eq. 4) as input to the grouping approach, where one row \(s_i\) represents the embedding of permutation \(\psi _i\). A policy then proposes one group \(c_j \in C\) of permutations and randomly selects one instance \(\psi _i \in c_j\) of the group. Then a training data point x is randomly sampled and shuffled by \(\psi _i\). This constitutes an action \(a=(x,\psi _i)\). Rather than directly proposing individual permutations \(\psi _i\), this strategy only proposes a set of related permutations \(c_j\). Since \(|C|<<|\varPsi |\), the effective dimensionality of actions is significantly reduced and learning a policy becomes feasible.

Network State: To obtain a more concise representation \(\hat{s}=[\hat{s}_{j}]_{j=1}^{|C|}\) of the state of the spatiotemporal network (the input to the policy), we aggregate the characteristics of all permutations within a group \(c_j\). Since the actions are directly linked to the groups, the features should contain the statistics of \(c_j\) based on the state of the network. Therefore we utilize per group (i) the number of permutations belonging to \(c_j\) and (ii) the median of the softmax ratios (Eq. 3) over the \((\psi _i,x)\) pairs with \(\psi _i \in c_j\) and \(x \in X_{val}\)

$$\begin{aligned} \hat{s} = [|c_j|,median\left( [s_i]_{\psi _i\in c_j}\right) ]_{j=1}^{|C|}. \end{aligned}$$
(10)

The median over the softmax ratios reflects how well the spatiotemporal network can classify the set of permutations which are grouped together. Including the size \(|c_j|\) of the groups helps the policy to avoid the selection of very small groups which could lead to overfitting of the self-supervised network on certain permutations. The proposed \(\hat{s}\) have proven to be an effective and efficient representation of the state. Including global features, as for example the iteration or learning rate utilized in previous work [14, 15], does not help in our scenario. It rather increases the complexity of the state and hinders policy learning. Figure 1(D) depicts the validation process, including the calculation of state \(\hat{s}\) and the reward r.

Training Algorithm: We train the self-supervised network and the policy simultaneously, where the training can be divided in two phases: the self-supervised training and the policy update (see Fig. 2 and Algorithm 1 in section A of the Supplementary Material). The total training runs for T steps. Between two steps t and \(t+1\) solely the self-supervised network is trained (\(\pi \) is fixed) using SGD for several iterations using the permutations proposed by \(\pi \). Then, \(\hat{s}\) is updated using the validation procedure explained above. At each time step t an episode (one update of \(\pi \)) is performed. During episode t, the policy proposes a batch of K actions \([a_t]^K_{k=1}\), based on the updated state \(\hat{s_t}\), which are utilized to train the self-supervised network for a small amount of iterations. At the end of the episode, another validation is applied to determine the reward \(r_t\) for updating \(\pi \) (Eq. 9). The two phases alternate each other until the end of the training.

Computational Extra Costs during Training: With respect to the basic self-supervised training, the extra cost for training the policy derives only from the total number of episodes \(\times \) the time needed for performing an episode. If the number of SGD iterations between two policy updates t and \(t+1\) is significantly higher than the steps within an episode, the computational extra costs for training the policy is small in comparison to the basic training. Fortunately, sparse policy updates are, in our scenario, possible since the policy network improves significantly faster than the self-supervised network. We observed a computational extra cost of \(\sim \)40% based on the optimal parameters. Previous work, [14, 56] which utilize deep RL for meta-learning, need to repeat the full training of the network several times to learn the policy, thus being several times slower.

4 Experiments

In this section, we provide additional details regarding the self-supervised training of our approach which we evaluate quantitatively and qualitatively using nearest neighbor search. Then, we validate the transferability of our trained feature representation on a variety of contrasting vision tasks, including image classification, object detection, object segmentation and action recognition (Sect. 4.2). We then perform an ablation study to analyze the gain of the proposed reinforcement learning policy and of combining both self-supervision tasks.

4.1 Self-supervised Training

We first describe all implementation details, including the network architecture and the preprocessing of the training data. We then utilize two different datasets for the evaluation of the feature representation trained only with self-supervision.

Implementation Details: Our shared basic model of the spatiotemporal network up to pool5 has the same architecture as CaffeNet [21] with batch normalization [20] between the conv layers. To train the policy we use the Policy Gradient algorithm REINFORCE (with moving average subtraction for variance reduction) and add the entropy of the policy to the objective function which improves the exploration and therefore prevents overfitting (proposed by [53]). The policy network contains 2 FC layers, where the hidden layer has 16 dimensions. We use K-means clustering for grouping the permutations in 10 groups. The validation set contains 100 (\(|X_{val}|=100\)) samples and is randomly sampled from the training set (and then excluded for training). The still images utilized for the spatial task are chosen from the training set of the Imagenet dataset [43]. For training our model with the temporal task, we utilize the frames from split1 of the human action dataset UCF-101 [46]. We use 1000 initial permutations for both tasks (\(|\varPsi | = 1000\)). Further technical details can be found in the supplementary material, section B.

Table 1. Quantitative evaluation of our self-supervised trained feature representation using nearest neighbor search on split1 of UCF-101 and Pascal VOC 2007 dataset. Distance measure is cosine distance of pool5 features. For UCF101, 10 frames per video are extracted. Images of the test set are used as queries and the images of the training set as the retrieval targets. We report mean accuracies [%] over all chosen test frames. If the class of a test sample appears within the topk it is considered correctly predicted. We compare the results gained by (i) a random initialization, (ii) a spatial approach [33], (iii) a temporal method [27], and (iv) our model. For extracting the features based on the weights of (ii) and (iii) we utilize their published models

Nearest Neighbor Search: To evaluate unsupervised representation learning, which has no labels provided, nearest neighbor search is the method of choice. For that, we utilize two different datasets: split1 of the human action dataset UCF-101 and the Pascal VOC 2007 dataset. UCF-101 contains 101 different action classes and over 13k clips. We extract 10 frames per video for computing the nearest neighbor. The Pascal VOC 2007 dataset consists of 9,963 images, containing 24,640 annotated objects which are divided in 20 classes. Based on the default split, 50% of the images belong to the training/validation set and 50% to the testing set. We use the provided bounding boxes of the dataset to extract the individual objects, whereas patches with less than 10k pixels are discarded. We use the model trained with our self-supervised approach to extract the pool5 features of the training and testing set and the images have an input size of \(227\times 227\). Then, for every test sample we compute the Topk nearest neighbors in the training set by using cosine distance. A test sample is considered as correctly predicted if its class can be found within the Topk nearest neighbors. The final accuracy is then determined by computing the mean over all testing samples. Table 1 shows the accuracy for \(k=1,5,10,20,50\) computed on UCF-101 and Pascal VOC 2007, respectively. It can be seen, that our model achieves the highest accuracy for all k, meaning that our method produces more informative features for object/video classification. Note, that especially the accuracy of Top1 is much higher in comparison to the other approaches.

We additionally evaluate our features qualitatively by depicting the Top5 nearest neighbors in the training set given a query image from the test set (see Fig. 3). We compare our results with [27, 33], a random initialization, and a network with supervised training using the Imagenet dataset.

Fig. 3.
figure 3

Unsupervised evaluation of the feature representation by nearest neighbor search on the VOC07 dataset. For every test sample we show the Top5 nearest neighbors from the training set (Top1 to Top5 from left to right) using the cosine distance of the pool5 features. We compare the models from (i) supervised training with the Imagenet classification task, (ii) our spatiotemporal approach, (iii) OPN as a temporal approach [27], (iv) Jigsaw as a spatial method [33] and (v) a random initialization.

4.2 Transfer Capabilities of the Self-supervised Representation

Subsequently, we evaluate how well our self-trained representation can transfer to different tasks and also to other datasets. For the following experiments we initialize all networks with our trained model up to conv5 and fine-tune on the specific task using standard evaluation procedures.

Imagenet [43]: The Imagenet benchmark consists of \(\sim \)1.3M images divided in 1000 objects category. The unsupervised features are tested by training a classifier on top of the frozen conv layers. Two experiments are proposed, one introduced by [54] using a linear classifier, and one using a two layer neural network proposed by [33]. Table 2 shows that our features obtain more than 2% over the best model with a comparable architecture, and almost 4% in the linear task. The modified CaffeNet introduced by [17] is not directly comparable to our model since it has 60% more parameters due to larger conv layers (groups parameter of the caffe framework [21]).

Table 2. Test accuracy [%] of the Imagenet classification task. A Linear [54] and Non-linear [33] classifier are trained over the frozen features (pool5) of the methods shown in the left column. (*: indicates our implementation of the model, +: indicates bigger architecture due to missing groups in the conv layers)
Table 3. Transferability of features learned using self-supervision to action recognition. The network is initialized until conv5 with the approach shown in the left column and fine-tuned on UCF-101 and HMDB-51. Accuracies [%] are reported for each approach. ‘*’: Jigsaw (Noroozi et al. [33]) do not provide results for this task, we replicate their results using our PyTorch implementation

Action Recognition: For evaluating our unsupervised pre-trained network on the action recognition task we use the three splits of two different human action datasets: UCF-101 [46] with 101 different action classes and over 13k clips and HMDB-51 [24] with 51 classes and around 7k clips. The supervised training is performed using single frames as input, whereas the network is trained and tested on every split separately. If not mentioned otherwise, all classification accuracies presented in this paragraph are computed by taking the mean over the three splits of the corresponding dataset. For training and testing we utilize the PyTorch implementationFootnote 1 provided by Wang et al. [50] for augmenting the data and for the finetuning and evaluation step, but network architecture and hyperparameters are retained from our model. Table 3 shows that we outperform the state-of-the-art by 2.3% on UCF-101 and 2.9% on HMDB-51. During our self-supervised training our network has never seen videos from the HMDB-51 dataset, showing that our model can transfer nicely to another dataset.

Table 4. Evaluating the transferability of representations learned using self-supervision to three tasks on Pascal VOC. We initialize the network until conv5 with the method shown in the left column and fine-tune for (i) multi-label image classification [22], (ii) object detection using Fast R-CNN [42] and (iii) image segmentation [28]. (i) and (ii) are evaluated on PASCAL VOC’07, (iii) on PASCAL VOC’12. For (i) and (ii) we show the mean average precision (mAP), for (iii) the mean intersection over union (mIoU). The fine-tuning has been performed using the standard CaffeNet, without batch normalization and groups 2 for conv [2, 4, 5]. (‘+’: significantly larger conv layers)

Pascal VOC: We evaluate the transferability of the unsupervised features by fine-tuning on three different tasks: multi-class object classification and object detection on Pascal VOC 2007 [12], and object segmentation on Pascal VOC 2012 [13]. In order to be comparable to previous work, we fine-tuned the model without batch normalization, using the standard CaffeNet with groups in conv2, conv4 and conv5. Previous methods using deeper networks, such as [10, 52], are omitted from Table 4. For object classification we fine-tune our model on the dataset using the procedure described in [22]. We do not require the pre-processing and initialization method described in [22] for any of the shown experiments. For object detection we train Fast RCNN [42] following the experimental protocol described in [42]. We use FCN [28] to fine-tune our features on the segmentation task. The results in Table 4 show that we significantly improve upon the other approaches. Our method outperforms even [17] in object classification and segmentation, which uses batch normalization also during fine-tuning and uses a larger network due to the group parameter in the conv layers.

Fig. 4.
figure 4

Permutations chosen by the policy in each training episode. For legibility, \(\psi _i\) are grouped by validation error into four groups. The policy, updated after every episode, learns to sample hard permutations more often in later iterations

Fig. 5.
figure 5

The test accuracy from Top1 nearest neighbor search evaluation on VOC07 is used for comparing different ablations of our architecture during training. The curves show a faster improvement of the features when the policy (P) is used

4.3 Ablation Study

In this section, we compare the performances of the combined spatiotemporal (S+T) model with the single tasks (S,T) and show the improvements achieved by training the networks with the permutations proposed by the policy (P).

Table 5. We compare the different models on the multi-object classification task using the Pascal VOC07 and on the action recognition task using UCF-101. (S): Spatial task, (T): Temporal task, (S+T):Spatial and Temporal task simultaneously, (S+P): Spatial task + Policy, (S+T): Temporal task + Policy, (S&T): first solely Spatial task, followed by solely Temporal task, (S+T+P):all approaches simultaneously

Unsupervised Feature Evaluation: In Fig. 5 the models are evaluated on the Pascal VOC object classification task without any further fine-tuning by extracting pool5 features and computing cosine similarities for nearest neighbor search as described in Sect. 4.1. This unsupervised evaluation shows how well the unsupervised features can generalize to a primary task, such as object classification. Figure 5 illustrates that the combined spatiotemporal model (S+T) clearly outperforms the networks trained on only one task (by 7% on the spatial and 14% on the temporal model). Furthermore, the combined network shows a faster improvement, which may be explained by the regularization effect that the temporal has on the spatial task and vice-versa. Figure 5 also shows, that each of the three models has a substantial gain when the CNN is trained using the policy. Our final model, composed of the spatiotemporal task with policy (S+T+P), reaches almost the supervised features threshold (“imagenet” line in Fig. 5).

Supervised Fine-Tuning: In Table 5, a supervised evaluation has been performed starting from the self-supervised features. Each model is fine-tuned on the multi-class object classification task on Pascal VOC 2007 and on video classification using UCF-101. The results are consistent throughout the unsupervised evaluation, showing that the features of the spatiotemporal model (S+T) outperform both single-task models and the methods with RL policy (S+P and T+P) improve over the baseline models. The combination of the two tasks has been performed in parallel (S+T) and in a serial manner (S&T) by initializing the temporal task using the features trained on the spatial task. Training the permutation tasks in parallel provides a big gain over the serial version, showing that the two tasks benefit from each other and should be trained together.

Policy Learning: Fig. 4 shows the permutations chosen by the policy while it is trained at different episodes (x-axis). The aim of this experiment is to analyze the learning behavior of the policy. For this reason we initialize the policy network randomly and the CNN model from an intermediate checkpoint (average validation error 72.3%). Per episode, the permutations are divided in four complexities (based on the validation error) and the relative count of permutations selected by the policy is shown per complexity. Initially the policy selects the permutations uniformly in the first three episodes, but then learns to sample with higher frequency from the hard permutations (with high error; top red) and less from the easy permutations (bottom purple), without overfitting to a specific complexity but mixing the hard classes with intermediate ones.

Figure 6 depicts the spatial validation error over the whole training process of the spatiotemporal network with and without the policy. The results are consistent with the unsupervised evaluation, showing a faster improvement when training with the permutations proposed by the policy than with random permutations. Note that (B) in Fig. 6 shows a uniform improvement over all permutations, whereas (A) demonstrates the selection process of the policy with a non-uniform decrease in error.

Fig. 6.
figure 6

Error over time of the spatial task, computed using the validation set and sorted by the average error. Each row shows how the error for one permutation evolves over time. (A): with Policy, (B): without policy

5 Conclusion

We have brought together the two directly related self-supervision tasks of spatial and temporal ordering. To sample data permutations, which are at the core of any surrogate ordering task, we have proposed a policy based on RL requiring relatively small computational extra cost during training in comparison to the basic training. Therefore, the sampling policy adapts to the state of the network that is being trained. As a result, permutations are sampled according to their expected utility for improving representation learning. In experiments on diverse tasks ranging from image classification and segmentation to action recognition in videos, our adaptive policy for spatiotemporal permutations has shown favorable results comparedFootnote 2 to the state-of-the-art.