Abstract
Finding general evaluation metrics for unsupervised representation learning techniques is a challenging open research question, which recently has become more and more necessary due to the increasing interest in unsupervised methods. Even though these methods promise beneficial representation characteristics, most approaches currently suffer from the objective function mismatch. This mismatch states that the performance on a desired target task can decrease when the unsupervised pretext task is learned too long–especially when both tasks are illposed. In this work, we build upon the widely used linear evaluation protocol and define new general evaluation metrics to quantitatively capture the objective function mismatch and the more generic metrics mismatch. We discuss the usability and stability of our protocols on a variety of pretext and target tasks and study mismatches in a wide range of experiments. Thereby we disclose dependencies of the objective function mismatch across several pretext and target tasks with respect to the pretext model’s representation size, target model complexity, pretext and target augmentations as well as pretext and target task types. In our experiments, we find that the objective function mismatch reduces performance by \(\sim\)0.1–5.0% for Cifar10, Cifar100 and PCam in many setups, and up to \(\sim\)25–59% in extreme cases for the 3dshapes dataset.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Unsupervised Representation Learning is a promising approach to learn useful features from huge amounts of data without human annotation effort. Thereby, a common evaluation pattern is to train an unsupervised pretext model on different datasets and then test its performance on several target tasks. Because of the huge variety of target tasks and preferred representation characteristics, the evaluation of these methods is challenging. In recent work, a large number of evaluation metrics have been proposed [24, 39, 40, 45], but because of the fast changes in unsupervised learning methodologies only a few of them can be used across the wide spectrum of promising approaches. This is one reason why the linear evaluation protocol is now commonly used [9, 14, 16, 21, 32, 37, 46], which trains a linear model for a target task ontop of the representations of an unsupervised pretext model. In this work, we show that simply training a target model for different layers of the pretext model does not yield the entire picture of the training process and leads to a loss of useful temporal information about learning. It is already known in literature, that succeeding in a pretext task can be the reason why the model fails on the target task. Here we propose that the linear evaluation protocol does not capture this properly. Therefore, we extend this protocol and address the question of when succeeding in a pretext task hurts performance and how much. We train target models on representations obtained from different training steps or epochs of the pretext model and plot target and pretext model metrics in comparison, as shown in Fig. 1. Thereby we observe that training an unsupervised pretext model too long can lead to an objective function mismatch [41, 58] between the objectives used to train both models. This mismatch leads to a drop in performance on the target task, while the pretext model and the target models still converge correctly, which can be seen in Fig. 1. To quantify our results, we define soft and hard versions for two simple and general evaluation metrics  the metrics mismatch and the objective function mismatch  formally. With these metrics, we then evaluate different imagebased pretext task types for selfsupervised learning by using the linear evaluation protocol.
Our contributions can be summarized as follows:

We propose hard and soft versions of general metrics to measure and compare mismatches of (unsupervised) representation learning methods across different target tasks (Sect. 3 and 4). To the best of our knowledge, this has not been done before.

We discuss the usability and stability of our protocols on a variety of pretext and target tasks (Sect. 6.2).

In our experiments we qualitatively show dependencies of the objective function mismatch with respect to the pretext model’s representation size (Sect. 6.3), target model complexity (Sect. 6.4), pretext and target augmentations (Sect. 6.5) as well as pretext and target task types (Sect. 6.6).

We find that the objective function mismatch can reduce performance on various benchmarks. Specifically, we observe a performance decrease by \(\sim\)0.1–5.0% for Cifar10, Cifar100 and PCam, and up to \(\sim\)25–59% in extreme cases for the 3dshapes dataset (Sect. 6).
2 Related work
2.1 Unsupervised representation learning
Many unsupervised representation learning algorithms are based on selfsupervised learning [26, 52, 53], which obtains labels directly from data without human annotation to define a pretext task. There are several approaches to selfsupervision:
Generationbased selfsupervison examines the generation of an arbitrary output from a learned representation of the given input. One line of work improves on autoencoders [51] and variational autoencoders [31] by defining generationbased pretext tasks which lead to representations valuable for required target tasks (e.g., object classification or detection). Examples are denoising [8, 61], colorization [34, 35, 73], or inpainting of images [47, 67]. Recently, a second line of work based on GANs [17] emerged, which adjusts their latent space for representation learning, for example by constraining [50] or changing [14] the architecture. In a third line of research, an autoregressive, transformerbased model achieved stateoftheart performance on visual representation learning by sequential image generation [20]. Generationbased selfsupervison is applied to other modalities as well, e.g., audio [20] or video [57, 62].
Contextbased selfsupervison recently has moved more and more away from autoencoding data: Early approaches utilize spatial context structure by defining pretext tasks for context generation, like image inpainting [47] or denoising, as a weak form of inpainting [8, 61]. In contrast, approaches for context prediction do not create any image and, for example, try to leverage the knowledge obtained by predicting patch positions [12, 43]. Spatial context can also be encoded by predicting transformations, which has led to a line of research focusing on autoencoding transformations rather than data [16, 37, 49, 72]. Recently, the contextbased similarity approach of contrastive learning [19], which utilizes context information between negative and positive pairs, gained popularity and achieved promising results [9, 11, 21, 36, 56]. Contrastive Learning has been linked to mutual information maximization [38, 68], which in ongoing work is used to define pretext tasks through contextbased similarity as well [4, 24]. Context similarity by pseudolabeling through clustering methods is another line of research [7, 71]. Selfsupervised relational reasoning combines contextbased similarity and contextbased structure by discriminating how entities relate to themselves and to other entities and has also been linked to mutual information maximization [46]. Contextbased approaches are applied to other data modalities as well, e.g., point clouds [69] or video [15, 28].
Other unsupervised representation learning methods for example combine multiple selfsupervised approaches [10, 29, 48, 63], use meta learning [25, 27, 41, 54] or metric learning [6, 65] to learn unsupervised learning rules, or rely on selforganization [1, 58].
2.2 Analyzing unsupervised representation learning
Changing the underlying model is one common theme to compare different unsupervised learning techniques [18]. Here, a well known finding is that a larger representation size significantly and consistently increases the quality of the learned visual representations [32].
Varying the amount of data samples has led to interesting observations as well [42, 59]. For example in [3], it is shown that unsupervised learning is capable to learn features of early layers from a single image.
Analyzing selfsupervised learning across target domains is another way to define and evaluate benchmarks for unsupervised approaches [8, 42, 64]. Zhai et al. [70] define good representations as those that adapt to diverse, unseen tasks with few examples.
Furthermore, there exist works where the underlying model, the amount of data samples and the target domain is analyzed collectively [18, 44].
Other investigations of unsupervised learning focus on the effect of the multitask pretext learning [13, 55], evaluate the disentanglement of representations [39], investigate the positive effects of unsupervised learning regarding robustness [23], or provide a theoretical analysis of contrastive learning [2, 66].
The objective function mismatch in unsupervised learning is not unknown. Some works directly or indirectly observed that learning a pretext task too long may hurt target task performance, but made no further investigations on this topic [32, 39, 64]. Other works sometimes showed performances of linear target models over training epochs, but did not examine or define the objective function mismatch in detail [58, 70]. Instead, unsupervised multitask learning and meta learning are proposed as approaches to lower the objective function mismatch [13, 41]. In contrast, this work focuses on defining general protocols to measure mismatches of metrics over the course of pretext task training when a target task is trained on top of the pretext model’s representations. To the best of our knowledge, this has not been done before. Furthermore, we highlight important properties of our evaluation protocols and interesting dependencies of the objective function mismatch.
3 Hard metrics mismatch
With the objective function mismatch, we want to measure the mismatch of two objectives while training a model on a (unsupervised) pretext task and using its representations to train another model on a target task. In general, we can measure the mismatch of two comparable metrics, if one metric is captured during training of a single pretext model and the other is captured for each target model fully trained on the representations of different steps or epochs of the pretext model. Two comparable metrics, for example, are classification accuracies for the pretext and target task, because they use the same measurement unit and scale. As illustrated in Fig. 2, the metric values of the target models form a curve over the course of learning. Between a metric value on this curve and the corresponding metric value on the pretext model curve, we can define the metrics mismatch (\(\mathrm {M3}\)) for a certain step (or epoch) in training by calculating their distance.
More formally, let \(M^{P}=(m_{1}^{P}, ... , m_{n}^{P})\) denote an ntuple of values from a metric used to measure pretext model P for different steps \(S=(s_1, ..., s_n)\). The length n of the tuple is usually given by a convergence criterion C on the metric of model P during training. Furthermore, let \(M^{T}=(m_{1}^{T}, ... , m_{n}^{T})\) denote an ntuple of values from a comparable metric used to measure target model T. \(M^{T}\) is of the same length and order as \(M^{P}\) and all values are calculated at the same training steps S of \(M^{P}\). Thereby the target model T is fully retrained for every step \(s_i\) in S on the representations of model P at this step before we measure \(m_{i}^{T}\).
Definition 1
The hard Metrics MisMatch (\(\mathrm {M3}\)) between \(m_{i}^{T}\) and \(m_{i}^{P}\) at step \(s_i\) is defined as:
where \(m_{i}^{T}\) and \(m_{i}^{P}\) are single values measured with comparable metrics at step \(s_i\).^{Footnote 1}
If \(\mathrm {M3}>0\) the performance of the target model is lower than the performance of the pretext model at step \(s_i\). In contrast, \(\mathrm {M3}\le 0\) represents the desired case in unsupervised representation learning, where target model performance is the same or above the pretext model performance at step \(s_i\). In our case, we measure \(m_{i}^{T}\) and \(m_{i}^{P}\) over the entire evaluation dataset for every step \(s_i\) in S. We plot \(\mathrm {M3}(m_{i}^{T}, m_{i}^{P})\) for the pretext task of predicting rotations and the target task of Cifar10 classification during training in Fig. 2. This shows that our metric captures the behavior of the target task performance regarding the pretext task performance, and we observe an increasing mismatch as training progresses. To capture the mismatch of the entire training procedure with respect to the target task in a single value, we can now define the mean hard metrics mismatch (\(\mathrm {MM3}\)) as the mean bias error between \(M^{T}\) and \(M^{P}\).
Definition 2
The Mean hard Metrics MisMatch (\(\mathrm {MM3}\)) between \(M^{T}\) and \(M^{P}\) is defined as:
where \(M^{T}\) and \(M^{P}\) are tuples measured with comparable metrics until the pretext model converges at step \(s_n\).
\(\mathrm {MM3}\) measures the bias of the target model metric to the pretext model metric. For positive or negative values of \(\mathrm {MM3}\), we can make similar observations as for \(\mathrm {M3}\), but they now account for the tendency of the entire training process and not for a single step \(s_i\). In general, the mean bias error can convey useful information, but it should be interpreted cautiously because there are special cases where positive and negative values cancel each other out. In our case, this can happen, for example, when learning the pretext task is very useful for the target task early in training but hurts the target performance equally strong later on when the pretext task is sufficiently solved. We simply capture this behavior by measuring and plotting \(\mathrm {M3}\) individually for the metric values of each step \(s_i\) as in Fig. 2, analogous to the way a loss is measured and plotted during training.
3.1 Hard objective function mismatch
Naively, we could compare the objective functions of the target and pretext task by using \(\mathrm {M3}\), which we define as the hard objective function mismatch. In most cases, however, the objective functions used to train the pretext model and the target models are not directly comparable. This is due to the usage of different objective functions for both model types, which, i.e., use different (non)linearities. But for some pretext tasks simple, comparable metrics can be defined. These metrics can be used as a proxy to measure the objective function mismatch in a general and comparable manner. A well known example is the accuracy metric, which can be used on the selfsupervised tasks of predicting rotations [16] and the stateoftheart approach of contrastive learning [9]. But comparable metric pairs can not always be found easily. For example, if we train a variational autoencoder and later use its representation for a classification target task, it does not make sense to define a pixelwise error between the given and generated images as a comparable pretext task metric. To achieve a comparable measurement for this situation, and on the loss curves in general, one could think of individual normalization techniques between objective function pairs. However, we want to be practical and define a measure, which can be used independently of the objective function pairs for every pretext and target model combination. Furthermore, in practice, we might be especially interested in how much the target task mismatches with the pretext task if a mismatch decreases target performance. This is why we define soft versions of our measurements.
4 Soft metrics mismatch
To bypass objective function pair normalization, we define the soft metrics mismatch (\(\mathrm {SM3}\)) directly on the target metric. Thereby, we no longer take the exact improvement of the pretext metric into account, we only care about its convergence. Since we now have no exact information about the pretext metric curve, we define \(\mathrm {SM3}\) for the current step \(s_i\) between the current target metric value and the previously or currently occurred minimal target metric value:
Definition 3
The Soft Metrics Mismatch (\(\mathrm {SM3}\)) between \(M^{T}\) and \(M^{P}\) at step \(s_i\) is defined as:
where \(\min _{0<j\le i}(m_{j}^{T})\) is the previously or currently occurred minimal target metric value.
\(\mathrm {SM3}\) has a slightly different meaning compared to \(\mathrm {M3}\): It equals zero if \(m_{i}^{T}\) is a minimal metric value and is positive if \(m_{i}^{T}\) is higher than the previously occurred minimal metric value. We want to point out that the only way we incorporate the pretext metrics into this measurement is by making sure that the pretext model does not overfit and has not yet converged. Again, we measure \(m_{i}^{T}\) and \(m_{i}^{P}\) over the entire evaluation dataset for every step \(s_i\) in S and plot \(\mathrm {SM3}(m_{i}^{T})\). A common case is shown in Fig. 3, which captures the behavior of pretext model training with respect to the target model. Here we observe zero soft mismatch early in training followed by increasing soft mismatch until pretext model convergence. Again, we can capture the mismatch of the pretext task with respect to the target task for the entire training process until pretext model convergence as the mean bias error of every metric value \(m_{i}^{T}\) and its minimal metric value:
Definition 4
The Mean Soft Metrics Mismatch (\(\mathrm {MSM3}\)) between \(M^{T}\) and \(M^{P}\) is defined as:
when the pretext model convergences at step \(s_n\).
\(\mathrm {MSM3}\) can either be zero, if no mismatch occurs, or positive, if there is a mismatch. Therefore, using \(\mathrm {MSM3}\) brings the benefit that positive values can not be canceled out by negative values. Furthermore, we define the maximum occurring mismatch \(\mathrm {mSM3}\) and the mismatch at the pretext model convergence \(\mathrm {cSM3}\). We are especially interested in \(\mathrm {cSM3}\), since it measures the representations one would naively take for the target task:
4.1 Soft objective function mismatch
Now we can use \({\mathrm {SM3}}\) to measure a soft form of the objective function mismatch on the loss curve obtained by the target models. However, the values of these measurements lie in a range, which depends on the target objective function. Therefore, they are not directly comparable to the measurements on loss curves from other target tasks. This is why we normalize the measurements of the target metric to percentage range and define the objective functions mismatch (\({\mathrm {OFM}}\)) as follows:
Definition 5
The Soft Objective Function Mismatch (\({\mathrm {OFM}}\)) between \(M^{T}\) and \(M^{P}\) at step \(s_i\) is defined as:
where \(m_{1}^{T}\) is the loss value of the target model trained on an untrained pretext model (\(s_1=0\)) and \(b={\mathrm {argmin}}_{0<i\le n}(m^{T}_i)\) denotes the index of the minimal target loss value. We then use \({\hat{M}}^{T} = ({\mathrm {N}}(m^{T}_1),...,{\mathrm {N}}(m^{T}_n))\) to calculate the \(\mathrm {OFM}\).
The intuition behind this normalization is that we declare \(m_{b}^{T}\) as the value where the pretext model has learned all of the target objective, it was able to learn (with this setting) and \(m_{1}^{T}\) as the value where the model has learned nothing of the target objective. Now we measure with \(\mathrm {OFM}(m^{T}_i)\) for what percentage the learning of a pretext objective hurts the maximum achieved target performance at step \(s_i\). An example is illustrated in Fig. 3. Furthermore, we can normalize the other soft measurements from Eqn. 5 and 6 analogs to Eq. 7.
The \(\mathrm {OFM}\) is a general measure, which can be used for pretext and target models where no good proxy metrics can be defined. With the \(\mathrm {OFM}\) we are able to compare mismatches across different pretext and target task objectives and their combinations. We propose these measurements to obtain quantitative and therefore comparable results for individual pretext tasks. To get the best information about the training process, we encourage to plot the curves formed by our metrics as well. We want to point out that our metrics are not intended to measure target task performance, they measure how much the performance on a target task can decrease when an (illposed) pretext task is learned too long. Now, to understand the \(\mathrm {OFM}\) further, we take a look at some cases:
\(\mathrm {OFM}(m^{T}_i)=0\): In this case, solving the pretext task objective did not hurt the performance of the target task objective at this point in training.
\(\mathrm {OFM}(m^{T}_i)=x\): Solving the pretext objective did hurt the performance of the target objective at this point in training by \(x\%\) of what the model has learned. Therefore, we should have stopped training earlier. It is not guaranteed that longer training would hurt performance even more, but a growing \(\mathrm {OFM}\) curve or \(\mathrm {MOFM}\) is a good indicator for that.
\(\mathrm {OFM}(m^{T}_i)>100\) The target objective performance is worse than for the untrained model at this point in training.
\(\mathrm {MOFM}(M^{T})=\infty\) Solving the pretext objective hurts the performance of the target objective from the point of initialization. Because we have learned essentially 0% about the target objective in the training process, there is no interval to be used for normalization. Therefore, we interpret this case as if the model has an infinite mismatch as soon as the model forgets something about the target objective.
5 Experimental setup
In our experiments, we focus on imagebased selfsupervised learning. However, it is likely that other target domains show mismatches as well, e.g., [13].
Pretext tasks For generationbased selfsupervision, we evaluate the approaches of autoencoding data from autoencoders (CAE) and color restoration (CCAE) as suggested by Chen et al. [9]. To evaluate context structure generation we use denoising autoencoders (DCAE). Spatial context structure is evaluated via autoencoding transformations by predicting rotations [16] (RCAE). For contextbased similarity methods, we follow the stateoftheart contrastive learning approach from Chen et al. [9] (SCLCAE). We refer to the literature for first glances into mismatches for VAEs, metalearning [41] and selforganization [58].
Pretext models Unless stated otherwise, we use a fourlayer CNN as encoder. For the autoencoding data approaches, we use a fourlayer decoder with transpose convolutions, for rotation prediction a single dense layer and for contrastive learning a nonlinear head, as suggested in [9]. We show that mismatches account for other architectures as well, by carrying out additional evaluations using ResNets [22] in Table 3 and Appendix C.
Target tasks We evaluate our metrics on imagebased target tasks. For coarsegrained classification, we use Cifar10, Cifar100 [33] and the coarsegrained labels of 3dshapes [5]. For finegrained classification, we use the PCam dataset [60] and the finegrained labels of 3dshapes.
Target models Following the linear evaluation protocol, we use a single, linear dense layer (FC) as a target model with a softmax activation. To evaluate our metrics for other target models, we use a two MLP (2FC) and a threelayer MLP (3FC).
Augmentations We make sure not to compare augmentations instead of pretext tasks by following Chen et al. [9] for our base augmentations to which we add the pretext taskspecific augmentations for pretext task training and evaluation. For the target task, we use the base training and evaluation augmentations of Chen et al. [9].
Optimization Our models are trained using the Adam optimizer [30] with standard parameters and batch size 2048 without any regularization instead of batch normalization. For our ResNets, we additionally use a weight decay of \(1\mathrm {e}{4}\).
Mismatch evaluation All reported values are determined by 5fold crossvalidation. We use standard early stopping (from tf.keras) as convergence criterion on the pretext evaluation curve with a minimum delta (threshold) of 0 and patience of 3. We change the patience in some experiments of Tables 1 and 3 to get a reasonable convergence epoch. For more details, we refer to Appendix B. When calculating our metrics, we estimate target values of missing epochs with linear interpolation to save computation time. In our case, \(\mathrm {SM3}\) and \(\mathrm {MM3}\) are measured on the target task accuracy.
Implementation Our implementation is available at https://github.com/BonifazStuhr/OFM.
6 Evaluation
In the following, we show results of most pretext and target tasks we have evaluated. We refer to Appendix C for additional, more detailed evidence. Since we capture our metrics during training, all mismatches are measured on the evaluation dataset.
6.1 Mismatch and convergence
For our measurements, we make sure to use metric value pairs from models that do not overfit. We achieve this by applying a convergence criterion on the pretext task and by using the best metric values from each target model evaluation curve. As shown in Appendix C, most observations in our experiments are independent of the use of a convergence criterion, if pretext models are trained long enough and without overfitting. Furthermore, we observe a common behavior in Fig. 1: Target models trained on higher epochs of the pretext model tend to converge faster. This indicates that longer training of the pretext task tends to create easier separable representations which may mismatch with the class label.
6.2 Stability
To evaluate the stability of our measurements, we show the mismatches of the entire training process and their range (\(+,\)) using 5fold crossvalidation in Figs. 2 and 3. The range of all other models, we have trained is shown in Appendix C. We observe that \(\mathrm {M3}\) generally seems more stable than \(\mathrm {SM3}\) or the \(\mathrm {OFM}\), since it does not rely so heavily on the target metric values, which can be quite unstable. The instability of the target task mismatch is captured in \(\mathrm {M3}\), but does not matter that much in the overall measurement for most cases. This is favourable if a stable value is desired and unfavourable if one wants to capture the instability of the target task training process explicitly. Furthermore, \(\mathrm {M3}\) is able to compensate target fluctuations with pretext fluctuations. In general, we observe that as long as we calculate the \(\mathrm {OFM}\) across a fair amount of crossvalidations (in our case 5), we can make statements about the mismatch. We measure our metrics on the mean losses during 5fold crossvalidation instead of calculating them five times and taking the average. For \(\mathrm {M3}\) both variants are equivalent and for the \(\mathrm {OFM}\) measuring on the mean losses leads to a lower bound in the case where all models converge at step \(s_n\) (see Appendix A for the simple proofs). We prefer to measure our metrics on the mean losses, since this avoids mismatches occurring just in some validation cycles due to small fluctuations of the underlying training procedure. An example is shown in translucent red in Fig. 3 at the beginning of training. We want to point out that the training and validation data differ slightly in every round because of the crossvalidation setup. This increases the instability, but shows the general behavior of the metrics for the underlying data distribution. In Appendix C we compare the instability of partially measured mismatches using linear interpolation with mismatches measured for every pretext training epoch and observe a similar instability. However, when using the \(\mathrm {OFM}\) in practice to compare models on a finer scale, we recommend to search for the actual minimal target metric value, since the \(\mathrm {OFM}\) relies on this value at each step. However, when tuning a model for maximum performance, one searches for this value. Thereby looking at the \(\mathrm {OFM}\) curve gives good indications in which interval one should search. This makes this protocol useful for performance tuning, if enough computational power is available.
6.3 Dependence on representation size
We hypothesize that large representation sizes tend to lower the \(\mathrm {OFM}\), which could be one reason why representation sizes are large in unsupervised learning. To affirm this hypothesis empirically, we train our pretext models with varying representation sizes on different target tasks while fixing all other model parameters. Figure 4 and Table 1 show that the \(\mathrm {OFM}\) tends to decrease when we enlarge the representation size. A reason for that might be that target models can exploit the high dimensional space of large representations to find better fitting clusters for their target task. We found an exception of this behavior, where we use larger representations of SCLCAE for the easy task of object hue prediction. Here, the target models trained on the untrained pretext models with larger representation sizes already achieve high performance due to a larger number of colorselective, random features. Further learning of the pretext model, in this case, does not lead to a highperformance gain and forgetting these sensitive random features during training leads to a high mismatch. Additionally, we observe that mismatches decrease, when we decrease the representation size for generationbased methods. A reason could be that the pretext models are forced to generalize to solve the target task for small representation sizes due to the limited amount of features in the bottleneck, or simply underfit on the pretext task.
6.4 Dependence on target model complexity
In Fig. 4 and Table 1 we observe a \(\mathrm {OFM}\) spike early in training for more complex target models. This spike occurs probably because nonlinear target models make better sense of specific random features at pretext task initialization, in contrast to the linear target model. Besides early spikes, mismatches tend to decrease when we add complexity to the target model. A model with increased nonlinearity has more freedom to disentangle representations, which do not fit properly with the target task. Again we found an exception where the \(\mathrm {MOFM}\) is lower for linear models when predicting the object hue after contrastive learning which can be appointed to the colorselective, random features of the untrained pretext model.
6.5 Dependence on augmentations
We vary the augmentations used for the pretext and target model by removing the color jitter and the image flip from our base augmentations successively. Figure 4 shows that augmentations can have a positive or negative impact on the mismatch. E.g., when predicting the object hue, the illposed color jitter augmentation increases the mismatch significantly.
6.6 Dependence on target task type
Here we use our metrics to examine findings stated in [13, 32] and [64, 70], where it is argued that some pretext tasks are better suited for different target tasks. We fix the underlying data distribution by using the 3dshapes dataset and train our target models for the different tasks. These tasks require a generic understanding of the scene like coarsegrained knowledge about object type and hue and finegrained knowledge about shapes, positions and scales. In Fig. 5 and Table 2, we observe that pretext models tend to learn pretext task specific features and discard features that are not needed to solve the pretext task during training. These models, therefore, mismatch with illposed target tasks. For example, rotation prediction discards features corresponding to the hue while it learns much about the orientation of the object.
6.7 Applying our metrics to ResNet models
In Fig. 6, we apply our metrics to ResNet models for several pretext and target tasks. For contrastive learning we observe a small \(\mathrm {OFM}\) for Cifar10 and Cifar100, which occurs late in training after pretext model convergence. However, when we use contrastive learning as pretext task for finegrained tumor detection on the PCam dataset, we observe a mismatch before pretext model convergence. For the wellknown rotation prediction pretext task, we oberved a high mismatch on Cifar10 classification early in pretext training. In Table 3, we show the corresponding mismatches measured until pretext model convergence.
7 Future work
In future work, our metrics can be used to create, tune and evaluate (selfsupervised) representation learning methods for different target tasks and datasets. These metrics make it possible to quantify the extent to which a pretext task matches a target task, and to determine whether the pretext task learns the right kind of representation throughout the entire training process. This enables a comparison of methods on benchmarks across different pretext tasks and models. The dependencies of the objective function mismatch on different parts of the selfsupervised setup (e.g., representation size) can be explored by future work in more detail, to further evaluate our findings and to create pretext tasks and model architectures that are robust against mismatches. Our metrics are defined for setups where the target models are trained on pretext model representations in general. Therefore, they can also be applied to other representation learning areas such as supervised, semisupervised, fewshot, or biological plausible representation learning.
8 Conclusion
In this work, we have used the linear evaluation protocol as a basis to define and discuss metrics to measure the metrics mismatch and the objective function mismatch. With soft and hard versions of our metrics, we collected evidence of how these mismatches relate to the pretext model’s representation size, target model complexity, pretext and target augmentations as well as pretext, and target task types. Furthermore, we observe that the epoch of the target task peak performance varies strongly for different datasets and pretext tasks. This highlights the importance of the protocol and shows that comparing approaches after a fixed number of epochs does not yield the entire picture of their capability. Our protocols make it possible to define benchmarks across different target tasks, where the goal is not to mismatch with the target metrics while achieving the best possible performance.
Data availability
All datasets are available in the public domain.
Code availability
Our implementation is available at https://github.com/BonifazStuhr/OFM.
Notes
Note that we define our measurements only for the case where lower metric values correspond to better performance. The definition for the opposite case arises naturally by changing maximum and minimum operations and/or subtraction orders.
References
Ackley DH, Hinton GE, Sejnowski TJ (1985) A learning algorithm for boltzmann machines. Cogn Sci
Arora S, Khandeparkar H, Khodak M, Plevrakis O, Saunshi N (2019) A theoretical analysis of contrastive unsupervised representation learning. CoRR
Asano YM, Rupprecht C, Vedaldi A (2020) A critical analysis of selfsupervision, or what we can learn from a single image. In: International conference on learning representations
Bachman P, Hjelm RD, Buchwalter W (2019) Learning representations by maximizing mutual information across views. In: Advances in neural information processing systems
Burgess C, Kim H (2018) 3d shapes dataset. https://github.com/deepmind/3dshapesdataset/
Cao X, Chen BC, Lim SN (2019) Unsupervised deep metric learning via auxiliary rotation loss. CoRR
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV)
Charte D, Charte F, del Jesus MJ, Herrera F (2020) An analysis on the use of autoencoders for representation learning: fundamentals, learning task case studies, explainability and challenges. Neurocomputing
Chen T, Kornblith S, Norouzi M, Hinton GE (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning
Chen T, Zhai X, Ritter M, Lucic M, Houlsby N (2019) Selfsupervised gans via auxiliary rotation loss. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Chen X, Fan H, Girshick RB, He K (2020) Improved baselines with momentum contrastive learning. CoRR
Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision
Doersch C, Zisserman A (2017) Multitask selfsupervised visual learning. In: Proceedings of the IEEE international conference on computer vision
Donahue J, Simonyan K (2019) Large scale adversarial representation learning. In: Advances in neural information processing systems
Fernando B, Bilen H, Gavves E, Gould S (2017) Selfsupervised video representation learning with oddoneout networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. CoRR
Goodfellow I, PougetAbadie J, Mirza M, Xu B, WardeFarley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems
Goyal P, Mahajan D, Gupta A, Misra I (2019) Scaling and benchmarking selfsupervised visual representation learning. In: Proceedings of the IEEE international conference on computer vision
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06)
Haque KN, Rana RK, Schuller B (2020) Guided generative adversarial neural network for representation learning and high fidelity audio generation using fewer labelled audio data. CoRR
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer wision and pattern recognition
Hendrycks D, Mazeika M, Kadavath S, Song D (2019) Using selfsupervised learning can improve model robustness and uncertainty. In: Advances in neural information processing systems
Hjelm RD, Fedorov A, LavoieMarchildon S, Grewal K, Trischler A, Bengio Y (2019) Learning deep representations by mutual information estimation and maximization. CoRR
Hochreiter S, Younger AS, Conwell PR (2001) Learning to learn using gradient descent. In: International conference on artificial neural networks
Jing L, Tian Y (2020) Selfsupervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intel. https://doi.org/10.1109/TPAMI.2020.2992393
Khodadadeh S, Boloni L, Shah M (2019) Unsupervised metalearning for fewshot image classification. In: Advances in neural information processing systems
Kim D, Cho D, Kweon IS (2019) Selfsupervised video representation learning with spacetime cubic puzzles. In: Proceedings of the AAAI conference on artificial intelligence
Kim D, Cho D, Yoo D, Kweon IS (2018) Learning image representations by completing damaged jigsaw puzzles. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR
Kingma DP, Welling M (2014) Autoencoding variational bayes. CoRR
Kolesnikov A, Zhai X, Beyer L (2019) Revisiting selfsupervised visual representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech Rep, University of Toronto
Larsson G, Maire M, Shakhnarovich G (2016) Learning representations for automatic colorization. In: European conference on computer vision
Larsson G, Maire M, Shakhnarovich G (2017) Colorization as a proxy task for visual understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Li J, Zhou P, Xiong C, Socher R, Hoi SCH (2020) Prototypical contrastive learning of unsupervised representations. CoRR
Lin F, Xu H, Li H, Xiong H, Qi GJ (2019) Aetv2: autoencoding transformations for selfsupervised representation learning by minimizing geodesic distances in lie groups. CoRR
Linsker R (1988) Selforganization in a perceptual network. Computer
Locatello F, Bauer S, Lucic M, Gelly S, Schölkopf B, Bachem O (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. CoRR
Lorena AC, Garcia LP, Lehmann J, Souto MC, Ho TK (2019) How complex is your classification problem? a survey on measuring classification complexity. ACM Computing Surveys (CSUR)
Metz L, Maheswaranathan N, Cheung B, SohlDickstein J (2018) Learning unsupervised learning rules. CoRR
Newell A, Deng J (2020) How useful is selfsupervised pretraining for visual tasks? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision. Springer
Oliver A, Odena A, Raffel CA, Cubuk ED, Goodfellow I (2018) Realistic evaluation of deep semisupervised learning algorithms. In: Advances in neural information processing systems
PalacioNiño J, Berzal F (2019) Evaluation metrics for unsupervised learning algorithms. CoRR
Patacchiola M, Storkey AJ (2020) Selfsupervised relational reasoning for representation learning. CoRR
Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Patrick M, Asano YM, Kuznetsova P, Fong R, Henriques JF, Zweig G, Vedaldi A (2020) Multimodal selfsupervision from generalized data transformations. CoRR
Qi GJ, Zhang L, Chen CW, Tian Q (2019) Avt: Unsupervised learning of transformation equivariant representations by autoencoding variational transformations. In: Proceedings of the IEEE international conference on computer vision
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR
Rumelhart DE, McClelland JL (1987) Learning internal representations by error propagation. American Association for the Advancement of Science
Schmidhuber J (1987) Evolutionary principles in selfreferential learning on learning how to learn: The metametameta...hook. Diploma thesi, Technische Universitat Munchen
Schmidhuber J (1990) Making the world differentiable: On using selfsupervised fully recurrent neural networks for dynamic reinforcement learning and planning in nonstationary environments. Tech rep, Technische Universitat Munchen
Schmidhuber J (1995) On learning how to learn learning strategies. Tech rep, Technische Universitat Munchen
Shukla A, Petridis S, Pantic M (2020) Does visual selfsupervision improve learning of speech representations? CoRR
Srinivas A, Laskin M, Abbeel P (2020) Curl: contrastive unsupervised representations for reinforcement learning. CoRR
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning
Stuhr B, Brauer J (2019) Csnns: unsupervised, backpropagationfree convolutional neural networks for representation learning. In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)
Su JC, Maji S, Hariharan B (2019) When does selfsupervision improve fewshot learning? CoRR
Veeling BS, Linmans J, Winkens J, Cohen T, Welling M (2018) Rotation equivariant cnns for digital pathology. In: International conference on medical image computing and computerassisted intervention
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning
Vondrick C, Shrivastava A, Fathi A, Guadarrama S, Murphy K (2018) Tracking emerges by colorizing videos. In: Proceedings of the European Conference on Computer Vision (ECCV)
Voynov A, Morozov S, Babenko A (2020) Big gans are watching you: Towards unsupervised object segmentation with offtheshelf generative models. CoRR
Wallace B, Hariharan B (2020) Extending and analyzing selfsupervised learning across domains. CoRR
Weinberger KQ, Blitzer J, Saul LK (2006) Distance metric learning for large margin nearest neighbor classification. In: Advances in neural information processing systems
Wen Z (2020) Convergence of endtoend training in deep unsupervised contrasitive learning. CoRR
Wolf S, Hamprecht FA, Funke J (2020) Instance separation emerges from inpainting. CoRR
Wu M, Zhuang C, Mosse M, Yamins DLK, Goodman ND (2020) On mutual information in contrastive learning for visual representations. CoRR
Xie S, Gu J, Guo D, Qi CR, Guibas L, Litany O (2020) Pointcontrast: unsupervised pretraining for 3d point cloud understanding. In: European conference on computer vision
Zhai X, Puigcerver J, Kolesnikov A, Ruyssen P, Riquelme C, Lucic M, Djolonga J, Pinto AS, Neumann M, Dosovitskiy A et al (2019) A largescale study of representation learning with the visual task adaptation benchmark. CoRR
Zhan X, Xie J, Liu Z, Ong YS, Loy CC (2020) Online deep clustering for unsupervised representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang L, Qi GJ, Wang L, Luo J (2019) Aet vs. aed: Unsupervised representation learning by autoencoding transformations rather than data. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: European conference on computer vision
Acknowledgments
Really kindly, we want to thank our colleagues from the University of Applied Sciences Kempten and the Autonomous University of Barcelona for the helpful discussions about this topic. Especially we want to thank Markus Klenk for proofreading the draft and Jordi Gonzàlez for his muchappreciated feedback on this work.
Funding
Open Access Funding provided by Universitat Autonoma de Barcelona.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
Bonifaz Stuhr and Jürgen Brauer are members of the University of Applied Sciences Kempten. Bonifaz Stuhr is a member of the Autonomous University of Barcelona.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Stuhr, B., Brauer, J. Don’t miss the mismatch: investigating the objective function mismatch for unsupervised representation learning. Neural Comput & Applic 34, 11109–11121 (2022). https://doi.org/10.1007/s00521022070319
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521022070319