Reviving autoencoder pretraining

The pressing need for pretraining algorithms has been diminished by numerous advances in terms of regularization, architectures, and optimizers. Despite this trend, we re-visit the classic idea of unsupervised autoencoder pretraining and propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. This yields networks that are as-invertible-as-possible and share mutual information across all constrained layers. We additionally establish links between singular value decomposition and pretraining and show how it can be leveraged for gaining insights about the learned structures. Most importantly, we demonstrate that our approach yields an improved performance for a wide variety of relevant learning and transfer tasks ranging from fully connected networks over residual neural networks to generative adversarial networks. Our results demonstrate that unsupervised pretraining has not lost its practical relevance in today’s deep learning environment.

In this paper, we develop an algorithm that reformulates autoencoder pretraining in a global way to arrive at a method that efficiently extracts general, dominant features from datasets. These features in turn improve performance for new tasks. Our approach is also inspired by techniques for orthogonalization [3,38,50,57]. Hence, we propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. A key insight is that there is no need for ''greediness,'' i.e., layer-wise decompositions of the network structure, and it is additionally beneficial to take into account a specific problem domain at the time of pretraining. We establish links between singular value decomposition (SVD) and pretraining, and show how our approach yields an embedding of problem-aware dominant features in the weight matrices. An SVD can then be leveraged to conveniently gain insights about learned structures. Unlike orthogonalization techniques, we focus on embedding the dominant features of a dataset into the weights of a network. This is achieved via a reverse pass network. This reverse pass is generic, simple to construct, and directly relates to model performance, instead of, e.g., constraining the orthogonality of weights. Most importantly, we demonstrate that the proposed pretraining yields an improved performance for a variety of learning and transfer tasks, while incurring only a minor extra computational cost from the reverse pass.
The structure of our networks is influenced by invertible network architectures that have received significant attention in recent years [24,34,36,85]. However, these approaches rely heavily on specific network architectures. Instead of aiming for a bijective mapping that reproduces inputs, we strive for learning a general representation by constraining the network to represent an as-reversible-aspossible process for all intermediate layer activations. Thus, even for cases where a classifier can, e.g., rely on color for inference of an object type, the model is encouraged to learn a representation that can recover the input. Hence, not only the color of the input should be retrieved, but also, e.g., its shape, so that more dominant features of the input dataset are embedded into the networks. In contrast to most structures for invertible networks, our approach does not impose architectural restrictions. We demonstrate the benefits of our pretraining for a variety of architectures, from fully connected layers to convolutional neural networks (CNNs) [46], over networks with batch normalization or dropout regularization, to generative adversarial networks (GAN) architectures [25].
Below, we will first give an overview of our formulation and its connection to singular values, before evaluating our model in the context of transfer learning. For a regular, i.e., a non-transfer task, the goal usually is to train a network that gives optimal performance for one specific goal. During a regular training run, the network naturally exploits any observed correlations between input and output distribution. An inherent difficulty in this setting is that typically no knowledge about the specifics of the new data and task domains is available when training the source model. Hence, it is common practice to target broad and difficult tasks hoping that this will result in features that are applicable in new domains [14,26,82]. Motivated by autoencoder pretraining, we instead leverage a pretraining approach that takes into account the data distribution of the inputs and drives the network to extract dominant features from the datasets, which differs from regular training for optimal performance of one specific goal. We demonstrate that our approach boosts the model accuracy for original and new tasks for a wide range of applications, from image classification to data-driven weather forecasting.

Related work
Greedy layer-wise pretraining was first proposed by Bengio et al. [4], and influenced a large number of follow-up works, providing a crucial method for feature extraction and enabling stable training runs of deeper networks. A detailed evaluation was performed by Erhan et al. [18], also highlighting cases where it can be detrimental. These problems were later on detailed in other works [1]. Principal component analysis (PCA) [29,77] is a popular approach for dimensionality reduction and feature extraction, and was proposed to, e.g., handle nonlinear relationships between variables [33,51], separate interpretable components [5], and improve robustness in Our pretraining (denoted as RR) yields improvements for numerous applications: a For difficult shape classification tasks, it outperforms existing approaches (Std TS , Ort TS , Pre TS ): the RR TS model classifies the airplane shape with significantly higher confidence. b Our approach establishes mutual information between input and output distributions. c For CIFAR 10 classification with a ResNet 110, RR C10 yields substantial practical improvements over the stateof-the-art. d Learned weather forecasting likewise benefits from our pretraining, with RR yielding 13:7% improvements in terms of latitude-weighted RMSE for the ERA dataset [31]. Pressure is shown for 2019-08-09, 22:00 UTC, together with Mean Absolute Error (MAE) for Std and RR models the presence of outliers [80]. However, PCA is computationally intensive in both memory and run time for larger dataset. Clustering is another popular alternative [6,22,65,84,89]. As these methods rely on data similarities, they yield a high complexity when the dataset size increases [7]. Sharing similarities with our approach, Rasmus et al. [58] combined supervised and unsupervised learning objectives, but focused on denoising autoencoders and a layer-wise approach without weight sharing. Unsupervised approaches for representation learning [23,37,42,48,81], especially contrastive learning, such as SimCLR [8], MoCo-v2 [10], ProtoNCE [49], and PaCo [13], similarly aim for learning generic features from a given dataset, but typically necessitate sophisticated training algorithms. We demonstrate the importance of leveraging state-of-the-art methods for training deep networks, i.e., without decomposing or modifying the network structure. This not only improves performance, but also very significantly simplifies the adoption of the pretraining pass in new application settings.
Extending the classic viewpoint of unsupervised autoencoder pretraining, regularization techniques have also been commonly developed to improve the properties of neural networks [45,47]. Several prior methods employed ''hard orthogonal constraints'' to improve weight orthogonality via SVD at training time [35,38,57]. Bansal et al. [3] additionally investigated efficient formulations of the orthogonality constraints. Orthogonal convolutional neural networks (OCNN) [75] reformulate the orthogonality constraints to be computed efficiently for networks convolutional layers. In practice, these constraints are difficult to satisfy, and correspondingly only weakly imposed. In addition, all of these methods focus on improving performance for a known, given task. This means the training process only extracts features that the network considers useful for improving the performance of the current task, not necessarily improving generalization or transfer performance [70]. While our approach shares similarities with SVD-based constraints, it can be realized with a very efficient L 2 -based formulation, and takes the full input distribution into account.
Recovering all input information from hidden representations of a network is generally very difficult [15,53,54], due to the loss of information throughout the layer transformations. In this context, [69] proposed the information bottleneck principle, which states that for an optimal representation, information unrelated to the current task is omitted. This highlights the common specialization of conventional training approaches.
Reversed network architectures were proposed in previous work [2,24,36,39], but mainly focus on how to make a network fully invertible via augmenting the network with special structures. As a consequence, the path from input to output is different from the reverse path that translates output to input. Besides, the augmented structures of these approaches can be challenging to apply to general network architectures. In contrast, our approach fully preserves an existing architecture for the backward path, and does not require any operations that were not part of the source network. As such, it can easily be applied in new settings, e.g., adversarial training [25]. While methods using reverse connections were previously proposed [67,85], these modules primarily focus on transferring information between layers for a given task, and on autoencoder structures for domain adaptation, respectively.

Method
With state-of-the-art deep learning methods [27,88], there is no need for breaking down the training process into single layers. Hence, we consider approaches that target whole networks, and employ orthogonalization regularizers as a starting point [35]. Orthogonality constraints were shown to yield improved training performance in various settings [3], and for an n-layer network, they can be formulated as: i.e., enforcing the transpose of the weight matrix M m 2 R s out m Âs in m for all layers m to yield its inverse when being multiplied with the original matrix. I denotes the identity matrix with I ¼ ðe 1 m ; :::e s in m m Þ, e j m denoting the j th column unit vector. Theoretically, L ort ¼ 0 cannot be perfectly fulfilled because of the information imbalance between inputs and outputs in most deep learning cases [69]. We will first analyze the influence of the loss function L ort assuming that it can be fulfilled, before applying the analysis to our full pretraining method.
Minimizing Eq. (1), i.e., M T m M m À I ¼ 0 is mathematically equivalent to: with rankðM T m M m Þ ¼ s in m , and e j m as eigenvectors of M T m M m with eigenvalues of 1. This formulation highlights that Eq. (2) does not depend on the training data, and instead only targets the content of M m . Instead, we will design a constraint that jointly considers data and the trainable weights, allowing us to learn the dominant features of the training dataset directly. We naturally would like to recover all the features of the dataset with a learning task, but finite network capacity makes this infeasible in practice. Instead, we aim for extracting the features that contribute the most in order to achieve a minimum loss value for our designed constraint. As a result, the features that appear the most, i.e., dominant features, will be extracted. In this section, we will introduce our constraint and analysis how it guides the weights to learn dominant features from the dataset. Then, we will illustrate how we insert our constraint into training with a reversed pass network.
Inspired by the classical unsupervised pretraining, we reformulate the orthogonality constraint in a data-driven manner to take into account the set of inputs D m for the current layer (either activation from a previous layer or the training data D 1 ), and instead minimize Due to its reversible nature, we will denote our approach with an RR subscript in the following. In contrast to classical autoencoder pretraining, we are minimizing this loss jointly for all layers of a network, and while orthogonality only focuses on M m , our formulation allows for minimizing the loss by extracting the dominant features of the input data.
Let q denotes the number of linearly independent entries in D m , i.e., its dimension, and t the size of the training data, i.e., D m ¼ t , usually with q\t. For every single datum and hence d i m are eigenvectors of M T m M m with corresponding eigenvalues being 1. Thus, instead of the generic constraint M T m M m ¼ I that is completely agnostic to the data at hand, the proposed formulation of Eq. (4) is aware of the training data, which improves the generality of the learned representation, as we will demonstrate in detail below.
The result of applying layer m of a network represents the features extracted this layer via its weight matrix M m . The singular vectors of the SVD of M m , can be regarded as input filters, and we can thus analyze the result of M m by focusing on its singular vectors. We employ SVD to identify what features are extracted by the parameters in M m . As by construction, rankðM m Þ ¼ r 6 minðs in m ; s out m Þ, the SVD of M m yields:  [73]. Here, especially the right singular vectors in V T m are important, as they determine which structures of the input are processed by the transformation M m . The original orthogonality constraint with Eq. (2) yields r unit vectors e j m as the eigenvectors of M T m M m . Hence, the influence of Eq. (2) on V m is completely independent of training data and learning objectives.
Next, we show that L RR facilitates learning dominant features from a given dataset. For this, we consider an arbitrary basis for spanning the space of inputs D m for layer m. Let B m : w 1 m ; . . .; w q m denote a set of q orthonormal basis vectors obtained via a Gram-Schmidt process, with t > q > r, and D m denoting the matrix of the vectors in B m . As we show in more detail in Appendix, our constraint from Eq. (4) requires eigenvectors of M T m M m to be w i m , with V m containing r orthogonal vectors ðv 1 m ; v 2 m ; . . .; v r m Þ from D m and ðs in m À rÞ vectors from the null space of M. We are especially interested in how M m changes w.r.t. input in terms of D m , i.e., we express L RR in terms of D m . By construction, each input d i m can be represented as a linear combination via a vector of coefficients c i m that multiplies D m so that the loss L RR of layer m can be rewritten as where we can assume that the coefficient vector c m is accumulated over the training dataset size t via c m ¼ P t i¼1 c i m , since eventually every single datum in D m will contribute to L RR m . The central component of Eq. (6) is V T m D m . For a successful minimization, V m needs to retain those w i m with the largest c m coefficients. As V m is typically severely limited in terms of its representational capabilities by the number of adjustable weights in a network, it needs to focus on the most important eigenvectors in terms of c m in order to establish a small distance to D m c m . Thus, features that appear most in the input data with a corresponding factor in c m will more strongly contribute to minimizing L RR m . Above, D m is only used implicitly to analyze different approaches, and we do not specify any explicit requirements for D m . Since a fixed dataset determines the corresponding D m , different orthogonal decompositions via Gram-Schmidt lead to different orthonormal bases. However, these different orthonormal bases can be aligned via rotation, and all span the same vector space. Thus, regardless of the particular orthonormal basis that is used, our method always focuses on extracting dominant features that appear most frequently in the dataset. This means the components of D m which contribute most to minimizing the loss will be embedded in the neural network. More in-depths discussions are provided in Appendix A.3.
Comparing our constraint from Eq. (3) with the orthogonal constraint in Eq. (1), we can see that our formulation is actually stricter. As a consequence, our method can retain the advantages of orthogonal constraints while simultaneously embedding dominant features into the weight matrices.
To summarize, V m is driven toward containing r orthogonal vectors w i m that represent the most frequent features of the input data, i.e., the dominant features. Additionally, due to the column vectors of V m being mutually orthogonal, M m is encouraged to extract different features from the input. For the sake of being distinct and representative of the dataset, these features have the potential to be useful for new inference tasks. The feature vectors embedded in M m can be extracted from the network weights in practical settings, as we will demonstrate below. Regular training typically starts with a chosen network structure and trains the model weights for a given task via a suitable loss function. Our approach fully retains this setup and adds a second pass that reverses the initial structure while reusing all weights and biases. For instance, for a typical fully connected layer in the forward pass with d mþ1 ¼ M m d m þ b m , the reverse pass operation is given by d

Realization in neural networks
Our goal with the reverse pass is to transpose all operations of the forward pass to obtain identical intermediate activations between the layers with matching dimensionality. We can then constrain the intermediate results of each layer of the forward pass to match the results of the backward pass, as illustrated in Fig. 2. While the construction of the reverse pass is straightforward for all standard operations, i.e., fully connected layers, convolutions, pooling, etc., slight adjustments are necessary for nonlinear activation functions (AFs) and batch normalization (BN). It is crucial for our formulation that d m and d 0 m contain the same latent space content in terms of range and dimensionality, such that they can be compared in the loss. Hence, we use the BN parameters and the activation of layer m À 1 from the forward pass, i.e., f mÀ1 and BN mÀ1 , for layer m in the reverse pass.
Unlike greedy layer-wise autoencoder pretraining, which trains each layer separately and only constrains d 1 and d 0 1 , we jointly train all layers and constrain all intermediate results. Due to the symmetric structure of the two passes, we can use a simple L 2 difference to drive the network toward aligning the results: Here, d m denotes the input of layer m in the forward pass and d 0 m the output of layer m for the reverse pass. k m denotes a scaling factor for the loss of layer m, which, however, is typically constant in our tests across all layers. Note that with our notation, d 1 and d 0 1 refer to the input data, and the reconstructed input, respectively.
Next, we show how this setup realizes the regularization from Eq. (3). For clarity, we use a fully connected layer with bias. In a neural network with n hidden layers, the forward process for a layer m is given by with d 1 and d nþ1 denoting input and output, respectively. All neural networks can be classified according to whether the full reverse pass can be built from the output to input, and we also classify our pretraining as full network pretraining and localized pretraining in implementation.

Full network pretraining
For networks where a unique path from output to input exists, we build a reverse pass network with transposed where the reverse pass activation d 0 m depends on d mþ1 0 , this formulation yields a full reverse pass from output to input, which we use for most training runs below. Here, we analyze the influence of Eq. (7) during training by assuming L RR ¼ 0 during the minimization. We then obtain activated intermediate content during the reverse pass that reconstructs the values computed in the forward pass, i.e., d 0 mþ1 ¼ d mþ1 holds. In this case which means that Eq. (7) is consistent with Eq. (3).

Localized pretraining
For architectures that have a reverse path that is not unique, e.g., in the presence of additive residual connections, we cannot uniquely determine the b, c in a ¼ b þ c given only a. In such cases, we use a local formulation, and d mþ1 is used as input of the reverse path of layer m directly. In this case Eq. (8) can be written as: In summary, Eq. (7) will drive the network toward a state that is as-invertible-as-possible for the given input dataset. Comparing the full network pretraining and localized pretraining, the full network pretraining establishes a stronger relationship among the loss terms of different layers, and allows earlier layers to decrease the accumulated loss of later layers. Localized pretraining, on the other hand, is even valid for cases where the reverse path from output to input is not unique.
Up to now, the discussion focused on simplified neural networks with convolutional operations, which are crucial for feature extraction, but without AFs or extensions such as BN, which are applied to increase model nonlinearity. While we leave a more detailed theoretical analysis of these extensions for future work, we apply these nonlinear extensions for all of our tests in Sects. 4 and 5. Thus, our experiments demonstrate that our method works in conjunction with BN and AFs. They show consistently show that the inherent properties of our pretraining remain valid: even in the nonlinear setting our approach successfully extracts dominant structures and yields improved generalization.
In Appendix, we give details on how to ensure that the latent space content for forward and reverse pass is aligned such that differences can be minimized, and we give practical examples of full and localized pretraining architectures.
To summarize, we realize the loss formulation of Eq. (7) to minimize P n m¼1 ðM T m M m À IÞd m 2 2 without explicitly having to construct M T m M m . Following the notation above, we will refer to networks trained with the added reverse structure and the additional loss terms as RR variants. We consider two variants for the reverse pass: a local pretraining Eq. (10) using the datum d mþ1 of a given layer, and a full version via Eq. (8) which uses d 0 mþ1 incoming from the next layer during the reverse pass. It is worth pointing out that our constraint is only used during the pretraining stage, and pretrained models are used as a starting point for the fine-tuning stage, where the search space for parameters is the same as in standard training, i.e., training is not constrained by our approach.

Embedding singular values
In the following, we evaluate networks trained with different methodologies. We distinguish our pretraining approach RR(in green), regular autoencoder pretraining Pre (in gray), and orthogonality constraints Ort (in blue). In addition, Std denotes a regular training run (in in graphs below), i.e., models trained without autoencoder pretraining, orthogonality regularization or our proposed method. Besides, a subscript will denote the task variant the model was trained for, such as Std T for task T. While we typically use all layers of a network in the constraints, a reduced variant that we compare to below only applies the constraint for the input data, i.e., m=1. A network trained with this variant, denoted by RR 1 A , is effectively trained to only reconstruct the input. It contains no constraints for the inner activations and layers of the network. For the Ort models, we use the Spectral Restricted Isometry Property algorithm [3].
We verify that the column vectors of V m of models from RR training contain the dominant features of the input with the help of a classification test, employing a single fully connected layer, i.e., d 2 ¼ M 1 d 1 , with BN and activation. To quantify this similarity, we compute a Learned Perceptual Image Patch Similarity (LPIPS) [86] between v i m and the training data (lower values being better).
We employ a training dataset constructed from two dominant classes (a peak in the top-left and bottom-right quadrant, respectively), augmented with noise in the form of random scribbles, as shown in Fig. 3. Based on the analysis above, we expect the RR training to extract the two dominant peaks during training. The LPIPS measurements confirm our SVD argumentation above, with average scores of 0:217 AE 0:022 for RR, 0:319 AE 0:114 for Pre, 0:495 AE 0:006 for Ort, and 0:500 AE 0:002 for Std, i.e., the RR model fares significantly better than the others. At the same time, the peaks are clearly visible for RR models, while the other models fail to extract structures that resemble the input. Thus, by training with the full network and the original training objective, our pretraining yields structures that are interpretable and be inspected by humans.
The results above experimentally confirm our formulation of the RR loss and its ability to extract dominant and generalizing structures from the training data. In addition, they give the first indication that this still holds when nonlinear components such as AFs are present. Next, we will focus on quantified metrics and turn to measurements in terms of mutual information to illustrate the behavior of our pretraining for deeper networks.

Evaluation in terms of mutual information
Mutual information (MI) measures the dependence of two random variables, i.e., higher MI means that there is more shared information between two parameters. As our approach hinges on the introduction of the reverse pass, we will show that it succeeds in terms of establishing MI between the input and the constrained intermediates inside a network. More formally, MI I(X; Y) of random variables X and Y measures how different the joint distribution of X and Y is w.r.t. the product of their marginal distributions, i.e., the Kullback-Leibler divergence IðX; YÞ ¼ D KL ½P ðX;YÞ jjP X P Y . [69] proposed MI plane to analyze trained models, which show the MI between the input X and activations of a layer D m , i.e., IðX; D m Þ and IðD m ; YÞ, i.e., MI of layer D m with output Y. These two quantities indicate how much information about the input and output distributions are retained at each layer, and we use them to show to which extent our pretraining succeeds at incorporating information about the inputs throughout training.
The following tests employ networks with six fully connected layers and nonlinear AFs, with the objective to learn the mapping from 12 binary inputs to 2 binary output digits [64], with results accumulated over five runs. Experimental details are illustrated in Appendix. We compare the versions Std A , Pre A , Ort A , and RR A . We visualize model comparisons with the MI planes, the content of which is visually summarized in Fig. 4a. Horizontal/vertical axis of the MI plane denotes IðX; D m Þ=IðY; D m Þ, which measures the amount of shared information between the m th layer D m and X/Y after training. This depicts how much information about input and output distribution is retained at each layer, as well as how these relationships change throughout the network.
For regular training, the information bottleneck principle [69] states that early layers contain more information about the input, i.e., show high values for IðX; D m Þ and IðD m ; YÞ. As a result, these layers are frequently visible in the topright corner of MI plane visualizations. After training, later layers typically share a large amount of information with the output, i.e., show high IðD m ; YÞ values, and correlate less with the input (low IðX; D m Þ). As a result, they typically appear in the top-left corner of MI plane graphs. The graph in Fig. 4b highlights that training with the RR loss RR A correlates input and output distributions across all layers: the cluster of green points in the center of the graph indicates that all layers contain balanced MI between input as well as output and the activations of each layer. Std A and Ort A almost exclusively focus on the output with IðD m ; YÞ being close to one and information dropped out layer by layer leads to a low IðX; D 7 Þ value. Pre A instead only focuses on reconstructing inputs. Thus, the early layers cluster in the upper-right corner, while the last layer IðD 7 ; YÞ fails to align with the outputs. Once we continue to fine-tune these models without regularization, the MI naturally shifts toward the output, as illustrated in Fig. 4c. Here, RR AA outperforms the other models in terms of the final performance. Furthermore, we design a transfer task B with switched output digits, which means that in task B, the original two binary output digits, e.g., (1, 0), will be switched into (0, 1). This change of the dataset results in significantly different mapping relationships between inputs and outputs compared with original task A. Likewise, RR AB performs best for a transfer task B with switched output digits, as shown in graph d, the final performance for both tasks across all runs is summarized in Table 1. The graph demonstrates that the proposed pretraining succeeds in robustly establishing mutual information between inputs and targets across a full network while extracting reusable features. The nonlinearity of the underlying network architectures does not impede the performance of the RR models. It is worth pointing out that Std and Ort exhibit high performance variance in transfer task B, but not in base task A, because Std A and Ort A were trained solely to improve task A performance. The extracted features are not guaranteed to be useful for task B in this process. As a result, performance in task B is not consistent across training. On the other hand, RR A focuses on extracting dominant features from the dataset, rather than specific tasks, which significantly improves the stability of training across different runs for tasks A and B.
Comparing Fig. 4b and d, we can see that after pretraining via our approach, balanced MI is obtained between input as well as output and the activations of each layer, indicating that our model extracted balanced features from both the input and output. After transfer learning for task B, we can see that all layers are located at the top part of the graph with high IðD m ; YÞ values, indicating that the model aims to improve the performance for a specific task.
We also compare the mutual information of three variants of our pretraining: the local variant lRR A , the full version  RR A , and a variant of the latter: RR 1 A , i.e., a version where only the input d 1 is constrained to be reconstructed. Figure 5 shows the MI planes for these three models. Only one layer is constrained with our formulation in RR 1 A , but we can see that the last two layers of the model are already located in the middle part of the MI plane (Fig. 5a), and the influence is in line with our full version RR A . Despite the local nature of lRR A , it manages to establish MI for the majority of the layers, as indicated by the cluster of layers in the center of the MI plane. Only the first layer moves toward the upper-right corner, and the second layer is affected slightly. In other words, these layers exhibit a stronger relationship with the distribution of the outputs. Despite this, the overall performance when fine-tuning or for the task transfer remains largely unaffected, e.g., the lRR AA/AB still clearly outperforms RR 1 AA=AB . This confirms our choice to use the full pretraining when network connectivity permits, and employ the local version in all other cases. Accuracy comparisons among different models are displayed in Table 1. RR AA=AB yields the highest performance, while lRR AA/AB performs similarly with RR AA=AB .
In summary, from the MI tests we can conclude that training with our formulation (RR A and lRR A ) is useful for correlating input and output distributions across all layers. Furthermore, this correlation would be strengthened if more layers were constrained with our formulations, e.g., comparing RR A with RR 1 A . On the other hand, models pretrained with our formulation, e.g., RR A and lRR A , can achieve highest value of IðD 7 ; YÞ and performance for source task A and transfer task B after fine-tuning.
MI has received attention recently as a learning objective, e.g., in the form of the InfoGAN approach [9] for learning disentangled and interpretable latent representations. While MI is typically challenging to assess and estimate [74], the results above show that our approach provides a straightforward and robust way for including it as a learning objective. In this way, we can easily, e.g., reproduce the disentangling results from [9] without explicitly calculating mutual information, which are shown in Fig. 1c. A generative model with our pretraining extracts intuitive latent dimensions for the different digits, line thickness, and orientation without any additional modifications to the loss function. The joint training of the full network with the proposed reverse structure, including nonlinearities and normalization, yields a natural and intuitive decomposition.

Experimental results
We now turn to a broad range of network structures, i.e., CNNs, Autoencoders, and GANs, with a variety of datasets and tasks to show our approach succeeds in improving inference accuracy and generality for modern-day applications and architectures. All tests use nonlinear activations and several of them include BN. Experimental details are provided in Appendix.

CIFAR-100 classification
We first focus on orthogonalization for a CIFAR-100 classification task with a ResNet 18 network, and compare the performance of RR with the variants Std, Ort, in addition to an OCNN (in light blue) network [75]. The CNN architecture has ca. 11 million trainable parameters in b, c After fine-tuning for A/B. The last layer D 7 of RR AA=AB builds the strongest relationship with Y. IðD 7 ; YÞ of lRR AA/AB is only slightly lower than RR AA=AB each case. Pre is not included in this comparison due to its incompatibility with ResNet architectures. The resulting performance for the different variants (evaluated for 3 runs each) is shown in Fig. 6. For CIFAR-100, the orthogonal regularizations (Ort and OCNN) result in noticeable performance gains of 0:33% and 0:337%, but RR clearly outperforms both with an improvements of 1:2%. Despite being different formulations, both Ort and OCNN represent orthogonal regularizers that aim for the same goal of weight orthogonality. Hence, their performance is on-par, and we will focus on the more generic Ort variant for the following evaluations.

Transfer learning benchmarks
We evaluate our approach with two state-of-the-art benchmarks for transfer learning (Fig. 7). The first one uses the texture-shape dataset from [21], which contains challenging images of various shapes combined with patterns and textures to be classified. The results below are given for 10 runs each. For the stylized data shown in Fig. 8a, the accuracy of Pre TS is low with 20.8%. This result is in line with observations in previous work and confirms the detrimental effect of classical pretraining. Std TS yields a performance of 44.2%, and Ort TS improves the performance to 47.0%, while RR TS yields a performance of 54.7% (see Fig. 8b). Thus, the accuracy of RR TS is 162:98% higher than Pre TS , 23:76% higher than Std TS , and 16:38% higher than Ort TS . To assess generality, we also apply the models to new data without re-training, i.e., an edge and a filled dataset, also shown in Fig. 8a. For the edge dataset, RR TS outperforms Pre TS , Std TS and Ort TS by 178:82%, 50% and 16:75%, respectively.
Exemplary curves for test accuracy at training time for Std TS , Ort TS , and RR TS are shown in Fig. 7. Pre TS is not included since its layer-wise curriculum precludes a direct comparison. The graph shows that RR TS converges faster than Std TS and Ort TS from the very beginning. It achieves the performance of Std TS and Ort TS with ca. 1 3 and 1 2 of number of training epochs, respectively. Achieving comparable performance with less training effort, and a higher final performance support the reasoning given in Sect. 3: RR TS with its reverse pass is more efficient at extracting relevant features from the training data. Over the course of our tests, we observed a similar convergence behavior for a wide range of other runs.
It is worth pointing out that the additional constraints of our training approach lead to moderately increased requirements for memory and computations, e.g., 41.86% more time per epoch than regular training for the textureshape test. As this test employs a small network with only ca. 1:2 Â 10 4 trainable weights, the computations for our approach still make a noticeable difference in training time. However, as we show below, the difference becomes negligible for larger networks. On the other hand, it allows us to train smaller models: we can reduce the weight count by 32% for the texture-shape case while still being on-par with Ort TS in terms of classification performance. By comparison, regular layer-wise pretraining requires significant overhead and fundamental changes to the training process. Our pretraining fully integrates with existing training methodologies and can easily be deactivated via k m ¼ 0. More details of runtime performance and training behavior are given in Appendix.
As a second test case, we use a CIFAR-based task transfer [61] that measures how well models trained on the original CIFAR 10, generalize to a new dataset (CIFAR 10.1) collected according to the same principles as the original one. Here, we use a ResNet 110 with 110 layers and 1.7 million parameters, Due to the consistently low performance of the Pre models [1], we focus on Std, Ort and RR for this test case. In terms of accuracy across 5 runs, Ort C10 outperforms Std C10 by 0.39%, while RR C10 outperforms Ort C10 by another 0.28% in terms of absolute test accuracy (Fig. 9). This increase for RR training matches the gains reported for orthogonality in previous work [3], thus showing that our approach yields substantial practical improvements over the latter. It is especially interesting how well performance for CIFAR 10 translates into transfer performance for CIFAR 10.1. Here, RR C10 still outperforms Ort C10 and Std C10 by 0.22% and 0.95%, respectively. Hence, the models from our pretraining very successfully translate gains in performance from the original task to the new one, indicating that the models have successfully learned a set of more general features. To summarize, both benchmark cases confirm that the proposed pretraining benefits generalization.

Smoke generation
In this section, we employ our pretraining in the context of generative models for transferring from synthetic to realworld data from the ScalarFlow dataset [17]. As superresolution task A, we first use a fully convolutional generator network, adversarially trained with a discriminator network on the synthetic flow data. While regular pretraining is more amenable to generative tasks than orthogonal regularization, it cannot be directly combined with adversarial training. Hence, we pretrain a model Pre for a reconstruction task at high-resolution without discriminator instead. Figure 10a demonstrates that our method works well in conjunction with the GAN training: As shown in the bottom row, the trained generator succeeds in recovering the input via the reverse pass without modifications. A regular model Std A , only yields a black image in this case. For Pre A , the layer-wise nature of the pretraining severely limits its capabilities to learn the correct data distribution [88], leading to low performance.
We now mirror the generator model from the previous task to evaluate an autoencoder structure that we apply to two different datasets: the synthetic smoke data used for the GAN training (task B 1 ), and a real-world RGB dataset of smoke clouds (task B 2 ). Thus, both variants represent transfer tasks, the second one being more difficult due to the changed data distribution. The resulting losses, summarized in Fig. 10b, show that RR training performs best for both autoencoder tasks: the L 2 loss of RR AB 1 is 68:88% lower than Std AB 1 , while it is 13:3% lower for task B 2 . The proposed pretraining also clearly outperforms the Pre  Fig. 11. It is apparent that RR AB 1 provides the best performance among these models. Figure 12 provides visual comparisons for transfer task B 2 , where RR AB 2 generates results that are closer to the reference. Within this series of tests, the RR performance for task B 2 is especially encouraging, as this task represents a synthetic to real transfer.

Weather forecasting
Pretraining is particularly attractive in situations where the amount of data for training is severely limited. Weather forecasting is such a case, as accurate, real-world data for many relevant quantities are only available for approximately 50 years. We use the ERA dataset [31] consisting of assimilated measurements, and additionally evaluate our models with simulated data from the CMIP database [19]. We replicate the architecture and training procedure of the WeatherBench benchmark [59]. Hence, we use prognostic variables at seven vertical levels, together with some surface and constant fields at the current time t as well as t À 6h and t À 12h as input, and target three-day forecasts of 500 hPa geopotential (Z500), 2-meter temperature (T2M), and 850 hPa temperature (T850  Fig. 13. Across all cases, irrespective of whether observation data or simulation data is used, the RR models clearly outperform the regular models and yield consistent improvements. This also indicates that our approach is compatible with other forms of regularization, such as dropout and L 2 regularization. The RR models yield performance improvements of 6% *8% for the CMIP cases, and the ERA case with dropout. Here, the re-trained Std version is on-par with the data reported in [59], while our RR model exhibits a performance improvement of 6:3% on average. For the ERA dataset without dropout regularization, the RR model decreases the loss even more strongly by 13:7%.
Visualizations of an inference result for 9 Aug. 2019 22:00 for the ERA dataset without dropout regularization are shown in Figs. 1d and 14a. Predictions of RR yield lower errors, and are closer to the reference. The same conclusions can be drawn from the example at 26 June 2014 0:00 from the CMIP dataset without dropout regularization in Fig. 14b.

Conclusions
We have proposed a novel pretraining approach inspired by classic methods for unsupervised autoencoder pretraining and orthogonality constraints. In contrast to the classical methods, we employ a constrained reverse pass for the full nonlinear network structure and include the original learning objective. Weight matrix SVD is applied to visually analyze and interpret that our proposed method is more capable of extracting dominant features from the training dataset. We have shown for a wide range of scenarios, from mutual information, over transfer learning benchmarks to weather forecasting, that the proposed pretraining yields networks with improved performance and better generalizing capabilities. Our training approach is general, easy to integrate, and imposes no requirements regarding network structure or training methods. As a whole, our results show that unsupervised pretraining has not lost its relevance in today's deep learning environment.
As future work, we believe it will be exciting to evaluate our approach in additional contexts, e.g., for temporal predictions [12,32], and for training explainable and interpretable models [9,16,83].

A.1 Pretraining and singular value decomposition
In this section, we give a more detailed derivation of our loss formulation, extending Sect. 3 of the main paper. As explained there, our loss formulation aims for minimizing where M m 2 R s out m Âs in m denotes the weight matrix of layer m, and data from the input dataset D m is denoted by Here, t denotes the number of samples in the input dataset. Minimizing Eq. (A1) is mathematically equivalent to Via an SVD of the matrix M m in Eq. (A4) we obtain where the coefficient vector c m is accumulated over the training dataset size t via c m ¼ P t i¼1 c i m . Here, we assume that over the course of a typical training run eventually every single datum in D m will contribute to L RR m . This form of the loss highlights that minimizing L RR requires an alignment of V m R T m R m V T m w h m c m h and w h m c m h . By construction, R m contains the square roots of the eigenvalues of M T m M m as its diagonal entries. The matrix has rank r ¼ rankðM T m M m Þ, and since all eigenvalues are required to be 1 by Eq. (A3), the multiplication with R m in Eq. (A5) effectively performs a selection of r column vectors from V m . Hence, we can focus on the interaction between the basis vectors w m and the r active column vectors of V m : and we trivially fulfill the constraint However, due to r being smaller than q in practice, V m typically cannot include all vectors from B m .
As a consequence, the constraint Eq. (A2) is only partially fulfilled: As the w h m have unit length, the factors c m determine the contribution of a datum to the overall loss. A feature w h m that appears multiple times in the input data will have a correspondingly larger factor in c m and hence will more strongly contribute to L RR . The L 2 formulation of Eq. (A1) leads to the largest contributors being minimized most strongly, and hence the repeating features of the data, i.e., dominant features, need to be represented in V m to minimize the loss. Interestingly, this argumentation holds when additional loss terms are present, e.g., a loss term for classification. In such a case, the factors c m will be skewed toward those components that fulfill the additional loss terms, i.e., favor basis vectors w h m that contain information for about the loss terms. This, e.g., leads to clear digit structures being embedded in the weight matrices for the MNIST example below.
In summary, to minimize L RR , V m is driven toward containing r orthogonal vectors w h m which represent the most frequent features of the input data, i.e., the dominant features. It is worth emphasizing that above B m is only an auxiliary basis, i.e., the derivation does not depend on any particular choice of B m .

A.2 Examples of network architectures with pretraining
While the proposed pretraining is significantly easier to integrate into training pipelines than classic autoencoder pretraining, there are subtleties w.r.t. the order of the operations in the reverse pass that we clarify with examples in the following sections. To specify NN architectures, we use the following notation: C(k, l, q), and D(k, l, q) denote convolutional and deconvolutional operations, respectively, while fully connected layers are denoted with F(l), where k, l, q denote kernel size, output channels, and stride size, respectively. The bias of a CNN layer is denoted with b. I/O(z) denote input/output, their dimensionality is given by z. I r denotes the input of the reverse pass network. tanh, relu, lrelu denote hyperbolic tangent, ReLU, and leaky ReLU activation functions (AFs), where we typically use a leaky tangent of 0.2 for the negative half-space. UP, MP and BN denote 2Â nearest-neighbor up-sampling, max pooling with 2 Â 2 filters and stride 2, and batch normalization, respectively.
Below we provide additional examples of how to realize the pretraining loss L RR in a neural network architecture. As explained in the main document, the constraint Eq. (A1) is formulated via with d m , and k m denoting the vector of activated intermediate data in layer m from the forward pass, and a scaling factor, respectively. d On the other hand, d mþ1 yields a variant that ensures local reversibility of each layer, and yields a very similar performance, as we will demonstrate below. We employ this local loss for networks without a unique, i.e., bijective, connection between two layers. Intuitively, when inputs cannot be reliably reconstructed from outputs.
Full network pretraining An illustration of a CNN structure with AFs and BN and a full loss is shown in Fig. 2 in the main paper. To illustrate this setup, we consider an example network employing convolutions with mixed AFs, BN, and MP. Let the network receives a field of 32 2 scalar values as input. From this input, 20, 40, and 60 feature maps are extracted in the first three layers. Besides, the kernel sizes are decreased from 5 Â 5 to 3 Â 3. To clarify the structure, we use ReLU activation for the first convolution, while the second one uses a hyperbolic tangent, and the third one a sigmoid function. With the notation outlined above, the first three layers of the network are The reverse pass for evaluating the loss reuses all weights of the forward pass and ensures that all intermediate vectors of activations, d m and d 0 m , have the same size and content in terms of normalization and nonlinearity. We always consider states after activation for L RR . Thus, d m denotes activations before pooling in the forward pass and d 0 m contains data after up-sampling in the reverse pass, in order to ensure matching dimensionality. Thus, the last three layers of the reverse network for computing L RR take the form: Here, the de-convolutions D x in the reverse network share weights with C x in the forward network, i.e., the 4 Â 4 Â 20 Â 40 weight matrix of C 2 is reused in its transposed form as a 4 Â 4 Â 40 Â 20 matrix in D 2 . Additionally, it becomes apparent that AFs and BN of layer 3 from the forward pass do not appear in the listing of the three last layers of the reverse pass. This is caused by the fact that both are required to establish the latent space of the fourth layer. Instead, d 3 in our example represents the activations after the second layer (with BN 2 and tanh), and hence the reverse pass for d For the reverse pass, we additionally found it beneficial to employ an AF for the very last layer if the output space has suitable content. For instance, for inputs in the form of RGB data we employ an additional activation with a ReLU function for the output to ensure the network generates only positive values. Localized pretraining In the example above, we use a full pretraining with d 0 mþ1 to reconstruct the activations d 0 m . However, if the architecture of the original network makes use of operations between layers that are not bijective, e.g., residual connections, we instead use the local loss. Note that our loss formulation has no problems with irreversible operations within a layer, e.g., most convolutional or fully connected layers typically are not fully invertible. In all these cases the loss will drive the network toward a state that is as-invertible-as-possible for the given input dataset. However, this requires a reliable vector of target activations in order to apply the constraints. If the connection between layers is not bijective, we cannot reconstruct this target for the constraints, as in the examples given above. In such cases, we regard every layer as an individual unit to which we apply the constraints by building a localized reverse pass. For example, given a simple convolutional architecture with in the forward pass, we calculate d 0 1 with We, e.g., use this local loss in the ResNet 110 network below. It is important to note that despite being closer to regular autoencoder pretraining, this formulation still incorporates all nonlinearities of the original network structure, and jointly trains full networks while taking into account the original learning objective.

A.3 MNIST and peak tests
Below we give details for the peak tests from Sect. 3 of the main paper and show additional tests with the MNIST dataset.
Peak Test For the Peak test we generated a dataset of 110 images shown in Fig. 15. 55 images contain a peak located in the upper left corner of the image. The other 55 contain a peak located in the bottom-right corner. We added random scribbles in the images to complicate the task. All 110 images were labeled with a one-hot encoding of the two possible positions of the peak. We use 100 images as the training dataset, and the remaining 10 for testing. All peak models are trained for 5000 epochs with a learning rate of 0.0001, with k ¼ 1e À 6 for RR A . To draw reliable conclusions, we show results for five repeated runs here. The neural network in this case contains one fully connected layer, with BN and ReLU activation. The results are shown in Fig. 16, with both peak modes being consistently embedded into the weight matrix of RR A , while regular, autoencoder pretraining and orthogonal training show primarily random singular vectors. For this test, the dataset is constructed such that the two Gaussian peaks are the dominant features of the dataset. No matter what orthonormal basis the network converges to, the two dominant peaks are included in D m with our approach. Specifically, after training, we can see via SVD that the network consistently learns to encode these two peaks in its parameters, since they contribute most to a reconstruction loss.
We also use different network architectures in Fig. 17 to verify that the dominant features are successfully extracted when using more complex network structures. Even for two layers with BN and ReLU activations, our pretraining clearly extracts the two modes of the training data. The visual resemblance is slightly reduced in this case, as the network has the freedom to embed the features in both layers. Across all three cases, our pretraining clearly outperforms regular training and the orthogonality constraint in terms of extracting and embedding the dominant structures of the training dataset in the weight matrix It also yields lower LPIPS evaluations than autoencoder pretraining, which indicates features embedded in RR models represent the training data better.

MNIST Test
We additionally verify that the column vectors of V m of models from RR training contain the dominant features of the input with MNIST tests, which employ a single fully connected layer, i.e., d 2 ¼ M 1 d 1 . In the first MNIST test, the training data consists only of 2 different images. All MNIST models are trained for 1000 epochs with a learning rate of 0.0001, and k ¼ 1e À 5 for RR A . After training, we compute the SVD for M 1 . SVDs of the weight matrices of trained models can be seen in Fig. 18. The LPIPS scores show that features embedded in the weights of RR are consistently closer to the training dataset than all other methods, i.e., regular training Std, classic autoencoder pretraining Pre, and regularization via orthogonalization Ort. While the vectors of Std and Ort contain no recognizable structures.
Overall, our experiments confirm the motivation of our pretraining formulation. They additionally show that employing an SVD of the network weights after our

Appendix B Mutual information
This section gives details of the mutual information and disentangled representation tests from Sect. 4 of the main paper.

B.1 Mutual information test
Mutual information (MI) measures the dependence of two random variables, i.e., higher MI means that there is more shared information between two parameters. More formally, the mutual information I(X; Y) of random variables X and Y measures how different the joint distribution of X and Y is w.r.t. the product of their marginal distributions, i.e., the Kullback-Leibler divergence IðX; YÞ ¼ KL½P ðX;YÞ jjP X P Y , where KL denotes the Kullback-Leibler divergence. Let IðX; D m Þ denote the mutual information between the activations of a layer D m and input X. Similarly IðD m ; YÞ denotes the MI between layer m and the output Y. We use MI planes in the main paper, which show IðX; D m Þ and IðD m ; YÞ in a 2D graph for the activations of each layer D m of a network after training. Training details We use the same numerical studies as in [64] as task A, i.e., a regular feed-forward neural network with 6 fully connected layers. The input variable X contains 12 binary digits that represent 12 uniformly distributed points on a 2D sphere. The learning objective is to discover binary decision rules which are invariant under O(3) rotations of the sphere. X has 4096 different patterns, which are divided into 64 disjoint orbits of the rotation group, forming a minimal sufficient partition for spherically symmetric rules [41]. To generate the input-output distribution P(X, Y), we apply the stochastic rule pðy ¼ 1 j xÞ ¼ Wðf ðxÞ À hÞ; ðx 2 X; y 2 YÞ, where W is a standard sigmoidal function WðuÞ ¼ 1=ð1 þ expðÀcuÞÞ, following [64]. We then use a spherically symmetric realvalued function of the pattern f(x), evaluated through its spherical harmonics power spectrum [41], and compare with a threshold h, which was selected to make pðy ¼ 1Þ ¼ P x pðy ¼ 1 j xÞpðxÞ % 0:5, with uniform p(x). c is high enough to keep the mutual information IðX; YÞ % 0:99 bits. For the transfer learning task B, we reverse output labels to check whether the model learned specific or generalizing features, e.g., if the output is [0,1] in the original dataset, we swap the entries to [1,0]. 80% of the data (3277 data pairs) are used for training and rests (819 data pairs) are used for testing. For the MI comparison in Fig. 4, we discuss models before and after fine-tuning separately, in order to illustrate the effects of regularization. We include a model with greedy layer-wise pretraining Pre, a regular model Std A , one with orthogonality constraints Ort A , and our regular model RR A , all before fine-tuning. For the model RR A all layers are constrained to be recovered in the backward pass. We additionally include the version RR 1 A , i.e., a model trained with only one loss term k 1 d 1 À d 0 1 2 2 , which means that only the input is constrained to be recovered. Thus, RR 1 A represents a simplified version of our approach which receives no constraints that intermediate results of the forward and backward pass should match. For Ort A , we Reverse pass RR A ð Þ : Reverse pass lRR A ð Þ: Training epochs 20000 for Pre A=AA=AB ; RR A=AA=AB ; lRR A=AA=AB ; Std A=AA=AB , and Std B  Ið2Þ À b 1 ! FCð784Þ ! Oð784Þ peak (Fig. 4c): used the Spectral Restricted Isometry Property (SRIP) regularization [3], where W is the kernel, I denotes an identity matrix, and b represents the regularization coefficient. rðWÞ ¼ sup z2R n ;z6 ¼0 W z k k z k k denotes the spectral norm of W. As explained in the main text, all layers of the first stage, i.e., from RR A , RR 1 A , Ort A , Pre A and Std A are reused for training the fine-tuned models without regularization, i.e., RR AA , RR 1 AA , Ort AA , Pre AA and Std AA . Likewise, all layers of the transfer task models RR AB , RR 1 AB , Ort AB , Pre AB and Std AB are initialized from the models of the first training stage.

Analysis of results
We first compare the version only constraining input and output reconstruction (RR 1 A ) and the full loss version RR A . Fig. 4b of the main paper shows that all points of RR A are located in a central region of the MI place, which means that all layers successfully encode information about the inputs as well as the outputs. This also indicates that every layer contains a similar amount of information about X and Y, and that the path from input to output is similar to the path from output to input. The points of RR 1 A , on the other hand, form a diagonal line, i.e., this network has different amounts of mutual information across its layers, and potentially a very different path for each direction. This difference in behavior is caused by the difference of the constraints in these two versions: RR 1 A is only constrained to be able to regenerate its input, while the    This indicates that the outputs were successfully encoded and that increasing amounts of information about the inputs are discarded. Hence, more specific features about the given output dataset are learned by Std A and Ort A . This shows that both models are highly specialized for the given task, and potentially perform worse when applied to new tasks. Pre A only focuses on decreasing the reconstruction loss, which results in high IðX; DÞ values for early layers, and low IðD; YÞ values for later layers.
During the fine-tuning phase for task A (i.e., regularizers being disabled), all models focus on the output and maximize IðD; YÞ. There are differences in the distributions of the points along the y-axis, i.e., how much MI with the output is retained, as shown in Fig. 4c of the main paper. For model RR AA , the IðD; YÞ value is higher than for Std AA , Ort AA , Pre AA and RR 1 AA , which means outputs of RR AA are more closely related to the outputs, i.e., the ground truth labels for task A. Thus, RR AA outperforms the other variants for the original task.
In the fine-tuning phase for task B, Std AB stands out with very low accuracy in Table 1 of the main paper. This model from a regular training run has large difficulties to adapt to the new task. Pre A aims at extracting features from inputs and reconstructing them. Pre AB outperforms Std AB , which means features helpful for task B are extracted by Pre A , however, it's hard to guide the feature extracting process. Model Ort AB also performs worse than Std B . RR AB shows the best performance in this setting, demonstrating that our loss formulation yielded more generic features, improving the performance for related tasks such as the inverted outputs for B.   Reverse pass: I r À b 6 ! reluðDð4; 8; 1ÞÞ À b 5 ! reluðDð4; 8; 2ÞÞ À b 4 ! reluðDð4; 8; 2ÞÞ À b 3 ! reluðDð4; 8; 2ÞÞ À b 2 ! reluðDð4; 8; 2ÞÞ À b 1 ! reluðDð4; 3; 2ÞÞ ! Oð224; 224; 3Þ Unlike regular training, where MI consistently decreases from the first to the last layer, the MI of layers produced by our formulation can be higher than the MI of preceding layers. During our pretraining stage, information can be transported from the first layer to the last layer as in a regular training process. However, it can also be transported from the last layer to the previous layers via the reverse pass network. This allows previous layers to be adjusted via later layers, resulting in an increased MI. Compared to regular training, our pretraining achieves a stronger correlation between the input and output distribution across all layers. The fine-tuning stage afterward aims to increase IðD 7 ; YÞ for a higher accuracy. As a result of the strong correlation between all layers, increasing IðD 7 ; YÞ leads to inner layers exhibiting an increase in MI.

B.2 Disentangled representations
The InfoGAN approach [9] demonstrated the possibility to control the output of generative models via maximizing mutual information between outputs and structured latent variables. However, mutual information is very hard to estimate in practice [74]. The previous section and Fig. 4b of the main paper demonstrated that models from our pretraining (both RR 1 A and RR A ) can increase the mutual information between network inputs and outputs. Intuitively, the pretraining explicitly constrains the model to recover an input given an output, which directly translates into an increase in mutual information between input and output distributions compared to regular training runs. For highlighting how our pretraining can yield disentangled representations (as discussed in the later paragraphs of Sect. 4 of the main text), we follow the experimental setup of InfoGAN [9]: the input dimension of our network is 74, containing 1 ten-dimensional category code c 1 , 2 continuous latent codes c 2 ; c 3 $ UðÀ1; 1Þ and 62 noise variables. Here, U denotes a uniform distribution.
Training details As InfoGAN focuses on structuring latent variables and thus only increases the mutual information between latent variables and the output, we also focus the pretraining on the corresponding latent variables, i.e., the goal is to maximize their mutual information with the output of the generative model. Hence, we train a model RR 1 for which only latent dimensions c 1 ; c 2 ; c 3 of the input layer are involved in the loss. We still employ a full reverse pass structure in the neural network architecture. c 1 is a ten-dimensional category code, which is used for controlling the output digit category, while c 2 and c 3 are continuous latent codes, to represent (previously unknown) key properties of the digits, such as orientation or thickness. Building relationship between c 1 and outputs is more difficult than for c 2 or c 3 , since the 10 different digit outputs need to be encoded in a single continuous variable c 1 . Thus, for the corresponding loss term for c 1 we use a slightly larger k factor (by 33%) than for c 2 and c 3 . Details of our results are shown in Fig. 19. Models are trained using a GAN loss [25] as the loss function for the outputs.
Analysis of results In Fig. 19, we show additional results for the disentangling test case. It is visible that our pretraining of the RR 1 model yields distinct and meaningful latent space dimensions for c 1;2;3 . While c 1 controls the digit, c 2;3 control the style and orientation of the digits. For comparison, a regular training run with model Std does result in meaningful or visible changes when adjusting the latent space dimensions. This illustrates how strongly the pretraining can shape the latent space, and in addition to an intuitive embedding of dominant features, yield a disentangled representation.

Appendix C Details of experimental results
To ensure reproducibility, source code and data for all tests will be published. Runtimes were measured on a machine with Nvidia GeForce GTX 1080 Ti GPUs and an Intel Core i7-6850K CPU.

Training details
All training data of the texture-shape tests were obtained from [21]. The stylized dataset contains 1280 images, 1120 images are used as training data, and 160 as test data. Both edge and filled datasets contain 160 images each, all of which are used for testing only. All three sets (stylized, edge, and filled) contain data for 16 different classes.
Analysis of results For a detailed comparison, we list perclass accuracy of stylized data training runs for Ort TS , Std TS , Pre TS and RR TS in Fig. 20. RR TS outperforms the other three models for most of the classes. RR TS requires an additional 41:86% for training compared to Std TS , but yields a 23:76% higher performance. (Training times for these models are given in the supplementary document.)

C.2 Smoke generation
Training details The dataset of the smoke simulation was generated with a Navier-Stokes solver from an opensource library [68]. We generated 20 randomized simulations with 120 frames each, with 10% of the data being Acc.  Acc.

RR
Acc. used for training. The low-resolution data were downsampled from the high-resolution data by a factor of 4. Data augmentation, such as flipping and rotation was used in addition. As outlined in the main text, we consider building an autoencoder model for the synthetic data as task B 1 , and generating samples from a real-world smoke dataset as task B 2 . The smoke capture dataset for B 2 contains 2500 smoke images from the ScalarFlow dataset [17], and we again used 10% of these images as training dataset.

Ort
Task A We use a fully convolutional CNN-based architecture for generator and discriminator networks. Note that the inputs of the discriminator contain high-resolution data (64,64,1), as well as low resolution (16,16,1), which is up-sampled to (64,64,1) and concatenated with the highresolution data. In line with previous work [79], RR A and Std A are trained with a non-saturating GAN loss, feature space loss and L 2 loss as base loss function. All generator layers are involved in the pretraining loss. As greedy layerwise autoencoder pretraining is not compatible with adversarial training, we pretrain Pre A for reconstructing the high-resolution data instead.  Acc. loss:9:42e7 AE 6:11e7). We believe the reason is that initializing both the encoder and decoder parts makes it more difficult to adjust the parameters for the new dataset that is very different from the dataset of the source task.

Std
Analysis of results Example outputs of Pre AB 1 , Std AB 1 and RR AB 1 are displayed in Fig. 11. We can observe that the results of Pre AB 1 are blurry, indicating that features learned from task A with greedy layer-wise pretraining are not successfully transferred to task B 1 . Likewise, Std AB 1 cannot provide the smoke frame with correct details, while RR AB 1 produces the closest results to the reference. We similarly illustrate the behavior of the transfer learning task B 2 for images of real-world fluids. This example likewise uses an autoencoder structure. Visual comparisons are provided in Fig. 12 in the main paper. Similar to task B 1 , Pre AB 2 and Std AB 2 cannot recover the smoke details properly, e.g., there are noisy colors in the results of Std AB 2 . On the other hand, the results of RR AB 2 are closer to the reference.
Overall, these findings demonstrate the benefits of our pretraining for GANs, as well as its potential to obtain more generic features from synthetic datasets that can be used for tasks involving real-world data.

C.3 Weather forecasting
Training details The weather forecasting scenario discussed in the main text follows the methodology of the WeatherBench benchmark [60]. This benchmark contains 40 years of data from the ERA reanalysis project [31] which was re-sampled to a 5.625 resolution, yielding 32 Â 64 grid points in ca. two-hour intervals. Data from the year of 1979 to 2015 (i.e., 324192 samples) are used for training. The benchmark also contains 165 years of historical simulation data from [19], and data from the year 1850 to 2005 (i.e., 224672 samples) are used for training. All RMSE measurements are latitude-weighted to account for area distortions from the spherical projection. The neural networks for the forecasting tasks employ a ResNet architecture with 19 layers, all of which contain 128 features with 3 Â 3 kernels (apart from 7 Â 7 in the first layer). All layers use batch normalization, leaky ReLU activation (tangent 0.3), and dropout with strength 0.1. As inputs, the model receives feature-wise concatenated data from the WeatherBench data for 3 consecutive time steps, i.e., t, t À 6h, and t À 12h, yielding 117 channels in total. The last convolution jointly generates all three output fields, i.e., pressure at 500 hPa (Z500), temperature at 850 hPa (T850), and the 2-meter temperature (T2M). Following [59], the learning rate was decreased by a factor of 5 when the loss did not decrease for two epochs, and the training is Acc.
Acc. terminated after 5 epochs without improvements. It is worth pointing out that for networks with large sizes, such as this weather forecasting test with 6.36M trainable parameters, the training time difference between RR and Std is negligible, with about 68.01 and 68.44 min/epoch correspondingly.

Analysis of results
In addition to the quantitative results given in the main text, Fig. 21 contains additional example visualizations from the test dataset. A visualization of the spatial error distribution w.r.t. ground truth results is also shown. It becomes apparent that our pretraining achieves reduced errors across the whole range of samples. Both temperature targets contain a larger number of smallerscale features than the pressure fields. The improvements of MAE from our pretraining approach are significant (c.a. 3% *10% across all cases), which represents a substantial improvement. The learning objective is highly non-trivial, and the improvements were achieved with the same limited set of training data. Being very easy to integrate into existing training pipelines, these results indicate that the proposed pretraining methodology has the potential to yield improved learning results for a wide range of problem settings.
Funding Open Access funding enabled and organized by Projekt DEAL. This work was funded by the ERC-2019-COG-863850 SpaTe project.
Data availability All data generated or analyzed during this study are included in this published article and its supplementary material.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.