Reviving autoencoder pretraining

Xie, You; Thuerey, Nils

doi:10.1007/s00521-022-07892-0

Reviving autoencoder pretraining

Original Article
Open access
Published: 26 October 2022

Volume 35, pages 4587–4619, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Reviving autoencoder pretraining

Download PDF

You Xie¹ &
Nils Thuerey¹

2079 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

The pressing need for pretraining algorithms has been diminished by numerous advances in terms of regularization, architectures, and optimizers. Despite this trend, we re-visit the classic idea of unsupervised autoencoder pretraining and propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. This yields networks that are as-invertible-as-possible and share mutual information across all constrained layers. We additionally establish links between singular value decomposition and pretraining and show how it can be leveraged for gaining insights about the learned structures. Most importantly, we demonstrate that our approach yields an improved performance for a wide variety of relevant learning and transfer tasks ranging from fully connected networks over residual neural networks to generative adversarial networks. Our results demonstrate that unsupervised pretraining has not lost its practical relevance in today’s deep learning environment.

Training Invertible Neural Networks as Autoencoders

Generative adversarial networks: a survey on applications and challenges

Article 24 October 2020

Learning Disentangled Representations with the Wasserstein Autoencoder

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

While approaches such as greedy layer-wise autoencoder pretraining [4, 18, 72, 78] paved the way for many fundamental concepts of today’s methodologies in deep learning, the pressing need for pretraining neural networks has been diminished in recent years. An inherent problem is the lack of a global view: layer-wise pretraining is limited to adjusting individual layers one at a time. Thus, bottom layers that are optimized first cannot be adjusted to correct errors in higher layers [11, 87]. In addition, numerous advancements in regularization [28, 43, 66, 76], network architectures [30, 63, 71], and improved optimization algorithms [44, 52, 62] have decreased the demand for layer-wise pretraining. Despite these advances, training deep neural networks that generalize well to a wide range of previously unseen tasks remains a fundamental challenge [20, 40, 55, 56] (Fig. 1).

In this paper, we develop an algorithm that reformulates autoencoder pretraining in a global way to arrive at a method that efficiently extracts general, dominant features from datasets. These features in turn improve performance for new tasks. Our approach is also inspired by techniques for orthogonalization [3, 38, 50, 57]. Hence, we propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. A key insight is that there is no need for “greediness,” i.e., layer-wise decompositions of the network structure, and it is additionally beneficial to take into account a specific problem domain at the time of pretraining. We establish links between singular value decomposition (SVD) and pretraining, and show how our approach yields an embedding of problem-aware dominant features in the weight matrices. An SVD can then be leveraged to conveniently gain insights about learned structures. Unlike orthogonalization techniques, we focus on embedding the dominant features of a dataset into the weights of a network. This is achieved via a reverse pass network. This reverse pass is generic, simple to construct, and directly relates to model performance, instead of, e.g., constraining the orthogonality of weights. Most importantly, we demonstrate that the proposed pretraining yields an improved performance for a variety of learning and transfer tasks, while incurring only a minor extra computational cost from the reverse pass.

The structure of our networks is influenced by invertible network architectures that have received significant attention in recent years [24, 34, 36, 85]. However, these approaches rely heavily on specific network architectures. Instead of aiming for a bijective mapping that reproduces inputs, we strive for learning a general representation by constraining the network to represent an as-reversible-as-possible process for all intermediate layer activations. Thus, even for cases where a classifier can, e.g., rely on color for inference of an object type, the model is encouraged to learn a representation that can recover the input. Hence, not only the color of the input should be retrieved, but also, e.g., its shape, so that more dominant features of the input dataset are embedded into the networks. In contrast to most structures for invertible networks, our approach does not impose architectural restrictions. We demonstrate the benefits of our pretraining for a variety of architectures, from fully connected layers to convolutional neural networks (CNNs) [46], over networks with batch normalization or dropout regularization, to generative adversarial networks (GAN) architectures [25].

Below, we will first give an overview of our formulation and its connection to singular values, before evaluating our model in the context of transfer learning. For a regular, i.e., a non-transfer task, the goal usually is to train a network that gives optimal performance for one specific goal. During a regular training run, the network naturally exploits any observed correlations between input and output distribution. An inherent difficulty in this setting is that typically no knowledge about the specifics of the new data and task domains is available when training the source model. Hence, it is common practice to target broad and difficult tasks hoping that this will result in features that are applicable in new domains [14, 26, 82]. Motivated by autoencoder pretraining, we instead leverage a pretraining approach that takes into account the data distribution of the inputs and drives the network to extract dominant features from the datasets, which differs from regular training for optimal performance of one specific goal. We demonstrate that our approach boosts the model accuracy for original and new tasks for a wide range of applications, from image classification to data-driven weather forecasting.

2 Related work

Greedy layer-wise pretraining was first proposed by Bengio et al. [4], and influenced a large number of follow-up works, providing a crucial method for feature extraction and enabling stable training runs of deeper networks. A detailed evaluation was performed by Erhan et al. [18], also highlighting cases where it can be detrimental. These problems were later on detailed in other works [1]. Principal component analysis (PCA) [29, 77] is a popular approach for dimensionality reduction and feature extraction, and was proposed to, e.g., handle nonlinear relationships between variables [33, 51], separate interpretable components [5], and improve robustness in the presence of outliers [80]. However, PCA is computationally intensive in both memory and run time for larger dataset. Clustering is another popular alternative [6, 22, 65, 84, 89]. As these methods rely on data similarities, they yield a high complexity when the dataset size increases [7]. Sharing similarities with our approach, Rasmus et al. [58] combined supervised and unsupervised learning objectives, but focused on denoising autoencoders and a layer-wise approach without weight sharing.

Unsupervised approaches for representation learning [23, 37, 42, 48, 81], especially contrastive learning, such as SimCLR [8], MoCo-v2 [10], ProtoNCE [49], and PaCo [13], similarly aim for learning generic features from a given dataset, but typically necessitate sophisticated training algorithms. We demonstrate the importance of leveraging state-of-the-art methods for training deep networks, i.e., without decomposing or modifying the network structure. This not only improves performance, but also very significantly simplifies the adoption of the pretraining pass in new application settings.

Extending the classic viewpoint of unsupervised autoencoder pretraining, regularization techniques have also been commonly developed to improve the properties of neural networks [45, 47]. Several prior methods employed “hard orthogonal constraints” to improve weight orthogonality via SVD at training time [35, 38, 57]. Bansal et al. [3] additionally investigated efficient formulations of the orthogonality constraints. Orthogonal convolutional neural networks (OCNN) [75] reformulate the orthogonality constraints to be computed efficiently for networks convolutional layers. In practice, these constraints are difficult to satisfy, and correspondingly only weakly imposed. In addition, all of these methods focus on improving performance for a known, given task. This means the training process only extracts features that the network considers useful for improving the performance of the current task, not necessarily improving generalization or transfer performance [70]. While our approach shares similarities with SVD-based constraints, it can be realized with a very efficient $L^2$-based formulation, and takes the full input distribution into account.

Recovering all input information from hidden representations of a network is generally very difficult [15, 53, 54], due to the loss of information throughout the layer transformations. In this context, [69] proposed the information bottleneck principle, which states that for an optimal representation, information unrelated to the current task is omitted. This highlights the common specialization of conventional training approaches.

Reversed network architectures were proposed in previous work [2, 24, 36, 39], but mainly focus on how to make a network fully invertible via augmenting the network with special structures. As a consequence, the path from input to output is different from the reverse path that translates output to input. Besides, the augmented structures of these approaches can be challenging to apply to general network architectures. In contrast, our approach fully preserves an existing architecture for the backward path, and does not require any operations that were not part of the source network. As such, it can easily be applied in new settings, e.g., adversarial training [25]. While methods using reverse connections were previously proposed [67, 85], these modules primarily focus on transferring information between layers for a given task, and on autoencoder structures for domain adaptation, respectively.

3 Method

With state-of-the-art deep learning methods [27, 88], there is no need for breaking down the training process into single layers. Hence, we consider approaches that target whole networks, and employ orthogonalization regularizers as a starting point [35]. Orthogonality constraints were shown to yield improved training performance in various settings [3], and for an n-layer network, they can be formulated as:

$$\begin{aligned} \mathcal {L}_{\text {ort}} = \sum _{m=1}^{n}\left\| M_{m}^{T} M_{m} - I\right\| _F^2 , \end{aligned}$$

(1)

i.e., enforcing the transpose of the weight matrix $M_m\in \mathbb {R}^{s_{m}^{\text {out}}\times s_{m}^{\text {in}}}$ for all layers m to yield its inverse when being multiplied with the original matrix. I denotes the identity matrix with $I=(\mathbf {e}_{m}^{1},...\mathbf {e}_{m}^{s_{m}^{\text {in}}})$, $\mathbf {e}_{m}^{j}$ denoting the $j_{th}$ column unit vector. Theoretically, $\mathcal {L}_{\text {ort}} = \mathbf{0}$ cannot be perfectly fulfilled because of the information imbalance between inputs and outputs in most deep learning cases [69]. We will first analyze the influence of the loss function $\mathcal {L}_{\text {ort}}$ assuming that it can be fulfilled, before applying the analysis to our full pretraining method.

Minimizing Eq. (1), i.e., $M_{m}^{T} M_{m} - I=\mathbf{0}$ is mathematically equivalent to:

$$\begin{aligned} M_{m}^{T} M_{m}\mathbf {e}_{m}^{j} - \mathbf {e}_{m}^{j}=\mathbf{0}, j=1,2,\ldots ,s_{m}^{\text {in}},m=1,2,\ldots ,n, \end{aligned}$$

(2)

with $rank(M_{m}^{T}M_{m})=s_{m}^{\text {in}}$, and $\mathbf {e}_{m}^{j}$ as eigenvectors of $M_{m}^{T}M_{m}$ with eigenvalues of 1. This formulation highlights that Eq. (2) does not depend on the training data, and instead only targets the content of $M_{m}$. Instead, we will design a constraint that jointly considers data and the trainable weights, allowing us to learn the dominant features of the training dataset directly. We naturally would like to recover all the features of the dataset with a learning task, but finite network capacity makes this infeasible in practice. Instead, we aim for extracting the features that contribute the most in order to achieve a minimum loss value for our designed constraint. As a result, the features that appear the most, i.e., dominant features, will be extracted. In this section, we will introduce our constraint and analysis how it guides the weights to learn dominant features from the dataset. Then, we will illustrate how we insert our constraint into training with a reversed pass network.

Inspired by the classical unsupervised pretraining, we reformulate the orthogonality constraint in a data-driven manner to take into account the set of inputs $\mathcal {D}_{m}$ for the current layer (either activation from a previous layer or the training data $\mathcal {D}_{1}$), and instead minimize

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {RR}}&= {\sum _{m=1}^{n}\left\| M_{m}^{T} M_{m}\mathbf {d}^{i}_{m} - \mathbf {d}^{i}_{m}\right\| _2^2} \\&{= \sum _{m=1}^{n}\left\| (M_{m}^{T} M_{m} - I) \mathbf {d}^{i}_{m}\right\| _2^2,} \\ \end{aligned} \end{aligned}$$

(3)

where $\mathbf {d}^{i}_{m} \in \mathcal {D}_{m} \subset \mathbb {R}^{s_{m}^{\text {in}}}$. Due to its reversible nature, we will denote our approach with an RR subscript in the following. In contrast to classical autoencoder pretraining, we are minimizing this loss jointly for all layers of a network, and while orthogonality only focuses on $M_m$, our formulation allows for minimizing the loss by extracting the dominant features of the input data.

Let q denotes the number of linearly independent entries in $\mathcal {D}_{m}$, i.e., its dimension, and t the size of the training data, i.e., $\mathcal {D}_{m}=t$ , usually with $q<t$. For every single datum $\mathbf {d}^{i}_{m}, i=1,2,\ldots ,t$, Eq. (3) results in

$$\begin{aligned} M_{m}^{T} M_{m} \mathbf {d}^{i}_{m} - \mathbf {d}^{i}_{m}=\mathbf{0},m=1,2,\ldots ,n, \end{aligned}$$

(4)

and hence $\mathbf {d}^{i}_{m}$ are eigenvectors of $M_{m}^{T}M_{m}$ with corresponding eigenvalues being 1. Thus, instead of the generic constraint $M_{m}^{T} M_{m}=I$ that is completely agnostic to the data at hand, the proposed formulation of Eq. (4) is aware of the training data, which improves the generality of the learned representation, as we will demonstrate in detail below.

The result of applying layer m of a network represents the features extracted this layer via its weight matrix $M_m$. The singular vectors of the SVD of $M_m$, can be regarded as input filters, and we can thus analyze the result of $M_m$ by focusing on its singular vectors. We employ SVD to identify what features are extracted by the parameters in $M_m$. As by construction, $rank(M_{m})=r\leqslant min(s_{m}^{\text {in}},s_{m}^{\text {out}})$, the SVD of $M_{m}$ yields:

$$\begin{aligned} \begin{aligned}&M_{m}=U_{m} \Sigma _{m} V_{m}^{T}, m=1,2,\ldots ,n,\\&\text {with} \left\{ \begin{matrix} U_{m}=(\mathbf {u}_{m}^{1},\mathbf {u}_{m}^{2},\ldots ,\mathbf {u}_{m}^{r}, \mathbf {u}_{m}^{r+1},\ldots ,\mathbf {u}_{m}^{s_{m}^{\text {out}}})\in \mathbb {R}^{s_{m}^{\text {out}}\times s_{m}^{\text {out}}} , \\ V_{m}=(\mathbf {v}_{m}^{1},\mathbf {v}_{m}^{2},\ldots ,\mathbf {v}_{m}^{r}, \mathbf {v}_{m}^{r+1},\ldots ,\mathbf {v}_{m}^{s_{m}^{\text {in}}})\in \mathbb {R}^{s_{m}^{\text {in}}\times s_{m}^{\text {in}}} , \end{matrix}\right. \end{aligned} \end{aligned}$$

(5)

with left and right singular vectors in $U_{m}$ and $V_{m}$, respectively, and $\Sigma _{m}$ having square roots of the r eigenvalues of $M_{m}^{T} M_{m}$ on its diagonal. $\mathbf {u}_{m}^{k}$ and $\mathbf {v}_{m}^{k} (k=1,\ldots ,r)$ are the eigenvectors of $M_{m}M_{m}^{T}$ and $M_{m}^{T}M_{m}$, respectively [73]. Here, especially the right singular vectors in $V_{m}^{T}$ are important, as they determine which structures of the input are processed by the transformation $M_{m}$. The original orthogonality constraint with Eq. (2) yields r unit vectors $\mathbf {e}_{m}^{j}$ as the eigenvectors of $M_{m}^{T}M_{m}$. Hence, the influence of Eq. (2) on $V_{m}$ is completely independent of training data and learning objectives.

Next, we show that $\mathcal {L}_{\text {RR}}$ facilitates learning dominant features from a given dataset. For this, we consider an arbitrary basis for spanning the space of inputs $\mathcal {D}_{m}$ for layer m. Let $\mathcal {B}_{m}:\left\langle \mathbf {w}_{m}^{1},\ldots ,\mathbf {w}_{m}^{q}\right\rangle$ denote a set of q orthonormal basis vectors obtained via a Gram–Schmidt process, with $t\!\geqslant \!q\!\geqslant \!r$, and $D_{m}$ denoting the matrix of the vectors in $\mathcal {B}_{m}$. As we show in more detail in Appendix, our constraint from Eq. (4) requires eigenvectors of $M_m^{T}M_m$ to be $\mathbf {w}_{m}^{i}$, with $V_{m}$ containing r orthogonal vectors $(\mathbf {v}_{m}^{1},\mathbf {v}_{m}^{2},\ldots ,\mathbf {v}_{m}^{r})$ from $\mathcal {D}_{m}$ and $(s_{m}^{\text {in}}-r)$ vectors from the null space of M.

We are especially interested in how $M_{m}$ changes w.r.t. input in terms of $D_{m}$, i.e., we express $\mathcal {L}_{\text {RR}}$ in terms of $D_{m}$. By construction, each input $\mathbf {d}^{i}_{m}$ can be represented as a linear combination via a vector of coefficients $\mathbf {c}_{m}^{i}$ that multiplies $D_{m}$ so that $\mathbf {d}^{i}_{m}\!=\!D_{m}\mathbf {c}_{m}^{i}$. Since $M_{m} \mathbf {d}_{m}=U_{m} \Sigma _{m} V_{m}^{T}\mathbf {d}_{m}$, the loss $\mathcal {L}_{\text {RR}}$ of layer m can be rewritten as

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {RR}_m}&= {\left\| M_{m}^{T} M_{m}\mathbf {d}_{m} - \mathbf {d}_{m}\right\| _2^2}\\&={\left\| V_{m} \Sigma _{m}^{T}\Sigma _{m} V_{m}^{T}\mathbf {d}_{m}- \mathbf {d}_{m}\right\| _2^2}\\&={\left\| V_{m} \Sigma _{m}^{T}\Sigma _{m} V_{m}^{T}D_{m}\mathbf {c}_{m}- D_{m}\mathbf {c}_{m}\right\| _2^2, m =1,2,\ldots ,n,} \end{aligned} \end{aligned}$$

(6)

where we can assume that the coefficient vector $\mathbf {c}_{m}$ is accumulated over the training dataset size t via $\mathbf {c}_{m}=\sum _{i=1}^{t}\mathbf {c}_{m}^{i}$, since eventually every single datum in $\mathcal {D}_{m}$ will contribute to $\mathcal {L}_{\text {RR}_{m}}$. The central component of Eq. (6) is $V_{m}^{T}D_{m}$. For a successful minimization, $V_{m}$ needs to retain those $\mathbf {w}_{m}^{i}$ with the largest $\mathbf {c}_{m}$ coefficients. As $V_{m}$ is typically severely limited in terms of its representational capabilities by the number of adjustable weights in a network, it needs to focus on the most important eigenvectors in terms of $\mathbf {c}_{m}$ in order to establish a small distance to $D_{m}\mathbf {c}_{m}$. Thus, features that appear most in the input data with a corresponding factor in $\mathbf {c}_{m}$ will more strongly contribute to minimizing $\mathcal {L}_{\text {RR}_m}$. Above, $D_{m}$ is only used implicitly to analyze different approaches, and we do not specify any explicit requirements for $D_{m}$. Since a fixed dataset determines the corresponding $D_m$, different orthogonal decompositions via Gram–Schmidt lead to different orthonormal bases. However, these different orthonormal bases can be aligned via rotation, and all span the same vector space. Thus, regardless of the particular orthonormal basis that is used, our method always focuses on extracting dominant features that appear most frequently in the dataset. This means the components of $D_m$ which contribute most to minimizing the loss will be embedded in the neural network. More in-depths discussions are provided in Appendix A.3.

Comparing our constraint from Eq. (3) with the orthogonal constraint in Eq. (1), we can see that our formulation is actually stricter. As a consequence, our method can retain the advantages of orthogonal constraints while simultaneously embedding dominant features into the weight matrices.

To summarize, $V_{m}$ is driven toward containing r orthogonal vectors $\mathbf{w}_{m}^{i}$ that represent the most frequent features of the input data, i.e., the dominant features. Additionally, due to the column vectors of $V_{m}$ being mutually orthogonal, $M_{m}$ is encouraged to extract different features from the input. For the sake of being distinct and representative of the dataset, these features have the potential to be useful for new inference tasks. The feature vectors embedded in $M_m$ can be extracted from the network weights in practical settings, as we will demonstrate below.

3.1 Realization in neural networks

Calculating $M_{m}^{T} M_{m} \mathbf {d}^{i}_{m}$ in Eq. (3) directly is usually very expensive due to the dimensionality of $M_{m}$. Instead, we reuse $M_{m} \mathbf {d}^{i}_{m}$ in the forward pass network and build an extra reverse pass network to calculate $M_{m}^{T} M_{m} \mathbf {d}^{i}_{m}$ by reusing parameters from the forward pass network. In the following, we will explain how to constrain the intermediate results of the network to efficiently realize Eq. (3) when training.

Regular training typically starts with a chosen network structure and trains the model weights for a given task via a suitable loss function. Our approach fully retains this setup and adds a second pass that reverses the initial structure while reusing all weights and biases. For instance, for a typical fully connected layer in the forward pass with $\mathbf {d}_{m+1} = M_{m} \mathbf {d}_{m} + \mathbf {b}_{m}$, the reverse pass operation is given by $\mathbf {d}^{'}_{m} = M^{T}_{m} (\mathbf {d}_{m+1}-\mathbf {b}_{m})$, where $\mathbf {d}^{'}_{m}$ denotes the reconstructed input.

Our goal with the reverse pass is to transpose all operations of the forward pass to obtain identical intermediate activations between the layers with matching dimensionality. We can then constrain the intermediate results of each layer of the forward pass to match the results of the backward pass, as illustrated in Fig. 2. While the construction of the reverse pass is straightforward for all standard operations, i.e., fully connected layers, convolutions, pooling, etc., slight adjustments are necessary for nonlinear activation functions (AFs) and batch normalization (BN). It is crucial for our formulation that $\mathbf {d}_{m}$ and $\mathbf {d}^{'}_{m}$ contain the same latent space content in terms of range and dimensionality, such that they can be compared in the loss. Hence, we use the BN parameters and the activation of layer $m-1$ from the forward pass, i.e., $f_{m-1}$ and $BN_{m-1}$, for layer m in the reverse pass.

Unlike greedy layer-wise autoencoder pretraining, which trains each layer separately and only constrains $\mathbf {d}_{1}$ and $\mathbf {d}^{'}_{1}$, we jointly train all layers and constrain all intermediate results. Due to the symmetric structure of the two passes, we can use a simple $L^2$ difference to drive the network toward aligning the results:

$$\begin{aligned} \mathcal {L}_{\text {RR}}={\sum _{m=1}^{n}{\lambda _{m}\left\| \mathbf {d}_{m} - \mathbf {d}^{'}_{m}\right\| }_2^2.} \end{aligned}$$

(7)

Here, $\mathbf {d}_{m}$ denotes the input of layer m in the forward pass and $\mathbf {d}^{'}_{m}$ the output of layer m for the reverse pass. $\lambda _{m}$ denotes a scaling factor for the loss of layer m, which, however, is typically constant in our tests across all layers. Note that with our notation, $\mathbf {d}_{1}$ and $\mathbf {d}_{1}^{'}$ refer to the input data, and the reconstructed input, respectively.

Next, we show how this setup realizes the regularization from Eq. (3). For clarity, we use a fully connected layer with bias. In a neural network with n hidden layers, the forward process for a layer m is given by $\mathbf {d}_{m+1}=M_{m} \mathbf {d}_{m}+\mathbf {b}_{m}$, with $\mathbf {d}_{1}$ and $\mathbf {d}_{n+1}$ denoting input and output, respectively. All neural networks can be classified according to whether the full reverse pass can be built from the output to input, and we also classify our pretraining as full network pretraining and localized pretraining in implementation.

3.1.1 Full network pretraining

For networks where a unique path from output to input exists, we build a reverse pass network with transposed operations starting with the final output where $\mathbf {d}_{n+1}=\mathbf {d}^{'}_{n+1}$, and the intermediate results $\mathbf {d}^{'}_{m+1}$:

$$\begin{aligned} { \mathbf {d}^{'}_{m}=M_{m}^{T}(\mathbf {d}^{'}_{m+1}-\mathbf {b}_{m}) , m=1,2,\ldots ,n,} \end{aligned}$$

(8)

where the reverse pass activation $\mathbf {d}^{'}_{m}$ depends on $\mathbf {d}_{m+1}{'}$, this formulation yields a full reverse pass from output to input, which we use for most training runs below. Here, we analyze the influence of Eq. (7) during training by assuming $\mathcal {L}_{\text {RR}}=0$ during the minimization. We then obtain activated intermediate content during the reverse pass that reconstructs the values computed in the forward pass, i.e., $\mathbf {d}^{'}_{m+1}=\mathbf {d}_{m+1}$ holds. In this case

$$\begin{aligned} \begin{aligned} \mathbf {d}^{'}_{m}&=M_{m}^{T}(\mathbf {d}^{'}_{m+1}-\mathbf {b}_{m})\\&=M_{m}^{T}(\mathbf {d}_{m+1}-\mathbf {b}_{m}) =M_{m}^{T}M_{m}\mathbf {d}_{m} , m=1,2,\ldots ,n, \end{aligned} \end{aligned}$$

(9)

which means that Eq. (7) is consistent with Eq. (3).

3.1.2 Localized pretraining

For architectures that have a reverse path that is not unique, e.g., in the presence of additive residual connections, we cannot uniquely determine the b, c in $a=b+c$ given only a. In such cases, we use a local formulation, and $d_{m+1}$ is used as input of the reverse path of layer m directly. In this case Eq. (8) can be written as:

$$\begin{aligned} { \mathbf {d}^{'}_{m}=M_{m}^{T}(\mathbf {d}_{m+1}-\mathbf {b}_{m}) , m=1,2,\ldots ,n,} \end{aligned}$$

(10)

which effectively employs $\mathbf {d}_{m+1}$ for jointly constraining all intermediate activations in the reverse pass. Moreover, it is consistent with Eq. (3).

In summary, Eq. (7) will drive the network toward a state that is as-invertible-as-possible for the given input dataset. Comparing the full network pretraining and localized pretraining, the full network pretraining establishes a stronger relationship among the loss terms of different layers, and allows earlier layers to decrease the accumulated loss of later layers. Localized pretraining, on the other hand, is even valid for cases where the reverse path from output to input is not unique.

Up to now, the discussion focused on simplified neural networks with convolutional operations, which are crucial for feature extraction, but without AFs or extensions such as BN, which are applied to increase model nonlinearity. While we leave a more detailed theoretical analysis of these extensions for future work, we apply these nonlinear extensions for all of our tests in Sects. 4 and 5. Thus, our experiments demonstrate that our method works in conjunction with BN and AFs. They show consistently show that the inherent properties of our pretraining remain valid: even in the nonlinear setting our approach successfully extracts dominant structures and yields improved generalization.

In Appendix, we give details on how to ensure that the latent space content for forward and reverse pass is aligned such that differences can be minimized, and we give practical examples of full and localized pretraining architectures.

To summarize, we realize the loss formulation of Eq. (7) to minimize $\sum _{m=1}^{n}\left\| (M_{m}^{T} M_{m} - I) \mathbf {d}_{m}\right\| _2^2$ without explicitly having to construct $M_{m}^{T} M_{m}$. Following the notation above, we will refer to networks trained with the added reverse structure and the additional loss terms as RR variants. We consider two variants for the reverse pass: a local pretraining Eq. (10) using the datum $\mathbf {d}_{m+1}$ of a given layer, and a full version via Eq. (8) which uses $\mathbf {d}^{'}_{m+1}$ incoming from the next layer during the reverse pass. It is worth pointing out that our constraint is only used during the pretraining stage, and pretrained models are used as a starting point for the fine-tuning stage, where the search space for parameters is the same as in standard training, i.e., training is not constrained by our approach.

3.2 Embedding singular values

In the following, we evaluate networks trained with different methodologies. We distinguish our pretraining approach $\text {RR}$(in green), regular autoencoder pretraining $\text {Pre}_{\text {}}$ (in gray), and orthogonality constraints $\text {Ort}_{\text {}}$ (in blue). In addition, $\text {Std}_{\text {}}$ denotes a regular training run (in in graphs below), i.e., models trained without autoencoder pretraining, orthogonality regularization or our proposed method. Besides, a subscript will denote the task variant the model was trained for, such as $\text {Std}_{\text {T}}$ for task T. While we typically use all layers of a network in the constraints, a reduced variant that we compare to below only applies the constraint for the input data, i.e., m=1. A network trained with this variant, denoted by $\text {RR}_\text {A}^{1}$, is effectively trained to only reconstruct the input. It contains no constraints for the inner activations and layers of the network. For the $\text {Ort}_{\text {}}$ models, we use the Spectral Restricted Isometry Property algorithm [3].

We verify that the column vectors of $V_{m}$ of models from RR training contain the dominant features of the input with the help of a classification test, employing a single fully connected layer, i.e., $\mathbf {d}_{2}=M_{1} \mathbf {d}_{1}$, with BN and activation. To quantify this similarity, we compute a Learned Perceptual Image Patch Similarity (LPIPS) [86] between $v_{m}^{i}$ and the training data (lower values being better).

We employ a training dataset constructed from two dominant classes (a peak in the top-left and bottom-right quadrant, respectively), augmented with noise in the form of random scribbles, as shown in Fig. 3. Based on the analysis above, we expect the RR training to extract the two dominant peaks during training. The LPIPS measurements confirm our SVD argumentation above, with average scores of $0.217 \pm 0.022$ for $\text {RR}^{}_{\text {}}$, $0.319 \pm 0.114$ for $\text {Pre}_{\text {}}$, $0.495 \pm 0.006$ for $\text {Ort}_{\text {}}$, and $0.500 \pm 0.002$ for $\text {Std}_{\text {}}$, i.e., the $\text {RR}^{}_{\text {}}$ model fares significantly better than the others. At the same time, the peaks are clearly visible for RR models, while the other models fail to extract structures that resemble the input. Thus, by training with the full network and the original training objective, our pretraining yields structures that are interpretable and be inspected by humans.

The results above experimentally confirm our formulation of the RR loss and its ability to extract dominant and generalizing structures from the training data. In addition, they give the first indication that this still holds when nonlinear components such as AFs are present. Next, we will focus on quantified metrics and turn to measurements in terms of mutual information to illustrate the behavior of our pretraining for deeper networks.

4 Evaluation in terms of mutual information

Mutual information (MI) measures the dependence of two random variables, i.e., higher MI means that there is more shared information between two parameters. As our approach hinges on the introduction of the reverse pass, we will show that it succeeds in terms of establishing MI between the input and the constrained intermediates inside a network. More formally, MI I(X; Y) of random variables X and Y measures how different the joint distribution of X and Y is w.r.t. the product of their marginal distributions, i.e., the Kullback–Leibler divergence $I(X;Y) = D_{KL}[P_{(X,Y)}||P_X P_Y]$. [69] proposed MI plane to analyze trained models, which show the MI between the input X and activations of a layer $\mathcal {D}_{m}$, i.e., $I(X;\mathcal {D}_{m})$ and $I(\mathcal {D}_{m};Y)$, i.e., MI of layer $\mathcal {D}_{m}$ with output Y. These two quantities indicate how much information about the input and output distributions are retained at each layer, and we use them to show to which extent our pretraining succeeds at incorporating information about the inputs throughout training.

The following tests employ networks with six fully connected layers and nonlinear AFs, with the objective to learn the mapping from 12 binary inputs to 2 binary output digits [64], with results accumulated over five runs. Experimental details are illustrated in Appendix. We compare the versions $\text {Std}_{\text {A}}$, $\text {Pre}_{\text {A}}$, $\text {Ort}_{\text {A}}$, and $\text {RR}^{}_{\text {A}}$. We visualize model comparisons with the MI planes, the content of which is visually summarized in Fig. 4a. Horizontal/vertical axis of the MI plane denotes $I(X; \mathcal {D}_{m})/I(Y; \mathcal {D}_{m})$, which measures the amount of shared information between the $m^{th}$ layer $\mathcal {D}_{m}$ and X/Y after training. This depicts how much information about input and output distribution is retained at each layer, as well as how these relationships change throughout the network. For regular training, the information bottleneck principle [69] states that early layers contain more information about the input, i.e., show high values for $I(X;\mathcal {D}_{m})$ and $I(\mathcal {D}_{m};Y)$. As a result, these layers are frequently visible in the top-right corner of MI plane visualizations. After training, later layers typically share a large amount of information with the output, i.e., show high $I(\mathcal {D}_{m};Y)$ values, and correlate less with the input (low $I(X;\mathcal {D}_{m})$). As a result, they typically appear in the top-left corner of MI plane graphs.

The graph in Fig. 4b highlights that training with the RR loss $\text {RR}^{}_{\text {A}}$ correlates input and output distributions across all layers: the cluster of green points in the center of the graph indicates that all layers contain balanced MI between input as well as output and the activations of each layer. $\text {Std}_{\text {A}}$ and $\text {Ort}_{\text {A}}$ almost exclusively focus on the output with $I(\mathcal {D}_{m};Y)$ being close to one and information dropped out layer by layer leads to a low $I(X; \mathcal {D}_{7})$ value. $\text {Pre}_{\text {A}}$ instead only focuses on reconstructing inputs. Thus, the early layers cluster in the upper-right corner, while the last layer $I(\mathcal {D}_{7};Y)$ fails to align with the outputs. Once we continue to fine-tune these models without regularization, the MI naturally shifts toward the output, as illustrated in Fig. 4c. Here, $\text {RR}^{}_{\text {AA}}$ outperforms the other models in terms of the final performance. Furthermore, we design a transfer task B with switched output digits, which means that in task B, the original two binary output digits, e.g., (1, 0), will be switched into (0, 1). This change of the dataset results in significantly different mapping relationships between inputs and outputs compared with original task A. Likewise, $\text {RR}^{}_{\text {AB}}$ performs best for a transfer task B with switched output digits, as shown in graph d, the final performance for both tasks across all runs is summarized in Table 1. The graph demonstrates that the proposed pretraining succeeds in robustly establishing mutual information between inputs and targets across a full network while extracting reusable features. The nonlinearity of the underlying network architectures does not impede the performance of the $\text {RR}^{}_{\text {}}$ models. It is worth pointing out that $\text {Std}_{\text {}}$ and $\text {Ort}_{\text {}}$ exhibit high performance variance in transfer task B, but not in base task A, because $\text {Std}_{\text {A}}$ and $\text {Ort}_{\text {A}}$ were trained solely to improve task A performance. The extracted features are not guaranteed to be useful for task B in this process. As a result, performance in task B is not consistent across training. On the other hand, $\text {RR}^{}_{\text {A}}$ focuses on extracting dominant features from the dataset, rather than specific tasks, which significantly improves the stability of training across different runs for tasks A and B.

Comparing Fig. 4b and d, we can see that after pretraining via our approach, balanced MI is obtained between input as well as output and the activations of each layer, indicating that our model extracted balanced features from both the input and output. After transfer learning for task B, we can see that all layers are located at the top part of the graph with high $I(\mathcal {D}_{m};Y)$ values, indicating that the model aims to improve the performance for a specific task.

We also compare the mutual information of three variants of our pretraining: the local variant $\text {lRR}_\text {A}$, the full version $\text {RR}^{}_{\text {A}}$, and a variant of the latter: $\text {RR}_\text {A}^{1}$, i.e., a version where only the input $\mathbf {d}_{1}$ is constrained to be reconstructed. Figure 5 shows the MI planes for these three models. Only one layer is constrained with our formulation in $\text {RR}_\text {A}^{1}$, but we can see that the last two layers of the model are already located in the middle part of the MI plane (Fig. 5a), and the influence is in line with our full version $\text {RR}^{}_{\text {A}}$. Despite the local nature of $\text {lRR}_\text {A}$, it manages to establish MI for the majority of the layers, as indicated by the cluster of layers in the center of the MI plane. Only the first layer moves toward the upper-right corner, and the second layer is affected slightly. In other words, these layers exhibit a stronger relationship with the distribution of the outputs. Despite this, the overall performance when fine-tuning or for the task transfer remains largely unaffected, e.g., the $\text {lRR}_\text {AA/AB}$ still clearly outperforms $\text {RR}^{1}_{\text {AA/AB}}$. This confirms our choice to use the full pretraining when network connectivity permits, and employ the local version in all other cases. Accuracy comparisons among different models are displayed in Table 1. $\text {RR}^{}_{\text {AA/AB}}$ yields the highest performance, while $\text {lRR}_\text {AA/AB}$ performs similarly with $\text {RR}^{}_{\text {AA/AB}}$.

In summary, from the MI tests we can conclude that training with our formulation ($\text {RR}^{}_{\text {A}}$ and $\text {lRR}_\text {A}$) is useful for correlating input and output distributions across all layers. Furthermore, this correlation would be strengthened if more layers were constrained with our formulations, e.g., comparing $\text {RR}^{}_{\text {A}}$ with $\text {RR}^{1}_{\text {A}}$. On the other hand, models pretrained with our formulation, e.g., $\text {RR}^{}_{\text {A}}$ and $\text {lRR}_\text {A}$, can achieve highest value of $I(\mathcal {D}_{7};Y)$ and performance for source task A and transfer task B after fine-tuning.

MI has received attention recently as a learning objective, e.g., in the form of the InfoGAN approach [9] for learning disentangled and interpretable latent representations. While MI is typically challenging to assess and estimate [74], the results above show that our approach provides a straightforward and robust way for including it as a learning objective. In this way, we can easily, e.g., reproduce the disentangling results from [9] without explicitly calculating mutual information, which are shown in Fig. 1c. A generative model with our pretraining extracts intuitive latent dimensions for the different digits, line thickness, and orientation without any additional modifications to the loss function. The joint training of the full network with the proposed reverse structure, including nonlinearities and normalization, yields a natural and intuitive decomposition.

Table 1 Performance of MI source and transfer tasks in Figs. 4 and 5

Reviving autoencoder pretraining

Abstract

Similar content being viewed by others

Training Invertible Neural Networks as Autoencoders

Generative adversarial networks: a survey on applications and challenges

Learning Disentangled Representations with the Wasserstein Autoencoder

Explore related subjects

1 Introduction

2 Related work

3 Method

3.1 Realization in neural networks

3.1.1 Full network pretraining

3.1.2 Localized pretraining

3.2 Embedding singular values

4 Evaluation in terms of mutual information

5 Experimental results

5.1 CIFAR-100 classification

5.2 Transfer learning benchmarks

5.3 Smoke generation

5.4 Weather forecasting

6 Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Details of the method

1.1 A.1 Pretraining and singular value decomposition

1.2 A.2 Examples of network architectures with pretraining

1.3 A.3 MNIST and peak tests

Appendix B Mutual information

1.1 B.1 Mutual information test

1.2 B.2 Disentangled representations

Appendix C Details of experimental results

1.1 C.1 Texture-shape benchmark

1.2 C.2 Smoke generation

1.3 C.3 Weather forecasting

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation