1 Introduction

Learning representations that capture general high-level information from abundant unlabeled sensory data remains a challenge for unsupervised representation learning. Research in neuroscience suggests that a major difference between state-of-the-art deep learning architectures and the human brain is that cells in the brain do not react to single stimuli, but instead extract invariant features from sequences of fast-changing sensory input signals (Bengio & Bergstra, 2009). Evidence found in the hierarchical organization of simple and complex vision cells shows that time-invariance is the principle after which the cortex extracts the underlying generative factors of these sequences and that these factors usually change slower than the observed signal (Wiskott & Sejnowski, 2002; Berkes & Wiskott, 2005; Bengio & Bergstra, 2009).

Computational neuroscientists have named this paradigm the slowness principle wherein individual measurements of a signal may vary quickly, but the underlying generative features vary slowly. For example, individual pixel values in a video change rapidly during short periods of time, but the scene itself usually changes in a slower time scale. This principle has found application in Slow Feature Analysis (SFA) as proposed in Wiskott and Sejnowski (2002). SFA has shown promising results in computational neuroscience but little research has explored the possible applications of the underlying slowness principle to state-of-the-art unsupervised representation learning methods used in deep learning. Previous research has employed temporal contrastive L1 and L2 losses in end-to-end tasks (Mobahi et al., 2009; Sermanet et al., 2018; Zou et al., 2012) such as classification and view-point invariant robot imitation learning. Only recently an unsupervised representation learning method, the SlowVAE (Klindt et al., 2020), has been leveraging the observed statistics of natural transitions in observation space to extend the VAE objective with a sparse temporal prior. The SlowVAE slowness prior has been evaluated on a variety of disentanglement metrics, but not with respect to downstream task data efficiency.

In this paper, we put existing methods for slow representation learning using Variational Autoencoders (VAEs) (Kingma & Welling, 2013) into a shared context and compare them from a theoretical and empirical point of view. We show that different priors used to enforce slowness can be included as a general slowness regularization term to the evidence lower bound (ELBO) of the VAE objective. Additionally, we propose a new slowness regularization term based on a Brownian motion prior for latent space evolution which is used in our method, the S-VAE. We empirically compare the \(\beta\)-VAE, the SlowVAE, S-VAE and L1/L2 slowness-based VAEs with respect to their performance and data efficiency on downstream regression tasks such as odometry estimation and behavioral cloning.

Furthermore, we investigate quantitative measures for the quality of latent representations and find that the Fréchet Inception Distance proposed in Heusel et al. (2017) correlates with the downstream task performance. Being able to predict the downstream task performance without the need for ground truth labels, greatly accelerates the hyperparameter search during the unsupervised pre-training of VAE models and helps identify good models without the need to perform the downstream task.

2 Related work

2.1 Slowness principle

The slowness principle is based on the assumption that the true generative factors of a signal vary on slower time scales than raw sensory signals. Research in computational neuroscience suggests that cell structures in the visual cortex have emerged based on the underlying principle of extracting slowly varying features from the environment (Berkes & Wiskott, 2005). Leveraging this principle, we can extract higher-level invariant scene information which usually changes slower than for example the individual pixel values of a video.

The most well-known application of the slowness principle is the slow feature analysis method (SFA) introduced in Wiskott and Sejnowski (2002). SFA is an unsupervised learning algorithm designed to extract linearly decorrelated features by expanding and transforming the input signal such that it can be optimized for finding the most slowly varying features from an input signal (Wiskott & Sejnowski, 2002). Extending the SFA method to nonlinear features has shown that the learned features share many characteristics with those of complex cells in the V1 cortex (Berkes & Wiskott, 2005). Further applications of the slowness principle include transformation invariant object detection (Franzius et al., 2011), pre-training of neural networks for improved performance on the MNIST dataset (Bengio & Bergstra, 2009) and the self-organization of grid cells, structures in the rodent brain used for navigation (Franzius et al., 2007a, b).

2.2 Contrastive learning and the slowness principle

The objective of contrastive learning is to learn to embed data using a metric score to express (dis-)similarity of data points. Contrastive learning has been successfully applied in reinforcement learning (Laskin et al., 2020) and recently for object classification in SimCLR (Chen et al., 2020a, b). These methods use a contrastive loss on augmented versions of the same observation, effectively learning transformation invariant features from images, and show that these representations benefit reinforcement learning and image classification tasks.

When using time as the contrastive metric we talk about time-contrastive learning. Time-contrastive learning has been applied successfully to learning view-point-invariant representations for learning from demonstration with a robot (Sermanet et al., 2018). Similar to our work, Mobahi et al. used the coherence in video material to train a Convolutional Neural Network (CNN) for a variety of specific tasks (Mobahi et al., 2009). While training two CNNs in parallel with shared parameters, in alternating fashion a labeled pair of images was used to perform a gradient update minimizing training loss followed by selecting two unlabeled images from a large video dataset to minimize a time-contrastive loss based on the L1 norm of the representations at each layer. The experiments showed that supervised tasks can benefit from the additional pseudo-supervisory signal and that features invariant to pose, illumination or clutter can be learned. Compared to the above methods where a specific task is learned end-to-end with temporal similarity as an additional supervisory signal, we use the slowness principle as a bias on model and data to learn task-agnostic representations that facilitate data efficient downstream task learning.

In contrast, the GP-VAE proposed by Fortuin et al. (2020) is a model for learning temporal dynamics for problems such as reconstructing missing input features, especially in a medical context. The core idea of the GP-VAE is to learn a latent embedding of high dimensional sequential data and to model latent dynamics using a Gaussian Process prior. The authors claim this prior facilitates representations that are smoother and allow the reconstruction of missing features in the input space. Compared to the GP-VAE, the methods presented in this paper address the problem of learning good representations for downstream tasks, instead of learning temporal dynamics of the signal. Another noteworthy method for learning temporal dynamics is the temporal difference variational autoencoder (TD-VAE) (Gregor et al., 2019) which learns representations that encode an uncertain belief state from which multiple possible future scenarios can be rolled out.

2.3 Disentanglement

Another concept related to unsupervised representation learning, especially when talking about the \(\beta\)-VAE, is disentanglement. Although there is no clear definition of disentanglement yet, most works (Bengio et al., 2013; Locatello et al., 2019a; Klindt et al., 2020) agree on the common notion that disentangled representations should approximate the ground truth generative factors of the observed data while the dimensions should be largely independent of each other. Ideally, each disentangled factor represents one ground truth factor that led to the generation of the observation data. Disentanglement in the context of the \(\beta\)-VAE has been discussed more in-depth by Burgess et al. (2018). In the \(\beta\)-VAE pressure on the latent bottleneck of the autoencoder limits how much information can be transmitted per sample while at the same time trying to maximize the data log likelihood. This is done by enforcing a unit Gaussian prior on the latent distributions which results in the embedding of data points on a set of representational axes where nearby points on the axes are also close in data space. This regularization results in these axes being the main contributors to improvements in the data log-likelihood and therefore often coincide with the ground truth generative factors.

Some of the claimed benefits of disentangled representations are better downstream task data efficiency and interpretability (Schölkopf et al., 2012; Bengio et al., 2013; Peters et al., 2017). However, in their research (Locatello et al., 2019a, b) show that various disentanglement methods are not able to generate disentangled representations without implicit biases on model and data and that more disentanglement does not necessarily lead to better downstream task data-efficiency. In a later work by Locatello et al. (2020) the aforementioned challenges were addressed and the authors showed that with weak supervision it is possible to learn fair and generalizable representations.

2.4 Fréchet inception distance

The Fréchet Inception Distance (FID), as introduced in Heusel et al. (2017), is a measure for the generative capabilities of deep generative models. It measures how similar the images generated by GANs are to images from the real data distribution. The FID is an improvement of the Inception Score (IS) introduced by Salimans et al. (2016) which only evaluates the distribution of the generated images and does not compare it to the true data distribution which has been shown to fail when comparing models (Barratt & Sharma, 2018). The FID is computed by comparing the activation distributions of an Inception-v3 neural network pre-trained on the ImageNet dataset for the generated and true data using the Wasserstein-2 distance between the real data distributions \((\mu , C)\), a Gaussian normal distribution with mean \(\mu\) and covariance C, and \((\mu _\text {pred}, C_\text {pred})\) describing the generated data distribution.

3 Autoencoding slow representations

Variational Autoencoders (VAEs) (Kingma & Welling, 2013) are a popular tool for dimensionality reduction and representation learning. Let \(q_\phi\) be the variational approximate posterior distribution obtained from a VAE’s encoder network with parameters \(\phi\) and \({\varvec{z}}\) be a latent vector such that \({\varvec{z}} \sim q_\phi ({\varvec{z}} \mid {\varvec{o}})\) where \({\varvec{o}}\) is an observation. The decoder network denoted by \(p_\theta ({\varvec{o}}\mid {\varvec{z}})\) is parameterized by \(\theta\). Since it is computationally not tractable to directly maximize the log probability of the data, a lower bound \({\mathcal {L}}\) is used for optimization:

$$\begin{aligned} \max _{\phi ,\theta } \log p_\theta ({\varvec{o}}) \ge {\mathcal {L}} = {\mathcal {L}}_\text {rec} - {\mathcal {L}}_\text {b} \end{aligned}$$
(1)

where \({\mathcal {L}}_\text {rec} = {\mathbb {E}}_{q_\phi ({\varvec{z}}\mid {\varvec{o}})}[\log p_\theta ({\varvec{o}}\mid {\varvec{z}})]\) is the reconstruction quality, measured by comparing the true observation and the decoded observations. \({\mathcal {L}}_\text {b} = D_{KL}(q_\phi ({\varvec{z}}\mid {\varvec{o}})\Vert p({\varvec{z}}))\) is imposing a unit Gaussian prior \(p({\varvec{z}})\) on the representations in the bottleneck of the VAE. To make the sampling process differentiable (and thus trainable using gradient based optimization), the variational distribution is usually reparameterized as a Gaussian \(q_\phi = {\mathcal {N}}({\varvec{\mu }}, {\varvec{\sigma }})\).

Higgins et al. (2017) proposed the \(\beta\)-VAE, which adds a parameter \(\beta\) to scale the weight of the pressure on the bottleneck to allow a trade-off between disentanglement of the latent factors and reconstruction quality.

Unlike in the \(\beta\)-VAE, where training data is assumed to be i.i.d, slow representation learning methods assume sequential data. The core idea of slow representation is to use this property as a weak supervision signal to extract better high-level representations. The slowness constraint is incorporated in the VAE optimization target as

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_\text {rec} \text { subject to }&{\mathcal {L}}_\text {b}< \epsilon _1, \\&{\mathcal {L}}_\text {slow} < \epsilon _2 \end{aligned} \end{aligned}$$
(2)

Rewriting Eq. (2) under the Karush–Kuhn–Tucker conditions (KuhnandA & Tucker, 1951; Karush , 1939) results in

$$\begin{aligned} {\mathcal {F}} = {\mathcal {L}}_\text {rec} - \beta ({\mathcal {L}}_\text {b}-\epsilon _1) - \gamma ({\mathcal {L}}_\text {slow} - \epsilon _2) \end{aligned}$$
(3)

With the simplification of \(\beta , \epsilon _1, \gamma , \epsilon _2 \ge 0\), we can derive the lower bound

$$\begin{aligned} {\mathcal {F}} \ge {\mathcal {L}} = {\mathcal {L}}_\text {rec} - \beta {\mathcal {L}}_\text {b} - \gamma {\mathcal {L}}_\text {slow}. \end{aligned}$$
(4)

with \(\beta\) being the \(\beta\)-VAE parameter and \(\gamma\) being a parameter to control the weight of the general slowness regularization term \({\mathcal {L}}_\text {slow}\). In the following, we present various slowness regularization terms used in existing works and a new slowness regularization term used in our method, the S-VAE.

3.1 Lp-norm slowness

A straightforward way to describe \({\mathcal {L}}_\text {slow}\) is to compute the \(L_p\)-norm of the means \({\varvec{\mu }}_i\) and \({\varvec{\mu }}_j\) of two encoded latent distributions from two distinct yet sequential observations \({\mathbf {o}}_i,{\mathbf {o}}_j \in {\mathbf {D}} \mid j>i\) as

$$\begin{aligned} {\mathcal {L}}_{L_p}({\varvec{o}}_j, {\varvec{o}}_i) = ({\varvec{\mu }}_j-{\varvec{\mu }}_i)^p \end{aligned}$$
(5)

where \(j>i\). Following Eq. (4), the full formulation of a \(L_p\)-slow VAE ELBO is therefore

$$\begin{aligned}&{\mathcal {L}}(\phi , \theta , \beta , \gamma , {\varvec{o}}_j, {\varvec{o}}_i) \nonumber \\&\quad ={\mathbb {E}}_{q_\phi ({\varvec{z}}_j,{\varvec{z}}_i\mid {\varvec{o}}_j, {\varvec{o}}_i)}[\log p_\theta ({\varvec{o}}_j, {\varvec{o}}_i\mid {\varvec{z}}_j, {\varvec{z}}_i)] \nonumber \\&\qquad -\, \beta D_{KL}(q_\phi ({\varvec{z}}_i\mid {\varvec{o}}_i)\Vert p({\varvec{z}}_i))\nonumber \\&\qquad -\, \gamma ({\varvec{\mu }}_j-{\varvec{\mu }}_i)^p, \end{aligned}$$
(6)

where \(\phi\) and \(\theta\) parameterize encoder and decoder. The hyperparameters \(\beta\) and \(\gamma\) are weights for the strength of the disentanglement and slowness regularization.

3.2 SlowVAE

The SlowVAE by Klindt et al. (2020), is based on a study of the statistics of natural transitions of object and object mask properties in two acknowledged video-object segmentation datasets. The authors base this prior on experimental evidence that the transitions of ground truth factors in such datasets can be approximated by generalized Laplace distributions, indicating that the temporal transitions are sparse. In other words, only few of the ground truth generative factors of a signal change in one transition. This assumption is expressed in a Laplacian prior on latent transitions, encouraging axis alignment. The SlowVAE slowness loss term is defined as

$$\begin{aligned} {\mathcal {L}}_\text {SlowVAE}({\varvec{o}}_{i+1}, {\varvec{o}}_{i}) = {\mathbb {E}}_{q_\phi (z_{i}\mid o_{i})}\left[ D_{KL}(q_\phi ({\varvec{z}}_{i+1}\mid {\varvec{o}}_{i+1})) \Vert p({\varvec{z}}_{i+1} \mid {\varvec{z}}_{i}))\right] . \end{aligned}$$
(7)

where \(p({\varvec{z}}_{i+1} \mid {\varvec{z}}_{i})\) is the Laplacian prior on the transition.

Consequently, the SlowVAE ELBO is defined as

$$\begin{aligned} {\mathcal {L}}(\phi , \theta , \beta , \gamma , {\varvec{o}}_{i+1}, {\varvec{o}}_{i})= & {} {\mathbb {E}}_{q_\phi ({\varvec{z}}_{i+1},{\varvec{z}}_{i} \mid {\varvec{o}}_{i+1}, {\varvec{o}}_{i})}\left[ \log p_\theta ({\varvec{o}}_{i+1}, {\varvec{o}}_ {i}\mid {\varvec{z}}_{i+1}, {\varvec{z}}_{i})\right] \nonumber \\&-\, \beta D_{KL}(q_\phi (\mathbf {z_{i}\mid o_{i}})\Vert p(\mathbf {z_{i}}))\nonumber \\&- \,\gamma {\mathbb {E}}_{q_\phi (z_{i}\mid o_{i})}\left[ D_{KL}(q_\phi ({\varvec{z}}_{i+1}\mid {\varvec{o}}_{i+1})) \Vert p({\mathbf {z}}_{i+1} \mid {\varvec{z}}_{i}))\right] . \end{aligned}$$
(8)

3.3 S-VAE

We propose the S-VAE as an alternative point of view on slow representation learning. The key idea is to directly incorporate the slowness principles such that “underlying generative factors change on a slower time scale“ (Berkes & Wiskott, 2005). Thus the S-VAE enforces that observations close in time have similar latent representations and reduces the strength of this assumption with growing temporal separation. The temporal similarity in the S-VAE is expressed as an additive stochastic process with stationary uncertainty, which is known to approach Brownian motion in the limit. Considering two distinct yet sequential observations \({\mathbf {o}}_i,{\mathbf {o}}_j \in D \mid j>i\), the difference of the corresponding latent representations is given by the approximate difference distribution,

$$\begin{aligned} q_\theta ({\varvec{z}}_j-{\varvec{z}}_i\mid {\mathbf {o}}_j,{\varvec{o}}_i) = {\mathcal {N}}({\varvec{\mu }}_j-{\varvec{\mu }}_i, {\varvec{\Sigma }}_j + {\varvec{\Sigma }}_i) \equiv q_\theta (\Delta {\varvec{z}}\mid {\varvec{o}}_j,{\varvec{o}}_i). \end{aligned}$$
(9)

We impose a prior \(p(\Delta {\varvec{z}})\) on the approximate difference distribution in Eq. (9). For two increments \({\varvec{z}}_j, {\varvec{z}}_i\) of a Brownian motion, the prior distribution is defined as

$$\begin{aligned} {\varvec{z}}_{j} - {\varvec{z}}_{i} = \sqrt{\Delta t} \cdot N \sim {\mathcal {N}}(0, \Delta t \lambda {\varvec{I}}) \equiv p(\Delta {\varvec{z}}), \end{aligned}$$
(10)

where \(N \sim {\mathcal {N}}(0, \Delta t\lambda {\varvec{I}})\) and \(\Delta t = j-i\) and \(\lambda\) is a parameter corresponding to the variance of the prior distribution.

The S-VAE loss term is computed as the Kullback–Leibler (KL) divergence between the approximate difference distribution in Eq. (9) and the prior distribution in Eq. (10) as

$$\begin{aligned} {\mathcal {L}}_\text {slow}({\varvec{o}}_i, {\varvec{o}}_j) = D_{KL}(q_\theta (\Delta {\varvec{z}}\mid {\varvec{o}}_j,{\varvec{o}}_i)\Vert p(\Delta {\varvec{z}})). \end{aligned}$$
(11)

resulting in the S-VAE ELBO

$$\begin{aligned}&{\mathcal {L}}(\theta , \phi , \beta , \gamma , {\varvec{o_j, o_{i}}}) \nonumber \\&\quad = {\mathbb {E}}_{q_\theta ({\varvec{z}}_j,{\varvec{z}}_i\mid {\varvec{o}}_j, {\varvec{o}}_i)}[\log p_\phi ({\varvec{o}}_j, {\varvec{o}}_i\mid {\varvec{z}}_j, {\varvec{z}}_i)] \nonumber \\&\qquad - \,\beta D_{KL}(q_\theta ({\varvec{z}}_i \mid {\varvec{o}}_i)\Vert p({\varvec{z}}_i))\nonumber \\&\qquad - \,\gamma D_{KL}(q_\theta (\Delta {\varvec{z}}\mid {\varvec{o}}_j,{\varvec{o}}_i)\Vert p(\Delta {\varvec{z}})). \end{aligned}$$
(12)

Following from Eq. (10), the Brownian motion prior \(p(\Delta {\varvec{z}})\) in the S-VAE ELBO explicitly takes the temporal separation of consecutive observations into account and relaxes the prior when temporal separation grows. A more detailed derivation can be found in Appendix A.

3.4 Discussion

Using the \(L_p\)-norm to enforce temporal similarity has been explored by Mobahi et al. (2009) where a L1 norm was used to enforce similarity between two halves of a siamese CNN architecture to leverage temporal similarity of observations during training. Other applications of a \(L_p\)-norm to enforce temporal similarity can be found in Zou et al. (2012) and (Sermanet et al., 2018), where a L1 norm and a triplet loss (based on the L2 norm) respectively were used to achieve viewpoint-invariance by enforcing similarity of representations from different angles according to temporal similarity. Cadieu and Olshausen (2012) claim that there are no significant differences between L1 and L2 norm for enforcing temporal similarity in latent space. The main differences between existing research and the representation learning methods presented in this work are that they do not operate in a variational setting and are learned end-to-end instead.

Fig. 1
figure 1

Conceptual difference between the S-VAE (ours) and the SlowVAE. The S-VAE expresses slowness through a prior \(p(\Delta z)\) on the similarity of \(z_j\) and \(z_i\), which is relaxed when \(\Delta t = j-i\) grows. The SlowVAE imposes a sparsity prior on \(z_{t+1}\), assuming that latent transitions are sparse

In contrast, the S-VAE and SlowVAE both use the variance of the predicted latent distributions, albeit in different ways depending on the applied prior. Figure 1 shows the conceptual differences between the S-VAE and SlowVAE for a pair of observations. The SlowVAE Laplace prior expresses the assumption that transitions in latent space are sparse, or in other words, are axis aligned with the ground truth generative axes. This bears similarity to the definition of disentanglement and can be understood as disentangling latent transitions. The S-VAE does not make such an assumption on the nature of the transitions. Instead, it is based on the assumption that observations close in time must also be similar in latent space. Increasing temporal distance in observation space is taken into account by the increasing uncertainty of the Brownian motion prior. This allows the S-VAE to benefit from the closed-form solution of the Brownian motion and to handle pairs of observations from any point in the temporal sequence as opposed to the SlowVAE, which has been evaluated on consecutive observations.

In Klindt et al. (2020), the authors investigated the performance of the SlowVAE when varying the temporal distance between observations up to 1 second when training. Analysis of the natural transitions showed that with increasing temporal distance between frames, the estimated kurtosis parameter \(\alpha\) of the fitted Laplace distribution increased, effectively moving closer to Gaussianity (\(\alpha =2\)). This is in line with the central limit theorem stating that the sum of i.i.d. random variables approaches the normal distribution when the number of terms increases.

Based on these differences, we aim to study the following hypotheses experimentally. Existing literature suggests that when compared to the \(\beta\)-VAE, \(L_1\) and \(L_2\), slowness regularization is beneficial for various of downstream tasks. The S-VAE and SlowVAE are expected to perform better than the \(L_p\)-norm slowness regularization as they also consider the variance of the latent distributions. Furthermore, we hypothesize that the S-VAE outperforms the SlowVAE with a kurtosis of \(\alpha = 1\) when the temporal separation between observations grows.

4 Empirical comparison

In this section, we compare the previously introduced slow representation learning methods to the baseline \(\beta\)-VAE with respect to their data efficiency when learning downstream regression tasks. We analyze the influence of the slowness hyperparameter \(\gamma\) on the latent representations by visualizing the latent spaces.

Although the use case is different, we compared the TD-VAE (Gregor et al., 2019) to the slow methods and the \(\beta\)-VAE and found that it is outperformed by all methods. We did not include those results as we were not confident in their thoroughness. The problem is that, to our knowledge, there is no official code repository for the TD-VAE, and could not reproduce the original results on the more complex DeepMind Lab experiment with the given implementation instructions.

4.1 Experimental setting

Three experiments have been conducted in which the goal was to learn downstream tasks in a semi-supervised way from video data. The semi-supervised process consists of two steps.

First, a VAE model is trained on abundant unlabeled video data in an unsupervised way. The VAE models use the encoder component to encode two observations \(o_i\) and \(o_j\) from the input video sequence such that the slowness loss terms can be computed for training. Following the ablation study in Klindt et al. (2020), which indicates that there is a sweet spot for the temporal separation of consecutive observations, we also vary \(\Delta t\) during training to take into account a wider variety of temporal separation.

Second, a downstream task is learned using embedding from the pre-trained VAE model while keeping the encoder network weights frozen. The two encoded latent representations \(z_i\) and \(z_j\) are concatenated and used to learn a downstream task end-to-end. The downstream tasks involve temporal tasks like velocity and odometry estimation or behavioral cloning.

Figure 2 shows random observations from the training dataset of all three experiment domains and their reconstructions generated using the S-VAE. The domains are a synthetic dataset of a ball bouncing in a 2D world, two reinforcement learning agents playing the game of Pong and a human randomly moving an agent in a 3D world in the DeepMind Lab environment (Beattie et al., 2016). The experiment setup and neural network architecture are described in more detail in Appendix C.

Fig. 2
figure 2

Random images from the training data (top) and their reconstructions (bottom) obtained from the S-VAE

4.2 Evaluation of downstream task performance

After the VAE training step, the encoder network is frozen and the downstream tasks are learned. For each downstream task, multiple models with varying amounts of labeled data available (from sparse to abundant) are trained. The downstream task performance is measured by computing the loss on a previously unseen labeled test set.

Fig. 3
figure 3

Results of the downstream data-efficiency experiment. MSE/BCE Loss between true and predicted label in the downstream task vs. the amount of labeled data used to train the model

Figure 3 shows a plot of the average downstream task performance for each method (baseline \(\beta\)-VAE, \(L_1\), \(L_2\), S-VAE and SlowVAE) against the amount of available labeled data. Mean and standard deviation are computed over multiple runs with different seeds and hyperparameter configurations (\(\gamma\) and \(\beta\)).

Figures 3a, c show the Ball and Pong experiment in the case where labeled data is sparse. In those cases, the S-VAE and SlowVAE outperform the \(L_1\) and \(L_2\) slowness regularization terms. The \(\beta\)-VAE without temporal regularization is outperformed by all methods in the Ball and Pong experiment. S-VAE and SlowVAE achieve the same performance as the baseline \(\beta\)-VAE with up to an order of magnitude fewer data and significantly better performance. In the DeepMind Lab experiment, the difference is less pronounced in the sparse data case, due to the complexity of estimating 6-DOF odometry from a 3D world with less than 1500 labeled examples. However, Fig. 3d shows that the S-VAE outperforms the other methods when more labeled data is available. Furthermore, in the DeepMind Lab experiment, the \(L_1\) and \(L_2\) methods yield no significant improvement over the \(\beta\)-VAE. We theorize that in this more complex task, taking into account the covariances in the slowness regularization term is important to learn good representations.

To summarize, slow methods generally yield better downstream task performance. Taking into account the variance of the latent distributions when applying slowness (S-VAE and SlowVAE) improves performance further.

Fig. 4
figure 4

Visualization of the 4D latent space of the Ball experiment for 3 hyperparameter configurations. Each dimension is shown with an individual scatter plot in which x and y axis are the ground truth position of the ball in the environment. The color is the value of the latent representation at the given position. Slowness regularization increases from \(\gamma = 0\) (\(\beta\)-VAE) in a to \(\gamma =5.0\) in c

4.3 Slow latent spaces

Next, we investigate how the slowness hyperparameter \(\gamma\) influences the latent representations by visualizing the latent spaces in the Ball experiment. Figure 4 shows a scatter plot where the x- and y-axis describe the ground truth position of the ball in the arena and the color represents the latent value. Figure 4a shows four plots, one for each latent dimension of a \(\beta\)-VAE. Figure 4b, c show the same for variations of the S-VAE with the same parameter \(\beta\), but increasingly higher slowness regularization \(\gamma\). We can see that increasing the strength of the slowness regularization increases the continuity of the latent space. For the \(\beta\)-VAE we observe multiple discontinuities that separate regions with similar latent values, whereas the S-VAE model with increasing slowness regularization exhibits visibly smooth representations.

5 Predicting downstream task performance

While a qualitative analysis showed that increasing slowness regularization makes the latent space smoother, this analysis is not generally suited to measure the quality of a learned embedding space or to predict downstream task performance. In the Ball experiment, we can visualize the latent space using the ground truth position of the ball, which is difficult for high-dimensional latent spaces or complex downstream tasks with higher-dimensional generative factors or observations.

In this section, we want to investigate further how one can predict downstream task performance without the need for labeled data or human intervention. Let us consider the predictive capabilities of each of the three components of the general formulation in Eq. (4): disentanglement, slowness and reconstruction performance.

The extent to which disentanglement can predict downstream task performances has been investigated by Locatello et al. in Locatello et al. (2019a). In their work, the authors questioned the benefits disentangled representations have for learning downstream tasks and criticized that currently, all disentanglement metrics require ground truth labels. Furthermore, disentanglement measures are supervised methods relying on abundant labeled data and are usually tailored for specific tasks. Thus we do not consider existing disentanglement metrics to predict downstream task performance.

The second option is to measure if a latent space exhibits the slowness properties and correlate this measure with downstream task performance. To this end, we experimented with label-agnostic metrics that measure the slowness by how smooth the latent space is. We measured the length of a trajectory in latent space defined by N encoded latent representations \(z_1, \ldots , z_N\) and compared it to the euclidean distance between \(z_1\) and \(z_N\). According to the slowness principle and the qualitative analysis in Fig. 4 we hypothesize that higher slowness regularization with a qualitatively smoother latent space has fewer jumps and therefore, the trajectory in latent space is shorter and more similar to the euclidean distance. We observed that, as expected, higher slowness regularization leads to less fragmented and shorter latent trajectories. However, this metric does not correlate with downstream task performance.

As the remaining option, we show that the generative capabilities of a model allow us to predict downstream task performance. The core idea is that a model capable of decoding “realistic” images from random samples drawn from the latent distributions encodes more useful information in the latent space. To measure this, we use the FID, which is commonly used to evaluate the generative capabilities of Generative Adversarial Networks (GANs). In practice, we used the python-fid package with standard parameters. The code can be found on GitHub. In this implementation, the FID is computed using the 2048 dimensional features extracted from the pool3 layer of the pre-trained Inception Net.

Fig. 5
figure 5

Scatter plot visualization of FID plotted against downstream task performance. Each point represents one model colored by method and with different hyperparameters \(\gamma\) and \(\beta\)

Figure 5 shows a scatter plot of the FID for all combinations of \(\gamma\) and \(\beta\) plotted against downstream task performance.

We observe correlation between low FID and low downstream task loss for the S-VAE and SlowVAE in all three experiments.Footnote 1 In the pong experiment, one can clearly identify outliers by their high FID. Upon further inspection, in those hyperparameter configurations, the SlowVAE failed to reconstruct the observations and thus led to weak downstream task performance. These models usually had high values for \(\gamma\) and \(\beta\), indicating that for those models, the applied regularization was too strong. In the Ball experiment, the \(L_1\) and \(L_2\) methods did not exhibit correlation. We hypothesize that the hyperparameter search in this experiment could have been even broader, exploring stronger regularization to find better models. This is further supported by the fact that the \(L_1\) and \(L_2\) models in the Ball experiment could reconstruct the ball with minimal reconstruction errors.

In conclusion, the generative capabilities of models can be used to predict the downstream task performance of VAE models. Using the FID as an indicator for downstream task performance allows more targeted exploration of hyperparameter ranges. We think that the FID as a tool to express the experiment-specific and general properties of generative models when learning downstream tasks should be further explored.

6 Empirical comparison of slowness regularization

As discussed in Sect. 3.4, the priors in the SlowVAE and S-VAE interpret slowness differently. The SlowVAE looks at slowness from a transition perspective, assuming that transitions are sparse. While transitions with small \(\Delta t\) have been shown to be sparse, results in Klindt et al. (2020) indicate that the kurtosis of the Laplacian fit to the transitions increases with temporal separation. The S-VAE, on the other hand, does not put a prior on the transition but instead enforces similarity based on temporal distance. Furthermore, the S-VAE explicitly incorporates \(\Delta t\) in the training process through the Brownian motion prior [see Eq. (10)]. Therefore, using the Brownian motion slowness prior to train an embedding should yield better performance in downstream tasks where observations are further apart in time.

To investigate this hypothesis, we used the best-performing models for each method presented in Fig. 3 and trained a latent dynamics model to predict future latent representations in a sequence. The first two observations of a sequence are encoded and a random observation \(\Delta t\) steps ahead in the sequence is drawn for the dynamics model to predict. The performance of the dynamics model is expressed as the mean squared error between the predicted and true future latent representation. Figure 6 shows the average latent dynamics error for predictions of varying \(\Delta t\). In all environments, the S-VAE generally has a lower error than the SlowVAE and the other methods. Interestingly the \(\beta\)-VAE is performing on par with the \(L_2\) slowness regularization. Both methods methods perform better or as good as the SlowVAE for short-term predictions in the Ball and DeepMind Lab latent dynamics experiment. This is in line with the findings by Klindt et al. (2020), showing that the SlowVAE has a sweet spot for the temporal separation, roughly when \(\Delta t > 0.4\) seconds. The Brownian motion prior in the S-VAE yields better performance across all values for \(\Delta t\) explored in this experiment.

Fig. 6
figure 6

Latent dynamics error vs. sequence length for the best performing model of each method. The latent dynamics error is computed as the mean squared error between true and predicted latent vector. The latent vectors are normalized to account for the different scales of latent representations across the methods

7 Conclusion

In this paper, we discuss the application of the slowness principle as an extension of the state-of-the-art \(\beta\)-VAE. We compare existing methods of slowness regularization such as L1 and L2 loss and the SlowVAE, a variation of the \(\beta\)-VAE imposing a Laplacian prior on the latent transitions. We also propose a new slowness regularization term based on a Brownian motion prior. We find that slow methods outperform the baseline \(\beta\)-VAE with respect to downstream task data efficiency. Furthermore, the results indicate that the S-VAE and SlowVAE perform similarly but better than the \(\beta\)-VAE and \(L_p\)-norm-based slowness regularization terms with respect to their data efficiency in downstream tasks. When learning a latent dynamics model to predict latent representations multiple steps ahead in time, the S-VAE exhibits superior performance due to its ability to adapt its Brownian motion prior to the temporal separation of observations. Lastly, we find that the Fréchet Inception Distance is a helpful measure to predict downstream task performance.