1 Introduction

Data often contain sequential structure, providing a rich signal for learning models of the world. Such models are useful for representing sequences (Li and Mandt 2018; Ha and Schmidhuber 2018) and planning actions (Hafner et al. 2019; Chua et al. 2018). Recent advances in deep learning have facilitated learning sequential probabilistic models directly from high-dimensional data (Graves 2013), like audio and video. A variety of techniques have emerged for learning deep sequential models, including memory units (Hochreiter and Schmidhuber 1997) and stochastic latent variables (Chung et al. 2015; Bayer and Osendorfer 2014). These techniques have enabled sequential models to capture increasingly complex dynamics. In this paper, we explore the complementary direction, asking can we simplify the dynamics of the data to meet the capacity of the model? To do so, we aim to learn a frame of reference to assist in modeling the data.

Frames of reference are an important consideration in sequence modeling, as they can simplify dynamics by removing redundancy. For instance, in a physical system, the frame of reference that moves with the system’s center of mass removes the redundancy in displacement. Frames of reference are also more widely applicable to arbitrary sequences. Indeed, video compression schemes use predictions as a frame of reference to remove temporal redundancy (Oliver 1952; Agustsson et al. 2020; Yang et al. 2021). By learning and applying a similar type of temporal normalization for sequence modeling, the model can focus on aspects that are not predicted by the low-level frame of reference, thereby simplifying dynamics modeling.

We formalize this notion of temporal normalization through the framework of autoregressive normalizing flows (Kingma et al. 2016; Papamakarios et al. 2017). In the context of sequences, these flows form predictions across time, attempting to remove temporal dependencies (Srinivasan et al. 1982). Thus, autoregressive flows can act as a pre-processing technique to simplify dynamics. We preview this approach in Fig. 1, where an autoregressive flow modeling the data (top) creates a transformed space for modeling dynamics (bottom). The transformed space is largely invariant to absolute pixel value, focusing instead on capturing deviations and motion.

We empirically demonstrate this modeling technique, both with standalone autoregressive normalizing flows, as well as within sequential latent variable models. While normalizing flows have been applied in sequential contexts previously, our main contributions are in (1) showing how these models can act as a general pre-processing technique to improve dynamics modeling, (2) empirically demonstrating log-likelihood performance improvements, as well as generalization improvements, on three benchmark video datasets and time series data from the UCI machine learning repository. This technique also connects to previous work in dynamics modeling, probabilistic models, and sequence compression, enabling directions for further investigation.

Fig. 1
figure 1

Sequence modeling with autoregressive flows. Top: Pixel values (solid) for a particular pixel location in a video sequence. An autoregressive flow models the pixel sequence using an affine shift (dashed) and scale (shaded), acting as a frame of reference. Middle: Frames of the data sequence (top) and the resulting “noise” (bottom) from applying the shift and scale. The redundant, static background has been largely removed. Bottom: The noise values (solid) are modeled using a base distribution (dashed and shaded) provided by a higher-level model. By removing temporal redundancy from the data sequence, the autoregressive flow simplifies dynamics modeling

2 Background

2.1 Autoregressive models

Consider modeling discrete sequences of observations, \({\mathbf {x}}_{1:T} \sim p_{{\text {data}}} ({\mathbf {x}}_{1:T})\), using a probabilistic model, \(p_\theta ({\mathbf {x}}_{1:T})\), with parameters \(\theta \). Autoregressive models (Frey et al. 1996; Bengio and Bengio 2000) use the chain rule of probability to express the joint distribution over time steps as the product of T conditional distributions. These models are often formulated in forward temporal order:

$$\begin{aligned} p_\theta ({\mathbf {x}}_{1:T}) = \prod _{t=1}^T p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t}). \end{aligned}$$
(1)

Each conditional distribution, \(p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t})\), models the dependence between time steps. For continuous variables, it is often assumed that each distribution takes a simple form, such as a diagonal Gaussian: \(p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t}) = {\mathcal {N}} ({\mathbf {x}}_t ; \varvec{\mu }_\theta ({\mathbf {x}}_{<t}), {{\,\mathrm{diag}\,}}(\varvec{\sigma }^2_\theta ({\mathbf {x}}_{<t}))),\) where \(\varvec{\mu }_\theta (\cdot )\) and \(\varvec{\sigma }_\theta (\cdot )\) are functions denoting the mean and standard deviation. These functions may take past observations as input through a recurrent network or a convolutional window (van den Oord et al. 2016a). When applied to spatial data (van den Oord et al. 2016b), autoregressive models excel at capturing local dependencies. However, due to their restrictive forms, such models often struggle to capture more complex structure.

2.2 Autoregressive (sequential) latent variable models

Autoregressive models can be improved by incorporating latent variables (Murphy 2012), often represented as a corresponding sequence, \({\mathbf {z}}_{1:T}\). The joint distribution, \(p_\theta ({\mathbf {x}}_{1:T} , {\mathbf {z}}_{1:T})\), has the form:

$$\begin{aligned} p_\theta ({\mathbf {x}}_{1:T} , {\mathbf {z}}_{1:T}) = \prod _{t=1}^T p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t}, {\mathbf {z}}_{\le t}) p_\theta ({\mathbf {z}}_t | {\mathbf {x}}_{<t} , {\mathbf {z}}_{<t}). \end{aligned}$$
(2)

Unlike the Gaussian form, evaluating \(p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t})\) now requires integrating over the latent variables,

$$\begin{aligned} p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t}) = \int p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t}, {\mathbf {z}}_{\le t}) p_\theta ({\mathbf {z}}_{\le t} | {\mathbf {x}}_{<t}) d {\mathbf {z}}_{\le t}, \end{aligned}$$
(3)

yielding a more flexible distribution. However, performing this integration in practice is typically intractable, requiring approximate inference techniques, like variational inference (Jordan et al. 1998), or invertible models (Kumar et al. 2020). Recent works have parameterized these models with deep neural networks, e.g. (Chung et al. 2015; Gan et al. 2015; Fraccaro et al. 2016; Karl et al. 2017), using amortized variational inference (Kingma and Welling 2014; Rezende et al. 2014). Typically, the conditional likelihood, \(p_\theta ({\mathbf {x}}_t | {\mathbf {x}}_{<t}, {\mathbf {z}}_{\le t})\), and the prior, \(p_\theta ({\mathbf {z}}_t | {\mathbf {x}}_{<t} , {\mathbf {z}}_{<t})\), are Gaussian densities, with temporal conditioning handled through recurrent networks. Such models have demonstrated success in audio (Chung et al. 2015; Fraccaro et al. 2016) and video modeling (Xue et al. 2016; Gemici et al. 2017; Denton and Fergus 2018; He et al. 2018; Li and Mandt 2018). However, as noted by Kumar et al. (2020), such models can be difficult to train with standard log-likelihood objectives, often struggling to capture dynamics.

Fig. 2
figure 2

Affine autoregressive transform. Computational diagram for an affine autoregressive transform Papamakarios et al. (2017). Each \({\mathbf {y}}_t\) is an affine transform of \({\mathbf {x}}_t\), with the affine parameters potentially non-linear functions of \({\mathbf {x}}_{<t}\). The inverse transform, shown here, is capable of converting a correlated input, \({\mathbf {x}}_{1:T}\), into an uncorrelated output, \({\mathbf {y}}_{1:T}\)

2.3 Autoregressive flows

Our approach is based on affine autoregressive normalizing flows (Kingma et al. 2016; Papamakarios et al. 2017). Here, we continue with the perspective of temporal sequences, however, these flows were initially developed and demonstrated in static settings. Kingma et al. (2016) noted that sampling from an autoregressive Gaussian model is an invertible transform, resulting in a normalizing flow (Rippel and Adams 2013; Dinh et al. 2015, 2017; Rezende and Mohamed 2015). Flow-based models transform simple, base probability distributions into more complex ones while maintaining exact likelihood evaluation. To see their connection to autoregressive models, we can express sampling a Gaussian random variable using the reparameterization trick (Kingma and Welling 2014; Rezende et al. 2014):

$$\begin{aligned} {\mathbf {x}}_t = \varvec{\mu }_\theta ({\mathbf {x}}_{<t}) + \varvec{\sigma }_\theta ({\mathbf {x}}_{<t}) \odot {\mathbf {y}}_t, \end{aligned}$$
(4)

where \({\mathbf {y}}_t \sim {\mathcal {N}} ({\mathbf {y}}_t; {\mathbf {0}}, {\mathbf {I}})\) is an auxiliary random variable and \(\odot \) denotes element-wise multiplication. Thus, \({\mathbf {x}}_t\) is an invertible transform of \({\mathbf {y}}_t\), with the inverse given as

$$\begin{aligned} {\mathbf {y}}_t = \frac{{\mathbf {x}}_t - \varvec{\mu }_\theta ({\mathbf {x}}_{<t})}{\varvec{\sigma }_\theta ({\mathbf {x}}_{<t})}, \end{aligned}$$
(5)

where division is element-wise. The inverse transform in Eq. 5, shown in Fig. 2, normalizes (hence, normalizing flow) \({\mathbf {x}}_{1:T}\), removing statistical dependencies. Given the functional mapping between \({\mathbf {y}}_t\) and \({\mathbf {x}}_t\) in Eq. 4, the change of variables formula converts between probabilities in each space:

$$\begin{aligned} \log p_\theta ({\mathbf {x}}_{1:T}) = \log p_\theta ({\mathbf {y}}_{1:T}) - \log \left| \det \left( \frac{\partial {\mathbf {x}}_{1:T}}{\partial {\mathbf {y}}_{1:T}} \right) \right| . \end{aligned}$$
(6)

By the construction of Eqs. 4 and 5, the Jacobian in Eq. 6 is triangular, enabling efficient evaluation as the product of diagonal terms:

$$\begin{aligned} \log \left| \det \left( \frac{\partial {\mathbf {x}}_{1:T}}{\partial {\mathbf {y}}_{1:T}} \right) \right| = \sum _{t=1}^T \sum _i \log \sigma _{\theta , i} ({\mathbf {x}}_{<t}), \end{aligned}$$
(7)

where i denotes the observation dimension, e.g. pixel. For a Gaussian autoregressive model, the base distribution is \(p_\theta ({\mathbf {y}}_{1:T}) = {\mathcal {N}} ({\mathbf {y}}_{1:T}; {\mathbf {0}}, {\mathbf {I}})\). We can improve upon this simple set-up by chaining transforms together, i.e. parameterizing \(p_\theta ({\mathbf {y}}_{1:T})\) as a flow, resulting in hierarchical models.

2.4 Related work

Autoregressive flows were initially considered in the contexts of variational inference (Kingma et al. 2016) and generative modeling (Papamakarios et al. 2017). These approaches are generalizations of previous approaches with affine transforms (Dinh et al. 2015, 2017). While autoregressive flows are well-suited for sequential data, these approaches, as well as many recent approaches (Huang et al. 2018; Oliva et al. 2018; Kingma and Dhariwal 2018), were initially applied to static data, such as images.

Recent works have started applying flow-based models to sequential data. van den Oord et al. (2018) and Ping et al. (2019) distill autoregressive speech models into flow-based models. Prenger et al. (2019) and Kim et al. (2019) instead train these models directly. Kumar et al. (2020) use a flow to model individual video frames, with an autoregressive prior modeling dynamics across time steps. Rhinehart et al. (2018, 2019) use autoregressive flows for modeling vehicle motion, and Henter et al. (2019) use flows for motion synthesis with motion-capture data. Ziegler and Rush (2019) model discrete observations (e.g., text) by using flows to model dynamics of continuous latent variables. Like these recent works, we apply flow-based models to sequences. However, we demonstrate that autoregressive flows can serve as a general-purpose technique for improving dynamics models. To the best of our knowledge, our work is the first to use flows to pre-process sequences to improve sequential latent variable models.

We utilize affine flows (Eq. 4), a family that includes methods like NICE (Dinh et al. 2015), RealNVP (Dinh et al. 2017), IAF (Kingma et al. 2016), MAF (Papamakarios et al. 2017), and Glow (Kingma and Dhariwal 2018). However, there has been recent work in non-affine flows (Huang et al. 2018; Jaini et al. 2019; Durkan et al. 2019), which offer further flexibility. We chose to investigate affine flows because they are commonly employed and relatively simple, however, non-affine flows could result in additional improvements.

Autoregressive dynamics models are also prominent in other related areas. Within the statistics and econometrics literature, autoregressive integrated moving average (ARIMA) is a standard technique (Box et al. 2015; Hamilton 2020), calculating differences with an autoregressive prediction to remove non-stationary components of a temporal signal. Such methods simplify downstream modeling, e.g., by removing seasonal effects. Low-level autoregressive models are also found in audio (Atal and Schroeder 1979) and video compression codecs (Wiegand et al. 2003; Agustsson et al. 2020; Yang et al. 2021), using predictive coding (Oliver 1952) to remove temporal redundancy, thereby improving downstream compression rates. Intuitively, if sequential inputs are highly predictable, it is far more efficient to compress the prediction error rather than each input (e.g., video frame) separately. Finally, we note that autoregressive models are a generic dynamics modeling approach and can, in principle, be parameterized by other techniques, such as LSTMs (Hochreiter and Schmidhuber 1997), or combined with other models, such as hidden Markov models (HMMs) (Murphy 2012).

3 Method

We now describe our approach for improving sequence modeling. First, we motivate using autoregressive flows to reduce temporal dependencies, thereby simplifying dynamics. We then show how this simple technique can be incorporated within sequential latent variable models.

Fig. 3
figure 3

Redundancy reduction. (a) Conditional densities for \(p (x_2 | x_1)\). (b) The marginal, \(p (x_2)\) differs from the conditional densities, thus, \({\mathcal {I}} (x_1 ; x_2) > 0\). (c) In the normalized space of y, the corresponding densities \(p (y_2 | y_1)\) are identical. (d) The marginal \(p(y_2)\) is identical to the conditionals, so \({\mathcal {I}} (y_1; y_2) = 0.\) Thus, in this case, a conditional affine transform removed the dependencies

3.1 Motivation: temporal redundancy reduction

Normalizing flows, while often utilized for density estimation, originated from data pre-processing techniques (Friedman 1987; Hyvärinen and Oja 2000; Chen and Gopinath 2001), which remove dependencies between dimensions, i.e., redundancy reduction (Barlow 1961). Removing dependencies simplifies the resulting probability distribution by restricting variation to individual dimensions, generally simplifying downstream tasks (Laparra et al. 2011). Normalizing flows improve upon these procedures using flexible, non-linear functions (Deco and Brauer 1995; Dinh et al. 2015). While flows have been used for spatial decorrelation (Agrawal and Dukkipati 2016; Winkler et al. 2019) and with other models (Huang et al. 2017), this capability remains under-explored.

Our main contribution is showing how to utilize autoregressive flows for temporal pre-processing to improve dynamics modeling. Data sequences contain dependencies in time, for example, in the redundancy of video pixels (Fig. 1), which are often highly predictable. These dependencies define the dynamics of the data, with the degree of dependence quantified by the multi-information,

$$\begin{aligned} {\mathcal {I}} ({\mathbf {x}}_{1:T}) = \sum _t {\mathcal {H}} ({\mathbf {x}}_t) - {\mathcal {H}} ({\mathbf {x}}_{1:T}), \end{aligned}$$
(8)

where \({\mathcal {H}}\) denotes entropy. Normalizing flows are capable of reducing redundancy, arriving at a new sequence, \({\mathbf {y}}_{1:T}\), with \({\mathcal {I}} ({\mathbf {y}}_{1:T}) \le {\mathcal {I}} ({\mathbf {x}}_{1:T})\), thereby reducing temporal dependencies. Thus, rather than fit the data distribution directly, we can first simplify the dynamics by pre-processing sequences with a normalizing flow, then fitting the resulting sequence. Through training, the flow will attempt to remove redundancies to meet the modeling capacity of the higher-level dynamics model, \(p_\theta ({\mathbf {y}}_{1:T})\).

Example To visualize this procedure for an affine autoregressive flow, consider a one-dimensional input over two time steps, \(x_1\) and \(x_2\). For each value of \(x_1\), there is a conditional density, \(p (x_2 | x_1)\). Assume that these densities take one of two forms, which are identical but shifted and scaled, shown in Fig. 3. Transforming these densities through their conditional means, \(\mu _2 = {\mathbb {E}} \left[ x_2 | x_1 \right] \), and standard deviations, \(\sigma _2 = {\mathbb {E}} \left[ (x_2 - \mu _2)^2 | x_1 \right] ^{1/2}\), creates a normalized space, \(y_2 = (x_2 - \mu _2) / \sigma _2\), where the conditional densities are identical. In this space, the multi-information is

$$\begin{aligned} {\mathcal {I}} (y_1;y_2) = {\mathbb {E}}_{p (y_1 , y_2)} \left[ \log p (y_2 | y_1) - \log p(y_2) \right] = 0, \end{aligned}$$

whereas \({\mathcal {I}} (x_1; x_2) > 0.\) Indeed, if \(p (x_t | x_{<t})\) is linear-Gaussian, inverting an affine autoregressive flow exactly corresponds to Cholesky whitening (Pourahmadi 2011; Kingma et al. 2016), removing all linear dependencies.

In the example above, \(\mu _2\) and \(\sigma _2\) act as a frame of reference for estimating \(x_2\). More generally, in the special case where \(\varvec{\mu }_\theta ({\mathbf {x}}_{<t}) = {\mathbf {x}}_{t-1}\) and \(\varvec{\sigma } ({\mathbf {x}}_{<t}) = {\mathbf {1}}\), we recover \({\mathbf {y}}_t = {\mathbf {x}}_t - {\mathbf {x}}_{t-1} = \varDelta {\mathbf {x}}_t\). Modeling finite differences (or generalized coordinates (Friston 2008)) is a well-established technique, (see, e.g. (Chua et al. 2018; Kumar et al. 2020)), which is generalized by affine autoregressive flows.

Fig. 4
figure 4

Model diagrams. a An autoregressive flow pre-processes a data sequence, \({\mathbf {x}}_{1:T}\), to produce a new sequence, \({\mathbf {y}}_{1:T}\), with reduced temporal dependencies. This simplifies dynamics modeling for a higher-level sequential latent variable model, \(p_\theta ({\mathbf {y}}_{1:T}, {\mathbf {z}}_{1:T})\). Empty diamond nodes represent deterministic dependencies, not recurrent states. b Diagram of the autoregressive flow architecture. Blank white rectangles represent convolutional layers (see Appendix). The three stacks of convolutional layers within the blue region are shared. cat denotes channel-wise concatenation

3.2 Modeling dynamics with autoregressive flows

We now discuss utilizing autoregressive flows to improve sequence modeling, highlighting use cases for modeling dynamics in the data and latent spaces.

3.2.1 Data dynamics

The form of an affine autoregressive flow across sequences is given in Eqs. 4 and 5, again, equivalent to a Gaussian autoregressive model. We can stack hierarchical chains of flows to improve the model capacity. Denoting the shift and scale functions at the \(m^\text {th}\) transform as \(\varvec{\mu }_\theta ^m (\cdot )\) and \(\varvec{\sigma }_\theta ^m (\cdot )\) respectively, we then calculate \({\mathbf {y}}^m\) using the inverse transform:

$$\begin{aligned} {\mathbf {y}}^m_t = \frac{{\mathbf {y}}^{m-1}_t - \varvec{\mu }^m_\theta ({\mathbf {y}}^{m-1}_{<t})}{\varvec{\sigma }^m_\theta ({\mathbf {y}}^{m-1}_{<t})}. \end{aligned}$$
(9)

After the final (\(M^{{\text {th}}}\)) transform, we can choose the form of the base distribution, \(p_\theta ({\mathbf {y}}^M_{1:T})\), e.g. Gaussian. While we could attempt to model \({\mathbf {x}}_{1:T}\) completely using stacked autoregressive flows, these models are limited to affine element-wise transforms that maintain the data dimensionality. Due to this limited capacity, purely flow-based models often require many transforms to be effective (Kingma and Dhariwal 2018).

Instead, we can model the base distribution using an expressive sequential latent variable model (SLVM), or, equivalently, we can augment the conditional likelihood of a SLVM using autoregressive flows (Fig. 4a). Following the motivation from Sect. 3.1, the flow can remove temporal dependencies, simplifying the modeling task for the SLVM. With a single flow, the joint probability is

$$\begin{aligned} p_\theta ({\mathbf {x}}_{1:T}, {\mathbf {z}}_{1:T}) = p_\theta ({\mathbf {y}}_{1:T}, {\mathbf {z}}_{1:T}) \left| \det \left( \frac{\partial {\mathbf {x}}_{1:T}}{\partial {\mathbf {y}}_{1:T}} \right) \right| ^{-1}, \end{aligned}$$
(10)

where the SLVM distribution is given by

$$\begin{aligned} p_\theta ({\mathbf {y}}_{1:T} , {\mathbf {z}}_{1:T}) = \prod _{t=1}^T p_\theta ({\mathbf {y}}_t | {\mathbf {y}}_{<t}, {\mathbf {z}}_{\le t}) p_\theta ({\mathbf {z}}_t | {\mathbf {y}}_{<t} , {\mathbf {z}}_{<t}). \end{aligned}$$
(11)

If the SLVM is itself a flow-based model, we can use maximum log-likelihood training. If not, we can resort to variational inference (Chung et al. 2015; Fraccaro et al. 2016; Marino et al. 2018). We derive and discuss this procedure in the Appendix.

3.2.2 Latent dynamics

We can also consider simplifying latent dynamics modeling using autoregressive flows. This is relevant in hierarchical SLVMs, such as VideoFlow (Kumar et al. 2020), where each latent variable is modeled as a function of past and higher-level latent variables. Using \({\mathbf {z}}_t^{(\ell )}\) to denote the latent variable at the \(\ell ^{\text {th}}\) level at time t, we can parameterize the prior as

$$\begin{aligned} p_\theta ({\mathbf {z}}^{(\ell )}_t | {\mathbf {z}}^{(\ell )}_{<t}, {\mathbf {z}}^{(>\ell )}_t) = p_\theta ({\mathbf {u}}^{(\ell )}_t | {\mathbf {u}}^{(\ell )}_{<t}, {\mathbf {z}}^{(>\ell )}_t) \left| \det \left( \frac{\partial {\mathbf {z}}^{(\ell )}_t}{\partial {\mathbf {u}}^{(\ell )}_t} \right) \right| ^{-1}, \end{aligned}$$
(12)

converting \({\mathbf {z}}^{(\ell )}_t\) into \({\mathbf {u}}^{(\ell )}_t\) using the inverse transform \({\mathbf {u}}^{(\ell )}_t = ({\mathbf {z}}^{(\ell )}_t - \varvec{\alpha }_\theta ({\mathbf {z}}^{(\ell )}_{<t})) / \varvec{\beta }_\theta ({\mathbf {z}}^{(\ell )}_{<t})\). As noted previously, VideoFlow uses a special case of this procedure, setting \(\varvec{\alpha }_\theta ({\mathbf {z}}^{(\ell )}_{<t}) = {\mathbf {z}}^{(\ell )}_{t-1}\) and \(\varvec{\beta }_\theta ({\mathbf {z}}^{(\ell )}_{<t}) = {\mathbf {1}}\). Generalizing this procedure further simplifies dynamics throughout the model.

4 Evaluation

We demonstrate and evaluate the proposed technique on three benchmark video datasets: Moving MNIST (Srivastava et al. 2015a), KTH Actions (Schuldt et al. 2004), and BAIR Robot Pushing (Ebert et al. 2017). In addition, we also perform experiments on several non-video sequence datasets from the UC Irvine Machine Learning Repository.Footnote 1 Specifically, we look at an activity recognition dataset (activity_rec) (Palumbo et al. 2016), an indoor localization dataset (smartphone_sensor) (Barsocchi et al. 2016), and a facial expression recognition dataset (facial_exp) (de Almeida Freitas et al. 2014). Experimental setups are described in Sect. 4.1, followed by a set of analyses in Sect. 4.2. Further details and results can be found in the Appendix.

Fig. 5
figure 5

Decreased temporal correlation. a Affine autoregressive flows result in sequences, \({\mathbf {y}}_{1:T}\), with decreased temporal correlation, \(\text {corr}_{\mathbf {y}}\), as compared with that of the original data, \(\text {corr}_{\mathbf {x}}\). The presence of a more powerful base distribution (SLVM) reduces the need for decorrelation. Additional flow transforms further decrease correlation (note: \(|\text {corr}_{\mathbf {y}}| < 0.01\) for 2-AF). (b) For SLVM + 1-AF, \(\text {corr}_{\mathbf {y}}\) decreases during training on KTH Actions (Color figure online)

4.1 Experimental setup

We empirically evaluate the improvements to downstream dynamics modeling from temporal pre-processing via autoregressive flows. For data space modeling, we compare four model classes: (1) standalone affine autoregressive flows with one (1-AF) and (2) two (2-AF) transforms, (3) a sequential latent variable model (SLVM), and (4) SLVM with flow-based pre-processing (SLVM + 1-AF). As we are not proposing a specific architecture, but rather a general modeling technique, the SLVM architecture is representative of recurrent convolutional video models with a single latent level (Denton and Fergus 2018; Ha and Schmidhuber 2018; Hafner et al. 2019). Flows are implemented with convolutional networks, taking in a fixed window of previous frames (Fig. 4b). These models allow us to evaluate the benefits of temporal pre-processing (SLVM vs. SLVM + 1-AF) and the benefits of more expressive higher-level dynamics models (2-AF vs. SLVM + 1-AF).

To evaluate latent dynamics modeling with flows, we use the tensor2tensor library (Vaswani et al. 2018) to compare (1) VideoFlowFootnote 2 and (2) the same model with affine autoregressive flow latent dynamics (VideoFlow + AF). VideoFlow is significantly larger (\(3\times \) more parameters) than the one-level SLVM, allowing us to evaluate whether autoregressive flows are beneficial in this high-capacity regime.

To enable a fairer comparison in our experiments, models with autoregressive flow dynamics have comparable or fewer parameters than baseline counterparts. We note that autoregressive dynamics adds only a constant computational cost per time-step, and this computation can be parallelized for training and evaluation. Full architecture, training, and analysis details can be found in the Appendix. Finally, as noted by Kumar et al. (2020), many previous works do not train SLVMs with proper log-likelihood objectives. Our SLVM results are consistent with previously reported log-likelihood values (Marino et al. 2018) for the Stochastic Video Generation model (Denton and Fergus 2018) trained with a log-likelihood bound objective.

Fig. 6
figure 6

Flow Visualization for SLVM + 1-AF on Moving MNIST (left) and KTH Actions (right)

Fig. 7
figure 7

Improved Generated Samples. Random samples generated from (a) VideoFlow and (b) VideoFlow + AF, each conditioned on the first 3 frames. Using AF produces more coherent samples. The robot arm blurs for VideoFlow in samples 1 and 4 (red boxes), but does not blur for VideoFlow + AF (Color figure online)

4.2 Analyses

Visualization In Fig. 1, we visualize the pre-processing procedure for SLVM + 1-AF on BAIR Robot Pushing. The plots show the RGB values for a pixel before (top) and after (bottom) the transform. The noise sequence is nearly zero throughout, despite large changes in the pixel value. We also see that the noise sequence (center, lower) is invariant to the static background, capturing the moving robotic arm. At some time steps (e.g. fourth frame), the autoregressive flow incorrectly predicts the next frame, however, the higher-level SLVM compensates for this prediction error.

We also visualize each component of the flow. Figure 4b illustrates this for SLVM + 1-AF on an input from BAIR Robot Pushing. We see that \(\varvec{\mu }_\theta \) captures the static background, while \(\varvec{\sigma }_\theta \) highlights regions of uncertainty. In Fig. 6 and the Appendix, we present visualizations on full sequences, where we see that different models remove varying degrees of temporal structure.

Temporal Redundancy Reduction To quantify temporal redundancy reduction, we evaluate the empirical correlation (linear dependence) between frames, denoted as corr, for the data and noise variables. We evaluate \(\text {corr}_{{\mathbf {x}}}\) and \(\text {corr}_{{\mathbf {y}}}\) for 1-AF, 2-AF, and SLVM + 1-AF. The results are shown in Fig. 5a. In Fig. 5b, we plot \(\text {corr}_{{\mathbf {y}}}\) for SLVM + 1-AF during training on KTH Actions. Flows decrease temporal correlation, with additional transforms yielding further decorrelation. Base distributions without temporal structure (1-AF) yield comparatively more decorrelation. Temporal redundancy is progressively removed throughout training. Note that 2-AF almost completely removes temporal correlations (\(|\text {corr}_{\mathbf {y}}| < 0.01\)). However, note that this only quantifies linear dependencies, and more complex non-linear dependencies may require the use of higher-level dynamics models, as shown through quantitative comparisons.

Performance Comparison Table 1 reports average test negative log-likelihood results on video datasets. Standalone flow-based models perform surprisingly well. Increasing flow depth from 1-AF to 2-AF generally results in improvement. SLVM + 1-AF outperforms the baseline SLVM despite having fewer parameters. As another baseline, we also consider modeling frame differences, \(\varDelta {\mathbf {x}} \equiv {\mathbf {x}}_t - {\mathbf {x}}_{t-1}\), with SLVM, which can be seen as a special case of 1-AF with \(\varvec{\mu }_\theta = {\mathbf {x}}_{t-1}\) and \(\varvec{\sigma }_\theta = {\mathbf {1}}\). On BAIR and KTH Actions, datasets with significant temporal redundancy (Fig. 5a), this technique improves performance over SLVM. However, on Moving MNIST, modeling \(\varDelta {\mathbf {x}}\) actually decreases performance, presumably by creating more complex spatial patterns. In all cases, the learned temporal transform, SLVM + 1-AF, outperforms this hard-coded transform, SLVM + \(\varDelta {\mathbf {x}}\). Finally, incorporating autoregressive flows into VideoFlow results in a modest but noticeable improvement, demonstrating that removing spatial dependencies, through VideoFlow, and temporal dependencies, through autoregressive flows, are complementary techniques.

Table 1 Quantitative comparison.
Fig. 8
figure 8

Improved Generalization. The low-level reference frame improves generalization to unseen sequences. Train and test negative log-likelihood bound histograms for (a) SLVM and (b) SLVM + 1-AF on KTH Actions. (c) The generalization gap for SLVM + 1-AF remains small for varying amounts of KTH training data, while it becomes worse in the low-data regime for SLVM (Color figure online)

Results on Non-Video Sequence Dataset In Table 2, we report negative log-density results on non-video data in nats per time step. Note that log-densities can be positive or negative. Again, we see that 2-AF consistently outperforms 1-AF, which are typically on-par or better than SLVM. However, SLVM + 1-AF outperforms all other model classes, achieving the lowest (best) log-densities across all datasets. With non-video data, we see that using the special case of modeling temporal differences (SLVM + \(\varDelta {\mathbf {x}}\)), performance is actually slightly worse than that of SLVM on all datasets. This, again, highlights the importance of using a learned pre-processing transform in comparison with hard-coded temporal differences.

Table 2 Non-Video Quantitative Comparison.

Improved Samples The quantitative improvement over VideoFlow is less dramatic, as this is already a high-capacity model. However, qualitatively, we observe that incorporating autoregressive flow dynamics improves sample quality (Fig. 7). In these randomly selected samples, the robot arm occasionally becomes blurry for VideoFlow (red boxes) but remains clear for VideoFlow + AF.

Improved Generalization Our temporal normalization technique also improves generalization to unseen examples, a key benefit of normalization schemes, e.g., batch norm (Ioffe and Szegedy 2015). Intuitively, higher-level dynamics are often preserved, whereas lower-level appearance is not. This is apparent on KTH Actions, which contains a substantial degree of train-test mismatch, due to different identities and activities. NLL histograms on KTH are shown in Fig. 8, with greater overlap for SLVM + 1-AF. We also train SLVM and SLVM + 1-AF on subsets of KTH Actions. In Fig. 8c, we see that autoregressive flows enable generalization in the low-data regime, whereas SLVM becomes worse.

5 Conclusion

We have presented a technique for improving sequence modeling using autoregressive flows. Learning a frame of reference, parameterized by autoregressive transforms, reduces temporal redundancy in input sequences, simplifying dynamics. Thus, rather than expanding the model, we can simplify the input to meet the capacity of the model. This approach is distinct from previous works with normalizing flows on sequences, yet contains connections to classical modeling and compression. We hope these connections lead to further insights and applications. Finally, we have analyzed and empirically shown how autoregressive pre-processing in both the data and latent spaces can improve sequence modeling and lead to improved sample quality and generalization.

The underlying assumption behind using autoregressive flows for sequence modeling is that sequences contain smooth or predictable temporal dependencies, with more complex, higher-level dependencies as well. In both video and non-video data, we have seen improvements from combining sequential latent variable models with autoregressive flows, suggesting that such assumptions are generally reasonable. Using affine autoregressive flows restricts our approach to sequences of continuous data, but future work could investigate discrete data, such as natural language. Likewise, we assume regularly sampled sequences (i.e., a constant frequency), however, future work could also investigate irregularly sampled event data.