1 Introduction

Movement models are key to spatiotemporal problems. They allow to study player coordination in team sports (Dick and Brefeld 2019) but also generalize to refugee migration patterns (Hübl et al. 2017), collective animal movements (McDermott et al. 2017), and understanding dynamical systems with moving particles (Padberg-Gehle and Schneide 2017). The task of a movement model is to predict all possible movements of a player (or refugee, animal, particle, etc.) in a given situation within a certain amount of time.

Traditional movement (or motion) models focus on simplifications of physical laws (Taki et al. 1996; Taki and Hasegawa 2000; Fujimura and Sugihara 2005) such as the ability to accelerate in every direction equally fast to compute the set of all reachable positions of an agent for a given time horizon. Besides, those models are non-probabilistic and do not account for the fact that positions may be attained with different likelihoods. A simple parametric probabilistic approach, on the other hand, may lead to suboptimal predictive accuracies due to inappropriate choices of the underlying distributions.

Seemingly, Brefeld et al. (2019) solved the problem by proposing nonparametric probabilistic movement models to quantify the desired likelihoods using kernel density estimation (KDE). Although their solution is purely data-driven and renders assumptions on physics obsolete, they need to distinguish initial conditions (e.g., bins of velocities and time horizon intervals). They address this by maintaining many, possibly differently parameterized, models. This turns the advantage of kernel density estimation into a drawback: being nonparametric by design, predictive performance does increase proportionally with data, but every new data point also increases the computation time for the prediction. The same holds true for memory requirements. This is rather impractical.

In this paper, we turn conditional normalizing flows into novel movement models that (i) consist of only a single contextualized probabilistic model, (ii) better adapt to different contexts, such as movement speed, and (iii) allow predictions whose computation time is independent of the amount of training data, allowing for real-time applications. Normalizing flows (Tabak and Vanden-Eijnden 2010; Tabak and Turner 2013; Rippel and Adams 2013; Dinh et al. 2015; Rezende and Mohamed 2015) provide a state-of-the-art framework for learning densities using invertible deep neural networks. A normalizing flow transforms complex data distributions into simpler ones by an invertible chain of transformations. This chain consists of parametrized bijective functions that transform the data into a representation that follows a known base distribution, usually a Gaussian.

The class of conditional normalizing flows (Winkler et al. 2019; Lu and Huang 2020) additionally offers to model a conditional distribution. We extend their approach in the remainder to incorporate context into flow-based movement models. Hence, our flow-based movement model is actually only a single model which can be conditioned on several kinds of contexts, particularly more complex ones than just bins of velocities and time horizon intervals. Moreover, our contribution is efficient and allows for (near) real-time predictions independently of the amount of data.

The remainder is organized as follows. Section 2 reviews related work and Section 3 introduces preliminaries. Section 4 presents our contextual movement models based on normalizing flows. Section 5 reports on empirical results, and Section 6 provides a discussion of the findings. Section 7 concludes.

2 Related work

Spatiotemporal analyses are often fundamental when processing data from wearables like smartphones or dedicated GPS-based tracking devices (Zheng 2015; Mazimpaka and Timpf 2016). While there are straightforward descriptive spatiotemporal tasks like the identification of road defects (Byrne et al. 2013; Mohan et al. 2008) or the discrimination of driving styles (Paefgen et al. 2011), many problems ground on accurate predictions of whereabouts of agents in the near future.

A great deal of these approaches has been published in the context of sports analytics and athlete motion. Often, the focus lies on identifying motion patterns over time for groups of agents (e.g., teams) (Laube et al. 2005; Sprado and Gottfried 2009; Gottfried 2008, 2011). Unfortunately, most of these contributions impose qualitative measures and lack generality. Computational approaches have been proposed by Knauf et al. (2016) who study spatiotemporal convolution kernels or Janetzko et al. (2014) who group attacking patterns on the example of soccer. A frequent pattern mining approach for trajectory data has been proposed in Haase and Brefeld (2014). Recently, neural networks have been applied to athlete trajectories to remedy the need for sufficient statistics and (possibly hand-crafted) feature representations describing the situation on the track or pitch (Zheng et al. 2016; Le et al. 2017).

One of the first movement models has been proposed by Taki and Hasegawa (2000). Their approach, however, simplifies physical laws and allows for unbounded velocities due to constant acceleration. Fujimura and Sugihara (2005) counterbalance this limitation by adding a resistive force to prevent unbounded velocities. Nevertheless, this is insufficient to achieve a realistic model physical model. Recently, Brefeld et al. (2019) circumvent the difficulties in deriving a realistic model from physics by proposing a purely data-driven approach, although their probabilistic model suffers from expensive kernel density estimations in practice.

Naturally, sports analytics is not the only domain where movement models are prominently deployed. Besides the already mentioned areas, other applications include vehicle trajectory analysis (Besse et al. 2018), pedestrian trajectory estimation from videos (Zhong et al. 2020), and real-time robot trajectory prediction (Gomez-Gonzalez et al. 2020).

Normalizing flows recently emerged as an attractive approach to learning densities. A significant advantage over other methods for learning densities using neural networks is that they allow the direct maximization of the log-likelihood. Alternatives such as generative adversarial networks (GANs) (Goodfellow et al. 2014) and variational autoencoders (VAEs) (Kingma and Welling 2014; Rezende et al. 2014) use instead surrogate learning objectives for this task, such as the GAN adversarial loss and the VAE evidence lower bound (ELBO). While this does not hamper their use as generative models, they are not suitable for estimating the likelihood of a given data point.

In the last few years, a number of different flow-based models were proposed. A particular family of models, derived from NICE (Dinh et al. 2015), is appealing due to their computationally efficient nature while being simple to implement. Improvements such as RealNVP (Dinh et al. 2017) and Glow (Kingma and Dhariwal 2018) were proposed over the years, achieving state-of-the-art performance to model complex, high-dimensional distributions, while keeping the attractive properties of NICE. Similarly, autoregressive flow models such as NAF (Huang et al. 2018) and B-NAF (De Cao et al. 2019) have been proposed. Particularly interesting are conditional normalizing flows (Lu and Huang 2020), a simple improvement over Glow which allows them to model distributions conditioned on continuous variables. For an in-depth analysis of these and their relation to other flow-based models, we refer the reader to Papamakarios et al. (2021).

3 Preliminaries

Let \(\mathscr {T}= (\mathbf {x}^{(t)})_{t \in \mathbb {R}^+}\) be the trajectory of positions \(\mathbf {x}^{(t)} \in \mathbb {R}^d\) at time t of an object. In the remainder, we deal with two-dimensional movements, i.e., \(\mathbf {x}^{(t)} = (x_1^{(t)},x_2^{(t)})^\top\), but the following definitions also hold for higher-dimensional movements. The positional data might come with additional information, the so-called context \(\mathbf {c}^{(t)}\) at time t. An example for context is the velocity vector \(\mathbf {v}^{(t)} \in \mathbb {R}^d\), or its magnitude, which is known as speed \(v^{(t)} = \Vert \mathbf {v}^{(t)}\Vert _2\), respectively. Our goal is to model the distribution of the position \({\mathbf{x}}^{{(t + t_{\Delta } )}}\), which is \({t}_{\Delta } {\text{ > 0}}\) seconds in the future, given the current position \(\mathbf {x}^{(t)}\) and context \(\mathbf {c}^{(t)}\).

Fig. 1
figure 1

Illustration of a movement model (purple contour lines). The solid black line depicts a trajectory. The last, current, and next position of an player are marked as A, B, and C, respectively. The movement model shown in purple models the distribution of the next position C (red dot) using the information of the past and present (A and B). The dashed lines denote a local coordinate system centered at the player and aligned in the current direction of movement

To make the movement model location-invariant, we consider a local coordinate system centered at the object’s current position and along the last movement direction. Figure 1 shows an example. The object under consideration is currently at position B and moves along the solid trajectory. The dashed lines show the local coordinate system which is aligned using the last direction, estimated by a position A, which is in the past. In \({t}_{\Delta } {\text{ = 1}}\) seconds, the position C is reached. The distribution of the next position, which is what we want to model, is depicted as a red contour plot.

3.1 Movement models based on kernel density estimates

Our movement model is based on concepts of Brefeld et al. (2019). They consider movement models for trajectories in soccer and employ a kernel density estimate for their model. The idea is as follows. Given a trajectory (or a set of many trajectories) one first extracts triplets as shown in Figure 1. Let \((\mathbf {x}_A,\mathbf {x}_B,\mathbf {x}_C)\) be such a triplet of positions with timestamps \(t_A\), \(t_B\), and \(t_C\) such that \(t_A< t_B < t_C\). We denote the time difference of \(t_A\) and \(t_B\) as \(t_{\delta } =t_B-t_A\) and the difference of \(t_B\) and \(t_C\) as \(t_\varDelta =t_C-t_B\). Here, \(\mathbf {x}_B\) describes the current position, \(\mathbf {x}_A\) is used to estimate the current direction in which the object is moving and \(\mathbf {x}_C\) denotes the position in the future. Hence, the next position \(\mathbf {x}_C\) describes the ability to move within a given time horizon \(t_\varDelta\). Collecting multiple next positions \(\mathbf {x}_C\), represented in the local coordinate system as \(\mathbf {x}_C^{\text {local}}\), allows for creating a movement model. Such a model is then able to quantify the likelihood of a possible next position relative to its current position.

To obtain the next position \(\mathbf {x}_C^{\text {local}}\), the triplet is transformed into the local coordinate system. The transformation realizing this first subtracts the current position \(\mathbf {x}_B\) from all points of the triplet. This way, the triplet is centered at the current position. Then, the triplet is rotated such that the last position, i.e., \(\mathbf {x}_A\), is aligned with the x-axis of the local coordinate system. Let \(\mathbf {g}(\mathbf {x}_A,\mathbf {x}_B,\mathbf {x}_C) = \mathbf {x}_C^{\text {local}}\) be this transformation. Mathematically, this transformation can be expressed as

$$\begin{aligned} \mathbf {x}_C^{\text {local}} = \begin{bmatrix} r \cdot \cos (\theta ) \\ r \cdot \sin (\theta ) \end{bmatrix}, \end{aligned}$$

where r is a distance given by \(r = \Vert \overrightarrow{\mathbf {x}_B\mathbf {x}_C} \Vert _2\) and \(\theta\) is a signed angle given by \(\theta = \measuredangle (\overrightarrow{\mathbf {x}_A\mathbf {x}_B}, \overrightarrow{\mathbf {x}_B\mathbf {x}_C})\). After processing multiple triplets along a trajectory for fixed time differences \(t_{\delta }\) and \(t_\varDelta\) and storing the corresponding next positions in the local coordinate system in a set \(\mathbb {S}_{t_\varDelta }\), a kernel density estimate is employed on \(\mathbb {S}_{t_\varDelta }\).

Brefeld et al. (2019) construct several movement models for various time horizons \(t_\varDelta\). As a context, they use the current speed of a player. To add this contextual information, they filter the triplets by current speed. This is done by first defining a binning of the range of possible speeds. Specifically, they use the following binning: [0, 1) (standing), [1, 7) (walking), [7, 14) (jogging), [14, 20) (running), and [20, 40] (sprinting), where all values are in kilometers per hour. Then, movement models are constructed for multiple time horizons per speed range S, yielding \(\mathbb {S}_{t_\varDelta }^{S}\). An example is depicted in Fig. 2. The left shows the set \(\mathbb {S}_{t_\varDelta }^{S}\) holding the next positions within the local coordinate system for a time horizon of \(t_\varDelta =1s\). We apply the context of a speed range \(S=[14,20)\), where the values are again in kilometers per hour. The corresponding kernel density estimate is shown on the right. Following Brefeld et al. (2019), we deploy a Gaussian kernel and select the bandwidth using Scott’s rule (Scott 2015).

Fig. 2
figure 2

An example of the set \(\mathbb {S}_{t_\varDelta }^{S}\) with a time horizon of \(t_\varDelta =1s\) and a speed range of 14-20 km/h and the corresponding kernel density estimate

The time complexity of this approach is its biggest drawback. Computing the likelihood of a single d-dimensional data point with a kernel density estimate built on n triplets \((\mathbf {x}_A, \mathbf {x}_B, \mathbf {x}_C)\) takes \(\mathscr {O}(dn)\) time. Analogously, the approach has also a large memory footprint, as every known data point must be accessible during prediction. The space complexity also scales in \(\mathscr {O}(dn)\).

3.2 Normalizing flows

Let \(\mathscr {X} = \{\mathbf {x}_1, \ldots , \mathbf {x}_n \} \subset \mathbb {R}^d\) be instances drawn from an unknown distribution \(p_x(\mathbf {x})\). The goal is to estimate an accurate model of \(p_x(\mathbf {x})\). This is done by expressing \(p_x(\mathbf {x})\) in terms of a simpler, known distribution \(p_z(\mathbf {z})\), and learning a bijective map between them. Formally, let \(\mathbf {f}: \mathbb {R}^d \rightarrow \mathbb {R}^d\) be a bijective and differentiable function. Using the change of variable theorem and \(\mathbf {z}= \mathbf {f}(\mathbf {x})\), we can express \(p_x(\mathbf {x})\) as

$$\begin{aligned} p_x(\mathbf {x}) = p_z(\mathbf {f}(\mathbf {x})) |\det J_\mathbf {f}(\mathbf {x})|, \end{aligned}$$

where \(J_\mathbf {f}(\mathbf {x})\) is the Jacobian matrix of the diffeomorphism \(\mathbf {f}\). The distribution \(p_z(\mathbf {z})\) is often referred to as a base distribution (Papamakarios et al. 2021). We henceforth drop the subscript of the distribution p whenever it is clear from the context.

Figure 3 depicts a one-dimensional example. The left-hand side shows the data distribution, which is usually unknown. The goal is to map from this distribution to a known, base distribution, depicted on the right side. The usual choice is a standard Gaussian.

Fig. 3
figure 3

A simple normalizing flow on one-dimensional data. The plot to the left depicts the usually unknown data distribution, while the plot to the right shows the base distribution (standard Gaussian)

In practice, \(\mathbf {f}\) is not just a fixed transformation, but rather a chain of parameterized transformations. Hence, we deal with \(\mathbf {f}^{(\theta )}\), which is represented by a chain of bijective transformations \(\mathbf {f}^{(\theta )} = \mathbf {f}_L^{(\theta )} \circ \mathbf {f}_{L-1}^{(\theta )} \circ \cdots \circ \mathbf {f}_1^{(\theta )}\), where L is the total number of transformations. This composition defines a normalizing flow between \(\mathbf {x}= \mathbf {z}_0\) and \(\mathbf {z}= \mathbf {z}_L\) and is precisely what enables flows to be computationally and analytically tractable. Let \(\mathbf {z}_i = f_i(\mathbf {z}_{i-1})\) be the intermediate variable, where \(i=1,\ldots ,L\). Then, \(\log p(\mathbf {x})\) can be expressed as the log-likelihood of the base distribution and the log-determinant of the Jacobians of each bijection

$$\begin{aligned} \log p(\mathbf {x}) = \log p(\mathbf {z}) + \sum _{i=1}^L \log \left| \det J_{\mathbf {f}^{(\theta )}_i}(\mathbf {z}_{i-1}) \right| . \end{aligned}$$

Thus, we can learn the transformation from the data \(\mathscr {X}\) to its base representation by optimizing the parameters \(\theta\) that minimize the negative log-likelihood

$$\begin{aligned} \min _\theta \ - \sum _{\mathbf {x}\in \mathscr {X}} \log p(\mathbf {x}). \end{aligned}$$

In the remainder, we further simplify the notation by dropping the superscript \(\theta\) from \(\mathbf {f}\). Note that the example in Fig. 3 described above shows that a normalizing flow defines a generative model. After learning the transformation \(\mathbf {f}\) on data \(\mathscr {X}\), we can sample from the base distribution and obtain new samples, distributed according to \(p_x\), by using the inverse transformation \(\mathbf {f}^{-1}\).

4 Learning movements with flow-based models

The choice of transformations for building the flow \(\mathbf {f}\) involves trade-offs between computational efficiency during learning or data generation and its ability to model \(p_x\). In the last few years, several different approaches emerged. To turn a normalizing flow into a movement model, we focus on only a few characteristic properties. As movement data tends to be low-dimensional, it can often be sampled at a high frequency, resulting in millions of observations.

4.1 Flow-based models without context

As an architecture for our flow model, we consider Glow (Kingma and Dhariwal 2018), which is itself based on RealNVP (Dinh et al. 2017). Glow is built from three main transformations: activation normalization (actnorm), \(1 \times 1\) invertible convolutions, and affine coupling. Those transformations are employed in a multi-scale architecture, reshaping the image tensors to have fewer pixels with more channels, referred to as squeezing. The channels are then split and further operations are only performed on half of them. This squeezing and splitting scheme is performed several times for scalability. Squeezing, however, is not suitable for vectorial data including d-dimensional positions. Furthermore, the dimensionality of our setting curtails the computational performance gains from splitting, allowing us to use the entire vector in every transformation of the flow.

An overview of the non-contextualized architecture we use is depicted in Fig. 4 (non-grayed area). We detail the actnorm and affine coupling transformations next. For both of them, let \(\mathbf {z},\mathbf {y}\in \mathbb {R}^d\) be d-dimensional vectors, where \(\mathbf {z}\) is the input of the current transformation and \(\mathbf {y}\) is its output, being directly fed into the next transformation in the chain.

Fig. 4
figure 4

Overview of the flow model: the composition of actnorm, permutation and coupling is repeated multiple times to build the flow. The contextual part is shaded in gray. \(\mathbf {x}\) is the input to the flow, in our setting the local coordinates (\(\mathbf {x}_C^{\text {local}}\)); \(\mathbf {z}\) is the output of the transformation into the base representation, which follows a predefined (base) distribution; in the conditional variant, \(\mathbf {c}\) is the context vector, which can be used to characterize the movement being modeled, such as speed. \({\mathrm{CN}_{\mathrm{ca}}}\) (conditional actnorm) and \({\mathrm{CN}_{\mathrm{cac}}}\) (conditional affine coupling) augment the respective transformations with contextual information

Actnorm. Let \(\mathbf {a} \in \mathbb {R}^d\) be a scaling vector and \(\mathbf {b} \in \mathbb {R}^d\) an offset vector. The actnorm transformation Kingma and Dhariwal (2018), its inverse, and log-determinant of its Jacobian are given by

$$\begin{aligned} \mathbf {y}&= \mathbf {f}(\mathbf {z}) = \mathbf {a} \odot \mathbf {z}+ \mathbf {b}, \\ \mathbf {z}&= \mathbf {f}^{-1}(\mathbf {y}) = (\mathbf {y}- \mathbf {b})/\mathbf {a}, \\ \log | \det J_{\mathbf {f}}(\mathbf {z}) |&= \sum _{i=1}^d \log |a_i|, \end{aligned}$$

respectively, where \(\odot\) denotes the Hadamard product. Particularly, \(\mathbf {a}\) and \(\mathbf {b}\) are learned as part of the transformation and are initialized with the first batch of data. This initialization is such that the mean and standard deviation of \(\mathbf {y}\) are zero and one, respectively.

Affine coupling. This transformation (Dinh et al. 2017) is slightly more involved. The input \(\mathbf {z}\in \mathbb {R}^d\) is initially split into two parts \((\mathbf {z}_1,\mathbf {z}_2)\), each of which is \(d'\)-dimensional, where \(d'=d/2\). Then, the second part is simply copied, \(\mathbf {y}_2=\mathbf {z}_2\), while \(\mathbf {z}_1\) is transformed based on information from \(\mathbf {z}_2\) as follows:

$$\begin{aligned} \left( \log \mathbf {a}, \mathbf {b}\right)&= {\mathrm{NN}_{\mathrm{ac}}}(\mathbf {z}_2), \nonumber \\ \mathbf {y}_1&= \exp (\log \mathbf {a}) \odot \mathbf {z}_1 + \mathbf {b}, \end{aligned}$$

where \({\mathrm{NN}_{\mathrm{ac}}}: \mathbb {R}^{d'} \rightarrow \mathbb {R}^{d}\) is a nonlinear mapping, generally realized by a neural network. The output is then \(\mathbf {y}= \mathrm{concat}(\mathbf {y}_1,\mathbf {y}_2)\). This inverse transformation can be computed by first splitting \(\mathbf {y}\) into \((\mathbf {y}_1,\mathbf {y}_2)\), copying the second part as before, i.e., \(\mathbf {z}_2 =\mathbf {y}_2\), computing the log-scale and offset parameters using \({\mathrm{NN}_{\mathrm{ac}}}(\cdot )\) as in Eq. (1), and computing the inverse affine linear transformation via

$$\begin{aligned} \mathbf {z}_1&= \left( \mathbf {y}_1 - \mathbf {b}\right) /\exp (\log \mathbf {a}). \end{aligned}$$

Finally, the two parts are merged with \(\mathbf {z}= \mathrm{concat}(\mathbf {z}_1,\mathbf {z}_2)\). The log-determinant of the Jacobian of this transformation is

$$\begin{aligned} \log | \det J_{\mathbf {f}}(\mathbf {z}) |&= \sum _{i=1}^{d'} \log |a_i|. \end{aligned}$$

Permutation. Not using squeezing and splitting entails a few additional considerations that directly lead us to \(1 \times 1\) convolutions that are particularly designed for these cases. Squeezing increases the number of channels and affect the way affine coupling layers perform. Hence, a permutation operation over the channels is usually performed before an affine coupling transformation. This allows the squeezing to affect a broader range of channels than it is designed to, lifting the flow to perform more sophisticated transformations. As such, having a permutation-like operation over the channels plays a significant role. Recall that this is not the case in our scenario as we cannot apply squeezing at all. Instead, we propose to employ a permutation operation that simply reverses the dimensions of the input. This operation has an inverse transformation and its log-determinant is equal to zero.

A major benefit of this approach is the ability to compute predictions independently of the data used for training. It only requires a number of computations proportional to the data dimensionality d and the number of transformations L employed. In other words, computing the likelihood of a data point takes \(\mathscr {O}(dL)\) time. Additionally, only the model parameters have to be stored in memory, resulting in a significantly smaller memory footprint that the KDE-based movement model when dealing with large data sets.

4.2 Contextualized flow-based movement models

In this setting, we are particularly interested in contextual information associated with movement data, such as speed. Therefore, the transformations employed in the flows should be able to take into account this additional information. A conditional flow-based model that appropriately addresses those requirements is conditional Glow (c-Glow) (Lu and Huang 2020). As the name suggests, c-Glow is based on Glow (Kingma and Dhariwal 2018), which is itself based on RealNVP (Dinh et al. 2017). Nevertheless, the goal is to derive a contextualized model that can be conditioned on arbitrary contexts. The most significant change is the addition of a conditioning network, denoted by \(\mathrm{CN}(\cdot )\), which replaces the parameters of the actnorm and affine coupling layers with surrogates that are predicted from the context \(\mathbf {c}\in \mathbb {R}^{d_c}\). Hence, a conditional probability density can be devised whose log-variant is given by

$$\begin{aligned} \log p(\mathbf {x}|\mathbf {c}) = \log p(\mathbf {z}) + \sum _{i=1}^L \log \left| \det J_{\tilde{\mathbf {f}}^{(\theta (\mathbf {c}))}_i}(\mathbf {x}) \right| . \end{aligned}$$

The parameters \(\theta (\mathbf {c})\) of the transformations \(\mathbf {f}_i\) are now dependent on the context \(\mathbf {c}\). Note that this does not change the computation of the log-determinant of the Jacobian in both cases. The new transformations with the addition of the CN are detailed below.

Conditional actnorm. The scaling and offset vectors \(\mathbf {a}\) and \(\mathbf {b}\) are now computed by \({\mathrm{CN}_\mathrm{ca}} : \mathbb {R}^{d_c} \rightarrow \mathbb {R}^{2d}\) as

$$\begin{aligned} (\mathbf {a}, \mathbf {b})&= {\mathrm{CN}_{\mathrm{ca}}}(\mathbf {c}), \end{aligned}$$

and the transformation is then carried out as before. The original initialization procedure for \(\mathbf {a}, \mathbf {b}\) is no longer needed.

Conditional affine coupling. This transformation already computes its scaling \(\mathbf {a}\) and offset \(\mathbf {b}\) parameters from \({\mathrm{NN}_{\mathrm{cac}}} : \mathbb {R}^{d'+d} \rightarrow \mathbb {R}^{d}\). The change introduced by conditioning on \(\mathbf {c}\) is an additional input to \({\mathrm{NN}_{\mathrm{cac}}}(\cdot )\), computed by \({\mathrm{CN}_{\mathrm{cac}}} : \mathbb {R}^{d_c} \rightarrow \mathbb {R}^{d}\), as

$$\begin{aligned} (\log \mathbf {a}, \mathbf {b})&= {\mathrm{NN}_{\mathrm{cac}}}(\mathbf {z}_2, {\mathrm{CN}_{\mathrm{cac}}}(\mathbf {c})), \end{aligned}$$

followed by the same operations as before.

Figure 4 provides an overview of the complete normalizing flow model with and without the conditioning on \(\mathbf {c}\). As shown in Fig. 5, learning contextual flow-based probabilistic movement models proceeds as follows: Given positional data, triplets (ABC) are extracted from the trajectories. The future position C in a time horizon \(t_\varDelta\) of interest is then represented in a local coordinate system for spatial invariance. The positions which can be reached in time \(t_\varDelta\) are then transformed, using our normalizing flow model, into a representational space that follows a standard Gaussian.

Fig. 5
figure 5

A summary of our approach. Left: Triplets are extracted from a trajectory using some current and previous position (black) as well as a future position (red) reached within the time horizon \(t_\varDelta\). Center: The future position is remapped using \(\mathbf {g}\) into a local coordinate system. The coordinates are centered at the current position and aligned with the previous position. Right: Points reached in time \(t_\varDelta\) are transformed using a conditional normalizing flow \(\mathbf {f}\). The flow \(\mathbf {f}\) is optimized to ensure that points under this transformation are distributed according to a Gaussian base distribution

The additional context taken into account by the model, while crucial to its flexibility in dealing with a variety of movements observed in the data, incurs little additional computational cost. As in the non-contextual case, the time to compute the likelihood of a data point is \(\mathscr {O}(dL)\) and the amount of memory required is proportional to the number of parameters of the model.

5 Experiments

We now evaluate our flow-based probabilistic movement models. First, we conduct an in-depth analysis of our proposed approach and compare its performance with several baselines. Second, we investigate whether using the positions of the remaining players as additional contextual information improves our movement model. For all experiments, we use data from professional soccer.

5.1 Data

The tracking data from soccer consist of coordinates of each player and the ball, recorded with camera-based systems at 25Hz for five professional games from the German Bundesliga. Each game is encoded as a sequence of triplets \((x_1^{(t)},x_2^{(t)},v^{(t)})\) describing the x and y coordinates on the pitch and the current speed \(v^{(t)}\) in km/h at time t for every player. Thus, every game consists of about \(25\cdot 60\cdot 90=135,000\) positions per player. The x and y positions are relative to the origin. Hence, the coordinates are within \([-52.5, 52.5]\times [-34.0, 34.0]\subset \mathbb {R}^2\), since the dimensions of a standard soccer field are \(105\times 68\) meters. We use the first four games for training and the last game for testing.

5.2 Baselines

We compare our model against the following baselines. The first baseline, denoted as KDE, follows the idea of Brefeld et al. (2019) and deploys kernel density estimates for the movement models. We use a Gaussian kernel and select the bandwidth using Scott’s rule (Scott 2015). A second, straightforward baseline simply estimates a two-dimensional Gaussian (mean and covariance) on the point cloud \(\mathbb {S}\) as introduced in Sect. 3, see for instance Figure 2. We refer to this baseline as Gaussian. Third, we test a two-dimensional histogram, again on the point cloud mentioned above. The histogram uses an equally spaced grid over \([-10,30]\times [-20,20]\) with 1600 cells of size \(1\times 1\) meter. Note that this range covers a larger space than depicted in Figure 2 and is suitable for all velocity ranges. We refer to this baseline as Histogram.

Depending on the experiment we use different configurations. We experiment with time horizons \(t_\varDelta \in \{1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0\}\), where every value is in seconds. The baselines use the same speed discretization as in Brefeld et al. (2019) which is \(\{[0,1),[1,7),[7,14),[14,20),[20,40]\}\). Those values are in kilometers per hour. Consequently, we train per baseline \(7 \cdot 5 = 35\) different configurations, one per time horizon and speed range. As for the \(t_{\delta }\), denoting the time difference for estimating the direction in which a player is moving, we again follow Brefeld et al. (2019) and use \(t_{\delta } =0.2\) seconds. We show the influence of \(t_{\delta }\) in "Appendix 1". The code for all baselines is written in Python using NumPy (Harris et al. 2020) and SciPy (Virtanen et al. 2020). All experiments run on a machine with an Intel Xeon CPU, 256GB of RAM, and an NVIDIA V100 GPU.

5.3 Evaluation metrics

When evaluating and comparing movement models, it is important to quantify how well a model explains the observed movements. In other words, a movement model should be capable of showing where the agent or object of interest will be in the near future. Consider a model that predicts a large area of future positions. Although this model explains all future positions by its sheer broadness, it is not concise. Hence, a good movement model needs to be concise and accurate and, to fulfill both, the model has to find an optimal trade-off between area and accuracy to estimate future positions in the smallest area possible.

A natural measure for this trade-off is the (log-)likelihood. Since densities are normalized by definition, larger areas possess lower point-wise likelihoods while compact densities capture only trivial movements and do not generalize well. As a consequence, a model achieving a good trade-off will also achieve higher likelihoods.

We additionally aim to measure the complexity of the different approaches. Section 3 shows that the complexity of predicting a likelihood either scales in the size of training data (KDE) or in the complexity of the neural networks (flow-based models). We thus measure the average evaluation time and report its quantity in seconds. Note that the prediction speed remains constant for both the Histogram and the Gaussian baseline.

A similar argument holds for memory footprints of the different approaches. The predictive performance of KDE is not only expected to deteriorate for data at large scales, it is also expected to require an excessive amount of memory. Hence, we also analyze the memory requirements of all evaluated methods and provide the number of variables (i.e., floats or double precision) that need to be stored.

5.4 Setup of the flow-based movement models

For a thorough comparison of the unconditional and the conditional flow-based movement models, we experiment with several configurations of the proposed models. The unconditional Flow uses the same time horizons and speed discretizations as the baselines above. Hence, we consider as many different models as the baselines for obtaining a fair comparison. With Flow, we intend to show that movement models based on normalizing flows achieve the same or better predictive performance as the state of the art. We introduce contextual information, given by \(\mathbf {c}^{(t)}=(v^{(t)}, t_\varDelta )\), directly as a part of a single model in CFlow. Here, we use the current speed \(v^{(t)}\) and the time horizon \(t_\varDelta\) as context, just as for the baselines. Note that the baselines are implicitly conditioned on the time horizon as well due to maintaining different models for different intervals. Hence, the purpose of CFlow is to have a unified model instead of many different ones. We further evaluate a model called CFlow-extended which leverages the positions of other players as additional contextual information. This model uses as context \(\mathbf {c}^{(t)}=(v^{(t)}, t_\varDelta , \bar{x}_\text {rel}^{(t)})\), where \(\bar{x}_\text {rel}^{(t)} \in \mathbb {R}^{\frac{1}{2} d_\text {hidden}}\) is a permutation invariant representation of the relative positions of the remaining players. We provide details of how we compute this vector in "Appendix 2".

All models follow the architecture shown in Figure 4, repeating the three depicted transformations (actnorm, permutation, and coupling) \(L = 8\) times in sequence. The same applies to their contextual alternatives. For unconditional and conditional flow-based models, the architectures of the neural networks are outlined in Table 1. All networks use SELU (Klambauer et al. 2017) activations, d denotes the dimensionality of the input \(\mathbf {x}\) and \(d_c\) is the dimensionality of the context which is for CFlow and CFlow-extended, \(d_c = 2\) and \(d_c = 2+\frac{1}{2}d_\text {hidden}\), respectively. The size of the hidden dimensionality is set to \(d_\text {hidden}=16\). Just as the authors of Glow (Kingma and Dhariwal 2018), we initialize the weights and biases of the last layer of each neural network in Table 1 to zeros for stability. This implies the actnorm and affine coupling layers are initialized to identity transformations before training.

The code is written in Python using JAX (Bradbury et al. 2018). We employ the Adam optimizer (Kingma and Ba 2015) with a learning rate of \(10^{-3}\). For stability, we clip gradients (Pascanu et al. 2013) with norms larger than 20. The batch size is set to 1, 024 for all models except the conditional models which condition on time horizon, in which a larger batch size of \(1,024 \cdot N_{t_{\varDelta }}\) is used to accommodate for different time horizons, where \(N_{t_{\varDelta }}\) denotes the number of different time horizons. All models are trained for 100 epochs with no early stopping.

Table 1 Neural network architectures detailed as a sequence of fully connected layers, where the constants before and after arrows indicate the dimensionalities of input and outputs for that layer, respectively, while SELU denotes the activation function Klambauer et al. (2017). The networks are part of the affine coupling \({\mathrm{NN}_{\mathrm{ac}}}(\mathbf {z}_2)\) that operates on the second part of the intermediate representation \(\mathbf {z}_2 \in \mathbb {R}^{d/2}\), the conditional actnorm \({\mathrm{CN}_{\mathrm{ca}}}(\mathbf {c})\) which uses the context \(\mathbf {c}\in \mathbb {R}^{d_c}\), and the conditional affine coupling \({\mathrm{NN}_{\mathrm{cac}}}(\mathbf {z}_2,\tilde{\mathbf {c}})\), where \(\tilde{\mathbf {c}}={\mathrm{CN}_{\mathrm{cac}}}(\mathbf {c}) \in \mathbb {R}^d\)

5.5 Results

We first evaluate the predictive performance of the baseline movement models and compare them to the unconditional Flow. To have a fair comparison, we deal with 35 different configurations per model. We proceed as follows. From the test game, we randomly sample a trajectory over three minutes for every player, leaving us with 22 such trajectories. Then, for several time horizons ranging from 1s to 4s, we compare the average log-likelihood per trajectory and model as well as the corresponding computation time for the prediction.

Unsurprisingly, the average log-likelihoods decrease for larger time horizons \(t_\varDelta\), as the models become progressively more uncertain with increasing \(t_\varDelta\). Figure 6 shows the relative improvement of each model compared to the KDE. It can be seen that the Flow model (green) performs on par with the KDE (dashed red) while the Gaussian approximation (brown) is almost consistently the worst. In addition, the performance gap of the Histogram (dotted purple) closes for larger time horizons, however, it underperforms compared to the KDE.

Fig. 6
figure 6

Average log-likelihood values (22 players) per model relative to the KDE baseline for various time horizons \(t_\varDelta\). The specific values of the KDE are stated in the upper x-axis

The relative improvement of CFlow (orange) and CFlow-extended (blue), each of which are single contextual models, shows that they clearly dominate all baselines by far. We credit this observation to conditioning. While the other models use a predefined velocity discretization, both CFlow and CFlow-extended adapt specifically to any velocity without the need for binning intervals with identical model output. If we additionally include the relative positions of the other players into the context and condition on it, the performance increases further.

The flexibility afforded by conditioning can also be seen in Fig. 7, which visualizes a trajectory of a soccer player drawn from the data. The bottom-right part of the figure depicts the current velocities of the player. In the remaining parts, the trajectory is colored according to the log-likelihood values using the Gaussian, Flow, and CFlow models for a time horizon of \(t_\varDelta =1s\). We also evaluate the average log-likelihood of this trajectory showing that CFlow has the highest log-likelihood, meaning it predicts the movement best. Once again, the Gaussian model yields the worst log-likelihood. We credit this due to its implicit symmetry. While being symmetric for moving left or right might be acceptable in this use case, the symmetry in the orthogonal direction is clearly not present, as seen in Fig. 2. Visual inspection shows the effect of binning the velocities on the performance of the Gaussian and Flow; the coloring exhibits a block structure due to velocity transitions across bins. By contrast, the conditional CFlow distinguishes by a continuous coloring, indicating a much better adaptation to the trajectory at hand. CFlow and CFlow-extended can smoothly adjust to the current velocity at each point in time.

Fig. 7
figure 7

A sample trajectory from soccer data. The bottom right plot shows speed along the trajectory. Other plots depict log-likelihood values along the trajectory for the Gaussian, Flow, and CFlow models for a time horizon of \(t_\varDelta =1s\). Darker colors in these other plots indicate the model estimates that position as less likely. We also provide avg. log-likelihood values for each model in this trajectory. The red star denotes the start of the trajectory. Black lines indicate model transitions for the Gaussian and Flow models, which rely on speed binning. Black triangle markers indicate an example where models behave distinctively: at those points in the trajectory the player starts accelerating and then slows down and CFlow maintains a consistent quality of predictions, while the other two models do not

Table 2 compares the predictive performance of the competitors in terms of computation time (in seconds). The numbers are again averaged over 22 individual trajectories and are shown with their standard errors. The baselines Gaussian and Histogram take almost no time to compute the predicted movements. KDE is consistently the slowest model by orders of magnitude. The result shows the limitations of the KDE for (near) real-time applications and/or large data sets. Our proposed flow-based models are only marginally slower than the straightforward competitors and, as expected, enable predictions that can be computed very efficiently.

Table 2 Evaluation time averaged over 22 players (incl. standard errors) in seconds

The KDE baseline is not only the slowest approach when comparing evaluation time, it has also the highest memory demand, as Table 3 shows. This is due to the nonparametric nature of the approach. The more training data is observed and integrated, the better the model. Trivially, integrating more data into a KDE also means to store exactly these additional data. In comparison, the Histogram baseline has a fixed grid and velocity binning, and, hence, a fixed memory footprint. The Gaussian exploits strong assumptions (e.g., unimodality, symmetry) and needs the least amount of memory, but is quite limited in expressiveness. Our proposed flow-based models have moderate memory requirements which neither grow with training data as the KDE, nor with a higher precision or finer-grained bins as the Histogram.

Table 3 Memory footprint of the evaluated approaches. The memory demand is stated in terms of floating point numbers which need to be saved

6 Discussion

Binning velocities are common trick in histogram-based movement models since continuous movements can be treated as discrete events that are instances of one or another bin. A major issue that is usually not addressed properly is how to find the optimal size of bins for a problem at hand. Often, optimal sizes are not equidistant but grow with the values of the variable of interest, such as speed. The bins also induce hard thresholds that may lead to treating similar values very differently, if they end up in two neighboring bins. The presented contextual models do not have these limitations as they work directly on the continuous signal.

In the case of movement models, binning leads to maintaining several models and renders the space and time complexity of the models highly inefficient. By contrast, the contextual models do not rely on binning and simply condition on all variables of interest such as current velocity, time horizon, and even the position of the other players. We thus end up with only a single model which is efficient and allows for (near) real-time processing. Our experiments show that both contextual models are moreover achieving superior performance in terms of log-likelihood and clearly outperform all other approaches.

As another (straightforward) baseline, we incorporated a Gaussian in the experiments. Naturally, a Gaussian is trivial to estimate and turns out efficient in time and space. However, these advantages imply assumptions on the densities being unimodal and symmetric. While some scenarios may support these claims, others may suffer from this oversimplification, as shown in Fig. 8. It depicts the movements of a goal keeper and the point cloud has clearly two modes. Estimating the density of the points with a Gaussian introduces unnecessary errors by assigning a great deal of probability mass to unlikely events. Again, the proposed contextual models do not make any assumption on the nature of the densities and adapt to any multi-modal distribution of points.

Fig. 8
figure 8

An example of the set \(\mathbb {S}_{t_\varDelta }^{S}\) (left) and corresponding kernel density estimate (right) for a goal keeper

A practical application for movement models is to compute so-called zones of control that describe the areas which are controlled by a player. The underlying idea is that the payer who controls the zone is expected to arrive first at every position within this area (Taki and Hasegawa 2000; Gudmundsson and Wolle 2014; Horton et al. 2015; Spearman et al. 2017; Brefeld et al. 2019). Figure 9 shows an example. Using any of the CFlow models to compute the zones of control removes the necessity of binning velocities and time horizons; computation time and memory requirements are very low, even when training on massive amounts of data, compared to KDE-based approach in Brefeld et al. (2019). Our models thus allow for (near) real-time processing of live data and may be used in live analysis and broadcasts to visualize key situations.

Fig. 9
figure 9

Soccer pitch showing players and the controlled spaces per player derived from the individual movement models

Finally, while movement models, by definition, compute the area of possible (near-)future positions, they do not allow for predicting positions. Our contextual movement models, however, may overcome this issue by explicitly conditioning the models also on positions of other players (CFlow-extended) and the ball, ball possession indicators, etc. In principle, any CFlow should adapt to the current positioning of players and ball on the pitch, hence, shifting the probability mass toward areas that better represent motions of professional athletes given the current situation. Although these assumptions need to be confirmed in additional (empirical) studies and are clearly out of scope for the present paper, the sheer possibility indicates the potential impact and importance of this line of work.

7 Conclusion

In this paper, we studied the problem of learning movement models in a purely data-driven fashion, which we evaluated on data from professional soccer matches. We circumvented the limitations of previous approaches and cast the problem as a conditional density estimation task. We exploited characteristics of the low-dimensional problem formulation to devise conditional normalizing flows for modeling movements. In contrast to the state of the art, our contextual model consists of only a single model. Having a conditional model proved important for the integration of contextual information, such as velocities of movements. In principle, any relevant information can be used as context. Moreover, they outperformed all competitors by far when it came to predictive performance and turned out very efficient. Their computation times were comparable to trivial baselines and orders of magnitude faster than the state of the art.