1 Introduction

Trajectory data of simultaneously moving objects is the key to analyse animal migration (Lee et al. 2008), transportation (Baud et al. 2007; Giannotti et al. 2007), tactics in team sports (Hirano and Tsumoto 2005; Kempe et al. 2014; Lucey et al. 2013; Wei et al. 2013), players and avatars in (serious) computer games (Kang et al. 2013; Pao et al. 2010), customer behaviour (Larson et al. 2005) as well as spread patterns of fires (Trunfio et al. 2011). A characteristic trait of many such applications is that trajectories of several objects are more informative than the trajectory of a single object. For instance, a single trajectory of a bird is not indicative for bird migration as individuals may join or leave the flock (Lee et al. 2008) and a single trajectory of a soccer player does not reveal insights on the actual situation on the pitch (Grunz et al. 2012; Wei et al. 2013).

Therefore, trajectories of multiple objects need to be processed together. Although this insight sounds trivial, processing multiple trajectories simultaneously challenges the standard model of computation as trajectories may interdepend in time and space in multiple ways. To exploit these dependencies, it is necessary to establish a notion of similarity for spatio-temporal paths of multiple objects to identify frequent patterns. By definition, frequent patterns are formed by an a priori unknown subset of objects at unknown locations in time and space. Analysing multi-trajectory data is therefore inherently a combinatorial problem that involves processing data at large scales.

A second problem arises from existing methods for analysing spatio-temporal data. Traditional approaches often cannot deal with continuous spatial domains but rely on an appropriate discretisation of the data at-hand (Kang and Yong 2008; Mamoulis et al. 2004; Mehta et al. 2005). However, finding an a priori optimal discretisation is often difficult in many domains where only the final result allows conclusions on whether an initial set of atomic events is plausible or not (Kang and Yong 2008). Furthermore, many approaches cannot deal with permutations of the objects and differences in speed, while still being sensitive to differences in the direction of the motion (Hirano and Tsumoto 2005; Junejo et al. 2004; Wei et al. 2013).

We devise a novel class of convolution kernels for multi-trajectory data. It is specially tailored to multi-object scenarios, i.e. trajectories of multiple simultaneously moving objects. The kernel properties as well as the modular nature of the proposed class of kernels renders it highly adaptive to different applications. Since it is a kernel, it can also naturally be deployed with any kernel machine. The three characteristics, multi-object scenario, modularity and kernel property, distinguish our approach from existing methods. Due to its distinct characteristics, our approach is more suitable for a large variety of applications, it is flexible with respect to the notion of similarity, and it is theoretically better grounded than most of the existing methods. Since the complexity of a kernel evaluation is quadratic in the number and the lengths of the involved objects, we also propose an efficient percental approximation. Empirically, the method is evaluated on artificial datasets and real-world tracking data from ten Bundesliga soccer matches. We generally observe that our convolution kernels lead to better clusterings compared to baseline methods.

The remainder of this article is structured as follows. Section 2 reviews existing work. Section 3 introduces our spatio-temporal convolution kernel methods and Sect. 4 reports on empirical results. Section 5 concludes.

2 Related work

2.1 Trajectory clustering

Trajectory clustering, or clustering of spatio-temporal data respectively, has been an active field of research in the past years. Existing approaches mainly focus on the application of video surveillance with the goal to detect anomalies in the data stream (Basharat et al. 2008; Fu et al. 2005; Hu et al. 2006; Jeong et al. 2011; Junejo et al. 2004; Saleemi et al. 2009). Other applications include automatic sports analysis, weather evolution modelling, animal migration and traffic analysis. Existing approaches rely mostly on processing of single trajectories. Recent contributions in this area can be roughly grouped into similarity-based approaches (Buzan et al. 2004; Jinyang et al. 2011; Fu et al. 2005; Hirano and Tsumoto 2005; Hu et al. 2006; Junejo et al. 2004; Piciarelli et al. 2005) and motion-based approaches (Basharat et al. 2008; Wang et al. 2008; Jeong et al. 2011; Li and Chellappa 2010; Lin et al. 2009; Saleemi et al. 2009).

Similarity-based approaches define pairwise similarities between trajectories which are then processed by some clustering algorithm. Junejo et al. (2004) represent trajectories as a set of two-dimensional coordinates together with the Hausdorff distance. Subsequently, graph-cuts are deployed to recursively partition the trajectories. Hausdorff distances are also used to cluster trajectories by Jinyang et al. (2011) where not only the position but also the direction of the trajectories is taken into account by using 4-tuples \((x,y,\mathrm {d}x,\mathrm {d}y)\) instead of coordinates only. Fu et al. (2005) first resample trajectories to obtain constant between-point distances. Then the corresponding points of two trajectories are compared using an RBF kernel where the longer trajectory is cut to the length of the shorter one. Spectral clustering is then used together with a symmetric normalised Laplacian.

Buzan et al. (2004) extend the longest common subsequence algorithm to three-dimensional coordinates and use a modified version of agglomerative hierarchical clustering. Hirano and Tsumoto (2005) deploy multi-scale matchings to compare trajectories. The basic idea is to generate trajectories at different scales as convolutions of the trajectory and Gaussian kernels with different standard deviations. Their similarity measure is then based on the hierarchical structure of the trajectory segments at different scales. Subsequently, a rough clustering is employed. Piciarelli et al. (2005) define a trajectory-to-cluster similarity by the average Euclidean distance of trajectory coordinates to the nearest cluster coordinate where offsets in time induce negative weights.

Our approach also belongs to these similarity-based methods. In general, similarity-based approaches suffer from two major drawbacks. First, their computational complexity is at least in quadratic in the number of trajectories. Second, they rely on clustering full trajectories and are hence sensitive to tracking errors and sub-trajectories. While the first drawback is inherent to all similarity-based methods, our distribution-based approach and gradual weighting mitigates the effects of noise and tracking errors and is able to identify partial matchings between trajectories.

In contrast to similarity-based approaches, motion-based approaches focus on local movements of objects to derive models for the overall (group) motion in a scene. Wang et al. (2008) and Jeong et al. (2011) represent a trajectory by bags of positions as well as directions based on the bag-of-words representation of documents in natural language processing. To this end, the spatial domain is discretised and the number of occurrences of each position in a trajectory is counted. Grimson et al. also take into account temporal information by counting the occurrences of each (discretised) direction in a trajectory. The topic model Dual-HDP (Wang et al. 2008) is used to find semantic regions, which are combined to form the different trajectories. Jeong et al. use latent dirichlet allocation (Blei et al. 2003) to obtain semantic regions. To incorporate temporal information, a hidden Markov model is trained for each topic based on the sequences which are close to the topic. Saleemi et al. (2009) propose kernel density estimation to learn a five dimensional distribution of transitions from \((x_1, y_1)\) to \((x_2, y_2)\) in time t. Markov chain monte carlo (Andrieu et al. 2003) is then deployed to sample the most likely paths given the learned transition probabilities.

Basharat et al. (2008) also learn a model for transition probabilities. Instead of kernel density estimation, a Gaussian mixture model is fitted to the observed transitions. Lin et al. (2009) exploit the Lie algebraic structure of affine transformations to learn a flow model consisting of overlapping two-dimensional Gaussian distributions, each of which corresponds to an affine transform dominant in this spatial area. The approach is applied to pedestrians in a train station and optical flows obtained from satellite images. Li and Chellappa (2010) also use a similar Lie algebraic representation called spatial hybrid driving force model, which, opposed to Lin et al. (2009), evolves over time. This model is used to solve the so-called group motion segmentation problem, i.e. to answer the question of which objects take part in an organised group motion and which do not.

Motion-based approaches also inhere some limitations. First, they often neglect temporal information at least of second order (curvature). Second, they do not provide a mapping of the input trajectories to groups of similar trajectories but rather describe the combined motion of all objects in all trajectories over time. Our approach differs methodologically from the summarised techniques in several ways: First, it provides a general framework that covers many applications and properties as opposed to being a very specific similarity measure tailored to a single application domain. Second, our approach is specialised on multiple simultaneously moving objects instead of focussing on only trajectories of single objects. Third, being a kernel the similarity measure is straightforwardly applicable to a broad range of algorithms and is theoretically well grounded in contrast to heuristic approaches.

2.2 Sports analytics

Current approaches in the area of sports game trajectory analysis either aim to define objective performance measures for players (Kang et al. 2006), classify (Bialkowski et al. 2013; Hervieu and Bouthemy 2010; Intille and Bobick 2001; Grunz et al. 2012; Perše et al. 2009; Siddiquie et al. 2009) or cluster (Hirano and Tsumoto 2005; Wei et al. 2013) plays/trajectories, or learn a motion model for team behaviour (Bialkowski et al. 2014; Direkoglu and O’Connor 2012; Kim et al. 2010; Li et al. 2009; Li and Chellappa 2010; Lucey et al. 2013; Zhu et al. 2007).

Kang et al. (2006) define performance metrics for soccer players based on the definition of owned and competitive regions of the field, which are derived from ball and player trajectories. Siddiquie et al. (2009) represent videos of football plays by a bag-of-features from histograms of optical flows as well as histograms of oriented gradients. Spatio-temporal pyramid matching (Lazebnik et al. 2006) is used to generate a kernel for each visual word. Football plays are then classified into seven categories using multiple kernel learning. Hervieu and Bouthemy (2010) use a hierarchical parallel semi-Markov model to classify different activity states in squash and handball, such as rallies, free throws and defence. The first layer describes the activity states, while the second layer consists of a parallel hidden Markov model for each feature representing the trajectories.

Perše et al. (2009) represent team activity in basketball using team centroids to hierarchically classify situations with Gaussian Mixture Models. Thereafter, each situation is converted into a string, which is compared to templates for classification. Bialkowski et al. (2013) use team centroids and occupancy maps to classify game situations in field hockey (corners, goals), emphasising the robustness of this representation to tracking noise. Grunz et al. (2012) employ self-organising maps to identify long and short game initiations in soccer and Hirano and Tsumoto (2005) use multi-scale comparison and rough clustering to cluster ball trajectories that lead to goals.

Direkoglu and O’Connor (2012) solve a special Poisson equation, in which the player positions determine the location of source terms. The derived distribution and its development over time defines a so-called region of interest used to describe the team behaviour. Wei et al. (2013) use role models (Lucey et al. 2013) and a Bilinear spatio-temporal basis model to represent team movement to cluster goal scoring opportunities in soccer. Bialkowski et al. (2014) also use role models to automatically detect and compare the formations of soccer teams. Li and Chellappa (2010) learn a spatio-temporal driving force model to identify offence and defence players in football. Kim et al. (2010) interpolate a dense motion field from player trajectories using thin-plate splines. This motion field is further investigated for points of convergence to predict where the game will evolve in short term.

From an application point of view, our approach is most comparable to Wei et al. (2013) and Grunz et al. (2012). While Wei et al. focus on scoring opportunities and Grunz et al. study game initiations, we consider both situations in this study. Similar to Bialkowski et al. (2013), our method proves robust to tracking noise.

3 Spatio-temporal convolution kernels

3.1 Representation

Multi-object trajectory analysis is concerned with a possibly varying number of moving objects \({\mathcal {O}}_t\) in a set X, e.g. \(X = {\mathbb {R}}^2\), over a finite period of time \({\mathcal {T}} \subset {\mathbb {N}}\). A multi-object trajectory is composed of snapshots of the object positions at different times. Depending on the context and application at hand, one of the following two formalisations of a snapshot is more appropriate.

Definition 1

(Object-oriented Snapshot) Assume the number of objects to be constant over time, i.e. \({\mathcal {O}}_t = {\mathcal {O}} = \{o_1,\ldots , o_N\}\) for \(N \in {\mathbb {N}}\). Then the object-oriented snapshot of all objects at time \(t \in {\mathcal {T}}\) is denoted by \( x_{t} \in X^N =: {\mathcal {X}}.\) We call \({\mathcal {X}}\) the snapshot space. The position of a particular object \(o \in {\mathcal {O}}\) is denoted by \( x_{t}(o) \in X.\)

Definition 2

(Group-oriented snapshot) Assume there is a constant number of groups \(K \in {\mathbb {N}}\). Moreover, at every point in time each object can be associated with exactly one of the groups \({\mathcal {G}} = \{g_1,\ldots ,g_K\}\).Footnote 1 Then the group-oriented snapshot of all objects at time \(t \in {\mathcal {T}}\) is denoted by

$$\begin{aligned} x_{t} \in {\mathcal {P}}(X)^K =: {\mathcal {X}}. \end{aligned}$$

We call \({\mathcal {X}}\) the snapshot space. The positions of all objects of a particular group \(g \in {\mathcal {G}}\) are denoted by \(x_{t}(g) \in {\mathcal {P}}(X).\)

The group members of group g in snapshot \(x_t\) are denoted by \(O_{x_t}(g) \subset {\mathcal {O}}_t.\)

The implications of the two definitions are as follows. First, the object-oriented snapshot representation only allows a fixed number of objects, whereas the group-oriented representation is not limited in that respect. Second, in the group-oriented snapshot, objects inside a group are indistinguishable. On one hand, the property allows for permutations of objects but on the other hand it naturally also entails ambiguities.

Instead of an ordered sequence of positions or snapshots we use a set of time/position-pairs to represent trajectories. Thereby, time and order is explicitly represented as opposed to the more implicit sequence representation.

Definition 3

(Trajectory) A trajectory is defined as a finite subset

$$\begin{aligned} \tilde{P} = \{(\tilde{t}_1, x_{\tilde{t}_1}),\ldots ,(\tilde{t}_n, x_{\tilde{t}_n})\} \subset {\mathcal {T}}\times {\mathcal {X}}, \end{aligned}$$

such that \(\tilde{t}_i \ne \tilde{t}_j\) for \(i\ne j\), i.e. the trajectory set \(\tilde{P}\) contains only one snapshot per point in time.

The set \(\pi _{{\mathcal {T}}}(\tilde{P})) = \{t \in {\mathcal {T}}: \exists (\tilde{s}, x_{\tilde{s}}) \in \tilde{P} \text { s. t. } t= \tilde{s}\}\) contains all timestamps of the trajectory and is usually of the form \(\{K, K+1,\ldots , K+L\}\) for some natural numbers K and L. When comparing trajectories it is insignificant at what absolute time the trajectories start. This gives rise to the following definition.

Definition 4

(Time-normalised trajectory) The time-normalised trajectory \(P \subset [0,1]\times {\mathcal {X}}\) corresponding to trajectory \(\tilde{P}\) is defined by normalising its time-scale to [0, 1]. This corresponds to the trajectory P, given by

$$\begin{aligned} P = \{(t, x_t) : \exists (\tilde{t}, x_{\tilde{t}}) \in \tilde{P} \text { s.t. } \tilde{t} = \mu + t(\max (\pi _{{\mathcal {T}}}(\tilde{P})) - \mu ) \wedge x_{\tilde{t}} = x_t\}, \end{aligned}$$

where \(\mu =\min (\pi _{{\mathcal {T}}}(\tilde{P}))\).

In the remaining part of this study we refer to time-normalised trajectories simply by trajectories.

3.2 Problem setting

One of the main advantages of kernel methods is the separation of algorithm and data. Following this paradigm, we focus on defining a kernel on the set of multi-object trajectories. Once a kernel has been defined, off-the-shelf kernel machines can be applied to generate models, such as support vector machines (Vapnik 1995), kernelised k-medoids (Kaufman and Rousseeuw 1987), or spectral clustering (Ng et al. 2001). The formal problem setting of this article is defined as follows. On the set of multi-object trajectories \(M \subset {\mathcal {P}}([0,1]\times {\mathcal {X}})\) we aim to develop a similarity measure \(k:{\mathcal {P}}([0,1]\times {\mathcal {X}}) \times {\mathcal {P}}([0,1]\times {\mathcal {X}}) \), such that

  1. (I)

    the absolute position as well as the shape of the trajectories is incorporated,

  2. (II)

    the measure is invariant to permutations of certain objects, i.e. for two trajectories \(P_1\), \(P_2\) with

    $$\begin{aligned} P_2 = \{(t, x_t) : \exists \text { permutation } \sigma \ \forall (s, y_s) \in P_1 \text { s. t. } t=s \wedge x_t = \sigma (y_s)\} \end{aligned}$$

    it holds that \(k(P_1, P_2) = 1\). In case of the group-oriented snapshot this already holds by definition if the permuted objects are members of the same group,

  3. (III)

    the measure is invariant with respect to the speed of the movement. Since all trajectories have already been normalised to the same time scale, differences in speed are mainly reflected in the cardinality of the trajectory sets. So, for example, given two trajectories \({P}_1\) and \({P}_2\) with \(|{P}_1| = 2|{P}_2|\) and

    $$\begin{aligned} {P}_2 = \{({t}, x_{{t}}) : \exists ({{s}}, y_{{s}}) \in {P}_1 \text { s. t. } {t}={2s} \wedge x_{{t}} = y_{{2s}}\} \end{aligned}$$

    it holds that \(k(P_1, P_2) \approx k(P_1, P_1)\),Footnote 2

  4. (IV)

    the similar movements of two sets of objects is recognised as such in the presence of deviations of single objects and outliers.

Moreover, the measure should have the following properties:

  1. (V)

    Kernel Property, i.e.

    $$\begin{aligned} k(P_1, P_2) = \langle \phi (P_1), \phi (P_2)\rangle _{\mathcal {F}} \end{aligned}$$

    for some, usually unknown, feature map \(\phi \) and inner product space \(\mathcal {F}\)

  2. (VI)

    Broad applicability, i.e. few application specific parameters and no restrictions on the space X

  3. (VII)

    Computational efficiency

Note that properties (I) to (IV) formalise an intuitive notion of similarity. (I) says that shape and positions matters. (II) requires that if two similar object swap roles, it does not matter. (III) formalises that we do not care about differences in speed that much.Footnote 3

Property (IV) demands robustness with respect to outlier trajectories. Further note that, for example, Dynamic Time Warping (Bellman and Kalaba 1959) meets condition (III) very well, but does not comply with (V), (VI) and (VII), since it is not a kernel and only applicable if the underlying set is a metric space. Moreover, it is computationally expensive. On the other hand, the Hausdorff distance (Hausdorff 1962) satisfies (I), (III) and (VII), but it does not satisfy (II), (V) and (VI), since it is only applicable to metric spaces and sensitive to permutations. In addition, it is not a kernel. A Gaussian RBF kernel on the full vector of positions meets conditions (I) (restricted), (V) and (VII), but is not applicable to sequences of different lengths. Also, it does not comply with (II),(III) and (VI), since it is highly sensitive to variations in speed and permutations and is restricted to metric spaces.

3.3 Spatio-temporal convolution kernels for multi-trajectories

In this section we develop a kernel on the space of (time-normalised) multi-trajectories \({\mathcal {P}}({[0,1]\times {\mathcal {X}}})\). Each of those trajectories consists of a set of snapshots associated with a relative time. The general idea is to perform a pairwise comparison of the snapshots in the two sets. Therefore, we first need a way to compare snapshots and, second, we need to know which snapshots of the two trajectory sets to compare with each other. For the latter dynamic time warping (DTW) (Bellman and Kalaba 1959) seems to be a good choice, since it aligns the snapshots optimally in terms of similarity. Unfortunately, the obtained kernel is not positive definite, i.e. it does not correspond to an inner product in some Hilbert Space. Although there is anecdotal evidence that learning with indefinite kernels can lead to good results in some applications (e.g. Ong et al. 2004), theory only supports the use of positive definite kernels. For many kernel machines there are error bounds and convergence criteria that can be straightforwardly applied to positive definite kernels but that do not hold for indefinite kernels (Blanchard et al. 2008; Lin 2001; Steinwart 2002).

Therefore, we propose a weighted comparison between every snapshot of the first trajectory and every snapshot of the second one where the weights depend on the offset in relative time. Formally, this is done using an R-convolution kernel (Haussler 1999) on the two sets representing the trajectories. Convolution kernels are a general class of kernels on structured objects \(x,y \in X\).Footnote 4 The idea is to compare instances x and y by comparing their parts \((x_1,\ldots ,x_D),(y_1,\ldots ,y_D) \in X_1\times \cdots \times X_D\). Thus, a relation function R is needed to express that something is a part of some structure.

Definition 5

(Relation) Let \(X,X_1,\ldots ,X_D\) be arbitrary sets. Then a relation

$$\begin{aligned} R: X\times X_1\times \cdots \times X_D \rightarrow \{0,1\} \end{aligned}$$

is an arbitrary boolean function that returns 1 if and only if \((x_1,\ldots x_d)\) are parts of x. The set of parts of \(x\in X\) under relation R is denoted by

$$\begin{aligned} R^{-1}(x) = \{(x_1,\ldots ,x_d) \in X_1\times \cdots \times X_D: R(x,x_1,\ldots ,x_D)=1\} \end{aligned}$$

and R is called finite if \(R^{-1}(x)\) is finite for every \(x\in X\).

Definition 6

( R -convolution kernel) Let \(X,X_1,\ldots ,X_D\) be arbitrary sets. Let \(x,y \in X\) and \(R: X\times X_1\times \cdots \times X_D \rightarrow \{0,1\}\) be a finite relation. Moreover, let \(k_1,\ldots ,k_D\) be kernels on \(X_1,\ldots ,X_D\). Then the R-convolution kernel on X is defined by

$$\begin{aligned} k(x, y) = \sum \limits _{\begin{array}{c} (x_1,\ldots , x_D) \in R^{-1}(x), \\ (y_1,\ldots ,y_D) \in R^{-1}(y) \end{array}} \prod \limits _{d=1}^D k_d(x_d,y_d) \end{aligned}$$

The following theorem shows that an R-convolution kernel is indeed a (positive-definite) kernel.

Theorem 1

Let \(X,X_1,\ldots ,X_D\) be arbitrary sets. Let R be a finite relation and let \(k_1,\ldots ,k_D\) be kernels on \(X_1,\ldots ,X_D\). Then the R-convolution kernel k given by Definition 6 is a kernel.

Proof

For the proof we refer to Haussler (1999) Theorem 1 and Lemma 1, which are essentially more involved applications of closure properties of kernels. \(\square \)

In our case the structure is a multi-object trajectory and its parts are the snapshots, the times of the snapshots and the length of the trajectory. The relation between the structure and its elementary components is given by

$$\begin{aligned} R: \underbrace{{\mathbb {N}}}_{\text {Length of Traj.}} \times \underbrace{[0,1]}_{\text {Time of Snapshot}} \times \underbrace{{\mathcal {X}}}_{\text {Snapshot}} \times \underbrace{{\mathcal {P}}([0,1]\times {\mathcal {X}})}_{\text {Trajectory}} \rightarrow \{0,1\}. \end{aligned}$$

To extract the snapshots with their associated times and the length of the trajectory from the trajectory, R needs to be defined as follows:

$$\begin{aligned}&R(n, t, x, P) = {\left\{ \begin{array}{ll} 1 &{}\text{ if } |P| = n \wedge \exists (s, y_s) \in P : (t,x) = (s,y_s) \\ 0 &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(1)

The first condition in Eq. 1 ensures that the given trajectory has the given length, while the second condition guarantees the occurrence of the given snapshot at the given time in the given trajectory. With this relation and the kernels to compare the parts the resulting convolution kernel is given by

$$\begin{aligned} k(P, Q) = \sum \limits _{\begin{array}{c} (n,t, x) \in R^{-1}(P), \\ (m,s, y) \in R^{-1}(Q) \end{array}} k_{{\mathbb {N}}}(n,m) \cdot k_{[0,1]}(t, s) \cdot k_{{\mathcal {X}}}(x, y) \end{aligned}$$
(2)

with \(R^{-1}(P) = \{(n,t,x): R(n,t,x,P) =1 \}\). The term \(k_{{\mathbb {N}}}\) accounts for differences in length of the trajectories by normalisation, i.e. \(k_{{\mathbb {N}}} = \frac{1}{nm}\). Note that \(k_{{\mathbb {N}}}\) is a kernel on \({\mathbb {N}}\), since it corresponds to the standard inner product under the feature map \(\phi : {\mathbb {N}} \rightarrow {\mathbb {N}}\) given by \(\phi : n \mapsto \frac{1}{n}\). Finally, the R-convolution kernel simplifies to

$$\begin{aligned} k(P, Q) = \frac{1}{|P||Q|}\sum \limits _{(t, x_t) \in P, (s, y_s) \in Q} k_{[0,1]}(t, s) \cdot k_{{\mathcal {X}}}(x_t, y_s). \end{aligned}$$
(3)

Theorem 2

The spatio-temporal convolution kernel (Eq. 3) is a kernel if the temporal kernel \(k_{[0,1]}\) and the spatial kernel \(k_{{\mathcal {X}}}\) are kernels in that sense.

Proof

By Theorem 1 we need to show that R is finite and the component kernels \(k_{\mathbb {N}}\) and \(k_{[0,1]}\), \(k_{{\mathcal {X}}}\) are indeed kernels. First, for all \(P \in {\mathcal {P}}([0,1]\times {\mathcal {X}})\) it holds that \(|R^{-1}(P)| = |P| < \infty \) by Definition 3, so R is finite. Second, \(k_{[0,1]}\) and \(k_{{\mathcal {X}}}\) are kernels by assumption and we have just shown that \(k_{\mathbb {N}}\) is also a kernel. Hence, the spatio-temporal convolution kernel is a kernel.\(\square \)

As indicated, Eq. 3 can be interpreted such that all snapshots of the two trajectories are compared with each other, but weighted by their offset in time. Thereby snapshots which occur at different relative times have a low contribution to the overall similarity, while snapshots at the same relative time have a high contribution. This is why \(k_{[0,1]}\) is sometimes also referred to as weight function. The definition of k in Eq. 3 leaves two degrees of freedom:

  • Spatial kernel \(k_{{\mathcal {X}}}\): The choice of the snapshot kernel determines which snapshots are similar.

  • Temporal kernel \(k_{[0,1]}\): The choice of the temporal kernel determines the way in which the snapshots of two sequences are combined, and thus the importance of ordering and speed.

In the following subsections we develop and compare different spatial and temporal kernels.

3.4 Spatial kernels

A spatial kernel compares two snapshots in X. Corresponding to the two definitions of the snapshot in Definition 1 (object-oriented) and Definition 2 (group-oriented), two types of kernels are introduced here as well.

Object-Wise Comparison These kernels correspond to the object-oriented snapshot by simply comparing the positions of the objects \({\mathcal {O}}\) and summing up their similarity:

$$\begin{aligned} k_{{\mathcal {X}}}(x_t, y_t) = \frac{1}{|{\mathcal {O}}|}\sum \limits _{o \in {\mathcal {O}}} k_{X}(x_t(o), y_t(o)). \end{aligned}$$
(4)

Note that Eq. 4 is a kernel, since kernels are closed under direct sum and multiplication by a positive constant (see Shawe-Taylor and Cristianini (2004) Proposition 3.22).Footnote 5 Since kernels are also closed under direct product, technically a product could have been used instead of the sum in Eq. 4. However, a product of kernels leads to vanishing similarities if only one object is dissimilar to its counterpart. This is counterintuitive as two snapshots are the more similar the more objects match well. It is thus an inherently additive relationship. For the same reason a product of kernels is more vulnerable to noise and outliers and generally less robust than a sum of kernels. The elementary kernel \(k_{X}\) can be any kernel comparing two positions in X. Table 1 lists common kernels for finite as well as continuous snapshot spaces.

Table 1 Elementary spatial kernels

Kernels as in Eq. 4 have two major shortcomings. First, they penalise permutations of objects. For example, two snapshots of two objects with swapped positions will have low similarity, although in terms of the group motion they are alike. One way to address permutations is to explicitly maximise the similarity of the two snapshots with respect to all possible permutations of objects. Due to the high computational costs of considering all possible permutationsFootnote 6, this is infeasible. In addition, the bandwidth parameter \(\sigma \) that controls the interval of high sensitivity, i.e. large values of \(\left| {\mathrm {d}k}/{\mathrm {d}(\Vert x-y\Vert )}\right| \), of the kernel has to be known beforehand. This is critical since one usually does not know on which scale significant deviations appear. Both issues are addressed by the group-wise comparison of objects.

Group-Wise Comparison In case of the group-oriented snapshot there is a partition of \({\mathcal {O}}_t\) into K sets \(O_{x_t}(g_1), O_{x_t}(g_2), .., O_{x_t}(g_K)\). Instead of comparing N objects, K groups of objects are compared, which leads to the following definition of a spatial kernel

$$\begin{aligned} k_{{\mathcal {X}}}(x_t, y_t) = \frac{1}{K}\sum \limits _{g \in {\mathcal {G}}} k_{G}(x_t(g), y_t(g)) \end{aligned}$$
(5)

with kernel \(k_G\) applied on sets of positions. For the same reasons as in the previous paragraph, a sum of kernels is preferable to a product of kernels. Definition 2 allows for zero group members in a snapshot, i.e. \(x_t(g) = \emptyset \). For all kernels defined below this special case is dealt with as follows:

$$\begin{aligned}&k_{G}(x, y) = {\left\{ \begin{array}{ll} 1 &{} \text{ if } x = y = \emptyset \\ 0 &{} otherwise. \end{array}\right. } \end{aligned}$$

The definition of \(k_G\) is very much dependent on the underlying set of positions X. There are two main classes of sets X. First, if X is a metric space, i.e. there exists a distance measure between all positions. Second, X is a finite set, in which no meaningful distance between positions can be defined.Footnote 7 In the case of a metric space we will subsequently only consider \(X={\mathbb {R}}^2\) with standard Euclidean distance as it is the most common in applications and can be directly extended to \(X = {\mathbb {R}}^N\).

Euclidean Space: X = \({\mathbb {R}}^{2}\) In the case of \(X = {\mathbb {R}}^2\) a straightforward approach is to use one of the kernels from Table 1 to compare the group centroids

$$\begin{aligned} \mu _{x_t}(g) = \frac{1}{|O_{x_t}(g)|} \sum \limits _{o \in O_{x_t}(g)} x_t(o) \quad \text {and}\quad \mu _{y_t}(g) = \frac{1}{|O_{y_t}(g)|} \sum \limits _{o \in O_{y_t}(g)} y_t(o), \end{aligned}$$

which leads to

$$\begin{aligned} k_{G}(x_t(g), y_t(g)) = k_X(\mu _{x_t}(g), \mu _{y_t}(g)). \end{aligned}$$

Besides the need of defining the width parameter \(\sigma \), disregarding the distribution of the objects around their centroids constitutes a disadvantage. To remedy both deficiencies, two Gaussian distributions are fitted to \(x_t(g)\), and \(y_t(g)\) respectively that we will denote as \(f_{x_t}(g)\) and \(f_{y_t}(g)\). These distributions are defined by their means \(\mu _{x_t}(g)\) and \(\mu _{y_t}(g)\) as defined above and their covariance matrices defined by

$$\begin{aligned}&\varSigma _{x_t}(g) = \frac{1}{|O_{x_t}(g)|-1} \sum \limits _{o \in O_{x_t}(g)} (x_t(o)-\mu _{x_t}(g))(x_t(o)-\mu _{x_t}(g))^T \\ \text {and }&\varSigma _{y_t}(g) = \frac{1}{|O_{y_t}(g)|-1} \sum \limits _{o \in O_{y_t}(g)} (y_t(o)-\mu _{y_t}(g))(y_t(o)-\mu _{y_t}(g))^T. \end{aligned}$$

If the covariance matrix is ill-conditioned or singularFootnote 8, a simple shrinking scheme with shrinkage parameter \(\alpha \) can be applied to achieve non-singularity:

$$\begin{aligned} \varSigma = (1- \alpha ) \cdot \varSigma + \alpha \cdot \frac{{{\mathrm{Tr}}}(\varSigma )}{2} {\mathbb {I}}_2, \end{aligned}$$

where \({\mathbb {I}}_2\) is the two-dimensional identity matrix. There exist different strategies (Ledoit and Wolf 2004; Chen et al. 2010) to choose an optimal value for \(\alpha \), but for our purposes it suffices to deploy a constant 0.1. In the case of \({{\mathrm{Tr}}}(\varSigma ) = 0\) the following scheme is used

$$\begin{aligned} \varSigma = (1- \alpha ) \cdot \varSigma + \alpha \cdot \sigma _{\text {MIN}}^2 \cdot {\mathbb {I}}_2 = \alpha \cdot \sigma _{\text {MIN}}^2 \cdot {\mathbb {I}}_2 , \end{aligned}$$

where \(\sigma _{\text {MIN}}^2\) is an application specific parameter. Strategies for choosing \(\sigma _{\text {MIN}}^2\) are discussed in Sect. 4. Reasons for the matrix to be singular are \(O_t(g) \le 2\) or strong collinearity of the samples. The two Gaussian distributions \(f_{x_t}(g)\) and \(f_{y_t}(g)\) are then compared using a probability product kernel (Jebara et al. 2004), i.e.

$$\begin{aligned} k_{G}(x_t(g), y_t(g)) = k^{\text {prod}}(f_{x_t}(g), f_{y_t}(g)). \end{aligned}$$

A probability product kernel compares two probability distributions, i.e. their density functions p and \(p'\), defined on a common probability space \((\varOmega , \mathcal {A}, \mu )\). It is defined as follows.

Definition 7

(Probability Product Kernel) Let p and \(p'\) be probability density functions defined on the same probability space \((\varOmega , \mathcal {A}, \mu )\). Assume that \(p^\rho ,p'^\rho \in L^2(\varOmega )\) for \(\rho \in {\mathbb {R}}^+\). Then the probability product kernel (PPK) \(k^{\text {prod}}: L^2(\varOmega ) \times L^2(\varOmega ) \rightarrow {\mathbb {R}}\) is given by

$$\begin{aligned}&k^{\text {prod}}(p,p') = \int \limits _{\varOmega } p(\omega )^\rho p'(\omega )^\rho \,\mathrm {d\mu }(\omega ). \end{aligned}$$

Lemma 1

\(k^{\text {prod}}\) is a (positive-definite) kernel.

Proof

Consider the feature map \(\phi : L^2(\varOmega ) \rightarrow L^2(\varOmega )\) given by

$$\begin{aligned}&\phi (p) = p^\rho , \end{aligned}$$

which is well-defined because of above \(L^2\) property. Then

$$\begin{aligned} k^{\text {prod}}(p, p') = \langle \phi (p), \phi (p')\rangle _{L^2(\varOmega )} \end{aligned}$$

and \(k^{\text {prod}}\) is a kernel corresponding to the scalar product in \(L^2(\varOmega )\)

$$\begin{aligned} \int \limits _{\varOmega } f(\omega )\cdot g(\omega )\,\mathrm {d\mu }(\omega ). \end{aligned}$$

\(\square \)

In this study we will use \(\rho = 1/2\), which has the important property that \(\text {Im}\left( k_{Prod}\right) = [0,1]\) and \(k^{\text {prod}}(p,p) = 1\). For two two-dimensional Gaussian distributions and \(\rho = 1/2\) the product probability kernel permits the following closed form:

$$\begin{aligned} \begin{aligned} k_{G}(x_t, y_t) =&\int \limits _{{\mathbb {R}}^2}\left( f_{x_t}(z) f_{y_t}(z)\right) ^{1/2}\,\mathrm {d}z = 2 \cdot |\varSigma ^*|^{\frac{1}{2}}|\varSigma _{x_t}|^{-\frac{1}{4}} |\varSigma _{y_t}|^{-\frac{1}{4}} \\&\cdot \exp \left( -\frac{1}{4}\left( \mu _{x_t}^T\varSigma _{x_t}^{-1}\mu _{x_t} + \mu _{y_t}^T\varSigma _{y_t}^{-1}\mu _{y_t}- \mu ^{*T} \varSigma ^* \mu ^* \right) \right) \end{aligned} \end{aligned}$$
(6)

with \(\varSigma ^* = (\varSigma _{x_t}^{-1}+\varSigma _{y_t}^{-1})^{-1}\) and \(\mu ^{*} = \varSigma _{x_t}^{-1}\mu _{x_t} + \varSigma _{y_t}^{-1}\mu _{y_t}\).Footnote 9 Besides its sensibility to the distribution of the objects inside the group \(k^{\text {prod}}\) has the advantage of not having to choose a width parameter. The kernel essentially adapts to the scale of the group.

Finite Case: \(|X|=L< \infty \) For finite \(X = \{z_1,\ldots ,z_L\}\), we can generalise to the group-oriented snapshot by comparison of the number of objects of each group at each possible location in X:

$$\begin{aligned} k_{G}(x_t(g), y_t(g)) = \frac{1}{L}\sum \limits _{z \in X} {\mathbb {I}}_{\{n_{z}(x_t(g))\}}(n_{z}(y_t(g))) \end{aligned}$$
(7)

with the indicator function \({\mathbb {I}}_A(x) = 1\) if \(x \in A\) and 0 otherwise and \(n_z(x_t(g)) = |\{x_t(o) = z \quad o \in O_{x_t}(g)\}|\) denotes the number of objects of group \(g \in {\mathcal {G}}\) in position \(z \in X\) at time \(t \in [0,1]\).

Lemma 2

\(k_{G}\) as defined in Eq. 7 is a kernel.

Proof

Let \(l^2(0,1)\) be the Hilbert space of 0/1-sequences equipped with the standard scalar product. Consider \(\phi : X \rightarrow l^2(0,1)^{L}\) with

$$\begin{aligned}&\phi (x_t(g))_{m,l} = {\left\{ \begin{array}{ll} 1 &{}\text{ if } \text{ m } \text{ objects } \text{ in } \text{ position } \text{ l } \\ 0 &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$

for \(l = 1,\ldots ,L\) and \(m = 0,1,2,\ldots \) Using this feature mapping, \(k_{G}\) can be rewritten to

$$\begin{aligned} k_{G}(x_t(g), y_t(g)) = \langle \phi (x_t(g)), \phi (y_t(g)) \rangle , \end{aligned}$$

showing that it is indeed a kernel. Note that that the inner product of a product space is defined by the sum of the inner products of its components. \(\square \)

If the number of possible locations is much higher than the number of objects, the above kernel (Eq. 7) is inappropriate, as it will return similarities close to one for every pair of snapshots because in both snapshots there are no objects at almost all locations. Alternatively, the idea of a product probability kernel can also be applied to the finite setting by fitting a multinomial distribution to the positions of the group and comparing the two resulting distributions \(p_{x_t}(g)\) and \(p_{y_t}(g)\) that are identified by their outcome probabilities \((p_{x_t,1}(g),\ldots ,p_{x_t,L}(g))\). For \(\rho = 1/2\) the probability product kernel on the multinomial distributions is derived as followsFootnote 10,

$$\begin{aligned} k^{\text {prod}}(x_t, y_t)&= \int \limits _{{\mathbb {N}}^{L}} \sqrt{p_{x_t}(n_1,\ldots ,n_{L})p_{y_t}(n_1,\ldots ,n_{L})}\,\mathrm {d}(n_1,\ldots ,n_{L}) \\&= \sum \limits _{n_1 \in N}\ldots \sum \limits _{n_{L} \in N} \sqrt{p_{x_t}(n_1,\ldots ,n_{L})p_{y_t}(n_1,\ldots ,n_{L})} \\&= \sum \limits _{n_1 \in N}\ldots \sum \limits _{n_{L} \in N} {n_1+\cdots +n_L \atopwithdelims ()n_1,\ldots ,n_{L}} \left( (p_{x_t, 1}\cdot p_{y_t, 1})^{n_1}\cdot \cdots \cdot (p_{x_t, L} \cdot p_{y_t, L})^{n_{L}} \right) ^\frac{1}{2} \end{aligned}$$

with the maximum-likelihood estimation for the outcome probabilities of the multinomial distributions given by

$$\begin{aligned} p_{x_t,l}(g) = \frac{n_{z_l}(x_t(g))}{|O_{x_t}(g)|}. \end{aligned}$$

3.5 Temporal kernels

The temporal kernel \(k_{[0,1]}\) is the simpler component of spatio-temporal convolution kernel because the underlying set is fixed to the one-dimensional interval [0, 1]. We briefly discuss possible options for the temporal kernel and their implications.

The simplest temporal kernel is a constant kernel \(k_{[0,1]}(t,s) = 1\) with the consequence that the spatio-temporal convolution kernel (Eq. 3) collapses to a set kernel on the two sets of snapshots, thus ignoring order. A uniform or interval kernel given by

$$\begin{aligned} k_{[0,1]}(t,s) = {\mathbb {I}}_{\{u \in {\mathbb {R}}:|t-u|<w \}}(s) \end{aligned}$$

could be used to compare every snapshot of the first trajectory to just those snapshots of the second trajectory which are close in (relative) time. The choice of w determines how close snapshots have to be to be taken into consideration. Finally, Gaussian kernels of the form

$$\begin{aligned} k_{[0,1]}(t,s) = \exp \left( -\frac{1}{2\sigma ^2}\Vert t-s\Vert _2^{2}\right) \end{aligned}$$

compare every snapshot of the first trajectory to every snapshot of the second trajectory and gives more importance to closer events (in relative time). Depending on the application at-hand, there may be many more options to define temporal kernels.

3.6 Approximation techniques

To compute the Gram matrix of a dataset of N trajectories, \(O(N^{2} L^{2})\) spatial as well as temporal kernels need to be evaluated with L being the maximal length of a sequence. Naturally, the evaluation of the spatial kernels will dominate the evaluation of the temporal kernel mainly for reasons of dimensionality as well as complexity of the kernel itself.

Generally, there exist two ways to speed up the algorithm. First, the number of similarities which have to be computed is reduced to shrink the factor \(O(N^2)\). Afterwards the missing entries in the Gram matrix are reconstructed. Second, the computation of each similarity is accelerated to reduce the term \(O(L^2)\). Regarding the first approach, there exist methods to reconstruct incomplete Gram matrices, most notably the Nyström method (see Fowlkes et al. 2004, Williams and Seeger 2001). These methods have the disadvantage of usually needing a constant fraction of all entries to perform well (Wauthier et al. 2012). It follows that they do not improve the asymptotic complexity of the computation. We focus on the second approach, since it is specific to spatio-temporal convolution kernels and potentially improves the asymptotic complexity.

figure a

As stated before, the computation of the temporal kernel is significantly faster than the one of the spatial kernel. Thus, a natural way to approximate these kernels is to first evaluate all temporal kernels and then to only evaluate those spatial kernels which have the highest contribution. Depending on the lengths of the two trajectories at-hand the number of overall kernel evaluations varies between pairs of trajectories. Therefore, when approximating the kernel one has to take care that all entries of the Gram matrix are approximated in similar speed to avoid distortions. To this end, we propose a percental approximation algorithm (Algorithm 1, Approximated STCK (ASTCK)), which in addition to the two trajectories takes the maximal length of all trajectories L as an input. The algorithm first evaluates all temporal kernels. Subsequently, in each step \(\frac{1}{L}\) spatial kernels are computed starting with the most relevant based on the evaluation of the temporal kernels. The order of evaluation is illustrated in Table 2. This algorithm leads to a complexity of

$$\begin{aligned} {\mathcal {O}}(N^2\cdot L^2\cdot C_T + N^2\cdot L \cdot I \cdot C_S) , \end{aligned}$$

where \(C_T\) is the complexity of evaluating the temporal kernel, \(C_S\) is the complexity of evaluating the spatial kernel and I is the number of iterations of the approximation algorithm. The parameter \(1\le I \le L\) needs to be chosen by the user. In practice, small values of I often lead to highly accurate results.

Table 2 Evaluation order of ASTCK with \(L=12\) and \(|P|=|Q|=6\)

3.7 Online application

STCKs are particularly well-suited for online analyses such as real-time computations where the objects of interest are still in motion. When a new measurement is added to a trajectory, Eq. 3 can be efficiently updated, since all previous spatial kernel evaluations remain constant in the sum. That is, only the value of the temporal kernel needs to be computed which is however usually inexpensive.

4 Empirical evaluation

In this section we empirically compare our spatio-temporal convolution kernels to baseline approaches using artificial and real data sets. We focus on clustering tasks and k-medoids (Kaufman and Rousseeuw 1987) as the underlying learning algorithm. The temporal kernel is always a Gaussian kernel that we combine with three different spatial kernels: an object-wise Gaussian RBF kernel as spatial kernel (\(\text {STCK}_{\text {1-1}}\)), a Gaussian RBF kernel on the group means (\(\text {STCK}_{\text {Mean}}\)), and a probability product kernel on the fitted Gaussian distributions (\(\text {STCK}_{\text {Dist}}\)).

For the latter, the group and application specific parameter \(\sigma _{\text {MIN}}^2\) which is sometimes needed to restore non-singularities of the covariance matrix (see Sect. 3.4), is set to the average distance between two objects of the group. If the group only consists of one object, it is equal to the average distance between two objects of the group, that is most similar. The width parameter \(\sigma _T\) of the temporal Gaussian kernel is set to 0.5 to balance invariance to speed differences and sensitivity to direction. The width parameter \(\sigma _S\) of the spatial Gaussian RBF kernel is set to 0.2.

We deploy three baselines. First, Junejo et al. (2004) is straightforwardly extended to the multi-object scenario, i.e. the use of Hausdorff distance on the set of positions of the trajectories. Instead of the hierarchical clustering employed in Fu et al. (2005), kernelised k-medoids is used. The second baseline is inspired by Wang et al. (2008). We use a bag-of-positions as well as a bag-of-directions representation for the trajectories of each group. To keep the setup simple, we use a multinomial mixture model (MNMM) and expectation maximisation for clustering instead of a semantic topic model like dual-HDP (Wang et al. 2008). Third, we also compare our method to dynamic time warping with a product probability kernel (\(\text {DTW}_{\text {dist}}\)) serving as local distance measure. This method applies the product probability kernel to the fitted Gaussian distributions. The number of clusters is determined using the silhouette measure (Rousseeuw 1987), Hartigan index (Hartigan 1975) as well as next-neighbour consistency for all methods.

In the next section, we measure the alignment between predicted and ground-truth clusterings using artificially generated data. The alignment between two groupings is captured by the Rand Index (Rand 1971); however, as two random clusterings may return a non-zero Rand Index by chance, we resort to the Adjusted Rand Index (Hubert and Arabie 1985), given by

$$\begin{aligned} AR(S, T) = \frac{R(S, T) - \mathbb {E}[R(S, T)]}{1 - \mathbb {E}[R(S, T)}, \end{aligned}$$

where R denotes the Rand Index and S and T are clusterings to compare.

4.1 Artificial data

Recall that our spatio-temporal convolution kernels possess basic propertiesFootnote 11 that are not shared with the baseline methods. Most notably these are invariance to permutations of the objects and to differences in speed as well as sensitivity to the spatial distribution of the objects and the direction of the movement.

In this section these properties are experimentally confirmed using two toy data sets consisting of artificially generated multi-object trajectories. The data is generated to cover a broad range of trajectories present in real-world applications. The first setting covers linear movements and each trajectory consists of five objects performing a linear movement as shown in Fig. 1 (top). To find the correct clusters, the respective kernels need to distinguish trajectories based on the direction of the movement or based on the distribution of the objects, while the objects’ centroid is the same in both trajectories.

Fig. 1
figure 1

Exemplary artificial multi-trajectories. For both settings ten multi-trajectories for each cluster are shown. Colours indicate cluster membership. Moreover, two levels of noise are depicted. For instance, linear 0.025 show linear movements with added noise in the interval \([-0.025, 0.025]\)

The second setting deals with circular movements and each trajectory consists of four objects moving in circles of different radii, see Fig. 1 (bottom). The construction of the second set enables us to evaluate the ability of the kernels to identify the correct clusters when these only differ in the direction of the movement but a spatial separation between the clusters is not possible. Compared to linear movements, the circular task is more difficult, as the directions of only some objects differ between the clusters. The data generation is described in detail in “Appendix” section.

To evaluate the sensitivity of the methods to permutations, the objects’ ordering inside each multi-object trajectory is permuted randomly in both data sets. To assess the sensitivity to changes in speed, we flip a coin for each trajectory in both test sets. With 50 % probability only every second position is retained in the trajectory. The resulting trajectory corresponds to a movement with twice the speed of the original trajectory. Finally, we add uniformly distributed noise in the range \([-\epsilon , \epsilon ]\) for \(\epsilon \in \{0.005, 0.025, 0.05, 0.1, 0.25\}\) with zero mean to every position in each multi-object trajectory in both test sets.

For both data sets, we generate 200 multi-trajectories (50 per cluster) for all five levels of noise. It is important to assess the methods’ sensitivity to noise for two reasons. First, trajectories are generally highly variable and exact matches are very rare. Second, due to frequent tracking errors, real-life trajectories are also subject to a significant level of noise.

Fig. 2
figure 2

Cluster performance of the STCK compared to three baselines

In this experiment, we focus on \(\text {STCK}_{\text {Dist}}\) as it leads to more accurate clusters than \(\text {STCK}_{\text {1-1}}\) and \(\text {STCK}_{\text {Mean}}\) throughout all ranges of noise. Figure 2 compares \(\text {STCK}_{\text {Dist}}\) to the three baselines. Our method outperforms dynamic time warping on both test sets and gives the most accurate clusterings. The multinomial mixture model performs reasonably well on linear movements but leads to inappropriate clusterings for circular ones. The poor results on the second data set are due to the absence of ordering information in the bag-of-directions representation when the objects move in a full circle. The Hausdorff distance-based Junejo et al. leads to generally inaccurate clusterings. We credit this finding to its sensitivity to permutations as well as negligence of ordering information.

We now evaluate the proposed approximation technique on both artificial data sets. The algorithm performs up to eleven iterations as the longest trajectory consists of eleven snapshots. Note that, at the end of the last iteration, the exact Gram matrix is obtained. The percental approximation proves accurate; the approximation quickly leads to the correct clustering after two iterations. Figure 3 shows that ASTCK also achieves higher consistencies with the correct clustering compared to all baseline methods from the second iteration onwards. Moreover, starting from the second iteration onwards it is always at least as good as the exact method. The depicted results correspond to a noise level of 0.005 but equivalent results hold for all other noise ratios.

Fig. 3
figure 3

Comparison of ASTCK II to the baseline methods

The runtime of multi-object trajectory clustering is governed by the number of trajectories, the length of the trajectories and the number of objects. Depending on the application, usually one or two dimensions are dominating. It is thus important to evaluate the runtime for each of the three dimensions. The theoretical runtime is \(O(N^{2}\cdot L^{2})\) spatial as well as temporal kernel evaluations, where N is the number of trajectories and L is the maximum length of the trajectories. To confirm these findings experimentally, we generate random trajectory sets in \([0,1]^2\) and vary the number of trajectories per set, the length of the trajectories and the number of objects per trajectory. The results are depicted in Fig. 4.

The figure on top shows the results for varying numbers of trajectories. Except for multinomial mixture models that perform linearly in the number of instances, all methods exhibit a quadratic complexity, which is simply due to the fact that all pairwise similarities are computed, i.e. \({N(N+1)}/{2}\). Varying the length of the trajectories in Fig. 4 (bottom, left) shows that all STCKs as well as DTW exhibit quadratic complexities which is in line with the theoretical runtime. However, the percental approximation algorithm (two iterations) significantly improves the runtime of STCKs and is comparable to that of Junejo et al. and multinomial mixture models.

Fig. 4
figure 4

Runtime: number of trajectories (top), length of trajectories (bottom, left), and number of objects (bottom, right)

Finally, Fig. 4 (bottom, right) varies the number of objects. The results are virtually constant time complexities for all methods except \(\text {STCK}_{\text {1-1}}\) and MNMM, which exhibit linear complexities. The observation is in line with our expectations as the Gaussian RBF kernel compares each object with its counterpart. The deviations in the case of \(\text {STCK}_{\text {Dist}}\) for a small number of objects (less or equal two objects) is due to the additional time needed by the shrinking schemes to restore non-singularity of the covariance matrices.

4.2 Real-world data

We now evaluate spatio-temporal convolution kernels on real world data using positional data streams of ten soccer games of the German Bundesliga. The goal is to identify movement patterns by analysing the tracking data.

The tracking data is captured by the VIS.TRACK Impire (2014) tracking system during five games of Bundesliga Team A and five games of Bundesliga Team B from the 2011/12 Bundesliga season.Footnote 12 VIS.TRACK is a video based tracking system consisting of several cameras in the soccer stadium. It determines the positions of the players, ball and referees at 25 frames per second, which amounts to roughly 135, 000 positions per object and match and a total of 31, 000, 000 positions. Additionally, a marker indicates the status of the ball (in-play, stoppage) and the possession of the ball. The range of the x- (parallel to sidelines) and y-coordinate (parallel to endlines) is \([-1,1]\), whereas values with an absolute value greater than one occur if the ball is out of bounds. The data stream is preprocessed so that positions of the second half are mirrored to account for the changeover at half time and the frame numbers of the second half are changed to succeed those of the first half. Additionally, the playing direction is determined and normalised, so that the team of interest always plays from left to right. Subsequently, we extract two types of sequences: game initiations and scoring opportunities.

Game initiations (GI) begin with the goal keeper passing the ball and end with the team loosing possession, a stoppage, the ball being in the attacking third of the field or the start of the next game initiation as defined above. Sequences shorter than length 12 are excluded. Scoring opportunities (SO) terminate when the ball is carried into a predefined zone of danger, usually defined as the attacking quarter. Scoring opportunities begin at the time of the last stoppage or win of the ball and last until the ball reaches the attacking quarter of the field \([0.5, 1]\times [-1,1]\). Again, sequences with a length below 12 are discarded.

For every sequence there are 23 possible trajectories (ball, 22 players, no referees) to include into the analysis. Since the opposing team changes from game to game, we will restrict the analysis to twelve objects, namely the ball and the players of Team A, and Team B respectively. In the following experiments we consider the ball as one group in the sense of Definition 2. We further include the back four (game initiations) and the four most offensive players (scoring opportunities) as a second group (in the sense of Definition 2) into the analysis. The clusterings in the remainder are thus based on the trajectories of five objects (ball, four players).

In general, there is no ground-truth for real-world clustering problems. Figure 5 therefore depicts the adjusted Rand indices between the five methods for pair-wise comparison. Generally, the adjusted Rand index is low in the majority of the cases indicating low consistency between the methods. On the other hand, we have a higher consistency between \(\text {STCK}_{\text {dist}}\) and \(\text {DTW}_{\text {dist}}\) demonstrating that our method is capable of dealing with speed and length differences in a similar way as dynamic time warping. This is also partially due to the use of the same kernel in both methods, once as spatial kernel and once as local distance measure.

Fig. 5
figure 5

Pairwise adjusted Rand indices

Figure 6 (left) shows average Silhouette measures of the four similarity-based methods over the four datasets. Our \(\text {STCK}_{\text {dist}}\) achieves the highest score indicating higher cluster separation and/or compactness. The object-wise STCK provides the poorest results on average, indicating that permutations are relevant in this application and that the distribution-based representation of the probability product kernel is more successful in capturing the relevant player movements. Dynamic time warping performs second best on average and in line with the previous results on the artificial datasets. Also note the high consistency with our method in Fig. 5. In terms of 5-nearest-neighbour consistency, the two distribution-based methods, \({STCK}_{{dist}}\) and \({DTW}_{{dist}}\), perform best, as depicted in Fig. 6 (right). Cluster quality is generally higher for game initiations compared to scoring opportunities which also results in in more interpretable clusters for these settings as we will show in the remainder.

Fig. 6
figure 6

Cluster consistencies and compactness

Figures 7 and 8 illustrate the resulting clusters for the five methods. The medoids of the clusters for both teams are depicted along with the distribution of the trajectory length for each cluster. For all settings cluster membership correlates with sequence length. Generally, the cluster medoids of \(\text {STCK}_{\text {dist}}\) and \(\text {DTW}_{\text {dist}}\) often coincide and are easily interpretable, while the cluster medoids of \(\text {STCK}_{\text {1-1}}\) and the cluster representatives of the multinomial mixture model are difficult to make sense of.

Fig. 7
figure 7

Game initiation clusters of Team A (top; four clusters) and Team B (bottom; 6). The above plots show cluster medoids indicated by colours. Only the trajectory of the ball is shown, while the clusters are also based on the trajectories of four players. Where possible, similar clusters are coloured equally. Below, the distribution of the length of the trajectories in each cluster is depicted. For multinomial mixture models, the multi-object trajectory with the highest probability is depicted as representative

Fig. 8
figure 8

Scoring opportunity clusters of Team A (top; six clusters) and Team B (bottom; 10). The plots show the cluster medoids. Colours indicate clusters. Only the trajectory of the ball is shown, while the clusters are also based on the trajectories of four players. Where possible, similar clusters are coloured equally. Below, the distribution of the length of the trajectories in each cluster is depicted. In the case of the multinomial mixture model, the multi-object trajectory with the highest probability is depicted as representative

In the Bundesliga 2011/2012 season, the strategy of Team A generally consisted of transporting the ball with few, but rehearsed short game initiations to the opposing half. For this purpose, many ball contacts were allowed and different players were integrated. On the contrary, Team B showed a rather chaotic game organisation with rather random actions and increasingly long, straight balls. Figure 7 shows the outcomes of the k-medoids cluster for the game initiations. Compared to all other methods, the the \(\text {STCK}_{\text {dist}}\) capture the characteristic traits of the teams well. Team A clearly acted with many short moves (long trajectories in cluster 1) and integrated many players in the playmaking (cluttered medians). By contrast, Team B acted with many long moves (short trajectories in all clusters) and preferred linear actions.

Near the opposing goal, Team A aimed at quickly achieving a goal in the opposing half during the 2011/2012 season. They operated with only a few ball contacts and aimed to quickly transport the ball in the predefined zone of danger. Again in contrast to this, Team B had many ball contacts and took their time in waiting for a mistake of the opponent and only then played in the zone of danger to achieve a goal. Figure 8 shows that all methods are capable of retracing the different offensive strategies of both teams. Again, Team B has more solution categories (more than 33 %) than Team A, which is shown by the versatile and multifaceted running patterns. However, solely the results with \(\text {STCK}_{\text {dist}}\) show that Team A rapidly tries to enter the zone of danger with very few ball contacts (short sequences in all clusters compared to Team B). To sum up, the results with \(\text {STCK}_{\text {dist}}\) best reflect the game philosophies of Team A and B from a sport-scientific perspectives and are the most easy to interpret.

5 Conclusion

We presented spatio-temporal convolution kernels for multi-object scenarios. Our kernels consist of a temporal and a spatial component that can be chosen according to the characteristic traits of a problem at-hand. The computation time is quadratic in terms of the number and lengths of trajectories. We proposed an efficient percental approximation algorithm that significantly reduced the complexity to superlinear runtime. Empirical results on artificial clustering tasks showed that our spatio-temporal convolution kernels effectively identify the target concepts. Results on large-scale real world data from soccer games showed that our kernels lead to easily interpretable clusters that may be used in further analysis by coaches.