Abstract
Trajectory data of simultaneously moving objects is being recorded in many different domains and applications. However, existing techniques that utilise such data often fail to capture characteristic traits or lack theoretical guarantees. We propose a novel class of spatiotemporal convolution kernels to capture similarities in multiobject scenarios. The abstract kernel is a composition of a temporal and a spatial kernel and its actual instantiations depend on the application at hand. Empirically, we compare our kernels and efficient approximations thereof to baseline techniques for clustering tasks using artificial and real world data from team sports.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Trajectory data of simultaneously moving objects is the key to analyse animal migration (Lee et al. 2008), transportation (Baud et al. 2007; Giannotti et al. 2007), tactics in team sports (Hirano and Tsumoto 2005; Kempe et al. 2014; Lucey et al. 2013; Wei et al. 2013), players and avatars in (serious) computer games (Kang et al. 2013; Pao et al. 2010), customer behaviour (Larson et al. 2005) as well as spread patterns of fires (Trunfio et al. 2011). A characteristic trait of many such applications is that trajectories of several objects are more informative than the trajectory of a single object. For instance, a single trajectory of a bird is not indicative for bird migration as individuals may join or leave the flock (Lee et al. 2008) and a single trajectory of a soccer player does not reveal insights on the actual situation on the pitch (Grunz et al. 2012; Wei et al. 2013).
Therefore, trajectories of multiple objects need to be processed together. Although this insight sounds trivial, processing multiple trajectories simultaneously challenges the standard model of computation as trajectories may interdepend in time and space in multiple ways. To exploit these dependencies, it is necessary to establish a notion of similarity for spatiotemporal paths of multiple objects to identify frequent patterns. By definition, frequent patterns are formed by an a priori unknown subset of objects at unknown locations in time and space. Analysing multitrajectory data is therefore inherently a combinatorial problem that involves processing data at large scales.
A second problem arises from existing methods for analysing spatiotemporal data. Traditional approaches often cannot deal with continuous spatial domains but rely on an appropriate discretisation of the data athand (Kang and Yong 2008; Mamoulis et al. 2004; Mehta et al. 2005). However, finding an a priori optimal discretisation is often difficult in many domains where only the final result allows conclusions on whether an initial set of atomic events is plausible or not (Kang and Yong 2008). Furthermore, many approaches cannot deal with permutations of the objects and differences in speed, while still being sensitive to differences in the direction of the motion (Hirano and Tsumoto 2005; Junejo et al. 2004; Wei et al. 2013).
We devise a novel class of convolution kernels for multitrajectory data. It is specially tailored to multiobject scenarios, i.e. trajectories of multiple simultaneously moving objects. The kernel properties as well as the modular nature of the proposed class of kernels renders it highly adaptive to different applications. Since it is a kernel, it can also naturally be deployed with any kernel machine. The three characteristics, multiobject scenario, modularity and kernel property, distinguish our approach from existing methods. Due to its distinct characteristics, our approach is more suitable for a large variety of applications, it is flexible with respect to the notion of similarity, and it is theoretically better grounded than most of the existing methods. Since the complexity of a kernel evaluation is quadratic in the number and the lengths of the involved objects, we also propose an efficient percental approximation. Empirically, the method is evaluated on artificial datasets and realworld tracking data from ten Bundesliga soccer matches. We generally observe that our convolution kernels lead to better clusterings compared to baseline methods.
The remainder of this article is structured as follows. Section 2 reviews existing work. Section 3 introduces our spatiotemporal convolution kernel methods and Sect. 4 reports on empirical results. Section 5 concludes.
2 Related work
2.1 Trajectory clustering
Trajectory clustering, or clustering of spatiotemporal data respectively, has been an active field of research in the past years. Existing approaches mainly focus on the application of video surveillance with the goal to detect anomalies in the data stream (Basharat et al. 2008; Fu et al. 2005; Hu et al. 2006; Jeong et al. 2011; Junejo et al. 2004; Saleemi et al. 2009). Other applications include automatic sports analysis, weather evolution modelling, animal migration and traffic analysis. Existing approaches rely mostly on processing of single trajectories. Recent contributions in this area can be roughly grouped into similaritybased approaches (Buzan et al. 2004; Jinyang et al. 2011; Fu et al. 2005; Hirano and Tsumoto 2005; Hu et al. 2006; Junejo et al. 2004; Piciarelli et al. 2005) and motionbased approaches (Basharat et al. 2008; Wang et al. 2008; Jeong et al. 2011; Li and Chellappa 2010; Lin et al. 2009; Saleemi et al. 2009).
Similaritybased approaches define pairwise similarities between trajectories which are then processed by some clustering algorithm. Junejo et al. (2004) represent trajectories as a set of twodimensional coordinates together with the Hausdorff distance. Subsequently, graphcuts are deployed to recursively partition the trajectories. Hausdorff distances are also used to cluster trajectories by Jinyang et al. (2011) where not only the position but also the direction of the trajectories is taken into account by using 4tuples \((x,y,\mathrm {d}x,\mathrm {d}y)\) instead of coordinates only. Fu et al. (2005) first resample trajectories to obtain constant betweenpoint distances. Then the corresponding points of two trajectories are compared using an RBF kernel where the longer trajectory is cut to the length of the shorter one. Spectral clustering is then used together with a symmetric normalised Laplacian.
Buzan et al. (2004) extend the longest common subsequence algorithm to threedimensional coordinates and use a modified version of agglomerative hierarchical clustering. Hirano and Tsumoto (2005) deploy multiscale matchings to compare trajectories. The basic idea is to generate trajectories at different scales as convolutions of the trajectory and Gaussian kernels with different standard deviations. Their similarity measure is then based on the hierarchical structure of the trajectory segments at different scales. Subsequently, a rough clustering is employed. Piciarelli et al. (2005) define a trajectorytocluster similarity by the average Euclidean distance of trajectory coordinates to the nearest cluster coordinate where offsets in time induce negative weights.
Our approach also belongs to these similaritybased methods. In general, similaritybased approaches suffer from two major drawbacks. First, their computational complexity is at least in quadratic in the number of trajectories. Second, they rely on clustering full trajectories and are hence sensitive to tracking errors and subtrajectories. While the first drawback is inherent to all similaritybased methods, our distributionbased approach and gradual weighting mitigates the effects of noise and tracking errors and is able to identify partial matchings between trajectories.
In contrast to similaritybased approaches, motionbased approaches focus on local movements of objects to derive models for the overall (group) motion in a scene. Wang et al. (2008) and Jeong et al. (2011) represent a trajectory by bags of positions as well as directions based on the bagofwords representation of documents in natural language processing. To this end, the spatial domain is discretised and the number of occurrences of each position in a trajectory is counted. Grimson et al. also take into account temporal information by counting the occurrences of each (discretised) direction in a trajectory. The topic model DualHDP (Wang et al. 2008) is used to find semantic regions, which are combined to form the different trajectories. Jeong et al. use latent dirichlet allocation (Blei et al. 2003) to obtain semantic regions. To incorporate temporal information, a hidden Markov model is trained for each topic based on the sequences which are close to the topic. Saleemi et al. (2009) propose kernel density estimation to learn a five dimensional distribution of transitions from \((x_1, y_1)\) to \((x_2, y_2)\) in time t. Markov chain monte carlo (Andrieu et al. 2003) is then deployed to sample the most likely paths given the learned transition probabilities.
Basharat et al. (2008) also learn a model for transition probabilities. Instead of kernel density estimation, a Gaussian mixture model is fitted to the observed transitions. Lin et al. (2009) exploit the Lie algebraic structure of affine transformations to learn a flow model consisting of overlapping twodimensional Gaussian distributions, each of which corresponds to an affine transform dominant in this spatial area. The approach is applied to pedestrians in a train station and optical flows obtained from satellite images. Li and Chellappa (2010) also use a similar Lie algebraic representation called spatial hybrid driving force model, which, opposed to Lin et al. (2009), evolves over time. This model is used to solve the socalled group motion segmentation problem, i.e. to answer the question of which objects take part in an organised group motion and which do not.
Motionbased approaches also inhere some limitations. First, they often neglect temporal information at least of second order (curvature). Second, they do not provide a mapping of the input trajectories to groups of similar trajectories but rather describe the combined motion of all objects in all trajectories over time. Our approach differs methodologically from the summarised techniques in several ways: First, it provides a general framework that covers many applications and properties as opposed to being a very specific similarity measure tailored to a single application domain. Second, our approach is specialised on multiple simultaneously moving objects instead of focussing on only trajectories of single objects. Third, being a kernel the similarity measure is straightforwardly applicable to a broad range of algorithms and is theoretically well grounded in contrast to heuristic approaches.
2.2 Sports analytics
Current approaches in the area of sports game trajectory analysis either aim to define objective performance measures for players (Kang et al. 2006), classify (Bialkowski et al. 2013; Hervieu and Bouthemy 2010; Intille and Bobick 2001; Grunz et al. 2012; Perše et al. 2009; Siddiquie et al. 2009) or cluster (Hirano and Tsumoto 2005; Wei et al. 2013) plays/trajectories, or learn a motion model for team behaviour (Bialkowski et al. 2014; Direkoglu and O’Connor 2012; Kim et al. 2010; Li et al. 2009; Li and Chellappa 2010; Lucey et al. 2013; Zhu et al. 2007).
Kang et al. (2006) define performance metrics for soccer players based on the definition of owned and competitive regions of the field, which are derived from ball and player trajectories. Siddiquie et al. (2009) represent videos of football plays by a bagoffeatures from histograms of optical flows as well as histograms of oriented gradients. Spatiotemporal pyramid matching (Lazebnik et al. 2006) is used to generate a kernel for each visual word. Football plays are then classified into seven categories using multiple kernel learning. Hervieu and Bouthemy (2010) use a hierarchical parallel semiMarkov model to classify different activity states in squash and handball, such as rallies, free throws and defence. The first layer describes the activity states, while the second layer consists of a parallel hidden Markov model for each feature representing the trajectories.
Perše et al. (2009) represent team activity in basketball using team centroids to hierarchically classify situations with Gaussian Mixture Models. Thereafter, each situation is converted into a string, which is compared to templates for classification. Bialkowski et al. (2013) use team centroids and occupancy maps to classify game situations in field hockey (corners, goals), emphasising the robustness of this representation to tracking noise. Grunz et al. (2012) employ selforganising maps to identify long and short game initiations in soccer and Hirano and Tsumoto (2005) use multiscale comparison and rough clustering to cluster ball trajectories that lead to goals.
Direkoglu and O’Connor (2012) solve a special Poisson equation, in which the player positions determine the location of source terms. The derived distribution and its development over time defines a socalled region of interest used to describe the team behaviour. Wei et al. (2013) use role models (Lucey et al. 2013) and a Bilinear spatiotemporal basis model to represent team movement to cluster goal scoring opportunities in soccer. Bialkowski et al. (2014) also use role models to automatically detect and compare the formations of soccer teams. Li and Chellappa (2010) learn a spatiotemporal driving force model to identify offence and defence players in football. Kim et al. (2010) interpolate a dense motion field from player trajectories using thinplate splines. This motion field is further investigated for points of convergence to predict where the game will evolve in short term.
From an application point of view, our approach is most comparable to Wei et al. (2013) and Grunz et al. (2012). While Wei et al. focus on scoring opportunities and Grunz et al. study game initiations, we consider both situations in this study. Similar to Bialkowski et al. (2013), our method proves robust to tracking noise.
3 Spatiotemporal convolution kernels
3.1 Representation
Multiobject trajectory analysis is concerned with a possibly varying number of moving objects \({\mathcal {O}}_t\) in a set X, e.g. \(X = {\mathbb {R}}^2\), over a finite period of time \({\mathcal {T}} \subset {\mathbb {N}}\). A multiobject trajectory is composed of snapshots of the object positions at different times. Depending on the context and application at hand, one of the following two formalisations of a snapshot is more appropriate.
Definition 1
(Objectoriented Snapshot) Assume the number of objects to be constant over time, i.e. \({\mathcal {O}}_t = {\mathcal {O}} = \{o_1,\ldots , o_N\}\) for \(N \in {\mathbb {N}}\). Then the objectoriented snapshot of all objects at time \(t \in {\mathcal {T}}\) is denoted by \( x_{t} \in X^N =: {\mathcal {X}}.\) We call \({\mathcal {X}}\) the snapshot space. The position of a particular object \(o \in {\mathcal {O}}\) is denoted by \( x_{t}(o) \in X.\)
Definition 2
(Grouporiented snapshot) Assume there is a constant number of groups \(K \in {\mathbb {N}}\). Moreover, at every point in time each object can be associated with exactly one of the groups \({\mathcal {G}} = \{g_1,\ldots ,g_K\}\).^{Footnote 1} Then the grouporiented snapshot of all objects at time \(t \in {\mathcal {T}}\) is denoted by
We call \({\mathcal {X}}\) the snapshot space. The positions of all objects of a particular group \(g \in {\mathcal {G}}\) are denoted by \(x_{t}(g) \in {\mathcal {P}}(X).\)
The group members of group g in snapshot \(x_t\) are denoted by \(O_{x_t}(g) \subset {\mathcal {O}}_t.\)
The implications of the two definitions are as follows. First, the objectoriented snapshot representation only allows a fixed number of objects, whereas the grouporiented representation is not limited in that respect. Second, in the grouporiented snapshot, objects inside a group are indistinguishable. On one hand, the property allows for permutations of objects but on the other hand it naturally also entails ambiguities.
Instead of an ordered sequence of positions or snapshots we use a set of time/positionpairs to represent trajectories. Thereby, time and order is explicitly represented as opposed to the more implicit sequence representation.
Definition 3
(Trajectory) A trajectory is defined as a finite subset
such that \(\tilde{t}_i \ne \tilde{t}_j\) for \(i\ne j\), i.e. the trajectory set \(\tilde{P}\) contains only one snapshot per point in time.
The set \(\pi _{{\mathcal {T}}}(\tilde{P})) = \{t \in {\mathcal {T}}: \exists (\tilde{s}, x_{\tilde{s}}) \in \tilde{P} \text { s. t. } t= \tilde{s}\}\) contains all timestamps of the trajectory and is usually of the form \(\{K, K+1,\ldots , K+L\}\) for some natural numbers K and L. When comparing trajectories it is insignificant at what absolute time the trajectories start. This gives rise to the following definition.
Definition 4
(Timenormalised trajectory) The timenormalised trajectory \(P \subset [0,1]\times {\mathcal {X}}\) corresponding to trajectory \(\tilde{P}\) is defined by normalising its timescale to [0, 1]. This corresponds to the trajectory P, given by
where \(\mu =\min (\pi _{{\mathcal {T}}}(\tilde{P}))\).
In the remaining part of this study we refer to timenormalised trajectories simply by trajectories.
3.2 Problem setting
One of the main advantages of kernel methods is the separation of algorithm and data. Following this paradigm, we focus on defining a kernel on the set of multiobject trajectories. Once a kernel has been defined, offtheshelf kernel machines can be applied to generate models, such as support vector machines (Vapnik 1995), kernelised kmedoids (Kaufman and Rousseeuw 1987), or spectral clustering (Ng et al. 2001). The formal problem setting of this article is defined as follows. On the set of multiobject trajectories \(M \subset {\mathcal {P}}([0,1]\times {\mathcal {X}})\) we aim to develop a similarity measure \(k:{\mathcal {P}}([0,1]\times {\mathcal {X}}) \times {\mathcal {P}}([0,1]\times {\mathcal {X}}) \), such that

(I)
the absolute position as well as the shape of the trajectories is incorporated,

(II)
the measure is invariant to permutations of certain objects, i.e. for two trajectories \(P_1\), \(P_2\) with
$$\begin{aligned} P_2 = \{(t, x_t) : \exists \text { permutation } \sigma \ \forall (s, y_s) \in P_1 \text { s. t. } t=s \wedge x_t = \sigma (y_s)\} \end{aligned}$$it holds that \(k(P_1, P_2) = 1\). In case of the grouporiented snapshot this already holds by definition if the permuted objects are members of the same group,

(III)
the measure is invariant with respect to the speed of the movement. Since all trajectories have already been normalised to the same time scale, differences in speed are mainly reflected in the cardinality of the trajectory sets. So, for example, given two trajectories \({P}_1\) and \({P}_2\) with \({P}_1 = 2{P}_2\) and
$$\begin{aligned} {P}_2 = \{({t}, x_{{t}}) : \exists ({{s}}, y_{{s}}) \in {P}_1 \text { s. t. } {t}={2s} \wedge x_{{t}} = y_{{2s}}\} \end{aligned}$$it holds that \(k(P_1, P_2) \approx k(P_1, P_1)\),^{Footnote 2}

(IV)
the similar movements of two sets of objects is recognised as such in the presence of deviations of single objects and outliers.
Moreover, the measure should have the following properties:

(V)
Kernel Property, i.e.
$$\begin{aligned} k(P_1, P_2) = \langle \phi (P_1), \phi (P_2)\rangle _{\mathcal {F}} \end{aligned}$$for some, usually unknown, feature map \(\phi \) and inner product space \(\mathcal {F}\)

(VI)
Broad applicability, i.e. few application specific parameters and no restrictions on the space X

(VII)
Computational efficiency
Note that properties (I) to (IV) formalise an intuitive notion of similarity. (I) says that shape and positions matters. (II) requires that if two similar object swap roles, it does not matter. (III) formalises that we do not care about differences in speed that much.^{Footnote 3}
Property (IV) demands robustness with respect to outlier trajectories. Further note that, for example, Dynamic Time Warping (Bellman and Kalaba 1959) meets condition (III) very well, but does not comply with (V), (VI) and (VII), since it is not a kernel and only applicable if the underlying set is a metric space. Moreover, it is computationally expensive. On the other hand, the Hausdorff distance (Hausdorff 1962) satisfies (I), (III) and (VII), but it does not satisfy (II), (V) and (VI), since it is only applicable to metric spaces and sensitive to permutations. In addition, it is not a kernel. A Gaussian RBF kernel on the full vector of positions meets conditions (I) (restricted), (V) and (VII), but is not applicable to sequences of different lengths. Also, it does not comply with (II),(III) and (VI), since it is highly sensitive to variations in speed and permutations and is restricted to metric spaces.
3.3 Spatiotemporal convolution kernels for multitrajectories
In this section we develop a kernel on the space of (timenormalised) multitrajectories \({\mathcal {P}}({[0,1]\times {\mathcal {X}}})\). Each of those trajectories consists of a set of snapshots associated with a relative time. The general idea is to perform a pairwise comparison of the snapshots in the two sets. Therefore, we first need a way to compare snapshots and, second, we need to know which snapshots of the two trajectory sets to compare with each other. For the latter dynamic time warping (DTW) (Bellman and Kalaba 1959) seems to be a good choice, since it aligns the snapshots optimally in terms of similarity. Unfortunately, the obtained kernel is not positive definite, i.e. it does not correspond to an inner product in some Hilbert Space. Although there is anecdotal evidence that learning with indefinite kernels can lead to good results in some applications (e.g. Ong et al. 2004), theory only supports the use of positive definite kernels. For many kernel machines there are error bounds and convergence criteria that can be straightforwardly applied to positive definite kernels but that do not hold for indefinite kernels (Blanchard et al. 2008; Lin 2001; Steinwart 2002).
Therefore, we propose a weighted comparison between every snapshot of the first trajectory and every snapshot of the second one where the weights depend on the offset in relative time. Formally, this is done using an Rconvolution kernel (Haussler 1999) on the two sets representing the trajectories. Convolution kernels are a general class of kernels on structured objects \(x,y \in X\).^{Footnote 4} The idea is to compare instances x and y by comparing their parts \((x_1,\ldots ,x_D),(y_1,\ldots ,y_D) \in X_1\times \cdots \times X_D\). Thus, a relation function R is needed to express that something is a part of some structure.
Definition 5
(Relation) Let \(X,X_1,\ldots ,X_D\) be arbitrary sets. Then a relation
is an arbitrary boolean function that returns 1 if and only if \((x_1,\ldots x_d)\) are parts of x. The set of parts of \(x\in X\) under relation R is denoted by
and R is called finite if \(R^{1}(x)\) is finite for every \(x\in X\).
Definition 6
( R convolution kernel) Let \(X,X_1,\ldots ,X_D\) be arbitrary sets. Let \(x,y \in X\) and \(R: X\times X_1\times \cdots \times X_D \rightarrow \{0,1\}\) be a finite relation. Moreover, let \(k_1,\ldots ,k_D\) be kernels on \(X_1,\ldots ,X_D\). Then the Rconvolution kernel on X is defined by
The following theorem shows that an Rconvolution kernel is indeed a (positivedefinite) kernel.
Theorem 1
Let \(X,X_1,\ldots ,X_D\) be arbitrary sets. Let R be a finite relation and let \(k_1,\ldots ,k_D\) be kernels on \(X_1,\ldots ,X_D\). Then the Rconvolution kernel k given by Definition 6 is a kernel.
Proof
For the proof we refer to Haussler (1999) Theorem 1 and Lemma 1, which are essentially more involved applications of closure properties of kernels. \(\square \)
In our case the structure is a multiobject trajectory and its parts are the snapshots, the times of the snapshots and the length of the trajectory. The relation between the structure and its elementary components is given by
To extract the snapshots with their associated times and the length of the trajectory from the trajectory, R needs to be defined as follows:
The first condition in Eq. 1 ensures that the given trajectory has the given length, while the second condition guarantees the occurrence of the given snapshot at the given time in the given trajectory. With this relation and the kernels to compare the parts the resulting convolution kernel is given by
with \(R^{1}(P) = \{(n,t,x): R(n,t,x,P) =1 \}\). The term \(k_{{\mathbb {N}}}\) accounts for differences in length of the trajectories by normalisation, i.e. \(k_{{\mathbb {N}}} = \frac{1}{nm}\). Note that \(k_{{\mathbb {N}}}\) is a kernel on \({\mathbb {N}}\), since it corresponds to the standard inner product under the feature map \(\phi : {\mathbb {N}} \rightarrow {\mathbb {N}}\) given by \(\phi : n \mapsto \frac{1}{n}\). Finally, the Rconvolution kernel simplifies to
Theorem 2
The spatiotemporal convolution kernel (Eq. 3) is a kernel if the temporal kernel \(k_{[0,1]}\) and the spatial kernel \(k_{{\mathcal {X}}}\) are kernels in that sense.
Proof
By Theorem 1 we need to show that R is finite and the component kernels \(k_{\mathbb {N}}\) and \(k_{[0,1]}\), \(k_{{\mathcal {X}}}\) are indeed kernels. First, for all \(P \in {\mathcal {P}}([0,1]\times {\mathcal {X}})\) it holds that \(R^{1}(P) = P < \infty \) by Definition 3, so R is finite. Second, \(k_{[0,1]}\) and \(k_{{\mathcal {X}}}\) are kernels by assumption and we have just shown that \(k_{\mathbb {N}}\) is also a kernel. Hence, the spatiotemporal convolution kernel is a kernel.\(\square \)
As indicated, Eq. 3 can be interpreted such that all snapshots of the two trajectories are compared with each other, but weighted by their offset in time. Thereby snapshots which occur at different relative times have a low contribution to the overall similarity, while snapshots at the same relative time have a high contribution. This is why \(k_{[0,1]}\) is sometimes also referred to as weight function. The definition of k in Eq. 3 leaves two degrees of freedom:

Spatial kernel \(k_{{\mathcal {X}}}\): The choice of the snapshot kernel determines which snapshots are similar.

Temporal kernel \(k_{[0,1]}\): The choice of the temporal kernel determines the way in which the snapshots of two sequences are combined, and thus the importance of ordering and speed.
In the following subsections we develop and compare different spatial and temporal kernels.
3.4 Spatial kernels
A spatial kernel compares two snapshots in X. Corresponding to the two definitions of the snapshot in Definition 1 (objectoriented) and Definition 2 (grouporiented), two types of kernels are introduced here as well.
ObjectWise Comparison These kernels correspond to the objectoriented snapshot by simply comparing the positions of the objects \({\mathcal {O}}\) and summing up their similarity:
Note that Eq. 4 is a kernel, since kernels are closed under direct sum and multiplication by a positive constant (see ShaweTaylor and Cristianini (2004) Proposition 3.22).^{Footnote 5} Since kernels are also closed under direct product, technically a product could have been used instead of the sum in Eq. 4. However, a product of kernels leads to vanishing similarities if only one object is dissimilar to its counterpart. This is counterintuitive as two snapshots are the more similar the more objects match well. It is thus an inherently additive relationship. For the same reason a product of kernels is more vulnerable to noise and outliers and generally less robust than a sum of kernels. The elementary kernel \(k_{X}\) can be any kernel comparing two positions in X. Table 1 lists common kernels for finite as well as continuous snapshot spaces.
Kernels as in Eq. 4 have two major shortcomings. First, they penalise permutations of objects. For example, two snapshots of two objects with swapped positions will have low similarity, although in terms of the group motion they are alike. One way to address permutations is to explicitly maximise the similarity of the two snapshots with respect to all possible permutations of objects. Due to the high computational costs of considering all possible permutations^{Footnote 6}, this is infeasible. In addition, the bandwidth parameter \(\sigma \) that controls the interval of high sensitivity, i.e. large values of \(\left {\mathrm {d}k}/{\mathrm {d}(\Vert xy\Vert )}\right \), of the kernel has to be known beforehand. This is critical since one usually does not know on which scale significant deviations appear. Both issues are addressed by the groupwise comparison of objects.
GroupWise Comparison In case of the grouporiented snapshot there is a partition of \({\mathcal {O}}_t\) into K sets \(O_{x_t}(g_1), O_{x_t}(g_2), .., O_{x_t}(g_K)\). Instead of comparing N objects, K groups of objects are compared, which leads to the following definition of a spatial kernel
with kernel \(k_G\) applied on sets of positions. For the same reasons as in the previous paragraph, a sum of kernels is preferable to a product of kernels. Definition 2 allows for zero group members in a snapshot, i.e. \(x_t(g) = \emptyset \). For all kernels defined below this special case is dealt with as follows:
The definition of \(k_G\) is very much dependent on the underlying set of positions X. There are two main classes of sets X. First, if X is a metric space, i.e. there exists a distance measure between all positions. Second, X is a finite set, in which no meaningful distance between positions can be defined.^{Footnote 7} In the case of a metric space we will subsequently only consider \(X={\mathbb {R}}^2\) with standard Euclidean distance as it is the most common in applications and can be directly extended to \(X = {\mathbb {R}}^N\).
Euclidean Space: X = \({\mathbb {R}}^{2}\) In the case of \(X = {\mathbb {R}}^2\) a straightforward approach is to use one of the kernels from Table 1 to compare the group centroids
which leads to
Besides the need of defining the width parameter \(\sigma \), disregarding the distribution of the objects around their centroids constitutes a disadvantage. To remedy both deficiencies, two Gaussian distributions are fitted to \(x_t(g)\), and \(y_t(g)\) respectively that we will denote as \(f_{x_t}(g)\) and \(f_{y_t}(g)\). These distributions are defined by their means \(\mu _{x_t}(g)\) and \(\mu _{y_t}(g)\) as defined above and their covariance matrices defined by
If the covariance matrix is illconditioned or singular^{Footnote 8}, a simple shrinking scheme with shrinkage parameter \(\alpha \) can be applied to achieve nonsingularity:
where \({\mathbb {I}}_2\) is the twodimensional identity matrix. There exist different strategies (Ledoit and Wolf 2004; Chen et al. 2010) to choose an optimal value for \(\alpha \), but for our purposes it suffices to deploy a constant 0.1. In the case of \({{\mathrm{Tr}}}(\varSigma ) = 0\) the following scheme is used
where \(\sigma _{\text {MIN}}^2\) is an application specific parameter. Strategies for choosing \(\sigma _{\text {MIN}}^2\) are discussed in Sect. 4. Reasons for the matrix to be singular are \(O_t(g) \le 2\) or strong collinearity of the samples. The two Gaussian distributions \(f_{x_t}(g)\) and \(f_{y_t}(g)\) are then compared using a probability product kernel (Jebara et al. 2004), i.e.
A probability product kernel compares two probability distributions, i.e. their density functions p and \(p'\), defined on a common probability space \((\varOmega , \mathcal {A}, \mu )\). It is defined as follows.
Definition 7
(Probability Product Kernel) Let p and \(p'\) be probability density functions defined on the same probability space \((\varOmega , \mathcal {A}, \mu )\). Assume that \(p^\rho ,p'^\rho \in L^2(\varOmega )\) for \(\rho \in {\mathbb {R}}^+\). Then the probability product kernel (PPK) \(k^{\text {prod}}: L^2(\varOmega ) \times L^2(\varOmega ) \rightarrow {\mathbb {R}}\) is given by
Lemma 1
\(k^{\text {prod}}\) is a (positivedefinite) kernel.
Proof
Consider the feature map \(\phi : L^2(\varOmega ) \rightarrow L^2(\varOmega )\) given by
which is welldefined because of above \(L^2\) property. Then
and \(k^{\text {prod}}\) is a kernel corresponding to the scalar product in \(L^2(\varOmega )\)
\(\square \)
In this study we will use \(\rho = 1/2\), which has the important property that \(\text {Im}\left( k_{Prod}\right) = [0,1]\) and \(k^{\text {prod}}(p,p) = 1\). For two twodimensional Gaussian distributions and \(\rho = 1/2\) the product probability kernel permits the following closed form:
with \(\varSigma ^* = (\varSigma _{x_t}^{1}+\varSigma _{y_t}^{1})^{1}\) and \(\mu ^{*} = \varSigma _{x_t}^{1}\mu _{x_t} + \varSigma _{y_t}^{1}\mu _{y_t}\).^{Footnote 9} Besides its sensibility to the distribution of the objects inside the group \(k^{\text {prod}}\) has the advantage of not having to choose a width parameter. The kernel essentially adapts to the scale of the group.
Finite Case: \(X=L< \infty \) For finite \(X = \{z_1,\ldots ,z_L\}\), we can generalise to the grouporiented snapshot by comparison of the number of objects of each group at each possible location in X:
with the indicator function \({\mathbb {I}}_A(x) = 1\) if \(x \in A\) and 0 otherwise and \(n_z(x_t(g)) = \{x_t(o) = z \quad o \in O_{x_t}(g)\}\) denotes the number of objects of group \(g \in {\mathcal {G}}\) in position \(z \in X\) at time \(t \in [0,1]\).
Lemma 2
\(k_{G}\) as defined in Eq. 7 is a kernel.
Proof
Let \(l^2(0,1)\) be the Hilbert space of 0/1sequences equipped with the standard scalar product. Consider \(\phi : X \rightarrow l^2(0,1)^{L}\) with
for \(l = 1,\ldots ,L\) and \(m = 0,1,2,\ldots \) Using this feature mapping, \(k_{G}\) can be rewritten to
showing that it is indeed a kernel. Note that that the inner product of a product space is defined by the sum of the inner products of its components. \(\square \)
If the number of possible locations is much higher than the number of objects, the above kernel (Eq. 7) is inappropriate, as it will return similarities close to one for every pair of snapshots because in both snapshots there are no objects at almost all locations. Alternatively, the idea of a product probability kernel can also be applied to the finite setting by fitting a multinomial distribution to the positions of the group and comparing the two resulting distributions \(p_{x_t}(g)\) and \(p_{y_t}(g)\) that are identified by their outcome probabilities \((p_{x_t,1}(g),\ldots ,p_{x_t,L}(g))\). For \(\rho = 1/2\) the probability product kernel on the multinomial distributions is derived as follows^{Footnote 10},
with the maximumlikelihood estimation for the outcome probabilities of the multinomial distributions given by
3.5 Temporal kernels
The temporal kernel \(k_{[0,1]}\) is the simpler component of spatiotemporal convolution kernel because the underlying set is fixed to the onedimensional interval [0, 1]. We briefly discuss possible options for the temporal kernel and their implications.
The simplest temporal kernel is a constant kernel \(k_{[0,1]}(t,s) = 1\) with the consequence that the spatiotemporal convolution kernel (Eq. 3) collapses to a set kernel on the two sets of snapshots, thus ignoring order. A uniform or interval kernel given by
could be used to compare every snapshot of the first trajectory to just those snapshots of the second trajectory which are close in (relative) time. The choice of w determines how close snapshots have to be to be taken into consideration. Finally, Gaussian kernels of the form
compare every snapshot of the first trajectory to every snapshot of the second trajectory and gives more importance to closer events (in relative time). Depending on the application athand, there may be many more options to define temporal kernels.
3.6 Approximation techniques
To compute the Gram matrix of a dataset of N trajectories, \(O(N^{2} L^{2})\) spatial as well as temporal kernels need to be evaluated with L being the maximal length of a sequence. Naturally, the evaluation of the spatial kernels will dominate the evaluation of the temporal kernel mainly for reasons of dimensionality as well as complexity of the kernel itself.
Generally, there exist two ways to speed up the algorithm. First, the number of similarities which have to be computed is reduced to shrink the factor \(O(N^2)\). Afterwards the missing entries in the Gram matrix are reconstructed. Second, the computation of each similarity is accelerated to reduce the term \(O(L^2)\). Regarding the first approach, there exist methods to reconstruct incomplete Gram matrices, most notably the Nyström method (see Fowlkes et al. 2004, Williams and Seeger 2001). These methods have the disadvantage of usually needing a constant fraction of all entries to perform well (Wauthier et al. 2012). It follows that they do not improve the asymptotic complexity of the computation. We focus on the second approach, since it is specific to spatiotemporal convolution kernels and potentially improves the asymptotic complexity.
As stated before, the computation of the temporal kernel is significantly faster than the one of the spatial kernel. Thus, a natural way to approximate these kernels is to first evaluate all temporal kernels and then to only evaluate those spatial kernels which have the highest contribution. Depending on the lengths of the two trajectories athand the number of overall kernel evaluations varies between pairs of trajectories. Therefore, when approximating the kernel one has to take care that all entries of the Gram matrix are approximated in similar speed to avoid distortions. To this end, we propose a percental approximation algorithm (Algorithm 1, Approximated STCK (ASTCK)), which in addition to the two trajectories takes the maximal length of all trajectories L as an input. The algorithm first evaluates all temporal kernels. Subsequently, in each step \(\frac{1}{L}\) spatial kernels are computed starting with the most relevant based on the evaluation of the temporal kernels. The order of evaluation is illustrated in Table 2. This algorithm leads to a complexity of
where \(C_T\) is the complexity of evaluating the temporal kernel, \(C_S\) is the complexity of evaluating the spatial kernel and I is the number of iterations of the approximation algorithm. The parameter \(1\le I \le L\) needs to be chosen by the user. In practice, small values of I often lead to highly accurate results.
3.7 Online application
STCKs are particularly wellsuited for online analyses such as realtime computations where the objects of interest are still in motion. When a new measurement is added to a trajectory, Eq. 3 can be efficiently updated, since all previous spatial kernel evaluations remain constant in the sum. That is, only the value of the temporal kernel needs to be computed which is however usually inexpensive.
4 Empirical evaluation
In this section we empirically compare our spatiotemporal convolution kernels to baseline approaches using artificial and real data sets. We focus on clustering tasks and kmedoids (Kaufman and Rousseeuw 1987) as the underlying learning algorithm. The temporal kernel is always a Gaussian kernel that we combine with three different spatial kernels: an objectwise Gaussian RBF kernel as spatial kernel (\(\text {STCK}_{\text {11}}\)), a Gaussian RBF kernel on the group means (\(\text {STCK}_{\text {Mean}}\)), and a probability product kernel on the fitted Gaussian distributions (\(\text {STCK}_{\text {Dist}}\)).
For the latter, the group and application specific parameter \(\sigma _{\text {MIN}}^2\) which is sometimes needed to restore nonsingularities of the covariance matrix (see Sect. 3.4), is set to the average distance between two objects of the group. If the group only consists of one object, it is equal to the average distance between two objects of the group, that is most similar. The width parameter \(\sigma _T\) of the temporal Gaussian kernel is set to 0.5 to balance invariance to speed differences and sensitivity to direction. The width parameter \(\sigma _S\) of the spatial Gaussian RBF kernel is set to 0.2.
We deploy three baselines. First, Junejo et al. (2004) is straightforwardly extended to the multiobject scenario, i.e. the use of Hausdorff distance on the set of positions of the trajectories. Instead of the hierarchical clustering employed in Fu et al. (2005), kernelised kmedoids is used. The second baseline is inspired by Wang et al. (2008). We use a bagofpositions as well as a bagofdirections representation for the trajectories of each group. To keep the setup simple, we use a multinomial mixture model (MNMM) and expectation maximisation for clustering instead of a semantic topic model like dualHDP (Wang et al. 2008). Third, we also compare our method to dynamic time warping with a product probability kernel (\(\text {DTW}_{\text {dist}}\)) serving as local distance measure. This method applies the product probability kernel to the fitted Gaussian distributions. The number of clusters is determined using the silhouette measure (Rousseeuw 1987), Hartigan index (Hartigan 1975) as well as nextneighbour consistency for all methods.
In the next section, we measure the alignment between predicted and groundtruth clusterings using artificially generated data. The alignment between two groupings is captured by the Rand Index (Rand 1971); however, as two random clusterings may return a nonzero Rand Index by chance, we resort to the Adjusted Rand Index (Hubert and Arabie 1985), given by
where R denotes the Rand Index and S and T are clusterings to compare.
4.1 Artificial data
Recall that our spatiotemporal convolution kernels possess basic properties^{Footnote 11} that are not shared with the baseline methods. Most notably these are invariance to permutations of the objects and to differences in speed as well as sensitivity to the spatial distribution of the objects and the direction of the movement.
In this section these properties are experimentally confirmed using two toy data sets consisting of artificially generated multiobject trajectories. The data is generated to cover a broad range of trajectories present in realworld applications. The first setting covers linear movements and each trajectory consists of five objects performing a linear movement as shown in Fig. 1 (top). To find the correct clusters, the respective kernels need to distinguish trajectories based on the direction of the movement or based on the distribution of the objects, while the objects’ centroid is the same in both trajectories.
The second setting deals with circular movements and each trajectory consists of four objects moving in circles of different radii, see Fig. 1 (bottom). The construction of the second set enables us to evaluate the ability of the kernels to identify the correct clusters when these only differ in the direction of the movement but a spatial separation between the clusters is not possible. Compared to linear movements, the circular task is more difficult, as the directions of only some objects differ between the clusters. The data generation is described in detail in “Appendix” section.
To evaluate the sensitivity of the methods to permutations, the objects’ ordering inside each multiobject trajectory is permuted randomly in both data sets. To assess the sensitivity to changes in speed, we flip a coin for each trajectory in both test sets. With 50 % probability only every second position is retained in the trajectory. The resulting trajectory corresponds to a movement with twice the speed of the original trajectory. Finally, we add uniformly distributed noise in the range \([\epsilon , \epsilon ]\) for \(\epsilon \in \{0.005, 0.025, 0.05, 0.1, 0.25\}\) with zero mean to every position in each multiobject trajectory in both test sets.
For both data sets, we generate 200 multitrajectories (50 per cluster) for all five levels of noise. It is important to assess the methods’ sensitivity to noise for two reasons. First, trajectories are generally highly variable and exact matches are very rare. Second, due to frequent tracking errors, reallife trajectories are also subject to a significant level of noise.
In this experiment, we focus on \(\text {STCK}_{\text {Dist}}\) as it leads to more accurate clusters than \(\text {STCK}_{\text {11}}\) and \(\text {STCK}_{\text {Mean}}\) throughout all ranges of noise. Figure 2 compares \(\text {STCK}_{\text {Dist}}\) to the three baselines. Our method outperforms dynamic time warping on both test sets and gives the most accurate clusterings. The multinomial mixture model performs reasonably well on linear movements but leads to inappropriate clusterings for circular ones. The poor results on the second data set are due to the absence of ordering information in the bagofdirections representation when the objects move in a full circle. The Hausdorff distancebased Junejo et al. leads to generally inaccurate clusterings. We credit this finding to its sensitivity to permutations as well as negligence of ordering information.
We now evaluate the proposed approximation technique on both artificial data sets. The algorithm performs up to eleven iterations as the longest trajectory consists of eleven snapshots. Note that, at the end of the last iteration, the exact Gram matrix is obtained. The percental approximation proves accurate; the approximation quickly leads to the correct clustering after two iterations. Figure 3 shows that ASTCK also achieves higher consistencies with the correct clustering compared to all baseline methods from the second iteration onwards. Moreover, starting from the second iteration onwards it is always at least as good as the exact method. The depicted results correspond to a noise level of 0.005 but equivalent results hold for all other noise ratios.
The runtime of multiobject trajectory clustering is governed by the number of trajectories, the length of the trajectories and the number of objects. Depending on the application, usually one or two dimensions are dominating. It is thus important to evaluate the runtime for each of the three dimensions. The theoretical runtime is \(O(N^{2}\cdot L^{2})\) spatial as well as temporal kernel evaluations, where N is the number of trajectories and L is the maximum length of the trajectories. To confirm these findings experimentally, we generate random trajectory sets in \([0,1]^2\) and vary the number of trajectories per set, the length of the trajectories and the number of objects per trajectory. The results are depicted in Fig. 4.
The figure on top shows the results for varying numbers of trajectories. Except for multinomial mixture models that perform linearly in the number of instances, all methods exhibit a quadratic complexity, which is simply due to the fact that all pairwise similarities are computed, i.e. \({N(N+1)}/{2}\). Varying the length of the trajectories in Fig. 4 (bottom, left) shows that all STCKs as well as DTW exhibit quadratic complexities which is in line with the theoretical runtime. However, the percental approximation algorithm (two iterations) significantly improves the runtime of STCKs and is comparable to that of Junejo et al. and multinomial mixture models.
Finally, Fig. 4 (bottom, right) varies the number of objects. The results are virtually constant time complexities for all methods except \(\text {STCK}_{\text {11}}\) and MNMM, which exhibit linear complexities. The observation is in line with our expectations as the Gaussian RBF kernel compares each object with its counterpart. The deviations in the case of \(\text {STCK}_{\text {Dist}}\) for a small number of objects (less or equal two objects) is due to the additional time needed by the shrinking schemes to restore nonsingularity of the covariance matrices.
4.2 Realworld data
We now evaluate spatiotemporal convolution kernels on real world data using positional data streams of ten soccer games of the German Bundesliga. The goal is to identify movement patterns by analysing the tracking data.
The tracking data is captured by the VIS.TRACK Impire (2014) tracking system during five games of Bundesliga Team A and five games of Bundesliga Team B from the 2011/12 Bundesliga season.^{Footnote 12} VIS.TRACK is a video based tracking system consisting of several cameras in the soccer stadium. It determines the positions of the players, ball and referees at 25 frames per second, which amounts to roughly 135, 000 positions per object and match and a total of 31, 000, 000 positions. Additionally, a marker indicates the status of the ball (inplay, stoppage) and the possession of the ball. The range of the x (parallel to sidelines) and ycoordinate (parallel to endlines) is \([1,1]\), whereas values with an absolute value greater than one occur if the ball is out of bounds. The data stream is preprocessed so that positions of the second half are mirrored to account for the changeover at half time and the frame numbers of the second half are changed to succeed those of the first half. Additionally, the playing direction is determined and normalised, so that the team of interest always plays from left to right. Subsequently, we extract two types of sequences: game initiations and scoring opportunities.
Game initiations (GI) begin with the goal keeper passing the ball and end with the team loosing possession, a stoppage, the ball being in the attacking third of the field or the start of the next game initiation as defined above. Sequences shorter than length 12 are excluded. Scoring opportunities (SO) terminate when the ball is carried into a predefined zone of danger, usually defined as the attacking quarter. Scoring opportunities begin at the time of the last stoppage or win of the ball and last until the ball reaches the attacking quarter of the field \([0.5, 1]\times [1,1]\). Again, sequences with a length below 12 are discarded.
For every sequence there are 23 possible trajectories (ball, 22 players, no referees) to include into the analysis. Since the opposing team changes from game to game, we will restrict the analysis to twelve objects, namely the ball and the players of Team A, and Team B respectively. In the following experiments we consider the ball as one group in the sense of Definition 2. We further include the back four (game initiations) and the four most offensive players (scoring opportunities) as a second group (in the sense of Definition 2) into the analysis. The clusterings in the remainder are thus based on the trajectories of five objects (ball, four players).
In general, there is no groundtruth for realworld clustering problems. Figure 5 therefore depicts the adjusted Rand indices between the five methods for pairwise comparison. Generally, the adjusted Rand index is low in the majority of the cases indicating low consistency between the methods. On the other hand, we have a higher consistency between \(\text {STCK}_{\text {dist}}\) and \(\text {DTW}_{\text {dist}}\) demonstrating that our method is capable of dealing with speed and length differences in a similar way as dynamic time warping. This is also partially due to the use of the same kernel in both methods, once as spatial kernel and once as local distance measure.
Figure 6 (left) shows average Silhouette measures of the four similaritybased methods over the four datasets. Our \(\text {STCK}_{\text {dist}}\) achieves the highest score indicating higher cluster separation and/or compactness. The objectwise STCK provides the poorest results on average, indicating that permutations are relevant in this application and that the distributionbased representation of the probability product kernel is more successful in capturing the relevant player movements. Dynamic time warping performs second best on average and in line with the previous results on the artificial datasets. Also note the high consistency with our method in Fig. 5. In terms of 5nearestneighbour consistency, the two distributionbased methods, \({STCK}_{{dist}}\) and \({DTW}_{{dist}}\), perform best, as depicted in Fig. 6 (right). Cluster quality is generally higher for game initiations compared to scoring opportunities which also results in in more interpretable clusters for these settings as we will show in the remainder.
Figures 7 and 8 illustrate the resulting clusters for the five methods. The medoids of the clusters for both teams are depicted along with the distribution of the trajectory length for each cluster. For all settings cluster membership correlates with sequence length. Generally, the cluster medoids of \(\text {STCK}_{\text {dist}}\) and \(\text {DTW}_{\text {dist}}\) often coincide and are easily interpretable, while the cluster medoids of \(\text {STCK}_{\text {11}}\) and the cluster representatives of the multinomial mixture model are difficult to make sense of.
In the Bundesliga 2011/2012 season, the strategy of Team A generally consisted of transporting the ball with few, but rehearsed short game initiations to the opposing half. For this purpose, many ball contacts were allowed and different players were integrated. On the contrary, Team B showed a rather chaotic game organisation with rather random actions and increasingly long, straight balls. Figure 7 shows the outcomes of the kmedoids cluster for the game initiations. Compared to all other methods, the the \(\text {STCK}_{\text {dist}}\) capture the characteristic traits of the teams well. Team A clearly acted with many short moves (long trajectories in cluster 1) and integrated many players in the playmaking (cluttered medians). By contrast, Team B acted with many long moves (short trajectories in all clusters) and preferred linear actions.
Near the opposing goal, Team A aimed at quickly achieving a goal in the opposing half during the 2011/2012 season. They operated with only a few ball contacts and aimed to quickly transport the ball in the predefined zone of danger. Again in contrast to this, Team B had many ball contacts and took their time in waiting for a mistake of the opponent and only then played in the zone of danger to achieve a goal. Figure 8 shows that all methods are capable of retracing the different offensive strategies of both teams. Again, Team B has more solution categories (more than 33 %) than Team A, which is shown by the versatile and multifaceted running patterns. However, solely the results with \(\text {STCK}_{\text {dist}}\) show that Team A rapidly tries to enter the zone of danger with very few ball contacts (short sequences in all clusters compared to Team B). To sum up, the results with \(\text {STCK}_{\text {dist}}\) best reflect the game philosophies of Team A and B from a sportscientific perspectives and are the most easy to interpret.
5 Conclusion
We presented spatiotemporal convolution kernels for multiobject scenarios. Our kernels consist of a temporal and a spatial component that can be chosen according to the characteristic traits of a problem athand. The computation time is quadratic in terms of the number and lengths of trajectories. We proposed an efficient percental approximation algorithm that significantly reduced the complexity to superlinear runtime. Empirical results on artificial clustering tasks showed that our spatiotemporal convolution kernels effectively identify the target concepts. Results on largescale real world data from soccer games showed that our kernels lead to easily interpretable clusters that may be used in further analysis by coaches.
Notes
Note that the group membership of an object can change over time.
Note that the Definition of \(P_2\) is such that it corresponds to an object moving with twice the speed of the first object.
Note that the property can be easily adapted to explicitly include speed by adding an extra coordinate. The position of each object is replaced by a positionspeedpair \(\left( x, \frac{\mathrm {d}x}{\mathrm {d}t}\right) _t\), see Jinyang et al. (2011) for details.
Originally, convolution kernels are defined on arbitrary sets X.
However, if the number of objects was not constant, \(k_{{\mathcal {X}}}\) would not be a kernel.
For example, 10 objects lead to about \(3.6 \cdot 10^6\) permutations.
The case of an infinite set X without a metric can be reduced to the finite case by only considering the positions that are attained by some objects at some time. These are only finitely many.
In this study a threshold on the covariance matrix’s condition number of 30 is used.
In order to simplify the notation, group index g has been omitted.
The group index g has again been omitted for reasons of readability.
The following properties refer to the \({STCK}_{Dist}\).
The team names must not be disclosed.
References
Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50(1–2), 5–43.
Basharat, A., Gritai, A., & Shah, M. (2008). Learning object motion patterns for anomaly detection and improved object detection. In IEEE conference on computer vision and pattern recognition.
Baud, O., ElBied, Y., Honore, N., & Taupin, O. (2007). Trajectory comparison for civil aircraft. In Aerospace conference, 2007 IEEE.
Bellman, R., & Kalaba, R. (1959). On adaptive control processes. IRE Transactions on Automatic Control, 4(2), 1–9.
Bialkowski, A., Lucey, P., Carr, P., Denman, S., Matthews, I., & Sridharan, S. (2013). Recognising team activities from noisy data. In IEEE conference on computer vision and pattern recognition workshops.
Bialkowski, A., Lucey, P., Carr, P., Yue, Y., & Matthews, I. (2014). “Win at home and draw away”: Automatic formation analysis highlighting the differences in home and away team behaviors. In MIT sloan sports analytics conference.
Blanchard, G., Bousquet, O., & Massart, P. (2008). Statistical performance of support vector machines. The Annals of Statistics, 36(2), 489–531.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal for Machine Learning Research, 3, 993–1022.
Buzan, D., Sclaroff, S., & Kollios, G. (2004). Extraction and clustering of motion trajectories in video. In International conference on pattern recognition, (Vol. 2).
Chen, Y., Wiesel, A., Eldar, Y., & Hero, A. (2010). Shrinkage algorithms for MMSE covariance estimation. IEEE Transactions on Signal Processing, 58(10), 5016–5029.
Direkoglu, C., & O’Connor, N. (2012). Team behavior analysis in sports using the poisson equation. In IEEE international conference on image processing.
Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nyström method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 214–225.
Fu, Z., Hu, W., & Tan, T. (2005). Similarity based vehicle trajectory clustering and anomaly detection. In International conference on image processing, (Vol. 2).
Giannotti, F., Nanni, M., Pinelli, F., & Pedreschi, D. (2007). Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07. ACM, New York, NY, USA. doi:10.1145/1281192.1281230.
Grunz, A., Memmert, D., & Perl, J. (2012). Tactical pattern recognition in soccer games by means of special selforganizing maps. Human Movement Science, 31(2), 334–343.
Hartigan, J. A. (1975). Clustering algorithms (99th ed.). New York, NY: Wiley.
Hausdorff, F. (1962). Set theory. Chelsea: Chelsea Pub. Co.
Haussler, D. (1999). Convolution kernels on discrete structures. In Technical report UCSCRL9910, University of California at Santa Cruz.
Hervieu, A., & Bouthemy, P. (2010). Understanding sports video using players trajectories. In Intelligent video event analysis and understanding, studies in computational intelligence, (Vol. 332). Berlin, Heidelberg: Springer.
Hirano, S., & Tsumoto, S. (2005). Grouping of soccer game records by multiscale comparison technique and rough clustering. In International conference on hybrid intelligent systems.
Hu, W., Xiao, X., Fu, Z., Xie, D., Tan, T., & Maybank, S. (2006). A system for learning statistical motion patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1450–1464.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Impire AG (2014) VIS.TRACK Produktinformation. http://www.bundesligadatenbank.de/fileadmin/impire/products/Produktinformationen/VIS.TRACK__Produktionformation.pdf. Accessed June 10, 2014.
Intille, S. S., & Bobick, A. F. (2001). Recognizing planned multiperson action. Computer Vision and Image Understanding, 81(3), 414–445.
Jebara, T., Kondor, R., & Howard, A. (2004). Probability product kernels. Journal for Machine Learning Research, 5, 819–844.
Jeong, H., Chang, H. J., & Choi, J. Y. (2011). Modeling of moving object trajectory by spatiotemporal learning for abnormal behavior detection. In IEEE international conference on advanced video and signalbased surveillance.
Jinyang, D., Rangding, W., Liangxu, L., & Jiatao, S. (2011). Clustering of trajectories based on hausdorff distance. In International conference on electronics, communications and control.
Junejo, I., Javed, O., & Shah, M. (2004). Multi feature path modeling for video surveillance. In International conference on pattern recognition, (Vol. 2).
Kang, C.H., Hwang, J.R., & Li, K.J. (2006). Trajectory analysis for soccer players. In IEEE international conference on data mining workshops.
Kang, J., & Yong, H. S. (2008). Spatiotemporal discretization for sequential pattern mining. In Proceedings of the 2nd international conference on ubiquitous information management and communication. ACM, New York, NY, USA.
Kang, S. J., Kim, Y., Park, T., & Kim, C. H. (2013). Automatic player behavior analysis system using trajectory data in a massive multiplayer online game. Multimedia Tools and Applications. doi:10.1007/s110420121052x.
Kaufman, L., & Rousseeuw, P. J. (1987). Clustering by means of medoids. In Y. Dodge (Ed.), Statistical Data Analysis Based on the \(L_{1}\) Norm and Related Methods (pp. 405–416). NorthHolland.
Kempe, M., Grunz, A., & Memmert, D. (2014). Detecting tactical patterns in basketball: Comparison of merge selforganising maps and dynamic controlled neural networks. European Journal of Sport Science, 15(4), 249–255.
Kim, K., Grundmann, M., Shamir, A., Matthews, I., Hodgins, J., & Essa, I. (2010). Motion fields to predict play evolution in dynamic sport scenes. In IEEE conference on computer vision and pattern recognition.
Larson, J.S., Bradlow, E.T., & Fader, P.S. (2005). An exploratory look at supermarket shopping paths. International Journal of Research in Marketing, 22. doi:10.1016/j.ijresmar.2005.09.005.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE computer society conference on computer vision and pattern recognition, (Vol. 2).
Ledoit, O., & Wolf, M. (2004). A wellconditioned estimator for largedimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411.
Lee, J.G., Han, J., & Li, X. (2008). Trajectory outlier detection: A partitionanddetect framework. In Proceedings of the 2008 IEEE 24th international conference on data engineering. IEEE Computer Society: Washington, DC, USA. doi:10.1109/ICDE.2008.4497422.
Li, R., & Chellappa, R. (2010). Group motion segmentation using a spatiotemporal driving force model. In IEEE conference on computer vision and pattern recognition.
Li, R., Chellappa, R., & Zhou, S. (2009). Learning multimodal densities on discriminative temporal interaction manifold for group activity recognition. In IEEE conference on computer vision and pattern recognition.
Lin, C. J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12(6), 1288–1298.
Lin, D., Grimson, E., & Fisher, J. (2009). Learning visual flows: A lie algebraic approach. In IEEE conference on computer vision and pattern recognition.
Lucey, P. J., Bialkowski, A., Carr, P., Morgan, S., Matthews, I., & Sheikh, Y. (2013). Representing and discovering adversarial team behaviors using player roles. In IEEE conference on computer vision and pattern recognition.
Mamoulis, N., Cao, H., Kollios, G., Hadjieleftheriou, M., Tao, Y., & Cheung, D.W. (2004). Mining, indexing, and querying historical spatiotemporal data. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM: New York, NY, USA.
Mehta, S., Parthasarathy, S., & Yang, H. (2005). Toward unsupervised correlation preserving discretization. IEEE Transactions on Knowledge and Data Engineering, 17(9), 1174–1185.
Ng, A. Y., Jordan, M. I., & Weiss,Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pp. 849–856.
Ong, C. S., Canu, S., Smola, A. J. (2004). Learning with nonpositive kernels. In International conference on machine learning.
Pao, H. K., Chen, K. T., & Chang, H. C. (2010). Game bot detection via avatar trajectory analysis. IEEE Transactions on Computational Intelligence and AI in Games, 2(3), 162–175.
Perše, M., Kristan, M., Kovačič, S., Vučkovič, G., & Perš, J. (2009). A trajectorybased analysis of coordinated team activity in a basketball game. Computer Vision and Image Understanding, 113(5), 612–621.
Piciarelli, C., Foresti, G., & Snidaro, L. (2005). Trajectory clustering and its applications for video surveillance. In IEEE conference on advanced video and signal based surveillance.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20.
Saleemi, I., Shafique, K., & Shah, M. (2009). Probabilistic modeling of scene dynamics for applications in visual surveillance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(8).
ShaweTaylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.
Siddiquie, B., Yacoob, Y., & Davis, L. (2009). Recognizing plays in american football videos.
Steinwart, I. (2002). Support vector machines are universally consistent. Journal of Complexity, 18(3), 768–791.
Trunfio, G. A., D’Ambrosio, D., Rongo, R., Spataro, W., & Di Gregorio, S. (2011). A new algorithm for simulating wildfire spread through cellular automata. In ACM transactions on modeling and computer simulation.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.
Wang, X., Ma, K. T., Ng, G. W., & Grimson, W. (2008). Trajectory analysis and semantic region modeling using a nonparametric bayesian model. In IEEE conference on computer vision and pattern recognition.
Wauthier, F. L., Jojic, N., Jordan, M. I. (2012). Active spectral clustering via iterative uncertainty reduction. In ACM SIGKDD international conference on knowledge discovery and data mining. ACM.
Wei, X., Sha, L., Lucey, P., Morgan, S., & Sridharan, S. (2013). Largescale analysis of formations in soccer. In International conference on digital image computing: Techniques and applications.
Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in neural information processing systems 13 (pp. 682–688). MIT Press.
Zhu, G., Huang, Q., Xu, C., Rui, Y., Jiang, S., Gao, W., & Yao, H. (2007). Trajectory based event tactics analysis in broadcast sports video. In International conference on multimedia. ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Tijl De Bie.
Appendix: Generation of artificial data
Appendix: Generation of artificial data
The four clusters of the linear movement data consist of trajectories each of which has of five objects performing a linear movement as depicted in Fig. 1. The clusters differ with regard to the direction of the movement as follows:

Cluster 1: Objects move downwards in parallel. In particular, the position (without noise) of object \(o_i\) for \(i=1,\ldots ,5\) at time \(t=0,0.1,\ldots ,1\) is given by
$$\begin{aligned} \left( 0.2 + \frac{i1}{10}, 1 t\right) . \end{aligned}$$ 
Cluster 2: Objects move upwards in parallel. In particular, the position of object \(o_i\) for \(i=1,\ldots ,5\) at time \(t=0,0.1,\ldots ,1\) is given by
$$\begin{aligned} \left( 0.2 + \frac{i1}{10}, t\right) . \end{aligned}$$ 
Cluster 3: Objects move downwards while drifting apart. In particular, the position of object \(o_i\) for \(i=1,2\) at time \(t=0,0.1,\ldots ,1\) is
$$\begin{aligned} (0.6 + (i1)\cdot 0.1 + 0.4\cdot t, 1 t), \end{aligned}$$the position of object \(o_3\) is
$$\begin{aligned} (0, 1t) \end{aligned}$$at time \(t=0,0.1,\ldots ,1\) and the position of object \(o_i\) for \(i=4,5\) at time \(t=0,0.1,\ldots ,1\) is
$$\begin{aligned} (0.5 + (i4)\cdot 0.1  0.4 \cdot t, 1t). \end{aligned}$$ 
Cluster 4: Objects move upwards while drifting apart.In particular, the position of object \(o_i\) for \(i=1,2\) at time \(t=0,0.1,\ldots ,1\) is
$$\begin{aligned} (0.6 + (i1)\cdot 0.1 + 0.4\cdot (1t), t), \end{aligned}$$the position of object \(o_3\) is
$$\begin{aligned} (0, t) \end{aligned}$$at time \(t=0,0.1,\ldots ,1\) and the position of object \(o_i\) for \(i=4,5\) at time \(t=0,0.1,\ldots ,1\) is
$$\begin{aligned} (0.5 + (i4)\cdot 0.1  0.4 \cdot (1t), t). \end{aligned}$$
The second artificial data set testing on circular movements contains trajectories of four objects moving in circles of different radii. The clusters differ with regard to the orientation of the movement as follows:

Cluster 1: All objects move clockwise. In particular, the position of object \(o_i\) for \(i=1,\ldots ,4\) at time \(t=0,0.1,\ldots ,1\) is given by
$$\begin{aligned} (0.25i \cos (2\pi t), 0.25i \sin (2\pi t)). \end{aligned}$$ 
Cluster 2: All objects move counterclockwise. In particular, the position of object \(o_i\) for \(i=1,\ldots ,4\) at time \(t=0,0.1,\ldots ,1\) is given by
$$\begin{aligned} (0.25i, \cos (2\pi t), 0.25i \sin (2\pi t)). \end{aligned}$$ 
Cluster 3: The inner two objects move clockwise, the outer two counterclockwise. In particular, the position of object \(o_i\) for \(i=1,\ldots ,4\) at time \(t=0,0.1,\ldots ,1\) is given by
$$\begin{aligned} (0.25i, \cos (2\pi t), {{\mathrm{sgn}}}(2i) 0.25i \sin (2\pi t)). \end{aligned}$$ 
Cluster 4: The inner two objects move counterclockwise, the outer two clockwise. In particular, the position of object \(o_i\) for \(i=1,\ldots ,4\) at time \(t=0,0.1,\ldots ,1\) is given by
$$\begin{aligned} (0.25i, \cos (2\pi t), {{\mathrm{sgn}}}(i3) 0.25i \sin (2\pi t)). \end{aligned}$$
Rights and permissions
About this article
Cite this article
Knauf, K., Memmert, D. & Brefeld, U. Spatiotemporal convolution kernels. Mach Learn 102, 247–273 (2016). https://doi.org/10.1007/s1099401555201
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401555201