Spatiotemporal convolution kernels
 2.1k Downloads
 7 Citations
Abstract
Trajectory data of simultaneously moving objects is being recorded in many different domains and applications. However, existing techniques that utilise such data often fail to capture characteristic traits or lack theoretical guarantees. We propose a novel class of spatiotemporal convolution kernels to capture similarities in multiobject scenarios. The abstract kernel is a composition of a temporal and a spatial kernel and its actual instantiations depend on the application at hand. Empirically, we compare our kernels and efficient approximations thereof to baseline techniques for clustering tasks using artificial and real world data from team sports.
Keywords
Convolution kernel Spatiotemporal Trajectory Soccer1 Introduction
Trajectory data of simultaneously moving objects is the key to analyse animal migration (Lee et al. 2008), transportation (Baud et al. 2007; Giannotti et al. 2007), tactics in team sports (Hirano and Tsumoto 2005; Kempe et al. 2014; Lucey et al. 2013; Wei et al. 2013), players and avatars in (serious) computer games (Kang et al. 2013; Pao et al. 2010), customer behaviour (Larson et al. 2005) as well as spread patterns of fires (Trunfio et al. 2011). A characteristic trait of many such applications is that trajectories of several objects are more informative than the trajectory of a single object. For instance, a single trajectory of a bird is not indicative for bird migration as individuals may join or leave the flock (Lee et al. 2008) and a single trajectory of a soccer player does not reveal insights on the actual situation on the pitch (Grunz et al. 2012; Wei et al. 2013).
Therefore, trajectories of multiple objects need to be processed together. Although this insight sounds trivial, processing multiple trajectories simultaneously challenges the standard model of computation as trajectories may interdepend in time and space in multiple ways. To exploit these dependencies, it is necessary to establish a notion of similarity for spatiotemporal paths of multiple objects to identify frequent patterns. By definition, frequent patterns are formed by an a priori unknown subset of objects at unknown locations in time and space. Analysing multitrajectory data is therefore inherently a combinatorial problem that involves processing data at large scales.
A second problem arises from existing methods for analysing spatiotemporal data. Traditional approaches often cannot deal with continuous spatial domains but rely on an appropriate discretisation of the data athand (Kang and Yong 2008; Mamoulis et al. 2004; Mehta et al. 2005). However, finding an a priori optimal discretisation is often difficult in many domains where only the final result allows conclusions on whether an initial set of atomic events is plausible or not (Kang and Yong 2008). Furthermore, many approaches cannot deal with permutations of the objects and differences in speed, while still being sensitive to differences in the direction of the motion (Hirano and Tsumoto 2005; Junejo et al. 2004; Wei et al. 2013).
We devise a novel class of convolution kernels for multitrajectory data. It is specially tailored to multiobject scenarios, i.e. trajectories of multiple simultaneously moving objects. The kernel properties as well as the modular nature of the proposed class of kernels renders it highly adaptive to different applications. Since it is a kernel, it can also naturally be deployed with any kernel machine. The three characteristics, multiobject scenario, modularity and kernel property, distinguish our approach from existing methods. Due to its distinct characteristics, our approach is more suitable for a large variety of applications, it is flexible with respect to the notion of similarity, and it is theoretically better grounded than most of the existing methods. Since the complexity of a kernel evaluation is quadratic in the number and the lengths of the involved objects, we also propose an efficient percental approximation. Empirically, the method is evaluated on artificial datasets and realworld tracking data from ten Bundesliga soccer matches. We generally observe that our convolution kernels lead to better clusterings compared to baseline methods.
The remainder of this article is structured as follows. Section 2 reviews existing work. Section 3 introduces our spatiotemporal convolution kernel methods and Sect. 4 reports on empirical results. Section 5 concludes.
2 Related work
2.1 Trajectory clustering
Trajectory clustering, or clustering of spatiotemporal data respectively, has been an active field of research in the past years. Existing approaches mainly focus on the application of video surveillance with the goal to detect anomalies in the data stream (Basharat et al. 2008; Fu et al. 2005; Hu et al. 2006; Jeong et al. 2011; Junejo et al. 2004; Saleemi et al. 2009). Other applications include automatic sports analysis, weather evolution modelling, animal migration and traffic analysis. Existing approaches rely mostly on processing of single trajectories. Recent contributions in this area can be roughly grouped into similaritybased approaches (Buzan et al. 2004; Jinyang et al. 2011; Fu et al. 2005; Hirano and Tsumoto 2005; Hu et al. 2006; Junejo et al. 2004; Piciarelli et al. 2005) and motionbased approaches (Basharat et al. 2008; Wang et al. 2008; Jeong et al. 2011; Li and Chellappa 2010; Lin et al. 2009; Saleemi et al. 2009).
Similaritybased approaches define pairwise similarities between trajectories which are then processed by some clustering algorithm. Junejo et al. (2004) represent trajectories as a set of twodimensional coordinates together with the Hausdorff distance. Subsequently, graphcuts are deployed to recursively partition the trajectories. Hausdorff distances are also used to cluster trajectories by Jinyang et al. (2011) where not only the position but also the direction of the trajectories is taken into account by using 4tuples \((x,y,\mathrm {d}x,\mathrm {d}y)\) instead of coordinates only. Fu et al. (2005) first resample trajectories to obtain constant betweenpoint distances. Then the corresponding points of two trajectories are compared using an RBF kernel where the longer trajectory is cut to the length of the shorter one. Spectral clustering is then used together with a symmetric normalised Laplacian.
Buzan et al. (2004) extend the longest common subsequence algorithm to threedimensional coordinates and use a modified version of agglomerative hierarchical clustering. Hirano and Tsumoto (2005) deploy multiscale matchings to compare trajectories. The basic idea is to generate trajectories at different scales as convolutions of the trajectory and Gaussian kernels with different standard deviations. Their similarity measure is then based on the hierarchical structure of the trajectory segments at different scales. Subsequently, a rough clustering is employed. Piciarelli et al. (2005) define a trajectorytocluster similarity by the average Euclidean distance of trajectory coordinates to the nearest cluster coordinate where offsets in time induce negative weights.
Our approach also belongs to these similaritybased methods. In general, similaritybased approaches suffer from two major drawbacks. First, their computational complexity is at least in quadratic in the number of trajectories. Second, they rely on clustering full trajectories and are hence sensitive to tracking errors and subtrajectories. While the first drawback is inherent to all similaritybased methods, our distributionbased approach and gradual weighting mitigates the effects of noise and tracking errors and is able to identify partial matchings between trajectories.
In contrast to similaritybased approaches, motionbased approaches focus on local movements of objects to derive models for the overall (group) motion in a scene. Wang et al. (2008) and Jeong et al. (2011) represent a trajectory by bags of positions as well as directions based on the bagofwords representation of documents in natural language processing. To this end, the spatial domain is discretised and the number of occurrences of each position in a trajectory is counted. Grimson et al. also take into account temporal information by counting the occurrences of each (discretised) direction in a trajectory. The topic model DualHDP (Wang et al. 2008) is used to find semantic regions, which are combined to form the different trajectories. Jeong et al. use latent dirichlet allocation (Blei et al. 2003) to obtain semantic regions. To incorporate temporal information, a hidden Markov model is trained for each topic based on the sequences which are close to the topic. Saleemi et al. (2009) propose kernel density estimation to learn a five dimensional distribution of transitions from \((x_1, y_1)\) to \((x_2, y_2)\) in time t. Markov chain monte carlo (Andrieu et al. 2003) is then deployed to sample the most likely paths given the learned transition probabilities.
Basharat et al. (2008) also learn a model for transition probabilities. Instead of kernel density estimation, a Gaussian mixture model is fitted to the observed transitions. Lin et al. (2009) exploit the Lie algebraic structure of affine transformations to learn a flow model consisting of overlapping twodimensional Gaussian distributions, each of which corresponds to an affine transform dominant in this spatial area. The approach is applied to pedestrians in a train station and optical flows obtained from satellite images. Li and Chellappa (2010) also use a similar Lie algebraic representation called spatial hybrid driving force model, which, opposed to Lin et al. (2009), evolves over time. This model is used to solve the socalled group motion segmentation problem, i.e. to answer the question of which objects take part in an organised group motion and which do not.
Motionbased approaches also inhere some limitations. First, they often neglect temporal information at least of second order (curvature). Second, they do not provide a mapping of the input trajectories to groups of similar trajectories but rather describe the combined motion of all objects in all trajectories over time. Our approach differs methodologically from the summarised techniques in several ways: First, it provides a general framework that covers many applications and properties as opposed to being a very specific similarity measure tailored to a single application domain. Second, our approach is specialised on multiple simultaneously moving objects instead of focussing on only trajectories of single objects. Third, being a kernel the similarity measure is straightforwardly applicable to a broad range of algorithms and is theoretically well grounded in contrast to heuristic approaches.
2.2 Sports analytics
Current approaches in the area of sports game trajectory analysis either aim to define objective performance measures for players (Kang et al. 2006), classify (Bialkowski et al. 2013; Hervieu and Bouthemy 2010; Intille and Bobick 2001; Grunz et al. 2012; Perše et al. 2009; Siddiquie et al. 2009) or cluster (Hirano and Tsumoto 2005; Wei et al. 2013) plays/trajectories, or learn a motion model for team behaviour (Bialkowski et al. 2014; Direkoglu and O’Connor 2012; Kim et al. 2010; Li et al. 2009; Li and Chellappa 2010; Lucey et al. 2013; Zhu et al. 2007).
Kang et al. (2006) define performance metrics for soccer players based on the definition of owned and competitive regions of the field, which are derived from ball and player trajectories. Siddiquie et al. (2009) represent videos of football plays by a bagoffeatures from histograms of optical flows as well as histograms of oriented gradients. Spatiotemporal pyramid matching (Lazebnik et al. 2006) is used to generate a kernel for each visual word. Football plays are then classified into seven categories using multiple kernel learning. Hervieu and Bouthemy (2010) use a hierarchical parallel semiMarkov model to classify different activity states in squash and handball, such as rallies, free throws and defence. The first layer describes the activity states, while the second layer consists of a parallel hidden Markov model for each feature representing the trajectories.
Perše et al. (2009) represent team activity in basketball using team centroids to hierarchically classify situations with Gaussian Mixture Models. Thereafter, each situation is converted into a string, which is compared to templates for classification. Bialkowski et al. (2013) use team centroids and occupancy maps to classify game situations in field hockey (corners, goals), emphasising the robustness of this representation to tracking noise. Grunz et al. (2012) employ selforganising maps to identify long and short game initiations in soccer and Hirano and Tsumoto (2005) use multiscale comparison and rough clustering to cluster ball trajectories that lead to goals.
Direkoglu and O’Connor (2012) solve a special Poisson equation, in which the player positions determine the location of source terms. The derived distribution and its development over time defines a socalled region of interest used to describe the team behaviour. Wei et al. (2013) use role models (Lucey et al. 2013) and a Bilinear spatiotemporal basis model to represent team movement to cluster goal scoring opportunities in soccer. Bialkowski et al. (2014) also use role models to automatically detect and compare the formations of soccer teams. Li and Chellappa (2010) learn a spatiotemporal driving force model to identify offence and defence players in football. Kim et al. (2010) interpolate a dense motion field from player trajectories using thinplate splines. This motion field is further investigated for points of convergence to predict where the game will evolve in short term.
From an application point of view, our approach is most comparable to Wei et al. (2013) and Grunz et al. (2012). While Wei et al. focus on scoring opportunities and Grunz et al. study game initiations, we consider both situations in this study. Similar to Bialkowski et al. (2013), our method proves robust to tracking noise.
3 Spatiotemporal convolution kernels
3.1 Representation
Multiobject trajectory analysis is concerned with a possibly varying number of moving objects \({\mathcal {O}}_t\) in a set X, e.g. \(X = {\mathbb {R}}^2\), over a finite period of time \({\mathcal {T}} \subset {\mathbb {N}}\). A multiobject trajectory is composed of snapshots of the object positions at different times. Depending on the context and application at hand, one of the following two formalisations of a snapshot is more appropriate.
Definition 1
(Objectoriented Snapshot) Assume the number of objects to be constant over time, i.e. \({\mathcal {O}}_t = {\mathcal {O}} = \{o_1,\ldots , o_N\}\) for \(N \in {\mathbb {N}}\). Then the objectoriented snapshot of all objects at time \(t \in {\mathcal {T}}\) is denoted by \( x_{t} \in X^N =: {\mathcal {X}}.\) We call \({\mathcal {X}}\) the snapshot space. The position of a particular object \(o \in {\mathcal {O}}\) is denoted by \( x_{t}(o) \in X.\)
Definition 2
The group members of group g in snapshot \(x_t\) are denoted by \(O_{x_t}(g) \subset {\mathcal {O}}_t.\)
The implications of the two definitions are as follows. First, the objectoriented snapshot representation only allows a fixed number of objects, whereas the grouporiented representation is not limited in that respect. Second, in the grouporiented snapshot, objects inside a group are indistinguishable. On one hand, the property allows for permutations of objects but on the other hand it naturally also entails ambiguities.
Instead of an ordered sequence of positions or snapshots we use a set of time/positionpairs to represent trajectories. Thereby, time and order is explicitly represented as opposed to the more implicit sequence representation.
Definition 3
The set \(\pi _{{\mathcal {T}}}(\tilde{P})) = \{t \in {\mathcal {T}}: \exists (\tilde{s}, x_{\tilde{s}}) \in \tilde{P} \text { s. t. } t= \tilde{s}\}\) contains all timestamps of the trajectory and is usually of the form \(\{K, K+1,\ldots , K+L\}\) for some natural numbers K and L. When comparing trajectories it is insignificant at what absolute time the trajectories start. This gives rise to the following definition.
Definition 4
In the remaining part of this study we refer to timenormalised trajectories simply by trajectories.
3.2 Problem setting
 (I)
the absolute position as well as the shape of the trajectories is incorporated,
 (II)the measure is invariant to permutations of certain objects, i.e. for two trajectories \(P_1\), \(P_2\) withit holds that \(k(P_1, P_2) = 1\). In case of the grouporiented snapshot this already holds by definition if the permuted objects are members of the same group,$$\begin{aligned} P_2 = \{(t, x_t) : \exists \text { permutation } \sigma \ \forall (s, y_s) \in P_1 \text { s. t. } t=s \wedge x_t = \sigma (y_s)\} \end{aligned}$$
 (III)the measure is invariant with respect to the speed of the movement. Since all trajectories have already been normalised to the same time scale, differences in speed are mainly reflected in the cardinality of the trajectory sets. So, for example, given two trajectories \({P}_1\) and \({P}_2\) with \({P}_1 = 2{P}_2\) andit holds that \(k(P_1, P_2) \approx k(P_1, P_1)\),^{2}$$\begin{aligned} {P}_2 = \{({t}, x_{{t}}) : \exists ({{s}}, y_{{s}}) \in {P}_1 \text { s. t. } {t}={2s} \wedge x_{{t}} = y_{{2s}}\} \end{aligned}$$
 (IV)
the similar movements of two sets of objects is recognised as such in the presence of deviations of single objects and outliers.
 (V)Kernel Property, i.e.for some, usually unknown, feature map \(\phi \) and inner product space \(\mathcal {F}\)$$\begin{aligned} k(P_1, P_2) = \langle \phi (P_1), \phi (P_2)\rangle _{\mathcal {F}} \end{aligned}$$
 (VI)
Broad applicability, i.e. few application specific parameters and no restrictions on the space X
 (VII)
Computational efficiency
Property (IV) demands robustness with respect to outlier trajectories. Further note that, for example, Dynamic Time Warping (Bellman and Kalaba 1959) meets condition (III) very well, but does not comply with (V), (VI) and (VII), since it is not a kernel and only applicable if the underlying set is a metric space. Moreover, it is computationally expensive. On the other hand, the Hausdorff distance (Hausdorff 1962) satisfies (I), (III) and (VII), but it does not satisfy (II), (V) and (VI), since it is only applicable to metric spaces and sensitive to permutations. In addition, it is not a kernel. A Gaussian RBF kernel on the full vector of positions meets conditions (I) (restricted), (V) and (VII), but is not applicable to sequences of different lengths. Also, it does not comply with (II),(III) and (VI), since it is highly sensitive to variations in speed and permutations and is restricted to metric spaces.
3.3 Spatiotemporal convolution kernels for multitrajectories
In this section we develop a kernel on the space of (timenormalised) multitrajectories \({\mathcal {P}}({[0,1]\times {\mathcal {X}}})\). Each of those trajectories consists of a set of snapshots associated with a relative time. The general idea is to perform a pairwise comparison of the snapshots in the two sets. Therefore, we first need a way to compare snapshots and, second, we need to know which snapshots of the two trajectory sets to compare with each other. For the latter dynamic time warping (DTW) (Bellman and Kalaba 1959) seems to be a good choice, since it aligns the snapshots optimally in terms of similarity. Unfortunately, the obtained kernel is not positive definite, i.e. it does not correspond to an inner product in some Hilbert Space. Although there is anecdotal evidence that learning with indefinite kernels can lead to good results in some applications (e.g. Ong et al. 2004), theory only supports the use of positive definite kernels. For many kernel machines there are error bounds and convergence criteria that can be straightforwardly applied to positive definite kernels but that do not hold for indefinite kernels (Blanchard et al. 2008; Lin 2001; Steinwart 2002).
Therefore, we propose a weighted comparison between every snapshot of the first trajectory and every snapshot of the second one where the weights depend on the offset in relative time. Formally, this is done using an Rconvolution kernel (Haussler 1999) on the two sets representing the trajectories. Convolution kernels are a general class of kernels on structured objects \(x,y \in X\).^{4} The idea is to compare instances x and y by comparing their parts \((x_1,\ldots ,x_D),(y_1,\ldots ,y_D) \in X_1\times \cdots \times X_D\). Thus, a relation function R is needed to express that something is a part of some structure.
Definition 5
Definition 6
The following theorem shows that an Rconvolution kernel is indeed a (positivedefinite) kernel.
Theorem 1
Let \(X,X_1,\ldots ,X_D\) be arbitrary sets. Let R be a finite relation and let \(k_1,\ldots ,k_D\) be kernels on \(X_1,\ldots ,X_D\). Then the Rconvolution kernel k given by Definition 6 is a kernel.
Proof
For the proof we refer to Haussler (1999) Theorem 1 and Lemma 1, which are essentially more involved applications of closure properties of kernels. \(\square \)
Theorem 2
The spatiotemporal convolution kernel (Eq. 3) is a kernel if the temporal kernel \(k_{[0,1]}\) and the spatial kernel \(k_{{\mathcal {X}}}\) are kernels in that sense.
Proof
By Theorem 1 we need to show that R is finite and the component kernels \(k_{\mathbb {N}}\) and \(k_{[0,1]}\), \(k_{{\mathcal {X}}}\) are indeed kernels. First, for all \(P \in {\mathcal {P}}([0,1]\times {\mathcal {X}})\) it holds that \(R^{1}(P) = P < \infty \) by Definition 3, so R is finite. Second, \(k_{[0,1]}\) and \(k_{{\mathcal {X}}}\) are kernels by assumption and we have just shown that \(k_{\mathbb {N}}\) is also a kernel. Hence, the spatiotemporal convolution kernel is a kernel.\(\square \)

Spatial kernel \(k_{{\mathcal {X}}}\): The choice of the snapshot kernel determines which snapshots are similar.

Temporal kernel \(k_{[0,1]}\): The choice of the temporal kernel determines the way in which the snapshots of two sequences are combined, and thus the importance of ordering and speed.
3.4 Spatial kernels
A spatial kernel compares two snapshots in X. Corresponding to the two definitions of the snapshot in Definition 1 (objectoriented) and Definition 2 (grouporiented), two types of kernels are introduced here as well.
Elementary spatial kernels
Name  X  Kernel  Para. 

Uniform  \({\mathbb {R}}^n\)  \(k_X(x, y) = {\mathbb {I}}_{\{z \in {\mathbb {R}}^n:\Vert zx\Vert _2 < w\}}(y)\)  w 
Triangular  \({\mathbb {R}}^n\)  \(k_X(x,y) = \left( 1 \frac{\Vert xy\Vert _2}{w}\right) {\mathbb {I}}_{\{z \in {\mathbb {R}}^n:\Vert zx\Vert _2 < w\}}(y)\)  w 
Polynomial  \({\mathbb {R}}^n\)  \(k_X(x,y) = \left( \langle x,y \rangle + R \right) ^d\)  R,d 
Gaussian  \({\mathbb {R}}^n\)  \(k_X(x,y) = \exp \left( \frac{1}{2\sigma ^2}\Vert xy\Vert _2^{2}\right) \)  \(\sigma ^{2}\) 
Matching kernel  Finite X  \(k_X(x, y) = {\mathbb {I}}_{\{x\}}(y)\)  – 
Kernels as in Eq. 4 have two major shortcomings. First, they penalise permutations of objects. For example, two snapshots of two objects with swapped positions will have low similarity, although in terms of the group motion they are alike. One way to address permutations is to explicitly maximise the similarity of the two snapshots with respect to all possible permutations of objects. Due to the high computational costs of considering all possible permutations^{6}, this is infeasible. In addition, the bandwidth parameter \(\sigma \) that controls the interval of high sensitivity, i.e. large values of \(\left {\mathrm {d}k}/{\mathrm {d}(\Vert xy\Vert )}\right \), of the kernel has to be known beforehand. This is critical since one usually does not know on which scale significant deviations appear. Both issues are addressed by the groupwise comparison of objects.
Definition 7
Lemma 1
\(k^{\text {prod}}\) is a (positivedefinite) kernel.
Proof
Lemma 2
\(k_{G}\) as defined in Eq. 7 is a kernel.
Proof
3.5 Temporal kernels
The temporal kernel \(k_{[0,1]}\) is the simpler component of spatiotemporal convolution kernel because the underlying set is fixed to the onedimensional interval [0, 1]. We briefly discuss possible options for the temporal kernel and their implications.
3.6 Approximation techniques
To compute the Gram matrix of a dataset of N trajectories, \(O(N^{2} L^{2})\) spatial as well as temporal kernels need to be evaluated with L being the maximal length of a sequence. Naturally, the evaluation of the spatial kernels will dominate the evaluation of the temporal kernel mainly for reasons of dimensionality as well as complexity of the kernel itself.
Evaluation order of ASTCK with \(L=12\) and \(P=Q=6\)
Timesteps P  

Step  0  .2  .4  .6  .8  1  
Timesteps Q  0  1  3  6  9  11  12 
.2  4  1  3  6  9  11  
.4  7  5  1  3  7  9  
.6  10  8  5  2  4  7  
.8  11  10  8  5  2  4  
1  12  12  10  8  6  2 
3.7 Online application
STCKs are particularly wellsuited for online analyses such as realtime computations where the objects of interest are still in motion. When a new measurement is added to a trajectory, Eq. 3 can be efficiently updated, since all previous spatial kernel evaluations remain constant in the sum. That is, only the value of the temporal kernel needs to be computed which is however usually inexpensive.
4 Empirical evaluation
In this section we empirically compare our spatiotemporal convolution kernels to baseline approaches using artificial and real data sets. We focus on clustering tasks and kmedoids (Kaufman and Rousseeuw 1987) as the underlying learning algorithm. The temporal kernel is always a Gaussian kernel that we combine with three different spatial kernels: an objectwise Gaussian RBF kernel as spatial kernel (\(\text {STCK}_{\text {11}}\)), a Gaussian RBF kernel on the group means (\(\text {STCK}_{\text {Mean}}\)), and a probability product kernel on the fitted Gaussian distributions (\(\text {STCK}_{\text {Dist}}\)).
For the latter, the group and application specific parameter \(\sigma _{\text {MIN}}^2\) which is sometimes needed to restore nonsingularities of the covariance matrix (see Sect. 3.4), is set to the average distance between two objects of the group. If the group only consists of one object, it is equal to the average distance between two objects of the group, that is most similar. The width parameter \(\sigma _T\) of the temporal Gaussian kernel is set to 0.5 to balance invariance to speed differences and sensitivity to direction. The width parameter \(\sigma _S\) of the spatial Gaussian RBF kernel is set to 0.2.
We deploy three baselines. First, Junejo et al. (2004) is straightforwardly extended to the multiobject scenario, i.e. the use of Hausdorff distance on the set of positions of the trajectories. Instead of the hierarchical clustering employed in Fu et al. (2005), kernelised kmedoids is used. The second baseline is inspired by Wang et al. (2008). We use a bagofpositions as well as a bagofdirections representation for the trajectories of each group. To keep the setup simple, we use a multinomial mixture model (MNMM) and expectation maximisation for clustering instead of a semantic topic model like dualHDP (Wang et al. 2008). Third, we also compare our method to dynamic time warping with a product probability kernel (\(\text {DTW}_{\text {dist}}\)) serving as local distance measure. This method applies the product probability kernel to the fitted Gaussian distributions. The number of clusters is determined using the silhouette measure (Rousseeuw 1987), Hartigan index (Hartigan 1975) as well as nextneighbour consistency for all methods.
4.1 Artificial data
Recall that our spatiotemporal convolution kernels possess basic properties^{11} that are not shared with the baseline methods. Most notably these are invariance to permutations of the objects and to differences in speed as well as sensitivity to the spatial distribution of the objects and the direction of the movement.
The second setting deals with circular movements and each trajectory consists of four objects moving in circles of different radii, see Fig. 1 (bottom). The construction of the second set enables us to evaluate the ability of the kernels to identify the correct clusters when these only differ in the direction of the movement but a spatial separation between the clusters is not possible. Compared to linear movements, the circular task is more difficult, as the directions of only some objects differ between the clusters. The data generation is described in detail in “Appendix” section.
To evaluate the sensitivity of the methods to permutations, the objects’ ordering inside each multiobject trajectory is permuted randomly in both data sets. To assess the sensitivity to changes in speed, we flip a coin for each trajectory in both test sets. With 50 % probability only every second position is retained in the trajectory. The resulting trajectory corresponds to a movement with twice the speed of the original trajectory. Finally, we add uniformly distributed noise in the range \([\epsilon , \epsilon ]\) for \(\epsilon \in \{0.005, 0.025, 0.05, 0.1, 0.25\}\) with zero mean to every position in each multiobject trajectory in both test sets.
In this experiment, we focus on \(\text {STCK}_{\text {Dist}}\) as it leads to more accurate clusters than \(\text {STCK}_{\text {11}}\) and \(\text {STCK}_{\text {Mean}}\) throughout all ranges of noise. Figure 2 compares \(\text {STCK}_{\text {Dist}}\) to the three baselines. Our method outperforms dynamic time warping on both test sets and gives the most accurate clusterings. The multinomial mixture model performs reasonably well on linear movements but leads to inappropriate clusterings for circular ones. The poor results on the second data set are due to the absence of ordering information in the bagofdirections representation when the objects move in a full circle. The Hausdorff distancebased Junejo et al. leads to generally inaccurate clusterings. We credit this finding to its sensitivity to permutations as well as negligence of ordering information.
The runtime of multiobject trajectory clustering is governed by the number of trajectories, the length of the trajectories and the number of objects. Depending on the application, usually one or two dimensions are dominating. It is thus important to evaluate the runtime for each of the three dimensions. The theoretical runtime is \(O(N^{2}\cdot L^{2})\) spatial as well as temporal kernel evaluations, where N is the number of trajectories and L is the maximum length of the trajectories. To confirm these findings experimentally, we generate random trajectory sets in \([0,1]^2\) and vary the number of trajectories per set, the length of the trajectories and the number of objects per trajectory. The results are depicted in Fig. 4.
Finally, Fig. 4 (bottom, right) varies the number of objects. The results are virtually constant time complexities for all methods except \(\text {STCK}_{\text {11}}\) and MNMM, which exhibit linear complexities. The observation is in line with our expectations as the Gaussian RBF kernel compares each object with its counterpart. The deviations in the case of \(\text {STCK}_{\text {Dist}}\) for a small number of objects (less or equal two objects) is due to the additional time needed by the shrinking schemes to restore nonsingularity of the covariance matrices.
4.2 Realworld data
We now evaluate spatiotemporal convolution kernels on real world data using positional data streams of ten soccer games of the German Bundesliga. The goal is to identify movement patterns by analysing the tracking data.
The tracking data is captured by the VIS.TRACK Impire (2014) tracking system during five games of Bundesliga Team A and five games of Bundesliga Team B from the 2011/12 Bundesliga season.^{12} VIS.TRACK is a video based tracking system consisting of several cameras in the soccer stadium. It determines the positions of the players, ball and referees at 25 frames per second, which amounts to roughly 135, 000 positions per object and match and a total of 31, 000, 000 positions. Additionally, a marker indicates the status of the ball (inplay, stoppage) and the possession of the ball. The range of the x (parallel to sidelines) and ycoordinate (parallel to endlines) is \([1,1]\), whereas values with an absolute value greater than one occur if the ball is out of bounds. The data stream is preprocessed so that positions of the second half are mirrored to account for the changeover at half time and the frame numbers of the second half are changed to succeed those of the first half. Additionally, the playing direction is determined and normalised, so that the team of interest always plays from left to right. Subsequently, we extract two types of sequences: game initiations and scoring opportunities.
Game initiations (GI) begin with the goal keeper passing the ball and end with the team loosing possession, a stoppage, the ball being in the attacking third of the field or the start of the next game initiation as defined above. Sequences shorter than length 12 are excluded. Scoring opportunities (SO) terminate when the ball is carried into a predefined zone of danger, usually defined as the attacking quarter. Scoring opportunities begin at the time of the last stoppage or win of the ball and last until the ball reaches the attacking quarter of the field \([0.5, 1]\times [1,1]\). Again, sequences with a length below 12 are discarded.
For every sequence there are 23 possible trajectories (ball, 22 players, no referees) to include into the analysis. Since the opposing team changes from game to game, we will restrict the analysis to twelve objects, namely the ball and the players of Team A, and Team B respectively. In the following experiments we consider the ball as one group in the sense of Definition 2. We further include the back four (game initiations) and the four most offensive players (scoring opportunities) as a second group (in the sense of Definition 2) into the analysis. The clusterings in the remainder are thus based on the trajectories of five objects (ball, four players).
In the Bundesliga 2011/2012 season, the strategy of Team A generally consisted of transporting the ball with few, but rehearsed short game initiations to the opposing half. For this purpose, many ball contacts were allowed and different players were integrated. On the contrary, Team B showed a rather chaotic game organisation with rather random actions and increasingly long, straight balls. Figure 7 shows the outcomes of the kmedoids cluster for the game initiations. Compared to all other methods, the the \(\text {STCK}_{\text {dist}}\) capture the characteristic traits of the teams well. Team A clearly acted with many short moves (long trajectories in cluster 1) and integrated many players in the playmaking (cluttered medians). By contrast, Team B acted with many long moves (short trajectories in all clusters) and preferred linear actions.
Near the opposing goal, Team A aimed at quickly achieving a goal in the opposing half during the 2011/2012 season. They operated with only a few ball contacts and aimed to quickly transport the ball in the predefined zone of danger. Again in contrast to this, Team B had many ball contacts and took their time in waiting for a mistake of the opponent and only then played in the zone of danger to achieve a goal. Figure 8 shows that all methods are capable of retracing the different offensive strategies of both teams. Again, Team B has more solution categories (more than 33 %) than Team A, which is shown by the versatile and multifaceted running patterns. However, solely the results with \(\text {STCK}_{\text {dist}}\) show that Team A rapidly tries to enter the zone of danger with very few ball contacts (short sequences in all clusters compared to Team B). To sum up, the results with \(\text {STCK}_{\text {dist}}\) best reflect the game philosophies of Team A and B from a sportscientific perspectives and are the most easy to interpret.
5 Conclusion
We presented spatiotemporal convolution kernels for multiobject scenarios. Our kernels consist of a temporal and a spatial component that can be chosen according to the characteristic traits of a problem athand. The computation time is quadratic in terms of the number and lengths of trajectories. We proposed an efficient percental approximation algorithm that significantly reduced the complexity to superlinear runtime. Empirical results on artificial clustering tasks showed that our spatiotemporal convolution kernels effectively identify the target concepts. Results on largescale real world data from soccer games showed that our kernels lead to easily interpretable clusters that may be used in further analysis by coaches.
Footnotes
 1.
Note that the group membership of an object can change over time.
 2.
Note that the Definition of \(P_2\) is such that it corresponds to an object moving with twice the speed of the first object.
 3.
Note that the property can be easily adapted to explicitly include speed by adding an extra coordinate. The position of each object is replaced by a positionspeedpair \(\left( x, \frac{\mathrm {d}x}{\mathrm {d}t}\right) _t\), see Jinyang et al. (2011) for details.
 4.
Originally, convolution kernels are defined on arbitrary sets X.
 5.
However, if the number of objects was not constant, \(k_{{\mathcal {X}}}\) would not be a kernel.
 6.
For example, 10 objects lead to about \(3.6 \cdot 10^6\) permutations.
 7.
The case of an infinite set X without a metric can be reduced to the finite case by only considering the positions that are attained by some objects at some time. These are only finitely many.
 8.
In this study a threshold on the covariance matrix’s condition number of 30 is used.
 9.
In order to simplify the notation, group index g has been omitted.
 10.
The group index g has again been omitted for reasons of readability.
 11.
The following properties refer to the \({STCK}_{Dist}\).
 12.
The team names must not be disclosed.
References
 Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50(1–2), 5–43.Google Scholar
 Basharat, A., Gritai, A., & Shah, M. (2008). Learning object motion patterns for anomaly detection and improved object detection. In IEEE conference on computer vision and pattern recognition.Google Scholar
 Baud, O., ElBied, Y., Honore, N., & Taupin, O. (2007). Trajectory comparison for civil aircraft. In Aerospace conference, 2007 IEEE.Google Scholar
 Bellman, R., & Kalaba, R. (1959). On adaptive control processes. IRE Transactions on Automatic Control, 4(2), 1–9.CrossRefGoogle Scholar
 Bialkowski, A., Lucey, P., Carr, P., Denman, S., Matthews, I., & Sridharan, S. (2013). Recognising team activities from noisy data. In IEEE conference on computer vision and pattern recognition workshops.Google Scholar
 Bialkowski, A., Lucey, P., Carr, P., Yue, Y., & Matthews, I. (2014). “Win at home and draw away”: Automatic formation analysis highlighting the differences in home and away team behaviors. In MIT sloan sports analytics conference.Google Scholar
 Blanchard, G., Bousquet, O., & Massart, P. (2008). Statistical performance of support vector machines. The Annals of Statistics, 36(2), 489–531.CrossRefMathSciNetMATHGoogle Scholar
 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal for Machine Learning Research, 3, 993–1022.MATHGoogle Scholar
 Buzan, D., Sclaroff, S., & Kollios, G. (2004). Extraction and clustering of motion trajectories in video. In International conference on pattern recognition, (Vol. 2).Google Scholar
 Chen, Y., Wiesel, A., Eldar, Y., & Hero, A. (2010). Shrinkage algorithms for MMSE covariance estimation. IEEE Transactions on Signal Processing, 58(10), 5016–5029.CrossRefMathSciNetGoogle Scholar
 Direkoglu, C., & O’Connor, N. (2012). Team behavior analysis in sports using the poisson equation. In IEEE international conference on image processing.Google Scholar
 Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nyström method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 214–225.CrossRefGoogle Scholar
 Fu, Z., Hu, W., & Tan, T. (2005). Similarity based vehicle trajectory clustering and anomaly detection. In International conference on image processing, (Vol. 2).Google Scholar
 Giannotti, F., Nanni, M., Pinelli, F., & Pedreschi, D. (2007). Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07. ACM, New York, NY, USA. doi: 10.1145/1281192.1281230.
 Grunz, A., Memmert, D., & Perl, J. (2012). Tactical pattern recognition in soccer games by means of special selforganizing maps. Human Movement Science, 31(2), 334–343.CrossRefGoogle Scholar
 Hartigan, J. A. (1975). Clustering algorithms (99th ed.). New York, NY: Wiley.MATHGoogle Scholar
 Hausdorff, F. (1962). Set theory. Chelsea: Chelsea Pub. Co.Google Scholar
 Haussler, D. (1999). Convolution kernels on discrete structures. In Technical report UCSCRL9910, University of California at Santa Cruz.Google Scholar
 Hervieu, A., & Bouthemy, P. (2010). Understanding sports video using players trajectories. In Intelligent video event analysis and understanding, studies in computational intelligence, (Vol. 332). Berlin, Heidelberg: Springer.Google Scholar
 Hirano, S., & Tsumoto, S. (2005). Grouping of soccer game records by multiscale comparison technique and rough clustering. In International conference on hybrid intelligent systems.Google Scholar
 Hu, W., Xiao, X., Fu, Z., Xie, D., Tan, T., & Maybank, S. (2006). A system for learning statistical motion patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1450–1464.CrossRefGoogle Scholar
 Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.CrossRefGoogle Scholar
 Impire AG (2014) VIS.TRACK Produktinformation. http://www.bundesligadatenbank.de/fileadmin/impire/products/Produktinformationen/VIS.TRACK__Produktionformation.pdf. Accessed June 10, 2014.
 Intille, S. S., & Bobick, A. F. (2001). Recognizing planned multiperson action. Computer Vision and Image Understanding, 81(3), 414–445.CrossRefMATHGoogle Scholar
 Jebara, T., Kondor, R., & Howard, A. (2004). Probability product kernels. Journal for Machine Learning Research, 5, 819–844.MathSciNetMATHGoogle Scholar
 Jeong, H., Chang, H. J., & Choi, J. Y. (2011). Modeling of moving object trajectory by spatiotemporal learning for abnormal behavior detection. In IEEE international conference on advanced video and signalbased surveillance.Google Scholar
 Jinyang, D., Rangding, W., Liangxu, L., & Jiatao, S. (2011). Clustering of trajectories based on hausdorff distance. In International conference on electronics, communications and control.Google Scholar
 Junejo, I., Javed, O., & Shah, M. (2004). Multi feature path modeling for video surveillance. In International conference on pattern recognition, (Vol. 2).Google Scholar
 Kang, C.H., Hwang, J.R., & Li, K.J. (2006). Trajectory analysis for soccer players. In IEEE international conference on data mining workshops.Google Scholar
 Kang, J., & Yong, H. S. (2008). Spatiotemporal discretization for sequential pattern mining. In Proceedings of the 2nd international conference on ubiquitous information management and communication. ACM, New York, NY, USA.Google Scholar
 Kang, S. J., Kim, Y., Park, T., & Kim, C. H. (2013). Automatic player behavior analysis system using trajectory data in a massive multiplayer online game. Multimedia Tools and Applications. doi: 10.1007/s110420121052x.
 Kaufman, L., & Rousseeuw, P. J. (1987). Clustering by means of medoids. In Y. Dodge (Ed.), Statistical Data Analysis Based on the \(L_{1}\) Norm and Related Methods (pp. 405–416). NorthHolland.Google Scholar
 Kempe, M., Grunz, A., & Memmert, D. (2014). Detecting tactical patterns in basketball: Comparison of merge selforganising maps and dynamic controlled neural networks. European Journal of Sport Science, 15(4), 249–255.CrossRefGoogle Scholar
 Kim, K., Grundmann, M., Shamir, A., Matthews, I., Hodgins, J., & Essa, I. (2010). Motion fields to predict play evolution in dynamic sport scenes. In IEEE conference on computer vision and pattern recognition.Google Scholar
 Larson, J.S., Bradlow, E.T., & Fader, P.S. (2005). An exploratory look at supermarket shopping paths. International Journal of Research in Marketing, 22. doi: 10.1016/j.ijresmar.2005.09.005.
 Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE computer society conference on computer vision and pattern recognition, (Vol. 2).Google Scholar
 Ledoit, O., & Wolf, M. (2004). A wellconditioned estimator for largedimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411.CrossRefMathSciNetMATHGoogle Scholar
 Lee, J.G., Han, J., & Li, X. (2008). Trajectory outlier detection: A partitionanddetect framework. In Proceedings of the 2008 IEEE 24th international conference on data engineering. IEEE Computer Society: Washington, DC, USA. doi: 10.1109/ICDE.2008.4497422.
 Li, R., & Chellappa, R. (2010). Group motion segmentation using a spatiotemporal driving force model. In IEEE conference on computer vision and pattern recognition.Google Scholar
 Li, R., Chellappa, R., & Zhou, S. (2009). Learning multimodal densities on discriminative temporal interaction manifold for group activity recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
 Lin, C. J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12(6), 1288–1298.CrossRefGoogle Scholar
 Lin, D., Grimson, E., & Fisher, J. (2009). Learning visual flows: A lie algebraic approach. In IEEE conference on computer vision and pattern recognition.Google Scholar
 Lucey, P. J., Bialkowski, A., Carr, P., Morgan, S., Matthews, I., & Sheikh, Y. (2013). Representing and discovering adversarial team behaviors using player roles. In IEEE conference on computer vision and pattern recognition.Google Scholar
 Mamoulis, N., Cao, H., Kollios, G., Hadjieleftheriou, M., Tao, Y., & Cheung, D.W. (2004). Mining, indexing, and querying historical spatiotemporal data. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM: New York, NY, USA.Google Scholar
 Mehta, S., Parthasarathy, S., & Yang, H. (2005). Toward unsupervised correlation preserving discretization. IEEE Transactions on Knowledge and Data Engineering, 17(9), 1174–1185.CrossRefGoogle Scholar
 Ng, A. Y., Jordan, M. I., & Weiss,Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pp. 849–856.Google Scholar
 Ong, C. S., Canu, S., Smola, A. J. (2004). Learning with nonpositive kernels. In International conference on machine learning.Google Scholar
 Pao, H. K., Chen, K. T., & Chang, H. C. (2010). Game bot detection via avatar trajectory analysis. IEEE Transactions on Computational Intelligence and AI in Games, 2(3), 162–175.CrossRefGoogle Scholar
 Perše, M., Kristan, M., Kovačič, S., Vučkovič, G., & Perš, J. (2009). A trajectorybased analysis of coordinated team activity in a basketball game. Computer Vision and Image Understanding, 113(5), 612–621.CrossRefGoogle Scholar
 Piciarelli, C., Foresti, G., & Snidaro, L. (2005). Trajectory clustering and its applications for video surveillance. In IEEE conference on advanced video and signal based surveillance.Google Scholar
 Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.CrossRefGoogle Scholar
 Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20.Google Scholar
 Saleemi, I., Shafique, K., & Shah, M. (2009). Probabilistic modeling of scene dynamics for applications in visual surveillance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(8).Google Scholar
 ShaweTaylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
 Siddiquie, B., Yacoob, Y., & Davis, L. (2009). Recognizing plays in american football videos.Google Scholar
 Steinwart, I. (2002). Support vector machines are universally consistent. Journal of Complexity, 18(3), 768–791.CrossRefMathSciNetMATHGoogle Scholar
 Trunfio, G. A., D’Ambrosio, D., Rongo, R., Spataro, W., & Di Gregorio, S. (2011). A new algorithm for simulating wildfire spread through cellular automata. In ACM transactions on modeling and computer simulation.Google Scholar
 Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.CrossRefMATHGoogle Scholar
 Wang, X., Ma, K. T., Ng, G. W., & Grimson, W. (2008). Trajectory analysis and semantic region modeling using a nonparametric bayesian model. In IEEE conference on computer vision and pattern recognition.Google Scholar
 Wauthier, F. L., Jojic, N., Jordan, M. I. (2012). Active spectral clustering via iterative uncertainty reduction. In ACM SIGKDD international conference on knowledge discovery and data mining. ACM.Google Scholar
 Wei, X., Sha, L., Lucey, P., Morgan, S., & Sridharan, S. (2013). Largescale analysis of formations in soccer. In International conference on digital image computing: Techniques and applications.Google Scholar
 Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in neural information processing systems 13 (pp. 682–688). MIT Press.Google Scholar
 Zhu, G., Huang, Q., Xu, C., Rui, Y., Jiang, S., Gao, W., & Yao, H. (2007). Trajectory based event tactics analysis in broadcast sports video. In International conference on multimedia. ACM.Google Scholar