Discovering recurring activity in temporal networks

Kostakis, Orestis; Tatti, Nikolaj; Gionis, Aristides

doi:10.1007/s10618-017-0515-0

Discovering recurring activity in temporal networks

Published: 17 June 2017

Volume 31, pages 1840–1871, (2017)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Orestis Kostakis¹,
Nikolaj Tatti² &
Aristides Gionis²

Abstract

Recent advances in data-acquisition technologies have equipped team coaches and sports analysts with the capability of collecting and analyzing detailed data of team activity in the field. It is now possible to monitor a sports event and record information regarding the position of the players in the field, passing the ball, coordinated moves, and so on. In this paper we propose a new method to analyze such team activity data. Our goal is to segment the overall activity stream into a sequence of potentially recurrent modes, which reflect different strategies adopted by a team, and thus, help to analyze and understand team tactics. We model team activity data as a temporal network, that is, a sequence of time-stamped edges that capture interactions between players. We then formulate the problem of identifying a small number of team modes and segmenting the overall timespan so that each segment can be mapped to one of the team modes; hence the set of modes summarizes the overall team activity. We prove that the resulting optimization problem is $\mathrm {NP}$-hard, and we discuss its properties. We then present a number of different algorithms for solving the problem, including an approximation algorithm that is practical only for one mode, as well as heuristic methods based on iterative and greedy approaches. We benchmark the performance of our algorithms on real and synthetic datasets. Of all methods, the iterative algorithm provides the best combination of performance and running time. We demonstrate practical examples of the insights provided by our algorithms when mining real sports-activity data. In addition, we show the applicability of our algorithms on other types of data, such as social networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Density-Based Clustering Based on Hierarchical Density Estimates

A survey of methods for time series change point detection

Article 08 September 2016

Samaneh Aminikhanghahi & Diane J. Cook

catch22: CAnonical Time-series CHaracteristics

Article Open access 09 August 2019

Carl H. Lubba, Sarab S. Sethi, … Nick S. Jones

Notes

https://doi.org/10.5281/zenodo.290629
https://doi.org/10.5281/zenodo.160509
This hashtag refers to the first semi-final of the 2014 World Cup held in Brazil. Germany beat home-team Brazil by 7–1.

References

Aggarwal A, Klawe M, Moran S, Shor P, Wilber R (1987) Geometric applications of a matrix-searching algorithm. Algorithmica 2(1–4):195–208
Article MathSciNet MATH Google Scholar
Alamar BC (2013) Sports analytics: a guide for coaches, managers, and other decision makers. Columbia University Press, New York
Book Google Scholar
Appan P, Sundaram H, Tseng B (2006) Summarization and visualization of communication patterns in a large-scale social network. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 371–379
Araujo M, Papadimitriou S, Günnemann S, Faloutsos C, Basu P, Swami A, Papalexakis EE, Koutra D (2014) Com2: fast automatic discovery of temporal (comet) communities. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 271–283
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: ACM-SIAM symposium on discrete algorithms, society for industrial and applied mathematics, pp 1027–1035
Asur S, Parthasarathy S, Ucar D (2009) An event-based framework for characterizing the evolutionary behavior of interaction graphs. ACM Trans Knowl Discov Data 3(4):16:1–16:36
Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6):284. doi:10.1145/366573.366611
Berlingerio M, Bonchi F, Bringmann B, Gionis A (2009) Mining graph evolution rules. In: European conference on machine learning and knowledge discovery in databases, pp 115–130
Chen KT, Jiang JW, Huang P, Chu HH, Lei CL, Chen WC (2009) Identifying mmorpg bots: a traffic analysis approach. EURASIP J Adv Signal Process 2009:3
Google Scholar
Crowder M, Dixon M, Ledford A, Robinson M (2002) Dynamic modelling and prediction of english football league matches for betting. J R Stat Soc D 51(2):157–168
Article MathSciNet Google Scholar
Denman H, Rea N, Kokaram A (2003) Content-based analysis for video from snooker broadcasts. Comput Vis Image Underst 92(23):176–195
Article MATH Google Scholar
Eagle N, Pentland A (2006) Reality mining: sensing complex social systems. Pers Ubiquit Comput 10(4):255–268
Article Google Scholar
Eppstein D, Galil Z, Italiano GF (1998) Dynamic graph algorithms. CRC Press, Boca Raton
Book MATH Google Scholar
Gao X, Xiao B, Tao D, Li X (2010) A survey of graph edit distance. Pattern Anal Appl 13(1):113–129
Article MathSciNet Google Scholar
Gift P, Rodenberg RM (2014) Napoleon complex: height bias among national basketball association referees. J Sports Econ 15(5):541–558
Article Google Scholar
Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: International conference on research in computational molecular biology, RECOMB, pp 123–130
Goldsberry K (2012) Courtvision: new visual and spatial analytics for the nba. In: MIT sloan sports analytics conference
Greene D, Doyle D, Cunningham P (2010) Tracking the evolution of communities in dynamic social networks. In: IEEE of international conference on advances in social network analysis and mining, pp 176–183
Gudmundsson J, Horton M (2016) Spatio-temporal analysis of team sports—a survey. arXiv preprint arXiv:1602.06994
Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31(1):396–438
Article Google Scholar
Halvorsen P, Sægrov S, Mortensen A, Kristensen DK, Eichhorn A, Stenhaug M, Dahl S, Stensland HK, Gaddam VR, Griwodz C, et al (2013) Bagadus: an integrated system for arena sports analytics: a soccer case study. In: Proceedings of the ACM multimedia systems conference. ACM, pp 48–59
Harville D (1980) Predictions for national football league games via linear-model methodology. J Am Stat Assoc 75(371):516–524
Article Google Scholar
Hayet JB, Mathes T, Czyz J, Piater J, Verly J, Macq B (2005) A modular multi-camera framework for team sports tracking. In: IEEE conference on advanced video and signal based surveillance, pp 493–498
Heinen T (1996) Latent class and discrete latent trait models: similarities and differences. Sage Publications, Inc, Thousand Oaks
Google Scholar
Henzinger M, King V (1999) Randomized fully dynamic graph algorithms with polylogarithmic time per operation. J ACM 46(4):502–516
Article MathSciNet MATH Google Scholar
Himberg J, Korpiaho K, Mannila H, Tikanmäki J, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: IEEE international conference on data mining, pp 203–210
Holm J, De Lichtenberg K, Thorup M (2001) Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. J ACM 48(4):723–760
Article MathSciNet MATH Google Scholar
Holme P, Saramäki J (2012) Temporal networks. Phys Rep 519(3):97–125
Article Google Scholar
Hvattum LM, Arntzen H (2010) Using elo ratings for match result prediction in association football. Int J Forecast 26(3):460–470
Article Google Scholar
Ide T, Kashima H (2004) Eigenspace-based anomaly detection in computer systems. In: ACM SIGKDD international conference on knowledge discovery and data mining
Kasiri-Bidhendi S, Fookes C, Morgan S, Martin DT, Sridharan S (2015) Combat sports analytics: boxing punch classification using overhead depthimagery. In: IEEE International Conference on image processing (ICIP), pp 4545–4549
Kleinberg J, Papadimitriou C, Raghavan P (1998) Segmentation problems. In: ACM symposium on theory of computing, pp 473–482
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Machine learning: ECML 2004. Springer, pp 217–226
Kostakis O (2014) Classy: fast clustering streams of call-graphs. Data Min Knowl Disc 28(5–6):1554–1585
Article MathSciNet Google Scholar
Kumar R, Calders T, Gionis A, Tatti N (2015) Maintaining sliding-window neighborhood profiles in interaction networks. In: European conference on machine learning and knowledge discovery in databases. Springer, pp 719–735
Lucey P, Bialkowski A, Carr P, Morgan S, Matthews I, Sheikh Y (2013a) Representing and discovering adversarial team behaviors using player roles. In: IEEE conference on computer vision and pattern recognition, pp 2706–2713
Lucey P, Oliver D, Carr P, Roth J, Matthews I (2013b) Assessing team strategy using spatiotemporal data. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 1366–1374
Maheswaran R, Chang YH, Henehan A, Danesis S (2012) Deconstructing the rebound with optical tracking data. In: MIT sloan sports analytics conference
Miller TW (2015) Sports analytics and data science: winning the game with methods and models. FT Press, Upper Saddle River
Google Scholar
Mongiovi M, Bogdanov P, Singh AK (2013) Mining evolving network processes. In: IEEE international conference on data mining, pp 537–546
Obradovic Z (2007) Panathinaikos offense. Fiba Assist Mag 26:33–36
Google Scholar
Papadimitriou P, Dasdan A, Garcia-Molina H (2010) Web graph similarity for anomaly detection. J Internet Serv Appl 1(1):19–30
Article Google Scholar
Pei SC, Chen F (2003) Semantic scenes detection and classification in sports videos. In: IPPR conference on computer vision, graphics and image processing (CVGIP), pp 210–217
Pers J, Bon M, Vuckovic G (2006) Cvbase 06 dataset
Perše M, Kristan M, Kovačič S, Vučkovič G, Perš J (2009) A trajectory-based analysis of coordinated team activity in a basketball game. Comput Vis Image Underst 113(5):612–621
Article Google Scholar
Pingali GS, Jean Y, Carlbom I (1998) Real time tracking for enhanced tennis broadcasts. In: Proceedings IEEE computer society conference on computer vision and pattern recognition, pp 260–265
Rayana S, Akoglu L (2016) Less is more: building selective anomaly ensembles. ACM Trans Knowl Discov Data 10(4):42
Article Google Scholar
Rodenberg RM, Feustel ED (2014) Forensic sports analytics: detecting and predicting match-fixing in tennis. J Predict Mark 8(1):77–95
Rozenshtein P, Tatti N, Gionis A (2014) Discovering dynamic communities in interaction networks. In: European conference on machine learning and knowledge discovery in databases, pp 678–693
Sakoe H, Chiba S (1971) A dynamic programming approach to continuous speech recognition. Int Congr Acoust 3:65–69
Google Scholar
Shah N, Koutra D, Zou T, Gallagher B, Faloutsos C (2015) Timecrunch: Interpretable dynamic graph summarization. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 1055–1064
Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: IEEE international conference on data engineering, pp 536–545
Sricharan K, Das K (2014) Localizing anomalous changes in time-evolving graphs. In: ACM SIGMOD international conference on management of data, pp 1347–1358
Stensland HK, Gaddam VR, Tennøe M, Helgedagsrud E, Næss M, Alstad HK, Mortensen A, Langseth R, Ljødal S, Landsverk Ø et al (2014) Bagadus: An integrated real-time system for soccer analytics. ACM Trans Multimedia Comput Commun Appl 10(1s):14
Article Google Scholar
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 687–696
Thorup M (2000) Near-optimal fully-dynamic graph connectivity. In: ACM symposium on theory of computing, pp 343–350
Travassos B, Davids K, Araújo D, Esteves PT (2013) Performance analysis in team sports: advances from an ecological dynamics approach. Int J Perform Anal Sport 13(1):83–95
Google Scholar
Wei X, Sha L, Lucey P, Morgan S, Sridharan S (2013) Large-scale analysis of formations in soccer. In: International conference on digital image computing: techniques and applications, pp 1–8
Zhong D, Chang SF (2001) Structure analysis of sports video using domain models. In: IEEE international conference on multimedia and expo, pp 713–716

Download references

Author information

Authors and Affiliations

Microsoft Corporation, Redmond, WA, USA
Orestis Kostakis
HIIT, Aalto University, Espoo, Finland
Nikolaj Tatti & Aristides Gionis

Authors

Orestis Kostakis
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar
Aristides Gionis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Orestis Kostakis.

Additional information

The work was carried out while the author was at Aalto University, Espoo, Finland.

Appendix: Proof of NP-hardness

To prove the np-hardness we use the following problem.

Problem 4

(Satisfy) Assume that we are given q formulas over $\ell $ variables $\left\{ v_i\right\} $ of form $\lnot z = x \wedge y$, where x, y, and z are boolean variables or their negations. Decide whether these clauses can be simultaneously satisfied with $v_1$ being set to true.

Proposition 10

Satisfy is NP-complete.

Proof

We will prove the hardness by reduction from 3SAT. Assume an instance of 3SAT with n variables and m clauses.

For each ith clause with two literals $x \vee y$, add $\lnot c_i = \lnot x \wedge \lnot y$.

For each ith clause with three literals $x \vee y \vee z$, add two formulas $h_i = \lnot x \wedge \lnot y$ and $\lnot c_i = h_i \wedge \lnot z$.

If the ith clause contains one literal x, then refer to x as $c_i$.

Add $m - 1$ variables $v_1, \ldots , v_{m - 1}$, and formulas $v_i = v_{i + 1} \wedge c_i$, for $i = 1, \ldots , m - 2$, and $v_{m - 1} = c_{m - 1} \wedge c_m$.

It follows that ith clause can be satisfied if and only if $c_i$ can be set to true. All $c_i$s can be set to true if and only if $v_1$ can be set to true. $\square $

Proposition 11

(k, 1)-segmentation is NP-hard.

Proof

We will prove the hardness by reduction from Satisfy. Assume that we are given an instance of Satisfy with q formulas and $\ell $ variables.

We begin by specifying the vertices. The total number of vertices is $1 + 3 + 2\ell + r$, where $r = (20q + 12\ell + 2)(3 + 2\ell )$.

The first vertex is $\alpha $, and every edge will be adjacent to $\alpha $. The next three vertices are $t_1$, $t_2$, and $t_3$. Our construction will make sure that $(\alpha , t_i) \in E(G)$.

The next $2\ell $ vertices correspond to the variables and their negations, we will denote them by $v_i$ and $u_i$, for $i = 1, \ldots , \ell $. We will denote by X the set of possible edges between $\alpha $ and these vertices. Define $X' = X {\setminus } \left\{ (\alpha , u_1), (\alpha , v_1)\right\} $.

Finally, the last r vertices are auxiliary vertices that will allow us to force segmentation borders. We will denote the set of possible edges between these vertices and $\alpha $ by B.

Our interation network consists of 3 parts, which in turn consists of sections. All these sections and parts are combined consecutively.

The first part, say $P_1$, consists of $2\ell $ sections, each containing 5 time points. The first $\ell $ sections are defined as

$$\begin{aligned} \begin{array}{rllllll} (\alpha , v_i): &{} 1 &{} 1 &{} &{} &{} \\ (\alpha , u_i): &{} &{} &{} &{} 1 &{} 1 \\ (\alpha , t_1): &{} 1 &{} 1 &{} &{} 1 &{} 1 &{}\\ (\alpha , t_2): &{} 1 &{} 1 &{} &{} 1 &{} 1 &{}\\ (\alpha , t_3): &{} &{} 1 &{} 1 &{} 1 &{} &{}\\ \text {for every } e \in B: &{} 1 &{} &{} 1 &{} &{} 1 &{}\\ \end{array} \end{aligned}$$

They last $\ell $ sections are copies of the first $\ell $ sections, except that they also contain the remaining edges from X at 1st, 3rd, and 5th time point.

The second part, say $P_2$, consists of 2q sections, each containing 7 time points. Let $c_i = (\lnot z = x \wedge y)$ be the ith formula. By using the same letters to represent the corresponding vertices, taking account negations, we define the ith section, where $i = 1, \ldots , k$, as

$$\begin{aligned} \begin{array}{rllllllll} (\alpha , x): &{} 1 &{} 1 &{} &{} &{} 1 \\ (\alpha , y): &{} 1 &{} &{} &{} 1 &{} 1 \\ (\alpha , z): &{} &{} 1 &{} 1 &{} 1 &{} &{} &{} 1\\ (\alpha , t_1): &{} 1 &{} 1 &{} &{} 1 &{} 1 &{} 1 &{} 1\\ (\alpha , t_2), (\alpha , t_3): &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1 &{} 1\\ \text {for every } e \in B: &{} 1 &{} &{} 1 &{} &{} 1 &{} 1 &{} 1 \\ \end{array} \end{aligned}$$

The $(q + i)$th section is a copy of ith segment, except that they also contain the remaining edges from X at 1st, 3rd, and 5th–7th time points.

The last part, say $P_3$, consists of $10q + 6\ell + 2$ sections, each consisting of 1 single time point. Each section contains B, $(\alpha , t_i)$, and $(\alpha , u_1)$. Moreover, every even section contains edges in $X'$.

We set $k = 20q + 12\ell + 2$. We claim that Satisfy is true if and only if the optimal segmentation has a score of

$$\begin{aligned} \begin{aligned} \sigma =&{\left| P_1\right| }/2 \times (3(2\ell - 2) + 2) + {\left| P_2\right| }/2 \times (5(2\ell - 3) + 12) + {\left| P_3\right| } / 2 \times (2\ell - 2)\quad . \end{aligned} \end{aligned}$$

We will prove this in several steps.

Step (i): Every $e \in B$ is contained in every segment exactly once. First, note that this segmentation is possible since B occurs at k different time points, the optimal cost of any such segmentation is bounded by r / 2, the number of possible edges times half the number of segments. Note that each $e \in B$ occurs at the exact same time point. Thus there is an optimal solution with every $e \in B$ either present or absent from the core. Assume that there is a segment that disagrees with the core. Then the cost is at least r. Consequently, every segment must contain every $e \in B$. Since B occurs at k different time points, each segment can contain only one instance of each $e \in B$.

Step (ii): It follows immediately, that the borders of the sections are included in the borders of the optimal segmentation. Moreover, each section in $P_1$ part is divided into 3 segments, each section in $P_2$ is divided into 5 segments, each section in $P_3$ corresponds to exactly 1 segment.

Step (iii): $(\alpha , t_i) \in E(G)$, $u_1 \in E(G)$ and $v_1 \notin E(G)$. This follows immediately from the fact that each section in $P_3$ corresponds to one segment, and ${\left| P_3\right| } > k / 2$, that is, $P_3$ contains the majority of the segments.

Step (iv): The cost of i th and $(i + 1)$ th section in $P_1$ is at least $3(2\ell - 2) + 2$. This bound is reached if and only if G contains either $u_i$ or $v_i$, but not both. First note, that the middle segment in both sectons contains the 3rd time point. This means that the remaining edges in X will occur exactly 3 times in 6 segments. Thus, they induce a cost of $3(2\ell - 2)$. A brute-force enumeration now implies that the involved edges induce a cost of at least 1, and this is possible if and only if G contains either $u_i$ or $v_i$, but not both.

Step (v): The cost of i th and $(i + 1)$ th section in $P_2$ is at least $5(2\ell - 3) + 12$. This bound is reached if and only if $(\alpha , z) \notin E(G) \Leftrightarrow (\alpha , x) \in E(G) $ and $(\alpha , y) \in E(G)$. First note, that the 2nd segment in both sectons contains the 3rd time point, and the 4th and 5th segments consists of exactly one time point. This implies that the remaining edges in X will occur exactly 5 times in 10 segments. Thus, they induce a cost of $5(2\ell - 3)$. A brute-force enumeration now implies that the involved edges induce a cost of at least 6, and this is possible if and only if $(\alpha , z) \notin E(G) \Leftrightarrow (\alpha , x) \in E(G) $ and $(\alpha , y) \in E(G)$.

Step (vi): The cost of an odd and even section in $P_3$ is equal to $2\ell - 2$. This follows from the fact that the edges in $X'$ occur exactly once in these two sections.

Step (vii): Steps (iv)–(vi) imply that $\sigma $ is a lower bound for the optimal score. This bound is reached if and only if, the lower bounds of each section is reached. This can happen if and only if each sentence in Satisfy can be satisfied. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kostakis, O., Tatti, N. & Gionis, A. Discovering recurring activity in temporal networks. Data Min Knowl Disc 31, 1840–1871 (2017). https://doi.org/10.1007/s10618-017-0515-0

Download citation

Received: 29 February 2016
Accepted: 22 May 2017
Published: 17 June 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s10618-017-0515-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Discovering recurring activity in temporal networks

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A survey of methods for time series change point detection

catch22: CAnonical Time-series CHaracteristics

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proof of NP-hardness

Problem 4

Proposition 10

Proof

Proposition 11

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discovering recurring activity in temporal networks

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A survey of methods for time series change point detection

catch22: CAnonical Time-series CHaracteristics

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proof of NP-hardness

Appendix: Proof of NP-hardness

Problem 4

Proposition 10

Proof

Proposition 11

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation