Abstract
In this chapter we survey the new theoretical developments and the use of scan statistics in data represented as graphs, trajectories, and text. These types of data are becoming common in the new massive digital data world. Large social networks are represented by complex graphs. We have records of the paths of moving objects, such as people who log their travel routes generating GPS trajectories. Large quantities of text are continuously generated by news wire services and social networks. There is a large interest in developing algorithms with strong statistical basis for detecting anomalies in these types of data. We review the use of the scan statistics in these situations. Additionally, we identify three main opportunities and challenges from the big data times for scan statistics: we need to deal with new stochastic data structures; we need much higher computational efficiency than we have now; and we need models that can deal with the variability that appears in the large samples now collected.
References
Aggarwal CC (2007) Data streams: models and algorithms, vol 31. Springer Science & Business Media, New York
Akoglu L (2014) Quantifying political polarity based on bipartite opinion networks. In: Eighth international AAAI conference on weblogs and social media
Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: spotting anomalies in weighted graphs. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 410–421
Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Disc 29(3):626–688
Assunção R, Costa M, Tavares A, Ferreira S (2006) Fast detection of arbitrarily shaped disease clusters. Stat Med 25(5):723–742
Berk RH, Jones DH (1979) goodness-of-fit test statistics that dominate the kolmogorov statistics. J Probab Theory Relat Areas 47(1):47–59
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., Sebastopol
Blanford J, Huang Z, Savelyev A, MacEachren A (2015) Geo-located tweets. Enhancing mobility maps and capturing cross-border movement. PLoS ONE 10(6):e0129202. https://doi.org/10.1371/journal.pone.0129202
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM SIGMOD record, vol 29. ACM, pp 93–104
Brugere I, Gallagher B, Berger-Wolf TY (2018) Network structure inference, a survey: motivations, methods, and applications. ACM Comput Surv (CSUR) 51(2):24
Cadena J, Chen F, Vullikanti A (2017) Near-optimal and practical algorithms for graph scan statistics. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 624–632
Cai H, Zheng VW, Chang KCC (2018) A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Trans Knowl Data Eng 30(9):1616–1637
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41(3):15
Chen W, Chundi P (2009) Extracting hot spots of basic and complex topics from time stamped documents. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, pp 125–132
Chen W, Chundi P (2011) Extracting hot spots of topics from time-stamped documents. Data Knowl Eng 70(7):642–660
Chen F, Neill DB (2014) Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1166–1175
Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S (2015) One trillion edges: graph processing at Facebook-scale. Proc VLDB Endowment 8(12):1804–1815
Costa MA, Kulldorff M (2014) Maximum linkage space-time permutation scan statistics for disease outbreak detection. Int J Health Geograph 13(1):20
Costa MA, Assunção RM, Kulldorff M (2012) Constrained spanning tree algorithms for irregularly-shaped spatial clustering. Comput Stat Data Anal 56(6):1771–1783
Duczmal L, Assunção R (2004) A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Comput Stat Data Anal 45(2):269–286
Gao Y, Li T, Wang S, Jeong MH, Soltani K (2018) A multidimensional spatial scan statistics approach to movement pattern comparison. Int J Geograph Inf Sci 32(7):1304–1325
Ghurye J, Krings G, Frias-Martinez V (2016) A framework to model human behavior at large scale during natural disasters. In: 17th IEEE MDM, pp 18–27
Giannotti F, Nanni M, Pinelli F, Pedreschi D (2007) Trajectory pattern mining. In: Proceedings of the 13th ACM SIGKDD conference, pp 330–339
Gilbert E, Karahalios K (2009) Predicting tie strength with social media. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 211–220
Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453:779–782
Goyal P, Ferrara E (2018) Graph embedding techniques, applications, and performance: a survey. Knowl-Based Syst 151:78–94
Grimes S (2014) Unstructured data and the 80 percent rule (2008). Clarabridge, Bridgepoints
Gu Y, Chen T, Sun Y, Wang B (2017) Ideology detection for twitter users via link analysis. In: International conference on social computing, behavioral-cultural modeling and prediction and behavior representation in modeling and simulation. Springer, pp 262–268
Guerra PC, Nalon R, Assunção R, Meira W Jr (2017) Antagonism also flows through retweets: the impact of out-of-context quotes in opinion polarization analysis. In: Eleventh international AAAI conference on web and social media
Gupta M, Gao J, Sun Y, Han J (2012) Integrating community matching and outlier detection for mining evolutionary community outliers. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 859–867
Gupta M, Gao J, Han J (2013) Community distribution outlier detection in heterogeneous information networks. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 557–573
Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Disc 5(1):1–129
Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2):8–12
Hallac D, Park Y, Boyd S, Leskovec J (2017) Network inference via the time-varying graphical lasso. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 205–213
Hua T, Chen F, Zhao L, Lu CT, Ramakrishnan N (2013) Sted: semi-supervised targeted-interest event detection in twitter. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1466–1469
Isaacman S, Becker R, Cáceres R, Martonosi M, Rowland J, Varshavsky A, Willinger W (2012) Human mobility modeling at metropolitan scales. In: ACM MobiSys, pp 239–252
Itskov M (2007) Tensor algebra and tensor analysis for engineers. Springer
Jin D, Rossi RA, Koh E, Kim S, Rao A, Koutra D (2019) Latent network summarization: bridging network embedding and summarization. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 987–997
Kang U, Faloutsos C (2014) Mining tera-scale graphs with “pegasus”: algorithms and discoveries. In: Large-scale data analytics. Springer, New York, pp 75–99
Kang U, Lee JY, Koutra D, Faloutsos C (2014) Net-ray: visualizing and mining billion-scale graphs. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 348–361
Klimt B, Yang Y (2004) The Enron corpus: a new dataset for email classification research. In: European conference on machine learning. Springer, pp 217–226
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Koutra D, Vogelstein JT, Faloutsos C (2013) Deltacon: a principled massive-graph similarity function. In: Proceedings of the 2013 SIAM international conference on data mining. SIAM, pp 162–170
Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6):1481–1496
Kulldorff M, Fang Z, Walsh SJ (2003) A tree-based scan statistic for database disease surveillance. Biometrics 59(2):323–331
Kulldorff M, Heffernan R, Hartman J, Assunção R, Mostashari F (2005) A space-time permutation scan statistic for disease outbreak detection. PLoS Med 2(3):e59. https://doi.org/10.1371/journal.pmed.0020059
Kulldorff M, Dashevsky I, Avery TR, Chan AK, Davis RL, Graham D, Platt R, Andrade SE, Boudreau D, Gunter MJ et al (2013) Drug safety data mining with a tree-based scan statistic. Pharmacoepidemiol Drug Saf 22(5):517–523
Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 631–636
Lim Y, Kang U, Faloutsos C (2014) Slashburn: graph compression and mining beyond caveman communities. IEEE Trans Knowl Data Eng 26(12):3077–3089
Lima A, Stanojevic R, Papagiannaki D, Rodriguez P, González MC (2016) Understanding individual routing behaviour. J R Soc Interface 13(116):20160021+
Liu Y, Zhou B, Chen F, Cheung DW (2016) Graph topic scan statistic for spatial event detection. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 489–498
Liu Y, Safavi T, Dighe A, Koutra D (2018) Graph summarization methods and applications: a survey. ACM Comput Surv (CSUR) 51(3):62
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146
Maurya A, Murray K, Liu Y, Dyer C, Cohen WW, Neill DB (2016) Semantic scan: detecting subtle, spatially localized events in text streams. arXiv preprint arXiv:160204393
McCulloh I, Carley KM (2011) Detecting change in longitudinal social networks. Technical report, Military Academy West Point NY Network Science Center (NSC)
McGregor A (2014) Graph stream algorithms: a survey. ACM SIGMOD Rec 43(1):9–20
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Monreale A, Pinelli F, Trasarti R, Giannotti F (2009) Wherenext: a location predictor on trajectory pattern mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 637–646
Neil J, Hash C, Brugh A, Fisk M, Storlie CB (2013) Scan statistics for the online detection of locally anomalous subgraphs. Technometrics 55(4):403–414
Neill DB (2012) Fast subset scan for spatial pattern detection. J R Stat Soc Ser B (Stat Methodol) 74(2):337–360
Park Y, Priebe C, Marchette D, Youssef A (2009) Anomaly detection using scan statistics on time series hypergraphs. In: Link analysis, counterterrorism and security (LACTS) conference, p 9
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 701–710
Prates MO, Assunção RM, Costa MA (2012) Flexible scan statistic test to detect disease clusters in hierarchical trees. Comput Stat 27(4):715–737
Priebe CE, Conroy JM, Marchette DJ, Park Y (2005) Scan statistics on Enron graphs. Comput Math Organ Theory 11(3):229–247
Ranshous S, Shen S, Koutra D, Harenberg S, Faloutsos C, Samatova NF (2015) Anomaly detection in dynamic networks: a survey. Wiley Interdiscip Rev Comput Stat 7(3): 223–247
Sadilek A, Brennan S, Kautz H, Silenzio V (2014) nemesis: which restaurants should you avoid today? In: First AAAI conference on human computation and crowdsourcing
Safavi T, Sripada C, Koutra D (2017) Scalable hashing-based network discovery. In: 2017 IEEE international conference on data mining (ICDM). IEEE, pp 405–414
Safavi T, Davoodi M, Koutra D (2018) Career transitions and trajectories: a case study in computing. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 675–684
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of WWW, pp 851–860
Savage D, Zhang X, Yu X, Chou P, Wang Q (2014) Anomaly detection in online social networks. Soc Netw 39:62–70
Scholtes I (2017) When is a network a network? Multi-order graphical model selection in pathways and temporal networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1037–1046
Shi L, Janeja VP (2009) Anomalous window discovery through scan statistics for linear intersecting paths (sslip). In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 767–776
Silva FA, Celes C, Boukerche A, Ruiz LB, Loureiro AAF (2015) Filling the gaps of vehicular mobility traces. In: 18th ACM MSWiM, pp 47–54
Somanchi S, Neill DB (2017) Graph structure learning from unlabeled data for early outbreak detection. IEEE Intell Syst 32(2):80–84
Souza RC, Assunção RM, de Oliveira DM, de Brito DE, Meira W Jr (2016) Infection hot spot mining from social media trajectories. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 739–755
Souza RC, Assunção RM, Neill DB, Meira W Jr (2019a) Detecting spatial clusters of disease infection risk using sparsely sampled social media mobility patterns. In: Proceedings of the 27th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, pp 359–368
Souza RC, Assunção RM, Oliveira DM, Neill DB, Meira W Jr (2019b) Where did I get dengue? Detecting spatial clusters of infection risk with social network data. Spat Spatio-Temporal Epidemiol 29:163–175
Speakman S, McFowland E III, Neill DB (2015) Scalable detection of anomalous patterns with connectivity constraints. J Comput Graph Stat 24(4):1014–1033
Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Synth Lect Data Min Knowl Disc 3(2):1–159
Sun J, Tao D, Faloutsos C (2006) Beyond streams and graphs: dynamic tensor analysis. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 374–383
Sun J, Faloutsos C, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 687–696
Tango T, Takahashi K (2005) A flexibly shaped spatial scan statistic for detecting clusters. Int J Health Geogr 4(1):11
Tostes AIJ, de LP Duarte-Figueiredo F, Assunção R, Salles J, Loureiro AA (2013) From data to knowledge: city-wide traffic flows analysis and prediction using bing maps. In: Proceedings of the 2nd ACM SIGKDD international workshop on urban computing. ACM, p 12
Van Der Hurk E, Kroon L, Maróti G, Vervest P (2015) Deduction of passengers’ route choices from smart card data. IEEE Trans Intell Transp Syst 16(1):430–440
Wang B, Phillips JM, Schreiber R, Wilkinson D, Mishra N, Tarjan R (2008) Spatial scan statistics for graph clustering. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 727–738
Wang D, Pedreschi D, Song C, Giannotti F, Barabasi AL (2011a) Human mobility, social ties, and link prediction. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1100–1108
Wang Y, Parthasarathy S, Tatikonda S (2011b) Locality sensitive outlier detection: a ranking driven approach. In: 2011 IEEE 27th international conference on data engineering. IEEE, pp 410–421
Wang SV, Maro JC, Baro E, Izem R, Dashevsky I, Rogers JR, Nguyen M, Gagne JJ, Patorno E, Huybrechts KF et al (2018) Data mining for adverse drug events with a propensity score-matched tree-based scan statistic. Epidemiology 29(6):895–903
Woodall WH, Zhao MJ, Paynabar K, Sparks R, Wilson JD (2017) An overview and perspective on social network monitoring. IISE Trans 49(3):354–365
Xu J, Wickramarathne TL, Chawla NV (2016) Representing higher-order dependencies in networks. Sci Adv 2(5):e1600028
Yang J, McAuley J, Leskovec J (2013) Community detection in networks with node attributes. In: 2013 IEEE 13th international conference on data mining. IEEE, pp 1151–1156
Ying JJC, Lee WC, Weng TC, Tseng VS (2011) Semantic trajectory mining for location prediction. In: Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, pp 34–43
Yuan J, Zheng Y, Xie X, Sun G (2011) Driving with knowledge from the physical world. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 316–324
Zheng Y (2015) Trajectory data mining: an overview. ACM Trans Intell Syst Technol (TIST) 6(3):29
Zheng Y, Zhang H, Yu Y (2015) Detecting collective anomalies from multiple spatio-temporal datasets across different domains. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems. ACM, p 2
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2020 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Assunção, R.M., Souza, R.C.S.N.P., Prates, M.O. (2020). New Frontiers for Scan Statistics: Network, Trajectory, and Text Data. In: Glaz, J., Koutras, M. (eds) Handbook of Scan Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8414-1_47-1
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8414-1_47-1
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8414-1
Online ISBN: 978-1-4614-8414-1
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering