Skip to main content

New Frontiers for Scan Statistics: Network, Trajectory, and Text Data

  • Living reference work entry
  • First Online:
Handbook of Scan Statistics

Abstract

In this chapter we survey the new theoretical developments and the use of scan statistics in data represented as graphs, trajectories, and text. These types of data are becoming common in the new massive digital data world. Large social networks are represented by complex graphs. We have records of the paths of moving objects, such as people who log their travel routes generating GPS trajectories. Large quantities of text are continuously generated by news wire services and social networks. There is a large interest in developing algorithms with strong statistical basis for detecting anomalies in these types of data. We review the use of the scan statistics in these situations. Additionally, we identify three main opportunities and challenges from the big data times for scan statistics: we need to deal with new stochastic data structures; we need much higher computational efficiency than we have now; and we need models that can deal with the variability that appears in the large samples now collected.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Aggarwal CC (2007) Data streams: models and algorithms, vol 31. Springer Science & Business Media, New York

    Book  MATH  Google Scholar 

  • Akoglu L (2014) Quantifying political polarity based on bipartite opinion networks. In: Eighth international AAAI conference on weblogs and social media

    Google Scholar 

  • Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: spotting anomalies in weighted graphs. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 410–421

    Google Scholar 

  • Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Disc 29(3):626–688

    Article  MathSciNet  Google Scholar 

  • Assunção R, Costa M, Tavares A, Ferreira S (2006) Fast detection of arbitrarily shaped disease clusters. Stat Med 25(5):723–742

    Article  MathSciNet  Google Scholar 

  • Berk RH, Jones DH (1979) goodness-of-fit test statistics that dominate the kolmogorov statistics. J Probab Theory Relat Areas 47(1):47–59

    MathSciNet  MATH  Google Scholar 

  • Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., Sebastopol

    MATH  Google Scholar 

  • Blanford J, Huang Z, Savelyev A, MacEachren A (2015) Geo-located tweets. Enhancing mobility maps and capturing cross-border movement. PLoS ONE 10(6):e0129202. https://doi.org/10.1371/journal.pone.0129202

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM SIGMOD record, vol 29. ACM, pp 93–104

    Google Scholar 

  • Brugere I, Gallagher B, Berger-Wolf TY (2018) Network structure inference, a survey: motivations, methods, and applications. ACM Comput Surv (CSUR) 51(2):24

    Article  Google Scholar 

  • Cadena J, Chen F, Vullikanti A (2017) Near-optimal and practical algorithms for graph scan statistics. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 624–632

    Google Scholar 

  • Cai H, Zheng VW, Chang KCC (2018) A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Trans Knowl Data Eng 30(9):1616–1637

    Article  Google Scholar 

  • Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41(3):15

    Article  Google Scholar 

  • Chen W, Chundi P (2009) Extracting hot spots of basic and complex topics from time stamped documents. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, pp 125–132

    Google Scholar 

  • Chen W, Chundi P (2011) Extracting hot spots of topics from time-stamped documents. Data Knowl Eng 70(7):642–660

    Article  Google Scholar 

  • Chen F, Neill DB (2014) Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1166–1175

    Google Scholar 

  • Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S (2015) One trillion edges: graph processing at Facebook-scale. Proc VLDB Endowment 8(12):1804–1815

    Article  Google Scholar 

  • Costa MA, Kulldorff M (2014) Maximum linkage space-time permutation scan statistics for disease outbreak detection. Int J Health Geograph 13(1):20

    Article  Google Scholar 

  • Costa MA, Assunção RM, Kulldorff M (2012) Constrained spanning tree algorithms for irregularly-shaped spatial clustering. Comput Stat Data Anal 56(6):1771–1783

    Article  MathSciNet  Google Scholar 

  • Duczmal L, Assunção R (2004) A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Comput Stat Data Anal 45(2):269–286

    Article  MathSciNet  MATH  Google Scholar 

  • Gao Y, Li T, Wang S, Jeong MH, Soltani K (2018) A multidimensional spatial scan statistics approach to movement pattern comparison. Int J Geograph Inf Sci 32(7):1304–1325

    Article  Google Scholar 

  • Ghurye J, Krings G, Frias-Martinez V (2016) A framework to model human behavior at large scale during natural disasters. In: 17th IEEE MDM, pp 18–27

    Google Scholar 

  • Giannotti F, Nanni M, Pinelli F, Pedreschi D (2007) Trajectory pattern mining. In: Proceedings of the 13th ACM SIGKDD conference, pp 330–339

    Google Scholar 

  • Gilbert E, Karahalios K (2009) Predicting tie strength with social media. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 211–220

    Google Scholar 

  • Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453:779–782

    Article  Google Scholar 

  • Goyal P, Ferrara E (2018) Graph embedding techniques, applications, and performance: a survey. Knowl-Based Syst 151:78–94

    Article  Google Scholar 

  • Grimes S (2014) Unstructured data and the 80 percent rule (2008). Clarabridge, Bridgepoints

    Google Scholar 

  • Gu Y, Chen T, Sun Y, Wang B (2017) Ideology detection for twitter users via link analysis. In: International conference on social computing, behavioral-cultural modeling and prediction and behavior representation in modeling and simulation. Springer, pp 262–268

    Google Scholar 

  • Guerra PC, Nalon R, Assunção R, Meira W Jr (2017) Antagonism also flows through retweets: the impact of out-of-context quotes in opinion polarization analysis. In: Eleventh international AAAI conference on web and social media

    Google Scholar 

  • Gupta M, Gao J, Sun Y, Han J (2012) Integrating community matching and outlier detection for mining evolutionary community outliers. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 859–867

    Google Scholar 

  • Gupta M, Gao J, Han J (2013) Community distribution outlier detection in heterogeneous information networks. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 557–573

    Google Scholar 

  • Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Disc 5(1):1–129

    MATH  Google Scholar 

  • Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2):8–12

    Article  Google Scholar 

  • Hallac D, Park Y, Boyd S, Leskovec J (2017) Network inference via the time-varying graphical lasso. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 205–213

    Google Scholar 

  • Hua T, Chen F, Zhao L, Lu CT, Ramakrishnan N (2013) Sted: semi-supervised targeted-interest event detection in twitter. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1466–1469

    Google Scholar 

  • Isaacman S, Becker R, Cáceres R, Martonosi M, Rowland J, Varshavsky A, Willinger W (2012) Human mobility modeling at metropolitan scales. In: ACM MobiSys, pp 239–252

    Google Scholar 

  • Itskov M (2007) Tensor algebra and tensor analysis for engineers. Springer

    MATH  Google Scholar 

  • Jin D, Rossi RA, Koh E, Kim S, Rao A, Koutra D (2019) Latent network summarization: bridging network embedding and summarization. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 987–997

    Google Scholar 

  • Kang U, Faloutsos C (2014) Mining tera-scale graphs with “pegasus”: algorithms and discoveries. In: Large-scale data analytics. Springer, New York, pp 75–99

    Chapter  Google Scholar 

  • Kang U, Lee JY, Koutra D, Faloutsos C (2014) Net-ray: visualizing and mining billion-scale graphs. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 348–361

    Google Scholar 

  • Klimt B, Yang Y (2004) The Enron corpus: a new dataset for email classification research. In: European conference on machine learning. Springer, pp 217–226

    MATH  Google Scholar 

  • Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500

    Article  MathSciNet  MATH  Google Scholar 

  • Koutra D, Vogelstein JT, Faloutsos C (2013) Deltacon: a principled massive-graph similarity function. In: Proceedings of the 2013 SIAM international conference on data mining. SIAM, pp 162–170

    Google Scholar 

  • Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6):1481–1496

    Article  MathSciNet  MATH  Google Scholar 

  • Kulldorff M, Fang Z, Walsh SJ (2003) A tree-based scan statistic for database disease surveillance. Biometrics 59(2):323–331

    Article  MathSciNet  MATH  Google Scholar 

  • Kulldorff M, Heffernan R, Hartman J, Assunção R, Mostashari F (2005) A space-time permutation scan statistic for disease outbreak detection. PLoS Med 2(3):e59. https://doi.org/10.1371/journal.pmed.0020059

    Article  Google Scholar 

  • Kulldorff M, Dashevsky I, Avery TR, Chan AK, Davis RL, Graham D, Platt R, Andrade SE, Boudreau D, Gunter MJ et al (2013) Drug safety data mining with a tree-based scan statistic. Pharmacoepidemiol Drug Saf 22(5):517–523

    Article  Google Scholar 

  • Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 631–636

    Google Scholar 

  • Lim Y, Kang U, Faloutsos C (2014) Slashburn: graph compression and mining beyond caveman communities. IEEE Trans Knowl Data Eng 26(12):3077–3089

    Article  Google Scholar 

  • Lima A, Stanojevic R, Papagiannaki D, Rodriguez P, González MC (2016) Understanding individual routing behaviour. J R Soc Interface 13(116):20160021+

    Google Scholar 

  • Liu Y, Zhou B, Chen F, Cheung DW (2016) Graph topic scan statistic for spatial event detection. In: Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, pp 489–498

    Google Scholar 

  • Liu Y, Safavi T, Dighe A, Koutra D (2018) Graph summarization methods and applications: a survey. ACM Comput Surv (CSUR) 51(3):62

    Article  Google Scholar 

  • Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146

    Google Scholar 

  • Maurya A, Murray K, Liu Y, Dyer C, Cohen WW, Neill DB (2016) Semantic scan: detecting subtle, spatially localized events in text streams. arXiv preprint arXiv:160204393

    Google Scholar 

  • McCulloh I, Carley KM (2011) Detecting change in longitudinal social networks. Technical report, Military Academy West Point NY Network Science Center (NSC)

    Google Scholar 

  • McGregor A (2014) Graph stream algorithms: a survey. ACM SIGMOD Rec 43(1):9–20

    Article  Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

    Google Scholar 

  • Monreale A, Pinelli F, Trasarti R, Giannotti F (2009) Wherenext: a location predictor on trajectory pattern mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 637–646

    Google Scholar 

  • Neil J, Hash C, Brugh A, Fisk M, Storlie CB (2013) Scan statistics for the online detection of locally anomalous subgraphs. Technometrics 55(4):403–414

    Article  MathSciNet  Google Scholar 

  • Neill DB (2012) Fast subset scan for spatial pattern detection. J R Stat Soc Ser B (Stat Methodol) 74(2):337–360

    Article  MathSciNet  MATH  Google Scholar 

  • Park Y, Priebe C, Marchette D, Youssef A (2009) Anomaly detection using scan statistics on time series hypergraphs. In: Link analysis, counterterrorism and security (LACTS) conference, p 9

    Google Scholar 

  • Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 701–710

    Google Scholar 

  • Prates MO, Assunção RM, Costa MA (2012) Flexible scan statistic test to detect disease clusters in hierarchical trees. Comput Stat 27(4):715–737

    Article  MathSciNet  MATH  Google Scholar 

  • Priebe CE, Conroy JM, Marchette DJ, Park Y (2005) Scan statistics on Enron graphs. Comput Math Organ Theory 11(3):229–247

    Article  MATH  Google Scholar 

  • Ranshous S, Shen S, Koutra D, Harenberg S, Faloutsos C, Samatova NF (2015) Anomaly detection in dynamic networks: a survey. Wiley Interdiscip Rev Comput Stat 7(3): 223–247

    Article  MathSciNet  Google Scholar 

  • Sadilek A, Brennan S, Kautz H, Silenzio V (2014) nemesis: which restaurants should you avoid today? In: First AAAI conference on human computation and crowdsourcing

    Google Scholar 

  • Safavi T, Sripada C, Koutra D (2017) Scalable hashing-based network discovery. In: 2017 IEEE international conference on data mining (ICDM). IEEE, pp 405–414

    Google Scholar 

  • Safavi T, Davoodi M, Koutra D (2018) Career transitions and trajectories: a case study in computing. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 675–684

    Google Scholar 

  • Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of WWW, pp 851–860

    Google Scholar 

  • Savage D, Zhang X, Yu X, Chou P, Wang Q (2014) Anomaly detection in online social networks. Soc Netw 39:62–70

    Article  Google Scholar 

  • Scholtes I (2017) When is a network a network? Multi-order graphical model selection in pathways and temporal networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1037–1046

    Google Scholar 

  • Shi L, Janeja VP (2009) Anomalous window discovery through scan statistics for linear intersecting paths (sslip). In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 767–776

    Google Scholar 

  • Silva FA, Celes C, Boukerche A, Ruiz LB, Loureiro AAF (2015) Filling the gaps of vehicular mobility traces. In: 18th ACM MSWiM, pp 47–54

    Google Scholar 

  • Somanchi S, Neill DB (2017) Graph structure learning from unlabeled data for early outbreak detection. IEEE Intell Syst 32(2):80–84

    Article  Google Scholar 

  • Souza RC, Assunção RM, de Oliveira DM, de Brito DE, Meira W Jr (2016) Infection hot spot mining from social media trajectories. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 739–755

    Google Scholar 

  • Souza RC, Assunção RM, Neill DB, Meira W Jr (2019a) Detecting spatial clusters of disease infection risk using sparsely sampled social media mobility patterns. In: Proceedings of the 27th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, pp 359–368

    Google Scholar 

  • Souza RC, Assunção RM, Oliveira DM, Neill DB, Meira W Jr (2019b) Where did I get dengue? Detecting spatial clusters of infection risk with social network data. Spat Spatio-Temporal Epidemiol 29:163–175

    Article  Google Scholar 

  • Speakman S, McFowland E III, Neill DB (2015) Scalable detection of anomalous patterns with connectivity constraints. J Comput Graph Stat 24(4):1014–1033

    Article  MathSciNet  Google Scholar 

  • Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Synth Lect Data Min Knowl Disc 3(2):1–159

    Google Scholar 

  • Sun J, Tao D, Faloutsos C (2006) Beyond streams and graphs: dynamic tensor analysis. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 374–383

    Google Scholar 

  • Sun J, Faloutsos C, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 687–696

    Google Scholar 

  • Tango T, Takahashi K (2005) A flexibly shaped spatial scan statistic for detecting clusters. Int J Health Geogr 4(1):11

    Article  Google Scholar 

  • Tostes AIJ, de LP Duarte-Figueiredo F, Assunção R, Salles J, Loureiro AA (2013) From data to knowledge: city-wide traffic flows analysis and prediction using bing maps. In: Proceedings of the 2nd ACM SIGKDD international workshop on urban computing. ACM, p 12

    Google Scholar 

  • Van Der Hurk E, Kroon L, Maróti G, Vervest P (2015) Deduction of passengers’ route choices from smart card data. IEEE Trans Intell Transp Syst 16(1):430–440

    Article  Google Scholar 

  • Wang B, Phillips JM, Schreiber R, Wilkinson D, Mishra N, Tarjan R (2008) Spatial scan statistics for graph clustering. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 727–738

    Google Scholar 

  • Wang D, Pedreschi D, Song C, Giannotti F, Barabasi AL (2011a) Human mobility, social ties, and link prediction. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1100–1108

    Google Scholar 

  • Wang Y, Parthasarathy S, Tatikonda S (2011b) Locality sensitive outlier detection: a ranking driven approach. In: 2011 IEEE 27th international conference on data engineering. IEEE, pp 410–421

    Google Scholar 

  • Wang SV, Maro JC, Baro E, Izem R, Dashevsky I, Rogers JR, Nguyen M, Gagne JJ, Patorno E, Huybrechts KF et al (2018) Data mining for adverse drug events with a propensity score-matched tree-based scan statistic. Epidemiology 29(6):895–903

    Article  Google Scholar 

  • Woodall WH, Zhao MJ, Paynabar K, Sparks R, Wilson JD (2017) An overview and perspective on social network monitoring. IISE Trans 49(3):354–365

    Article  Google Scholar 

  • Xu J, Wickramarathne TL, Chawla NV (2016) Representing higher-order dependencies in networks. Sci Adv 2(5):e1600028

    Article  Google Scholar 

  • Yang J, McAuley J, Leskovec J (2013) Community detection in networks with node attributes. In: 2013 IEEE 13th international conference on data mining. IEEE, pp 1151–1156

    Google Scholar 

  • Ying JJC, Lee WC, Weng TC, Tseng VS (2011) Semantic trajectory mining for location prediction. In: Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, pp 34–43

    Google Scholar 

  • Yuan J, Zheng Y, Xie X, Sun G (2011) Driving with knowledge from the physical world. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 316–324

    Google Scholar 

  • Zheng Y (2015) Trajectory data mining: an overview. ACM Trans Intell Syst Technol (TIST) 6(3):29

    Google Scholar 

  • Zheng Y, Zhang H, Yu Y (2015) Detecting collective anomalies from multiple spatio-temporal datasets across different domains. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems. ACM, p 2

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Renato M. Assunção .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Assunção, R.M., Souza, R.C.S.N.P., Prates, M.O. (2020). New Frontiers for Scan Statistics: Network, Trajectory, and Text Data. In: Glaz, J., Koutras, M. (eds) Handbook of Scan Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8414-1_47-1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-8414-1_47-1

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-8414-1

  • Online ISBN: 978-1-4614-8414-1

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics