Advertisement

Knowledge and Information Systems

, Volume 36, Issue 3, pp 749–788 | Cite as

Multi-domain anomaly detection in spatial datasets

  • Vandana P. JanejaEmail author
  • Revathi Palanisamy
Regular Paper

Abstract

A spatial anomaly captures a phenomenon occurring in a region which is vastly deviant in behavior with respect to the other normal observations. However, in reality this anomaly may impact other phenomena in the region across multiple domains, for example, crime is often linked to other sociopolitical factors or phenomenon such as poverty and education. Similarly, accidents in the region may be linked to other environmental factors such as weather and surface condition. So, finding anomalies across multiple domains is important in various applications. In this paper, we propose an approach for finding such a tangible anomalous window across multiple domains where window refers to the set of contiguous points in space, and since the window is multi-domain, there are several overlapping windows in the same space across domains. Our approach for finding anomalous window across the domains comprises the following steps: (1) single-domain anomaly detection: discovering anomalous window in each domain; (2) association rule mining: discovering relationship between the anomalous windows across domains using association rule mining; and (3) validation: validating the result using (a) Monte Carlo simulation, (b) correlation using lift and (c) ground truth evaluation. In addition, we also provide a probabilistic framework to evaluate the relationships between the spatial nodes as a postprocessing step. Finally, we provide a visualization technique for viewing the multi-domain anomalous window and the probabilistic relationships between the nodes. We provide detailed experimental results and comparisons with other approaches using real-world health ranking [51] and transportation datasets [50] with known ground truth windows. The results show that our approach is effective in finding the anomalies in multiple domains as compared to other approaches.

Keywords

Multi-domain mining Spatial anomaly detection Association rule mining Co-occurrence 

References

  1. 1.
    Agarwal D, McGregor A, Phillips JM, Venkatasubramanian S, Zhu Z (2006) Spatial scan statistics: approximations and performance study. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (Philadelphia, PA, USA, August 20–23, 2006), KDD ’06. ACM, New York, NY, pp 24–33. doi: 10.1145/1150402.1150410
  2. 2.
    Agrawal R, Imielminski T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD conference, pp 207–216Google Scholar
  3. 3.
    Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New YorkzbMATHGoogle Scholar
  4. 4.
    Bonnie DR, Sorensen J, Guest Column (2011) Where you live matters to your health. http://www.news-journalonline.com/opinion/editorials/guest-columns/2010/07/12/where-you-live-matters-to-your-health.html. Last accessed March 2011
  5. 5.
    Breiger RL (1974) The duality of persons and groups. University of North Carolina Press, Social Forces, Chapel HillGoogle Scholar
  6. 6.
    Chawla S, Sun P (2006) SLOM: a new measure for local spatial outliers. Knowl Inf Syst 9(4):412–429CrossRefGoogle Scholar
  7. 7.
    Computer science-advanced web and network technologies, and applications. Lecture Notes in Computer Science, 2008, vol 4977/2008, pp 99–109. doi: 10.1007/978-3-540-89376-9
  8. 8.
    Das K, Schneider J, Neill DB (2008) Anomaly pattern detection in categorical datasets. In: Proceedings of 14\(^{\rm th}\) ACM SIGKDD 2008. ACM, New York, pp 169–176Google Scholar
  9. 9.
    de Vries T, Chawla S, Houle ME (2011) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst. doi: 10.1007/s10115-011-0430-4. Last accessed 9 Dec
  10. 10.
    Dillard B, Shmueli G (2004) Simultaneous analysis of multiple time series using two-dimensional wavelets. Manuscript 1:1Google Scholar
  11. 11.
    Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 44–49, USA. AAAI Press, Menlo ParkGoogle Scholar
  12. 12.
    Everett Martin G, Borgatti Stephen P (1998) Analyzing clique overlap. Connections 21(1):49–61Google Scholar
  13. 13.
    Han D, Rogerson PA, Nie J, Bonner MR, Vena JE, Vito D, Muti P, Trevisan M, Edge SB, Freudenheim JL (2004) Geographic clustering of residence in early life and subsequent risk of breast cancer (United States). Cancer Causes Control 15(9):921–929CrossRefGoogle Scholar
  14. 14.
    Harel D, Koren Y (2001) Clustering spatial data using random walks. In: Proceedings of the seventh international conference on knowledge discovery and data mining, pp 281–286, ACM Press, New YorkGoogle Scholar
  15. 15.
    Health Statistics, Obesity (most recent) by country. http://www.nationmaster.com/graph/hea_obe-health-obesity. Last accessed March 2011
  16. 16.
    Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336CrossRefGoogle Scholar
  17. 17.
    Howe HL, Wingo PA, Thun MJ, Ries LA, Rosenberg HM, Feigal EG, Edwards BK (2001) Annual report to the nation on the status of cancer (1973 through 1998), featuring cancers with recent increasing trends. J Natl Cancer Inst 93(11):824–842CrossRefGoogle Scholar
  18. 18.
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732272/. Cancer outlier detection based on likelihood ratio test
  19. 19.
    Hu J Cancer outlier detection based on likelihood ratio test. http://bioinformatics.oxfordjournals.org/content/24/19/2193.short
  20. 20.
    Janeja VP, Adam N, Atluri V, Vaidya JS (March 2010) Spatial neighborhood based anomaly detection in sensor datasets. In: Special issue on outlier detection data mining and knowledge discovery, vol 20(2). Springer, Berlin, pp 221–258Google Scholar
  21. 21.
    Janeja VP, Adam NR, Atluri V, Vaidya J (2010) Spatial neighborhood based anomaly detection in sensor datasets. Data Min Knowl Discov 2:221–258. doi: 10.1007/s10618-009-0147-0 MathSciNetCrossRefGoogle Scholar
  22. 22.
    Janeja VP, Atluri V (2008) Random walks to identify anomalous free-form spatial scan windows. In: IEEE TKDE 20(10):1378–1392Google Scholar
  23. 23.
    Janeja VP, Atluri V, Vaidya JS, Adam N (2005) Collusion set detection through outlier discovery. In: IEEE intelligence and security informatics (IEEE ISI). Atlanta, GeorgiaGoogle Scholar
  24. 24.
    Janet G (ed) (2008) State of the evidence the connection between breast cancer and the environment, 5th edn. Ph.D., published by the Breast Cancer FundGoogle Scholar
  25. 25.
    JGraph (2011) Java graph component for the visualization and layout of graphs. http://www.jgraph.com/. Last accessed 9 Dec 2011
  26. 26.
    Jiawei H, Micheline K (2006) Data mining: concepts and techniques, 2\(^{\rm nd}\) edn. Morgan KauffmanGoogle Scholar
  27. 27.
    Jung I, Kulldorff M, Klassen A (2007) A spatial scan statistic for ordinal data. Stat Med 26:1594–1607Google Scholar
  28. 28.
    Knorr Edwin M, Ng Raymond T (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253CrossRefGoogle Scholar
  29. 29.
    Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26:1481–1496MathSciNetzbMATHCrossRefGoogle Scholar
  30. 30.
    Kulldorff M (1999) Spatial scan statistics: models, calculations and applications. In: Glaz J, Balkrishnan N (eds) Scan statistics and applications, statistics for industry and technology, pp 303–322Google Scholar
  31. 31.
    Kulldorff M, Athas W, Feuer E, Miller B, Key C (1998) Evaluating cluster alarms: a space-time scan statistic and brain cancer in Los Alamos. Am J Public Health 88(9):1377–1380CrossRefGoogle Scholar
  32. 32.
    Kulldorff M, Nagarwalla N (1995) Spatial disease clusters: detection and inference. Stat Med 14:799–810Google Scholar
  33. 33.
    Lu C, Chen D, Kou Y (2003) Detecting spatial outliers with multiple attribute. In: Proceedings of ICTAI’03, Proceedings of 15\(^{\rm th}\) IEEE international conference on tools with artificial intelligence, p 122Google Scholar
  34. 34.
    Multivariate scan statistics for disease surveillance. http://www.dbmi.pitt.edu/panda/papers/Kulldorff/k-M2005.pdf
  35. 35.
    Naus J (1965) The distribution of the size of the maximum cluster of points on the line. J Am Stat Assoc 60:532–538MathSciNetCrossRefGoogle Scholar
  36. 36.
    Neill DB, Cooper GF, Das K, Jiang X, Schneider J (2009) Bayesian network scan statistics for multivariate pattern detection. In: Scan statistics: statistics for industry and technology, pp 221–249Google Scholar
  37. 37.
    Neill DB, Moore AW, Cooper GF A multivariate Bayesian scan statisticGoogle Scholar
  38. 38.
    New Jersey accident data for state routes. http://www.state.nj.us/transportation/refdata/accident/ (1999)
  39. 39.
    Newman MEJ (2008) The mathematics of networks, The New Palgrave encyclopedia of economicsGoogle Scholar
  40. 40.
    NodeXL (2011) An excel 2007/2010 template for viewing network graphs. http://nodexl.codeplex.com/. Last accessed 9 Dec 2011
  41. 41.
    Park Y, Priebe CE, Marchette DJ, Youssef A (2009) Anomaly detection using scan statistics on time series hypergraphs, workshop on link analysis, SDM 2009Google Scholar
  42. 42.
    Patcha A, Park J-M (2007) An overview of anomaly detection techniques: existing Solutions and latest technological trends. Comput Netw 51(12):3448–3470CrossRefGoogle Scholar
  43. 43.
    Rivers RW (2006) Evidence in traffic crash investigation and reconstruction: identification, interpretation and analysis of evidence, and the traffic crash investigation and reconstruction processGoogle Scholar
  44. 44.
    Sabyasachi B, Martin M (2007) Automatic outlier detection for time series: an application to sensor data. Knowl Inf Syst 11(2):137–154CrossRefGoogle Scholar
  45. 45.
    Sergey Brin, Lawrence Page (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 33:107–117Google Scholar
  46. 46.
    Shi L, Janeja VP (2009) Anomalous window discovery through scan statistics for linear intersecting paths (SSLIP). In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (Paris, France, June 28–July 01, 2009), KDD ’09. ACM, New York, NY, pp 767–776. doi: 10.1145/1557019.1557104
  47. 47.
    Shmueli G, Fienberg SE (2004) Current and potential statistical methods for monitoring multiple data streams for bio-surveillance. In: Wilson A, Olwell D (eds) Statistical methods in counter-terrorismGoogle Scholar
  48. 48.
    Snyder D (2001) Online intrusion detection using sequences of system calls. M.S. thesis, Department of Computer Science, Florida State UniversityGoogle Scholar
  49. 49.
    Sslip:code, datasets and known window reports. http://userpages.umbc.edu/~leishi1/sslip/sslip.htm (2009)
  50. 50.
    State of New Jersey, Department of Transportation, Crash records, http://www.state.nj.us/transportation/refdata/accident/. Last accessed March 2011
  51. 51.
    The County Health Rankings, a key component of the mobilizing action toward community health (MATCH) project. http://www.countyhealthrankings.org/. Last accessed March 2011
  52. 52.
    Wasserman S, Faust K (1994) Social network analysis. Cambridge University Press, CambridgeGoogle Scholar
  53. 53.
    WEKA Weka 3: data mining software in Java. http://www.cs.waikato.ac.nz/ml/weka/. Last accessed March (2011)
  54. 54.
    World Road Association, PIARC Road accident investigation guidelines for road engineers. http://www.who.int/roadsafety/news/piarc_manual.pdf

Copyright information

© Springer-Verlag London Limited 2012

Authors and Affiliations

  1. 1.University of Maryland, Baltimore CountyBaltimoreUSA

Personalised recommendations