Abstract
This paper addresses the anomaly detection problem in large-scale data mining applications using residual subspace analysis. We are specifically concerned with situations where the full data cannot be practically obtained due to physical limitations such as low bandwidth, limited memory, storage, or computing power. Motivated by the recent compressed sensing (CS) theory, we suggest a framework wherein random projection can be used to obtained compressed data, addressing the scalability challenge. Our theoretical contribution shows that the spectral property of the CS data is approximately preserved under a such a projection and thus the performance of spectral-based methods for anomaly detection is almost equivalent to the case in which the raw data is completely available. Our second contribution is the construction of the framework to use this result and detect anomalies in the compressed data directly, thus circumventing the problems of data acquisition in large sensor networks. We have conducted extensive experiments to detect anomalies in network and surveillance applications on large datasets, including the benchmark PETS 2007 and 83 GB of real footage from three public train stations. Our results show that our proposed method is scalable, and importantly, its performance is comparable to conventional methods for anomaly detection when the complete data is available.
Similar content being viewed by others
Notes
Though we only report these amounts of data in this paper, we note that the proposed method forms a core of a more complex commercial system that has been successfully tested over thousand hours of video, equivalent to hundred of Tetrabyes. For detail see http://www.icetana.com.au.
References
Achlioptas D (2001) Database-friendly random projections. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. ACM, New York, pp 274–281. http://doi.acm.org/10.1145/375551.375608
Adam A, Rivlin E, Shimshoni I, Reinitz D (2008) Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Trans Pattern Anal Mach Intell 30:555–560
Aggarwal C (2005) On abnormality detection in spuriously populated data streams. In: Proceedings of the IEEE international conference on data mining (ICDM), Houston
Barnett V, Lewis T (1984) Outliers in statistical data. Chichester, New York
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the KDD. ACM, New York, pp 245–250
Brand M, Oliver N, Pentland A (1997) Coupled hidden Markov models for complex action recognition. In: IEEE CVPR, San Juan, pp 994–999
Budhaditya S, Pham D, Lazarescu M, Venkatesh S (2009) Effective anomaly detection in sensor networks data streams. In: Proceedings of the IEEE international conference on data mining (ICDM), Miami, pp 722–727
Candes E, Tao T (2006) Near optimal signal recovery from random projections: universal encoding strategies. IEEE Trans Inf Theory 52:5406–5425
Candes E, Romberg J, Tao T (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inf Theory 52(2):489–509
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41:Article 15
Chatzigiannakis V, Papavassiliou S, Grammatikou M, Maglaris B (2006) Hierarchical anomaly detection in distributed large-scale sensor networks. In: Proceedings of the 11th IEEE symposium on computers and communications (ISCC), Washington, DC, pp 761–767
Donoho D (2006) Compressed sensing. IEEE Trans Inf Theory 52:1289–1306
Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1):9–33
Drineas P, Kannan R, Mahoney M (2006) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158
Elad M (2007) Optimized projections for compressed sensing. IEEE Trans Signal Proc 55:5695–5702
Fowler J (2009) Compressive-projection principal component analysis and the first eigenvector. In: Data compression conference, 2009, DCC’09, Snowbird. IEEE, Washington, DC, pp 223–232
Fujimaki R (2008) Anomaly detection support vector machine and its application to fault diagnosis. In: Proceedings of the IEEE international conference on data mining (ICDM), Washington, DC, pp 797–802
Geman S (1980) A limit theorem for the norm of random matrices. Ann Probab 8:252–261
Giatrakos N, Kotidis Y, Deligiannakis A, Vassalos V, Theodoridis Y (2010) Taco: tunable approximate computation of outliers in wireless sensor networks. In: Proceedings of the 2010 international conference on Management of data. ACM, New York, pp 279–290
Golub Loan V (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Huang L, Nguyen X, Garofalakis M, Jordan M, Joseph A, Taft N (2007) In-network PCA and anomaly detection. In: Proceedings of NIPS, Vancouver, pp 617–624
Jackson J (1959) Quality control methods for several related variables. Technometrics 1:359–377
Jackson J (1980) Principal components and factor analysis. I—principal components. J Qual Technol 12:201–213
Jackson E, Mudholkar G (1979) Control procedures for residuals associated with principal component analysis. Technometrics 21(3):341–349
Janakiram D, Reddy V, Kumar A (2006) Outlier detection in wireless sensor networks using Bayesian belief networks. In: Proceedings of the first international conference on communication system software and middleware, New Delhi
Jiang X, Cooper G (2010) A real-time temporal bayesian architecture for event surveillance and its application to patient-specific multiple disease outbreak detection. Data Min Knowl Discov 20(3):328–360
Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2):259–289
Lakhina A, Crovella M, Diot C (2004) Diagonising network-wide traffic anomalies. In: Proceedings of ACM SIGCOMM, Portland
Li W, Yue H, Valle-Cervantes S, Qin S (2000) Recursive PCA for adaptive process monitoring. J Process Control 10(5):471–486
Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1):92–106
Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. Proc IJCAI 81:674–679
Medioni G, Cohen I, Brémond F, Hongeng S, Nevatia R (2001) Event detection and analysis from video streams. IEEE Trans Pattern Anal Mach Intell 23:873–889
Niebles J, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318
Noto K, Brodley C, Slonim D (2011) Frac: a feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Min Knowl Discov 25:109–133
Phung D, Duong T, Venkatesh S, Bui H (2005) Topic transition detection using hierarchical hidden Markov and semi-Markov models. In: Proceedings of ACM-MM, New York, pp 11–20
Rabbat M, Haupt J, Singh A, Nowak R (2006) Decentralized compression and predistribution via randomized gossiping. In: Proceedings of IPSN, New York, pp 51–59
Strohmer T, Heath R (2003) Grassmannian frames with applications to coding and communication. Appl Comput Harmon Anal 14:257–275
Thottan M, Ji C (2003) Anomaly detection in IP networks. IEEE Trans Signal Process 51(8):2191–2204
Vempala S (2004) The random projection method. American Mathematical Society (AMS)
Yan J, Zhang B, Liu N, Yan S, Cheng Q, Fan W, Yang Q, Xi W, Chen Z (2006) Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing. IEEE Trans Knowl Data Eng 18:320–333
Zhu C, Kitagawa H, Faloutsos C (2005) Example-based robust outlier detection in high dimensional datasets. In: Proceedings of ICDM, Houston
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Eamonn Keogh.
Rights and permissions
About this article
Cite this article
Pham, DS., Venkatesh, S., Lazarescu, M. et al. Anomaly detection in large-scale data stream networks. Data Min Knowl Disc 28, 145–189 (2014). https://doi.org/10.1007/s10618-012-0297-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0297-3