Abstract
This work proposes a method for detecting distance-based outliers in data streams under the sliding window model. The novel notion of one-time outlier query is introduced in order to detect anomalies in the current window at arbitrary points-in-time. Three algorithms are presented. The first algorithm exactly answers to outlier queries, but has larger space requirements than the other two. The second algorithm is derived from the exact one, reduces memory requirements and returns an approximate answer based on estimations with a statistical guarantee. The third algorithm is a specialization of the approximate algorithm working with strictly fixed memory requirements. Accuracy properties and memory consumption of the algorithms have been theoretically assessed. Moreover experimental results have confirmed the effectiveness of the proposed approach and the good quality of the solutions.
Similar content being viewed by others
References
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of international conference on managment of data (SIGMOD’01)
Aggarwal CC (2005) On abnormality detection in spuriously populated data streams. In: SIAM data mining
Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2): 145–160
Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of international conference on principles of data mining and knowledge discovery (PKDD’02), pp 15–26
Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2): 203–215
Angiulli F, Fassetti F (2007) Very efficient mining of distance-based outliers. In: CIKM, pp 791–800
Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1): 1–57
Arning A, Aggarwal C, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of international conference on knowledge discovery and data mining (KDD’96), pp 164–169
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: PODS, pp 1–16
Barnett V, Lewis T (1994) Outliers in Statistical Data. John Wiley & Sons, Chichester
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of international conference on knowledge discovery and data mining (KDD’03)
Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of international conference on managment of data (SIGMOD’00)
Chávez E, Navarro G, Baeza-Yates RA, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321
DARPA Defense Advanced Research Projects Agency (1998) Intrusion detection evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html
Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection : detecting intrusions in unlabeled data. In: Applications of data mining in computer security, Kluwer, Dordrecht
Ghoting A, Parthasarathy S, Otey ME (2006) Fast mining of distance-based outliers in high-dimensional datasets. In: Proceedings of the SIAM international conference on data mining (SDM’06), Bethesda, MD, USA
Golab L, Özsu MT (2003) Issues in data stream management. SIGMOD Rec 32(2): 5–14
Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01)
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of international conference on very large databases (VLDB98), pp 392–403
Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceeings of international conference on very large databases (VLDB99), pp 211–222
Knorr E, Ng R, Tucakov V (2000) Distance-based outlier: algorithms and applications. VLDB J 8(3–4): 237–253
Knuth D (1997) The art of computer programming, Vol. 3: sorting and searching. Addison-Wesley, Reading
Lazarevic A, Ertöz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the SIAM international conference on data mining
Mood AM, Graybill FA, Boes DC (1974) Introduction to the theory of statistics. McGraw-Hill, New York
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of international conference on data enginnering (ICDE), pp 315–326
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDE, pp 315–326
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of international conference on managment of data (SIGMOD’00), pp 427–438
Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: International conference on very large data bases, Seoul, Korea, September 12–15
Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’06), Philadelphia, PA, USA, pp 394–403
Watanabe O (2000) Simple sampling techniques for discovery science. TIEICE: IEICE Transactions on communications/electronics/information and systems
Yamanishi K, Takeuchi J, Williams GJ, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: KDD, pp 320–324
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Sanjay Chawla.
We state that a reduced version of the paper was published in the Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (CIKM 2007) under the title “Detecting Distance-Based Outliers in Streams of Data”, but a substantial revision and extension was made and that the paper currently is not under review by another publication.
Rights and permissions
About this article
Cite this article
Angiulli, F., Fassetti, F. Distance-based outlier queries in data streams: the novel task and algorithms. Data Min Knowl Disc 20, 290–324 (2010). https://doi.org/10.1007/s10618-009-0159-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0159-9