Distance-based outlier queries in data streams: the novel task and algorithms

Angiulli, Fabrizio; Fassetti, Fabio

doi:10.1007/s10618-009-0159-9

Distance-based outlier queries in data streams: the novel task and algorithms

Published: 26 January 2010

Volume 20, pages 290–324, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Fabrizio Angiulli¹ &
Fabio Fassetti¹

827 Accesses
73 Citations
Explore all metrics

Abstract

This work proposes a method for detecting distance-based outliers in data streams under the sliding window model. The novel notion of one-time outlier query is introduced in order to detect anomalies in the current window at arbitrary points-in-time. Three algorithms are presented. The first algorithm exactly answers to outlier queries, but has larger space requirements than the other two. The second algorithm is derived from the exact one, reduces memory requirements and returns an approximate answer based on estimations with a statistical guarantee. The third algorithm is a specialization of the approximate algorithm working with strictly fixed memory requirements. Accuracy properties and memory consumption of the algorithms have been theoretically assessed. Moreover experimental results have confirmed the effectiveness of the proposed approach and the good quality of the solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

A survey of methods for time series change point detection

Article 08 September 2016

Stratified random sampling from streaming and stored data

Article 23 October 2020

References

Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of international conference on managment of data (SIGMOD’01)
Aggarwal CC (2005) On abnormality detection in spuriously populated data streams. In: SIAM data mining
Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2): 145–160
Article Google Scholar
Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of international conference on principles of data mining and knowledge discovery (PKDD’02), pp 15–26
Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2): 203–215
Article MathSciNet Google Scholar
Angiulli F, Fassetti F (2007) Very efficient mining of distance-based outliers. In: CIKM, pp 791–800
Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1): 1–57
Article Google Scholar
Arning A, Aggarwal C, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of international conference on knowledge discovery and data mining (KDD’96), pp 164–169
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: PODS, pp 1–16
Barnett V, Lewis T (1994) Outliers in Statistical Data. John Wiley & Sons, Chichester
MATH Google Scholar
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of international conference on knowledge discovery and data mining (KDD’03)
Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of international conference on managment of data (SIGMOD’00)
Chávez E, Navarro G, Baeza-Yates RA, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321
Article Google Scholar
DARPA Defense Advanced Research Projects Agency (1998) Intrusion detection evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html
Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection : detecting intrusions in unlabeled data. In: Applications of data mining in computer security, Kluwer, Dordrecht
Ghoting A, Parthasarathy S, Otey ME (2006) Fast mining of distance-based outliers in high-dimensional datasets. In: Proceedings of the SIAM international conference on data mining (SDM’06), Bethesda, MD, USA
Golab L, Özsu MT (2003) Issues in data stream management. SIGMOD Rec 32(2): 5–14
Article Google Scholar
Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01)
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of international conference on very large databases (VLDB98), pp 392–403
Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceeings of international conference on very large databases (VLDB99), pp 211–222
Knorr E, Ng R, Tucakov V (2000) Distance-based outlier: algorithms and applications. VLDB J 8(3–4): 237–253
Google Scholar
Knuth D (1997) The art of computer programming, Vol. 3: sorting and searching. Addison-Wesley, Reading
MATH Google Scholar
Lazarevic A, Ertöz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the SIAM international conference on data mining
Mood AM, Graybill FA, Boes DC (1974) Introduction to the theory of statistics. McGraw-Hill, New York
MATH Google Scholar
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of international conference on data enginnering (ICDE), pp 315–326
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDE, pp 315–326
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of international conference on managment of data (SIGMOD’00), pp 427–438
Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: International conference on very large data bases, Seoul, Korea, September 12–15
Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’06), Philadelphia, PA, USA, pp 394–403
Watanabe O (2000) Simple sampling techniques for discovery science. TIEICE: IEICE Transactions on communications/electronics/information and systems
Yamanishi K, Takeuchi J, Williams GJ, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: KDD, pp 320–324

Download references

Author information

Authors and Affiliations

DEIS, Università della Calabria, Via P. Bucci, 41C, 87036, Rende, CS, Italy
Fabrizio Angiulli & Fabio Fassetti

Authors

Fabrizio Angiulli
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Fassetti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabrizio Angiulli.

Additional information

Responsible editor: Sanjay Chawla.

We state that a reduced version of the paper was published in the Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (CIKM 2007) under the title “Detecting Distance-Based Outliers in Streams of Data”, but a substantial revision and extension was made and that the paper currently is not under review by another publication.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Angiulli, F., Fassetti, F. Distance-based outlier queries in data streams: the novel task and algorithms. Data Min Knowl Disc 20, 290–324 (2010). https://doi.org/10.1007/s10618-009-0159-9

Download citation

Received: 17 July 2008
Accepted: 21 November 2009
Published: 26 January 2010
Issue Date: March 2010
DOI: https://doi.org/10.1007/s10618-009-0159-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distance-based outlier queries in data streams: the novel task and algorithms

Abstract

Access this article

Similar content being viewed by others

Uncertainty in big data analytics: survey, opportunities, and challenges

A survey of methods for time series change point detection

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distance-based outlier queries in data streams: the novel task and algorithms

Abstract

Access this article

Similar content being viewed by others

Uncertainty in big data analytics: survey, opportunities, and challenges

A survey of methods for time series change point detection

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation