Skip to main content
Log in

Distance-based outlier queries in data streams: the novel task and algorithms

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This work proposes a method for detecting distance-based outliers in data streams under the sliding window model. The novel notion of one-time outlier query is introduced in order to detect anomalies in the current window at arbitrary points-in-time. Three algorithms are presented. The first algorithm exactly answers to outlier queries, but has larger space requirements than the other two. The second algorithm is derived from the exact one, reduces memory requirements and returns an approximate answer based on estimations with a statistical guarantee. The third algorithm is a specialization of the approximate algorithm working with strictly fixed memory requirements. Accuracy properties and memory consumption of the algorithms have been theoretically assessed. Moreover experimental results have confirmed the effectiveness of the proposed approach and the good quality of the solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of international conference on managment of data (SIGMOD’01)

  • Aggarwal CC (2005) On abnormality detection in spuriously populated data streams. In: SIAM data mining

  • Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2): 145–160

    Article  Google Scholar 

  • Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of international conference on principles of data mining and knowledge discovery (PKDD’02), pp 15–26

  • Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2): 203–215

    Article  MathSciNet  Google Scholar 

  • Angiulli F, Fassetti F (2007) Very efficient mining of distance-based outliers. In: CIKM, pp 791–800

  • Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1): 1–57

    Article  Google Scholar 

  • Arning A, Aggarwal C, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of international conference on knowledge discovery and data mining (KDD’96), pp 164–169

  • Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: PODS, pp 1–16

  • Barnett V, Lewis T (1994) Outliers in Statistical Data. John Wiley & Sons, Chichester

    MATH  Google Scholar 

  • Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of international conference on knowledge discovery and data mining (KDD’03)

  • Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of international conference on managment of data (SIGMOD’00)

  • Chávez E, Navarro G, Baeza-Yates RA, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321

    Article  Google Scholar 

  • DARPA Defense Advanced Research Projects Agency (1998) Intrusion detection evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html

  • Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection : detecting intrusions in unlabeled data. In: Applications of data mining in computer security, Kluwer, Dordrecht

  • Ghoting A, Parthasarathy S, Otey ME (2006) Fast mining of distance-based outliers in high-dimensional datasets. In: Proceedings of the SIAM international conference on data mining (SDM’06), Bethesda, MD, USA

  • Golab L, Özsu MT (2003) Issues in data stream management. SIGMOD Rec 32(2): 5–14

    Article  Google Scholar 

  • Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01)

  • Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of international conference on very large databases (VLDB98), pp 392–403

  • Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceeings of international conference on very large databases (VLDB99), pp 211–222

  • Knorr E, Ng R, Tucakov V (2000) Distance-based outlier: algorithms and applications. VLDB J 8(3–4): 237–253

    Google Scholar 

  • Knuth D (1997) The art of computer programming, Vol. 3: sorting and searching. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Lazarevic A, Ertöz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the SIAM international conference on data mining

  • Mood AM, Graybill FA, Boes DC (1974) Introduction to the theory of statistics. McGraw-Hill, New York

    MATH  Google Scholar 

  • Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of international conference on data enginnering (ICDE), pp 315–326

  • Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDE, pp 315–326

  • Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of international conference on managment of data (SIGMOD’00), pp 427–438

  • Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: International conference on very large data bases, Seoul, Korea, September 12–15

  • Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’06), Philadelphia, PA, USA, pp 394–403

  • Watanabe O (2000) Simple sampling techniques for discovery science. TIEICE: IEICE Transactions on communications/electronics/information and systems

  • Yamanishi K, Takeuchi J, Williams GJ, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: KDD, pp 320–324

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabrizio Angiulli.

Additional information

Responsible editor: Sanjay Chawla.

We state that a reduced version of the paper was published in the Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (CIKM 2007) under the title “Detecting Distance-Based Outliers in Streams of Data”, but a substantial revision and extension was made and that the paper currently is not under review by another publication.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Angiulli, F., Fassetti, F. Distance-based outlier queries in data streams: the novel task and algorithms. Data Min Knowl Disc 20, 290–324 (2010). https://doi.org/10.1007/s10618-009-0159-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0159-9

Keywords

Navigation