Advertisement

Data Mining and Knowledge Discovery

, Volume 20, Issue 2, pp 290–324 | Cite as

Distance-based outlier queries in data streams: the novel task and algorithms

  • Fabrizio Angiulli
  • Fabio Fassetti
Article

Abstract

This work proposes a method for detecting distance-based outliers in data streams under the sliding window model. The novel notion of one-time outlier query is introduced in order to detect anomalies in the current window at arbitrary points-in-time. Three algorithms are presented. The first algorithm exactly answers to outlier queries, but has larger space requirements than the other two. The second algorithm is derived from the exact one, reduces memory requirements and returns an approximate answer based on estimations with a statistical guarantee. The third algorithm is a specialization of the approximate algorithm working with strictly fixed memory requirements. Accuracy properties and memory consumption of the algorithms have been theoretically assessed. Moreover experimental results have confirmed the effectiveness of the proposed approach and the good quality of the solutions.

Keywords

Data streams Anomaly detection Distance-based outliers 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of international conference on managment of data (SIGMOD’01)Google Scholar
  2. Aggarwal CC (2005) On abnormality detection in spuriously populated data streams. In: SIAM data miningGoogle Scholar
  3. Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2): 145–160CrossRefGoogle Scholar
  4. Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of international conference on principles of data mining and knowledge discovery (PKDD’02), pp 15–26Google Scholar
  5. Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2): 203–215CrossRefMathSciNetGoogle Scholar
  6. Angiulli F, Fassetti F (2007) Very efficient mining of distance-based outliers. In: CIKM, pp 791–800Google Scholar
  7. Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1): 1–57CrossRefGoogle Scholar
  8. Arning A, Aggarwal C, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of international conference on knowledge discovery and data mining (KDD’96), pp 164–169Google Scholar
  9. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: PODS, pp 1–16Google Scholar
  10. Barnett V, Lewis T (1994) Outliers in Statistical Data. John Wiley & Sons, ChichesterMATHGoogle Scholar
  11. Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of international conference on knowledge discovery and data mining (KDD’03)Google Scholar
  12. Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of international conference on managment of data (SIGMOD’00)Google Scholar
  13. Chávez E, Navarro G, Baeza-Yates RA, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321CrossRefGoogle Scholar
  14. DARPA Defense Advanced Research Projects Agency (1998) Intrusion detection evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html
  15. Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection : detecting intrusions in unlabeled data. In: Applications of data mining in computer security, Kluwer, DordrechtGoogle Scholar
  16. Ghoting A, Parthasarathy S, Otey ME (2006) Fast mining of distance-based outliers in high-dimensional datasets. In: Proceedings of the SIAM international conference on data mining (SDM’06), Bethesda, MD, USAGoogle Scholar
  17. Golab L, Özsu MT (2003) Issues in data stream management. SIGMOD Rec 32(2): 5–14CrossRefGoogle Scholar
  18. Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01)Google Scholar
  19. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of international conference on very large databases (VLDB98), pp 392–403Google Scholar
  20. Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceeings of international conference on very large databases (VLDB99), pp 211–222Google Scholar
  21. Knorr E, Ng R, Tucakov V (2000) Distance-based outlier: algorithms and applications. VLDB J 8(3–4): 237–253Google Scholar
  22. Knuth D (1997) The art of computer programming, Vol. 3: sorting and searching. Addison-Wesley, ReadingMATHGoogle Scholar
  23. Lazarevic A, Ertöz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the SIAM international conference on data miningGoogle Scholar
  24. Mood AM, Graybill FA, Boes DC (1974) Introduction to the theory of statistics. McGraw-Hill, New YorkMATHGoogle Scholar
  25. Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of international conference on data enginnering (ICDE), pp 315–326Google Scholar
  26. Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDE, pp 315–326Google Scholar
  27. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of international conference on managment of data (SIGMOD’00), pp 427–438Google Scholar
  28. Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: International conference on very large data bases, Seoul, Korea, September 12–15Google Scholar
  29. Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’06), Philadelphia, PA, USA, pp 394–403Google Scholar
  30. Watanabe O (2000) Simple sampling techniques for discovery science. TIEICE: IEICE Transactions on communications/electronics/information and systemsGoogle Scholar
  31. Yamanishi K, Takeuchi J, Williams GJ, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: KDD, pp 320–324Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.DEIS, Università della CalabriaRendeItaly

Personalised recommendations