Advertisement

AnyOut: Anytime Outlier Detection on Streaming Data

  • Ira Assent
  • Philipp Kranen
  • Corinna Baldauf
  • Thomas Seidl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7238)

Abstract

With the increase of sensor and monitoring applications, data mining on streaming data is receiving increasing research attention. As data is continuously generated, mining algorithms need to be able to analyze the data in a one-pass fashion. In many applications the rate at which the data objects arrive varies greatly. This has led to anytime mining algorithms for classification or clustering. They successfully mine data until the a priori unknown point of interruption by the next data in the stream.

In this work we investigate anytime outlier detection. Anytime outlier detection denotes the problem of determining within any period of time whether an object in a data stream is anomalous. The more time is available, the more reliable the decision should be. We introduce AnyOut, an algorithm capable of solving anytime outlier detection, and investigate different approaches to build up the underlying data structure. We propose a confidence measure for AnyOut that allows to improve the performance on constant data streams. We evaluate our method in thorough experiments and demonstrate its performance in comparison with established algorithms for outlier detection.

Keywords

Data Stream Probable Outlier Outlier Detection Cluster Feature Streaming Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Achtert, E., Kriegel, H.-P., Reichert, L., Schubert, E., Wojdanowski, R., Zimek, A.: Visual Evaluation of Outlier Detection Models. In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds.) DASFAA 2010. LNCS, vol. 5982, pp. 396–399. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  2. 2.
    Aggarwal, C.C.: On abnormality detection in spuriously populated data streams. In: SDM (2005)Google Scholar
  3. 3.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB, pp. 81–92 (2003)Google Scholar
  4. 4.
    Angiulli, F., Fassetti, F.: Detecting distance-based outliers in streams of data. In: CIKM (2007)Google Scholar
  5. 5.
    Assent, I., Kranen, P., Baldauf, C., Seidl, T.: Detecting outliers on arbitrary data streams using anytime approaches. In: StreamKDD Workshop in Conjunction with 16th ACM SIGKDD (2010)Google Scholar
  6. 6.
    Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley (1994)Google Scholar
  7. 7.
    Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: Moa: Massive online analysis, a framework for stream classification and clustering. Journal of Machine Learning Research - Proceedings Track 11, 44–51 (2010)Google Scholar
  8. 8.
    Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  9. 9.
    Breunig, M., Kriegel, H.-P., Ng, R., Sander, J.: LOF: Identifying density-based local outliers. In: ACM SIGMOD, pp. 93–104 (2000)Google Scholar
  10. 10.
    Cao, H., Zhou, Y., Shou, L., Chen, G.: Attribute Outlier Detection over Data Streams. In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds.) DASFAA 2010. LNCS, vol. 5982, pp. 216–230. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    DeCoste, D.: Anytime query-tuned kernel machines via cholesky factorization. In: SDM (2003)Google Scholar
  12. 12.
    Dempster, A.P., Laird, N.M.L., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. Royal Stat. Soc. B 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley (2000)Google Scholar
  14. 14.
    Esmeir, S., Markovitch, S.: Interruptible anytime algorithms for iterative improvement of decision trees. In: UBDM Workshop at KDD (2005)Google Scholar
  15. 15.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases. In: KDD (1996)Google Scholar
  16. 16.
    Foss, A., Zaïane, O., Zilles, S.: Unsupervised Class Separation of Multivariate Data through Cumulative Variance-Based Ranking. In: ICDM (2009)Google Scholar
  17. 17.
    Franke, C., Gertz, M.: Detection and exploration of outlier regions in sensor data streams. In: ICDM Workshops, pp. 375–384 (2008)Google Scholar
  18. 18.
    Grefenstette, J., Ramsey, C.: An Approach to Anytime Learning. In: Workshop on Machine Learning, pp. 189–195 (1992)Google Scholar
  19. 19.
    Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: ACM SIGMOD (1984)Google Scholar
  20. 20.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The weka data mining software: an update. SIGKDD Expl. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  21. 21.
    Hansen, E.A., Zilberstein, S.: Monitoring anytime algorithms. SIGART Bulletin 7(2), 28–33 (1996)CrossRefGoogle Scholar
  22. 22.
    Hawkins, D.: Identification of outliers. Chapman and Hall, New York (1980)zbMATHGoogle Scholar
  23. 23.
    He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recognition Letters (2003)Google Scholar
  24. 24.
    Hettich, S., Bay, S.: The UCI KDD archive (1999), http://kdd.ics.uci.edu
  25. 25.
    Hoang Vu, N., Gopalkrishnan, V., Namburi, P.: Online Outlier Detection Based on Relative Neighbourhood Dissimilarity. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 50–61. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  26. 26.
    Kendall, M.: A new measure of rank correlation. Biometrika 30(1-2), 81 (1938)MathSciNetzbMATHCrossRefGoogle Scholar
  27. 27.
    Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: algorithms and applications. In: VLDBJ (2000)Google Scholar
  28. 28.
    Kotenko, I., Stankevitch, L.: The control of teams of autonomous objects in the time-constrained environments. In: ICTAI, pp. 158–163 (2002)Google Scholar
  29. 29.
    Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: ICDM (2009)Google Scholar
  30. 30.
    Kranen, P., Kremer, H., Jansen, T., Seidl, T., Bifet, A., Holmes, G., Pfahringer, B.: Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within moa. In: ICDM (2010)Google Scholar
  31. 31.
    Kranen, P., Kremer, H., Jansen, T., Seidl, T., Bifet, A., Holmes, G., Pfahringer, B., Read, J.: Stream Data Mining using the MOA Framework. In: Lee, S.-G., et al. (eds.) DASFAA 2012, Part II. LNCS, vol. 7239, pp. 309–313. Springer, Heidelberg (2012)Google Scholar
  32. 32.
    Kranen, P., Krieger, R., Denker, S., Seidl, T.: Bulk Loading Hierarchical Mixture Models for Efficient Stream Classification. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 325–334. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  33. 33.
    Kranen, P., Seidl, T.: Harnessing the Strengths of Anytime Algorithms for Constant Data Streams. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part I. LNCS (LNAI), vol. 5781, p. 31. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  34. 34.
    Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G., Pfahringer, B.: An effective evaluation measure for clustering on evolving data streams. In: ACM SIGKDD, pp. 868–876 (2011)Google Scholar
  35. 35.
    Müller, E., Schiffer, M., Seidl, T.: Adaptive outlierness for subspace outlier ranking. In: CIKM, pp. 1629–1632. ACM (2010)Google Scholar
  36. 36.
    Müller, E., Schiffer, M., Seidl, T.: Statistical selection of relevant subspace projections for outlier ranking. In: ICDE, pp. 434–445. IEEE Computer Society (2011)Google Scholar
  37. 37.
    Muthukrishnan, S., Shah, R., Vitter, J.: Mining deviants in time series data streams. In: SSDBM (2004)Google Scholar
  38. 38.
    Seidl, T., Assent, I., Kranen, P., Krieger, R., Herrmann, J.: Indexing density models for incremental learning and anytime classification on data streams. In: EDBT (2009)Google Scholar
  39. 39.
    Spearman, C.: The Proof and Measurement of Association between Two Things. The American Journal of Psychology 15(1), 72–101 (1904)CrossRefGoogle Scholar
  40. 40.
    Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., Gunopulos, D.: Online outlier detection in sensor data using non-parametric models. In: VLDB, pp. 187–198 (2006)Google Scholar
  41. 41.
    Yamanishi, K., Takeuchi, J., Williams, G., Milne, P.: On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. DMKD Journal 8(3), 275–300 (2004)MathSciNetGoogle Scholar
  42. 42.
    Yang, D., Rundensteiner, E.A., Ward, M.O.: Neighbor-based pattern detection for windows over streaming data. In: EDBT, pp. 529–540 (2009)Google Scholar
  43. 43.
    Zhang, J., Gao, Q., Wang, H.: Spot: A system for detecting projected outliers from high-dimensional data streams. In: ICDE, pp. 1628–1631 (2008)Google Scholar
  44. 44.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD (1996)Google Scholar
  45. 45.
    Zhu, C., Kitagawa, H., Faloutsos, C.: Example-Based Robust Outlier Detection in High Dimensional Datasets. In: ICDM (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Ira Assent
    • 1
  • Philipp Kranen
    • 2
  • Corinna Baldauf
    • 2
  • Thomas Seidl
    • 2
  1. 1.Dept. of Computer ScienceAarhus UniversityDenmark
  2. 2.Data Management and Data Exploration GroupRWTH Aachen UniversityGermany

Personalised recommendations