Data Intensive vs Sliding Window Outlier Detection in the Stream Data — An Experimental Approach

  • Mateusz Kalisch
  • Marcin Michalak
  • Marek Sikora
  • Łukasz Wróbel
  • Piotr Przystałka
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9693)

Abstract

In the paper a problem of outlier detection in the stream data is raised. The authors propose a new approach, using well known outlier detection algorithms, of outlier detection in the stream data. The method is based on the definition of a sliding window, which means a sequence of stream data observations from the past that are closest to the newly coming object. As it may be expected the outlier detection accuracy level of this model becomes worse than the accuracy of the model that uses all historical data, but from the statistical point of view the difference is not significant. In the paper several well known methods of outlier detection are used as the basis of the model.

Keywords

Outlier detection Data analysis Classification Time series 

References

  1. 1.
    Abadi, D., Carney, D., Çetintemel, U., et al.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)CrossRefGoogle Scholar
  2. 2.
    Aggarwal, C.: An Introduction to Data Streams. Springer, USA (2007)CrossRefGoogle Scholar
  3. 3.
    Aggarwal, C.: Outlier Analysis. Springer, New York (2013)CrossRefMATHGoogle Scholar
  4. 4.
    Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 37–46 (2001)Google Scholar
  5. 5.
    Angiulli, F., Fassetti, F.: Distance-based outlier queries in data streams: the novel task and algorithms. Data Min. Knowl. Discov. 20(2), 290–324 (2010)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Arvind, A., Brian, B., Shivnath, B., John, C., Keith, I., Rajeev, M., Utkarsh, S., Jennifer, W.: Stream: the stanford data stream management system (2004)Google Scholar
  7. 7.
    Assent, I., Kranen, P., Baldauf, C., Seidl, T.: AnyOut: anytime outlier detection on streaming data. In: Lee, S., Peng, Z., Zhou, X., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA 2012, Part I. LNCS, vol. 7238, pp. 228–242. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1–16 (2002)Google Scholar
  9. 9.
    Barkow, S., Bleuler, S., Prelić, A., Zimmermann, P., Zitzler, E.: BicAT: a biclustering analysis toolbox. Bioinformatics 22(10), 1282–1283 (2006)CrossRefGoogle Scholar
  10. 10.
    Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)MATHGoogle Scholar
  11. 11.
    Basu, S., Meckesheimer, M.: Automatic outlier detection for time series: an application to sensor data. Knowl. Inf. Syst. 11(2), 137–154 (2007)CrossRefGoogle Scholar
  12. 12.
    Breunig, M., Kriegel, H.-P., Ng, R., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)Google Scholar
  13. 13.
    Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: OPTICS-OF: identifying local outliers. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 262–270. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  14. 14.
    Bu, Y., Leung, T.-W., Fu, A., et al.: WAT: finding top-\(K\) discords in time series database. In: Proceedings of the 2007 SIAM International Conference on Data Mining (2007)Google Scholar
  15. 15.
    Byers, S., Raftery, A.: Nearest-neighbor clutter removal for estimating features in spatial point processes. J. Am. Stat. Assoc. 93(442), 577–584 (1988)CrossRefMATHGoogle Scholar
  16. 16.
    Chandrasekaran, S., Cooper, O., Deshpande, A., et al.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 668–668 (2003)Google Scholar
  17. 17.
    Dhaliwal, P., Bhatia, M., Bansal, P.: A cluster-based approach for outlier detection in dynamic data streams (KORM: k-median OutlieR miner). J. Comput. 2(2), 74–80 (2010)Google Scholar
  18. 18.
    Elahi, M., Li, K., Nisar, W., et al.: Efficient clustering-based outlier detection algorithm for dynamic data stream. In: 5th International Conference on Fuzzy Systems and Knowledge, Discovery, pp. 298–304 (2008)Google Scholar
  19. 19.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)Google Scholar
  20. 20.
    Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC, Boca Raton (2010)CrossRefMATHGoogle Scholar
  21. 21.
    Georgiadis, D., Kontaki, M., Gounaris, A., et al.: Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 1061–1064 (2013)Google Scholar
  22. 22.
    Grubbs, F.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)CrossRefGoogle Scholar
  23. 23.
    Grubbs, F.: Sample criteria for testing outlying observations. Ann. Math. Stat. 21(1), 27–58 (1950)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Hawkins, D.: Identification of Outliers. Springer, Netherlands (1980)CrossRefMATHGoogle Scholar
  26. 26.
    Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefMATHGoogle Scholar
  27. 27.
    Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  28. 28.
    John, G.: Robust decision trees: removing outliers from databases. In: Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press (1995)Google Scholar
  29. 29.
    Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: International Conference on Knowledge Discovery and Data Mining, pp. 224–228 (1998)Google Scholar
  30. 30.
    Kalisch, M., Michalak, M., Sikora, M., Wróbel, Ł., Przystałka, P.: Influence of outliers introduction on predictive models quality. Comm. Comp. Inf. Sci. (2016, to appear)Google Scholar
  31. 31.
    Keogh, E., Lin, J., Fu, A.: HOT SAX: efficiently finding the most unusual time series subsequence. In: Fifth IEEE International Conference on Data Mining (2005)Google Scholar
  32. 32.
    Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 392–403 (1998)Google Scholar
  33. 33.
    Kontaki, M., Gounaris, A., Papadopoulos, A., et al.: Continuous monitoring of distance-based outliers over data streams. In: IEEE International Conference on Data Engineering, pp. 135–146 (2011)Google Scholar
  34. 34.
    Kozielski, M., Sikora, M., Wróbel, Ł.: DISESOR - decision support system for mining industry. Ann. Comput. Sci. Inf. Syst. 5, 67–74 (2015)CrossRefGoogle Scholar
  35. 35.
    Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 444–452 (2008)Google Scholar
  36. 36.
    Kuna, H., Garcia-Martinez, R., Villatoro, F.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)CrossRefGoogle Scholar
  37. 37.
    Le, N., Martin, R., Raftery, A.: Modeling flat stretches, time series using mixture transition distribution models. J. Am. Stat. Assoc. 91(436), 1504–1515 (1996)MathSciNetMATHGoogle Scholar
  38. 38.
    Ma, J., Perkins, S.: Online novelty detection on temporal sequences. In: Proceedings of 9th SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 613–618 (2003)Google Scholar
  39. 39.
    Nag, A., Mitra, A., Mitra, S.: Multiple outlier detection in multivariate data using self-organizing maps title. Comput. Stat. 20(2), 245–264 (2005)MathSciNetCrossRefMATHGoogle Scholar
  40. 40.
    Orzechowski, P., Boryczko, K.: Parallel approach for visual clustering of protein databases. Comput. Inf. 29(6), 1221–1231 (2010)Google Scholar
  41. 41.
    Pokrajac, D., Lazarevic, A., Latecki, L.J.: Incremental local outlier detection for data streams. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 504–515 (2007)Google Scholar
  42. 42.
    Prakash, C., Prashant, C.: Outlier detection techniques over streaming data in data mining: a research perspective. Int. J. Recent Technol. Eng. 1(2), 157–162 (2013)Google Scholar
  43. 43.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 427–438 (2000)Google Scholar
  44. 44.
    Rousseeuw, P.: Multivariate estimation with high breakdown point. In: Mathematical Statistics and Applications (Vol. B). Reidel, Dordrecht (1985)Google Scholar
  45. 45.
    Ruts, I., Rousseeuw, P.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1), 153–168 (1996)CrossRefMATHGoogle Scholar
  46. 46.
    Sadik, S., Gruenwald, L.: Online outlier detection for data streams. In: Proceedings of the 15th Symposium on International Database Engineering and Applications, pp. 88–96 (2011)Google Scholar
  47. 47.
    Schölkopf, B., Williamson, R., Smola, A., et al.: Support vector method for novelty detection. Adv. Neural Inf. Process. Syst. 12, 582–588 (2000)Google Scholar
  48. 48.
    Shekhar, S., Lu, C.-T., Zhang, P.: Detecting graph-based spatial outliers: algorithms and applications (a summary of results). In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 371–376 (2001)Google Scholar
  49. 49.
    Torr, P., Murray, D.: Outlier detection and motion segmentation. In: Proceedings of SPIE, vol. 2059, pp. 432–443 (1993)Google Scholar
  50. 50.
    Tukey, J.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading (1977)MATHGoogle Scholar
  51. 51.
    Yang, D., Rundensteiner, E., Ward, M.: Neighbor-based pattern detection for windows over streaming data. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 529–540 (2009)Google Scholar
  52. 52.
    Yogita, T., Toshniwal, D.: A framework for outlier detection in evolving data streams by weighting attributes in clustering. Procedia Technol. 6, 214–222 (2012)CrossRefGoogle Scholar
  53. 53.
    Wei, L., Keogh, E., Xi, X.: SAXually explicit images: finding unusual shapes. In: Sixth International Conference on Data Mining, pp. 711–720 (2006)Google Scholar
  54. 54.
    Weisberg, S.: Applied Linear Regression. Wiley, Hoboken (2005)CrossRefMATHGoogle Scholar
  55. 55.
    Widera, M., Kozielski, S.: Strumieniowe systemy zarządzania danymi - przegląd rozwiązań (in Polish), in: Bazy danych. Modele, technologie, narzȩdzia. [Vol. 1]: Architektura, metody formalne, bezpieczeństwo, 257–266, WKŁ (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Mateusz Kalisch
    • 1
  • Marcin Michalak
    • 2
  • Marek Sikora
    • 2
    • 3
  • Łukasz Wróbel
    • 3
  • Piotr Przystałka
    • 1
  1. 1.Institute of Fundamentals of Machinery DesignSilesian University of TechnologyGliwicePoland
  2. 2.Institute of InformaticsSilesian University of TechnologyGliwicePoland
  3. 3.Institute of Innovative Technologies EMAGKatowicePoland

Personalised recommendations