Data Intensive vs Sliding Window Outlier Detection in the Stream Data — An Experimental Approach

Kalisch, Mateusz; Michalak, Marcin; Sikora, Marek; Wróbel, Łukasz; Przystałka, Piotr

doi:10.1007/978-3-319-39384-1_7

Mateusz Kalisch¹⁹,
Marcin Michalak²⁰,
Marek Sikora^20,21,
Łukasz Wróbel²¹ &
…
Piotr Przystałka¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9693))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

1265 Accesses
3 Citations

Abstract

In the paper a problem of outlier detection in the stream data is raised. The authors propose a new approach, using well known outlier detection algorithms, of outlier detection in the stream data. The method is based on the definition of a sliding window, which means a sequence of stream data observations from the past that are closest to the newly coming object. As it may be expected the outlier detection accuracy level of this model becomes worse than the accuracy of the model that uses all historical data, but from the statistical point of view the difference is not significant. In the paper several well known methods of outlier detection are used as the basis of the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, D., Carney, D., Çetintemel, U., et al.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)
Article Google Scholar
Aggarwal, C.: An Introduction to Data Streams. Springer, USA (2007)
Book Google Scholar
Aggarwal, C.: Outlier Analysis. Springer, New York (2013)
Book MATH Google Scholar
Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 37–46 (2001)
Google Scholar
Angiulli, F., Fassetti, F.: Distance-based outlier queries in data streams: the novel task and algorithms. Data Min. Knowl. Discov. 20(2), 290–324 (2010)
Article MathSciNet Google Scholar
Arvind, A., Brian, B., Shivnath, B., John, C., Keith, I., Rajeev, M., Utkarsh, S., Jennifer, W.: Stream: the stanford data stream management system (2004)
Google Scholar
Assent, I., Kranen, P., Baldauf, C., Seidl, T.: AnyOut: anytime outlier detection on streaming data. In: Lee, S., Peng, Z., Zhou, X., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA 2012, Part I. LNCS, vol. 7238, pp. 228–242. Springer, Heidelberg (2012)
Chapter Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1–16 (2002)
Google Scholar
Barkow, S., Bleuler, S., Prelić, A., Zimmermann, P., Zitzler, E.: BicAT: a biclustering analysis toolbox. Bioinformatics 22(10), 1282–1283 (2006)
Article Google Scholar
Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)
MATH Google Scholar
Basu, S., Meckesheimer, M.: Automatic outlier detection for time series: an application to sensor data. Knowl. Inf. Syst. 11(2), 137–154 (2007)
Article Google Scholar
Breunig, M., Kriegel, H.-P., Ng, R., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)
Google Scholar
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: OPTICS-OF: identifying local outliers. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 262–270. Springer, Heidelberg (1999)
Chapter Google Scholar
Bu, Y., Leung, T.-W., Fu, A., et al.: WAT: finding top-\(K\) discords in time series database. In: Proceedings of the 2007 SIAM International Conference on Data Mining (2007)
Google Scholar
Byers, S., Raftery, A.: Nearest-neighbor clutter removal for estimating features in spatial point processes. J. Am. Stat. Assoc. 93(442), 577–584 (1988)
Article MATH Google Scholar
Chandrasekaran, S., Cooper, O., Deshpande, A., et al.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 668–668 (2003)
Google Scholar
Dhaliwal, P., Bhatia, M., Bansal, P.: A cluster-based approach for outlier detection in dynamic data streams (KORM: k-median OutlieR miner). J. Comput. 2(2), 74–80 (2010)
Google Scholar
Elahi, M., Li, K., Nisar, W., et al.: Efficient clustering-based outlier detection algorithm for dynamic data stream. In: 5th International Conference on Fuzzy Systems and Knowledge, Discovery, pp. 298–304 (2008)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
Google Scholar
Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC, Boca Raton (2010)
Book MATH Google Scholar
Georgiadis, D., Kontaki, M., Gounaris, A., et al.: Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 1061–1064 (2013)
Google Scholar
Grubbs, F.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)
Article Google Scholar
Grubbs, F.: Sample criteria for testing outlying observations. Ann. Math. Stat. 21(1), 27–58 (1950)
Article MathSciNet MATH Google Scholar
Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014)
Article MathSciNet MATH Google Scholar
Hawkins, D.: Identification of Outliers. Springer, Netherlands (1980)
Book MATH Google Scholar
Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Article MATH Google Scholar
Jin, W., Tung, A.K.H., Han, J., Wang, W.: Ranking outliers using symmetric neighborhood relationship. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 577–593. Springer, Heidelberg (2006)
Chapter Google Scholar
John, G.: Robust decision trees: removing outliers from databases. In: Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press (1995)
Google Scholar
Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: International Conference on Knowledge Discovery and Data Mining, pp. 224–228 (1998)
Google Scholar
Kalisch, M., Michalak, M., Sikora, M., Wróbel, Ł., Przystałka, P.: Influence of outliers introduction on predictive models quality. Comm. Comp. Inf. Sci. (2016, to appear)
Google Scholar
Keogh, E., Lin, J., Fu, A.: HOT SAX: efficiently finding the most unusual time series subsequence. In: Fifth IEEE International Conference on Data Mining (2005)
Google Scholar
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 392–403 (1998)
Google Scholar
Kontaki, M., Gounaris, A., Papadopoulos, A., et al.: Continuous monitoring of distance-based outliers over data streams. In: IEEE International Conference on Data Engineering, pp. 135–146 (2011)
Google Scholar
Kozielski, M., Sikora, M., Wróbel, Ł.: DISESOR - decision support system for mining industry. Ann. Comput. Sci. Inf. Syst. 5, 67–74 (2015)
Article Google Scholar
Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 444–452 (2008)
Google Scholar
Kuna, H., Garcia-Martinez, R., Villatoro, F.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)
Article Google Scholar
Le, N., Martin, R., Raftery, A.: Modeling flat stretches, time series using mixture transition distribution models. J. Am. Stat. Assoc. 91(436), 1504–1515 (1996)
MathSciNet MATH Google Scholar
Ma, J., Perkins, S.: Online novelty detection on temporal sequences. In: Proceedings of 9th SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 613–618 (2003)
Google Scholar
Nag, A., Mitra, A., Mitra, S.: Multiple outlier detection in multivariate data using self-organizing maps title. Comput. Stat. 20(2), 245–264 (2005)
Article MathSciNet MATH Google Scholar
Orzechowski, P., Boryczko, K.: Parallel approach for visual clustering of protein databases. Comput. Inf. 29(6), 1221–1231 (2010)
Google Scholar
Pokrajac, D., Lazarevic, A., Latecki, L.J.: Incremental local outlier detection for data streams. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 504–515 (2007)
Google Scholar
Prakash, C., Prashant, C.: Outlier detection techniques over streaming data in data mining: a research perspective. Int. J. Recent Technol. Eng. 1(2), 157–162 (2013)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 427–438 (2000)
Google Scholar
Rousseeuw, P.: Multivariate estimation with high breakdown point. In: Mathematical Statistics and Applications (Vol. B). Reidel, Dordrecht (1985)
Google Scholar
Ruts, I., Rousseeuw, P.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1), 153–168 (1996)
Article MATH Google Scholar
Sadik, S., Gruenwald, L.: Online outlier detection for data streams. In: Proceedings of the 15th Symposium on International Database Engineering and Applications, pp. 88–96 (2011)
Google Scholar
Schölkopf, B., Williamson, R., Smola, A., et al.: Support vector method for novelty detection. Adv. Neural Inf. Process. Syst. 12, 582–588 (2000)
Google Scholar
Shekhar, S., Lu, C.-T., Zhang, P.: Detecting graph-based spatial outliers: algorithms and applications (a summary of results). In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 371–376 (2001)
Google Scholar
Torr, P., Murray, D.: Outlier detection and motion segmentation. In: Proceedings of SPIE, vol. 2059, pp. 432–443 (1993)
Google Scholar
Tukey, J.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading (1977)
MATH Google Scholar
Yang, D., Rundensteiner, E., Ward, M.: Neighbor-based pattern detection for windows over streaming data. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 529–540 (2009)
Google Scholar
Yogita, T., Toshniwal, D.: A framework for outlier detection in evolving data streams by weighting attributes in clustering. Procedia Technol. 6, 214–222 (2012)
Article Google Scholar
Wei, L., Keogh, E., Xi, X.: SAXually explicit images: finding unusual shapes. In: Sixth International Conference on Data Mining, pp. 711–720 (2006)
Google Scholar
Weisberg, S.: Applied Linear Regression. Wiley, Hoboken (2005)
Book MATH Google Scholar
Widera, M., Kozielski, S.: Strumieniowe systemy zarządzania danymi - przegląd rozwiązań (in Polish), in: Bazy danych. Modele, technologie, narzȩdzia. [Vol. 1]: Architektura, metody formalne, bezpieczeństwo, 257–266, WKŁ (2005)
Google Scholar

Download references

Acknowledgements

This work was partially supported by Polish National Centre for Research and Development (NCBiR) grant PBS2/B9/20/2013 within Applied Research Programmes. The infrastructure was supported by “PL-LAB2020” project, contract POIG.02.03.01-00-104/13-00.

Author information

Authors and Affiliations

Institute of Fundamentals of Machinery Design, Silesian University of Technology, ul. Konarskiego 18a, 44-100, Gliwice, Poland
Mateusz Kalisch & Piotr Przystałka
Institute of Informatics, Silesian University of Technology, ul. Akademicka 16, 44-100, Gliwice, Poland
Marcin Michalak & Marek Sikora
Institute of Innovative Technologies EMAG, ul. Leopolda 31, 40-186, Katowice, Poland
Marek Sikora & Łukasz Wróbel

Authors

Mateusz Kalisch
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Michalak
View author publications
You can also search for this author in PubMed Google Scholar
Marek Sikora
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Wróbel
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Przystałka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcin Michalak .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Czestochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Czestochowa, Poland
Marcin Korytkowski
Częstochowa University of Technology, Czestochowa, Poland
Rafał Scherer
AGH University of Science and Technology, Krakow, Poland
Ryszard Tadeusiewicz
University of California, Berkeley, California, USA
Lotfi A. Zadeh
University of Louisville, Louisville, Kentucky, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kalisch, M., Michalak, M., Sikora, M., Wróbel, Ł., Przystałka, P. (2016). Data Intensive vs Sliding Window Outlier Detection in the Stream Data — An Experimental Approach. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science(), vol 9693. Springer, Cham. https://doi.org/10.1007/978-3-319-39384-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-39384-1_7
Published: 29 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39383-4
Online ISBN: 978-3-319-39384-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics