Abstract
Outlier detection has become an important research area in the field of stream data mining due to its vast applications. In the literature, many methods have been proposed, but they work well for simple and positive regions of outliers, where boundary regions are not given much importance. Moreover, an algorithm which processes stream data must be effective and able to compute infinite data in one pass or limited number of passes. These problems have motivated us to propose an outlier detection approach for large-scale data stream. The proposed algorithm employs the concept of relative cardinality, entropy outlier factor theory of information-based system, and size-variant sliding window in stream data. In addition, we propose a new methodology for concept drift adaptation on evolving data streams. The proposed method is executed on nine benchmark datasets and compared with six existing methods that are EXPoSE, iForest, OC-SVM, LOF, KDE, and FastAbod. Experimental results show that the proposed method outperforms six existing methods in terms of receiver operating characteristic curve, precision recall, and computational time for positive regions as well as for boundary regions.
Similar content being viewed by others
References
Ghosh S, Biswas S, Sarkar D, Sarkar PP (2014) A novel neuro-fuzzy classification technique for data mining. Egypt Inform J 15(3):129–147
Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474
Ghosh D, Vogt A (2012) Outliers: an evaluation of methodologies. In: Joint statistical meetings. American Statistical Association San Diego, CA, pp 3455–3460
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York
Zhang B, Sconyers C, Byington C, Patrick R, Orchard ME, Vachtsevanos G (2011) A probabilistic fault detection approach: application to bearing fault detection. IEEE Trans Ind Electron 58(5):2011–2018
Xiong L, Poczos B, Schneider J, Connolly A, VanderPlas J (2011) Hierarchical probabilistic models for group anomaly detection. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 789–797
Han D-H, Zhang X, Wang G-R (2015) Classifying uncertain and evolving data streams with distributed extreme learning machine. J Comput Sci Technol 30(4):874–887
Shojafar M, Cordeschi N, Baccarelli E (2016) Energy-efficient adaptive resource management for real-time vehicular cloud services. IEEE Trans Cloud Comput 7(1):196–209
Beaubouef T, Petry FE, Arora G (1998) Information-theoretic measures of uncertainty for rough sets and rough relational databases. Inf Sci 109(1–4):185–195
Liang J, Shi Z (2004) The information entropy, rough entropy and knowledge granulation in rough set theory. Int J Uncert Fuzziness Knowl Based Syst 12(01):37–46
Duntsch I, Gediga G (1998) Uncertainty measures of rough set prediction. Artif Intell 106(1):109–137
Xie N, Liu M, Li Z, Zhang G (2019) New measures of uncertainty for an interval-valued information system. Inf Sci 470:156–174
Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12
Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253
Jiang F, Sui Y, Cao C (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36(3):4680–4687
Shoval P, Gudes E, Goldstein M (1988) Gisd: a graphical interactive system for conceptual database design. Inf Syst 13(1):81–95
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record. ACM, vol 29, pp 93–104
Yao H, Xiuwen F, Yang Y, Postolache O (2018) An incremental local outlier detection method in the data stream. Appl Sci 8(8):1248
Kriegel H-P, Zimek A et al (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 444–452
Aggarwal CC (2015) Outlier analysis: advanced concepts. In: Data mining. Springer, pp 265–283
Tan SC, Ting KM, Liu TF (2011) Fast anomaly detection for streaming data. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1511
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3
Mahadevan S, Shah SL (2009) Fault detection and diagnosis in process data using one-class support vector machines. J Process Control 19(10):1627–1639
Barddal JP, Gomes HM, Enembreck F, Barthes J-P (2016) Sncstream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73
Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch and streaming anomaly detection. Mach Learn 105(3):305–333
Zhang J, Li T, Ruan D, Gao Z, Zhao C (2012) A parallel method for computing rough set approximations. Inf Sci 194:209–223
Hu X (1995) Knowledge discovery in databases: an attribute-oriented rough set approach. PhD thesis, University of Regina
Liang J, Zongben X (2002) The algorithm on knowledge reduction in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 10(01):95–103
Qian Y, Liang J, Wang F (2009) A new method for measuring the uncertainty in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 17(06):855–880
Wang X, Yang J, Teng X, Xia W, Jensen R (2007) Feature selection based on rough sets and particle swarm optimization. Pattern Recognit Lett 28(4):459–471
Park I-K, Choi G-S (2015) A variable-precision information-entropy rough set approach for job searching. Inf Syst 48:279–288
Parra L, Deco G, Miesbach S (1996) Statistical independence and novelty detection with information preserving nonlinear maps. Neural Comput 8(2):260–269
Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602
Taha A, Hadi AS (2019) Anomaly detection methods for categorical data: a review. ACM Comput Surv 52(2):38
Park I-K, Choi G-S (2015) Rough set approach for clustering categorical data using information-theoretic dependency measure. Inf Syst 48:289–295
D’eer L, Cornelis C (2018) A comprehensive study of fuzzy covering-based rough set models: definitions, properties and interrelationships. Fuzzy Sets Syst 336:1–26
Gomes JB, Gaber MM, Sousa PAC, Menasalvas E (2014) Mining recurring concepts in a dynamic feature space. IEEE Trans Neural Netw Learn Syst 25(1):95–110
Le Q, Sarlos T, Smola A (2013) Fastfood-approximating kernel expansions in loglinear time. In: Proceedings of the international conference on machine learning, vol 85
Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 306–315
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30(2):37–46
Goix N (2016) How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152
Amer M, Goldstein M, Abdennadher S (2013) Enhancing one-class support vector machines for unsupervised anomaly detection. In:: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 8–15
Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173
Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180
Campos GO, Zimek A, Sander J, Campello RJGB, Micenkova B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927
Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Advances in neural information processing systems, pp 467–475
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We hereby declare that we have no conflict of interest
Human and animal rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Singh, M., Pamula, R. An outlier detection approach in large-scale data stream using rough set. Neural Comput & Applic 32, 9113–9127 (2020). https://doi.org/10.1007/s00521-019-04421-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04421-4