An outlier detection approach in large-scale data stream using rough set

Singh, Manmohan; Pamula, Rajendra

doi:10.1007/s00521-019-04421-4

An outlier detection approach in large-scale data stream using rough set

Original Article
Published: 16 August 2019

Volume 32, pages 9113–9127, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Manmohan Singh¹ &
Rajendra Pamula¹

694 Accesses
6 Citations
Explore all metrics

Abstract

Outlier detection has become an important research area in the field of stream data mining due to its vast applications. In the literature, many methods have been proposed, but they work well for simple and positive regions of outliers, where boundary regions are not given much importance. Moreover, an algorithm which processes stream data must be effective and able to compute infinite data in one pass or limited number of passes. These problems have motivated us to propose an outlier detection approach for large-scale data stream. The proposed algorithm employs the concept of relative cardinality, entropy outlier factor theory of information-based system, and size-variant sliding window in stream data. In addition, we propose a new methodology for concept drift adaptation on evolving data streams. The proposed method is executed on nine benchmark datasets and compared with six existing methods that are EXPoSE, iForest, OC-SVM, LOF, KDE, and FastAbod. Experimental results show that the proposed method outperforms six existing methods in terms of receiver operating characteristic curve, precision recall, and computational time for positive regions as well as for boundary regions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Anomaly Detection Algorithms

Article 26 November 2021

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

Detection of Outliers Using Interquartile Range Technique from Intrusion Dataset

References

Ghosh S, Biswas S, Sarkar D, Sarkar PP (2014) A novel neuro-fuzzy classification technique for data mining. Egypt Inform J 15(3):129–147
Google Scholar
Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474
Google Scholar
Ghosh D, Vogt A (2012) Outliers: an evaluation of methodologies. In: Joint statistical meetings. American Statistical Association San Diego, CA, pp 3455–3460
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York
MATH Google Scholar
Zhang B, Sconyers C, Byington C, Patrick R, Orchard ME, Vachtsevanos G (2011) A probabilistic fault detection approach: application to bearing fault detection. IEEE Trans Ind Electron 58(5):2011–2018
Google Scholar
Xiong L, Poczos B, Schneider J, Connolly A, VanderPlas J (2011) Hierarchical probabilistic models for group anomaly detection. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 789–797
Han D-H, Zhang X, Wang G-R (2015) Classifying uncertain and evolving data streams with distributed extreme learning machine. J Comput Sci Technol 30(4):874–887
MathSciNet Google Scholar
Shojafar M, Cordeschi N, Baccarelli E (2016) Energy-efficient adaptive resource management for real-time vehicular cloud services. IEEE Trans Cloud Comput 7(1):196–209
Google Scholar
Beaubouef T, Petry FE, Arora G (1998) Information-theoretic measures of uncertainty for rough sets and rough relational databases. Inf Sci 109(1–4):185–195
Google Scholar
Liang J, Shi Z (2004) The information entropy, rough entropy and knowledge granulation in rough set theory. Int J Uncert Fuzziness Knowl Based Syst 12(01):37–46
MathSciNet MATH Google Scholar
Duntsch I, Gediga G (1998) Uncertainty measures of rough set prediction. Artif Intell 106(1):109–137
MathSciNet MATH Google Scholar
Xie N, Liu M, Li Z, Zhang G (2019) New measures of uncertainty for an interval-valued information system. Inf Sci 470:156–174
MathSciNet Google Scholar
Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12
Google Scholar
Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267
MATH Google Scholar
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253
Google Scholar
Jiang F, Sui Y, Cao C (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36(3):4680–4687
Google Scholar
Shoval P, Gudes E, Goldstein M (1988) Gisd: a graphical interactive system for conceptual database design. Inf Syst 13(1):81–95
Google Scholar
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record. ACM, vol 29, pp 93–104
Yao H, Xiuwen F, Yang Y, Postolache O (2018) An incremental local outlier detection method in the data stream. Appl Sci 8(8):1248
Google Scholar
Kriegel H-P, Zimek A et al (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 444–452
Aggarwal CC (2015) Outlier analysis: advanced concepts. In: Data mining. Springer, pp 265–283
Tan SC, Ting KM, Liu TF (2011) Fast anomaly detection for streaming data. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1511
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3
Google Scholar
Mahadevan S, Shah SL (2009) Fault detection and diagnosis in process data using one-class support vector machines. J Process Control 19(10):1627–1639
Google Scholar
Barddal JP, Gomes HM, Enembreck F, Barthes J-P (2016) Sncstream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73
Google Scholar
Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch and streaming anomaly detection. Mach Learn 105(3):305–333
MathSciNet MATH Google Scholar
Zhang J, Li T, Ruan D, Gao Z, Zhao C (2012) A parallel method for computing rough set approximations. Inf Sci 194:209–223
Google Scholar
Hu X (1995) Knowledge discovery in databases: an attribute-oriented rough set approach. PhD thesis, University of Regina
Liang J, Zongben X (2002) The algorithm on knowledge reduction in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 10(01):95–103
MathSciNet MATH Google Scholar
Qian Y, Liang J, Wang F (2009) A new method for measuring the uncertainty in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 17(06):855–880
MathSciNet MATH Google Scholar
Wang X, Yang J, Teng X, Xia W, Jensen R (2007) Feature selection based on rough sets and particle swarm optimization. Pattern Recognit Lett 28(4):459–471
Google Scholar
Park I-K, Choi G-S (2015) A variable-precision information-entropy rough set approach for job searching. Inf Syst 48:279–288
Google Scholar
Parra L, Deco G, Miesbach S (1996) Statistical independence and novelty detection with information preserving nonlinear maps. Neural Comput 8(2):260–269
Google Scholar
Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602
Google Scholar
Taha A, Hadi AS (2019) Anomaly detection methods for categorical data: a review. ACM Comput Surv 52(2):38
Google Scholar
Park I-K, Choi G-S (2015) Rough set approach for clustering categorical data using information-theoretic dependency measure. Inf Syst 48:289–295
Google Scholar
D’eer L, Cornelis C (2018) A comprehensive study of fuzzy covering-based rough set models: definitions, properties and interrelationships. Fuzzy Sets Syst 336:1–26
MathSciNet MATH Google Scholar
Gomes JB, Gaber MM, Sousa PAC, Menasalvas E (2014) Mining recurring concepts in a dynamic feature space. IEEE Trans Neural Netw Learn Syst 25(1):95–110
Google Scholar
Le Q, Sarlos T, Smola A (2013) Fastfood-approximating kernel expansions in loglinear time. In: Proceedings of the international conference on machine learning, vol 85
Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 306–315
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30(2):37–46
Google Scholar
Goix N (2016) How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152
Amer M, Goldstein M, Abdennadher S (2013) Enhancing one-class support vector machines for unsupervised anomaly detection. In:: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 8–15
Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173
Google Scholar
Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180
Google Scholar
Campos GO, Zimek A, Sander J, Campello RJGB, Micenkova B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927
MathSciNet Google Scholar
Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Advances in neural information processing systems, pp 467–475

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, Jharkhand, 826004, India
Manmohan Singh & Rajendra Pamula

Authors

Manmohan Singh
View author publications
You can also search for this author in PubMed Google Scholar
Rajendra Pamula
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manmohan Singh.

Ethics declarations

Conflict of interest

We hereby declare that we have no conflict of interest

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, M., Pamula, R. An outlier detection approach in large-scale data stream using rough set. Neural Comput & Applic 32, 9113–9127 (2020). https://doi.org/10.1007/s00521-019-04421-4

Download citation

Received: 21 September 2018
Accepted: 07 August 2019
Published: 16 August 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00521-019-04421-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An outlier detection approach in large-scale data stream using rough set

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Anomaly Detection Algorithms

Uncertainty in big data analytics: survey, opportunities, and challenges

Detection of Outliers Using Interquartile Range Technique from Intrusion Dataset

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An outlier detection approach in large-scale data stream using rough set

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Anomaly Detection Algorithms

Uncertainty in big data analytics: survey, opportunities, and challenges

Detection of Outliers Using Interquartile Range Technique from Intrusion Dataset

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation