Skip to main content
Log in

An outlier detection approach in large-scale data stream using rough set

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Outlier detection has become an important research area in the field of stream data mining due to its vast applications. In the literature, many methods have been proposed, but they work well for simple and positive regions of outliers, where boundary regions are not given much importance. Moreover, an algorithm which processes stream data must be effective and able to compute infinite data in one pass or limited number of passes. These problems have motivated us to propose an outlier detection approach for large-scale data stream. The proposed algorithm employs the concept of relative cardinality, entropy outlier factor theory of information-based system, and size-variant sliding window in stream data. In addition, we propose a new methodology for concept drift adaptation on evolving data streams. The proposed method is executed on nine benchmark datasets and compared with six existing methods that are EXPoSE, iForest, OC-SVM, LOF, KDE, and FastAbod. Experimental results show that the proposed method outperforms six existing methods in terms of receiver operating characteristic curve, precision recall, and computational time for positive regions as well as for boundary regions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Ghosh S, Biswas S, Sarkar D, Sarkar PP (2014) A novel neuro-fuzzy classification technique for data mining. Egypt Inform J 15(3):129–147

    Google Scholar 

  2. Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474

    Google Scholar 

  3. Ghosh D, Vogt A (2012) Outliers: an evaluation of methodologies. In: Joint statistical meetings. American Statistical Association San Diego, CA, pp 3455–3460

  4. Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York

    MATH  Google Scholar 

  5. Zhang B, Sconyers C, Byington C, Patrick R, Orchard ME, Vachtsevanos G (2011) A probabilistic fault detection approach: application to bearing fault detection. IEEE Trans Ind Electron 58(5):2011–2018

    Google Scholar 

  6. Xiong L, Poczos B, Schneider J, Connolly A, VanderPlas J (2011) Hierarchical probabilistic models for group anomaly detection. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 789–797

  7. Han D-H, Zhang X, Wang G-R (2015) Classifying uncertain and evolving data streams with distributed extreme learning machine. J Comput Sci Technol 30(4):874–887

    MathSciNet  Google Scholar 

  8. Shojafar M, Cordeschi N, Baccarelli E (2016) Energy-efficient adaptive resource management for real-time vehicular cloud services. IEEE Trans Cloud Comput 7(1):196–209

    Google Scholar 

  9. Beaubouef T, Petry FE, Arora G (1998) Information-theoretic measures of uncertainty for rough sets and rough relational databases. Inf Sci 109(1–4):185–195

    Google Scholar 

  10. Liang J, Shi Z (2004) The information entropy, rough entropy and knowledge granulation in rough set theory. Int J Uncert Fuzziness Knowl Based Syst 12(01):37–46

    MathSciNet  MATH  Google Scholar 

  11. Duntsch I, Gediga G (1998) Uncertainty measures of rough set prediction. Artif Intell 106(1):109–137

    MathSciNet  MATH  Google Scholar 

  12. Xie N, Liu M, Li Z, Zhang G (2019) New measures of uncertainty for an interval-valued information system. Inf Sci 470:156–174

    MathSciNet  Google Scholar 

  13. Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12

    Google Scholar 

  14. Gupta M, Gao J, Aggarwal CC, Han J (2014) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 26(9):2250–2267

    MATH  Google Scholar 

  15. Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253

    Google Scholar 

  16. Jiang F, Sui Y, Cao C (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36(3):4680–4687

    Google Scholar 

  17. Shoval P, Gudes E, Goldstein M (1988) Gisd: a graphical interactive system for conceptual database design. Inf Syst 13(1):81–95

    Google Scholar 

  18. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: ACM sigmod record. ACM, vol 29, pp 93–104

  19. Yao H, Xiuwen F, Yang Y, Postolache O (2018) An incremental local outlier detection method in the data stream. Appl Sci 8(8):1248

    Google Scholar 

  20. Kriegel H-P, Zimek A et al (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 444–452

  21. Aggarwal CC (2015) Outlier analysis: advanced concepts. In: Data mining. Springer, pp 265–283

  22. Tan SC, Ting KM, Liu TF (2011) Fast anomaly detection for streaming data. In: IJCAI proceedings-international joint conference on artificial intelligence, vol 22, p 1511

  23. Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3

    Google Scholar 

  24. Mahadevan S, Shah SL (2009) Fault detection and diagnosis in process data using one-class support vector machines. J Process Control 19(10):1627–1639

    Google Scholar 

  25. Barddal JP, Gomes HM, Enembreck F, Barthes J-P (2016) Sncstream+: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73

    Google Scholar 

  26. Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch and streaming anomaly detection. Mach Learn 105(3):305–333

    MathSciNet  MATH  Google Scholar 

  27. Zhang J, Li T, Ruan D, Gao Z, Zhao C (2012) A parallel method for computing rough set approximations. Inf Sci 194:209–223

    Google Scholar 

  28. Hu X (1995) Knowledge discovery in databases: an attribute-oriented rough set approach. PhD thesis, University of Regina

  29. Liang J, Zongben X (2002) The algorithm on knowledge reduction in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 10(01):95–103

    MathSciNet  MATH  Google Scholar 

  30. Qian Y, Liang J, Wang F (2009) A new method for measuring the uncertainty in incomplete information systems. Int J Uncert Fuzziness Knowl Based Syst 17(06):855–880

    MathSciNet  MATH  Google Scholar 

  31. Wang X, Yang J, Teng X, Xia W, Jensen R (2007) Feature selection based on rough sets and particle swarm optimization. Pattern Recognit Lett 28(4):459–471

    Google Scholar 

  32. Park I-K, Choi G-S (2015) A variable-precision information-entropy rough set approach for job searching. Inf Syst 48:279–288

    Google Scholar 

  33. Parra L, Deco G, Miesbach S (1996) Statistical independence and novelty detection with information preserving nonlinear maps. Neural Comput 8(2):260–269

    Google Scholar 

  34. Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602

    Google Scholar 

  35. Taha A, Hadi AS (2019) Anomaly detection methods for categorical data: a review. ACM Comput Surv 52(2):38

    Google Scholar 

  36. Park I-K, Choi G-S (2015) Rough set approach for clustering categorical data using information-theoretic dependency measure. Inf Syst 48:289–295

    Google Scholar 

  37. D’eer L, Cornelis C (2018) A comprehensive study of fuzzy covering-based rough set models: definitions, properties and interrelationships. Fuzzy Sets Syst 336:1–26

    MathSciNet  MATH  Google Scholar 

  38. Gomes JB, Gaber MM, Sousa PAC, Menasalvas E (2014) Mining recurring concepts in a dynamic feature space. IEEE Trans Neural Netw Learn Syst 25(1):95–110

    Google Scholar 

  39. Le Q, Sarlos T, Smola A (2013) Fastfood-approximating kernel expansions in loglinear time. In: Proceedings of the international conference on machine learning, vol 85

  40. Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 306–315

  41. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30(2):37–46

    Google Scholar 

  42. Goix N (2016) How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152

  43. Amer M, Goldstein M, Abdennadher S (2013) Enhancing one-class support vector machines for unsupervised anomaly detection. In:: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 8–15

  44. Goldstein M, Uchida S (2016) A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4):e0152173

    Google Scholar 

  45. Tang B, He H (2017) A local density-based approach for outlier detection. Neurocomputing 241:171–180

    Google Scholar 

  46. Campos GO, Zimek A, Sander J, Campello RJGB, Micenkova B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927

    MathSciNet  Google Scholar 

  47. Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: Advances in neural information processing systems, pp 467–475

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manmohan Singh.

Ethics declarations

Conflict of interest

We hereby declare that we have no conflict of interest

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, M., Pamula, R. An outlier detection approach in large-scale data stream using rough set. Neural Comput & Applic 32, 9113–9127 (2020). https://doi.org/10.1007/s00521-019-04421-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04421-4

Keywords

Navigation