Abstract
Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain data. This paper proposes an entropy-based grid approach (EGO) for outlier detection in clustered data. The given hard clusters obtained from a hard clustering algorithm, EGO uses entropy on the dataset as a whole or on an individual cluster to detect outliers. EGO works in two steps: explicit outlier detection and implicit outlier detection. Explicit outlier detection is concerned with those data points that are isolated in the grid cells. They are either far from the dense region or maybe a nearby isolated data point and therefore declared as an explicit outlier. Implicit outlier detection is associated with the detection of outliers that are perplexedly deviated from the normal pattern. The determination of such outliers is achieved using entropy change of the dataset or a specific cluster for each deviation. The elbow based on the trade-off between entropy and object geometries optimizes the outlier detection process. Experimental results on CHAMELEON datasets and other similar datasets suggested that the proposed approach(es) detect the outliers more precisely and extend the capability of outliers detection to an additional 4.5% to 8.6%. Moreover, the resultant clusters became more precise and compact when the entropy-based gridding approach is applied on top of hard clustering algorithms. The performance of the proposed algorithms is compared with well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Finally, a case study for detecting outliers in environmental data has been carried out using the proposed approach and results are generated on our synthetically prepared datasets. The performance shows that the proposed approach may be an industrial-oriented solution to outlier detection in environmental monitoring data.
Similar content being viewed by others
Availability of data and materials
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
References
Agrawal R, Gehrke J, Gunopulos D, et al (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the international conference on Management of data. pp 94–105
Alameddine I, Kenney MA, Gosnell RJ et al (2010) Robust multivariate outlier detection methods for environmental data. J Environ Eng 136(11):1299–1304
Ali B, Azam N, Shah A et al (2021) A spatial filtering inspired three-way clustering approach with application to outlier detection. Int J Approx Reason 130:1–21
Amini A, Wah TY, Saboohi H (2014) On density-based data streams clustering algorithms: A survey. J Comput Sci Technol 29(1):116–141
Andersson JL, Graham MS, Zsoldos E et al (2016) Incorporating outlier detection and replacement into a non-parametric framework for movement and distortion correction of diffusion mr images. NeuroImage 141:556–572
Bai M, Wang X, Xin J et al (2016) An efficient algorithm for distributed density-based outlier detection on big data. Neurocomputing 181:19–28
Batra R, Ko KI (1992) An adaptive mesh refinement technique for the analysis of shear bands in plane strain compression of a thermoviscoplastic solid. Comput Mech 10(6):369–379
Benesty J, Chen J, Huang Y, et al (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, p 1–4
Berger MJ, Oliger J (1984) Adaptive mesh refinement for hyperbolic partial differential equations. J Comput Phys 53(3):484–512
Berger MJ, Colella P et al (1989) Local adaptive mesh refinement for shock hydrodynamics. J Comput Phys 82(1):64–84
Bharti S, Pattanaik K, Pandey A (2019) Contextual outlier detection for wireless sensor networks. J Ambient Intell Humanized Comput 1–20
Birant D, Kut A (2007) St-dbscan: An algorithm for clustering spatial-temporal data. Data Knowl Eng 60(1):208–221
Blythe J, Jain S, Deelman E et al (2005) Task scheduling strategies for workflow-based applications in grids. In: IEEE International Symposium on Cluster Computing and the Grid, vol 2005. pp 759–767
Borah B, Bhattacharyya D (2004) An improved sampling-based dbscan for large spatial databases. In: Proceedings of the International conference on intelligent sensing and information processing. pp 92–96
Breunig MM, Kriegel HP, Ng RT, et al (2000) Lof: identifying density-based local outliers. In: Proceedings of the international conference on Management of data. pp 93–104
Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining. pp 160–172
Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927
Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927
Chen J, Sathe S, Aggarwal C, et al (2017) Outlier detection with autoencoder ensembles. In: Proceedings of the international conference on data mining. pp 90–98
Chen Z, Liu B (2014) Mining topics in documents: standing on the shoulders of big data. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 1116–1125
Christy A, Gandhi GM, Vaithyasubramanian S (2015) Cluster based outlier detection algorithm for healthcare data. Procedia Comput Sci 50:209–215
Duan L, Xu L, Guo F et al (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32(7):978–986
Eiseman PR (1987) Adaptive grid generation. Comput Methods Appl Mech Eng 64(1–3):321–376
Erskine RH, Green TR, Ramirez JA, et al (2006) Comparison of grid-based algorithms for computing upslope contributing area. Water Resour Res 42(9)
Ester M, Kriegel HP, Sander J, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Knowledge Discovery and Data Mining. pp 226–231
Fakhari A, Lee T (2014) Finite-difference lattice boltzmann method with a block-structured adaptive-mesh-refinement technique. Phys Rev E 89(3):033310
Fei G, Liu B (2016) Breaking the closed world assumption in text classification. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp 506–514
Fuchs L (1986) A local mesh-refinement technique for incompressible flows. Comput Fluids 14(1):69–81
Gan G, Ng MKP (2017) K-means clustering with outlier removal. Pattern Recog Lett 90:8–14
Garces H, Sbarbaro D (2009) Outliers detection in environmental monitoring data. IFAC Proc 42(23):330–335
Goldstein M, Dengel A (2012) Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. Poster Demo Track 59–63
Goldstein MB (2014) Anomaly detection in large datasets. Verlag Dr. Hut
Gu Y, Ganesan RK, Bischke B, et al (2017) Grid-based outlier detection in large data sets for combine harvesters. In: Proceedings of the International Conference on Industrial Informatics. pp 811–818
Güngör E, Özmen A (2017) Distance and density based clustering algorithm using gaussian kernel. Expert Syst Appl 69:10–20
Guseva AI, Kuznetsov IA (2017) The use of entropy measure for higher quality machine learning algorithms in text data processing. In: Proceedings of the International Conference on Future Internet of Things and Cloud Workshops. pp 47–52
Hautamäki V, Cherednichenko S, Kärkkäinen I, et al (2005) Improving k-means by outlier removal. In: Scandinavian Conference on Image Analysis. Springer, pp 978–987
He Y, Tan H, Luo W et al (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10):1641–1650
Jabez J, Muthukumar B (2015) Intrusion detection system (ids): anomaly detection using outlier detection approach. Procedia Comput Sci 48:338–346
Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Pattern Recognit 22(6–7):691–700
Kadlec P, Gabrys B, Strandt S (2009) Data-driven soft sensors in the process industry. Comput Chem Eng 33(4):795–814
Karypis G, Han EH, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Kotsiantis S, Pintelas P (2004) Recent advances in clustering: A brief survey. Trans Inf Sci Appl 1(1):73–81
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 444–452
Kriegel HP, Kröger P, Schubert E, et al (2009) Loop: local outlier probabilities. In: Proceedings of the conference on Information and knowledge management. pp 1649–1652
Krkkinen I, Frnti P (2002) Dynamic local search algorithm for the clustering problem. Department of Computer Science, University of Joensuu, Tech Rep A-2002-6
Lang K (1995) Newsweeder: Learning to filter netnews. In: Machine Learning Proceedings 1995. Elsevier, p 331–339
Lee J, Cho NW (2016) Fast outlier detection using a grid-based algorithm. PLoS ONE 11(11):e0165972
Liao Wk, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: Proceedings of the international conference on data mining. pp 61–69
Lin S, Brown DE (2006) An outlier-based data association method for linking criminal incidents. Decis Support Syst 41(3):604–615
Liu B, Yin J, Xiao Y, et al (2010) Exploiting local data uncertainty to boost global outlier detection. In: Proceedings of the International Conference on Data Mining, pp 304–313
Louhichi S, Gzara M, Abdallah HB (2014) A density based algorithm for discovering clusters with varied density. In: Proceedings of World Congress on Computer Applications and Information Systems). pp 1–6
Lucas Y, Portier PE, Laporte L et al (2020) Towards automated feature engineering for credit card fraud detection using multi-perspective hmms. Futur Gener Comput Syst 102:393–402
Luo J, Xu L, Jamont JP et al (2007) Flood decision support system on agent grid: method and implementation. Enterp Inf Syst 1(1):49–68
Ma EW, Chow TW (2004) A new shifting grid clustering algorithm. Pattern Recogn 37(3):503–514
Mahmoud E, Elmogy AM, Sarhan A (2016) Enhancing grid local outlier factor algorithm for better outlier detection. Artif Intell Mach Learn J 16(1):13–21
Malini N, Pushpa M (2017) Analysis on credit card fraud identification techniques based on knn and outlier detection. In: Proceedings of the third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics. pp 255–258
McInnes L, Healy J, Astels S (2017) hdbscan: Hierarchical density based clustering. J Open Source Softw 2(11):205
Mia Hubert PR, Segaert P (2015) Discussion of multivariate functional outlier detection. Stat Methods Appl 24(2):177–202
Ohadi N, Kamandi A, Shabankhah M, et al (2020) Sw-dbscan: A grid-based dbscan algorithm for large datasets. In: Proceddings of the International Conference on Web Research (ICWR). pp 139–145
Osekowska E, Johnson H, Carlsson B (2014) Grid size optimization for potential field based maritime anomaly detection. Transp Res Procedia 3:720–729
Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM Sigmod Rec 33(1):32–37
Pearson RK (2002) Outliers in process modeling and identification. IEEE Trans Control Syst Technol 10(1):55–63
Pilevar AH, Sukumar M (2005) Gchl: A grid-clustering algorithm for high-dimensional very large spatial data bases. Pattern Recogn Lett 26(7):999–1010
Qiu GF, Li HZ, Xu LD et al (2003) A knowledge processing method for intelligent systems based on inclusion degree. Expert Syst 20(4):187–195
Rai P, Singh S (2010) A survey of clustering techniques. Int J Comput Appl 7(12):1–5
Rajeswari A, Yalini S, Janani R, et al (2018) A comparative evaluation of supervised and unsupervised methods for detecting outliers. In: Proceedings of the Second International Conference on Inventive Communication and Computational Technologies. pp 1068–1073
Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput 11(5):489–494
Rencis JJ, Mullen RL (1986) Solution of elasticity problems by a self-adaptive mesh refinement technique for boundary element computation. Int J Numer Methods Eng 23(8):1509–1527
Rokach L (2009) A survey of clustering algorithms. In: Data mining and knowledge discovery handbook. p 269–298
Sandosh S, Govindasamy V, Akila G (2020) Enhanced intrusion detection system via agent clustering and classification based on outlier detection. Peer-to-Peer Netw Appl 1–8
Shafiq M, Tian Z, Bashir AK et al (2020) Corrauc: a malicious bot-iot traffic detection method in iot network using machine-learning techniques. IEEE Internet Things J 8(5):3242–3254
Shafiq M, Tian Z, Bashir AK et al (2020) Iot malicious traffic identification using wrapper-based feature selection mechanisms. Comput Secur 94:101863
Shafiq M, Tian Z, Sun Y et al (2020) Selection of effective machine learning algorithm and bot-iot attacks traffic identification for internet of things in smart city. Futur Gener Comput Syst 107:433–442
Shah A, Azam N, Ali B et al (2021) A three-way clustering approach for novelty detection. Inf Sci 569:650–668
Shah A, Azam N, Alanazi E, et al (2022) Image blurring and sharpening inspired three-way clustering approach. Appl Intell 1–25
Sheikholeslami S, Chatterjee S, Zhang A (2002) A multi-resolution clustering approach for very large spatial databases. In: Proceedings of the International Conference on Formal Ontology in Information Systems. pp 622–630
Sitanggang IS, Baehaki DAM (2015) Global and collective outliers detection on hotspot data as forest fires indicator in riau province, indonesia. In: Proceedings of the International Conference on Spatial Data Mining and Geographical Knowledge Services. pp 66–70
Tran TN, Drab K, Daszykowski M (2013) Revised dbscan algorithm to cluster data with dense adjacent clusters. Chemometr Intell Lab Syst 120:92–96
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280
Veselík P, Sejkorová M, Nieoczym A, et al (2020) Outlier identification of concentrations of pollutants in environmental data using modern statistical methods. Pol J Environ Stud 29(1)
Wang B, Xiao G, Yu H, et al (2009) Distance-based outlier detection on uncertain data. In: Proceddings of the International Conference on Computer and Information Technology. pp 293–298
Wang W, Yang J, Muntz R, et al (1997) Sting: A statistical information grid approach to spatial data mining. In: Proceeding of the conference very large data bases. pp 186–195
Wang X, Davidson I (2009) Discovering contexts and contextual outliers using random walks in graphs. In: Proceedings of the International Conference on Data Mining. pp 1034–1039
Warne K, Prasad G, Rezvani S et al (2004) Statistical and computational intelligence techniques for inferential model development: a comparative evaluation and a novel proposition for fusion. Eng Appl Artif Intell 17(8):871–885
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193
Xu X, Yuruk N, Feng Z, et al (2007) Scan: a structural clustering algorithm for networks. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 824–833
Xu X, Liu H, Li L et al (2018) A comparison of outlier detection techniques for high-dimensional data. Int J Comput Intell Syst 11(1):652–662
Yang H, Antonante P, Tzoumas V et al (2020) Graduated non-convexity for robust spatial perception: From non-minimal solvers to global outlier rejection. IEEE Robot Autom Lett 5(2):1127–1134
Yang X, Zhang G, Lu J et al (2010) A kernel fuzzy c-means clustering-based fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE Trans Fuzzy Syst 19(1):105–115
Yap P (2002) Grid-based path-finding. In: Conference of the Canadian Society for Computational Studies of Intelligence. pp 44–55
Zhang JS, Leung YW (2003) Robust clustering by pruning outliers. IEEE Trans Syst Man Cybern 33(6):983–998
Zhu Y, Ting KM, Carman MJ (2016) Density-ratio based clustering for discovering clusters with varying densities. Patt Recogn 60:983–997
Zhu Y, Ting KM, Angelova M (2018) A distance scaling method to improve density-based clustering. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp 389–400
Funding
This work is supported by the National Natural Science Foundation of China, Grant No. 62250410365.
Author information
Authors and Affiliations
Contributions
Conceptualization: Anwar Shah and Bahar Ali. Methodology: Anwar Shah, Bahar Ali, and Kassian T.T. Amesho. Software: Anwar Shah, Bahar Ali, Fazal Wahab, and Inam Ullah. Validation: Anwar Shah, Bahar Ali, and Muhammad Shafiq. Formal analysis: Anwar Shah. Investigation: Anwar Shah, Bahar Ali, Inam Ullah, and Muhammad Shafiq. Resources: Anwar Shah and Inam Ullah. Data curation: Anwar Shah, Fazal Wahab, and Inam Ullah. Writing—original draft preparation: Anwar Shah. Writing—review and editing: Anwar Shah and Bahar Ali. Visualization: Anwar Shah and Bahar Ali. Supervision: Anwar Shah, Bahar Ali, and Muhammad Shafiq. Project administration: Anwar Shah, Inam Ullah, Kassian T.T. Amesho, and Muhammad Shafiq. Funding acquisition: Kassian T.T. Amesho, Muhammad Shafiq, Shahid Anwar, and Ahyoung Choi.
Corresponding author
Ethics declarations
Ethics statement
The studies involving human participants were reviewed and approved by the National Natural Science Foundation of China, Grant No. 62250410365. The ethics committee waived the requirement of written informed consent for participation. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.
Consent to participate
Informed consent was obtained from all individual participants included in the study.
Consent for publication
The authors of the article “Entropy Based Grid Approach for Handling Outliers: A Case Study to Environmental Monitoring Data” give consent for the publication of all the identifiable details, including text, images, and materials, to be published in the journal Environmental Science and Pollution Research.
Conflict of interest
The authors declare no competing interests.
Additional information
Responsible editor: Marcus Schulz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shah, A., Ali, B., Wahab, F. et al. Entropy-based grid approach for handling outliers: a case study to environmental monitoring data. Environ Sci Pollut Res 30, 125138–125157 (2023). https://doi.org/10.1007/s11356-023-26780-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11356-023-26780-1