Skip to main content
Log in

Entropy-based grid approach for handling outliers: a case study to environmental monitoring data

  • Applications of Emerging Green Technologies for Efficient Valorization of Agro-Industrial Waste: A Roadmap Towards Sustainable Environment and Circular Economy
  • Published:
Environmental Science and Pollution Research Aims and scope Submit manuscript

Abstract

Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain data. This paper proposes an entropy-based grid approach (EGO) for outlier detection in clustered data. The given hard clusters obtained from a hard clustering algorithm, EGO uses entropy on the dataset as a whole or on an individual cluster to detect outliers. EGO works in two steps: explicit outlier detection and implicit outlier detection. Explicit outlier detection is concerned with those data points that are isolated in the grid cells. They are either far from the dense region or maybe a nearby isolated data point and therefore declared as an explicit outlier. Implicit outlier detection is associated with the detection of outliers that are perplexedly deviated from the normal pattern. The determination of such outliers is achieved using entropy change of the dataset or a specific cluster for each deviation. The elbow based on the trade-off between entropy and object geometries optimizes the outlier detection process. Experimental results on CHAMELEON datasets and other similar datasets suggested that the proposed approach(es) detect the outliers more precisely and extend the capability of outliers detection to an additional 4.5% to 8.6%. Moreover, the resultant clusters became more precise and compact when the entropy-based gridding approach is applied on top of hard clustering algorithms. The performance of the proposed algorithms is compared with well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Finally, a case study for detecting outliers in environmental data has been carried out using the proposed approach and results are generated on our synthetically prepared datasets. The performance shows that the proposed approach may be an industrial-oriented solution to outlier detection in environmental monitoring data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of data and materials

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

References

  • Agrawal R, Gehrke J, Gunopulos D, et al (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the international conference on Management of data. pp 94–105

  • Alameddine I, Kenney MA, Gosnell RJ et al (2010) Robust multivariate outlier detection methods for environmental data. J Environ Eng 136(11):1299–1304

    CAS  Google Scholar 

  • Ali B, Azam N, Shah A et al (2021) A spatial filtering inspired three-way clustering approach with application to outlier detection. Int J Approx Reason 130:1–21

    Google Scholar 

  • Amini A, Wah TY, Saboohi H (2014) On density-based data streams clustering algorithms: A survey. J Comput Sci Technol 29(1):116–141

    Google Scholar 

  • Andersson JL, Graham MS, Zsoldos E et al (2016) Incorporating outlier detection and replacement into a non-parametric framework for movement and distortion correction of diffusion mr images. NeuroImage 141:556–572

    Google Scholar 

  • Bai M, Wang X, Xin J et al (2016) An efficient algorithm for distributed density-based outlier detection on big data. Neurocomputing 181:19–28

    Google Scholar 

  • Batra R, Ko KI (1992) An adaptive mesh refinement technique for the analysis of shear bands in plane strain compression of a thermoviscoplastic solid. Comput Mech 10(6):369–379

    Google Scholar 

  • Benesty J, Chen J, Huang Y, et al (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, p 1–4

  • Berger MJ, Oliger J (1984) Adaptive mesh refinement for hyperbolic partial differential equations. J Comput Phys 53(3):484–512

    Google Scholar 

  • Berger MJ, Colella P et al (1989) Local adaptive mesh refinement for shock hydrodynamics. J Comput Phys 82(1):64–84

    Google Scholar 

  • Bharti S, Pattanaik K, Pandey A (2019) Contextual outlier detection for wireless sensor networks. J Ambient Intell Humanized Comput 1–20

  • Birant D, Kut A (2007) St-dbscan: An algorithm for clustering spatial-temporal data. Data Knowl Eng 60(1):208–221

    Google Scholar 

  • Blythe J, Jain S, Deelman E et al (2005) Task scheduling strategies for workflow-based applications in grids. In: IEEE International Symposium on Cluster Computing and the Grid, vol 2005. pp 759–767

  • Borah B, Bhattacharyya D (2004) An improved sampling-based dbscan for large spatial databases. In: Proceedings of the International conference on intelligent sensing and information processing. pp 92–96

  • Breunig MM, Kriegel HP, Ng RT, et al (2000) Lof: identifying density-based local outliers. In: Proceedings of the international conference on Management of data. pp 93–104

  • Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining. pp 160–172

  • Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927

    Google Scholar 

  • Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927

    Google Scholar 

  • Chen J, Sathe S, Aggarwal C, et al (2017) Outlier detection with autoencoder ensembles. In: Proceedings of the international conference on data mining. pp 90–98

  • Chen Z, Liu B (2014) Mining topics in documents: standing on the shoulders of big data. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 1116–1125

  • Christy A, Gandhi GM, Vaithyasubramanian S (2015) Cluster based outlier detection algorithm for healthcare data. Procedia Comput Sci 50:209–215

    Google Scholar 

  • Duan L, Xu L, Guo F et al (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32(7):978–986

    Google Scholar 

  • Eiseman PR (1987) Adaptive grid generation. Comput Methods Appl Mech Eng 64(1–3):321–376

    Google Scholar 

  • Erskine RH, Green TR, Ramirez JA, et al (2006) Comparison of grid-based algorithms for computing upslope contributing area. Water Resour Res 42(9)

  • Ester M, Kriegel HP, Sander J, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Knowledge Discovery and Data Mining. pp 226–231

  • Fakhari A, Lee T (2014) Finite-difference lattice boltzmann method with a block-structured adaptive-mesh-refinement technique. Phys Rev E 89(3):033310

    Google Scholar 

  • Fei G, Liu B (2016) Breaking the closed world assumption in text classification. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp 506–514

  • Fuchs L (1986) A local mesh-refinement technique for incompressible flows. Comput Fluids 14(1):69–81

    Google Scholar 

  • Gan G, Ng MKP (2017) K-means clustering with outlier removal. Pattern Recog Lett 90:8–14

    Google Scholar 

  • Garces H, Sbarbaro D (2009) Outliers detection in environmental monitoring data. IFAC Proc 42(23):330–335

    Google Scholar 

  • Goldstein M, Dengel A (2012) Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. Poster Demo Track 59–63

  • Goldstein MB (2014) Anomaly detection in large datasets. Verlag Dr. Hut

  • Gu Y, Ganesan RK, Bischke B, et al (2017) Grid-based outlier detection in large data sets for combine harvesters. In: Proceedings of the International Conference on Industrial Informatics. pp 811–818

  • Güngör E, Özmen A (2017) Distance and density based clustering algorithm using gaussian kernel. Expert Syst Appl 69:10–20

    Google Scholar 

  • Guseva AI, Kuznetsov IA (2017) The use of entropy measure for higher quality machine learning algorithms in text data processing. In: Proceedings of the International Conference on Future Internet of Things and Cloud Workshops. pp 47–52

  • Hautamäki V, Cherednichenko S, Kärkkäinen I, et al (2005) Improving k-means by outlier removal. In: Scandinavian Conference on Image Analysis. Springer, pp 978–987

  • He Y, Tan H, Luo W et al (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99

    Google Scholar 

  • He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10):1641–1650

    Google Scholar 

  • Jabez J, Muthukumar B (2015) Intrusion detection system (ids): anomaly detection using outlier detection approach. Procedia Comput Sci 48:338–346

    Google Scholar 

  • Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Pattern Recognit 22(6–7):691–700

    Google Scholar 

  • Kadlec P, Gabrys B, Strandt S (2009) Data-driven soft sensors in the process industry. Comput Chem Eng 33(4):795–814

    CAS  Google Scholar 

  • Karypis G, Han EH, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75

    Google Scholar 

  • Kotsiantis S, Pintelas P (2004) Recent advances in clustering: A brief survey. Trans Inf Sci Appl 1(1):73–81

    Google Scholar 

  • Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 444–452

  • Kriegel HP, Kröger P, Schubert E, et al (2009) Loop: local outlier probabilities. In: Proceedings of the conference on Information and knowledge management. pp 1649–1652

  • Krkkinen I, Frnti P (2002) Dynamic local search algorithm for the clustering problem. Department of Computer Science, University of Joensuu, Tech Rep A-2002-6

  • Lang K (1995) Newsweeder: Learning to filter netnews. In: Machine Learning Proceedings 1995. Elsevier, p 331–339

  • Lee J, Cho NW (2016) Fast outlier detection using a grid-based algorithm. PLoS ONE 11(11):e0165972

    Google Scholar 

  • Liao Wk, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: Proceedings of the international conference on data mining. pp 61–69

  • Lin S, Brown DE (2006) An outlier-based data association method for linking criminal incidents. Decis Support Syst 41(3):604–615

    Google Scholar 

  • Liu B, Yin J, Xiao Y, et al (2010) Exploiting local data uncertainty to boost global outlier detection. In: Proceedings of the International Conference on Data Mining, pp 304–313

  • Louhichi S, Gzara M, Abdallah HB (2014) A density based algorithm for discovering clusters with varied density. In: Proceedings of World Congress on Computer Applications and Information Systems). pp 1–6

  • Lucas Y, Portier PE, Laporte L et al (2020) Towards automated feature engineering for credit card fraud detection using multi-perspective hmms. Futur Gener Comput Syst 102:393–402

    Google Scholar 

  • Luo J, Xu L, Jamont JP et al (2007) Flood decision support system on agent grid: method and implementation. Enterp Inf Syst 1(1):49–68

    Google Scholar 

  • Ma EW, Chow TW (2004) A new shifting grid clustering algorithm. Pattern Recogn 37(3):503–514

    Google Scholar 

  • Mahmoud E, Elmogy AM, Sarhan A (2016) Enhancing grid local outlier factor algorithm for better outlier detection. Artif Intell Mach Learn J 16(1):13–21

    Google Scholar 

  • Malini N, Pushpa M (2017) Analysis on credit card fraud identification techniques based on knn and outlier detection. In: Proceedings of the third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics. pp 255–258

  • McInnes L, Healy J, Astels S (2017) hdbscan: Hierarchical density based clustering. J Open Source Softw 2(11):205

    Google Scholar 

  • Mia Hubert PR, Segaert P (2015) Discussion of multivariate functional outlier detection. Stat Methods Appl 24(2):177–202

    Google Scholar 

  • Ohadi N, Kamandi A, Shabankhah M, et al (2020) Sw-dbscan: A grid-based dbscan algorithm for large datasets. In: Proceddings of the International Conference on Web Research (ICWR). pp 139–145

  • Osekowska E, Johnson H, Carlsson B (2014) Grid size optimization for potential field based maritime anomaly detection. Transp Res Procedia 3:720–729

    Google Scholar 

  • Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM Sigmod Rec 33(1):32–37

    Google Scholar 

  • Pearson RK (2002) Outliers in process modeling and identification. IEEE Trans Control Syst Technol 10(1):55–63

    Google Scholar 

  • Pilevar AH, Sukumar M (2005) Gchl: A grid-clustering algorithm for high-dimensional very large spatial data bases. Pattern Recogn Lett 26(7):999–1010

    Google Scholar 

  • Qiu GF, Li HZ, Xu LD et al (2003) A knowledge processing method for intelligent systems based on inclusion degree. Expert Syst 20(4):187–195

    Google Scholar 

  • Rai P, Singh S (2010) A survey of clustering techniques. Int J Comput Appl 7(12):1–5

    Google Scholar 

  • Rajeswari A, Yalini S, Janani R, et al (2018) A comparative evaluation of supervised and unsupervised methods for detecting outliers. In: Proceedings of the Second International Conference on Inventive Communication and Computational Technologies. pp 1068–1073

  • Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput 11(5):489–494

    Google Scholar 

  • Rencis JJ, Mullen RL (1986) Solution of elasticity problems by a self-adaptive mesh refinement technique for boundary element computation. Int J Numer Methods Eng 23(8):1509–1527

    Google Scholar 

  • Rokach L (2009) A survey of clustering algorithms. In: Data mining and knowledge discovery handbook. p 269–298

  • Sandosh S, Govindasamy V, Akila G (2020) Enhanced intrusion detection system via agent clustering and classification based on outlier detection. Peer-to-Peer Netw Appl 1–8

  • Shafiq M, Tian Z, Bashir AK et al (2020) Corrauc: a malicious bot-iot traffic detection method in iot network using machine-learning techniques. IEEE Internet Things J 8(5):3242–3254

    Google Scholar 

  • Shafiq M, Tian Z, Bashir AK et al (2020) Iot malicious traffic identification using wrapper-based feature selection mechanisms. Comput Secur 94:101863

    Google Scholar 

  • Shafiq M, Tian Z, Sun Y et al (2020) Selection of effective machine learning algorithm and bot-iot attacks traffic identification for internet of things in smart city. Futur Gener Comput Syst 107:433–442

    Google Scholar 

  • Shah A, Azam N, Ali B et al (2021) A three-way clustering approach for novelty detection. Inf Sci 569:650–668

    Google Scholar 

  • Shah A, Azam N, Alanazi E, et al (2022) Image blurring and sharpening inspired three-way clustering approach. Appl Intell 1–25

  • Sheikholeslami S, Chatterjee S, Zhang A (2002) A multi-resolution clustering approach for very large spatial databases. In: Proceedings of the International Conference on Formal Ontology in Information Systems. pp 622–630

  • Sitanggang IS, Baehaki DAM (2015) Global and collective outliers detection on hotspot data as forest fires indicator in riau province, indonesia. In: Proceedings of the International Conference on Spatial Data Mining and Geographical Knowledge Services. pp 66–70

  • Tran TN, Drab K, Daszykowski M (2013) Revised dbscan algorithm to cluster data with dense adjacent clusters. Chemometr Intell Lab Syst 120:92–96

    CAS  Google Scholar 

  • Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280

    Google Scholar 

  • Veselík P, Sejkorová M, Nieoczym A, et al (2020) Outlier identification of concentrations of pollutants in environmental data using modern statistical methods. Pol J Environ Stud 29(1)

  • Wang B, Xiao G, Yu H, et al (2009) Distance-based outlier detection on uncertain data. In: Proceddings of the International Conference on Computer and Information Technology. pp 293–298

  • Wang W, Yang J, Muntz R, et al (1997) Sting: A statistical information grid approach to spatial data mining. In: Proceeding of the conference very large data bases. pp 186–195

  • Wang X, Davidson I (2009) Discovering contexts and contextual outliers using random walks in graphs. In: Proceedings of the International Conference on Data Mining. pp 1034–1039

  • Warne K, Prasad G, Rezvani S et al (2004) Statistical and computational intelligence techniques for inferential model development: a comparative evaluation and a novel proposition for fusion. Eng Appl Artif Intell 17(8):871–885

    Google Scholar 

  • Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193

    Google Scholar 

  • Xu X, Yuruk N, Feng Z, et al (2007) Scan: a structural clustering algorithm for networks. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 824–833

  • Xu X, Liu H, Li L et al (2018) A comparison of outlier detection techniques for high-dimensional data. Int J Comput Intell Syst 11(1):652–662

    Google Scholar 

  • Yang H, Antonante P, Tzoumas V et al (2020) Graduated non-convexity for robust spatial perception: From non-minimal solvers to global outlier rejection. IEEE Robot Autom Lett 5(2):1127–1134

    Google Scholar 

  • Yang X, Zhang G, Lu J et al (2010) A kernel fuzzy c-means clustering-based fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE Trans Fuzzy Syst 19(1):105–115

    Google Scholar 

  • Yap P (2002) Grid-based path-finding. In: Conference of the Canadian Society for Computational Studies of Intelligence. pp 44–55

  • Zhang JS, Leung YW (2003) Robust clustering by pruning outliers. IEEE Trans Syst Man Cybern 33(6):983–998

    Google Scholar 

  • Zhu Y, Ting KM, Carman MJ (2016) Density-ratio based clustering for discovering clusters with varying densities. Patt Recogn 60:983–997

    Google Scholar 

  • Zhu Y, Ting KM, Angelova M (2018) A distance scaling method to improve density-based clustering. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp 389–400

Download references

Funding

This work is supported by the National Natural Science Foundation of China, Grant No. 62250410365.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Anwar Shah and Bahar Ali. Methodology: Anwar Shah, Bahar Ali, and Kassian T.T. Amesho. Software: Anwar Shah, Bahar Ali, Fazal Wahab, and Inam Ullah. Validation: Anwar Shah, Bahar Ali, and Muhammad Shafiq. Formal analysis: Anwar Shah. Investigation: Anwar Shah, Bahar Ali, Inam Ullah, and Muhammad Shafiq. Resources: Anwar Shah and Inam Ullah. Data curation: Anwar Shah, Fazal Wahab, and Inam Ullah. Writing—original draft preparation: Anwar Shah. Writing—review and editing: Anwar Shah and Bahar Ali. Visualization: Anwar Shah and Bahar Ali. Supervision: Anwar Shah, Bahar Ali, and Muhammad Shafiq. Project administration: Anwar Shah, Inam Ullah, Kassian T.T. Amesho, and Muhammad Shafiq. Funding acquisition: Kassian T.T. Amesho, Muhammad Shafiq, Shahid Anwar, and Ahyoung Choi.

Corresponding author

Correspondence to Muhammad Shafiq.

Ethics declarations

Ethics statement

The studies involving human participants were reviewed and approved by the National Natural Science Foundation of China, Grant No. 62250410365. The ethics committee waived the requirement of written informed consent for participation. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

The authors of the article “Entropy Based Grid Approach for Handling Outliers: A Case Study to Environmental Monitoring Data” give consent for the publication of all the identifiable details, including text, images, and materials, to be published in the journal Environmental Science and Pollution Research.

Conflict of interest

The authors declare no competing interests.

Additional information

Responsible editor: Marcus Schulz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shah, A., Ali, B., Wahab, F. et al. Entropy-based grid approach for handling outliers: a case study to environmental monitoring data. Environ Sci Pollut Res 30, 125138–125157 (2023). https://doi.org/10.1007/s11356-023-26780-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11356-023-26780-1

Keywords

Navigation