Skip to main content
Log in

COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Nowadays, most data mining algorithms focus on clustering methods alone. Also, there are a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, especially for those data sets with noise. Thus, it is necessary to treat both clusters and outliers as concepts of the same importance in data analysis. In this paper, we present our continuous work on the cluster–outlier iterative detection algorithm (Shi in SubCOID: exploring cluster-outlier iterative detection approach to multi-dimensional data analysis in subspace. Auburn, pp. 132–135, 2008; Shi and Zhang in Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis. IEEE Computer Society. Tokyo, pp. 518–519, 2005) to detect the clusters and outliers in another perspective for noisy data sets. In this algorithm, clusters are detected and adjusted according to the intra-relationship within clusters and the inter-relationship between clusters and outliers, and vice versa. The adjustment and modification of the clusters and outliers are performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields, such as pattern recognition, data clustering, and signal processing. Experimental results demonstrate the advantages of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Achtert E, Kriegel H, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Ludascher B, Mamoulis N (eds) Proceedings of the 20th international conference on scientific and statistical database management (SSDBM), Hong Kong, pp 580–585

  2. Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Aref W (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. ACM Press, Santa Barbara, pp 37–46

    Chapter  Google Scholar 

  3. Agrawal R, Gehrke J, Gunopulos D et al (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Seattle, pp 94–105

    Google Scholar 

  4. Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Bussche J, VianuLecture V (eds) Proceedings of the 8th international conference on database theory. Springer, London, pp 420–434

    Google Scholar 

  5. Aggarwal C, Procopiuc C, Wolf J et al (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM Press, Philadelphia, pp 61–72

    Google Scholar 

  6. Ankerst M, Breunig M, Kriegel H et al (1999) OPTICS: ordering points to identify the clustering structure. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM Press, Philadelphia, pp 49–60

    Google Scholar 

  7. Bay S (1999) The UCI KDD Archive [http://kdd.ics.uci.edu]. Department of Information and Computer Science, University of California, Irvine

  8. Beyer K, Goldstein J, Ramakrishnan R et al (1999) When is “nearest neighbor” meaningful?. In: Beeri C, Buneman P (eds) Proceedings of international conference on database theory. Springer, Jerusalem, pp 217– 235

    Google Scholar 

  9. Bradley P, Fayyad U (1998) Refining initial points for K-Means clustering. In: Proceedings of 15th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 91–99

  10. Breunig M, Kriegel H, Ng R et al (2000) LOF: identifying density-based local outliers. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM, Dallas, pp 93–104

    Chapter  Google Scholar 

  11. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Berkhin P, Caruana R, Wu X (eds) Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Jose, pp 133–142

    Chapter  Google Scholar 

  12. Chen C, Lee J (2001) The validity measurement of fuzzy C-means classifier for remotely sensed images. In: Proceedings of 22nd Asian conference on remote sensing. Singapore

  13. Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of 2nd international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231

    Google Scholar 

  14. Fayyad U, Piatetsky-Shapiro G, Smyth P et al (1996) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park

    Google Scholar 

  15. Fayyad U, Reina C, Bradley P (1998) Initialization of iterative refinement clustering algorithms. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, New York, pp 194–198

    Google Scholar 

  16. Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38: 311–322

    Article  Google Scholar 

  17. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Seattle, pp 73–84

    Google Scholar 

  18. Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the IEEE conference on data engineering. IEEE Computer Society Press, Sydney, pp 512–521

  19. Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, New York, pp 58–65

    Google Scholar 

  20. Halkidi M, Vazirgiannis M (2001) A data set oriented approach for clustering algorithm selection. In: Raedt L, Siebes A (eds) Proceedings of the 5th European conference on principles of data mining and knowledge discovery. Springer, Freiburg, pp 165–179

  21. Hinneburg A, Aggarwal C, Keim D (2000) What is the nearest neighbor in high dimensional spaces?. In: Abbadi A, Brodie M, Chakravarthy S (eds) Proceedings of 26th international conference on very large data bases. Morgan Kaufmann, Cairo, pp 506–515

    Google Scholar 

  22. Jain A, Murty M, Flyn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  23. Karypis G, Han E, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32: 68–75

    Article  Google Scholar 

  24. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken

    Google Scholar 

  25. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24th international conference on very large data bases. Morgan Kaufmann, New York, pp 392–403

    Google Scholar 

  26. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berleley 1:281–297

  27. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann, Santiago de Chile, pp 144–155

    Google Scholar 

  28. Nguyen M, Mark L, Omiecinski E (2008) Unusual pattern detection in high dimensions. Advances in knowledge discovery and data mining, 12th Pacific-Asia conference. Springer, Osaka, pp, pp 247–259

    Google Scholar 

  29. Peterson G, McBride B (2008) The importance of generalizability for anomaly detection. Knowl Inf Syst 14(3): 377–392

    Article  Google Scholar 

  30. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM, Dallas, pp 427–438

    Chapter  Google Scholar 

  31. Rothman M (1963) The laws of physics. Basic Books, New York

    Google Scholar 

  32. Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24th international conference on very large data bases. Morgan Kaufmann, New York, pp 428–439

  33. Shi Y (2008a) Detecting clusters and outliers for multi-dimensional data. In: Proceedings of the 2008 international conference on multimedia and ubiquitous engineering. SERSC, Busan, pp 429–432

  34. Shi Y (2008b) SubCOID: exploring cluster-outlier iterative detection approach to multi-dimensional data analysis in subspace. In: ACMSE 2008: the 46th ACM southeast conference. ACM, Auburn, pp 132–135

  35. Shi Y, Zhang A (2005) Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis. In: Proceedings of the 21st international conference on data engineering. IEEE Computer Society, Tokyo, pp 518–519

  36. Shi Y, Song Y, Zhang A (2003) A shrinking-based approach for multi-dimensional data analysis. In: Freytag J, Lockemann P, Abiteboul S et al (eds) Proceedings of 29th international conference on very large data bases. ACM, Berlin, pp 440–451

    Google Scholar 

  37. Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 394–403

    Google Scholar 

  38. Wang J, Chiang J (2008) A cluster validity measure with outlier detection for support vector clustering. IEEE Trans Syst, Man, Cybernet, B 38(1): 78–89

    Article  Google Scholar 

  39. Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In: Jarke M, Carey M, Dittrich K et al (eds) Proceedings of 23rd international conference on very large data bases. Morgan Kaufmann, Athens, pp 186–195

    Google Scholar 

  40. Wu M, Jermaine C (2006) Outlier detection by sampling with accuracy guarantees. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 767–772

    Google Scholar 

  41. Wu X, Kumar V, Ross Q et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37

    Article  Google Scholar 

  42. Xiong H, Steinbach M, Ruslim A et al (2008) Characterizing pattern preserving clustering. Knowl Inf Syst 19(3): 311–336

    Article  Google Scholar 

  43. Xiong H, Wu J, Chen J (2006) K-means clustering versus validation measures: a data distribution perspective. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 779–784

    Google Scholar 

  44. Yang J, Zhong N, Yao Y et al (2008) Local peculiarity factor and its application in outlier detection. In: Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Las Vegas, pp 776–784

    Chapter  Google Scholar 

  45. Yu D, Sheikholeslami G, Zhang A (2000) FindOut: finding outliers in very large Datasets. Knowl Inf Syst 4(4): 387–412

    Article  Google Scholar 

  46. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, Montreal, pp 103–114

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Shi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, Y., Zhang, L. COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Syst 28, 709–733 (2011). https://doi.org/10.1007/s10115-010-0323-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0323-y

Keywords

Navigation