Knowledge and Information Systems

, Volume 28, Issue 3, pp 709–733 | Cite as

COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis

Regular Paper

Abstract

Nowadays, most data mining algorithms focus on clustering methods alone. Also, there are a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, especially for those data sets with noise. Thus, it is necessary to treat both clusters and outliers as concepts of the same importance in data analysis. In this paper, we present our continuous work on the cluster–outlier iterative detection algorithm (Shi in SubCOID: exploring cluster-outlier iterative detection approach to multi-dimensional data analysis in subspace. Auburn, pp. 132–135, 2008; Shi and Zhang in Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis. IEEE Computer Society. Tokyo, pp. 518–519, 2005) to detect the clusters and outliers in another perspective for noisy data sets. In this algorithm, clusters are detected and adjusted according to the intra-relationship within clusters and the inter-relationship between clusters and outliers, and vice versa. The adjustment and modification of the clusters and outliers are performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields, such as pattern recognition, data clustering, and signal processing. Experimental results demonstrate the advantages of our approach.

Keywords

Clustering Outlier detection Multi-dimensional data Cluster and outlier diversity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Achtert E, Kriegel H, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Ludascher B, Mamoulis N (eds) Proceedings of the 20th international conference on scientific and statistical database management (SSDBM), Hong Kong, pp 580–585Google Scholar
  2. 2.
    Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Aref W (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. ACM Press, Santa Barbara, pp 37–46CrossRefGoogle Scholar
  3. 3.
    Agrawal R, Gehrke J, Gunopulos D et al (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Seattle, pp 94–105Google Scholar
  4. 4.
    Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Bussche J, VianuLecture V (eds) Proceedings of the 8th international conference on database theory. Springer, London, pp 420–434Google Scholar
  5. 5.
    Aggarwal C, Procopiuc C, Wolf J et al (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM Press, Philadelphia, pp 61–72Google Scholar
  6. 6.
    Ankerst M, Breunig M, Kriegel H et al (1999) OPTICS: ordering points to identify the clustering structure. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM Press, Philadelphia, pp 49–60Google Scholar
  7. 7.
    Bay S (1999) The UCI KDD Archive [http://kdd.ics.uci.edu]. Department of Information and Computer Science, University of California, Irvine
  8. 8.
    Beyer K, Goldstein J, Ramakrishnan R et al (1999) When is “nearest neighbor” meaningful?. In: Beeri C, Buneman P (eds) Proceedings of international conference on database theory. Springer, Jerusalem, pp 217– 235Google Scholar
  9. 9.
    Bradley P, Fayyad U (1998) Refining initial points for K-Means clustering. In: Proceedings of 15th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 91–99Google Scholar
  10. 10.
    Breunig M, Kriegel H, Ng R et al (2000) LOF: identifying density-based local outliers. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM, Dallas, pp 93–104CrossRefGoogle Scholar
  11. 11.
    Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Berkhin P, Caruana R, Wu X (eds) Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Jose, pp 133–142CrossRefGoogle Scholar
  12. 12.
    Chen C, Lee J (2001) The validity measurement of fuzzy C-means classifier for remotely sensed images. In: Proceedings of 22nd Asian conference on remote sensing. SingaporeGoogle Scholar
  13. 13.
    Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of 2nd international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231Google Scholar
  14. 14.
    Fayyad U, Piatetsky-Shapiro G, Smyth P et al (1996) Advances in knowledge discovery and data mining. AAAI Press, Menlo ParkGoogle Scholar
  15. 15.
    Fayyad U, Reina C, Bradley P (1998) Initialization of iterative refinement clustering algorithms. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, New York, pp 194–198Google Scholar
  16. 16.
    Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38: 311–322CrossRefGoogle Scholar
  17. 17.
    Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Seattle, pp 73–84Google Scholar
  18. 18.
    Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the IEEE conference on data engineering. IEEE Computer Society Press, Sydney, pp 512–521Google Scholar
  19. 19.
    Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, New York, pp 58–65Google Scholar
  20. 20.
    Halkidi M, Vazirgiannis M (2001) A data set oriented approach for clustering algorithm selection. In: Raedt L, Siebes A (eds) Proceedings of the 5th European conference on principles of data mining and knowledge discovery. Springer, Freiburg, pp 165–179Google Scholar
  21. 21.
    Hinneburg A, Aggarwal C, Keim D (2000) What is the nearest neighbor in high dimensional spaces?. In: Abbadi A, Brodie M, Chakravarthy S (eds) Proceedings of 26th international conference on very large data bases. Morgan Kaufmann, Cairo, pp 506–515Google Scholar
  22. 22.
    Jain A, Murty M, Flyn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323CrossRefGoogle Scholar
  23. 23.
    Karypis G, Han E, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32: 68–75CrossRefGoogle Scholar
  24. 24.
    Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, HobokenGoogle Scholar
  25. 25.
    Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24th international conference on very large data bases. Morgan Kaufmann, New York, pp 392–403Google Scholar
  26. 26.
    MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berleley 1:281–297Google Scholar
  27. 27.
    Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann, Santiago de Chile, pp 144–155Google Scholar
  28. 28.
    Nguyen M, Mark L, Omiecinski E (2008) Unusual pattern detection in high dimensions. Advances in knowledge discovery and data mining, 12th Pacific-Asia conference. Springer, Osaka, pp, pp 247–259Google Scholar
  29. 29.
    Peterson G, McBride B (2008) The importance of generalizability for anomaly detection. Knowl Inf Syst 14(3): 377–392CrossRefGoogle Scholar
  30. 30.
    Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM, Dallas, pp 427–438CrossRefGoogle Scholar
  31. 31.
    Rothman M (1963) The laws of physics. Basic Books, New YorkGoogle Scholar
  32. 32.
    Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24th international conference on very large data bases. Morgan Kaufmann, New York, pp 428–439Google Scholar
  33. 33.
    Shi Y (2008a) Detecting clusters and outliers for multi-dimensional data. In: Proceedings of the 2008 international conference on multimedia and ubiquitous engineering. SERSC, Busan, pp 429–432Google Scholar
  34. 34.
    Shi Y (2008b) SubCOID: exploring cluster-outlier iterative detection approach to multi-dimensional data analysis in subspace. In: ACMSE 2008: the 46th ACM southeast conference. ACM, Auburn, pp 132–135Google Scholar
  35. 35.
    Shi Y, Zhang A (2005) Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis. In: Proceedings of the 21st international conference on data engineering. IEEE Computer Society, Tokyo, pp 518–519Google Scholar
  36. 36.
    Shi Y, Song Y, Zhang A (2003) A shrinking-based approach for multi-dimensional data analysis. In: Freytag J, Lockemann P, Abiteboul S et al (eds) Proceedings of 29th international conference on very large data bases. ACM, Berlin, pp 440–451Google Scholar
  37. 37.
    Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 394–403Google Scholar
  38. 38.
    Wang J, Chiang J (2008) A cluster validity measure with outlier detection for support vector clustering. IEEE Trans Syst, Man, Cybernet, B 38(1): 78–89CrossRefGoogle Scholar
  39. 39.
    Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In: Jarke M, Carey M, Dittrich K et al (eds) Proceedings of 23rd international conference on very large data bases. Morgan Kaufmann, Athens, pp 186–195Google Scholar
  40. 40.
    Wu M, Jermaine C (2006) Outlier detection by sampling with accuracy guarantees. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 767–772Google Scholar
  41. 41.
    Wu X, Kumar V, Ross Q et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37CrossRefGoogle Scholar
  42. 42.
    Xiong H, Steinbach M, Ruslim A et al (2008) Characterizing pattern preserving clustering. Knowl Inf Syst 19(3): 311–336CrossRefGoogle Scholar
  43. 43.
    Xiong H, Wu J, Chen J (2006) K-means clustering versus validation measures: a data distribution perspective. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 779–784Google Scholar
  44. 44.
    Yang J, Zhong N, Yao Y et al (2008) Local peculiarity factor and its application in outlier detection. In: Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Las Vegas, pp 776–784CrossRefGoogle Scholar
  45. 45.
    Yu D, Sheikholeslami G, Zhang A (2000) FindOut: finding outliers in very large Datasets. Knowl Inf Syst 4(4): 387–412CrossRefGoogle Scholar
  46. 46.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, Montreal, pp 103–114CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  1. 1.Department of Computer Science and Information SystemsKennesaw State UniversityKennesawUSA
  2. 2.Department of Computer ScienceEastern Michigan UniversityYpsilantiUSA

Personalised recommendations