Advertisement

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets

  • Turgay Tugay Bilgin
  • A. Yilmaz Camurcu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4690)

Abstract

In this study, we propose a better relationship based clus tering framework for dealing with unbalanced clustering and outlier fil tering on high dimensional datasets. Original relationship based cluster ing framework is based on a weighted graph partitioning system named METIS. However, it has two major drawbacks: no outlier filtering and forcing clusters to be balanced. Our proposed framework uses Graclus, an unbalanced kernel k-means based partitioning system. We have two major improvements over the original framework: First, we introduce a new space. It consists of tiny unbalanced partitions created using Graclus, hence we call it micro-partition space. We use a filtering approach to drop out singletons or micro-partitions that have fewer members than a threshold value. Second, we agglomerate the filtered micro-partition space and apply Graclus again for clustering. The visualization of the results has been carried out by CLUSION. Our experiments have shown that our proposed framework produces promising results on high dimen sional datasets.

Keywords

Data Mining Dimensionality Clustering Outlier filtering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press (1961)Google Scholar
  2. 2.
    Strehl, A., Ghosh, J.: Relationship-based clustering and visualization for high dimensional data mining. INFORMS Journal on Computing, 208–230 (2003)Google Scholar
  3. 3.
    Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientic Computing 20(1), 359–392 (1998)CrossRefMathSciNetGoogle Scholar
  4. 4.
    Dhillon, I., Guan, Y., Kulis, B.: A Fast Kernel-based Multilevel Algorithm for Graph Clustering. In: Proceedings of The 11th ACM SIGKDD, August 21 - 24, 2005, Chicago, IL (2005)Google Scholar
  5. 5.
    Dhillon, I., Guan, Y., Kulis, B.: Kernel k -means, spectral clustering and normalized cuts. In: Proc. 10th ACM KDD Conference, ACM Press, New York (2004)Google Scholar
  6. 6.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, New Jersey (1988)zbMATHGoogle Scholar
  7. 7.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, Chichester (1990)Google Scholar
  8. 8.
    Hartigan, J.A.: Clustering Algorithms. Wiley Publishers, New York (1975)zbMATHGoogle Scholar
  9. 9.
    Hendrickson, B., Leland, R.: The Chaco users guide -version 2.0. Technical Report SAND94-2692, Sandia National Laboratories (1994)Google Scholar
  10. 10.
    Karypis, G., Kumar, V.: A parallel algorithm for multilevel graph-partitioning and sparse matrix ordering. Journal of Parallel and Distributed Computing 48(1), 71–95 (1998)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. Bell Sys. Tech. J. 49(2), 291–308 (1970)Google Scholar
  12. 12.
    Keim, D.A., Kriegel, H.P.: Visualization Techniques for Mining Large Databases: A Comparison. IEEE Trans. Knowledge and Data Eng. 8(6), 923–936 (1996)CrossRefGoogle Scholar
  13. 13.
    Gale, N., Halperin, W., Costanzo, C.: Unclassed matrix shading and optimal ordering in hierarchical cluster analy- sis. Journal of Classication 1, 75–92 (1984)CrossRefGoogle Scholar
  14. 14.
    Wang, J., Yu, B., Gasser, L.: Concept Tree Based Clustering Visualization with Shaded Similarity Matrices. In: Kumar, V., Tsumoto, S., Zhong, N., Yu, P.S., Wu, X. (eds.) ICDM 2002. Proceedings of 2002 IEEE International Conference on Data Mining, Maebashi, Japan, pp. 697–700. IEEE Computer Society, Los Alamitos (2002)Google Scholar
  15. 15.
    BBC news articles dataset from Trinity College Computer Science Department (downloaded in February 2006), (available at, https://www.cs.tcd.ie/Derek.Greene/research/
  16. 16.
    Milliyet dataset is provided by popular Turkish daily newspaper, http://www.milliyet.com.tr

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Turgay Tugay Bilgin
    • 1
  • A. Yilmaz Camurcu
    • 2
  1. 1.Department of Computer Engineering, Maltepe University, Maltepe, IstanbulTurkey
  2. 2.Department of Electronics and Computer Education, Marmara University, Kadikoy, IstanbulTurkey

Personalised recommendations