Clustering Mixed Type Attributes in Large Dataset

  • Jian Yin
  • Zhifang Tan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3758)

Abstract

Clustering is a widely used technique in data mining, now there exists many clustering algorithms, but most existing clustering algorithms either are limited to handle the single attribute or can handle both data types but are not efficient when clustering large data sets. Few algorithms can do both well. In this paper, we propose a clustering algorithm CFIKP that can handle large datasets with mixed type of attributes. We first use CF *-tree to pre-cluster datasets. After the dense regions are stored in leaf nodes, then we look every dense region as a single point and use an improved k-prototype to cluster such dense regions. Experiments show that the CFIKP algorithm is very efficient in clustering large datasets with mixed type of attributes.

Keywords

Data Mining Cluster Algorithm Large Dataset Leaf Node Mixed Type 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Pro. 5th Berkeley Symp. Math. Statist, Pro., vol. 1, pp. 128–297 (1967)Google Scholar
  2. 2.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovering 2, 283–304 (1998)CrossRefGoogle Scholar
  3. 3.
    Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Pro. 1994 Int. Conf. Very Large Data Bases, pp. 144–155 (1994)Google Scholar
  4. 4.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clustering in large spatial database with noise. In: Proc. 1996 Int. Conf. Knowledge Discovering and Data Mining, pp. 266–231 (1996)Google Scholar
  5. 5.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM-SIGKDD Int. Conf. Managament of Data, pp. 103–114 (1996)Google Scholar
  6. 6.
    Chiu, T., Fang, D.P., Chen, J., Wang, Y.: A Robust and Scalable Clustering Algorithm for Mixed Type Attributes in Large Database Environment. In: Proc. ACM-SIGKDD int. conf. Knowledge discovery and data mining (KDD 2001), pp. 263–268 (2001)Google Scholar
  7. 7.
    Chen, P., Wang, Y.: An Efficient clustering algorithm for categorical and mixed typed attributes. Computer Engineering and Application (1), 190–191 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jian Yin
    • 1
  • Zhifang Tan
    • 1
  1. 1.Department of Computer ScienceZhongshan UniversityGuangzhouChina

Personalised recommendations