Advertisement

Integrative Parameter-Free Clustering of Data with Mixed Type Attributes

  • Christian Böhm
  • Sebastian Goebl
  • Annahita Oswald
  • Claudia Plant
  • Michael Plavinski
  • Bianca Wackersreuther
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6118)

Abstract

Integrative mining of heterogeneous data is one of the major challenges for data mining in the next decade. We address the problem of integrative clustering of data with mixed type attributes. Most existing solutions suffer from one or both of the following drawbacks: Either they require input parameters which are difficult to estimate, or/and they do not adequately support mixed type attributes. Our technique INTEGRATE is a novel clustering approach that truly integrates the information provided by heterogeneous numerical and categorical attributes. Originating from information theory, the Minimum Description Length (MDL) principle allows a unified view on numerical and categorical information and thus naturally balances the influence of both sources of information in clustering. Moreover, supported by the MDL principle, parameter-free clustering can be performed which enhances the usability of INTEGRATE on real world data. Extensive experiments demonstrate the effectiveness of INTEGRATE in exploiting numerical and categorical information for clustering. As an efficient iterative algorithm INTEGRATE is scalable to large data sets.

Keywords

Numerical Attribute Categorical Attribute Cost Curve Minimum Description Length Categorical Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Yang, Q., Wu, X.: 10 challenging problems in data mining research. IJITDM 5(4), 597–604 (2006)Google Scholar
  2. 2.
    Macqueen, J.B.: Some methods of classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  3. 3.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRefGoogle Scholar
  4. 4.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference, pp. 103–114 (1996)Google Scholar
  5. 5.
    Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: ICML, pp. 727–734 (2000)Google Scholar
  6. 6.
    Böhm, C., Faloutsos, C., Pan, J.Y., Plant, C.: Robust information-theoretic clustering. In: KDD, pp. 65–75 (2006)Google Scholar
  7. 7.
    Böhm, C., Faloutsos, C., Plant, C.: Outlier-robust clustering using independent components. In: SIGMOD Conference, pp. 185–198 (2008)Google Scholar
  8. 8.
    Yin, J., Tan, Z.: Clustering mixed type attributes in large dataset. In: ISPA, pp. 655–661 (2005)Google Scholar
  9. 9.
    Hsu, C.C., Chen, Y.C.: Mining of mixed data with application to catalog marketing. Expert Syst. Appl. 32(1), 12–23 (2007)CrossRefGoogle Scholar
  10. 10.
    He, Z., Xu, X., Deng, S.: Clustering mixed numeric and categorical data: A cluster ensemble approach. CoRR abs/cs/0509011 (2005)Google Scholar
  11. 11.
    Rendon, E., Sánchez, J.S.: Clustering based on compressed data for categorical and mixed attributes. In: SSPR/SPR, pp. 817–825 (2006)Google Scholar
  12. 12.
    Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)CrossRefGoogle Scholar
  13. 13.
    Brouwer, R.K.: Clustering feature vectors with mixed numerical and categorical attributes. IJCIS 1-4, 285–298 (2008)Google Scholar
  14. 14.
    Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)CrossRefGoogle Scholar
  15. 15.
    Li, T., Chen, Y.: A weight entropy k-means algorithm for clustering dataset with mixed numeric and categorical data. In: FSKD 2008, vol. (1), pp. 36–41 (2008)Google Scholar
  16. 16.
    Rissanen, J.: An introduction to the mdl principle. Technical report, Helsinkin Institute for Information Technology (2005)Google Scholar
  17. 17.
    Dom, B.: An information-theoretic external cluster-validity measure. In: UAI, pp. 137–145 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Christian Böhm
    • 1
  • Sebastian Goebl
    • 1
  • Annahita Oswald
    • 1
  • Claudia Plant
    • 2
  • Michael Plavinski
    • 1
  • Bianca Wackersreuther
    • 1
  1. 1.University of Munich 
  2. 2.Technische Universität München 

Personalised recommendations