Clustering and Labeling of Multi-dimensional Mixed Structured Data

  • Marco Brambilla
  • Massimiliano Zanoni
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7538)

Abstract

Cluster Analysis consists of the aggregation of data items of a given set into subsets based on some similarity properties. Clustering techniques have been applied in many fields which typically involve a large amount of complex data. This study focuses on what we call multi-domain clustering and labeling, i.e. a set of techniques for multi-dimensional structured mixed data clustering. The work consists of studying the best mix of clustering techniques that address the problem in the multi-domain setting. Considered data types are numerical, categorical and textual. All of them can appear together within the same clustering scenario. We focus on k-means and agglomerative hierarchical clustering methods based on a new distance function we define for this specific setting. The proposed approach has been validated on some real and realistic data-sets based onto college, automobile and leisure fields. Experimental data allowed to evaluate the effectiveness of the different solutions, both for clustering and labeling.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Singh, M.P.: Deep web structure. IEEE Internet Computing 6, 4–5 (2002)Google Scholar
  2. 2.
    Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Society for Industrial and Applied Mathematics (2007)Google Scholar
  3. 3.
    Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attributein unsupervised learning for categorical data set. Pattern Recognition Letters 28(1), 110–118 (2007)CrossRefGoogle Scholar
  4. 4.
    Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley and Sons, New York (1973)Google Scholar
  5. 5.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283–304 (1998)CrossRefGoogle Scholar
  6. 6.
    Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar
  7. 7.
    Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada (1997)Google Scholar
  8. 8.
    Luo, H., Kong, F., Li, Y.: Clustering Mixed Data Based on Evidence Accumulation. In: Li, X., Zaïane, O.R., Li, Z.-h. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 348–355. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    He, Z., Xu, X., Deng, S.: Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligence Systems 20, 1077–1089 (2005)MATHCrossRefGoogle Scholar
  10. 10.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference (1996)Google Scholar
  11. 11.
    Ankerst, M., Breunig, M., Kriegel, H., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM SIGMOD International Conference on Management of Data (1999)Google Scholar
  12. 12.
    Karypis, G., Han, E., Kumar, V.: Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)CrossRefGoogle Scholar
  13. 13.
    Rousseeuw, P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987)MATHCrossRefGoogle Scholar
  14. 14.
    Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41 (2009)Google Scholar
  15. 15.
    Liu, X., Croft, B.W.: Cluster-based retrieval using language models. In: Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval, vol. 1, pp. 186–193. ACM Press (2004)Google Scholar
  16. 16.
    Heart, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval, pp. 76–84. ACM Press (1996)Google Scholar
  17. 17.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18, 613–620 (1975)MATHCrossRefGoogle Scholar
  18. 18.
    Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis, 4th edn. Oxford Press (2001)Google Scholar
  19. 19.
    Esposito, F., Fanizzi, N., d’Amato, C.: Partitional Conceptual Clustering of Web Resources Annotated with Ontology Languages. In: Berendt, B., Mladenič, D., de Gemmis, M., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., Železný, F. (eds.) Knowledge Discovery Enhanced with Semantic and Social Information. SCI, vol. 220, pp. 53–70. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  20. 20.
    Stepp, R.E., Michalski, R.S.: Conceptual clustering of structured objects: A goal-oriented approach. Artificial Intelligence 28(1), 43–69 (1986)CrossRefGoogle Scholar
  21. 21.
    Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge UP, Cambridge (2008), Cluster Labeling Stanford Natural Language Processing Group (2009)Google Scholar
  22. 22.
    Gad, W.K., Kamel, M.S.: Incremental clustering algorithm based on phrase-semantic similarity histogram. In: Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Marco Brambilla
    • 1
  • Massimiliano Zanoni
    • 1
  1. 1.Dipartimento di Elettronica e InformazionePolitecnico di MilanoMilanoItaly

Personalised recommendations