Clustering and Labeling of Multi-dimensional Mixed Structured Data

Brambilla, Marco; Zanoni, Massimiliano

doi:10.1007/978-3-642-34213-4_8

Marco Brambilla¹⁸ &
Massimiliano Zanoni¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7538))

962 Accesses
2 Citations

Abstract

Cluster Analysis consists of the aggregation of data items of a given set into subsets based on some similarity properties. Clustering techniques have been applied in many fields which typically involve a large amount of complex data. This study focuses on what we call multi-domain clustering and labeling, i.e. a set of techniques for multi-dimensional structured mixed data clustering. The work consists of studying the best mix of clustering techniques that address the problem in the multi-domain setting. Considered data types are numerical, categorical and textual. All of them can appear together within the same clustering scenario. We focus on k-means and agglomerative hierarchical clustering methods based on a new distance function we define for this specific setting. The proposed approach has been validated on some real and realistic data-sets based onto college, automobile and leisure fields. Experimental data allowed to evaluate the effectiveness of the different solutions, both for clustering and labeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 72.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Singh, M.P.: Deep web structure. IEEE Internet Computing 6, 4–5 (2002)
Google Scholar
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Society for Industrial and Applied Mathematics (2007)
Google Scholar
Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attributein unsupervised learning for categorical data set. Pattern Recognition Letters 28(1), 110–118 (2007)
Article Google Scholar
Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley and Sons, New York (1973)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283–304 (1998)
Article Google Scholar
Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada (1997)
Google Scholar
Luo, H., Kong, F., Li, Y.: Clustering Mixed Data Based on Evidence Accumulation. In: Li, X., Zaïane, O.R., Li, Z.-h. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 348–355. Springer, Heidelberg (2006)
Chapter Google Scholar
He, Z., Xu, X., Deng, S.: Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligence Systems 20, 1077–1089 (2005)
Article MATH Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference (1996)
Google Scholar
Ankerst, M., Breunig, M., Kriegel, H., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM SIGMOD International Conference on Management of Data (1999)
Google Scholar
Karypis, G., Han, E., Kumar, V.: Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)
Article Google Scholar
Rousseeuw, P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987)
Article MATH Google Scholar
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41 (2009)
Google Scholar
Liu, X., Croft, B.W.: Cluster-based retrieval using language models. In: Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval, vol. 1, pp. 186–193. ACM Press (2004)
Google Scholar
Heart, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval, pp. 76–84. ACM Press (1996)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18, 613–620 (1975)
Article MATH Google Scholar
Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis, 4th edn. Oxford Press (2001)
Google Scholar
Esposito, F., Fanizzi, N., d’Amato, C.: Partitional Conceptual Clustering of Web Resources Annotated with Ontology Languages. In: Berendt, B., Mladenič, D., de Gemmis, M., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., Železný, F. (eds.) Knowledge Discovery Enhanced with Semantic and Social Information. SCI, vol. 220, pp. 53–70. Springer, Heidelberg (2009)
Chapter Google Scholar
Stepp, R.E., Michalski, R.S.: Conceptual clustering of structured objects: A goal-oriented approach. Artificial Intelligence 28(1), 43–69 (1986)
Article Google Scholar
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge UP, Cambridge (2008), Cluster Labeling Stanford Natural Language Processing Group (2009)
Google Scholar
Gad, W.K., Kamel, M.S.: Incremental clustering algorithm based on phrase-semantic similarity histogram. In: Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Elettronica e Informazione, Politecnico di Milano, 20133, Milano, Italy
Marco Brambilla & Massimiliano Zanoni

Authors

Marco Brambilla
View author publications
You can also search for this author in PubMed Google Scholar
Massimiliano Zanoni
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Elettronica e Informazione,, Politecnico di Milano, Via Ponzio, 34/5, 20133, Milan, Italy
Stefano Ceri
Dipartimento di Elettronica e Informazione, Politecnico di Milano, 20133, Milan, Italy
Marco Brambilla

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Brambilla, M., Zanoni, M. (2012). Clustering and Labeling of Multi-dimensional Mixed Structured Data. In: Ceri, S., Brambilla, M. (eds) Search Computing. Lecture Notes in Computer Science, vol 7538. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34213-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-34213-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34212-7
Online ISBN: 978-3-642-34213-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics