Skip to main content

Clustering and Labeling of Multi-dimensional Mixed Structured Data

  • Chapter
Search Computing

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7538))

Abstract

Cluster Analysis consists of the aggregation of data items of a given set into subsets based on some similarity properties. Clustering techniques have been applied in many fields which typically involve a large amount of complex data. This study focuses on what we call multi-domain clustering and labeling, i.e. a set of techniques for multi-dimensional structured mixed data clustering. The work consists of studying the best mix of clustering techniques that address the problem in the multi-domain setting. Considered data types are numerical, categorical and textual. All of them can appear together within the same clustering scenario. We focus on k-means and agglomerative hierarchical clustering methods based on a new distance function we define for this specific setting. The proposed approach has been validated on some real and realistic data-sets based onto college, automobile and leisure fields. Experimental data allowed to evaluate the effectiveness of the different solutions, both for clustering and labeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 72.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Singh, M.P.: Deep web structure. IEEE Internet Computing 6, 4–5 (2002)

    Google Scholar 

  2. Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Society for Industrial and Applied Mathematics (2007)

    Google Scholar 

  3. Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attributein unsupervised learning for categorical data set. Pattern Recognition Letters 28(1), 110–118 (2007)

    Article  Google Scholar 

  4. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley and Sons, New York (1973)

    Google Scholar 

  5. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283–304 (1998)

    Article  Google Scholar 

  6. Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)

    Google Scholar 

  7. Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada (1997)

    Google Scholar 

  8. Luo, H., Kong, F., Li, Y.: Clustering Mixed Data Based on Evidence Accumulation. In: Li, X., Zaïane, O.R., Li, Z.-h. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 348–355. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. He, Z., Xu, X., Deng, S.: Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligence Systems 20, 1077–1089 (2005)

    Article  MATH  Google Scholar 

  10. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference (1996)

    Google Scholar 

  11. Ankerst, M., Breunig, M., Kriegel, H., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM SIGMOD International Conference on Management of Data (1999)

    Google Scholar 

  12. Karypis, G., Han, E., Kumar, V.: Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)

    Article  Google Scholar 

  13. Rousseeuw, P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  14. Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41 (2009)

    Google Scholar 

  15. Liu, X., Croft, B.W.: Cluster-based retrieval using language models. In: Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval, vol. 1, pp. 186–193. ACM Press (2004)

    Google Scholar 

  16. Heart, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval, pp. 76–84. ACM Press (1996)

    Google Scholar 

  17. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  18. Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis, 4th edn. Oxford Press (2001)

    Google Scholar 

  19. Esposito, F., Fanizzi, N., d’Amato, C.: Partitional Conceptual Clustering of Web Resources Annotated with Ontology Languages. In: Berendt, B., Mladenič, D., de Gemmis, M., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., Železný, F. (eds.) Knowledge Discovery Enhanced with Semantic and Social Information. SCI, vol. 220, pp. 53–70. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  20. Stepp, R.E., Michalski, R.S.: Conceptual clustering of structured objects: A goal-oriented approach. Artificial Intelligence 28(1), 43–69 (1986)

    Article  Google Scholar 

  21. Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge UP, Cambridge (2008), Cluster Labeling Stanford Natural Language Processing Group (2009)

    Google Scholar 

  22. Gad, W.K., Kamel, M.S.: Incremental clustering algorithm based on phrase-semantic similarity histogram. In: Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Brambilla, M., Zanoni, M. (2012). Clustering and Labeling of Multi-dimensional Mixed Structured Data. In: Ceri, S., Brambilla, M. (eds) Search Computing. Lecture Notes in Computer Science, vol 7538. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34213-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34213-4_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34212-7

  • Online ISBN: 978-3-642-34213-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics