Abstract
Very few clustering algorithms can cope with a high number of dimensions. Many problems arise when dimensions grow to the order of hundreds. Dimensionality reduction and feature selection are simple remedies to these problems. In addition to being somewhat intellectually disappointing approaches, both also lead to loss of information.
Furthermore, many elaborated clustering algorithms become unintuitive for the user because she is required to set the values of several hyperparameters without clear understanding of their meaning and effect.
We develop PCA-based hierarchical clustering algorithms that are particularly geared for high-dimensional data. Technically the novelty is to describe data vectors iteratively by the their angles with eigenvectors in the orthogonal basis of the input space. The new algorithms avoid the major curse of dimensionality problems that affect cluster analysis.
We aim at expressive algorithms that are easily applicable. This entails that the user only needs to set few intuitive hyperparameters. Moreover, exploring the effects of tuning parameters is simple since they are directly (or inversely) proportional to the clustering resolution. Also, the clustering hierarchy has a comprehensible interpretation and, therefore, moving between nodes in the hierarchy tree has an intuitive meaning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD, pp. 269–274 (2001)
Bhattacharya, A., De, R.K.: Average correlation clustering algorithm (ACCA) for grouping of co-regulated genes with similar pattern of variation in their expression values. J. Biomed. Inform. 43(4), 560–568 (2010)
Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2(4), 433–459 (2010)
Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: On exploring complex relationships of correlation clusters. In: Proceedings of the 19th SSBDM, pp. 52–61. IEEE (2007)
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3(1), 1–58 (2009)
Choi, S.S., Cha, S.H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)
Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th International Conference on VLDB, pp. 506–515 (2000)
Aggarwal, C.C., Procopiuc, C.M., Wolf, J.L., Yu, P.S., Park, J.S.: Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD, pp. 61–72. ACM (1999)
Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM SDM, pp. 517–521 (2004)
Agrawal, R., et al.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD, pp. 94–105 (1998)
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD, pp. 418–427 (2002)
Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S.: MaPle: a fast algorithm for maximal pattern-based clustering. In: Proceedings of the 3rd IEEE ICDM, pp. 259–266 (2003)
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD, pp. 70–81 (2000)
Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. In: Proceedings of the 2007 SIAM SDM, pp. 413–418 (2007)
Zhang, Y., Wu, H., Cheng, L.: Some new deformation formulas about variance and covariance. In: Proceedings of the 4th ICMIC, pp. 987–992. IEEE (2012)
Strang, G.: Linear Algebra and Its Applications. Thomson, Brooks/Cole, Pacific Grove (2006)
Simonoff, J.S.: Analyzing Categorical Data. Springer, New York (2003). https://doi.org/10.1007/978-0-387-21727-7
MacQueen, J.B.: On convergence of \(k\)-means and partitions with minimum average variance (abstract). Ann. Math. Stat. 36, 1084 (1965)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
Rosenberg, A., Hirschberg, J.: V-Measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 EMNLP-CoNLL (2007)
Hubert, L., Arabie, P.: Comparing partitions. J. Classification 2(1), 193–218 (1985)
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison. In: Proceedings of the 26th ICML, pp. 1073–1080. ACM (2009)
Strehl, A., Ghosh, J.: Cluster ensembles – a knowledge reuse framework for combining multiple partitions. JMLR 3, 583–617 (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Kampman, I., Elomaa, T. (2018). Hierarchical Clustering of High-Dimensional Data Without Global Dimensionality Reduction. In: Ceci, M., Japkowicz, N., Liu, J., Papadopoulos, G., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2018. Lecture Notes in Computer Science(), vol 11177. Springer, Cham. https://doi.org/10.1007/978-3-030-01851-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-01851-1_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01850-4
Online ISBN: 978-3-030-01851-1
eBook Packages: Computer ScienceComputer Science (R0)