Hierarchical Clustering of High-Dimensional Data Without Global Dimensionality Reduction

Kampman, Ilari; Elomaa, Tapio

doi:10.1007/978-3-030-01851-1_23

Ilari Kampman¹⁸ &
Tapio Elomaa¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11177))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

910 Accesses

Abstract

Very few clustering algorithms can cope with a high number of dimensions. Many problems arise when dimensions grow to the order of hundreds. Dimensionality reduction and feature selection are simple remedies to these problems. In addition to being somewhat intellectually disappointing approaches, both also lead to loss of information.

Furthermore, many elaborated clustering algorithms become unintuitive for the user because she is required to set the values of several hyperparameters without clear understanding of their meaning and effect.

We develop PCA-based hierarchical clustering algorithms that are particularly geared for high-dimensional data. Technically the novelty is to describe data vectors iteratively by the their angles with eigenvectors in the orthogonal basis of the input space. The new algorithms avoid the major curse of dimensionality problems that affect cluster analysis.

We aim at expressive algorithms that are easily applicable. This entails that the user only needs to set few intuitive hyperparameters. Moreover, exploring the effects of tuning parameters is simple since they are directly (or inversely) proportional to the clustering resolution. Also, the clustering hierarchy has a comprehensible interpretation and, therefore, moving between nodes in the hierarchy tree has an intuitive meaning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.openml.org/d/458.

References

Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD, pp. 269–274 (2001)
Google Scholar
Bhattacharya, A., De, R.K.: Average correlation clustering algorithm (ACCA) for grouping of co-regulated genes with similar pattern of variation in their expression values. J. Biomed. Inform. 43(4), 560–568 (2010)
Article Google Scholar
Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2(4), 433–459 (2010)
Article Google Scholar
Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: On exploring complex relationships of correlation clusters. In: Proceedings of the 19th SSBDM, pp. 52–61. IEEE (2007)
Google Scholar
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3(1), 1–58 (2009)
Article Google Scholar
Choi, S.S., Cha, S.H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)
Google Scholar
Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th International Conference on VLDB, pp. 506–515 (2000)
Google Scholar
Aggarwal, C.C., Procopiuc, C.M., Wolf, J.L., Yu, P.S., Park, J.S.: Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD, pp. 61–72. ACM (1999)
Google Scholar
Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM SDM, pp. 517–521 (2004)
Chapter Google Scholar
Agrawal, R., et al.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD, pp. 94–105 (1998)
Article Google Scholar
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD, pp. 418–427 (2002)
Google Scholar
Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S.: MaPle: a fast algorithm for maximal pattern-based clustering. In: Proceedings of the 3rd IEEE ICDM, pp. 259–266 (2003)
Google Scholar
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD, pp. 70–81 (2000)
Article Google Scholar
Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. In: Proceedings of the 2007 SIAM SDM, pp. 413–418 (2007)
Chapter Google Scholar
Zhang, Y., Wu, H., Cheng, L.: Some new deformation formulas about variance and covariance. In: Proceedings of the 4th ICMIC, pp. 987–992. IEEE (2012)
Google Scholar
Strang, G.: Linear Algebra and Its Applications. Thomson, Brooks/Cole, Pacific Grove (2006)
MATH Google Scholar
Simonoff, J.S.: Analyzing Categorical Data. Springer, New York (2003). https://doi.org/10.1007/978-0-387-21727-7
Book MATH Google Scholar
MacQueen, J.B.: On convergence of \(k\)-means and partitions with minimum average variance (abstract). Ann. Math. Stat. 36, 1084 (1965)
Article MathSciNet Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Rosenberg, A., Hirschberg, J.: V-Measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 EMNLP-CoNLL (2007)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classification 2(1), 193–218 (1985)
Article Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison. In: Proceedings of the 26th ICML, pp. 1073–1080. ACM (2009)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles – a knowledge reuse framework for combining multiple partitions. JMLR 3, 583–617 (2002)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Wapice Ltd., Tampere, Finland
Ilari Kampman
Tampere University of Technology, Tampere, Finland
Tapio Elomaa

Authors

Ilari Kampman
View author publications
You can also search for this author in PubMed Google Scholar
Tapio Elomaa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tapio Elomaa .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
American University, Washington, DC, USA
Nathalie Japkowicz
Hong Kong Baptist University, Kowloon, Hong Kong
Jiming Liu
University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
University of North Carolina, Charlotte, NC, USA
Zbigniew W. Raś

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kampman, I., Elomaa, T. (2018). Hierarchical Clustering of High-Dimensional Data Without Global Dimensionality Reduction. In: Ceci, M., Japkowicz, N., Liu, J., Papadopoulos, G., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2018. Lecture Notes in Computer Science(), vol 11177. Springer, Cham. https://doi.org/10.1007/978-3-030-01851-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-01851-1_23
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01850-4
Online ISBN: 978-3-030-01851-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics