Skip to main content

Hierarchical Clustering of High-Dimensional Data Without Global Dimensionality Reduction

  • Conference paper
  • First Online:
Book cover Foundations of Intelligent Systems (ISMIS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11177))

Included in the following conference series:

  • 910 Accesses

Abstract

Very few clustering algorithms can cope with a high number of dimensions. Many problems arise when dimensions grow to the order of hundreds. Dimensionality reduction and feature selection are simple remedies to these problems. In addition to being somewhat intellectually disappointing approaches, both also lead to loss of information.

Furthermore, many elaborated clustering algorithms become unintuitive for the user because she is required to set the values of several hyperparameters without clear understanding of their meaning and effect.

We develop PCA-based hierarchical clustering algorithms that are particularly geared for high-dimensional data. Technically the novelty is to describe data vectors iteratively by the their angles with eigenvectors in the orthogonal basis of the input space. The new algorithms avoid the major curse of dimensionality problems that affect cluster analysis.

We aim at expressive algorithms that are easily applicable. This entails that the user only needs to set few intuitive hyperparameters. Moreover, exploring the effects of tuning parameters is simple since they are directly (or inversely) proportional to the clustering resolution. Also, the clustering hierarchy has a comprehensible interpretation and, therefore, moving between nodes in the hierarchy tree has an intuitive meaning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.openml.org/d/458.

References

  1. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD, pp. 269–274 (2001)

    Google Scholar 

  2. Bhattacharya, A., De, R.K.: Average correlation clustering algorithm (ACCA) for grouping of co-regulated genes with similar pattern of variation in their expression values. J. Biomed. Inform. 43(4), 560–568 (2010)

    Article  Google Scholar 

  3. Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2(4), 433–459 (2010)

    Article  Google Scholar 

  4. Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: On exploring complex relationships of correlation clusters. In: Proceedings of the 19th SSBDM, pp. 52–61. IEEE (2007)

    Google Scholar 

  5. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3(1), 1–58 (2009)

    Article  Google Scholar 

  6. Choi, S.S., Cha, S.H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)

    Google Scholar 

  7. Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th International Conference on VLDB, pp. 506–515 (2000)

    Google Scholar 

  8. Aggarwal, C.C., Procopiuc, C.M., Wolf, J.L., Yu, P.S., Park, J.S.: Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD, pp. 61–72. ACM (1999)

    Google Scholar 

  9. Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: Proceedings of the 4th SIAM SDM, pp. 517–521 (2004)

    Chapter  Google Scholar 

  10. Agrawal, R., et al.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD, pp. 94–105 (1998)

    Article  Google Scholar 

  11. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A Monte Carlo algorithm for fast projective clustering. In: Proceedings of the 2002 ACM SIGMOD, pp. 418–427 (2002)

    Google Scholar 

  12. Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S.: MaPle: a fast algorithm for maximal pattern-based clustering. In: Proceedings of the 3rd IEEE ICDM, pp. 259–266 (2003)

    Google Scholar 

  13. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: Proceedings of the 2000 ACM SIGMOD, pp. 70–81 (2000)

    Article  Google Scholar 

  14. Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. In: Proceedings of the 2007 SIAM SDM, pp. 413–418 (2007)

    Chapter  Google Scholar 

  15. Zhang, Y., Wu, H., Cheng, L.: Some new deformation formulas about variance and covariance. In: Proceedings of the 4th ICMIC, pp. 987–992. IEEE (2012)

    Google Scholar 

  16. Strang, G.: Linear Algebra and Its Applications. Thomson, Brooks/Cole, Pacific Grove (2006)

    MATH  Google Scholar 

  17. Simonoff, J.S.: Analyzing Categorical Data. Springer, New York (2003). https://doi.org/10.1007/978-0-387-21727-7

    Book  MATH  Google Scholar 

  18. MacQueen, J.B.: On convergence of \(k\)-means and partitions with minimum average variance (abstract). Ann. Math. Stat. 36, 1084 (1965)

    Article  MathSciNet  Google Scholar 

  19. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  20. Rosenberg, A., Hirschberg, J.: V-Measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 EMNLP-CoNLL (2007)

    Google Scholar 

  21. Hubert, L., Arabie, P.: Comparing partitions. J. Classification 2(1), 193–218 (1985)

    Article  Google Scholar 

  22. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison. In: Proceedings of the 26th ICML, pp. 1073–1080. ACM (2009)

    Google Scholar 

  23. Strehl, A., Ghosh, J.: Cluster ensembles – a knowledge reuse framework for combining multiple partitions. JMLR 3, 583–617 (2002)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tapio Elomaa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kampman, I., Elomaa, T. (2018). Hierarchical Clustering of High-Dimensional Data Without Global Dimensionality Reduction. In: Ceci, M., Japkowicz, N., Liu, J., Papadopoulos, G., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2018. Lecture Notes in Computer Science(), vol 11177. Springer, Cham. https://doi.org/10.1007/978-3-030-01851-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01851-1_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01850-4

  • Online ISBN: 978-3-030-01851-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics