Subspace Clustering Techniques
- 54 Downloads
Cluster analysis aims at finding a set of subsets (i.e., a clustering) of objects in a data set.
Cluster analysis aims at finding a set of subsets (i.e., a clustering) of objects in a data set. A meaningful clustering reflects a natural grouping of the data. In high-dimensional data, irrelevant attributes and correlated attributes make any natural grouping hardly detectable. Specialized techniques aim at finding clusters in subspaces of a high-dimensional data space.
While different weighting of attributes was in use since clusters were derived by hand, the problem of finding a cluster based on a subset of attributes and a specialized solution was first described 1972 by Hartigan . But, triggered by modern capabilities of massive acquisition of high-dimensional data in many scientific and economic domains and the first general approaches to the problem [2, 3, 4], research focused on the problem not till 1998. The more special topic of pattern-based clustering is covered in . Broad overviews are provided in several surveys [6, 7].
Different Challenges: The “Curse of Dimensionality”
High-dimensional data confronts cluster analysis with several problems. A bundle of problems is commonly addressed as the “curse of dimensionality.” Aspects of this “curse” most relevant to the clustering problem are: (i) In general, any optimization problem becomes increasingly difficult with an increasing number of variables (attributes) . (ii) The relative contrast of the farthest point and the nearest point converges to 0 with increasing data dimensionality , i.e., the discrimination between the nearest and the farthest neighbor becomes rather poor in high-dimensional data spaces. In clustered data, this effect can be expected to be less expressive, but it might remain a problem in combination with the other aspects . (iii) Capabilities of automated data acquisition in many application domains lead to the collection of as many features as possible in the expectation that many of these features may provide useful insights. Thus, for the task at hand in many problems, there exist typically many irrelevant attributes in a data set. Since groups of data are defined by some of the attributes only, the remaining irrelevant attributes (“noise”) may heavily interfere with the efforts to find these groups. (iv) Similarly, in a data set containing many attributes, some attributes will most probably exhibit correlations among each other (in varying complexity).
Many approaches try to alleviate the “curse of dimensionality” by applying feature reduction methods prior to cluster analysis. However, the second main challenge for cluster analysis of high- dimensional data is the possibility and even high probability that different subsets or combinations of attributes may be relevant for different clusters. Thus, a global feature selection or dimensionality reduction method cannot be applied. Rather, it becomes an intrinsic problem of the clustering approach to find the relevant subspaces and to find clusters in these relevant subspaces. Furthermore, although correlation among attributes often is the basis for a dimension reduction, for many application domains, it is a main part of the interesting information what correlations exist among which attributes for which subset of objects. As a consequence of this second challenge, the first challenge (i.e., the “curse of dimensionality”) generally cannot be alleviated for clustering high-dimensional data.
Different Solutions: Categories of Subspace Clustering Techniques
Subspace clustering techniques can be divided into three main families. In view of the challenges sketched above, any arbitrarily oriented subspace may be interesting for a subspace clustering approach. The most general techniques (“(arbitrarily) oriented clustering,” “correlation clustering”(Note that the name “correlation clustering” relates to a different problem within the machine learning community.)) tackle this infinite search space. Yet most of the research in this field assumes the search space to be restricted to axis-parallel subspaces. Since the search space of all possible axis-parallel subspaces of a d-dimensional data space is still in O(2 d ), different search strategies and heuristics are implemented. Axis-parallel approaches mainly split into “subspace clustering” and “projected clustering.” In between these two main fields, a group of approaches is known as “pattern-based clustering” (also “bi-clustering” or “co-clustering”). For these approaches, the search space is not necessarily restricted to axis-parallel subspaces but on the other hand does not contain all arbitrarily oriented subspaces. The restrictions on the search space differ substantially between different approaches in this group.
To navigate through the search space of all possible axis-parallel subspaces and to find clusters in subspaces, mainly two strategies are implemented: the top-down approach and the bottom-up approach.
Following the top-down approach, an algorithm derives a cluster approximately based on the full-dimensional space and refines the cluster by adapting the corresponding subspace based on the current selection of points. This means a lower-dimensional projection is sought for where the (iteratively refined) set of points clusters best. Thus, algorithms pursuing this approach are called “projected clustering algorithms” and, usually, assign each point to at most one subspace cluster. The first approach of this category is proposed in .
Bottom-up approaches start by single dimensions and search primarily for all interesting subspaces (i.e., subspaces containing clusters) as combinations of lower-dimensional interesting subspaces (often this combination is translated to the frequent item set problem and, thus, based on the Apriori property). Most of these approaches are therefore “subspace clustering algorithms” and usually can assign one point to different clusters simultaneously (i.e., subspace clusters may overlap). Their aim is to find all clusters in all subspaces. There are also “hybrid algorithms” following the projected clustering approach but allowing points to belong to multiple clusters simultaneously or, on the other hand, following the subspace clustering approach but not computing all clusters in all subspaces. The first approach in this category is proposed in .
In summary, approaches to axis-parallel subspace clustering handle the problem of irrelevant attributes (aspect (iii) of the “curse of dimensionality”). Bottom-up approaches, additionally, tackle mostly the problem of poor discrimination of nearest and farthest neighbor (aspect (ii)).
Pattern-based clustering algorithms seek subsets of objects exhibiting a certain pattern on a subset of attributes. In the most spread algorithms, this pattern is an additive model of the cluster, meaning each attribute value within a cluster and within the relevant subset of attributes is given by the sum of a cluster mean value and an adjustment value for the current object and an adjustment value for the current attribute. In general, covering a cluster with such an additive model is possible if the contributing attributes exhibit a simple linear positive correlation among each other. This excludes negative or complex correlations, thus restricting the general search space. Cluster objects reside sparsely on hyperplanes parallel to the irrelevant axes. Projected onto the relevant subspace, the clusters appear as increasing one-dimensional lines. In comparison to axis-parallel approaches, the generalization consists mainly in allowing the axis-parallel hyperplane to be sparse. Also the cluster in the projection subspace may remain sparse. The unifying property of all cluster members is the common pattern of correlation between attributes.
Allowing sparseness in the spatial patterns is an interesting feature of this family of approaches since this also alleviates aspects (ii) and (iii) of the “curse of dimensionality.” Aspect (iv) is addressed partially.
Correlation clustering approaches follow the most general model: Points forming a cluster can be located on an arbitrarily oriented hyperplane (i.e., subspace). These patterns occur if some attributes follow linear but complex correlations among each other (i.e., one attribute may be the linear combination of several other attributes). The main point addressed by these approaches is therefore aspect (iv) of the “curse of dimensionality.” The most widespread technique is the application of principal component analysis (PCA) on locally selected sets of points. Other techniques are based on applying the Hough transform  to the data set. Since the Hough transform does not rely on spatial closeness of points, by using this technique, also aspect (ii) is tackled.
In many scientific and economic fields (like astronomy, physics, medicine, biology, archaeology, geology, geography, psychology, and marketing), vast amounts of high-dimensional data are collected. To gain the full potentials out of the gathered information, subspace clustering techniques are useful in all these domains. Pattern-based approaches are especially popular in microarray data analysis.
The different groups of subspace clustering techniques (subspace clustering, projected clustering, pattern-based clustering, correlation clustering) tackle different subproblems of the “curse of dimensionality.” There remain challenges for each of these problems. However, as a next-generation type of approach, algorithms to tackle more and more aspects simultaneously can be expected. An open problem is the redundancy of results when similar clusters are identified in different subspaces. With respect to this problem, there are strong connections to other clustering techniques such as ensemble clustering, alternative clustering, and constrained clustering . Other challenges arise when tackling the subspace clustering problem on more complex data such as dynamic data .
Url To Code
- 2.Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Seattle; 1998. p. 94–105.Google Scholar
- 3.Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS. Fast algorithms for projected clustering. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Philadelphia; 1999. p. 61–72.Google Scholar
- 4.Aggarwal CC, Yu PS. Finding generalized projected clusters in high dimensional space. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas; 2000. p. 70–81.Google Scholar
- 6.Kriegel HP, Kr¨ger P, Zimek A. Clustering high dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data (TKDD). 2009;3(1):1–58.Google Scholar
- 9.Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “Nearest Neighbor” meaningful? In: Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem; 1999. p. 217–35.Google Scholar
- 10.Houle ME, Kriegel HP, Kr¨ger P, Schubert E, Zimek A. Can shared-neighbor distances defeat the curse of dimensionality? In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg; 2010. p. 482–500.Google Scholar
- 12.Achtert E, B¨hm C, Kriegel HP, Kr¨ger P, Zimek A. Deriving quantitative models for correlation clusters. In: Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia; 2006. p. 4–13.Google Scholar
- 15.Achtert E, Kriegel HP, Schubert E, Zimek A. Interactive data mining with 3D-parallel-coordinate-trees. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York; 2013. p. 1009–12.Google Scholar