Abstract
Clustering requires the user to define a distance metric, select a clustering algorithm, and set the hyperparameters of that algorithm. Getting these right, so that a clustering is obtained that meets the users subjective criteria, can be difficult and tedious. Semisupervised clustering methods make this easier by letting the user provide mustlink or cannotlink constraints. These are then used to automatically tune the similarity measure and/or the optimization criterion. In this paper, we investigate a complementary way of using the constraints: they are used to select an unsupervised clustering method and tune its hyperparameters. It turns out that this very simple approach outperforms all existing semisupervised methods. This implies that choosing the right algorithm and hyperparameter values is more important than modifying an individual algorithm to take constraints into account. In addition, the proposed approach allows for active constraint selection in a more effective manner than other methods.
Introduction
One of the core tasks in data mining is clustering: finding structure in data by identifying groups of instances that are highly similar (Jain 2010). We consider partitional clustering, in which every instance is assigned to exactly one cluster. To cluster data, a practitioner typically has to (1) define a similarity metric, (2) select a clustering algorithm, and (3) set the hyperparameters of that algorithm. A different choice in one of these steps usually leads to a different clustering. This variability is desirable, as it allows users with different interests to find different clusterings of the same data. In search of a clustering that matches their interests, a typical practitioner might for example tweak the distance metric, or try different clustering algorithms, each with several hyperparameter settings. Though common, this strategy can also be quite tedious. Manually tweaking the distance metric is hard, and so is selecting and tuning a clustering algorithm. As a result, a user may go through several iterations of this clustering pipeline before arriving at a satisfactory result.
Semisupervised clustering methods (Wagstaff et al. 2001; Xing et al. 2003; Bilenko et al. 2004) avoid the need for such iterations by explicitly incorporating user feedback into the clustering process. Instead of interacting with the clustering system by manually tweaking parts of the pipeline, the user provides feedback in a much more simple and welldefined way. Often, feedback is given in the form of pairwise constraints: the user indicates for several pairs whether they should be in the same cluster (called a mustlink constraint), or not (a cannotlink constraint). Specifying such constraints is much easier than constructing a good distance metric, or selecting and tuning a clustering algorithm. The former relies on simple expressions of domain knowledge, whereas the latter requires extensive expertise on metrics, algorithm biases, and the influence of hyperparameters. It is thus assumed that the user already has some understanding of the data, or has some knowledge about it that is not directly captured by the features, and wants to find a clustering that respects this understanding. There are also other ways in which constraints can help in clustering: for example, one can use them to find clusterings that score better on a particular unsupervised optimization objective (e.g. they can help to obtain a lower withcluster sum of squares (Ashtiani et al. 2016)). This, however, is only useful if the user knows a priori which objective is most suited for the task at hand.
Traditional approaches to semisupervised (or constraintbased) clustering use constraints in one of the following three ways. First, one can modify an existing clustering algorithm to take them into account (Wagstaff et al. 2001; Ruiz et al. 2007; Rangapuram and Hein 2012; Wang et al. 2014). Second, one can learn a new distance metric based on the constraints, after which the metric is used within a traditional clustering algorithm (Xing et al. 2003; BarHillel et al. 2003; Davis et al. 2007). Third, one can combine these two approaches and develop socalled hybrid methods (Bilenko et al. 2004; Basu et al. 2004). All of these methods aim to exploit the given background knowledge within the boundaries of a single clustering algorithm, and as such they ignore the algorithm and hyperparameter selection steps in the pipeline outlined above.
In contrast, we propose to use constraints for exactly this: to select and tune an unsupervised clustering algorithm. Our approach is motivated by the fact that no single algorithm works best on all clustering problems (EstivillCastro 2002): each algorithm comes with its own bias, which may match a particular problem to a greater or lesser degree. Further, it exploits the ability of algorithms to produce very different clusterings depending on their hyperparameter settings.
Our proposed approach is simple: to find an appropriate clustering, we first generate a set of clusterings using several unsupervised algorithms, with different hyperparameter settings, and afterwards select from this set the clustering that satisfies the largest number of constraints. Our experiments show that, surprisingly, this simple constraintbased selection method often yields better clusterings than existing semisupervised methods. This leads to the key insight that it is more important to use an algorithm of which the inherent bias matches a particular problem, than to modify the optimization criterion of any individual algorithm to take the constraints into account.
Furthermore, we present a method for selecting the most informative constraints first, which further increases the usefulness of our approach. This selection strategy allows us to obtain good clusterings with fewer queries. Reducing the number of queries is important in many applications, as they are often answered by a user who has a limited time budget.
The remainder of this paper is structured as follows. In Sect. 2 we give some background on semisupervised clustering, and algorithm and hyperparameter selection for clustering. Section 3 presents our approach to using pairwise constraints in clustering, which we call COBS (for ConstraintBased Selection). In Sect. 4 we describe how COBS can be extended to actively select informative constraints. We conclude in Sect. 5.
Background
Semisupervised clustering algorithms allow the user to incorporate a limited amount of supervision into the clustering procedure. Several kinds of supervision have been proposed, one of the most popular ones being pairwise constraints. Mustlink (ML) constraints indicate that two instances should be in the same cluster, cannotlink (CL) constraints that they should be in different clusters. Most existing semisupervised approaches use such constraints within the scope of an individual clustering algorithm. COPKMeans (Wagstaff et al. 2001), for example, modifies the clustering assignment step of Kmeans: instances are assigned to the closest cluster for which the assignment does not violate any constraints. Similarly, the clustering procedures of DBSCAN (Ruiz et al. 2007; Lelis and Sander 2009; Campello et al. 2013), EM (Shental et al. 2004) and spectral clustering (Rangapuram and Hein 2012; Wang et al. 2014) have been extended to incorporate pairwise constraints. Another approach to semisupervised clustering is to learn a distance metric based on the constraints (Xing et al. 2003; BarHillel et al. 2003; Davis et al. 2007). Xing et al. (2003), for example, propose to learn a Mahalanobis distance by solving a convex optimization problem in which the distance between instances with a mustlink constraint between them is minimized, while simultaneously separating instances connected by a cannotlink constraint. Hybrid algorithms, such as MPCKMeans (Bilenko et al. 2004), combine metric learning with an adapted clustering procedure.
Metalearning and algorithm selection have been studied extensively for supervised learning (Brazdil et al. 2003; Thornton et al. 2013), but much less for clustering. There is some work on building metalearning systems that recommend clustering algorithms (Souto et al. 2008; Ferrari and de Castro 2015). However, these systems do not take hyperparameter selection into account, or any form of supervision. More related to ours is the work of Caruana et al. (2006). They generate a large number of clusterings using Kmeans and spectral clustering, and cluster these clusterings. This metaclustering is presented to the user as a dendrogram. Here, we also generate a set of clusterings, but afterwards we select from that set the most suitable clustering based on pairwise constraints. The only other work, to our knowledge, that has explored the use of pairwise constraints for algorithm selection is that by Adam and Blockeel (2015). They define a metafeature based on constraints, and use this feature to predict whether EM or spectral clustering will perform better for a dataset. While their metafeature attempts to capture one specific property of the desired clusters, i.e. whether they overlap, our approach is more general and allows selection between any clustering algorithms.
Whereas algorithm selection has received little attention in clustering, several methods have been proposed for hyperparameter selection. One strategy is to run the algorithm with several parameter settings, and select the clustering that scores highest on an internal quality measure (Arbelaitz et al. 2013; Vendramin et al. 2010). Such measures try to capture the idea of a “good” clustering. A first limitation is that they are not able to deal with the inherent subjectivity of clustering, as they do not take any external information into account. Furthermore, internal measures are only applicable within the scope of individual clustering algorithms, as each of them comes with its own bias (von Luxburg et al. 2014). For example, the vast majority of them has a preference for spherical clusters, making them suitable for Kmeans, but not for e.g. spectral clustering and DBSCAN.
Another strategy for parameter selection in clustering is based on stability analysis (BenHur et al. 2002; von Luxburg 2010; BenDavid et al. 2006). A parameter setting is considered to be stable if similar clusterings are produced with that setting when it is applied to several datasets from the same underlying model. These datasets can for example be obtained by taking subsamples of the original dataset (BenHur et al. 2002; Lange et al. 2004). In contrast to internal quality measures, stability analysis does not require an explicit definition of what it means for a clustering to be good. Most studies on stability focus on selecting parameter settings in the scope of individual algorithms (in particular, often the number of clusters). As such, it is unclear to which extent stability can be used to choose between clusterings from very different clustering algorithms.
Additionally, one can also avoid the need for explicit parameter selection. In selftuning spectral clustering (Zelnikmanor and Perona 2004), for example, the affinity matrix is constructed based on local statistics and the number of clusters is estimated using the structure of the eigenvectors of the Laplacian.
A key distinction with COBS is that none of the above methods takes the subjective preferences of the user into account. We will compare our constraintbased selection strategy to some of them in the next section.
Pourrajabi et al. (2014) have introduced CVCP, a framework for using constraints for hyperparameter selection within the scope of individual semisupervised algorithms. A major difference is that COBS uses all constraints for selection (and none within the algorithm) and selects both an unsupervised algorithm and its hyperparameters (as opposed to only the hyperparameters of a semisupervised algorithm). We compare COBS to CVCP in the experiments in Sect. 3.
Constraintbased clustering selection
Algorithm and hyperparameter selection are difficult in an entirely unsupervised setting. This is mainly due to the lack of a welldefined way to estimate the quality of clustering results (EstivillCastro 2002). We propose to use constraints for this purpose, and estimate the quality of a clustering as the number of constraints that it satisfies. This quality estimate allows us to do a search over unsupervised algorithms and their parameter settings, as described in Algorithm 1.
Essentially, the algorithm simply generates a large set of clusterings, using various clustering algorithms and parameter settings, and then selects from this set the clustering that performs best according to the constraintbased criterion. In the algorithmic description, c[i] indicates the cluster of element i and \(\mathbb {I}\) is the indicator function. For varying the clustering algorithm and hyperparameter settings, a simple grid search is used in our experiments. We have also experimented with SMAC (Hutter et al. 2011), which is based on sequential modelbased optimization, but found the results highly similar to those obtained with grid search, both in terms of cluster quality and runtime. For completeness, we do include the SMAC results in the comparison to the semisupervised competitors in Sect. 3.5 (Fig. 2). The objective that SMAC aims to minimize in these experiments is the number of unsatisfied constraints.
We now reiterate and clarify the motivations for COBS, which were briefly presented in the introduction. First, each clustering algorithm comes with a particular bias, and no single one performs best on all clustering problems (EstivillCastro 2002). Existing semisupervised approaches can change the bias of an unsupervised algorithm, but only to a certain extent. For instance, using constraints to learn a Mahalanobis distance allows Kmeans to find ellipsoidal clusters, rather than spherical ones, but still does not make it possible to find nonconvex clusters. In contrast, by using constraints to choose between clusterings generated by very different algorithms, COBS aims to select the most suitable one from a diverse range of biases.
Second, it is also widely known that within a single clustering algorithm the choice of the hyperparameters can strongly influence the clustering result. Consequently, choosing a good parameter setting is crucial. Currently, a user can either do this manually, or use one of the selection strategies discussed in Sect. 2. Both options come with significant drawbacks. Doing parameter tuning manually is timeconsuming, given the often large number of combinations one might try. Existing automated selection strategies avoid this manual labor, but can easily fail to select a good setting as they do not take the user’s preferences into account. For COBS, parameters are an asset rather than a burden. They allow generating a large and diverse set of clusterings, from which we can select the most suitable solution with a limited number of pairwise constraints.
Research questions
In the remainder of this section, we aim to answer the following questions:
 Q1:

How does COBS’ hyperparameter selection compare to unsupervised hyperparameter selection?
 Q2:

How does unsupervised clustering with COBS’ hyperparameter selection compare to semisupervised clustering algorithms?
 Q3:

How does COBS, with both algorithm and hyperparameter selection, compare to existing semisupervised algorithms?
 Q4:

Can we combine the best of both worlds, that is: use COBS with semisupervised clustering algorithms rather than unsupervised ones?
 Q5:

How do the clusterings that COBS selects score on internal quality criteria?
 Q6:

What is the computational cost of COBS, compared to the alternatives?
Although our selection strategy is also related to metaclustering (Caruana et al. 2006), an experimental comparison would be difficult as metaclustering produces a dendrogram of clusterings for the user to explore. The user can traverse this dendrogram to obtain a single clustering, but the outcome of this process is highly subjective. COBS works with pairwise constraints, therefore we compare to other methods that do the same.
Experimental methodology
To answer our research questions we perform experiments with artificial datasets as well as UCI classification datasets. The classes are assumed to represent the clusters of interest. We evaluate how well the returned clusters coincide with them by computing the Adjusted Rand Index (ARI) (Hubert and Arabie 1985), which is a commonly used measure for this. The ARI is an extension of the regular Rand index (RI), which is the ratio of the number of pairs of instances on which two clusterings agree (i.e. pairs that belong to the same cluster in both clusterings, or to a different cluster in both clusterings) to the total numer of pairs. The ARI corrects the RI for chance; whereas the expected value for random clusterings is not constant for the regular RI, it is 0 for the ARI. A clustering that coincides perfectly with the one indicated by the class labels has an ARI of 1, whereas a clustering that is generated randomly should have an ARI close to 0. In computing the ARI for clusterings generated by DBSCAN and FOSC we consider each point identified as noise to be in its own separate cluster.^{Footnote 1} We use fivefold crossvalidation, and in each iteration perform the following steps:

1.
Generate c pairwise constraints (c is a parameter) by repeatedly selecting two random instances from the training set, and adding a mustlink constraint if they belong to the same class, and a cannotlink otherwise.

2.
Apply COBS to the full dataset to obtain a clustering.

3.
Evaluate the clustering by calculating the ARI on all objects from the validation fold.
In all plots and tables, we report the average ARI over the 5 validation folds. Measuring how well clusterings agree with given class labels is the most common way of evaluating semisupervised clustering algorithms, despite its limitations (Farber et al. 2010; von Luxburg et al. 2014). One such limitation is that class labels do not always correspond to what one would commonly identify as cluster structure. We partly get around these limitations by also using artificial datasets (for which we know that the labels indicate actual clustering structure), and by investigating the performance of COBS on two internal quality measures (in research question 5).
Datasets
An overview of all the datasets can be found in Table 1. First, we perform experiments with the 5 artificial datasets^{Footnote 2} that are shown in Fig. 1. The advantage of these artificial datasets is that they have unambiguous ground truth clusterings, and that they can easily be visualized (some illustrations of clusterings can be found in Sect. 3.5). Further, we perform experiments with 14 classification datasets that have also been used in several other studies on semisupervised clustering. The optdigits389 dataset is a subset of the UCI handwritten digits dataset, containing only digits 3, 8 and 9 (Bilenko et al. 2004; Mallapragada et al. 2008). The faces dataset contains 624 images of 20 persons taking different poses, with different expressions, with and without sunglasses. Hence, this dataset has 4 target clusterings: identity, pose, expression and eyes (whether or not the person is wearing sunglasses). We extract a 2048value feature vector for each image by running it through the pretrained InceptionV3 network (Szegedy et al. 2015) and storing the output of the second last layer. We only show results for the identity and eyes target clusterings, as the clusterings generated for pose and expression had an ARI close to zero for all methods. We experiment with two clustering tasks on the 20 newsgroups text data: in the first task the data consists of all documents from three related topics (comp.graphics, comp.os.mswindows and comp.windows.x, as in Basu and Mooney (2004); Mallapragada et al. (2008)), in the second task the data consists of all documents from three very different topics (alt.atheism, rec.sport.baseball and sci.space, as in Basu and Mooney (2004); Mallapragada et al. (2008)). We first extract tfidf features, and then use latent semantic indexing (as in Mallapragada et al. (2008)) to reduce the dimensionality to 10. Duplicate instances were removed for all these datasets, and all data was normalized between 0 and 1 (except for the artificial data, and the image and text data).
Unsupervised algorithms used in COBS
We use Kmeans, DBSCAN and spectral clustering to generate clusterings in step one of Algorithm 1, as they are common representatives of different types of algorithms (we use implementations from scikitlearn (Pedregosa et al. 2011)). The hyperparameters are varied in the ranges specified in Table 2. In particular, for each dataset we generate 180 clusterings using Kmeans (for each number of clusters we store the clusterings obtained with 20 random initializations), 351 using spectral clustering and 400 using DBSCAN, yielding a total of 931 clusterings. For discrete parameters, clusterings are generated for the complete range. For continuous parameters, clusterings are generated using 20 evenly spaced values in the specified intervals. For the \(\epsilon \) parameter used in DBSCAN, the lower and upper bounds are the minimum and maximum pairwise distances between instances (referred to as \(\min (d)\) and \(\max (d)\) in Table 2). We use the Euclidean distance for all unsupervised algorithms.
Question Q1: COBS versus unsupervised hyperparameter tuning
To evaluate hyperparameter selection for individual algorithms, we use Algorithm 1 with C a set of clusterings generated using one particular algorithm (Kmeans, DBSCAN or spectral). We compare COBS to state of the art unsupervised selection strategies. As there is no single method that can be used for all three algorithms, we use three different approaches, which are briefly described next.
Existing unsupervised hyperparameter selection methods
Kmeans has one hyperparameter: the number of clusters K. A popular method to select K in Kmeans is by using internal clustering quality measures (Vendramin et al. 2010; Arbelaitz et al. 2013). Kmeans is run for different values of K (and in this case also for different random seeds), and afterwards the clustering that scores highest on such an internal measure is chosen. In our setup, we generate 20 clusterings for each K by using different random seeds. We select the clustering that scores highest on the silhouette index (Rousseeuw 1987), which was identified as one of the best internal criteria by Arbelaitz et al. (2013).
DBSCAN has two parameters: \(\epsilon \), which specifies how close points should be to be in the same neighborhood, and minPts, which specifies the number of points that are required in the neighborhood to be a core point. Most internal criteria are not suited for DBSCAN, as they assume spherical clusters, and one of the key characteristics of DBSCAN is that it can find clusters with arbitrary shape. One exception is the DensityBased Cluster Validation (DBCV) score (Moulavi et al. 2014), which we use in our experiments.
Spectral clustering requires the construction of a similarity graph, which can be done in several ways (von Luxburg 2007). If a knearest neighbor graph is used, k has to be set. For graphs based on a Gaussian similarity function, \(\sigma \) has to be set to specify the width of the neighborhoods. Also the number of clusters K should be specified. Selftuning spectral clustering (Zelnikmanor and Perona 2004) avoids having to specify any of these parameters, by relying on local statistics to compute different \(\sigma \) values for each instance, and by exploiting structure in the eigenvectors to determine the number of clusters. This approach is different from the one used for Kmeans and DBSCAN, as here we do not generate a set of clusterings first, but instead hyperparameters are estimated directly from the data.
Results and conclusion
The columns of Table 3 marked with Q1 compare the ARIs obtained with the unsupervised approaches to those obtained with COBS. The best of these two is underlined for each algorithm and dataset combination. Most of the times the constraintbased selection strategy performs better, and often by a large margin. Note for example the large difference for ionosphere: DBSCAN is able to produce a good clustering, but only the constraintbased approach recognizes it as the best one. When the unsupervised selection method performs better, the difference is usually small. We conclude that often the internal measures do not match the clusterings that are indicated by the class labels. Constraints provide useful information that can help select a good parameter setting.
Question Q2: COBS versus semisupervised algorithms
It is not too surprising that COBS outperforms unsupervised hyperparameter selection, since it has access to more information. We now compare to semisupervised algorithms, which have access to the same information.
Existing semisupervised algorithms
We compare to the following algorithms, as they are semisupervised variants of the unsupervised algorithms used in our experiments:

MPCKMeans (Bilenko et al. 2004) is a hybrid semisupervised extension of Kmeans. It minimizes an objective that combines the withincluster sum of squares with the cost of violating constraints. This objective is greedily minimized using a procedure based on Kmeans. Besides a modified cluster assignment step and the usual cluster center reestimation step, this procedure also adapts an individual metric associated with each cluster in each iteration. We use the implementation available in the WekaUT package.^{Footnote 3}

FOSCOpticsDend (Campello et al. 2013) is a semisupervised extension of OPTICS, which is in turn based on ideas similar to DBSCAN. The first step of this algorithm is to run the unsupervised OPTICS algorithm, and to construct a dendrogram using its output. The FOSC framework is then used to extract a flat clustering from this dendrogram that is optimal w.r.t. the given constraints.

COSC (Rangapuram and Hein 2012) is based on spectral clustering, but optimizes for an objective that combines the normalized cut with a penalty for constraint violation. We use the implementation available on the authors’ web page.^{Footnote 4}
In our experiments, the only kind of supervision that is given to the algorithms is in the form of pairwise constraints. In particular, the number of clusters K is assumed to be unknown. In COBS, K is treated as any other hyperparameter. MPCKMeans and COSC, however, require specifying the number of clusters. The following strategies are used to select K based on the constraints:

NSat We run the algorithms for multiple K, and select the clustering that violates the smallest number of constraints. In case of a tie, we choose the solution with the lowest number of clusters.

CVCP CrossValidation for finding Clustering Parameters (Pourrajabi et al. 2014) is a crossvalidation procedure for semisupervised clustering. The set of constraints is divided into n independent folds. To evaluate a parameter setting, the algorithm is repeatedly run on the entire dataset given the constraints in \(n1\) folds, keeping aside the nth fold as a test set. The clustering that is produced given the constraints in the \(n1\) folds, is then considered as a classifier that distinguishes between mustlink and cannotlink constraints in the nth fold. The Fmeasure is used to evaluate the score of this classifier. The performance of the parameter setting is then estimated as the average Fmeasure over all test folds. This process is repeated for all parameter settings, and the one resulting in the highest average Fmeasure is retained. The algorithm is then run with this parameter setting using all constraints to produce the final clustering. We use fivefold crossvalidation.
We also compare to unsupervised hyperparameter selection strategies for the semisupervised algorithms. In particular, we use the silhouette index for MPCKMeans, and the eigengap heuristic for COSC (von Luxburg 2007). The affinity matrix for COSC is constructed using local scaling as in Rangapuram and Hein (2012).
Results and conclusion
The columns in Table 3 marked with Q2 show the ARIs obtained with the semisupervised algorithms. The best result for each type of algorithm (unsupervised or semisupervised) is indicated in bold. The table shows that in several cases it is more advantageous to use the constraints to optimize the hyperparameters of the unsupervised algorithm (as COBS does). In other cases, it is better to use the constraints within the algorithm itself, to perform a more informed search (as the semisupervised variants do). Within the scope of a single clustering algorithm, neither strategy consistently outperforms the other. For example, if we use spectral clustering on the dermatology data, it is better to use the constraints for tuning the hyperparameters of unsupervised spectral clustering (also varying k and \(\sigma \) for constructing the signature matrix) than within COSC, its semisupervised variant (which uses local scaling for this). In contrast, if we use densitybased clustering on the same data, it is better to use constraints in FOSCOpticsDend (which does not have an \(\epsilon \) parameter, and for which minPts is set to 4, a value commonly used in the literature (Ester et al. 1996; Campello et al. 2013)) than to use them to tune the hyperparameters of DBSCAN (varying both \(\epsilon \) and minPts).
Question Q3: COBS with multiple unsupervised algorithms
In the previous two subsections, we showed that constraints can be useful to tune the hyperparameters of individual algorithms. Table 3 also shows, however, that no single algorithm (unsupervised or semisupervised) performs well on all datasets. This motivates the use of COBS to not only select hyperparameters, but also the clustering algorithm. In this subsection we again use Algorithm 1, but set C in step 1 now includes clusterings produced by any of the three unsupervised algorithms.
Results
We compare COBS with existing semisupervised algorithms in Fig. 2. For the majority of datasets, COBS produces clusterings that are on par with, or better than, those produced by the best competitor. While some other approaches also do well on some of the datasets, none of them do so consistently. Compared to each competitor individually, COBS is clearly superior. For example, COSCNumSat outperforms COBS on the breastcancerwisconsin dataset, but performs much worse on several others. The only datasets for which COBS performs significantly worse than its competitors are column_2C and hepatitis.
Table 4 allows us to assess the quality of the clusterings that are selected by COBS, relative to the quality of the best clustering in the set of generated clusterings. Column 2 shows the highest ARI of all generated clusterings for each dataset. Note that we can only compute this value in an experimental setting, in which we have labels for all elements. In a real clustering application, we cannot simply select the result with the highest ARI. Column 3, then, shows the ARI of the clustering that is actually selected using COBS when it is given 50 constraints. It shows that there still is room for improvement, i.e. a more advanced strategy might get closer to the maxima. Nevertheless, even our simple strategy gets close enough to outperform most other semisupervised methods. The last column of Table 4 shows how often COBS chose a clustering by Kmeans (’K’), DBSCAN (’D’) and spectral clustering (’S’). It illustrates that the selected algorithm strongly depends on the dataset. For example, for ionosphere COBS selects clusterings generated by DBSCAN, as it is the only algorithm able to produce good clusterings of this dataset.
The clusterings that COBS selects for hepatitis are also generated by DBSCAN. This might seem strange as the ARIs of these DBSCAN clusterings are low (and significantly lower than those of the Kmeans clusterings, as can be seen in Table 3). In other words, the DBSCAN clusterings are good at satisfying the constraints (as they are selected), but at the same time do not recover the class structure that the constraints are derived from. To understand this, note that most of the constraints for hepatitis are mustlinks (it consists of 2 classes, one with 19 instances and the other with 93). DBSCAN generates some clusterings in which (nearly) all instances are in one cluster. These clusterings are good at satisfying the randomly generated constraints (most of them are mustlinks, and DBSCAN is right about them as there is only one cluster), and are chosen by COBS. The selected clusterings, however, score badly on the Adjusted Rand Index (ARI) as this measure is adjusted for chance and takes class imbalance into account. In other words, this behaviour is due to a discrepancy between the selection criterion of COBS, and the ARI. This observation indicates that COBS might benefit from the development of more complex selection criteria.
Results on artificial datasets
Figure 3 shows the clusterings that are produced for the artificial datasets given 50 constraints. COBS performs on par with the best competitor for all of these. An interesting observation can be made for flame (shown in the last row). For this dataset, COBS selects a solution consisting of two clusters and 11 additional noise points (which are considered as separate clusters in computing the ARI). This clustering is produced by DBSCAN, which identifies the points shown as green crosses as noise. In this case, no constraints were defined on these points. The clustering that is shown satisfies all given constraints, and was selected randomly from all clusterings that did so. Giving the correct number of clusters (as was done for COSC and MPCKMeans for Fig. 3) and not allowing noise points would result in COBS selecting clusterings that are highly similar to those generated by COSC and FOSC.
Besides COBS, also COSC attains high ARIs for all 5 artificial datasets, but only if it is given the correct number of clusters. Without giving it the right number of clusters, COSC produces much worse clusterings on curet24k, with ARIs of 0.43 (for the eigengap selection method), 0.42 (for NumSat) and 0.51 (for CVCP) (as listed in Table 3).
Another interesting observation can be made for MPCKMeans, for example on the flame dataset (shown in the last row). It shows that using constraints does not allow MPCKMeans to avoid its inherent spherical bias. Points that are connected by a mustlink constraint are placed in the same cluster, but the overall cluster shape cannot be correctly identified. This is seen in the plot, as for instance some red points occur in the inner cluster.
Conclusion
If any of the unsupervised algorithms is able to produce good clusterings, COBS can select them using a limited number of constraints. If not, COBS performs poorly, but in our experiments none of the algorithms did well in this case. We conclude that it is generally better to use constraints to select and tune an unsupervised algorithm, than within a randomly chosen semisupervised algorithm.
Question Q4: using COBS with semisupervised algorithms
In the previous section we have shown that we can use constraints to do algorithm and hyperparameter selection for unsupervised algorithms. On the other hand, constraints can also be useful when used within an adapted clustering procedure, as traditional semisupervised algorithms do. This raises the question: can we combine both approaches? In this section, we use the constraints to select and tune a semisupervised clustering algorithm. In particular, we vary the hyperparameters of the semisupervised algorithms to generate the set of clusterings from which we select. The varied hyperparameters are the same as those for their unsupervised variants, except for two. First, \(\epsilon \) is not varied for FOSCOpticsDend, as it is not a hyperparameter for that algorithm. Second, in this section we only use knearest neighbors graphs for (semisupervised) spectral clustering, as full similarity graphs lead to long execution times for COSC.
Results and conclusions
Column 3 of Table 5 shows that this strategy does not produce better results. This is caused by using the same constraints twice: once within the semisupervised algorithms, and once to evaluate the algorithms and select the bestperforming one. Obviously, algorithms that overfit the given constraints will get selected in this manner.
The problem could be alleviated by using separate constraints inside the algorithm and for evaluation, but this decreases the number of constraints that can effectively be used for either purpose. Column 4 of Table 5 shows the average ARIs that are obtained if we use half of the constraints within the semisupervised algorithms, and half to select one of the generated clusterings afterwards. This works better, but still often not as good as COBS with unsupervised algorithms.
We conclude that using semisupervised algorithms within COBS can only be beneficial if the semisupervised algorithms use different constraints from those used for selection. Even then, when a limited number of constraints is available, using all of them for selection is often the best choice.
Question Q5: evaluating the clusterings on internal criteria
In the previous research questions we have investigated how well the clusterings produced by COBS score according to the ARI, an external evaluation measure. This is motivated by the fact that our main goal is to find clusterings that are aligned with the user’s interests, which are assumed to be captured by the class labels. In this section we investigate how well these clusterings score on internal measures, which are computed using only the data. Such internal measures capture characteristics that one might expect of a good clustering, such as a high intracluster similarity and low intercluster similarity. In particular, we want to know to what extent we compromise on quality according to internal criteria, by focusing on satisfying constraints.
Ideally, we should use an internal measure that is not biased towards any particular clustering algorithm. However, this does not exist (EstivillCastro 2002): each internal quality measure comes with its own bias, which may match the bias of a clustering algorithm to a greater or lesser degree. As a result, choosing a suitable internal quality criterion is often as difficult as choosing the right clustering algorithm. For example, the large majority of internal measures has a strong spherical bias (Vendramin et al. 2010; Arbelaitz et al. 2013), making them suited to use in combination with kmeans, but not with spectral clustering and DBSCAN.
In this section, we will investigate the tradeoff between the ARI and two internal measures: the silhouette index (SI) (Rousseeuw 1987) and the densitybased cluster validation (DBCV) score (Moulavi et al. 2014), both of which were also used in answering the previous research questions. The SI was chosen as it is wellknown, and the extensive studies by Arbelaitz et al. (2013) and Vendramin et al. (2010) identify it as one of the best performing measures. The DBCV score was chosen as it is one of the few internal measures that does not have a spherical bias, and instead is based on the within and betweencluster density connectedness of clusters. Although it does not have a spherical bias, the DBCV score comes with its own limitations; for example, it is strongly influenced by noise, and biased towards imbalanced clusterings (Van Craenendonck and Blockeel 2015). Both of them range in \([1,1]\), with higher values being better.
Figure 4 shows how well the semisupervised methods score on the internal measures for six datasets. In most cases, COBS performs comparable to its competitors. A notable exception is the parkinsons dataset, for which FOSCOpticsDend produces clusterings that score significantly higher on both the DBCV score. Interestingly, the ARI of these clusterings is near zero. For parkinsons, the clusterings with the highest ARI score low on the internal measures. This, however, does not necessarily imply that the clustering does not identify any inherent structure (although this can be the case), it only means that it does not identify structure as it is defined by the silhouette score (i.e. spherical structure) or the DBCV score (i.e. density structure).
We conclude that, most of the times, COBS performs comparable to its competitors on the silhouette and DBCV scores. We also note that while on the one hand COBS selects a clustering solely on how well it satisfies a given set of constraints, on the other hand the clusterings from which it selects are all generated by an unsupervised algorithm that did not have access to these constraints, and hence are sensible according to the bias of at least the algorithm that generated them.
Question Q6: the computational cost of COBS
The computational cost of COBS depends on the complexity of the unsupervised algorithms that are used to generate clusterings. Generating all clusterings took the longest, by far, for the faces dataset. For the identity target, it took ca. 5 h, due to the high dimensionality of the data and the many values of K that were tried (it was varied in [2, 30]). For most datasets, generating all clusterings was much faster. As the semisupervised algorithms can be significantly slower than their unsupervised counterparts, generating all unsupervised clusterings was in several cases faster than doing several runs of a semisupervised algorithm (which is usually required, as the number of clusters is not known beforehand). This was especially so for COSC, which can be much slower than unsupervised spectral clustering. For example, generating all unsupervised clusterings for engytime took 50 min (using scikitlearn implementations), whereas only a single run of COSC took 28 min (using the Matlab implementation available on the authors’ web page).
The runtime of COBS can be reduced in several ways. The cluster generation step can easily be parallelized. For larger datasets, one might consider doing the algorithm and hyperparameter selection on a sample of the data, and afterwards cluster the complete dataset only once with the selected configuration.
Finally, note that the added cost of doing algorithm and parameter selection is no different from its comparable, and commonly accepted, cost in supervised learning. The focus is on maximally exploiting the limited amount of supervision, as obtaining labels or constraints is often expensive, whereas computation is cheap.
Active COBS
Obtaining constraints can be costly, as they are often specified by human experts. Consequently, several methods have been proposed to actively select the most informative constraints (Basu and Mooney 2004; Mallapragada et al. 2008; Xiong et al. 2014). We first briefly discuss some of these methods, and subsequently present a constraint selection strategy for COBS.
Related work
Basu and Mooney (2004) were the first to propose an active constraint selection method for semisupervised clustering. Their strategy is based on the construction of neighborhoods, which are points that are known to belong to the same cluster because mustlink constraints are defined between them. These neighborhoods are initialized in the exploration phase: K (the number of clusters) instances with cannotlink constraints between them are sought, by iteratively querying the relation between the existing neighborhoods and the point farthest from these neighborhoods. In the subsequent consolidation phase these neighborhoods are expanded by iteratively querying a random point against the known neighborhoods until a mustlink occurs and the right neighborhood is found. Mallapragada et al. (2008) extend this strategy by selecting the most uncertain points to query in the consolidation phase, instead of random ones. Note that in these approaches all constraints are queried before the actual clustering is performed.
More recently, Xiong et al. (2014) proposed the normalized pointbased uncertainty (NPU) framework. Like the approach introduced by Mallapragada et al. (2008), NPU incrementally expands neighborhoods and uses an uncertaintybased principle to determine which pairs to query. In the NPU framework, however, data is reclustered several times, and at each iteration the current clustering is used to determine the next set of pairs to query. NPU can be used with any semisupervised clustering algorithm, and Xiong et al. (2014) use it with MPCKMeans to experimentally demonstrate its superiority to the method of Mallapragada et al. (2008).
Active constraint selection in COBS
Like the approaches in Mallapragada et al. (2008) and Xiong et al. (2014), our constraint selection strategy for COBS is based on uncertainty sampling. Defining this uncertainty is straightforward within COBS, because of the availability of a set of clusterings: a pair is more uncertain if more clusterings disagree on whether it should be in the same cluster or not. Algorithm 2 presents a selection strategy based on this idea. We associate with each clustering c a weight \(w_c\) that depends on the number of constraints c was right or wrong about. In each iteration we query the pair with the lowest weighted agreement. The agreement of a pair (line 5 of the algorithm) is defined as the absolute value of the difference between the sum of weights of clusterings in which the instances in the pair belong to the same cluster, and the sum of weights of clusterings in which they belong to a different cluster. The weights of clusterings that correctly “predict” the relation between pairs are increased by multiplying with an update factor m, weights of other clusterings are decreased by dividing by m. As the total number of pairwise constraints is quite large \((\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) with n the number of instances), we only consider constraints in a small random sample P of all possible constraints.
Experiments
We first demonstrate the influence of the weight update factor and sample size, and then compare our approach to active constraint selection with NPU (Xiong et al. 2014).
Effect of weight update factor and sample size
Our constraint selection strategy requires specifying a weight update factor m and a sample size s. Figure 5 shows the effect of these two parameters for the wine and dermatology datasets. First, the figure shows that the active strategy can significantly improve performance over random selection when only few constraints are used. For example, given five constraints the random selection strategy on average chooses a clustering with an ARI of ca. 0.67, whereas the active strategy on average selects a clustering with an ARI of ca. 0.80 (for \(s=200\) and \(m=2\)). A similar boost in ARI is observed for dermatology. Second, the figure shows that smaller sample sizes tend to give better, and more stable, results. This can be explained by the occasional domination of poor quality clusterings in the selection process: if there are more pairs to choose from, poor clusterings (which may have gotten lucky on the first few queries) have more opportunity to favor the search in their direction. This phenomenon is worse for large update factors, which can be seen by comparing the performance for \(m=1.05\) and \(m=2\) on the dermatology data, for a sample size of \(s=2000\).
In the remainder of this section we use a sample of 200 constraints (i.e. we try to choose the most useful constraints to ask from 200 possible queries), and set the weight update factor to 2.
Comparison to active selection with NPU
NPU (Xiong et al. 2014) can be used in combination with any semisupervised clustering algorithm, we use the same ones as in the previous section. We do not include CVCP hyperparameter selection in these experiments, because of its high computational complexity (for these experiments we cannot cluster for several fixed numbers of constraints, as the choice of the next constraints depends on the current clustering). For the same reason we only include the EigenGap parameter selection method for the two largest datasets (opdigits389 and segmentation) in these experiments. The results are shown in Fig. 6. For the first 8 datasets, the conclusions are similar to those for the random setting: COBS consistently performs well. Also in the active setting, none of the approaches produces a clustering with a high ARI for glass. For hepatitis, however, MPCKMeans is able to find good clusterings while COBS is not, albeit only after a relatively large number of constraints (hepatitis contains 112 instances). This implies that, although the labels might not represent a natural grouping, the class structure does match the bias of MPCKMeans, and given many constraints the algorithm finds this structure.
Time complexity
We distinguish between the offline and online stages of COBS. In the offline stage, the set of clusterings is generated. As mentioned before, this took up to ca. 5 h (for the faces dataset). In the online stage, we select the most informative pairs and ask the user about their relation. Execution time is particularly important here, as this stage requires user interaction. In active COBS, selecting the next pair to query is \(\mathcal {O}(CP)\), as we have to loop through all clusterings (C) for each constraint in the sample (P). For the setup used in our experiments (\(C=931\), \(P=200\)), this was always less than 0.02s. Note that this time does not depend on the size of the dataset (as all clusterings are generated beforehand). In contrast, NPU requires reclustering the data several times during the constraint selection process, which is usually significantly more expensive. This means that if NPU is used in combination with an expensive algorithm, e.g. COSC, the user has to wait longer between questions.
Conclusion
The COBS approach allows for a straightforward definition of uncertainty: pairs of instances are more uncertain if more clusterings disagree on them. Selecting the most uncertain pairs first can significantly increase performance.
Conclusion
Exploiting constraints has been the subject of substantial research, but all existing methods use them within the clustering process of individual algorithms. In contrast, we propose to use them to choose between clusterings generated by different unsupervised algorithms, ran with different parameter settings. We experimentally show that this strategy is superior to all the semisupervised algorithms compared to, which themselves are state of the art and representative for a wide range of algorithms. For the majority of the datasets, it works as well as the best among them, and on average it performs much better. The generated clusterings can also be used to select more informative constraints first, which further improves performance.
Notes
 1.
One could argue that these points are not really separate clusters, but as all the evaluation criteria and quality indices assume a complete partitioning of the data, they need to be taken into account somehow, either as separate clusters or as part of a single “noise cluster”. The former seems most natural, but we also experimented with the latter. This did not affect any of our conclusions.
 2.
These datasets are obtained from https://github.com/deric/clusteringbenchmark
 3.
 4.
References
Adam, A., & Blockeel, H. (2015). Dealing with overlapping clustering: A constraintbased approach to algorithm selection. In MetaSel workshop at ECMLPKDD. CEUR workshop proceedings (pp. 43–54).
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.
Ashtiani, H., Kushagra, S., & BenDavid, S. (2016). Clustering with samecluster queries. In: D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 3216–3224). Barcelona, Spain. https://papers.nips.cc/book/advancesinneuralinformationprocessingsystems292016.
BarHillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence relations. In Proceedings of the twentieth international conference on machine learning (pp. 11–18).
Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semisupervised clustering. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 59–68).
Basu, S., & Mooney, R. J. (2004). Active semisupervision for pairwise contrained clustering. In Proceedings of the SIAM international conference on data mining (pp. 333–344). doi:10.1137/1.9781611972740.31.
BenDavid, S., von Luxburg, U., & Pál D. (2006). A sober look at clustering stability. In Proceedings of the 19th annual conference on learning theory (pp. 5–19). doi:10.1007/11776420_4.
BenHur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific symposium on biocomputing (pp. 6–17).
Bilenko, M., Basu, S., & Mooney, R. J. (2004). Integrating constraints and metric learning in semisupervised clustering. In Proceedings of the 21st international conference on machine learning (pp. 81–88).
Brazdil, P. B., Soares, C., & da Costa, J. P. (2003). Ranking learning algorithms: Using IBL and metalearning on accuracy and time results. Machine Learning, 50(3), 251–277. doi:10.1023/A:1021713901879.
Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2013). A framework for semisupervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery, 27(3), 344–371. doi:10.1007/s1061801303114.
Caruana, R., Elhawary, M., Nguyen, N. (2006). Meta clustering. In Proceedings of the international conference on data mining (pp. 107–118).
Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I. S. (2007). Informationtheoretic metric learning. In Proceedings of the 24th international conference on machine learning (pp. 209–216). doi:10.1145/1273496.1273523.
de Souto, M. C. P., et al. (2008). Ranking and selecting clustering algorithms using a metalearning approach. In IEEE international joint conference on neural networks. doi:10.1109/IJCNN.2008.4634333.
Ester, M., Kriegel, H.P., Sander, J., & Xu, X. (1996). A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the second international conference on knowledge discovery and data mining (pp. 226–231).
EstivillCastro, V. (2002). Why so many clustering algorithms: A position paper. ACM SIGKDD Explorations Newsletter, 4, 65–75.
Färber, I., et al. (2010). On using classlabels in evaluation of clusterings. In Proceedings if 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust 2010) in conjunction with 16th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2010).
Ferrari, D. G., & de Castro, L. N. (2015). Clustering algorithm selection by metalearning systems: A new distancebased problem characterization and ranking combination methods. Information Sciences, 301, 181–194.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. doi:10.1007/BF01908075.
Hutter, F., Hoos, H. H., & LeytonBrown, K. (2011). Sequential modelbased optimization for general algorithm configuration. In 5th International conference on learning and intelligent optimization (pp. 507–523). doi:10.1007/9783642255663_40.
Jain, A. K. (2010). Data clustering: 50 years beyond Kmeans. Pattern Recognition Letters, 31, 651–666. doi:10.1016/j.patrec.2009.09.011.
Lange, T, Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stabilitybased validation of clustering solutions. Neural Computation, 16(6), 1299–1323. doi:10.1162/089976604773717621.
Lelis, L., & Sander, J. (2009). Semisupervised densitybased clustering. In IEEE international conference on data mining (pp. 842–847). doi:10.1109/ICDM.2009.143.
Mallapragada, P. K., Jin, R., & Jain, A. K. (2008). Active query selection for semisupervised clustering. In Proceedings of the 19th international conference on pattern recognition. doi:10.1109/ICPR.2008.4761792.
Moulavi, D., Jaskowiak, P. A., Campello, R. J. G. B., Zimek, A., Sander, J. (2014). Densitybased clustering validation. In Proceedings of the 14th SIAM international conference on data mining.
Pedregosa, F., et al. (2011). Scikitlearn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Pourrajabi, M., Zimek, A., Moulavi, D., Campello, R. J. G. B., & Goebel, R. (2014). Model selection for semisupervised clustering. In Proceedings of the 17th international conference on extending database technology.
Rangapuram, S. S., & Hein, M. (2012). Constrained 1spectral clustering. In Proceedings of the 15th international conference on artificial intelligence and statistics.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. doi:10.1016/03770427(87)901257.
Ruiz, C., et al. (2007). CDBSCAN: Densitybased clustering with constraints. In RSFDGr’07: Proceedings of the international conference on rough sets, fuzzy sets, data mining and granular computing held in JRS07 4481 (pp. 216–223).
Shental, N., BarHillel, A., Hertz, T., & Weinshall, D. (2004). Computing Gaussian mixture models with EM using equivalence constraints. In: Advances in neural information processing systems (Vol. 16). https://papers.nips.cc/book/advancesinneuralinformationprocessingsystems292016.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the inception architecture for computer vision. In Conference on computer vision and pattern recognition.
Thornton, C., Hutter, F., Hoos, H. H., LeytonBrown, K. (2013). AutoWEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. doi:10.1145/2487575.2487629.
Van Craenendonck, T., & Blockeel, H. (2015). Using internal validity measures to compare clustering algorithms. In AutoML workshop at ICML 2015 (pp. 1–8). URL:https://lirias.kuleuven.be/handle/123456789/504712
Vendramin, L., Campello, R. J. G. B., & Hruschka, E. R. (2010). Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4), 209–235. doi:10.1002/sam.10080.
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. doi:10.1007/s112220079033z.
von Luxburg, U. (2010). Clustering stability: An overview. Foundations and Trends Machine Learning 2(3), 235–274. doi:10.1561/2200000008.
von Luxburg, U., Williamson, R. C., Guyon, I. (2014). Clustering: Science or art? In Workshop on unsupervised learning and transfer learning, JMLR workshop and conference proceedings (Vol. 27).
Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained Kmeans clustering with background knowledge. In Proceedings of the eighteenth international conference on machine learning (pp. 577–584).
Wang, X., Qian, B., & Davidson, I. (2014). On constrained spectral clustering and its applications. Data Mining and Knowledge Discovery, 28(1), 1–30. doi:10.1007/s1061801202919. arXiv:1201.5338.
Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning, with application to clustering with sideinformation. Advances in Neural Information Processing Systems, 15, 505–512.
Xiong, S., Azimi, J., & Fern, X. Z. (2014). Active learning of constraints for semisupervised clustering. IEEE Transactions on Knowledge and Data Engineering, 26(1), 43–54. doi:10.1109/TKDE.2013.22.
Zelnikmanor, L., & Perona, P. (2004). Selftuning spectral clustering. Advances in Neural Information Processing Systems, 17, 1601–1608.
Acknowledgements
We thank the anonymous reviewers for their insightful comments, which helped to improve the quality of the paper. Toon Van Craenendonck is supported by the Agency for Innovation by Science and Technology in Flanders (IWT).
Author information
Affiliations
Corresponding author
Additional information
Editors: Kurt Driessens, Dragi Kocev, Marko RobnikŠikonja, and Myra Spiliopoulou.
Rights and permissions
About this article
Cite this article
Van Craenendonck, T., Blockeel, H. Constraintbased clustering selection. Mach Learn 106, 1497–1521 (2017). https://doi.org/10.1007/s1099401756437
Received:
Accepted:
Published:
Issue Date:
Keywords
 Constraintbased clustering
 Algorithm and hyperparameter selection
 Active constraint selection