Constraintbased clustering selection
Abstract
Clustering requires the user to define a distance metric, select a clustering algorithm, and set the hyperparameters of that algorithm. Getting these right, so that a clustering is obtained that meets the users subjective criteria, can be difficult and tedious. Semisupervised clustering methods make this easier by letting the user provide mustlink or cannotlink constraints. These are then used to automatically tune the similarity measure and/or the optimization criterion. In this paper, we investigate a complementary way of using the constraints: they are used to select an unsupervised clustering method and tune its hyperparameters. It turns out that this very simple approach outperforms all existing semisupervised methods. This implies that choosing the right algorithm and hyperparameter values is more important than modifying an individual algorithm to take constraints into account. In addition, the proposed approach allows for active constraint selection in a more effective manner than other methods.
Keywords
Constraintbased clustering Algorithm and hyperparameter selection Active constraint selection1 Introduction
One of the core tasks in data mining is clustering: finding structure in data by identifying groups of instances that are highly similar (Jain 2010). We consider partitional clustering, in which every instance is assigned to exactly one cluster. To cluster data, a practitioner typically has to (1) define a similarity metric, (2) select a clustering algorithm, and (3) set the hyperparameters of that algorithm. A different choice in one of these steps usually leads to a different clustering. This variability is desirable, as it allows users with different interests to find different clusterings of the same data. In search of a clustering that matches their interests, a typical practitioner might for example tweak the distance metric, or try different clustering algorithms, each with several hyperparameter settings. Though common, this strategy can also be quite tedious. Manually tweaking the distance metric is hard, and so is selecting and tuning a clustering algorithm. As a result, a user may go through several iterations of this clustering pipeline before arriving at a satisfactory result.
Semisupervised clustering methods (Wagstaff et al. 2001; Xing et al. 2003; Bilenko et al. 2004) avoid the need for such iterations by explicitly incorporating user feedback into the clustering process. Instead of interacting with the clustering system by manually tweaking parts of the pipeline, the user provides feedback in a much more simple and welldefined way. Often, feedback is given in the form of pairwise constraints: the user indicates for several pairs whether they should be in the same cluster (called a mustlink constraint), or not (a cannotlink constraint). Specifying such constraints is much easier than constructing a good distance metric, or selecting and tuning a clustering algorithm. The former relies on simple expressions of domain knowledge, whereas the latter requires extensive expertise on metrics, algorithm biases, and the influence of hyperparameters. It is thus assumed that the user already has some understanding of the data, or has some knowledge about it that is not directly captured by the features, and wants to find a clustering that respects this understanding. There are also other ways in which constraints can help in clustering: for example, one can use them to find clusterings that score better on a particular unsupervised optimization objective (e.g. they can help to obtain a lower withcluster sum of squares (Ashtiani et al. 2016)). This, however, is only useful if the user knows a priori which objective is most suited for the task at hand.
Traditional approaches to semisupervised (or constraintbased) clustering use constraints in one of the following three ways. First, one can modify an existing clustering algorithm to take them into account (Wagstaff et al. 2001; Ruiz et al. 2007; Rangapuram and Hein 2012; Wang et al. 2014). Second, one can learn a new distance metric based on the constraints, after which the metric is used within a traditional clustering algorithm (Xing et al. 2003; BarHillel et al. 2003; Davis et al. 2007). Third, one can combine these two approaches and develop socalled hybrid methods (Bilenko et al. 2004; Basu et al. 2004). All of these methods aim to exploit the given background knowledge within the boundaries of a single clustering algorithm, and as such they ignore the algorithm and hyperparameter selection steps in the pipeline outlined above.
In contrast, we propose to use constraints for exactly this: to select and tune an unsupervised clustering algorithm. Our approach is motivated by the fact that no single algorithm works best on all clustering problems (EstivillCastro 2002): each algorithm comes with its own bias, which may match a particular problem to a greater or lesser degree. Further, it exploits the ability of algorithms to produce very different clusterings depending on their hyperparameter settings.
Our proposed approach is simple: to find an appropriate clustering, we first generate a set of clusterings using several unsupervised algorithms, with different hyperparameter settings, and afterwards select from this set the clustering that satisfies the largest number of constraints. Our experiments show that, surprisingly, this simple constraintbased selection method often yields better clusterings than existing semisupervised methods. This leads to the key insight that it is more important to use an algorithm of which the inherent bias matches a particular problem, than to modify the optimization criterion of any individual algorithm to take the constraints into account.
Furthermore, we present a method for selecting the most informative constraints first, which further increases the usefulness of our approach. This selection strategy allows us to obtain good clusterings with fewer queries. Reducing the number of queries is important in many applications, as they are often answered by a user who has a limited time budget.
The remainder of this paper is structured as follows. In Sect. 2 we give some background on semisupervised clustering, and algorithm and hyperparameter selection for clustering. Section 3 presents our approach to using pairwise constraints in clustering, which we call COBS (for ConstraintBased Selection). In Sect. 4 we describe how COBS can be extended to actively select informative constraints. We conclude in Sect. 5.
2 Background
Semisupervised clustering algorithms allow the user to incorporate a limited amount of supervision into the clustering procedure. Several kinds of supervision have been proposed, one of the most popular ones being pairwise constraints. Mustlink (ML) constraints indicate that two instances should be in the same cluster, cannotlink (CL) constraints that they should be in different clusters. Most existing semisupervised approaches use such constraints within the scope of an individual clustering algorithm. COPKMeans (Wagstaff et al. 2001), for example, modifies the clustering assignment step of Kmeans: instances are assigned to the closest cluster for which the assignment does not violate any constraints. Similarly, the clustering procedures of DBSCAN (Ruiz et al. 2007; Lelis and Sander 2009; Campello et al. 2013), EM (Shental et al. 2004) and spectral clustering (Rangapuram and Hein 2012; Wang et al. 2014) have been extended to incorporate pairwise constraints. Another approach to semisupervised clustering is to learn a distance metric based on the constraints (Xing et al. 2003; BarHillel et al. 2003; Davis et al. 2007). Xing et al. (2003), for example, propose to learn a Mahalanobis distance by solving a convex optimization problem in which the distance between instances with a mustlink constraint between them is minimized, while simultaneously separating instances connected by a cannotlink constraint. Hybrid algorithms, such as MPCKMeans (Bilenko et al. 2004), combine metric learning with an adapted clustering procedure.
Metalearning and algorithm selection have been studied extensively for supervised learning (Brazdil et al. 2003; Thornton et al. 2013), but much less for clustering. There is some work on building metalearning systems that recommend clustering algorithms (Souto et al. 2008; Ferrari and de Castro 2015). However, these systems do not take hyperparameter selection into account, or any form of supervision. More related to ours is the work of Caruana et al. (2006). They generate a large number of clusterings using Kmeans and spectral clustering, and cluster these clusterings. This metaclustering is presented to the user as a dendrogram. Here, we also generate a set of clusterings, but afterwards we select from that set the most suitable clustering based on pairwise constraints. The only other work, to our knowledge, that has explored the use of pairwise constraints for algorithm selection is that by Adam and Blockeel (2015). They define a metafeature based on constraints, and use this feature to predict whether EM or spectral clustering will perform better for a dataset. While their metafeature attempts to capture one specific property of the desired clusters, i.e. whether they overlap, our approach is more general and allows selection between any clustering algorithms.
Whereas algorithm selection has received little attention in clustering, several methods have been proposed for hyperparameter selection. One strategy is to run the algorithm with several parameter settings, and select the clustering that scores highest on an internal quality measure (Arbelaitz et al. 2013; Vendramin et al. 2010). Such measures try to capture the idea of a “good” clustering. A first limitation is that they are not able to deal with the inherent subjectivity of clustering, as they do not take any external information into account. Furthermore, internal measures are only applicable within the scope of individual clustering algorithms, as each of them comes with its own bias (von Luxburg et al. 2014). For example, the vast majority of them has a preference for spherical clusters, making them suitable for Kmeans, but not for e.g. spectral clustering and DBSCAN.
Another strategy for parameter selection in clustering is based on stability analysis (BenHur et al. 2002; von Luxburg 2010; BenDavid et al. 2006). A parameter setting is considered to be stable if similar clusterings are produced with that setting when it is applied to several datasets from the same underlying model. These datasets can for example be obtained by taking subsamples of the original dataset (BenHur et al. 2002; Lange et al. 2004). In contrast to internal quality measures, stability analysis does not require an explicit definition of what it means for a clustering to be good. Most studies on stability focus on selecting parameter settings in the scope of individual algorithms (in particular, often the number of clusters). As such, it is unclear to which extent stability can be used to choose between clusterings from very different clustering algorithms.
Additionally, one can also avoid the need for explicit parameter selection. In selftuning spectral clustering (Zelnikmanor and Perona 2004), for example, the affinity matrix is constructed based on local statistics and the number of clusters is estimated using the structure of the eigenvectors of the Laplacian.
A key distinction with COBS is that none of the above methods takes the subjective preferences of the user into account. We will compare our constraintbased selection strategy to some of them in the next section.
Pourrajabi et al. (2014) have introduced CVCP, a framework for using constraints for hyperparameter selection within the scope of individual semisupervised algorithms. A major difference is that COBS uses all constraints for selection (and none within the algorithm) and selects both an unsupervised algorithm and its hyperparameters (as opposed to only the hyperparameters of a semisupervised algorithm). We compare COBS to CVCP in the experiments in Sect. 3.
3 Constraintbased clustering selection
Algorithm and hyperparameter selection are difficult in an entirely unsupervised setting. This is mainly due to the lack of a welldefined way to estimate the quality of clustering results (EstivillCastro 2002). We propose to use constraints for this purpose, and estimate the quality of a clustering as the number of constraints that it satisfies. This quality estimate allows us to do a search over unsupervised algorithms and their parameter settings, as described in Algorithm 1.
We now reiterate and clarify the motivations for COBS, which were briefly presented in the introduction. First, each clustering algorithm comes with a particular bias, and no single one performs best on all clustering problems (EstivillCastro 2002). Existing semisupervised approaches can change the bias of an unsupervised algorithm, but only to a certain extent. For instance, using constraints to learn a Mahalanobis distance allows Kmeans to find ellipsoidal clusters, rather than spherical ones, but still does not make it possible to find nonconvex clusters. In contrast, by using constraints to choose between clusterings generated by very different algorithms, COBS aims to select the most suitable one from a diverse range of biases.
Second, it is also widely known that within a single clustering algorithm the choice of the hyperparameters can strongly influence the clustering result. Consequently, choosing a good parameter setting is crucial. Currently, a user can either do this manually, or use one of the selection strategies discussed in Sect. 2. Both options come with significant drawbacks. Doing parameter tuning manually is timeconsuming, given the often large number of combinations one might try. Existing automated selection strategies avoid this manual labor, but can easily fail to select a good setting as they do not take the user’s preferences into account. For COBS, parameters are an asset rather than a burden. They allow generating a large and diverse set of clusterings, from which we can select the most suitable solution with a limited number of pairwise constraints.
3.1 Research questions
 Q1

How does COBS’ hyperparameter selection compare to unsupervised hyperparameter selection?
 Q2

How does unsupervised clustering with COBS’ hyperparameter selection compare to semisupervised clustering algorithms?
 Q3

How does COBS, with both algorithm and hyperparameter selection, compare to existing semisupervised algorithms?
 Q4

Can we combine the best of both worlds, that is: use COBS with semisupervised clustering algorithms rather than unsupervised ones?
 Q5

How do the clusterings that COBS selects score on internal quality criteria?
 Q6

What is the computational cost of COBS, compared to the alternatives?
Although our selection strategy is also related to metaclustering (Caruana et al. 2006), an experimental comparison would be difficult as metaclustering produces a dendrogram of clusterings for the user to explore. The user can traverse this dendrogram to obtain a single clustering, but the outcome of this process is highly subjective. COBS works with pairwise constraints, therefore we compare to other methods that do the same.
3.2 Experimental methodology
 1.
Generate c pairwise constraints (c is a parameter) by repeatedly selecting two random instances from the training set, and adding a mustlink constraint if they belong to the same class, and a cannotlink otherwise.
 2.
Apply COBS to the full dataset to obtain a clustering.
 3.
Evaluate the clustering by calculating the ARI on all objects from the validation fold.
3.2.1 Datasets
Datasets used in the experiments
Dataset  # instances  # features  # classes 

2d4cno4  863  2  4 
curet24k  4200  2  7 
engytime  4096  2  2 
Jain  373  2  2 
flame  240  2  2 
wine  178  13  3 
dermatology  358  33  6 
iris  147  4  3 
ionosphere  350  34  2 
breastcancerwisconsin  449  32  2 
ecoli  336  7  8 
optdigits389  1151  64  3 
segmentation  2100  19  7 
glass  214  10  7 
hepatitis  112  19  2 
parkinsons  195  13  2 
column_2C  310  6  2 
faces identity  624  2048  20 
faces eyes  624  2048  2 
newsgroups sim3  2946  10  3 
newsgroups diff3  2780  10  3 
3.2.2 Unsupervised algorithms used in COBS
Algorithms used, the hyperparameters that were varied, their corresponding ranges and the hyperparameter selection methods used in Q1
Algorithm  Param.  Range  Selection method 

kmeans  K  [2, 10] or [2, 30]  Silhouette index 
10ptDBSCAN  \(\epsilon \)  \([\min (d), \max (d)]\)  DBCV index 
minPts  [2, 20]  
13ptSpectral  K  [2, 10] or [2, 30]  Selftuning 
k  [2, 20]  spectral clustering  
\(\sigma \)  [0.01, 5.0] 
3.3 Question Q1: COBS versus unsupervised hyperparameter tuning
To evaluate hyperparameter selection for individual algorithms, we use Algorithm 1 with C a set of clusterings generated using one particular algorithm (Kmeans, DBSCAN or spectral). We compare COBS to state of the art unsupervised selection strategies. As there is no single method that can be used for all three algorithms, we use three different approaches, which are briefly described next.
3.3.1 Existing unsupervised hyperparameter selection methods
Kmeans has one hyperparameter: the number of clusters K. A popular method to select K in Kmeans is by using internal clustering quality measures (Vendramin et al. 2010; Arbelaitz et al. 2013). Kmeans is run for different values of K (and in this case also for different random seeds), and afterwards the clustering that scores highest on such an internal measure is chosen. In our setup, we generate 20 clusterings for each K by using different random seeds. We select the clustering that scores highest on the silhouette index (Rousseeuw 1987), which was identified as one of the best internal criteria by Arbelaitz et al. (2013).
DBSCAN has two parameters: \(\epsilon \), which specifies how close points should be to be in the same neighborhood, and minPts, which specifies the number of points that are required in the neighborhood to be a core point. Most internal criteria are not suited for DBSCAN, as they assume spherical clusters, and one of the key characteristics of DBSCAN is that it can find clusters with arbitrary shape. One exception is the DensityBased Cluster Validation (DBCV) score (Moulavi et al. 2014), which we use in our experiments.
We first show the ARIs obtained with unsupervised versus constraintbased hyperparameter selection (columns marked Q1)
Dataset  Kmeans  MPCKMeans  DBSCAN  FOSC  Spectral  COSC  

Dataset  Q1  Q2  Q1  Q2  Q1  Q2  
SI  COBS  SI  NSat  CVCP  DBCV  COBS  STS  COBS  Eigen  NSat  CVCP  
2d4cno4  0.97  0.93  0.82  0.74  0.78  0.38  0.38  0.99  0.99  0.99  0.99  0.69  0.91 
curet24k  0.45  0.56  0.53  0.33  0.35  0.00  0.27  0.81  0.75  0.86  0.43  0.42  0.51 
engytime  0.56  0.81  0.56  0.49  0.55  0.00  0.00  0.38  0.85  0.81  0.86  0.86  0.76 
jain  0.22  0.50  0.46  0.54  0.61  0.87  0.87  0.94  0.23  1.0  1.0  1.0  0.94 
flame  0.44  0.32  0.10  0.68  0.43  0.18  0.91  0.98  0.48  0.87  0.95  0.95  0.95 
wine  0.86  0.85  0.81  0.72  0.69  0.30  0.32  0.53  0.89  0.93  0.54  0.54  0.75 
dermatology  0.56  0.81  0.59  0.39  0.36  0.27  0.34  0.81  0.20  0.92  0.36  0.36  0.58 
iris  0.55  0.69  0.67  0.81  0.54  0.54  0.43  0.80  0.55  0.78  0.90  0.39  0.68 
ionosphere  0.28  0.23  0.21  0.20  0.18  0.05  0.49  \(\)0.02  0.28  0.19  0.26  0.26  0.20 
breastcancer  0.73  0.73  0.73  0.71  0.73  0.29  0.32  0.53  0.81  0.78  0.76  0.76  0.76 
ecoli  0.04  0.60  0.70  0.57  0.57  0.06  0.51  0.54  0.04  0.69  0.60  0.41  0.57 
optdigits389  0.49  0.79  0.58  0.24  0.54  0.00  0.30  0.53  0.38  0.96  0.54  0.54  0.85 
segmentation  0.10  0.59  0.43  0.22  0.33  0.24  0.49  0.61  0.23  0.58  0.19  0.19  0.27 
hepatitis  0.25  0.34  0.25  0.23  0.25  0.02  0.08  0.23  \(\)0.08  0.05  0.20  0.20  0.20 
glass  0.26  0.25  0.31  0.28  0.32  0.01  0.19  0.32  0.23  0.23  0.11  0.11  0.16 
parkinsons  \(\)0.07  0.03  \(\)0.04  0.01  \(\)0.02  0.02  \(\)0.03  \(\)0.10  0.09  0.12  0.16  0.12  0.07 
column_2C  0.17  0.13  0.22  0.17  0.22  0.02  \(\)0.02  \(\)0.03  0.17  0.10  0.22  0.22  0.27 
faces identity  0.66  0.40  0.71  0.03  0.15  0.00  0.49  0.65  0.23  0.65  0.07  0.07  0.25 
faces eyes  0.15  0.59  0.08  0.60  0.55  0.00  0.00  0.00  0.13  0.49  0.01  0.01  0.00 
news sim3  0.12  0.18  0.12  0.04  0.08  0.00  0.02  0.02  0.00  0.13  0.09  0.00  0.14 
news diff3  0.28  0.56  0.25  0.23  0.34  0.00  0.24  0.48  0.00  0.56  0.15  0.00  0.14 
3.3.2 Results and conclusion
The columns of Table 3 marked with Q1 compare the ARIs obtained with the unsupervised approaches to those obtained with COBS. The best of these two is underlined for each algorithm and dataset combination. Most of the times the constraintbased selection strategy performs better, and often by a large margin. Note for example the large difference for ionosphere: DBSCAN is able to produce a good clustering, but only the constraintbased approach recognizes it as the best one. When the unsupervised selection method performs better, the difference is usually small. We conclude that often the internal measures do not match the clusterings that are indicated by the class labels. Constraints provide useful information that can help select a good parameter setting.
3.4 Question Q2: COBS versus semisupervised algorithms
It is not too surprising that COBS outperforms unsupervised hyperparameter selection, since it has access to more information. We now compare to semisupervised algorithms, which have access to the same information.
3.4.1 Existing semisupervised algorithms

MPCKMeans (Bilenko et al. 2004) is a hybrid semisupervised extension of Kmeans. It minimizes an objective that combines the withincluster sum of squares with the cost of violating constraints. This objective is greedily minimized using a procedure based on Kmeans. Besides a modified cluster assignment step and the usual cluster center reestimation step, this procedure also adapts an individual metric associated with each cluster in each iteration. We use the implementation available in the WekaUT package.^{3}

FOSCOpticsDend (Campello et al. 2013) is a semisupervised extension of OPTICS, which is in turn based on ideas similar to DBSCAN. The first step of this algorithm is to run the unsupervised OPTICS algorithm, and to construct a dendrogram using its output. The FOSC framework is then used to extract a flat clustering from this dendrogram that is optimal w.r.t. the given constraints.

COSC (Rangapuram and Hein 2012) is based on spectral clustering, but optimizes for an objective that combines the normalized cut with a penalty for constraint violation. We use the implementation available on the authors’ web page.^{4}

NSat We run the algorithms for multiple K, and select the clustering that violates the smallest number of constraints. In case of a tie, we choose the solution with the lowest number of clusters.

CVCP CrossValidation for finding Clustering Parameters (Pourrajabi et al. 2014) is a crossvalidation procedure for semisupervised clustering. The set of constraints is divided into n independent folds. To evaluate a parameter setting, the algorithm is repeatedly run on the entire dataset given the constraints in \(n1\) folds, keeping aside the nth fold as a test set. The clustering that is produced given the constraints in the \(n1\) folds, is then considered as a classifier that distinguishes between mustlink and cannotlink constraints in the nth fold. The Fmeasure is used to evaluate the score of this classifier. The performance of the parameter setting is then estimated as the average Fmeasure over all test folds. This process is repeated for all parameter settings, and the one resulting in the highest average Fmeasure is retained. The algorithm is then run with this parameter setting using all constraints to produce the final clustering. We use fivefold crossvalidation.
3.4.2 Results and conclusion
The columns in Table 3 marked with Q2 show the ARIs obtained with the semisupervised algorithms. The best result for each type of algorithm (unsupervised or semisupervised) is indicated in bold. The table shows that in several cases it is more advantageous to use the constraints to optimize the hyperparameters of the unsupervised algorithm (as COBS does). In other cases, it is better to use the constraints within the algorithm itself, to perform a more informed search (as the semisupervised variants do). Within the scope of a single clustering algorithm, neither strategy consistently outperforms the other. For example, if we use spectral clustering on the dermatology data, it is better to use the constraints for tuning the hyperparameters of unsupervised spectral clustering (also varying k and \(\sigma \) for constructing the signature matrix) than within COSC, its semisupervised variant (which uses local scaling for this). In contrast, if we use densitybased clustering on the same data, it is better to use constraints in FOSCOpticsDend (which does not have an \(\epsilon \) parameter, and for which minPts is set to 4, a value commonly used in the literature (Ester et al. 1996; Campello et al. 2013)) than to use them to tune the hyperparameters of DBSCAN (varying both \(\epsilon \) and minPts).
3.5 Question Q3: COBS with multiple unsupervised algorithms
In the previous two subsections, we showed that constraints can be useful to tune the hyperparameters of individual algorithms. Table 3 also shows, however, that no single algorithm (unsupervised or semisupervised) performs well on all datasets. This motivates the use of COBS to not only select hyperparameters, but also the clustering algorithm. In this subsection we again use Algorithm 1, but set C in step 1 now includes clusterings produced by any of the three unsupervised algorithms.
3.5.1 Results
We compare COBS with existing semisupervised algorithms in Fig. 2. For the majority of datasets, COBS produces clusterings that are on par with, or better than, those produced by the best competitor. While some other approaches also do well on some of the datasets, none of them do so consistently. Compared to each competitor individually, COBS is clearly superior. For example, COSCNumSat outperforms COBS on the breastcancerwisconsin dataset, but performs much worse on several others. The only datasets for which COBS performs significantly worse than its competitors are column_2C and hepatitis.
The ARI of the best clustering that is generated by any of the unsupervised algorithms, the ARI of the clustering that is selected after 50 constraints (averaged over five crossvalidation folds), and the algorithms that produced the selected clusterings
Dataset  Best unsupervised  COBS  Algorithm used 

2d4cno4  1.0  0.99  K:1/D:0/S:4 
curet24k  0.94  0.86  K:0/D:0/S:5 
engytime  0.82  0.81  K:1/D:0/S:4 
jain  1.0  0.95  K:0/D:4/S:1 
flame  0.93  0.88  K:0/D:2/S:3 
wine  0.93  0.90  K:0/D:0/S:5 
dermatology  0.94  0.86  K:2/D:0/S:3 
iris  0.88  0.78  K:0/D:0/S:5 
ionosphere  0.56  0.48  K:0/D:5/S:0 
breastcancerwisconsin  0.84  0.69  K:0/D:1/S:4 
ecoli  0.75  0.71  K:1/D:0/S:4 
optdigits389  0.97  0.95  K:0/D:0/S:5 
segmentation  0.61  0.57  K:0/D:0/S:5 
hepatitis  0.32  0.08  K:0/D:5/S:0 
glass  0.33  0.23  K:1/D:0/S:4 
parkinsons  0.34  0.08  K:0/D:3/S:2 
column_2C  0.27  0.05  K:0/D:3/S:2 
faces identity  0.80  0.49  K:2/D:0/S:3 
faces eyes  0.66  0.59  K:5/D:0/S:0 
news sim3  0.28  0.17  K:3/D:0/S:2 
news diff3  0.63  0.58  K:2/D:0/S:3 
3.5.2 Results on artificial datasets
Figure 3 shows the clusterings that are produced for the artificial datasets given 50 constraints. COBS performs on par with the best competitor for all of these. An interesting observation can be made for flame (shown in the last row). For this dataset, COBS selects a solution consisting of two clusters and 11 additional noise points (which are considered as separate clusters in computing the ARI). This clustering is produced by DBSCAN, which identifies the points shown as green crosses as noise. In this case, no constraints were defined on these points. The clustering that is shown satisfies all given constraints, and was selected randomly from all clusterings that did so. Giving the correct number of clusters (as was done for COSC and MPCKMeans for Fig. 3) and not allowing noise points would result in COBS selecting clusterings that are highly similar to those generated by COSC and FOSC.
Besides COBS, also COSC attains high ARIs for all 5 artificial datasets, but only if it is given the correct number of clusters. Without giving it the right number of clusters, COSC produces much worse clusterings on curet24k, with ARIs of 0.43 (for the eigengap selection method), 0.42 (for NumSat) and 0.51 (for CVCP) (as listed in Table 3).
Another interesting observation can be made for MPCKMeans, for example on the flame dataset (shown in the last row). It shows that using constraints does not allow MPCKMeans to avoid its inherent spherical bias. Points that are connected by a mustlink constraint are placed in the same cluster, but the overall cluster shape cannot be correctly identified. This is seen in the plot, as for instance some red points occur in the inner cluster.
3.5.3 Conclusion
If any of the unsupervised algorithms is able to produce good clusterings, COBS can select them using a limited number of constraints. If not, COBS performs poorly, but in our experiments none of the algorithms did well in this case. We conclude that it is generally better to use constraints to select and tune an unsupervised algorithm, than within a randomly chosen semisupervised algorithm.
3.6 Question Q4: using COBS with semisupervised algorithms
In the previous section we have shown that we can use constraints to do algorithm and hyperparameter selection for unsupervised algorithms. On the other hand, constraints can also be useful when used within an adapted clustering procedure, as traditional semisupervised algorithms do. This raises the question: can we combine both approaches? In this section, we use the constraints to select and tune a semisupervised clustering algorithm. In particular, we vary the hyperparameters of the semisupervised algorithms to generate the set of clusterings from which we select. The varied hyperparameters are the same as those for their unsupervised variants, except for two. First, \(\epsilon \) is not varied for FOSCOpticsDend, as it is not a hyperparameter for that algorithm. Second, in this section we only use knearest neighbors graphs for (semisupervised) spectral clustering, as full similarity graphs lead to long execution times for COSC.
3.6.1 Results and conclusions
Column 3 of Table 5 shows that this strategy does not produce better results. This is caused by using the same constraints twice: once within the semisupervised algorithms, and once to evaluate the algorithms and select the bestperforming one. Obviously, algorithms that overfit the given constraints will get selected in this manner.
The problem could be alleviated by using separate constraints inside the algorithm and for evaluation, but this decreases the number of constraints that can effectively be used for either purpose. Column 4 of Table 5 shows the average ARIs that are obtained if we use half of the constraints within the semisupervised algorithms, and half to select one of the generated clusterings afterwards. This works better, but still often not as good as COBS with unsupervised algorithms.
ARIs obtained with 50 constraints by COBS with unsupervised algorithms (COBSU) and with semisupervised algorithms, with and without splitting the constraint set (COBSSS and COBSSSsplit)
Dataset  COBSU  COBSSS  COBSSSsplit 

2d4cno4  0.98  0.64  1.0 
curet24k  0.86  0.39  0.83 
engytime  0.80  0.15  0.78 
jain  0.97  0.75  0.97 
flame  0.91  0.45  0.93 
wine  0.86  0.66  0.86 
dermatology  0.89  0.74  0.77 
iris  0.78  0.73  0.73 
ionosphere  0.53  0.26  0.19 
breastcancerwisconsin  0.69  0.59  0.68 
ecoli  0.69  0.46  0.64 
optdigits389  0.96  0.58  0.93 
segmentation  0.55  0.37  0.64 
hepatitis  0.08  0.14  0.27 
glass  0.23  0.20  0.24 
parkinsons  0.05  0.04  0.08 
column_2C  0.02  0.24  0.03 
faces identity  0.45  0.30  0.64 
faces eyes  0.60  0.16  0.59 
news sim3  0.18  0.09  0.15 
news diff3  0.60  0.34  0.42 
3.7 Question Q5: evaluating the clusterings on internal criteria
In the previous research questions we have investigated how well the clusterings produced by COBS score according to the ARI, an external evaluation measure. This is motivated by the fact that our main goal is to find clusterings that are aligned with the user’s interests, which are assumed to be captured by the class labels. In this section we investigate how well these clusterings score on internal measures, which are computed using only the data. Such internal measures capture characteristics that one might expect of a good clustering, such as a high intracluster similarity and low intercluster similarity. In particular, we want to know to what extent we compromise on quality according to internal criteria, by focusing on satisfying constraints.
Ideally, we should use an internal measure that is not biased towards any particular clustering algorithm. However, this does not exist (EstivillCastro 2002): each internal quality measure comes with its own bias, which may match the bias of a clustering algorithm to a greater or lesser degree. As a result, choosing a suitable internal quality criterion is often as difficult as choosing the right clustering algorithm. For example, the large majority of internal measures has a strong spherical bias (Vendramin et al. 2010; Arbelaitz et al. 2013), making them suited to use in combination with kmeans, but not with spectral clustering and DBSCAN.
In this section, we will investigate the tradeoff between the ARI and two internal measures: the silhouette index (SI) (Rousseeuw 1987) and the densitybased cluster validation (DBCV) score (Moulavi et al. 2014), both of which were also used in answering the previous research questions. The SI was chosen as it is wellknown, and the extensive studies by Arbelaitz et al. (2013) and Vendramin et al. (2010) identify it as one of the best performing measures. The DBCV score was chosen as it is one of the few internal measures that does not have a spherical bias, and instead is based on the within and betweencluster density connectedness of clusters. Although it does not have a spherical bias, the DBCV score comes with its own limitations; for example, it is strongly influenced by noise, and biased towards imbalanced clusterings (Van Craenendonck and Blockeel 2015). Both of them range in \([1,1]\), with higher values being better.
Figure 4 shows how well the semisupervised methods score on the internal measures for six datasets. In most cases, COBS performs comparable to its competitors. A notable exception is the parkinsons dataset, for which FOSCOpticsDend produces clusterings that score significantly higher on both the DBCV score. Interestingly, the ARI of these clusterings is near zero. For parkinsons, the clusterings with the highest ARI score low on the internal measures. This, however, does not necessarily imply that the clustering does not identify any inherent structure (although this can be the case), it only means that it does not identify structure as it is defined by the silhouette score (i.e. spherical structure) or the DBCV score (i.e. density structure).
3.8 Question Q6: the computational cost of COBS
The computational cost of COBS depends on the complexity of the unsupervised algorithms that are used to generate clusterings. Generating all clusterings took the longest, by far, for the faces dataset. For the identity target, it took ca. 5 h, due to the high dimensionality of the data and the many values of K that were tried (it was varied in [2, 30]). For most datasets, generating all clusterings was much faster. As the semisupervised algorithms can be significantly slower than their unsupervised counterparts, generating all unsupervised clusterings was in several cases faster than doing several runs of a semisupervised algorithm (which is usually required, as the number of clusters is not known beforehand). This was especially so for COSC, which can be much slower than unsupervised spectral clustering. For example, generating all unsupervised clusterings for engytime took 50 min (using scikitlearn implementations), whereas only a single run of COSC took 28 min (using the Matlab implementation available on the authors’ web page).
The runtime of COBS can be reduced in several ways. The cluster generation step can easily be parallelized. For larger datasets, one might consider doing the algorithm and hyperparameter selection on a sample of the data, and afterwards cluster the complete dataset only once with the selected configuration.
Finally, note that the added cost of doing algorithm and parameter selection is no different from its comparable, and commonly accepted, cost in supervised learning. The focus is on maximally exploiting the limited amount of supervision, as obtaining labels or constraints is often expensive, whereas computation is cheap.
4 Active COBS
Obtaining constraints can be costly, as they are often specified by human experts. Consequently, several methods have been proposed to actively select the most informative constraints (Basu and Mooney 2004; Mallapragada et al. 2008; Xiong et al. 2014). We first briefly discuss some of these methods, and subsequently present a constraint selection strategy for COBS.
4.1 Related work
Basu and Mooney (2004) were the first to propose an active constraint selection method for semisupervised clustering. Their strategy is based on the construction of neighborhoods, which are points that are known to belong to the same cluster because mustlink constraints are defined between them. These neighborhoods are initialized in the exploration phase: K (the number of clusters) instances with cannotlink constraints between them are sought, by iteratively querying the relation between the existing neighborhoods and the point farthest from these neighborhoods. In the subsequent consolidation phase these neighborhoods are expanded by iteratively querying a random point against the known neighborhoods until a mustlink occurs and the right neighborhood is found. Mallapragada et al. (2008) extend this strategy by selecting the most uncertain points to query in the consolidation phase, instead of random ones. Note that in these approaches all constraints are queried before the actual clustering is performed.
More recently, Xiong et al. (2014) proposed the normalized pointbased uncertainty (NPU) framework. Like the approach introduced by Mallapragada et al. (2008), NPU incrementally expands neighborhoods and uses an uncertaintybased principle to determine which pairs to query. In the NPU framework, however, data is reclustered several times, and at each iteration the current clustering is used to determine the next set of pairs to query. NPU can be used with any semisupervised clustering algorithm, and Xiong et al. (2014) use it with MPCKMeans to experimentally demonstrate its superiority to the method of Mallapragada et al. (2008).
4.2 Active constraint selection in COBS
4.3 Experiments
We first demonstrate the influence of the weight update factor and sample size, and then compare our approach to active constraint selection with NPU (Xiong et al. 2014).
4.3.1 Effect of weight update factor and sample size
Our constraint selection strategy requires specifying a weight update factor m and a sample size s. Figure 5 shows the effect of these two parameters for the wine and dermatology datasets. First, the figure shows that the active strategy can significantly improve performance over random selection when only few constraints are used. For example, given five constraints the random selection strategy on average chooses a clustering with an ARI of ca. 0.67, whereas the active strategy on average selects a clustering with an ARI of ca. 0.80 (for \(s=200\) and \(m=2\)). A similar boost in ARI is observed for dermatology. Second, the figure shows that smaller sample sizes tend to give better, and more stable, results. This can be explained by the occasional domination of poor quality clusterings in the selection process: if there are more pairs to choose from, poor clusterings (which may have gotten lucky on the first few queries) have more opportunity to favor the search in their direction. This phenomenon is worse for large update factors, which can be seen by comparing the performance for \(m=1.05\) and \(m=2\) on the dermatology data, for a sample size of \(s=2000\).
4.3.2 Comparison to active selection with NPU
NPU (Xiong et al. 2014) can be used in combination with any semisupervised clustering algorithm, we use the same ones as in the previous section. We do not include CVCP hyperparameter selection in these experiments, because of its high computational complexity (for these experiments we cannot cluster for several fixed numbers of constraints, as the choice of the next constraints depends on the current clustering). For the same reason we only include the EigenGap parameter selection method for the two largest datasets (opdigits389 and segmentation) in these experiments. The results are shown in Fig. 6. For the first 8 datasets, the conclusions are similar to those for the random setting: COBS consistently performs well. Also in the active setting, none of the approaches produces a clustering with a high ARI for glass. For hepatitis, however, MPCKMeans is able to find good clusterings while COBS is not, albeit only after a relatively large number of constraints (hepatitis contains 112 instances). This implies that, although the labels might not represent a natural grouping, the class structure does match the bias of MPCKMeans, and given many constraints the algorithm finds this structure.
4.3.3 Time complexity
We distinguish between the offline and online stages of COBS. In the offline stage, the set of clusterings is generated. As mentioned before, this took up to ca. 5 h (for the faces dataset). In the online stage, we select the most informative pairs and ask the user about their relation. Execution time is particularly important here, as this stage requires user interaction. In active COBS, selecting the next pair to query is \(\mathcal {O}(CP)\), as we have to loop through all clusterings (C) for each constraint in the sample (P). For the setup used in our experiments (\(C=931\), \(P=200\)), this was always less than 0.02s. Note that this time does not depend on the size of the dataset (as all clusterings are generated beforehand). In contrast, NPU requires reclustering the data several times during the constraint selection process, which is usually significantly more expensive. This means that if NPU is used in combination with an expensive algorithm, e.g. COSC, the user has to wait longer between questions.
4.3.4 Conclusion
The COBS approach allows for a straightforward definition of uncertainty: pairs of instances are more uncertain if more clusterings disagree on them. Selecting the most uncertain pairs first can significantly increase performance.
5 Conclusion
Exploiting constraints has been the subject of substantial research, but all existing methods use them within the clustering process of individual algorithms. In contrast, we propose to use them to choose between clusterings generated by different unsupervised algorithms, ran with different parameter settings. We experimentally show that this strategy is superior to all the semisupervised algorithms compared to, which themselves are state of the art and representative for a wide range of algorithms. For the majority of the datasets, it works as well as the best among them, and on average it performs much better. The generated clusterings can also be used to select more informative constraints first, which further improves performance.
Footnotes
 1.
One could argue that these points are not really separate clusters, but as all the evaluation criteria and quality indices assume a complete partitioning of the data, they need to be taken into account somehow, either as separate clusters or as part of a single “noise cluster”. The former seems most natural, but we also experimented with the latter. This did not affect any of our conclusions.
 2.
These datasets are obtained from https://github.com/deric/clusteringbenchmark
 3.
 4.
Notes
Acknowledgements
We thank the anonymous reviewers for their insightful comments, which helped to improve the quality of the paper. Toon Van Craenendonck is supported by the Agency for Innovation by Science and Technology in Flanders (IWT).
References
 Adam, A., & Blockeel, H. (2015). Dealing with overlapping clustering: A constraintbased approach to algorithm selection. In MetaSel workshop at ECMLPKDD. CEUR workshop proceedings (pp. 43–54).Google Scholar
 Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.Google Scholar
 Ashtiani, H., Kushagra, S., & BenDavid, S. (2016). Clustering with samecluster queries. In: D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 3216–3224). Barcelona, Spain. https://papers.nips.cc/book/advancesinneuralinformationprocessingsystems292016.
 BarHillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence relations. In Proceedings of the twentieth international conference on machine learning (pp. 11–18).Google Scholar
 Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semisupervised clustering. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 59–68).Google Scholar
 Basu, S., & Mooney, R. J. (2004). Active semisupervision for pairwise contrained clustering. In Proceedings of the SIAM international conference on data mining (pp. 333–344). doi: 10.1137/1.9781611972740.31.
 BenDavid, S., von Luxburg, U., & Pál D. (2006). A sober look at clustering stability. In Proceedings of the 19th annual conference on learning theory (pp. 5–19). doi: 10.1007/11776420_4.
 BenHur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific symposium on biocomputing (pp. 6–17).Google Scholar
 Bilenko, M., Basu, S., & Mooney, R. J. (2004). Integrating constraints and metric learning in semisupervised clustering. In Proceedings of the 21st international conference on machine learning (pp. 81–88).Google Scholar
 Brazdil, P. B., Soares, C., & da Costa, J. P. (2003). Ranking learning algorithms: Using IBL and metalearning on accuracy and time results. Machine Learning, 50(3), 251–277. doi: 10.1023/A:1021713901879.
 Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2013). A framework for semisupervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery, 27(3), 344–371. doi: 10.1007/s1061801303114.
 Caruana, R., Elhawary, M., Nguyen, N. (2006). Meta clustering. In Proceedings of the international conference on data mining (pp. 107–118).Google Scholar
 Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I. S. (2007). Informationtheoretic metric learning. In Proceedings of the 24th international conference on machine learning (pp. 209–216). doi: 10.1145/1273496.1273523.
 de Souto, M. C. P., et al. (2008). Ranking and selecting clustering algorithms using a metalearning approach. In IEEE international joint conference on neural networks. doi: 10.1109/IJCNN.2008.4634333.
 Ester, M., Kriegel, H.P., Sander, J., & Xu, X. (1996). A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the second international conference on knowledge discovery and data mining (pp. 226–231).Google Scholar
 EstivillCastro, V. (2002). Why so many clustering algorithms: A position paper. ACM SIGKDD Explorations Newsletter, 4, 65–75.CrossRefGoogle Scholar
 Färber, I., et al. (2010). On using classlabels in evaluation of clusterings. In Proceedings if 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust 2010) in conjunction with 16th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2010).Google Scholar
 Ferrari, D. G., & de Castro, L. N. (2015). Clustering algorithm selection by metalearning systems: A new distancebased problem characterization and ranking combination methods. Information Sciences, 301, 181–194.Google Scholar
 Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. doi: 10.1007/BF01908075.
 Hutter, F., Hoos, H. H., & LeytonBrown, K. (2011). Sequential modelbased optimization for general algorithm configuration. In 5th International conference on learning and intelligent optimization (pp. 507–523). doi: 10.1007/9783642255663_40.
 Jain, A. K. (2010). Data clustering: 50 years beyond Kmeans. Pattern Recognition Letters, 31, 651–666. doi: 10.1016/j.patrec.2009.09.011.
 Lange, T, Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stabilitybased validation of clustering solutions. Neural Computation, 16(6), 1299–1323. doi: 10.1162/089976604773717621.
 Lelis, L., & Sander, J. (2009). Semisupervised densitybased clustering. In IEEE international conference on data mining (pp. 842–847). doi: 10.1109/ICDM.2009.143.
 Mallapragada, P. K., Jin, R., & Jain, A. K. (2008). Active query selection for semisupervised clustering. In Proceedings of the 19th international conference on pattern recognition. doi: 10.1109/ICPR.2008.4761792.
 Moulavi, D., Jaskowiak, P. A., Campello, R. J. G. B., Zimek, A., Sander, J. (2014). Densitybased clustering validation. In Proceedings of the 14th SIAM international conference on data mining.Google Scholar
 Pedregosa, F., et al. (2011). Scikitlearn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetMATHGoogle Scholar
 Pourrajabi, M., Zimek, A., Moulavi, D., Campello, R. J. G. B., & Goebel, R. (2014). Model selection for semisupervised clustering. In Proceedings of the 17th international conference on extending database technology.Google Scholar
 Rangapuram, S. S., & Hein, M. (2012). Constrained 1spectral clustering. In Proceedings of the 15th international conference on artificial intelligence and statistics.Google Scholar
 Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. doi: 10.1016/03770427(87)901257.
 Ruiz, C., et al. (2007). CDBSCAN: Densitybased clustering with constraints. In RSFDGr’07: Proceedings of the international conference on rough sets, fuzzy sets, data mining and granular computing held in JRS07 4481 (pp. 216–223).Google Scholar
 Shental, N., BarHillel, A., Hertz, T., & Weinshall, D. (2004). Computing Gaussian mixture models with EM using equivalence constraints. In: Advances in neural information processing systems (Vol. 16). https://papers.nips.cc/book/advancesinneuralinformationprocessingsystems292016.
 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the inception architecture for computer vision. In Conference on computer vision and pattern recognition.Google Scholar
 Thornton, C., Hutter, F., Hoos, H. H., LeytonBrown, K. (2013). AutoWEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. doi: 10.1145/2487575.2487629.
 Van Craenendonck, T., & Blockeel, H. (2015). Using internal validity measures to compare clustering algorithms. In AutoML workshop at ICML 2015 (pp. 1–8). URL:https://lirias.kuleuven.be/handle/123456789/504712
 Vendramin, L., Campello, R. J. G. B., & Hruschka, E. R. (2010). Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4), 209–235. doi: 10.1002/sam.10080.
 von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. doi: 10.1007/s112220079033z.
 von Luxburg, U. (2010). Clustering stability: An overview. Foundations and Trends Machine Learning 2(3), 235–274. doi: 10.1561/2200000008.
 von Luxburg, U., Williamson, R. C., Guyon, I. (2014). Clustering: Science or art? In Workshop on unsupervised learning and transfer learning, JMLR workshop and conference proceedings (Vol. 27).Google Scholar
 Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained Kmeans clustering with background knowledge. In Proceedings of the eighteenth international conference on machine learning (pp. 577–584).Google Scholar
 Wang, X., Qian, B., & Davidson, I. (2014). On constrained spectral clustering and its applications. Data Mining and Knowledge Discovery, 28(1), 1–30. doi: 10.1007/s1061801202919. arXiv:1201.5338.
 Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning, with application to clustering with sideinformation. Advances in Neural Information Processing Systems, 15, 505–512.Google Scholar
 Xiong, S., Azimi, J., & Fern, X. Z. (2014). Active learning of constraints for semisupervised clustering. IEEE Transactions on Knowledge and Data Engineering, 26(1), 43–54. doi: 10.1109/TKDE.2013.22.
 Zelnikmanor, L., & Perona, P. (2004). Selftuning spectral clustering. Advances in Neural Information Processing Systems, 17, 1601–1608.Google Scholar