Shift of Pairwise Similarities for Data Clustering

Several clustering methods (e.g., Normalized Cut and Ratio Cut) divide the Min Cut cost function by a cluster dependent factor (e.g., the size or the degree of the clusters), in order to yield a more balanced partitioning. We, instead, investigate adding such regularizations to the original cost function. We first consider the case where the regularization term is the sum of the squared size of the clusters, and then generalize it to adaptive regularization of the pairwise similarities. This leads to shifting (adaptively) the pairwise similarities which might make some of them negative. We then study the connection of this method to Correlation Clustering and then propose an efficient local search optimization algorithm with fast theoretical convergence rate to solve the new clustering problem. In the following, we investigate the shift of pairwise similarities on some common clustering methods, and finally, we demonstrate the superior performance of the method by extensive experiments on different datasets.


Introduction
Given a set of objects, clustering is concerned with grouping them in such a way that objects of the same group are more similar to each other (according to a predefined similarity measure), compared to those in different groups.This task plays a fundamental role in several data analytics applications.Examples are image segmentation (to detect the items in images), document clustering (for the purpose of document organization, topic identification or efficient information retrieval), data compression, and analysis of (e.g., transportation) networks and graphs.Clustering itself is not a specific method, rather is a general machine learning task to be addressed.The task can be solved via several methods that differ significantly in the way they define the notion of clusters and the way they extract them.The concept of clustering is originated from anthropology and then was used in psychology [52,1], in particular for trait theory classification in personality psychology [5].
A wide range of clustering methods introduce a cost function whose minimal solution provides a clustering solution.Kmeans is a common cost function which is defined by the within-cluster sum of squared distances from the means [35].The data can be demonstrated by a graph, whose nodes represent the objects and the edge weights are the pairwise similarities between the objects.Then, a wide range of different graph partitioning methods can be applied to produce the clusters.Arguably, the most basic graph-based method is the Min Cut (Minimum K-Cut) cost function [27,56], in which the goal is to partition the graph into exactly K connected components (clusters) such that the sum of the inter-clusters edge weights is minimal.As we will see, the Min Cut cost function often yields separating singleton clusters, in particular when the clusters have diverse densities.To overcome such problem, several clustering methods normalize the Min Cut clusters to render more balanced clusters.For example, they propose to normalize the Min Cut clusters by the size of the clusters (Ratio Assoc [23] and Ratio Cut [6]) or the degree of the clusters (Normalized Cut [47]).
Hungarian algorithm, which suffers from a high runtime (cubic w.r.t. the number of objects).Another work models this problem as a least square linear regression with a balance constraints and uses the method of augmented Lagrange multipliers to solve it [32].The work in [33] considers K-means as the main clustering method and the respective cluster variances as the penalty term.Then, [30] yields balanced clustering with convex regularization which makes the optimization more efficient.In the following, [16] studies balanced K-center, K-median, and K-means in high dimensions with theoretical approximate algorithms.Finally, [22] proposes a balanced clustering framework that utilizes both local and global information.However, in this paper, we consider the 'graph-based' balanced clustering variant, where we assume the clustering is applied to a given graph, instead of data features.
While most of graph clustering cost functions assume a nonnegative matrix of pairwise similarities as input, Correlation Clustering assumes that the similarities can be negative as well.This cost function was first introduced on graphs with only +1 and −1 edge weights [2], and then it was generalized to graphs with arbitrary positive and negative edge weights [12].Such graph clustering cost functions are often NP-hard [47,2,12].However, the respective optimal solution can be approximated in some way.A category of methods work based on eigenvector analysis of the Laplacian matrix.Spectral Clustering [47,38] was the first method which exploits the information from eigenvectors.It forms a low-dimensional embedding by the bottom eigenvectors of the Laplacian of the similarity matrix and then applies K-means to produce the final clusters.A more recent method, called Power Iteration Clustering (PIC) [29], instead of embedding the data into a K-dimensional space, approximates an eigenvalue-weighted linear combination of all the eigenvectors of the normalized similarity matrix via early stopping of the power iteration method.P-Spectral Clustering (PSC) [3] is another spectral approach that proposes a non-linear generalization of the Laplacian and then performs an iterative splitting method based on its second eigenvector.
An alternative graph-based clustering approach has been developed in the context of discrete time dynamical systems and evolutionary game theory which is based on performing replicator dynamics [41,39,31].Dominant Set Clustering (DSC) [41] is an iterative method which at each iteration, peels off a cluster by performing a replicator dynamics until its convergence.The method in [31] proposes an iterative clustering algorithm in two shrink and expansion steps, which helps to extract many small and dense clusters in large datasets.The method in [4], called InImDyn, instead of replicator dynamics, suggests to use a population dynamics motivated from the analogy with infection and immunization processes within a population of players.
In this paper, we investigate adding the regularization terms to the Min Cut cost function, in order to avoid creation of small singleton sets of clusters.We first consider the case where the regularization is the sum of the squared size of the clusters, weighted by the parameter α.This regularization leads to a simple shift transformation of the input, i.e., subtracting the pairwise similarities by α, which provides a straightforward quadratic cost function.We further extend the regularization to the pairwise similarities and employ an adaptive shift of the pairwise similarities which does not require fixing a regularization parameter in advance.The size constrained Min Cut then constitutes a special case of the latter form.Such a shift might render some pairwise similarities to be negative.We then study the connection to Correlation Clustering, another cost function which performs on both positive and negative similarities, and conclude the equivalence of these two methods given the shifted (regularized) pairwise similarities in a direct and straightforward way (beyond the argument based on algorithmic reduction proposed in [12]).However, our method, called Shifted Min Cut, provides a principled way to deduce such negative edge weights (adaptively).Thereafter, we develop an efficient optimization method based on local search to solve the new optimization problem.We further discuss the fast theoretical convergence rate of this local search algorithm.In the following, we study the impact of shifting the pairwise similarities on some common flat and hierarchical clustering methods where they often exhibit an invariant behaviour with respect to the shift of pairwise similarities, unlike the basic Min Cut cost function.Finally, we perform extensive experiments on several real-world datasets to study the performance of Shifted Min Cut compared to the alternatives.This work is an extension of our previous work [21] wherein we additionally, i) provide an argument on the theoretical convergence rate of the local search algorithm based on the connection to an optimized variant of Frank-Wolfe algorithm, ii) discuss the shift of pairwise similarities on several other clustering methods, and iii) elaborate further the existing experimental results and perform extra studies on real-world datasets.We have later found out that the work in [11] suggests a similar idea for regularization of Min Cut in order to yield balanced clusters.However, there are several fundamental differences between [11] and our work: i) they study size constrained Min Cut for bi-partitioning (i.e., for only two clusters), whereas we model it for arbitrary K clusters.Then, to generate more clusters than two, they propose an iterative (sequential) bi-partitioning which might cause the re-scaling problem.ii) Their method requires fixing critical hyperparameters often in a heuristic way, whereas our method does not include such hyperparameters.iii) Beyond size constrained Min Cut, we extend the method to refined regularization of the pairwise similarities that yields an adaptive regularization (shift) of the cost function.This adaptive regularization, not only provides adaptivity with respect to the type of the relations, but also obviates the need for fixing critical hyperparameters.iv) We consider that the regularization renders some of the pairwise similarities to be negative, and thereby, we study the connection between such a regularized (Shifted) Min Cut method and Correlation Clustering.However, [11] does not study such a connection.v) To optimize the respective cost function, we employ integration of the regularizations into shifting the pairwise similarities and develop an efficient local search algorithm that enjoys a linear convergence rate.[11], instead, develops approximate spectral solutions.vi) We demonstrate the performance of the method on several real-world datasets with respect to different evaluation criteria, whereas [11] only studies the mutual information evaluation criterion on two datasets.In particular, we investigate both the cost function and its optimization separately.
The rest of the paper is organized as following.In section 2, we introduce the notations and the definitions.Then, in section 3, we describe the regularization and the connection to shifting the pairwise similarities.In this section, we extend the method to adaptive regularization (shift) of the pairwise similarities.In section 4, we study the connection between Shifted Min Cut and Correlation Clustering, and, in section 5, we develop an efficient local search optimization method for the cost function.In section 6, we study the consequence of shifting pairwise relations in some other (flat and hierarchical) clustering methods.In section 7, we experimentally investigate the different aspects of the method on several real-world datasets, and finally, in section 8, we conclude the paper.

Notations and Definitions
The data is given by a set of n objects O = {1, ..., n} and the corresponding matrix of pairwise similarities X = {X ij }, ∀i, j ∈ O. Thus, the data can be represented by (an undirected) graph G(O, X), where the objects O constitute the nodes of the graph and X ij represents the weight of the edge between i and j.Then, the goal is to partition the objects (the graph) into K coherent groups which are distinguishable from each other.The clustering solution is encoded in c ∈ {1, ..., K} n , i.e., c i indicates the cluster label of the i th object.The vector c can be also represented via the co-clustering matrix H ∈ {0, 1} n×n .
C denotes the space of all different clustering solutions.
Moreover, we assume O k ⊂ O includes the members of the k th cluster, i.e., |O k | refers to the size of the k th cluster.

Shift of Pairwise Similarities for Clustering
Different graph-based clustering methods often consider the Min Cut cost function as a base method which is defined by This cost function has a tendency to split small sets of objects, since the cost increases with the number of inter-cluster edge weights, i.e., the edges connecting the different clusters.Figure 1 illustrates such a situation for two clusters [47].
We assume that the edge weights are inversely proportional to the distances between the objects.It is observed that Min Cut favors splitting objects i or j, instead of performing a more balanced split.In fact, any cut that splits one of the objects on the right half will yield a smaller cost than the cut that partitions the objects into the left and right halves.This issue is particularly problematic when the intra-cluster edge weights are heterogeneous among different clusters.
Thus, several methods propose to normalize the Min Cut clusters by a cluster depending factor, e.g., the size of clusters (Ratio Assoc [23] and Ratio Cut [6]) or the degree of clusters (Normalized Cut [47]).
We investigate an alternative approach to yield the occurrence of more balanced clusters.Instead of normalizing (dividing) the Min Cut cost function by a cluster-dependent function, we consider adding such a regularization to the original cost function, i.e., Figure 1: The Min Cut cost function has a bias to split small (singleton) sets of objects.Any cut that splits one of the objects on the right half will have smaller cost than the cut that splits the objects into the left and right halves.The figure has been adapted from [47].
where r(c, X) indicates the regularization.Note that this formulation involves the two free choices α and r(c, X), thereby, it yields a richer family of alternative methods.We first focus on the case where r(c, X) is the sum of the squared size of the clusters1 , i.e., Thereby, is minimal when only the singleton clusters (objects) are separated.Thus, this choice does not help to avoid occurrence of singleton clusters, rather, it accelerates.
This leads to equalize the size of clusters.We note that |O k |'s are integer numbers, but n/K is not necessarily an integer.Thus, we may arbitrarily set some of the |O k |'s to n/K and some others to n/K such that The cost function in Eq. 5 can be further written as Therefore, we define Thus, we employ a shifted variant of Min Cut cost function (called Shifted Min Cut), wherein all pairwise similarities are subtracted by a positive parameter α, such that some of the pairwise similarities might become negative.It makes sense that the regularization on the size of the clusters becomes connected to the pairwise similarities, as, at the end, pairwise relations are responsible for creating the clusters.Thus, by tuning them properly, one should be able to obtain the desired balanced clusters.Thereby, the cluster level regularization is effectively applied to the representation space, where, as will be discussed, it yields modelling and computational advantages.
This formulation provides a rich family of alternative clustering methods where different regularizations are induced by different values of α.However, choosing a very large α can lead to equalizing the size of the clusters that are inherently very unbalanced in size.For example, consider the dataset shown in Figure 2. We assume that the edge weights are inversely proportional to the pairwise distances.Then, we subtract all pairwise similarities by a very large number.Therefore, the pairwise similarities become very large but negative numbers which renders Shifted Min Cut to produce equal-size clusters, even though a correct cut should separate only the object i from the rest.Thus, in practice one needs to examine different values of α, and choose the one that yields the best results, or is preferred by the user.However, this procedure might be computationally expensive, and, moreover, the user might not be able to validate the correct solution among many different alternatives, due to lack of enough prior knowledge, supervision or side information.
For this reason, we employ a particular shift of pairwise similarities which takes the connectivity of the objects into account and does not need fixing any free parameter.
Adaptive shift of pairwise similarities.Different pairwise similarities might need different shifts, depending on the type and the density of the clusters that the respective objects belong to.Therefore, we relax the constraints of the formulation in Eq. 6 and consider a separate shift parameter for every pairwise similarity X ij .
The formulation in Eq. 7 already involves the formulation in Eq. 6 as a special case where all α ij 's are fixed by a constat.To determine α ij 's properly, a reasonable approach is to shift the pairwise similarity X ij between i and j adaptively with respect to the similarities between i and all the other objects and as well as the similarities between j and the other objects.For this purpose, we shift X ij such that the sum of the pairwise similarities between i and all the other objects becomes zero, and the same holds for j too.In this way, we have Summing up the regularizations for all pairs of objects, we have (we assume X is symmetric): where deg(k) is the degree of cluster k, i.e., deg(k) = i∈O k n p=1 X ip , and constant β is the sum of the given pairwise similarities, i.e., β = n p=1 n q=1 X pq .Therefore, the adaptive regularization yields a tradeoff between the size of the clusters and the degree of the clusters.The former is used in Ratio Assoc and the latter in Normalized Cut, both in the denominator.However, here a combination of these two is assumed, but as additive terms.
Therefore, the new shifted similarity S ij is obtained by It is easy to check that S is symmetric, provided that X is symmetric.It can be shown that sum of the rows and the columns of S are equal to zero.For example, for a fixed row i we have The adaptive shift in Eq. 10 can be written in matrix form as where the n × n matrix T is defined by U is an n × n matrix whose all elements are 1. 2hus, according to Eqs. 6 and 7, the new cost function is written by As an alternative to the adaptive shift, a proper shift can be obtained by investigating few pairwise relations by a user (i.e., a kind of weak supervision).In this setting, the user tells us how the actual pairwise relations should look like for a small subset of them, i.e., weather they are in the same cluster (positive shift) or different clusters (negative shift).Then, given this feedback, we can generalize them to all the pairwise relations.We may train a model, e.g. a neural network, which learns the shift depending on the specifications of the respective edge and objects.Such an approach can be even combined with our method for adaptive shift of pairwise similarities, where the later is used as an initial guess for the shifted pairwise relations and then they are fine tuned further using the user feedbacks if needed.This formulation also provides a convenient way to encode constraints and prior knowledge such as 'objects x and y must be together', and 'objects p and q must be in different clusters'.

Relation to Correlation Clustering
Correlation Clustering is a clustering cost function that partitions a graph with positive and negative edge weights.The cost function sums the disagreements, i.e., the sum of negative intra-cluster edge weights plus the sum of positive inter-cluster edge weights.The respective cost function on general graphs is defined by [12] R CC (c, X) where E <−> and E <+> respectively indicate the set of the edges with negative and with positive weights.The approximation scheme in [12] reduces Min Cut to Correlation Clustering in order to obtain a logarithmic approximation factor for Correlation Clustering.It also develops a reduction from Correlation Clustering to Min Cut to conclude the equivalence of these two cost functions.Here, we elaborate that these two cost functions are identical and represent the same objective (given the shifted pairwise similarities) in a direct and straightforward way without using the more complicated reduction argument.In addition, [12] assumes that the number of clusters is hidden in the cost function (as defined in Eq. 16).However, we study the equivalence for any arbitrary number of clusters K.As shown in [10,18], optimizing Correlation Clustering without a constraint on the number of clusters can lead to overfitting and unrobust solutions, whereas fixing the number of clusters may avoid these issues.Therefore, we consider the setting where the number of clusters K is explicitly specified in the cost function and the user has the possibility to fix it in advance.Finally, the reduction-based argument in [12] yields the equivalence of the optimal solutions between Min Cut and Correlation Clustering and the respective approximation and hardness results.We, in addition, conclude the equivalence of any local optimal solution for the two cost functions, which is important when using local search algorithms to optimize the cost functions.
For a fixed K, the Correlation Clustering cost function can be written as [10,18] The first term (called a) sums the intra-cluster negative edge weights, whereas the second term (called b) sums the inter-cluster positive edge weights.We separately expand each term.
Similarly, we expand term b.
Then, by summing a and b we obtain Thus, Correlation Clustering and Min Cut are equivalent cost functions, i.e., 1.The cost functions share the same optimal solution, i.e., arg min c R M C (c, X) = arg min c R CC (c, X).
2. The costs differences are the same, i.e., ∀c ∈ C : . This is in particular relevant when defining for example a Boltzmann distribution over the solution space C.
Thus, Correlation Clustering, similar to Shifted Min Cut, is an extension of Min Cut which deals with both negative and positive edge weights.However, there are fundamental differences between these two methods: 1. Correlation Clustering assumes that the matrix of pairwise positive and negative similarities is given (which might be nontrivial), whereas Shifted Min Cut proposes a principled way to yield clustering of positive and negative similarities via regularizing the base Min Cut cost function.Thus, Shifted Min Cut provides an explicit and straightforward interpretation of the clustering problem.
2. The form of the Shifted Min Cut cost function expressed in Eq. 14 provides efficient function evaluations (e.g., for optimization) compared to the Correlation Clustering cost function in Eq. 17 or the base Min Cut cost function in Eq. 3. The cost functions in Eqs 17 and 3 are quadratic with respect to K, the number of clusters, whereas the cost function in Eq. 14 is linear.

Optimization of the Shifted Min Cut Cost Function
Finding the optimal solution of the standard Min Cut with non-negative edge weights, i.e., when X ij ≥ 0, ∀i, j, is wellstudied, for which there exist several polynomial time algorithms, e.g., O(n 4 ) [19] and O(n 2 log3 n) [25].However, finding the optimal solution of the Shifted Min Cut cost function, wherein some edge weights are negative, is NPhard [2,12] and even is APX-hard [12].Therefore, we develop a local search method which computes a local minimum of the cost function in Eq. 14.The effectiveness of such a greedy strategy is well studied for different clustering cost functions, e.g., K-means [35], kernel K-means [45] and in particular several graph partitioning methods [14,15]. 3In this approach, we start with a random clustering solution and then we iteratively assign each object to the cluster that yields a maximal reduction in the cost function.We repeat this procedure until no further change of assignments is achieved during a complete round of investigation of the objects, i.e., then a local optimal solution is attained.
At each iteration of the aforementioned procedure, one needs to evaluate the cost of assigning every object to each of the clusters.The cost function is quadratic, thus a single evaluation might take O(Kn 2 ) runtime.Thereby, if the local search converges after t iterations, then, the total runtime will be O(tKn 3 ) for n objects, which might be computationally expensive.
However, we do not need to recalculate the cost function for every individual evaluation.Let R SM C (c o→l , S) denote the cost of the clustering solution c wherein object o is assigned to cluster l.At each step of the local search algorithm, we need to evaluate the cost R SM C (c o→l , S), l = l given R SM C (c o→l , S).
The cost R SM C (c o→l , S) is written by Similarly, the cost R SM C (c o→l , S), l = l is obtained by Thus, given R SM C (c o→l , S) the runtime of a new evaluation of the cost function Hence, the total runtime of the local search method will be O(tn 2 ).Therefore, at the beginning, we compute a random initial solution, wherein each object is assigned randomly to one of K clusters, and compute the respective cost.At each iteration, we use Eq.22 to investigate the cost of assigning an object to the other clusters than the current one.Then, we assign the object to the cluster that yields a maximal reduction in the cost.We might repeat the local search algorithm with several random initializations and at end, choose a solution with a minimal cost.Note that even the efficient evaluation and optimization of the variants in Eq. 3 and Eq. 17 would yield O(tKn 2 ) total runtime, i.e., K times slower than the variant expressed in Eq.14.
We note that this technique can be employed with other optimization or inference methods as well, such as MCMC methods and simulated annealing.
On the convergence rate of the local search optimization.With the co-authors, we have shown in [51] that for Correlation Clustering, Frank-Wolfe optimization with line search for the update parameter (to find the optimal learning rate) is equivalent to the local search algorithm.On the other hand, we have established convergence rate of O( 1 t ) for Frank-Wolfe optimization applied to Correlation Clustering [51] (t indicates the optimization step).As discussed before, given the shifted pairwise similarities, Shifted Min Cut is equivalent to Correlation Clustering.Thus, the same argument holds for the aforementioned local search algorithm for Shifted Min Cut, i.e., Shifted Min Cut enjoys the convergence rate of O( 1t ).This convergence rate should be compared with the convergence rate of O( 1 √ t ) for general non-convex (non-concave) functions [42] that applies to many other clustering objectives such Ratio Assoc, Normalized Cut and Dominant Set Clustering, i.e., optimizing Shifted Min Cut yields a faster theoretical convergence rate compared to many other alternatives.

Shift Analysis of Other Clustering Methods
In this section, we investigate the impact of shifting the pairwise similarities on some common flat and hierarchical clustering methods.
Shift of pairwise similarities for flat clustering.It is obvious that K-means and Gaussian Mixture Models (GMMs) are invariant with respect to the shift of data features.Since these methods perform directly on the data features, shifting refers to adding constant α to all the features.Under this shift, the centroids (in K-means) and the means (in GMM) are shifted by α as well, but their proportional distances stay the same.The other parameters, i.e., the clustering assignments (in K-means), and the assignment probabilities, covariance matrices and weights (in GMM) do not change.One might assume that by shift only the location of the clusters is affected without modifying the cluster memberships.A similar argument applies to a density-based clustering method such as DBSCAN [17] wherein shifting data features does not modify the clustering solution, except a consistent shift of the geographical locations of the clusters together.
As discussed in [44], when shifting the pairwise similarities by α, the Ratio Assoc and Ratio Cut cost functions stay invariant, i.e., their optimal solutions stay the same.By shifting the pairwise similarities by α, the Ratio Assoc cost function is written as Therefore, the Ratio Assoc cost function is invariant under shifting the pairwise similarities.Similar to Ratio Assoc, the Shifted Ratio Cut cost function can be written as Thereby, both Ratio Assoc and Ratio Cut cost functions are invariant under shifting the pairwise similarities.One can show that this holds in general for every clustering cost function that normalizes the clusters by the size of the clusters, i.e., size-normalized (divided) clustering cost functions stay invariant with respect to the shift of pairwise similarities.
On the other hand, the Normalized Cut cost function when the pairwise similarities are shifted is written by It turns out that this cost function is not shift invariant in general, contrary to the two previous alternatives.However, for the special case of almost balanced clusters, i.e., 4 and similar intra-cluster similarity distribution among all clusters, all the row-sums of the similarity matrix X tend to be close to each other.The objects then share the same degree, i.e., n j=1 X ij ≈ constant.In this case, the Normalized Cut cost functions becomes equivalent to the Ratio Assoc cost function [44].This analysis explains the similar performance of such graph partitioning methods in large-scale comparison studies, e.g., for image segmentation, where clusters have balanced and similar structures [50,44].
Ratio Cut, despite normalizing the cut by the size of clusters, intends to separate small clusters, as demonstrated in [47,20].For this reason, Normalized Cut has proposed to normalize the cut by the degree of the clusters, rather than the size of the clusters.An alternative way to overcome this problem is to apply a stronger constraint on the size of the clusters.Using this idea, P-Spectral Clustering [3] proposes a nonlinear generalization of spectral clustering based on the second eigenvector of the graph p-Laplacian which is then interpreted as a generalization of graph clustering models such as Ratio Cut.P-Spectral Clustering is an iterative clustering procedure that at each step performs a bi-partitioning of one of the existing clusters until K clusters are constructed using a nonlinear spectral method.The underlying respective cost function for bi-partitioning into two sets O a and O b is given by (p > 1) In [20], we have introduced Adaptive Ratio Cut (ARC) as a generalization of the cost function to yield K clusters: For the special case of p = 2, Adaptive Ratio Cut is equivalent to the standard Ratio Cut cost function.However, unlike Ratio Cut, it is easy to see that Adaptive Ratio Cut is not shift invariant, as the shift parameter α cannot be factored out from the cost function.
Shifted Dominant Set Clustering.This clustering method computes the clusters via performing replicator dynamics.It has been shown that the solutions of a replicator dynamics correspond to the solutions of the following quadratic program [46,55].
where the n-dimensional characteristic vector v determines the participation of the objects to the solution.
Thus, to study the impact of the shift on DSC, we consider the shifted variant of the quadratic program.In [7] we have elaborated the impact of such a shift based on the off-diagonal shift argument in [40].It yields where e = (1, 1, ...1) T is a vector of ones.
Therefore, Dominant Set Clustering is invariant under shifting the pairwise similarities.
However, it has been proposed in [40] to shift the diagonal entries of the similarity matrix by a negative value, in order to obtain coarser clusters, which yields computing a hierarchy of clusters.The clusters obtained from the unshifted similarity matrix appear at the lowest level of the hierarchy.The larger the negative shift is the coarser the clusters are.Performing a negative shift is equivalent to adding the same shift but with a positive sign to the off-diagonal pairwise similarities.Thereby, the shifted matrix is still non-negative and has a null diagonal, i.e. satisfies the conditions of Dominant Set Clustering.
One can think of performing a negative shift on the off-diagonal pairwise similarities to compute a finer representation of the clusters.However, this type of shift might violate the non-negativity and null diagonal constraints.On the other hand, according to our experiments, a negative shift is effectively equivalent to applying a larger cut-off threshold when peeling off the clusters.In [7] we have proposed such a shift to accelerate the appearance of clusters for DSC.
Shift of pairwise similarities for hierarchical clustering.Hierarchical clustering methods, unlike flat clustering, produce clusters at multiple levels.A main category of such methods first consider each object in a separate cluster, and then at each step, combine the two clusters with a minimal distance according to some criterion until only one cluster is left at the highest level.
A cluster at an arbitrary level is represented by a set of objects belong to that, e.g., by u or v.A hierarchical clustering solution can be represented by a dendrogram (tree) T such that, i) each node v in T consists of a non-empty subset of the objects that belong to cluster v, and ii) the overlapping clusters have a parent-child relation, i.e., one is the (grand) parent of the other.
We use dist(u, v) to refer to the inter-cluster distance between clusters u and v.It can be defined according to different criteria.Three common criteria for hierarchical clustering are single linkage, complete linkage and average linkage.Given the matrix of (inter-object) pairwise dissimilarities D = {D ij }, i, j ∈ O, the single linkage criterion [48] defines the distance between every two clusters as the distance between their nearest members: On the other hand, complete linkage [26] considers the distance between their farthest members: Finally, average linkage [49] uses the average of the inter-cluster distances as the distance between the two clusters: In the following we show that these methods, which perform based on pairwise inter-cluster distances, are shift-invariant (Proposition 1).Proposition 1. Single linkage, complete linkage and average linkage methods are invariant with respect to the shift of the pairwise dissimilarities D by constant α.
Proof.Let us show the shifted pairwise dissimilarities by D α , i.e., • With shifting all the pairwise dissimilarities by α, the dist(u, v) function for single linkage is defined as Thus, if dist(u, v) ≤ dist(u, w) holds with respect to D, then it would also hold with respect to D α and vice versa, as they differ only by a constant in both sides of the inequality.Thus, shifting the pairwise dissimilarities by α does not change the order of merging the intermediate clusters and hence the final dendrogram will remain the same.
• With shifting all the pairwise dissimilarities by α, the dist(u, v) function for complete linkage is defined as Thus, with the same argument as with single linkage, shifting the pairwise dissimilarities by α does not change the final complete linkage dendrogram.
• With shifting all the pairwise dissimilarities by α, the dist(u, v) function in average linkage is defined as Thus, we use the same argument as in with single linkage and complete linkage, and conclude that shifting the pairwise dissimilarities by α does not change the final average linkage dendrogram.
Another category of hierarchical clustering methods such as centroid linkage and Ward linkage perform directly on data features, instead of pairwise dissimilarities.Centroid linkage computes a representative for each cluster and defines the inter-cluster distances according to those representatives.Similar to the case of K-means, shifting the data features by a constant does not change the pairwise inter-cluster distances.The Ward linkage [54] aims at minimizing the within-cluster variance at each step, i.e., the dist(u, v) is defined as where g u denotes the centroid vector of cluster u.Therefore, due to shift invariance of variance, the Ward linkage is also invariant with respect to the shift of data features.Thereby, we can state Proposition 2 as following.
Proposition 2. Centroid linkage and Ward linkage are invariant with respect to the shift of data features.
Finally, it is notable that some of the improvements proposed for hierarchical clustering still preserve the invariance property with respect to the shift of pairwise distances.For example, in order to improve the robustness of hierarchical clustering, it is suggested in [9] to first apply K-means with many centroids (of order of n) and then apply the aforementioned hierarchical methods.Since both steps, i.e., K-means clustering and hierarchical clustering, are invariant with respect to the shift, thus one can conclude that the entire procedure remains invariant as well.The work in [8] studies extracting all mutual linkages at every step of hierarchical clustering, instead of the smallest one, in order to provide adaptivity to diverse shapes of clusters.Since this contribution is independent of the way the inter-cluster distances are defined, then this strategy yields invariant clustering with respect to the shift of pairwise distances for methods such as single linkage, complete linkage and average linkage.

Experiments
We empirically investigate the performance of Shifted Min Cut and compare the results against several alternatives.We perform the experiments under identical computational settings on a core i7-4600U Intel machine with 2.7 GHz CPU and 8.00 GB internal memory.
Data.We first perform our experiments on several UCI datasets [28], chosen from different domains and contexts with different type of features.11.Teaching Assistant: consists of evaluations of teaching performance over 5 semesters of 151 teaching assistant assignments.The scores are divided into 3 roughly equal-sized categories ('low', 'medium' and 'high') to form the target variables which are used as the cluster labels.The attributes are categorical and integer, where we use one-hot encoding for categorical attributes.There are no missing values.
12. User Knowledge Modeling: contains the 403 students' knowledge status on Electrical DC Machines with 5 integer attributes grouped in 4 categories.The labels and the cluster distributions are: 'very Low': 50, 'low': 129, 'middle': 122 and 'high': 130.There are no missing values.
In these datasets, the objects are represented by vectors.Thus, to obtain the pairwise similarity matrix X, we first compute the pairwise squared Euclidean distances between the vectors and obtain matrix D.Then, as proposed in [7], we convert the pairwise distances D to the similarity matrix X via X ij = max(D) − D ij + min(D), where the max(.) and min(.) operations respectively give the maximum and the minimum of the elements in D. An alternative transformation is an exponential function in the form of S ij = exp(− Xij σ 2 ), which requires fixing the free parameter σ in advance.However, this task is nontrivial in unsupervised learning and the appropriate values of σ coincide in a very narrow range [34].The other alternative is the cosine similarity, which suits better to textual and document datasets.On our datasets, we consistently obtain better results with the aforementioned transformation.
Methods.We compare Shifted Min Cut against several alternative methods developed for clustering.We consider the following methods: i) Dominant Set Clustering (DSC), ii) InImDyn, iii) P-Spectral Clustering (PSC), iv) Gaussian Mixture Model (GMM), v) K-means, vi) Power Iteration Clustering (PIC), and vii) Spectral Clustering (SC).
The chosen baselines belong to different clustering approaches which cover a wide range of alternative viewpoints for clustering, e.g., those based on a cost function, probabilistic methods, game-theoretic methods and spectral methods.With the GMM method, we obtain the probabilistic assignment of the objects to the clusters.Then, we assign each object to the most probable cluster.The developed clustering perspective can potentially be combined with the recent developments proposed in particular for cost-based clustering methods.For example, a category of recent clustering methods aim to combine deep representation learning methods with clustering [13,57], or develop approximate and distributed clustering methods.Such contributions are orthogonal to our contribution and, in principle, can be combined with Shifted Min Cut as well.On the other hand, considering the relation between Shifted Min Cut and Correlation Clustering, with the co-authors, we have recently [51] studied the performance of the local search optimization compared to a wide range of approximate methods developed for Correlation Clustering and have demonstrated both efficiency and effectiveness for the local search method.Evaluation criteria.We have access to the ground truth solutions for the datasets.These labels may play the role of an expert (reference) that tells us the desired clustering solution.Thus, we can use them to evaluate the results of different methods.We note that we do not employ them to infer the clustering solution, they are only used for evaluation.Therefore, we are still in unsupervised setting which assumes no data label is used to obtain the results.This evaluation procedure is recommended in [37] consistent with several studies, e.g., [14,29,31,51,57].Thereby, we compare the true (given) clustering labels and the estimated solutions to investigate quantitatively the performance of each method.
For this purpose, we consider three criteria: 1. adjusted Mutual Information [53]: the mutual information between the two estimated and true clustering solutions, 2. adjusted Rand score [24]: the similarity between the solutions, and 3. V-measure [43]: the harmonic mean of homogeneity and completeness.
We compute the adjusted variant of these criteria such that they give zero scores for random solutions.
Results.We study the performance of different methods from two perspectives in order to distinguish between the quality of a method/costs function and its optimization.The former implies how good a particular method/cost function is (given that it can be optimized in a proper way) while the later focuses on the optimization aspects of the method/cost function.We run each method 100 times with different random initializations.In the first type of study, we choose the best solution in terms of the cost or likelihood among the 100 different runs for each method.We note that we do not choose the best results in terms of the evaluation criteria.This helps to gain a sense that the optimization is done properly, and we may not suffer from very poor local optima, and therefore, we can investigate the performance of the method or the cost function regardless of its optimization.Tables 1, 2 and 3 show the results of the first type of study for different clustering methods on the UCI datasets respectively with respect to the Mutual Information criterion, the Rand score and the V-measure.We observe that on most of the datasets, Shifted Min Cut yields the best scores.In the cases that the method is not the best, it is usually among top choices.DSC and InImDyn perform very similarly, consistent to the results in [4].PIC works well only when there are few clusters in the dataset.The reason is that it computes an one-dimensional embedding of the data and then applies K-means.However, such an embedding might confuse some clusters when there exist many of them in the dataset [7].PSC is significantly slower than the other methods and also yields suboptimal results, as reported by several previous studies as well.Other methods are efficient and perform within few seconds.
In the second type of study, in order to investigate the optimization itself, we report the average scores and the respective standard deviations over the 100 different runs for each method.We note that DSC, InImDyn and PSC are non-randomized algorithmic procedures that do not show randomness in the performance and their results are stable among different runs.Therefore, we do not need to report their results here.Tables 4, 5 and 6 show such optimization variability results for different UCI datasets (i.e., the average results and the respective standard deviations shown in brackets).We observe that the results are consistent among different runs and the better methods in Tables 1, 2 and 3 perform well on average too, i.e., the results from the first type of study and the second type of study are consistent in overall.In particular, Shifted Min Cut yields the most promising results in this type of study as well.The results also confirm the effectiveness of the optimization based on local search, a method that is nowadays used widely in different machine learning paradigms.
Experiments on real-world data.In the following, we investigate the performance of different clustering methods on two real-world datasets:

DS2:
In this dataset, we collect articles about 5 different Computer Science subjects: 'artificial intelligence', 'software', 'hardware', 'networks' and 'algorithms'.For each category, we collect 1500 articles, thus in total there are 7500 articles in this dataset.We computer the tf-idf vectors for each article, thus the attributes are numerical.There are no missing values.
Similar to the experiments on the UCI datasets, we first study the performance of the methods when the optimization is performed properly, i.e., when we pick the best results in terms of the cost function or the likelihood over 100 different runs.Tables 7 and 8 show the performance of different clustering methods with respect to the evaluation criteria on DS1 and DS2.We observe that only Shifted Min Cut yields high scores with respect to all criteria.In most of the cases, Shifted Min Cut results in the best scores.Otherwise, it is still competitive compared to the best choice.
Finally, we study the optimization variability, i.e., the average results and the respective standard deviations among the 100 runs.The results with respect to different evaluation criteria are shown in Tables 9 and 10 corresponding to DS1 and DS2.Similar to the experiments on the UCI datasets, we observe that the optimization variability results follow the same trend as the results in Tables 7 and 8.This indicates that the average results are consistent with the results obtained based on the best values of the cost function or the likelihood.On the other hand, Shifted Min Cut yields the most promising results either in average or when choosing the best solutions in terms of cost/likelihood.

Conclusion
This paper investigates an alternative approach for regularizing the Min Cut cost function in order to avoid the appearance of singleton clusters, where the regularization term is added to the cost function, instead of dividing the Min Cut clusters by a cluster dependent factor.We, in particular, studied the case where the regularization term leads to subtracting the pairwise similarities by the regularization factor.Then, we only need to apply the base Min Cut, but on the (adaptively) shifted similarities instead of the original data.In the following, we developed an efficient local search algorithm to optimize (locally) the Shifted Min Cut cost function and studied its fast theoretical convergence rate.Thereafter, we discussed that unlike Min Cut, many other common clustering cost functions are invariant with respect to the shift of pairwise similarities.Finally, we performed extensive experiments on several UCI and real-world datasets to demonstrate the superior performance of Shifted Min Cut according to different evaluation criteria.

Kk=1
|O k | = n.The order would not change the minimum.

Figure 2 :
Figure 2: The impact of the shift parameter α on the results of the Shifted Min Cut cost function.A very large α might yield splitting large clusters, instead of separating true small clusters.

Table 1 :
Heart: dataset of heart disease that involves 303 instances each with 75 attributes.The attributes are diverse: categorical, integer and real where the categorical attributes are treated using one-hot encoding.The missing values are estimated by the median of the respective feature.Cluster distributions are: 164, 55, 36, 35 and 13.6.Lung Cancer: high-dimensional lung cancer data with 32 instances (with distribution 9 and 23) and 56 integer features.There are few missing values estimated using the median of the respective feature.7. Parkinsons: contains 197 biomedical voice measurements from 31 people each represented by 23 real-valued attributes that correspond to voice recordings.In the dataset, there are 48 healthy samples and 147 other samples that belong to one of 23 people with Parkinson's disease.Performance of different methods with respect to the adjusted Mutual Information criterion.Shifted Min Cut yields the best results in most of the cases.SPECTF: describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images with 44 integer attributes (values from the 0 to 100) about the heart of 267 patients.The diagnosis is binary with the distribution of 55 and 212 samples.10.Statlog ACA (Australian Credit Approval): contains information of 690 credit card applications each described with 14 features (with cluster size 383 and 307).The features are categorical and numerical where for categorical features we use one-hot encoding.The few missing values are estimated using the median of the respective feature.
1. Breast Tissue: contains 106 electrical impedance measurements of the breast tissue samples in 6 types (clusters) each with 10 features.The types or clusters are 'car' (carcinoma, 21 measurements), 'fad' (fibro-adenoma, 15 measurements), 'mas' (mastopathy 18 measurements), 'gla' (glandular, 16 measurements), 'con' (connective, 12 measurements) and 'adi' (adipose 22 measurements).The features are real valued with no missing value.2. Cloud: consists of 2048 vectors, where each vector includes 10 parameters in two types (each of size 1024) representing AVHRR images.The vectors (attributes) are real-valued and there are no missing values.The target clusters are balanced.3. Ecoli: a biological dataset on the cellular localization sites of 7 types (clusters) of proteins which includes 336 samples.The samples are represented by 8 real-valued features.The size of the clusters are: 143, 77, 3, 7, 35, 20 and 52, 4. Forest Type Mapping: a remote sensing dataset of 523 samples with 27 real-valued attributes collected from forests in Japan and grouped in 4 different forest types (clusters).The clusters are: 's' ('Sugi' forest, 159 samples), 'h' ('Hinoki' forest, 86 samples), 'd' ('Mixed deciduous' forest, 195 samples), 'o' ('Other' non-forest land, 83 samples).5. 8. Pima Indians Diabetes: the data of 768 female patients from Pima Indian heritage with 8 attributes.The attributes include the number of pregnancies of the patient, their BMI, insulin level, age, and so on, and they are either real numbers or integers.268 samples out of 768 haze the outcome 1 and the others (500 samples) have the outcome 0.

Table 2 :
Performance of different methods with respect to the adjusted Rand score.Shifted Min Cut leads to better clustering solutions on most of the datasets.

Table 3 :
Performance of different methods with respect to the adjusted V-measure.In a consistent way to the two previous evaluation criteria, the Shifted Min Cut method provides the best clustering results on most of the datasets.

Table 4 :
Average performance (and the standard deviation shown in brackets) for different methods over 100 runs with respect to adjusted Mutual Information, where Shifted Min Cut often yields the most promising results.

Table 5 :
Average performance (and the standard deviation) for different methods with respect to adjusted Rand score.Consistent with adjusted Mutual Information, Shifted Min Cut yields the best results on most of the datasets .

Table 6 :
Average performance (and the standard deviation) for different methods with respect to adjusted V-measure.We observe that the average results are consistent with the results from the first type of study and Shifted Min Cut performs well compared to the alternatives.

Table 7 :
Performance of different methods on DS1.On this dataset, Shifted Min Cut yields superior results compared to the alternatives.This dataset, collected by a document processing company, contains the vectors of 675 scanned documents, wherein each document is represented in a 4096 dimensional space using different textual, image, structural and other features.The documents are placed within 56 clusters with different sizes, that makes the clustering task challenging.The size of the clusters varies from few documents to more than 200 documents.The features are real-valued.

Table 8 :
Performance of different methods on DS2 where Shifted Min Cut leads to the best overall performance.

Table 9 :
Average performance (and the standard deviation shown in brackets) for different methods over 100 runs with respect to different evaluation criteria on DS1.

Table 10 :
Average performance (and the standard deviation) for different methods over 100 runs with respect to different evaluation criteria on DS2.On both DS1 and DS2, Shifted Min Cut yields more promising results in overall.