1 Introduction

Given a set of data points in a normed vector space and a number of clusters, the clustering problem consists in deciding which data point should be assigned to which cluster. Moreover, a representative point for each cluster needs to be determined. Clustering problems form a highly relevant sub-class of unsupervised learning in machine learning and computational statistics. Its relevance is supported by many applications, e.g., in functional data analysis [10, 53], image processing [9], bio-informatics [12], economics [32], and social sciences [31]. For a detailed survey of the history of clustering problems we refer to Steinley [58]. Depending on, e.g., the way how vicinity is measured and on whether the representative is an arbitrary point or one of the data points, different variants of clustering problems arise. In this paper we consider the minimum sum-of-squares clustering (MSSC) problem. Here, distance between data points is measured using the squared Euclidean norm and any point can be chosen as the representative for each cluster.

Modeling this problem leads to a nonconvex mixed-integer nonlinear optimization problem (MINLP) that is extremely hard to solve for high-dimensional real-world problems. Moreover, the problem is known to be NP-hard even in the case of two dimensions; see, e.g., Aloise et al. [11], Dasgupta [2], and Mahajan et al. [41]. This is why the problem is most frequently solved using heuristics out of which the k-means clustering method is the most prominent one; see, e.g., Lloyd [39], MacQueen [40]. However, solving such clustering problems only heuristically may come with severe disadvantages. Since solving the MSSC problem is an unsupervised learning problem, the outcome typically requires the interpretation of experts from the specific field of application such as medicine or social sciences. This interpretation, however, may be completely wrong in the case that the expert is confronted with a heuristic clustering solution of bad quality. Moreover, it is easy to imagine that such a misleading interpretation might have some severe, e.g., medical, consequences. Thus, there is a strong need for sophisticated optimization techniques to improve the process of solving clustering problems to global optimality and this is exactly the contribution of this paper: We take the MINLP solver SCIP and enhance its solution process by developing novel mixed-integer optimization techniques that enable us to solve MSSC problems to global optimality that cannot be solved with the plain version of SCIP.

Of course, we are not the first ones trying to solve the MSSC problem to global optimality. To the best of our knowledge, the earliest application of branch-and-bound methods is presented by Fukunaga et al. [25], which has been refined later on by Diehr [15]. A variant of a so-called repetitive branch-and-bound method has been devised by Brusco [7], where the authors conclude that their method is well-suited for a small number of clusters. Another recent branch-and-bound approach is presented by Sherali and Desai [55]. The authors use reformulation-linearization-techniques (RLT) embedded in a branch-and-bound method to solve the problem to global optimality. In their introduction, they also talk about a “limited number of optimization techniques” as opposed to a rather large number of heuristics that are used in practice. As an additional technique, the authors present further valid inequalities to tackle the inherent symmetry of the problem. Regarding symmetry breaking for clustering problems we also refer to Plastria [48], which is, generally speaking, a modeling tutorial paper but which also contains a discussion of symmetry breaking constraints for the clustering problem. Aloise and Hansen [4] also consider the MSSC problem and try to re-produce the results of Sherali and Desai [55]. However, the re-production failed since significantly longer running times have been observed. Consequently, the computational efficiency of the RLT-based branch-and-bound method may be taken with some care. Another algorithmic technique is the column generation approach presented first in Merle et al. [18] and which has been re-considered and improved by Aloise et al. [5]. Also other classic techniques of mixed-integer (non)linear optimization have been applied such as generalized Benders decomposition in Floudas et al. [21], and Tan et al. [59]. Alternatively, Peng and Xia [45] consider the MSSC problem as a concave minimization problem and adapt Tuy’s cut method, see Horst and Tuy [34], for solving the problem. Further, Prasad and Hanasusanto [49] propose improved conic reformulations of the MSSC problem and also study some symmetry breaking techniques. Tîrnăucă et al. [60] follow a more geometric approach that is based on Voronoi diagrams. Finally, there is a rather large branch of literature towards the application of techniques from semi-definite programming (SDP). Peng and Wei [46], Peng and Xia [44] proved the equivalence between the MSSC problem and a 0-1 SDP reformulation. Based on this 0-1 SDP model, Aloise and Hansen [3] propose a branch-and-cut algorithm and solve instances with up to 202 data points to global optimality. More recently, Piccialli et al. [47] consider the same mixed-integer SDP for the MSSC problem and propose another branch-and-bound algorithm that is capable of solving real-world instances with up to 4000 data points. To the best of our knowledge, this is the most recent state-of-the-art branch-and-bound algorithm for the MSSC problem. SDP-like models have also been used in De Rosa and Khajavirad [13] to use \(Z = X X^{\top } \in [0,1]^{n \times n}\) for encoding the clustering instead of \(X \in \{0,1\}^{n \times k}\). The authors derive cutting planes and show a relation of their cutting planes to the cut polytope; see Deza and Laurent [14] for a survey on the latter. The presented numerical experiments show that these novel cutting planes can be strong but the authors only solve the initial LP relaxation and do not apply a complete branch-and-bound method. Finally, some recent ideas based on reduced-space techniques seem to be very promising, see Hua et al. [35] and Liberti and Manca [38]. Besides that, in Liberti and Manca [38], the authors discuss the MSSC problem with several side constraints. One of their base models is, in particular, the convex MINLP that we present in the next section.

In our contribution, we add to the literature on solving the MSSC problem to global optimality. To this end, we develop novel mixed-integer programming techniques that are mainly motivated by geometric insights and that improve the branch-and-cut solution process of an MINLP solver. To be more precise, we present two MINLP formulations of the problem (Sect. 2), develop cutting planes (Sect. 3), propagation methods (Sect. 4), as well as problem-specific branching rules (Sect. 5) and primal heuristics (Sect. 6). We implement and test all techniques in the open-source MINLP solver SCIP; see Gamrath et al. [26]. By doing so, we also automatically apply state-of-the-art symmetry breaking techniques to the problem. Our numerical results are presented and discussed in Sect. 8, where we show that our techniques significantly improve the solution process. We close the paper with some concluding remarks and some potential topics for future work in Sect. 9. Our code is publicly available at GitHub.Footnote 1 Although our numerical results clearly show that the solution process of an MINLP solver applied to the MSSC problem is significantly improved, we do not beat current state-of-the-art and SDP-based techniques as studied in Piccialli et al. [47]. Nevertheless, we are convinced that it is worth to also push MINLP-based approaches forward so that, in the end, different techniques for various approaches can be combined to lead to an even better and maybe hybrid solution approach.

2 MINLP models for the MSSC problem

We now model the minimum-sum-of-squares clustering (MSSC) problem as a mixed-integer nonlinear optimization problem (MINLP). To this end, we are given a set of data points \(p \in P \subseteq \mathbb {R}^d\) and a positive integer \(2 \le k \le |P|\), which is the number of clusters of the problem. The task is then to assign every data point \(p \in P\) to a cluster (indexed by \(j \in [k] {:}{=}\{1, \dotsc , k\}\)) so that the sum of the squared Euclidean distances between the data points and the corresponding centroids \(c^j\) is minimal. This problem is modeled via the following MINLP:

$$\begin{aligned} \min _{x, c} \quad&\sum _{p \in P} \sum _{j \in [k]} \, x_{pj} \Vert p - c^j\Vert ^2 \end{aligned}$$
(1a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\sum _{j \in [k]} x_{pj} = 1, \quad p \in P, \end{aligned}$$
(1b)
$$\begin{aligned} \quad&x_{pj} \in \{0,1\}^{}, \quad p \in P, \ j \in [k], \end{aligned}$$
(1c)
$$\begin{aligned} \quad&c^j \in B, \quad j \in [k]. \end{aligned}$$
(1d)

The binary variables \(x_{pj}\) are the assignment variables that model whether the data point p is assigned to cluster j \((x_{pj} = 1)\) or not \((x_{pj} = 0)\). Moreover, \(B \subseteq \mathbb {R}^d\) is a set that contains all points in P. This can, e.g., be the bounding box of P. That is, if for each \(i \in [d]\), \(\ell _i = \min \{p_i:p \in P\}\) and \(u_i = \max \{p_i:p \in P\}\), then \(B = \{c \in \mathbb {R}^d: \ell _i \le c_i \le u_i,\; i \in [d]\}\) is a valid choice. Note that (1d) is not necessary for the correctness of Model (1). Nevertheless, we include it in our implementation, because Model (1) is a nonconvex MINLP, for which bounds on variables are usually beneficial. The objective function measures the sum of the squared Euclidean distances between the data points and the centroids of the clusters to which they belong. Finally, Constraint (1b) ensures that every point is assigned to exactly one cluster.

Note that this model is cubic since the objective function uses multiplications of the assignment variables x with the norms that depend on the centroids c, which are variables of the problem as well. In particular, Model (1) is a nonconvex MINLP. However, it can also be re-written as a convex MINLP in a lifted space by using its epigraph formulation. To this end, we model each term in the objective function using a separate variable and bound it in a newly introduced constraint. The resulting problem then reads

$$\begin{aligned} \min _{x,c,\eta } \quad&\sum _{p \in P} \sum _{j \in [k]} \eta _{pj} \end{aligned}$$
(2a)
$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\eta _{pj} \ge \Vert p - c^j\Vert ^2 - M_p(1-x_{pj}), \quad p \in P,\ j \in [k], \end{aligned}$$
(2b)
$$\begin{aligned}&\sum _{j \in [k]} x_{pj} = 1, \quad p \in P, \end{aligned}$$
(2c)
$$\begin{aligned}&x_{pj} \in \{0,1\}^{}, \quad p \in P,\ j \in [k], \end{aligned}$$
(2d)
$$\begin{aligned}&c^j \in B, \quad j \in [k], \end{aligned}$$
(2e)
$$\begin{aligned}&\eta _{pj} \ge 0, \quad p \in P, \ j \in [k], \end{aligned}$$
(2f)

where \(M_p\) are sufficiently large numbers. The objective function is linear now and we obtain the additional quadratic and convex constraints in (2b).

For every \(p \in P\), \(M_p\) can be chosen to be the maximum distance of p to any other point \(\tilde{p} \in P\). An overestimation can be easily computed via

$$\begin{aligned} M_p = M = (u_1 - \ell _1)^2 + \cdots + (u_d - \ell _d)^2, \end{aligned}$$

where \(\ell _i\) and \(u_i\) are the componentwise bounds of the bounding box given above. Moreover, for a given cluster assignment x, an optimal choice for the cluster centroids is immediate, an observation that we will exploit frequently.

Observation 2.1

For a given assignment of x-variables adhering to (1b) or (2c), respectively, the optimal choice for \(c^j\), \(j \in [k]\), is

$$\begin{aligned} \frac{\sum _{p \in P} p \, x_{pj}}{\sum _{p \in P} x_{pj}}, \end{aligned}$$

i.e., the barycenter of all points assigned to cluster j.

3 Cutting planes

Without doubt, cutting planes are among the most powerful techniques to enhance the solution process for mixed-integer problems. Modern MI(N)LP solvers have many general-purpose cutting planes built-in. However, it is very often beneficial to derive problem-specific cutting planes. This is particularly important for the MSSC problem, since it is well-known that one of the most challenging issues for developing an efficient branch-and-bound algorithm for the MSSC problem is the computation of good lower bounds in a reasonable amount of time. In this section, we state two tailored cutting planes. The first one is applicable to Model (1) and (2) whereas the second one is only applicable to Model (2).

3.1 Cardinality cuts

We first briefly discuss cardinality cuts, which are already mentioned in Aloise et al. [5] as well as Sherali and Desai [55]. Consider an optimal solution of Model (1). In this optimal solution, there cannot be any empty cluster, because otherwise the corresponding objective value can be decreased by assigning a point that is not a centroid to that empty cluster. On the other extreme, a cluster contains \(|P| - k + 1\) data points if every other cluster consists of only a single data point. Thus, to tighten the formulation (1), the following cardinality cuts can be added to Model (1):

$$\begin{aligned} 1 \le \sum _{p \in P} x_{pj} \le |P| - k + 1, \quad j \in [k]. \end{aligned}$$

Obviously, the same cuts are also valid for Model (2). Moreover, note that the upper bound is implied by the model’s constraints and the lower bound of the previous inequalities as \(\sum _{p \in P} \sum _{j = 1}^k x_{pj} = |P|\) implies for a fixed \(j \in [k]\) that \(\sum _{p \in P}x_{pj} = |P| - \sum _{p \in P} \sum _{j' \in [k] {\setminus } \{j\}} x_{pj'} \le |P| - (k-1)\).

The idea of cardinality cuts can also be localized, i.e., the cardinality bounds can be adapted to take local variable bounds at a node of the branch-and-bound tree into account. To this end, we introduce, for each \(j \in [k]\) the integral variable \(\kappa _j\) with range \(\{k-1,\dots ,|P| - 1\}\) and link it with the x-variables via the linear constraint \(\kappa _j + \sum _{p \in P} x_{pj} = |P|\), \(j \in [k]\). That is, \(\kappa _j\) describes the number of data points that are not assigned to cluster j. If a lower bound \( {\kappa }_j\) and an upper bound \(\bar{\kappa }_j\) on \(\kappa _j\) is given, this equation implies the inequalities

$$\begin{aligned} 1 \le |P| - \bar{\kappa }_j \le \sum _{p \in P} x_{pj} \le |P| - {\kappa }_j \le |P| - k + 1, \quad j \in [k]. \end{aligned}$$

That is, they describe localized versions of cardinality cuts that get stronger if x-variables get fixed. Another side effect of the auxiliary variables \(\kappa _j\) is that a solver might decide to branch on these variables. In doing so, it imposes bounds on the size of cluster \(j \in [k]\).

3.2 Outer approximation cuts

We now focus on Model (2). The only nonlinear constraints (2b) in this problem are convex. Thus, their first-order Taylor approximations are global underestimators at any point \((\bar{\eta }, \bar{c}, \bar{x})\) and thus provide valid inequalities that are linear in \((\eta , c, x)\):

$$\begin{aligned} \sum _{i=1}^d \left( 2\bar{c}_i^j c_i^j - 2 p_i c_i^j + (p_i)^2 - (\bar{c}_i^j)^2 \right) - \eta _{pj} - M_p(1-x_{pj}) \le 0, \quad p \in P,\ j \in [k]. \end{aligned}$$
(3)

This allows to solve Model (2) in an outer approximation or LP/NLP-based branch-and-bound fashion; see Duran and Grossmann [17], Fletcher and Leyffer [20] or Quesada and Grossmann [50], respectively. We start by relaxing the constraint set (2b). Next, we assume that \((\bar{\eta }, \bar{c}, \bar{x})\) is a solution of this relaxation, i.e., it particularly fulfills the binary conditions (2d). If the relaxation’s solution is feasible for the nonlinear constraints (2b), it is also a solution for Model (2). If not, we can compute a feasible point \((\hat{\eta }, \hat{c}, \bar{x})\) of Model (2). In the original outer approximation method, this is done by fixing the binary variables \(\bar{x}\) in Model (2) and solving the resulting convex NLP subproblem; see Duran and Grossmann [17]. The benefit in our specific application is that solving the subproblem boils down to a simple computation of the barycenters \(\hat{c}\), see Observation 2.1, followed by an evaluation of the distances \(\hat{\eta }\) according to Constraints (2b) and (2f).

From the theory of outer approximation, it is well-known that when adding the inequalities (3) at the solution \((\hat{\eta }, \hat{c}, \bar{x})\) of the subproblem, it holds

$$\begin{aligned} \sum _{p \in P} \sum _{j \in [k]} \eta _{pj} \ge \sum _{p \in P} \sum _{j \in [k]} \hat{\eta }_{pj} \end{aligned}$$

for all feasible points \((\eta , c, \bar{x})\) of the updated relaxation. In other words, adding the outer-approximation cuts bounds the optimal objective value of the relaxation with fixed binaries \(x=\bar{x}\) from below by \(\sum _{p \in P} \sum _{j \in [k]} \hat{\eta }_{pj}\). Consequently, the updated relaxation yields a solution with a new, previously unseen, cluster assignment x or the optimality gap is closed. Thus, iterating this process terminates after a finite number of steps; see Duran and Grossmann [17] or Duran and Grossmann [20] for more details. We note that the number of inequalities (3) does not depend on the dimension d, which might be beneficial for problems with higher dimensions.

Instead of implementing an LP/NLP-based branch-and-bound from scratch, we can use solvers such as SCIP to solve Model (2). In this setting, we can separate and add cuts (3) to tighten the LP relaxations. Since for larger \(|P|\) and k, adding all inequalities (3) might be impracticable, we may also add only a certain amount of cuts. In particular, in our implementation in SCIP, we add only 10 cuts per separation round.

4 Propagation

Suppose we are at a node of the branch-and-bound tree. Due to branching decisions and further reductions, some variables might have been fixed or their bounds have been tightened in comparison to the original problem formulation. The aim of propagation is to find further variable fixings or bound tightenings that are valid at the current node. That is, one tries to apply further reductions based on local variable bound information. According to Observation 2.1, every assignment of x-variables that satisfies (1b) or (2c) can be extended to a feasible solution of (1) or (2), respectively. Thus, it is crucial to derive propagation mechanisms that exclude assignments of x-variables that cannot be optimal. Moreover, we develop algorithms to strengthen bounds of the c-variables and the objective variables. Before we discuss our propagation algorithms, we fix the following notation and terminology.

For every \(j \in [k]\), we denote by \(P_j \subseteq P\) the set of all data points \(p \in P\) whose corresponding variable \(x_{pj}\) has been fixed to 1 at the current node of the branch-and-bound tree. That is, we have already decided to assign p to cluster j. Moreover, we denote by \(P'_j \subseteq P\) all data points p such that \(x_{pj}\) has not been fixed to 0 yet, i.e., p is already or can still be assigned to cluster j. Note that \(P_j \subseteq P'_j\). For a continuous variable z, i.e., for the c- and \(\eta \)-variables, we denote by \( {z}\) and \(\bar{z}\) the lower and upper bound on z at the current node, respectively.

4.1 Barycenter propagation

Given a non-empty set of data points \(Q \subseteq P\) defining a cluster, the optimal choice for its centroid is the barycenter

$$\begin{aligned} \mathcal {C}(Q) {:}{=}\frac{1}{|Q|}\sum _{p \in Q} p \end{aligned}$$

of all data points in Q. The respective sum of all squared distances thus is \(\mathcal {D}(Q) = \sum _{p \in Q} \Vert p - \mathcal {C}(Q)\Vert ^2\). The idea of the barycenter propagation is to use this observation to find lower bounds on the objective and to strengthen the bounds for the c-variables.

4.1.1 Bound tightening for the objective function values

To find a lower bound on the objective in Model (1), note that for sets \(Q \,{\subseteq }\, Q' \,{\subseteq }\, P\), we have \(\mathcal {D}(Q) \le \mathcal {D}(Q')\). Consequently, a lower bound on the objective is given by \(\sum _{j \in [k]} \mathcal {D}(P_j)\). The barycenter propagator uses this value to possibly tighten the lower bound on the objective at the current node of the branch-and-bound tree. Computing this lower bound for all clusters can be done in \(O(kd|P|)\) time and it has also been used by Brusco [7], see also Guns et al. [30], in a repetitive branch-and-bound framework.

For Model (2), no immediate lower bound on the objective can be enforced because the objective is decoupled via the \(\eta \)-variables. Nevertheless, for each \((p,j) \in P \times [k]\), the following steps can be done. We can prune a node of the branch-and-bound tree if \(p \notin P'_j\) and \( {\eta }_{pj} > 0\), because an optimal solution has \(\eta _{pj} = 0\) as data points not assigned to a cluster do not contribute to \(\mathcal {D}(P_j)\). Otherwise, if \(p \notin P'_j\) and \( {\eta }_{pj} = 0\), we can fix \(\eta _{pj}\) to 0. The first step is thus a pruning operation based on sub-optimal bounds in the subproblem, whereas the second step is a bound tightening operation.

4.1.2 Bound tightening for the centroids

Recall that \([k] = \{1,\dots ,k\}\). Besides strengthening bounds on the objective, barycenter information can also be used to tighten bounds on centroid variables \(c^j_i\) with \((i,j) \in [d] \times [k]\). Suppose \(P'_j {\setminus } P_j = \{p^1,\dots ,p^s\}\) such that \(p^1_i \le p^2_i \le \dots \le p^s_i\). For each \(r \in [s]_0{:}{=}[s] \cup \{0\}\), we compute \(\gamma ^{j,r} = \mathcal {C}(P_j \cup \{p^1,\dots ,p^r\})\), i.e., the barycenter of the data points contained in \(P_j\) and the data points with the r smallest ith coordinates that are not contained in \(P_j\). As we show next, the ith coordinates of these barycenters can be used to compute a lower bound on \(c^j_i\).

Lemma 4.1

A valid lower bound on \(c^j_i\) is given by \(\min _{r \in [s]_0} \gamma ^{j,r}_i\).

Proof

Let \(Q \subseteq P'_j \setminus P_j\) and assume \(|Q| = r\). Then, \(\mathcal {C}(P_j \cup Q)_i \ge \mathcal {C}(P_j \cup \{p^1,\dots ,p^r\})_i\), because the points \(p^1, \dots , p^r\) are points with the r smallest ith coordinates. Consequently, to find a lower bound on the centroids, it is sufficient to consider \(\mathcal {C}(P_j \cup \{p^1,\dots ,p^r\})\) for each \(r \in [s]_0\). \(\square \)

Analogously, an upper bound is given by \(\max _{r \in [s]_0} \mathcal {C}(P_j \cup \{p^{s-r},\dots , p^s\})_i\). Since computing an iterative sequence of barycenters can be done using the formula

$$\begin{aligned} \gamma ^{j,r+1} = \frac{(|P_j| + r)\gamma ^{j,r} + p^{r+1}}{|P_j| + r+1}, \end{aligned}$$

we can compute the minimum and maximum values for all coordinates and clusters in \(O(kd|P|)\) time.

4.2 Convexity and cone propagation

Based on optimality arguments, we can also derive rules to assign data points \(p \in P'_j \setminus P_j\) to cluster \(j \in [k]\). The key idea of the convexity propagator is the following simple observation.

Lemma 4.2

There exists an optimal solution of MSSC with clusters \(P_1, \dots , P_k\) such that, for each \(j \in [k]\), we have \({{\,\textrm{conv}\,}}(P_j) \cap P = P_j\).

Proof

Given an optimal allocation of the k centroids, the Voronoi cells

$$\begin{aligned} C_j = \{x \in \mathbb {R}^d:\Vert x - c^j\Vert \le \Vert x - c^{j'}\Vert ,\; j' \in [k]\} \end{aligned}$$

for \(j \in [k]\) cover the entire \(\mathbb {R}^d\) and only intersect at their boundaries. Since Voronoi cells are full-dimensional polyhedra, we can use the following mechanism to prove the assertion. We start with cluster 1 and observe that \(P_1 \subseteq C_1\) in any optimal solution. If there exist \(p \in P \setminus P_1\) that are contained in \(C_1\), they are necessarily contained in the boundary of \(C_1\). Hence, if we change the assignment of these points to \(P_1\), this does not change the objective of MSSC. The assertion thus holds for \(P_1\), and we can use the same arguments iteratively to conclude the proof. \(\square \)

As a consequence, the convexity propagator computes \({{\,\textrm{conv}\,}}(P_j)\) for each \(j \in [k]\). If there exists \(p \in P \cap {{\,\textrm{conv}\,}}(P_j)\) it performs the following steps: If \(p \notin P'_j\) holds, then we can prune the current node of the branch-and-bound tree, because the local variable bounds cannot lead to an optimal solution adhering to Lemma 4.2. Otherwise, \(x_{pj}\) can be fixed to 1.

Besides pruning nodes and fixing variables to 1, Lemma 4.2 has another consequence that allows us to fix some variables to 0, which is illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of cone-based propagation. If q is not contained in the red cluster, none of the black points can be contained in the red cluster. Assigning the white points to the red cluster is still possible

Lemma 4.3

Let \(P_1 \cup \dots \cup P_k\) be a partition of a finite set \(P \subseteq \mathbb {R}^d\). Suppose \({{\,\textrm{conv}\,}}(P_j) \cap P = P_j\) for each \(j \in [k]\). Then, for every \(q \in P {\setminus } P_j\),

$$\begin{aligned} q + {{\,\textrm{cone}\,}}\{-(p - q):p \in P_j\} \supseteq P \setminus P_j. \end{aligned}$$

Proof

Note that \(q + {{\,\textrm{cone}\,}}\{p - q:p \in P_j\}\) is the smallest cone with apex q that contains \({{\,\textrm{conv}\,}}(P_j)\), because \(q \notin P_j\) and we shoot rays from q through each of the finitely many points in \(P_j\). Hence, negating these rays leads to a cone that cannot contain any point from \(P_j\) as it does not contain \({{\,\textrm{conv}\,}}(P_j)\). \(\square \)

We can use this observation as follows. If there is \(q \in P {\setminus } P'_j\), i.e., \(x_{qj}\) is fixed to 0, then all points in \(q + {{\,\textrm{cone}\,}}\{-(p - q):p \in P_j\}\) can be fixed to 0 as well.

In arbitrary dimensions, the convexity propagator cannot be implemented efficiently, because \({{\,\textrm{conv}\,}}(P_j)\) might have \(\Omega (2^d)\) many facets. In small dimensions, computing convex hulls can be done rather quickly and, as our numerical results will indicate, have a very positive impact on the time needed to solve MSSC.

4.3 Distance propagation

The distance propagator provides another set of rules to fix variables \(x_{pj}\), \((p,j) \in P \times [k]\), to 0. To this end, it defines for each \(j \in [k]\) the bounding box \(B_j = \{y \in \mathbb {R}^d: {c}^j_i \le y_i \le \bar{c}^j_i,\; i \in [d]\}\) for the centroids. That is, the smallest box that contains the centroid for cluster j based on local variable bound information. Afterward, for each \(p \in P\) and \(j \in [k]\), it computes the minimum and maximum distances \(D^{\min }_{j,p}\) and \(D^{\max }_{j,p}\) to the bounding box \(B_j\), i.e.,

$$\begin{aligned} D^{\min }_{j,p} = \min \left\{ \Vert p-x\Vert :x \in B_j\right\} , \quad D^{\max }_{j,p} = \max \left\{ \Vert p-x\Vert :x \in B_j\right\} . \end{aligned}$$

Since a data point is assigned to a centroid of minimum distance in an optimal solution, p cannot be assigned to cluster \(j \in [k]\) if there is \(j' \in [k]\) with \(D^{\max }_{j',p} < D^{\min }_{j,p}\). Consequently, \(x_{pj}\) can be fixed to 0 in this case.

Finding \(B_j\) and computing the minimum and maximum distance of a point to a box can be done in O(d) time. Hence, the distance propagator runs in \(O(kd|P|)\) time.

5 Branching rules

After a node of the branch-and-bound tree has been processed (adding cutting planes, propagation), branching rules split the current subproblem into further subproblems to, e.g., tighten the problem formulation or enforce integrality of variables. The decision on how to split the current subproblem is typically guided by the solution of the subproblem’s LP relaxation. To enforce integrality of variables (in the notation of the MSSC problem), one typically selects a variable \(x_{pj}\) whose value in the LP solution is non-integral. Then, two subproblems are created that fix \(x_{pj}\) to 0 and 1, respectively. Despite the existence of many branching rules that perform well for generic problems, see Achterberg et al. [1], there also exist branching rules tailored to a specific problem. This becomes relevant, because they might allow to derive further reductions based on problem structure or further components of a solver such as propagation mechanisms.

For integer programs, Gilpin and Sandholm [27] proposed four families of branching rules motivated from an information-theoretic perspective. The common ground of all their rules is to interpret the values \(x_{pj}\) as the probability that a point \(p \in P\) is assigned to cluster \(j \in [k]\). Using their rules, they aim at reducing the assignment uncertainty in the current subtree. The first and second rule use a look-ahead approach, similar to strong branching. For the MSSC problem with a large number of clusters or points, look-ahead branching rules may become computationally prohibitive when applied to the full problem. Hence, we do not use them. The third family is called entropic look-ahead-free variable selection and the fourth is its extension for a multi-variable branching version. Since we think that the third family might be helpful for the MSSC problem, we describe it in the following. Afterward, two novel branching rules for the MSSC problem will be presented.

5.1 Entropy branching

Suppose that the optimal LP solution of the current subproblem does not satisfy the integrality constraints, which means that the relaxed solution \(\bar{x}\) is non-integral. Let \(\bar{X}\) be the set of all branching candidates that are non-integral in this LP solution, i.e.,

$$\begin{aligned} \bar{X} {:}{=}\left\{ \bar{x}_{pj}:\bar{x}_{pj} \in (0,1), \ p \in P, \ j \in [k]\right\} . \end{aligned}$$

Since each of these \(\bar{x}_{pj}\) is non-integral, the cluster assignment of point p is not fixed. For the assignment to be fixed, \(\bar{x}_{pj}\) has to be one. Due to Constraints (1b) or (2c), \(\bar{x}_{pj}\) can be seen as a kind of posterior probability of point p to belong to cluster j [61]. A good strategy for branching would be to select the point p for which the probabilities for each cluster assignment are almost the same.

The most unclear situation is where \(\bar{x}_{pj}={1}/{k}\) holds for all \(j\in [k]\). Here, each cluster assignment is equally probable for point p. This can be seen as a homogeneous information setting. The level of homogeneity can be measured via the Shannon entropy of point p [54]. More precisely, for each variable \(\bar{x}_{pj}\in \bar{X}\) the entropy of point p with probabilities \(\bar{x}_{pj}\), \(j\in [k]\), is

$$\begin{aligned} H_p = - \sum _{j\in [k]} \bar{x}_{pj} \log _2 (\bar{x}_{pj}). \end{aligned}$$

The maximum entropy (\(H_p=\log _2 k\)) occurs in the above mentioned extreme case. That is, the current LP solution does not provide any information on the best (or most probable) cluster assignment of point p. The minimum entropy is obtained when there is a clear cluster assignment. In this situation, let the point p be already assigned to a cluster, e.g., to cluster \(j = 1\) and hence \(\bar{x}_{p1} = 1\). Due to (1b) (or (2c)), \(\bar{x}_{pj'} = 0\) for \(j'\in [k]\setminus \{1\}\). The entropy of p is then

$$\begin{aligned} H_p = - 1 \log _2 1 - 0\log _2 0 - \cdots - 0 \log _2 0 = 0, \end{aligned}$$

where \(0\log _2 0\) is taken to be zero.

We are interested in finding the point corresponding to the fractional variable \(\bar{x}_{p^*j^*}\in \bar{X}\) such that the entropy of point \(p^*\) is the maximal over all points with fractional variables, i.e., we search for the most uncertain assignment. The point is formally given by

$$\begin{aligned} {p^*} \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\{p\in P:\bar{x}_{pj}\in \bar{X}, \, j\in [k]\}} H_p. \end{aligned}$$

For this point \(p^*\), we select the cluster index j to branch on arbitrarily, i.e., we create two subproblems by adding either \(\bar{x}_{p^*j} = 1\) or \(\bar{x}_{p^*j} = 0\).

5.2 Distance branching

Next, we describe three branching rules with a geometric motivation. The first one, called distance branching, is based on the intuition that clusters should be rather compact (opposed to being spread out). Given the current LP solution with its suggestion for the centroids \(c^j\), \(j \in [k]\), the variable \(x_{pj}\in \bar{X}\) selected for branching is the one corresponding to the data point p and cluster j that are most apart from each other, i.e., we find

$$\begin{aligned} (p^*, j^*) \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\{(p, j) \in P \times [k]:\bar{x}_{pj}\in \bar{X}\}} \Vert p -c^j\Vert . \end{aligned}$$

Then, we branch on the fractional variable \(\bar{x}_{p^*j^*}\), creating two subproblems by adding either \(\bar{x}_{p^*j^*} = 1\) or \(\bar{x}_{p^*j^*} = 0\). If an optimal cluster is indeed compact, then the 0-subproblem contains an optimal solution. Otherwise, the convexity propagator has the potential to also fix additional variables that lie between \(p^*\) and the remaining points of cluster \(j^*\) in the 1-subproblem.

5.3 Centrality branching

Since the distance branching rule is tailored towards the extremes of compact vs. far spread-out clusters, the centrality branching rule takes a more balanced approach by selecting a point whose distance to a cluster is not too big. Given the current LP solution with its suggestion for the centroids \(c^j\), \(j \in [k]\), we would like to branch on the non-integral variable \(x_{pj}\) corresponding to the data point p and cluster j that is lying in the center of the cloud of unassigned data points. To obtain a cheap evaluation, we take the point \(p^*\) that is in the center of all centroids, i.e.,

$$\begin{aligned} p^* \in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\{p\in P:\bar{x}_{pj}\in \bar{X}, \, j\in [k]\}} \ \sum _{j\in [k]} \Vert p -c^j\Vert . \end{aligned}$$

From this point \(p^*\), we select an arbitrary variable \(\bar{x}_{p^*j^*}\), creating two subproblems by adding either \(\bar{x}_{p^*j^*} = 1\) or \(\bar{x}_{p^*j^*} = 0\). If the distance of \(p^*\) to cluster \(j^*\) is not too small, then there is a chance that the convexity propagator can fix further data points to be contained in cluster \(j^*\) in the 1-subproblem. Opposed to the distance branching rule, however, also the 0-subproblem becomes relevant for the convexity propagator as it might fix some variables to 0 based on its cone propagation.

5.4 Pairs-in-the-middle-of-pairs branching

The next branching rule is a variation of the last one, but now we branch on more general linear inequalities. Given the current LP solution with its suggestion for the centroids \(c^j\), \(j \in [k]\), we would like to branch on the sum of a pair of non-integral variables corresponding to the two data points located in the middle of a pair of clusters.

First assume that there are only two clusters with corresponding centroids \(c^1\) and \(c^2\). We want to find the points that are nearest to the point lying half way from centroid \(c^1\) to centroid \(c^2\). Any point \(p\in P\) that lies on the line segment between \(c^1\) and \(c^2\), minimizes the sum \(\Vert p -c^1\Vert + \Vert p -c^2\Vert \) by the triangle inequality of the Euclidean distance. With the same reasoning, the smaller this sum is, the nearer is point p to the line segment between the two clusters. However, there can be multiple points p with the same value for \(\Vert p -c^1\Vert + \Vert p -c^2\Vert \). We are interested in the ones that are nearest to the middle. Thus, we penalize longer distances by minimizing the sum \(\Vert p -c^1\Vert ^2 + \Vert p -c^2\Vert ^2\) of squares instead.

Now assume that there are more than two clusters which have pairwise the exact same distance of centroids to each other. Then, still looking for the two points \(p \in P\) that minimize \(\Vert p -c^j\Vert ^2 + \Vert p -c^{j'}\Vert ^2\) for \(j,j'\in [k]\) with \(j\ne j'\) gives us the desired points. Let p and q be the selected points and j and \(j'\) be the selected clusters. We then compute the sums in the current LP solution, \(\bar{x}_{pj} + \bar{x}_{qj}\) and \(\bar{x}_{pj'} + \bar{x}_{qj'}\), and select the sum that is most fractional or least fractional. Both versions are tested in our numerical experiments. Suppose that the first sum is selected, which means that the selected cluster is j. Then, we branch on

$$\begin{aligned} \bar{x}_{pj} + \bar{x}_{qj} \le \lfloor \bar{x}_{pj} + \bar{x}_{qj} \rfloor \end{aligned}$$

and

$$\begin{aligned} \bar{x}_{pj} + \bar{x}_{qj} \ge \lceil \bar{x}_{pj} + \bar{x}_{qj} \rceil . \end{aligned}$$

6 Primal heuristics

Primal heuristics try to find feasible solutions of good quality in a short amount of time. Having good feasible solutions at hand early in the solving process is crucial. Feasible solutions help to prune branch-and-bound nodes based on bounding as well as to perform further fixings and reductions. Moreover, a user may already be satisfied with the quality of the heuristic solution, such that the solving process can be stopped at an early stage. In this section we present three primal heuristics for the MSSC problem.

6.1 A root-node heuristic

To obtain a first feasible point, i.e., a point for warm-starting, we use the k-means algorithm, which is the most popular heuristic for finding a feasible solution for the MSSC problem; see, e.g., Lloyd [39] and MacQueen [40]. It consists of two main steps. First, given an initial guess for the location of the centroids, each data point is assigned to the nearest centroid. Afterward, each centroid is updated by calculating the mean of the data points assigned to this centroid. This process is repeated until the centroids no longer change. To obtain an initial guess for the location of the centroids, we use the “furthest point heuristic”, also known as “Maxmin” [28]. The idea is to select the first centroid randomly within the respective bounding box and then obtain new centroids one by one. In each iteration, the next centroid is the point that is the furthest (max) from its nearest (min) existing centroid. Here, we choose the first data point as the first centroid. For a comparison of several initialization heuristics, see, e.g., Fränti and Sieranoja [23].

6.2 A rounding heuristic

Feasible solutions can be obtained at each node by applying a rounding scheme to the LP solution. We use the rounding heuristic proposed by Sherali and Desai [55]. For completeness, we also describe it here.

Given a non-integral LP solution \((\tilde{x}, \tilde{c})\), or \((\tilde{x}, \tilde{c}, \tilde{\eta })\) for Model (2), at a node of the branch-and-bound tree, we round the non-integral \(\tilde{x}\)-solution to the closest feasible binary solution \(\bar{x}\), while respecting the decisions that have already been made, i.e., if a data point is already assigned to a cluster it will remain in that cluster. First, we ensure that there are no empty clusters by finding, for each \(j\in [k]\) with \(P_j = \emptyset \), the point \(\bar{p} \in P {\setminus } \cup _{j \in [k]} P_j\) such that \(\bar{p} \in {{\,\mathrm{arg\,max}\,}}\left\{ \tilde{x}_{pj} : p\in P {\setminus } \cup _{j \in [k]} P_j \right\} \), and setting \(\bar{x}_{\bar{p} j} = 1\). To break a tie, the point with smallest index is chosen. Now, to ensure that the point \(\bar{p}\) is only in one cluster, we set \(\bar{x}_{\bar{p} j^{'}} = 0\), for all \(j^{'} \in [k] \setminus \{j\}\).

Furthermore, for each data point \(p \in P\) such that \(\tilde{x}_{pj}\), \(j\in [k]\), is not yet rounded, we find a cluster \(j^*\) such that \(\tilde{x}_{pj^*} = \max \left\{ \tilde{x}_{pj} : j \in [k] \right\} \). Again, we break ties by selecting the cluster with the smallest index. Then, we set \(\bar{x}_{pj^*} = 1\) and \(\bar{x}_{pj} = 0\) for all \(j \in [k] {\setminus } \{j^*\}\). With \(\bar{x}\) at hand, we can then compute the centroids for each cluster, see Observation 2.1, and obtain a feasible solution for the MSSC problem.

6.3 An improvement heuristic

Given a feasible solution \((\bar{x},\bar{c})\) of Model (1) or \((\bar{x},\bar{c},\bar{\eta })\) of Model (2), we try to improve this solution by evaluating the loss function (i.e., the intra-variance) within each cluster. For that, consider the weighted value of the loss function restricted to cluster \(C_j\) as

$$\begin{aligned} F_{j} = \frac{1}{|C_j|} \sum _{p \in C_j} \Vert p -\bar{c}^j\Vert ^2. \end{aligned}$$

It may happen that some clusters have a large loss function value, while some other clusters may have a very small one. Thus, we may find a better solution—regarding the sum of all losses—by splitting a cluster into two smaller clusters and joining two other clusters. This heuristic has been proposed in Burgard et al. [8], where the motivation for its development is explained in more details.

The procedure is described as follows. For each pair of clusters \((C_{j_1}, C_{j_2})\), we compute their joint centroid and the corresponding total loss via

$$\begin{aligned} c^{j_1j_2} = \frac{ 1 }{ | C_{j_1} | + |C_{j_2}| } \sum _{p \in C_{j_1} \cup C_{j_2}} p \end{aligned}$$

and

$$\begin{aligned} F_{j_{1}j_2} = \frac{ 1 }{ |C_{j_1}| + |C_{j_2}| } \sum _{p \in C_{j_1} \cup C_{j_2}}\Vert p - c^{j_{1}j_2} \Vert ^2. \end{aligned}$$

Now, consider the set

$$\begin{aligned} \Psi {:}{=}\left\{ (C_{j_1}, C_{j_2}, C_{j_3}): F_{j_{1}j_2} < F_{j_3} \right\} , \end{aligned}$$

which is the set of all possible combinations of three clusters such that the total loss within two joined clusters is smaller than the total loss within a third cluster. Note that the set \(\Psi \) can be empty. If so, this means that we cannot obtain a better solution by joining two clusters and splitting another one. On the other hand, i.e., if there exists \((C_{j_1}, C_{j_2}, C_{j_3}) \in \Psi \), then the total loss of the joined clusters \(C_{j_1}\) and \(C_{j_2}\) is smaller than the total loss within cluster \(C_{j_3}\). Thus, we obtain a better solution by joining \(C_{j_1}\) and \(C_{j_2}\) and by splitting cluster \(C_{j_3}\) into two smaller clusters. To this end, we update the centroids in such a way that the clusters \(C_{j_1}\) and \(C_{j_2}\) are now one cluster with centroid \(c^{j_1j_2}\), cluster \(C_{j_3}\) receives two new centroids, and the other centroids remain the same, i.e.,

$$\begin{aligned} \hat{c}^{j_1} \leftarrow c^{j_1j_2}, \quad \hat{c}^{j_2} \leftarrow \tilde{c}, \quad \hat{c}^{j_3} \leftarrow \tilde{c}', \\ \hat{c}^j \leftarrow \bar{c}^j \quad \text {for all} \quad j \notin \{j_1,j_2,j_3\}, \end{aligned}$$

where \(\tilde{c}\) and \(\tilde{c}'\) are obtained as follows. First we find the two furthest points in \(C_{j_3}\) to be the initial guesses for the location of the centroids, i.e.,

$$\begin{aligned} (\tilde{c}, \tilde{c}') \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{p, p^{'} \in C_{j_3}} \left\{ \Vert p - p' \Vert ^2 \right\} . \end{aligned}$$

Next, each point in \(C_{j_3}\) is assigned to the closest centroid, either \(\tilde{c}\) or \(\tilde{c}'\). Then, the centroids \(\tilde{c}\) and \(\tilde{c}'\) are updated based on this assignment. Now, the update of the assignments and centroids is repeated until they do not change anymore. This way we obtain the new centroids \(\tilde{c}\) and \(\tilde{c}'\) that give us the desired splitting of cluster \(C_{j_3}\).

Finally, if the set \(\Psi \) has more than one element, then we repeat the process starting with the element \((C_{j_1}, C_{j_2}, C_{j_3})\) that gives the minimum ratio \(F_{j_{1}j_2} / F_{j_3}\). Each time an element \((C_{j_1}, C_{j_2}, C_{j_3})\) is used, we exclude all the elements that contain \(C_{j_1}\), \(C_{j_2}\), or \(C_{j_3}\), because these clusters have already been modified.

With \(\hat{c}\) at hand, we can easily compute \(\hat{x}\) and, thus, a new feasible solution \((\hat{x}, \hat{c})\) is obtained. If the objective function value is better at this new solution, then we have found an improved solution out of \((\bar{x}, \bar{c})\).

7 Symmetry breaking

Note that both Model (1) and (2) are symmetric with respect to cluster assignments. That is, once a feasible solution has been found, one can generate equivalent (symmetric) solutions by exchanging the labels of the clusters. Such symmetries are known to deteriorate the performance of search-based approaches like branch-and-bound, because symmetric subproblems are created repeatedly without providing the solver with new information. Such cluster symmetries can be handled in both models by imposing additional restrictions on the x-variables. If we interpret x as a binary matrix whose columns are labeled by clusters, then we can handle symmetries by enforcing that the columns of x need to be sorted lexicographically non-increasing. Since each row of matrix x has exactly one 1-entry due to (1b) and (2c), the lexicographic sorting can be imposed by orbitopal fixing, see Kaibel et al. [37], and separating the symmetry handling inequalities developed by Kaibel and Pfetsch [36].

8 Numerical experiments

In this section, we report extensive computational results that show the benefits of the techniques proposed in Sects. 36. To this end, we have incorporated all these techniques into the state-of-the-art solver SCIP; see Gamrath et al. [26]. As a reference for comparison, we use plain SCIP for both problem formulations (1) and (2). That is, we solve the two formulations without our problem-specific enhancements but with enabled symmetry handling.

To conduct the experiments, we use different test sets from the literature, which contain both real-world as well as synthetic instances. The test sets and the general computational setup are described in Sects. 8.1 and 8.2, respectively. Then, in Sect. 8.3, we start the discussion of the numerical results for the case \(k=2\). We evaluate the benefits of each particular technique and indicate which setting performs best. Next, in Sect. 8.4, we repeat the discussion but for the case \(k=3\). Finally, in Sect. 8.5, we present results on a larger test set in order to draw solid and comprehensive conclusions about the performance of the novel techniques.

8.1 Test sets

We evaluate the impact of the presented algorithmic ideas for solving the MSSC problem using both synthetic and real-world test sets. To be able to draw conclusions on a reliable basis, we have collected all publicly available instances that have been used in the related literature for solving the MSSC problem to global optimality. Thus, to the best of our knowledge, our results are based on the largest publicly available test set for the MSSC problem consisting of realistic instances. Specifically, we use the instances that have been used in Aloise and Hansen [4], Sherali and Desai [55], as well as in Aloise et al. [5]. Since these instances come from different sources, we provide the source for every instance in Table 1. The synthetic test set has been proposed in Fränti and Sieranoja [22]. The authors show that these synthetic instances cover a wide range of classic MSSC instances. In particular, the test set contains instances with different degrees of overlap, density, and sparsity of data points.

Table 1 Information about the test sets

Note that some of the synthetic instances contain data points with very large coordinate values. In preliminary experiments, we have observed that this leads to very large big-M values in Model (2), which in turn causes numerical instabilities. To avoid numerical issues, we therefore re-scale these instances as follows. First, for each coordinate, we shift the data points such that their coordinate-wise minimum and maximum value is the same to have a “symmetric” distribution. Then, we re-scale the data points if they do not fit into \([-10^3, 10^3]^d\). More precisely, for each dimension \(i\in [d]\), we compute the maximum and minimum coordinate value obtaining \(\bar{v}_i\) and \( {v}_i\), respectively. Then, we take \(u_i = 0.5(\bar{v}_i + {v}_i)\) and shift each data point p obtaining \(\hat{p}_i = p_i - u_i\) for all \(i\in [d]\). If we do this for all data points, they get centered around the origin. Now, if \( {w}_i = {v}_i - u_i < -10^3\) or \(\bar{w}_i = \bar{v}_i - u_i > 10^3\) holds, we re-scale the data. The desired new bounds then are \( {z}_i = -10^3\) and \(\bar{z}_i = 10^3\). Thus, the re-scaled data point \(\tilde{p}\) is

$$\begin{aligned} \tilde{p}_i = \frac{\hat{p}_i - {w}_i}{\bar{w}_i - {w}_i} \cdot (\bar{z}_i - {z}_i) + {z}_i, \quad i\in [d]. \end{aligned}$$

The corresponding instances that needed to be re-scaled are s1, s2, s3, s4, and unbalance.

8.2 Computational setup

To conduct our experiments, we use SCIP 7.0.3 as a branch-and-bound framework. All LP relaxations are solved using CPLEX 12.8. Our novel techniques discussed in Sects. 36 are implemented as SCIP plugins written in C/C++ and our code is publicly available at GitHubFootnote 2 (git hash 19003a37). To handle symmetries, we use the orbitope constraint handler plugin of SCIP, which implements orbitopal fixing and the symmetry handling inequalities as mentioned in Sect. 7. To compute convex hulls and cones in the convexity propagator proposed in Sect. 4.2, we use the QhullFootnote 3C++ interface proposed by Barber et al. [6]. We have also conducted experiments using the CDD library [24] for computing convex hulls and cones, but due to numerical instabilities therein, we decided to use Qhull. Moreover, since we observed that many of SCIP’s internal heuristics require a lot of running time without generating a feasible solution, we disabled these heuristics. A list of disabled heuristics can be found in “Appendix A”. All computations were performed on a computer with two Intel Xeon CPU E5-2699 v4 at 2.20 GHz (\(2 \times 44\) threads) and 756 GB RAM. The time limit of all computations is 1 h per instance.

In the following, we discuss the impact of our techniques on solving the MSSC problem for \(k=2\) and \(k=3\) clusters. We only report on aggregated results in the discussion and refer the reader to “Appendix B” for results per instance. The tables that we present show for both the quadratic and epigraph formulation the mean number of nodes in the branch-and-bound tree (column #nodes), the mean running time per instance in seconds (time), and the number of solved instances (#solved). Instances that cannot be solved within the time limit contribute 3600 s to the mean time value. Moreover, we report on the used setting, where each of the following subsections describes how the settings are encoded in the tables. All mean numbers of measurements \(t_1,\dots ,t_n\) are provided as shifted geometric means \(\prod _{i = 1}^n (t_i + s)^{\nicefrac {1}{n}} - s\) to reduce the impact of outliers. For time we use a shift of \(s=10\) and for nodes a shift of \(s=100\).

8.3 Discussion of the numerical results for 2 clusters

We start with the discussion of the numerical results for the case when there are 2 clusters. First, we apply plain SCIP to all the 25 instances presented in Table 1. Afterward, we gradually enable our techniques in SCIP and evaluate the benefits of each particular technique as well as the benefits of different combinations of techniques. To allow for a concise encoding, we abbreviate the different techniques as described below. Whether a technique is enabled (resp. disabled) is encoded by 1 (resp. 0) in the corresponding tables.

8.3.1 Primal heuristics

We start by evaluating the impact of primal heuristics. A summary of the obtained results is presented in Table 2, where “round.”, “impr.”, and “init”, serve as abbreviations for the rounding, improvement, and root-node heuristic, respectively. Recall that the improvement heuristic is only active for \(k>2\), i.e., it has no effect in the experiments discussed next. The first row shows the results obtained by plain SCIP. It can be directly seen that the MSSC problem is extremely hard to solve. Note that SCIP is able to solve only 1 instance to global optimality, regardless of which model is used. Enabling all primal heuristics still does not allow to solve more instances. However, we can see that the mean running time decreases in both models, where the impact is larger for the quadratic model. That is, the single instance that can be solved is solved in approximately 3.9% faster using heuristics.

Table 2 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different heuristics for 2 clusters

Let us stress that, based on preliminary experiments, the main difficulty of solving the MSSC problem to global optimality is to obtain good dual bounds in a reasonable amount of time. For this reason, the impact of heuristics on the solving process is expected to be minor in comparison to the impact of techniques that improve the dual bound. However, since we needed to disable many of SCIP’s internal heuristics as described above, we enable all our heuristics in the following experiments as their running time is low and they produce good solutions.

8.3.2 Propagators

The results of our experiments regarding propagators are summarized in Table 3, where “bary.”, “conv.”, “cone”, and “dist.” abbreviate the barycenter, convexity, cone, and distance propagator, respectively. However, we do not include the results using the distance propagator here, since in preliminary numerical experiments we observed that this propagator is not able to derive many reductions if used alone. In later experiments, we will enable it again to investigate whether it is able to improve the solution process if also other components are enabled.

The first row of results in Table 3 corresponds to the setting where only primal heuristics are enabled. It can be directly seen that as more propagators are enabled, more instances are solved. Without propagators only 1 instance is solved to global optimality. Using all our propagators, we are able to solve 8 instances with the quadratic model. Thus, the geometric ideas incorporated into the propagators are an important component to solve the MSSC problem effectively. In particular, plain SCIP is not able to make use of the simple geometric observations on its own. In the following, we discuss the benefits of each particular propagator in more detail.

Table 3 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different propagators for 2 clusters

8.3.3 Barycenter propagator

By using the barycenter propagator and the quadratic model, much more nodes can be processed if compared with the previous setting and, more importantly, in significantly less time. The reason for this is that the barycenter propagator is able to perform many reductions, which in turn simplifies the LP relaxations. As a consequence, the dual bounds obtained with the quadratic model drastically improve by using the barycenter propagator. This can be clearly seen in Fig. 2, where we plot the instances vs. the corresponding gap between the primal and dual bounds.

Fig. 2
figure 2

Instance ID vs. gap (in percentage and log-scale) for 2 clusters. Since the y-axis is in log-scale, if a particular instance is solved to global optimality by a particular method, then the gap is zero and hence it does not appear in the plot. Whereas, if the gap is equal or larger than \(10^4\), then it assumes the gap limit of \(10^4\) in the plot

This already demonstrates the great benefit that the barycenter propagator adds to the solution process. As discussed in Sect. 4.1.1, the barycenter propagator is less powerful for the epigraph model as it is for the quadratic model, which is also reflected in the results. Nevertheless, it allows 1 more instance to be solved to global optimality and it slightly reduces the running times. Looking at Fig. 2 again, we also see that for many instances the gaps improve.

8.3.4 Convexity+Cone propagator

The convexity propagator is based on geometric ideas and is extremely powerful. Using only this propagator alone and heuristics, we can already solve 3 more instances if the quadratic model is used, and 2 more instances if the epigraph model is used. Without the convexity propagator, these instances cannot be solved. This technique drastically helps in the solution process of the MSSC problem. Besides allowing more instances to be solved, it also requires half of the time that was needed before. Moreover, the number of nodes that need to be processed to solve the instances also reduces significantly.

Using the cone propagation in combination with the convexity propagator, this effect is even more pronounced. It allows 1 more instance to be solved if the quadratic model is used, and 2 more instances if the epigraph model is used and results in much lower mean running times.

8.3.5 Barycenter+Convexity+Cone propagators

Although the barycenter and convexity-cone propagators alone already significantly improved SCIP’s performance, their combination allows to solve three further instances in the quadratic model. This results in a significant reduction of running time by approximately 46%. Interestingly, the mean number of nodes in the combined setting is roughly twice as large as if just the convexity-cone propagator is used. This again shows that the reductions found by the propagators simplify the structure of relaxations drastically, e.g., because fixed x-variables remove non-convex expressions from the quadratic model. These reductions allow SCIP to process more nodes, which in turn allows to solve more instances. In the epigraph model, the combination of the three propagators does not qualitatively change the results. Although 1 less instance can be solved, for many instances the gaps improved; see Fig. 3.

Fig. 3
figure 3

Comparison of gaps using different propagators for 2 clusters

We conclude that, for both the quadratic and epigraph model, our propagation algorithms are an important component to solve the MSSC problem to global optimality. In particular, using combinations of these propagators creates synergies that allow to solve more instances in comparison with just using a single propagator, where the effect is more prominent for the quadratic model.

8.3.6 Cutting planes

Next, we evaluate the impact of cutting planes on SCIP. As before, we also enable all heuristics and, due to the positive effect of propagators, also the convexity-cone and barycenter propagator. Preliminary numerical results showed that by localizing the cardinality cuts, no positive impact on the solution process can be achieved in general. Therefore, we focus only on the outer-approximation (OA) cuts. Since these cuts are only applicable for the epigraph model, we concentrate only on the epigraph model in the following discussion. Table 4 summarizes our results.

Table 4 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using or not the OA cuts for 2 clusters

At first glance, it seems that OA cuts only have a minor impact on SCIP’s performance as the number of solved instances does not change. Comparing the gaps with and without OA cuts, however, reveals a clear impact; see Fig. 4. For 8 instances, we observe a change in the gap if cuts are enabled. In three cases, the gaps slightly degrade when OA cuts are enabled. For the remaining five instances, however, OA cuts either reduce or drastically reduce the gap. Thus, although no clear trend is visible, we may conclude that OA cuts are helpful when solving MSSC problems. The effect of OA cuts is less pronounced compared to the effect of propagators, which might be explained by the fact that OA cuts do not exploit the specific problem structure of MSSC. In contrast to this, our novel propagator techniques are tailored to the MSSC problem and thus allow stronger reductions.

Fig. 4
figure 4

Comparison of gaps using propagators with the OA cuts enabled or not for 2 clusters

8.3.7 Distance propagator

As reported above, the distance propagator alone is not able to significantly improve SCIP’s performance. For this reason, we test its effect if also further components are enabled. From Table 5, we can see that using the distance propagator together with our other techniques has a slightly positive effect. Therefore, it is also enabled in the experiments discussed next.

Table 5 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using or not distance propagator for 2 clusters

8.3.8 Branching rules

The last components to be tested are branching rules. We have implemented all branching rules described in Sect. 5. Preliminary numerical experiments, however, revealed that only the entropy and distance branching rules may be beneficial for some instances. In contrast, the centrality and pairs-in-the-middle-of-pairs harm the solution process leading to a less well-performing code. Therefore, we focus only on the branching rules that have a positive impact on some instances in the following discussion. We present the summary results in Table 6. By “standard” we refer to SCIP’s default branching rule.

Our experiments show that no branching rule dominates the others. On the one hand, by using the distance branching rule and the quadratic model, 1 additional instance can be solved. On the other hand, the number of nodes and the time required increases. If the epigraph model is used, then the entropy branching rule is performing best: The number of solved instances remains the same but the running times are slightly lower. The overall impact of branching rules, however, seems to heavily depend on the underlying instance to solve, which does not allow us to provide a clear winner.

Table 6 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different branching rules for 2 clusters

8.3.9 Best setting

To conclude the discussion of the numerical results for \(k=2\), we show a comparison of plain SCIP with the best combination of the techniques proposed in this paper. The latter comprises primal heuristics, propagators, OA cuts (for the epigraph model), and the standard branching rules of SCIP, since our branching rules and standard branching rules are performing equally good on average. This comparison in shown in Fig. 5.

Fig. 5
figure 5

Running times and gaps comparison between plain SCIP and SCIP enabled with the best setting for 2 clusters

Regarding the performance of plain SCIP, we emphasize that the dual bounds found by SCIP in the quadratic model are very weak which leads to very large gaps. In contrast to this, if the epigraph formulation is used, better dual bounds can be obtained by using plain SCIP. Despite their simplicity, the plots show that our novel geometric ideas drastically improve on the performance of SCIP, thus adding powerful methods to the toolbox for solving the MSSC problem to global optimality if \(k = 2\).

These methods work particularly well for instances with 2-dimensional data and the number of data points not exceeding 2048 as almost all such instances from our test set, see Table 1, can be solved to global optimality within the time limit by the quadratic model. The only exception is instance 11, which terminates after one hour with a gap of 19.02%. For more detailed results of the best setting we refer the reader to Table 21 in the “Appendix B”.

8.4 Discussion of the numerical results for 3 clusters

We now turn our attention to the experiments for \(k=3\). The MSSC problem is much harder to solve to global optimality in this setting. We proceed as in the last section.

8.4.1 Primal heuristics

The summary results of plain SCIP and SCIP enabled with our primal heuristics are presented in Table 7.

Table 7 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different heuristics for 3 clusters

By using plain SCIP and the epigraph model, we can solve only 1 instance to global optimality. Enabling heuristics does not change the number of solved instances. This is in line with the observations made above: the main difficulty in solving the MSSC problem to global optimality is to provide tight dual bounds, which are not provided by primal heuristics. However, we observe that by enabling the heuristics in the epigraph model, the primal-dual gap improves for many instances substantially; see Fig. 6. For this reason, we enable heuristics in the following experiments.

8.4.2 Propagators

Next, we evaluate the impact of our propagation techniques for \(k=3\). In Table 8, we show the summarized results obtained by activating our primal heuristics and the propagators. Taking a general look at the results and comparing the first row (SCIP + heuristics) with the last row, we can see that 2 more instances can be solved to global optimality, using either the quadratic or the epigraph model. Thus, although the MSSC problem for \(k=3\) is much harder to solve than for \(k=2\), the propagation techniques are still helpful; see also Fig. 7 where we compare the gaps obtained by using propagators. Therefore, in what follows, we discuss the benefits of the separate propagators in turn.

Fig. 6
figure 6

Comparison of gaps using or not the heuristics for 3 clusters

Table 8 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different propagators for 3 clusters
Fig. 7
figure 7

Comparison of gaps using different propagators for 3 clusters

8.4.3 Barycenter propagator

By enabling only the barycenter propagator, we see that more nodes can be explored in the branch-and-bound tree. Particularly, when the quadratic model is used, there is a huge difference in the number of nodes when comparing the first two rows of Table 8. Moreover, the barycenter propagator allows one more instance to be solved, which could not be solved before by the quadratic model, resulting in a much lower mean running time. Finally, although the number of solved instances increases only slightly, we can see that the barycenter propagator has also a very positive effect on the other instances: the primal-dual gap improves significantly, in particular, for the quadratic model; see Fig. 7.

Besides the improvement in the gaps for the epigraph model, the barycenter propagator also allows more nodes to be explored while reducing the running times considerably. Therefore, also for \(k=3\), the barycenter propagator allows to simplify the relaxations used in branch-and-bound.

8.4.4 Convexity+Cone propagator

If we use only the convexity propagator and compare the results with plain SCIP and enabled heuristics, then we see that more nodes can be processed in significantly less time. Additionally enabling the cone propagator also allows to solve two more instances in the epigraph model; see Fig. 7 again. Regarding the quadratic model, using the convexity and cone propagators allows to process more nodes in the branch-and-bound tree. This, however, comes at the price that the solvable instance requires more time. We conclude that the convexity and cone propagators enhance the solution process, where the effect is more dominant for the epigraph model.

8.4.5 Barycenter+Convexity+Cone propagator

The quadratic model clearly benefits from using the barycenter propagator together with convexity and cone propagation as one more instance can be solved and in less time. For the epigraph model, no significant change regarding running times can be observed, however it has a positive effect on some gaps; see Fig. 7.

8.4.6 Cutting planes

Regarding the cutting planes, we again focus only on the OA cuts, since the localized version of the cardinality cuts does not positively affect the solution process in general. In Table 9 we present the aggregated results for the epigraph model with and without OA cuts. Note that by using the OA cuts we can solve the same number of instances, while requiring less time and much less nodes. The OA cuts help mainly in terms of dual bounds; see also Fig. 8. Therefore, we conclude that the OA cuts are also beneficial for the epigraph model in the harder case of \(k=3\).

Table 9 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using OA cuts for 3 clusters
Fig. 8
figure 8

Comparison of gaps using different propagators and using or not the OA cuts for 3 clusters

8.4.7 Distance propagator

The distance propagator does not impact the solution process if used standalone. The reason is that, for most of the instances, the propagator does not find any reduction, which is to be expected because fixing x-variables to 0 is less powerful than fixing them to 1 (as the convexity propagator does, for instance). However, as more reductions and fixings are being made by other components, the distance propagator slightly helps; see the summary results shown in Table 10.

Table 10 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using or not the distance propagator for 3 clusters

The distance propagator improves the solution process of both models, since more nodes can be explored and in less time while solving the same number of instances. Therefore, we enable this propagator as well in the following experiments.

8.4.8 Branching rules

Finally, we report on the effect of branching rules. Again, the centrality and pairs-in-the-middle-of-pairs rule hinder the solution process. Therefore, we focus only on the entropy and distance branching rules, which are summarized in Table 11. The branching rules do not really change the results. In Fig. 9, we see that the entropy branching rule has a positive effect on four gaps of the quadratic model, but harms the solution process of two other instances. Moreover, it requires more time. For the epigraph model, it is visibly bad. The standard and distance branching rules perform equally good for the quadratic model; see Fig. 9 again. While, for the epigraph model, the standard branching rule performs best.

Table 11 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different branching rules for 3 clusters
Fig. 9
figure 9

Comparison of gaps using different branching rules for 3 clusters

8.4.9 Best setting

To finalize the discussion of the numerical results for the case \(k=3\), we again compare plain SCIP with the best setting so far. We consider as the best setting the following: the primal heuristics, the four propagators, the OA cuts (for the epigraph model), and the standard branching rules that are part of SCIP, all enabled. The comparison is presented in Fig. 10.

Fig. 10
figure 10

Running times and gaps comparison between plain SCIP and SCIP enabled with the best setting for 3 clusters

As for the case \(k=2\), the dual bounds obtained with the epigraph model are better than the ones obtained with the quadratic model if plain SCIP is used. Although not many instances can be solved by using our techniques, the improvement that they bring to the solution process is significant in terms of primal-dual gaps. The instances solved to global optimality within the time limit are the smallest instances in our test set in terms of data points, i.e., instances 2 and 3 can be solved by the quadratic model and the epigraph model additionally solves instance 12; see Table 1 for details about these instances and Table 31 in the “Appendix” for detailed results per instance.

8.5 Discussion of the numerical results for samples of instances

The results discussed in the last two sections indicate that our techniques are highly beneficial for SCIP when solving the MSSC problem with a number of data points that is not too large. On the tested instances, this roughly means \(n<\) 1000 and \(d=2\). To further support this hypothesis, we conducted experiments on a broader test set of smaller instances. We have formed 10 new sub-test sets by extracting samples of sizes \(\{100, 200, \ldots , 1000\}\) from the two-dimensional and large instances a1, a2, a3, s1, s2, s3, s4. Each sub-test set is comprised of 7 instances, which in total yields additional 70 instances. To sample the data, we use the Python routine skopt.sampler.Sobol; for Sobol sequences see Sobol’ [56]. In Fig. 11, we show two examples of these samples (or sub-instances). The blue points represent the original data points, while the red crosses represent the obtained sample. We believe that these samples extract meaningful information of the larger instances as the samples look similar to the full instances.

Fig. 11
figure 11

Example of samples extracted from the s1 instance. The sample sizes are: 200 (left) and 500 (right). The blue points represent the original data points, while the red crosses represent the sampled data points

In the following, we investigate how our novel methods scale with increasing problem size. We focus on the case \(k=2\), because the instances for \(k=3\) are still very challenging to solve and drawing reliable conclusions is difficult.

8.5.1 Primal heuristics and propagators

Table 12 shows aggregated results obtained by enabling our primal heuristics and propagators in SCIP. We can immediately see that SCIP’s performance clearly improves if more of our components are enabled as more instances can be solved and the running times decrease drastically. In particular, enabling all our techniques allows us to solve using the quadratic model all 7 instances per sub-test set for up to 500 data points. Using the epigraph model, all instances with up to 900 data points can be solved to global optimality within the time limit. Moreover, for both models, we can solve almost all instances with 1000 data points if all our methods are enabled. Without our techniques, SCIP cannot solve a single instance even if the number of data points is 100.

Table 12 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different propagators for the sampled instances and 2 clusters

8.5.2 Cutting planes

We again only focus on the OA cuts and on the epigraph model. The summarized results are presented in Table 13. Interestingly, the OA cuts do more harm than good in this experiment. We can thus conclude that since these cuts are not based on problem-specific clustering ideas, they do not help as much as the propagators do for solving the MSSC problem effectively.

Table 13 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using or not the OA cuts for the epigraph model

8.5.3 Branching rules

Now we evaluate the performance of our branching rules. Again, we focus only on the two branching rules that yield some improvement in the solution process of the MSSC problem. The comparison results are displayed in Table 14.

Table 14 Comparison of mean number of nodes, mean running time per instance (in seconds), and number of solved instances using different branching rules

The entropy branching rule performs better if the epigraph model is used, whereas both the distance and the standard branching rules of SCIP are equally good if the quadratic model is used.

8.5.4 Best setting

As in the last sections, we conclude the analysis by comparing plain SCIP with the best setting, see Fig. 12. One can clearly see the significant improvement achieved by using our techniques. Moreover, it is also visible that as the sizes of the instances get larger, the harder the instances become. Therefore, if the size of the MSSC problem at hand is not too large, i.e., the number of data points n is around 1000, the dimension d is 2, and the number of clusters k is 2, then our techniques can be efficiently used to solve the problem to global optimality in just a few seconds. However, it is important to note that this is the case only for the instances we consider because the difficulty of a MSSC problem does not solely depend on the size of the instance but also depends on the structure of the given data points.

Fig. 12
figure 12

Running times and gaps comparison between plain SCIP and SCIP enabled with the best setting for the sampled instances and for 2 clusters

From this experiment, we also conclude that, in general, the quadratic model performs better regarding running times, whereas the epigraph model performs better regarding the dual bounds, which in turn leads to more instances being solved.

9 Conclusion

Solving the MSSC problem to global optimality is a very challenging task that already has received considerable attention in the literature. Nevertheless, the problem is far from being “practically solved”. In this paper, we propose different techniques (including propagation, cutting planes, branching rules, or primal heuristics) that can be incorporated in a branch-and-bound framework for solving the problem. Our extensive numerical study shows that these novel techniques significantly help to improve the solution process. On the one hand, we can now solve instances that have not been solvable before. On the other hand, the optimality gaps for those instances that remain unsolvable are significantly reduced.

Not surprisingly, there are still some ideas left for future research. Let us sketch two of them. First, we show that our techniques can be used to globally solve instances of moderate size. Thus, our methods could also be used in solution approaches for the MSSC problem that rely on reducing the dimension or the size of the originally given problem; see, e.g., Hua et al. [35]. Second, there further exist variants of the MSSC problem with additional side constraints as discussed in, e.g., Liberti and Manca [38]. Such side constraints allow for solution techniques that are feasibility-based, whereas all our techniques are optimality-based. Hence, a combination of both could yield an overall branch-and-bound framework that is even more effective for side-constrained MSSC problems.