Mixed-integer programming techniques for the minimum sum-of-squares clustering problem

Burgard, Jan Pablo; Moreira Costa, Carina; Hojny, Christopher; Kleinert, Thomas; Schmidt, Martin

doi:10.1007/s10898-022-01267-4

Mixed-integer programming techniques for the minimum sum-of-squares clustering problem

Open access
Published: 10 January 2023

Volume 87, pages 133–189, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Global Optimization Aims and scope Submit manuscript

Mixed-integer programming techniques for the minimum sum-of-squares clustering problem

Download PDF

1995 Accesses
1 Altmetric
Explore all metrics

Abstract

The minimum sum-of-squares clustering problem is a very important problem in data mining and machine learning with very many applications in, e.g., medicine or social sciences. However, it is known to be NP-hard in all relevant cases and to be notoriously hard to be solved to global optimality in practice. In this paper, we develop and test different tailored mixed-integer programming techniques to improve the performance of state-of-the-art MINLP solvers when applied to the problem—among them are cutting planes, propagation techniques, branching rules, or primal heuristics. Our extensive numerical study shows that our techniques significantly improve the performance of the open-source MINLP solver SCIP. Consequently, using our novel techniques, we can solve many instances that are not solvable with SCIP without our techniques and we obtain much smaller gaps for those instances that can still not be solved to global optimality.

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Article 09 April 2023

Kanchan Rajwar, Kusum Deep & Swagatam Das

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

1 Introduction

Given a set of data points in a normed vector space and a number of clusters, the clustering problem consists in deciding which data point should be assigned to which cluster. Moreover, a representative point for each cluster needs to be determined. Clustering problems form a highly relevant sub-class of unsupervised learning in machine learning and computational statistics. Its relevance is supported by many applications, e.g., in functional data analysis [10, 53], image processing [9], bio-informatics [12], economics [32], and social sciences [31]. For a detailed survey of the history of clustering problems we refer to Steinley [58]. Depending on, e.g., the way how vicinity is measured and on whether the representative is an arbitrary point or one of the data points, different variants of clustering problems arise. In this paper we consider the minimum sum-of-squares clustering (MSSC) problem. Here, distance between data points is measured using the squared Euclidean norm and any point can be chosen as the representative for each cluster.

Modeling this problem leads to a nonconvex mixed-integer nonlinear optimization problem (MINLP) that is extremely hard to solve for high-dimensional real-world problems. Moreover, the problem is known to be NP-hard even in the case of two dimensions; see, e.g., Aloise et al. [11], Dasgupta [2], and Mahajan et al. [41]. This is why the problem is most frequently solved using heuristics out of which the k-means clustering method is the most prominent one; see, e.g., Lloyd [39], MacQueen [40]. However, solving such clustering problems only heuristically may come with severe disadvantages. Since solving the MSSC problem is an unsupervised learning problem, the outcome typically requires the interpretation of experts from the specific field of application such as medicine or social sciences. This interpretation, however, may be completely wrong in the case that the expert is confronted with a heuristic clustering solution of bad quality. Moreover, it is easy to imagine that such a misleading interpretation might have some severe, e.g., medical, consequences. Thus, there is a strong need for sophisticated optimization techniques to improve the process of solving clustering problems to global optimality and this is exactly the contribution of this paper: We take the MINLP solver SCIP and enhance its solution process by developing novel mixed-integer optimization techniques that enable us to solve MSSC problems to global optimality that cannot be solved with the plain version of SCIP.

Of course, we are not the first ones trying to solve the MSSC problem to global optimality. To the best of our knowledge, the earliest application of branch-and-bound methods is presented by Fukunaga et al. [25], which has been refined later on by Diehr [15]. A variant of a so-called repetitive branch-and-bound method has been devised by Brusco [7], where the authors conclude that their method is well-suited for a small number of clusters. Another recent branch-and-bound approach is presented by Sherali and Desai [55]. The authors use reformulation-linearization-techniques (RLT) embedded in a branch-and-bound method to solve the problem to global optimality. In their introduction, they also talk about a “limited number of optimization techniques” as opposed to a rather large number of heuristics that are used in practice. As an additional technique, the authors present further valid inequalities to tackle the inherent symmetry of the problem. Regarding symmetry breaking for clustering problems we also refer to Plastria [48], which is, generally speaking, a modeling tutorial paper but which also contains a discussion of symmetry breaking constraints for the clustering problem. Aloise and Hansen [4] also consider the MSSC problem and try to re-produce the results of Sherali and Desai [55]. However, the re-production failed since significantly longer running times have been observed. Consequently, the computational efficiency of the RLT-based branch-and-bound method may be taken with some care. Another algorithmic technique is the column generation approach presented first in Merle et al. [18] and which has been re-considered and improved by Aloise et al. [5]. Also other classic techniques of mixed-integer (non)linear optimization have been applied such as generalized Benders decomposition in Floudas et al. [21], and Tan et al. [59]. Alternatively, Peng and Xia [45] consider the MSSC problem as a concave minimization problem and adapt Tuy’s cut method, see Horst and Tuy [34], for solving the problem. Further, Prasad and Hanasusanto [49] propose improved conic reformulations of the MSSC problem and also study some symmetry breaking techniques. Tîrnăucă et al. [60] follow a more geometric approach that is based on Voronoi diagrams. Finally, there is a rather large branch of literature towards the application of techniques from semi-definite programming (SDP). Peng and Wei [46], Peng and Xia [44] proved the equivalence between the MSSC problem and a 0-1 SDP reformulation. Based on this 0-1 SDP model, Aloise and Hansen [3] propose a branch-and-cut algorithm and solve instances with up to 202 data points to global optimality. More recently, Piccialli et al. [47] consider the same mixed-integer SDP for the MSSC problem and propose another branch-and-bound algorithm that is capable of solving real-world instances with up to 4000 data points. To the best of our knowledge, this is the most recent state-of-the-art branch-and-bound algorithm for the MSSC problem. SDP-like models have also been used in De Rosa and Khajavirad [13] to use $Z = X X^{\top } \in [0,1]^{n \times n}$ for encoding the clustering instead of $X \in \{0,1\}^{n \times k}$. The authors derive cutting planes and show a relation of their cutting planes to the cut polytope; see Deza and Laurent [14] for a survey on the latter. The presented numerical experiments show that these novel cutting planes can be strong but the authors only solve the initial LP relaxation and do not apply a complete branch-and-bound method. Finally, some recent ideas based on reduced-space techniques seem to be very promising, see Hua et al. [35] and Liberti and Manca [38]. Besides that, in Liberti and Manca [38], the authors discuss the MSSC problem with several side constraints. One of their base models is, in particular, the convex MINLP that we present in the next section.

In our contribution, we add to the literature on solving the MSSC problem to global optimality. To this end, we develop novel mixed-integer programming techniques that are mainly motivated by geometric insights and that improve the branch-and-cut solution process of an MINLP solver. To be more precise, we present two MINLP formulations of the problem (Sect. 2), develop cutting planes (Sect. 3), propagation methods (Sect. 4), as well as problem-specific branching rules (Sect. 5) and primal heuristics (Sect. 6). We implement and test all techniques in the open-source MINLP solver SCIP; see Gamrath et al. [26]. By doing so, we also automatically apply state-of-the-art symmetry breaking techniques to the problem. Our numerical results are presented and discussed in Sect. 8, where we show that our techniques significantly improve the solution process. We close the paper with some concluding remarks and some potential topics for future work in Sect. 9. Our code is publicly available at GitHub.^{Footnote 1} Although our numerical results clearly show that the solution process of an MINLP solver applied to the MSSC problem is significantly improved, we do not beat current state-of-the-art and SDP-based techniques as studied in Piccialli et al. [47]. Nevertheless, we are convinced that it is worth to also push MINLP-based approaches forward so that, in the end, different techniques for various approaches can be combined to lead to an even better and maybe hybrid solution approach.

2 MINLP models for the MSSC problem

We now model the minimum-sum-of-squares clustering (MSSC) problem as a mixed-integer nonlinear optimization problem (MINLP). To this end, we are given a set of data points $p \in P \subseteq \mathbb {R}^d$ and a positive integer $2 \le k \le |P|$, which is the number of clusters of the problem. The task is then to assign every data point $p \in P$ to a cluster (indexed by $j \in [k] {:}{=}\{1, \dotsc , k\}$) so that the sum of the squared Euclidean distances between the data points and the corresponding centroids $c^j$ is minimal. This problem is modeled via the following MINLP:

$$\begin{aligned} \min _{x, c} \quad&\sum _{p \in P} \sum _{j \in [k]} \, x_{pj} \Vert p - c^j\Vert ^2 \end{aligned}$$

(1a)

$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\sum _{j \in [k]} x_{pj} = 1, \quad p \in P, \end{aligned}$$

(1b)

$$\begin{aligned} \quad&x_{pj} \in \{0,1\}^{}, \quad p \in P, \ j \in [k], \end{aligned}$$

(1c)

$$\begin{aligned} \quad&c^j \in B, \quad j \in [k]. \end{aligned}$$

(1d)

The binary variables $x_{pj}$ are the assignment variables that model whether the data point p is assigned to cluster j $(x_{pj} = 1)$ or not $(x_{pj} = 0)$. Moreover, $B \subseteq \mathbb {R}^d$ is a set that contains all points in P. This can, e.g., be the bounding box of P. That is, if for each $i \in [d]$, $\ell _i = \min \{p_i:p \in P\}$ and $u_i = \max \{p_i:p \in P\}$, then $B = \{c \in \mathbb {R}^d: \ell _i \le c_i \le u_i,\; i \in [d]\}$ is a valid choice. Note that (1d) is not necessary for the correctness of Model (1). Nevertheless, we include it in our implementation, because Model (1) is a nonconvex MINLP, for which bounds on variables are usually beneficial. The objective function measures the sum of the squared Euclidean distances between the data points and the centroids of the clusters to which they belong. Finally, Constraint (1b) ensures that every point is assigned to exactly one cluster.

Note that this model is cubic since the objective function uses multiplications of the assignment variables x with the norms that depend on the centroids c, which are variables of the problem as well. In particular, Model (1) is a nonconvex MINLP. However, it can also be re-written as a convex MINLP in a lifted space by using its epigraph formulation. To this end, we model each term in the objective function using a separate variable and bound it in a newly introduced constraint. The resulting problem then reads

$$\begin{aligned} \min _{x,c,\eta } \quad&\sum _{p \in P} \sum _{j \in [k]} \eta _{pj} \end{aligned}$$

(2a)

$$\begin{aligned} {{\,\mathrm{s.t.}\,}}\quad&\eta _{pj} \ge \Vert p - c^j\Vert ^2 - M_p(1-x_{pj}), \quad p \in P,\ j \in [k], \end{aligned}$$

(2b)

$$\begin{aligned}&\sum _{j \in [k]} x_{pj} = 1, \quad p \in P, \end{aligned}$$

(2c)

$$\begin{aligned}&x_{pj} \in \{0,1\}^{}, \quad p \in P,\ j \in [k], \end{aligned}$$

(2d)

$$\begin{aligned}&c^j \in B, \quad j \in [k], \end{aligned}$$

(2e)

$$\begin{aligned}&\eta _{pj} \ge 0, \quad p \in P, \ j \in [k], \end{aligned}$$

(2f)

where $M_p$ are sufficiently large numbers. The objective function is linear now and we obtain the additional quadratic and convex constraints in (2b).

For every $p \in P$, $M_p$ can be chosen to be the maximum distance of p to any other point $\tilde{p} \in P$. An overestimation can be easily computed via

$$\begin{aligned} M_p = M = (u_1 - \ell _1)^2 + \cdots + (u_d - \ell _d)^2, \end{aligned}$$

where $\ell _i$ and $u_i$ are the componentwise bounds of the bounding box given above. Moreover, for a given cluster assignment x, an optimal choice for the cluster centroids is immediate, an observation that we will exploit frequently.

Observation 2.1

For a given assignment of x-variables adhering to (1b) or (2c), respectively, the optimal choice for $c^j$, $j \in [k]$, is

$$\begin{aligned} \frac{\sum _{p \in P} p \, x_{pj}}{\sum _{p \in P} x_{pj}}, \end{aligned}$$

i.e., the barycenter of all points assigned to cluster j.

3 Cutting planes

Without doubt, cutting planes are among the most powerful techniques to enhance the solution process for mixed-integer problems. Modern MI(N)LP solvers have many general-purpose cutting planes built-in. However, it is very often beneficial to derive problem-specific cutting planes. This is particularly important for the MSSC problem, since it is well-known that one of the most challenging issues for developing an efficient branch-and-bound algorithm for the MSSC problem is the computation of good lower bounds in a reasonable amount of time. In this section, we state two tailored cutting planes. The first one is applicable to Model (1) and (2) whereas the second one is only applicable to Model (2).

3.1 Cardinality cuts

We first briefly discuss cardinality cuts, which are already mentioned in Aloise et al. [5] as well as Sherali and Desai [55]. Consider an optimal solution of Model (1). In this optimal solution, there cannot be any empty cluster, because otherwise the corresponding objective value can be decreased by assigning a point that is not a centroid to that empty cluster. On the other extreme, a cluster contains $|P| - k + 1$ data points if every other cluster consists of only a single data point. Thus, to tighten the formulation (1), the following cardinality cuts can be added to Model (1):

$$\begin{aligned} 1 \le \sum _{p \in P} x_{pj} \le |P| - k + 1, \quad j \in [k]. \end{aligned}$$

Obviously, the same cuts are also valid for Model (2). Moreover, note that the upper bound is implied by the model’s constraints and the lower bound of the previous inequalities as $\sum _{p \in P} \sum _{j = 1}^k x_{pj} = |P|$ implies for a fixed $j \in [k]$ that $\sum _{p \in P}x_{pj} = |P| - \sum _{p \in P} \sum _{j' \in [k] {\setminus } \{j\}} x_{pj'} \le |P| - (k-1)$.

The idea of cardinality cuts can also be localized, i.e., the cardinality bounds can be adapted to take local variable bounds at a node of the branch-and-bound tree into account. To this end, we introduce, for each $j \in [k]$ the integral variable $\kappa _j$ with range $\{k-1,\dots ,|P| - 1\}$ and link it with the x-variables via the linear constraint $\kappa _j + \sum _{p \in P} x_{pj} = |P|$, $j \in [k]$. That is, $\kappa _j$ describes the number of data points that are not assigned to cluster j. If a lower bound $ {\kappa }_j$ and an upper bound $\bar{\kappa }_j$ on $\kappa _j$ is given, this equation implies the inequalities

$$\begin{aligned} 1 \le |P| - \bar{\kappa }_j \le \sum _{p \in P} x_{pj} \le |P| - {\kappa }_j \le |P| - k + 1, \quad j \in [k]. \end{aligned}$$

That is, they describe localized versions of cardinality cuts that get stronger if x-variables get fixed. Another side effect of the auxiliary variables $\kappa _j$ is that a solver might decide to branch on these variables. In doing so, it imposes bounds on the size of cluster $j \in [k]$.

3.2 Outer approximation cuts

We now focus on Model (2). The only nonlinear constraints (2b) in this problem are convex. Thus, their first-order Taylor approximations are global underestimators at any point $(\bar{\eta }, \bar{c}, \bar{x})$ and thus provide valid inequalities that are linear in $(\eta , c, x)$:

$$\begin{aligned} \sum _{i=1}^d \left( 2\bar{c}_i^j c_i^j - 2 p_i c_i^j + (p_i)^2 - (\bar{c}_i^j)^2 \right) - \eta _{pj} - M_p(1-x_{pj}) \le 0, \quad p \in P,\ j \in [k]. \end{aligned}$$

(3)

This allows to solve Model (2) in an outer approximation or LP/NLP-based branch-and-bound fashion; see Duran and Grossmann [17], Fletcher and Leyffer [20] or Quesada and Grossmann [50], respectively. We start by relaxing the constraint set (2b). Next, we assume that $(\bar{\eta }, \bar{c}, \bar{x})$ is a solution of this relaxation, i.e., it particularly fulfills the binary conditions (2d). If the relaxation’s solution is feasible for the nonlinear constraints (2b), it is also a solution for Model (2). If not, we can compute a feasible point $(\hat{\eta }, \hat{c}, \bar{x})$ of Model (2). In the original outer approximation method, this is done by fixing the binary variables $\bar{x}$ in Model (2) and solving the resulting convex NLP subproblem; see Duran and Grossmann [17]. The benefit in our specific application is that solving the subproblem boils down to a simple computation of the barycenters $\hat{c}$, see Observation 2.1, followed by an evaluation of the distances $\hat{\eta }$ according to Constraints (2b) and (2f).

From the theory of outer approximation, it is well-known that when adding the inequalities (3) at the solution $(\hat{\eta }, \hat{c}, \bar{x})$ of the subproblem, it holds

$$\begin{aligned} \sum _{p \in P} \sum _{j \in [k]} \eta _{pj} \ge \sum _{p \in P} \sum _{j \in [k]} \hat{\eta }_{pj} \end{aligned}$$

for all feasible points $(\eta , c, \bar{x})$ of the updated relaxation. In other words, adding the outer-approximation cuts bounds the optimal objective value of the relaxation with fixed binaries $x=\bar{x}$ from below by $\sum _{p \in P} \sum _{j \in [k]} \hat{\eta }_{pj}$. Consequently, the updated relaxation yields a solution with a new, previously unseen, cluster assignment x or the optimality gap is closed. Thus, iterating this process terminates after a finite number of steps; see Duran and Grossmann [17] or Duran and Grossmann [20] for more details. We note that the number of inequalities (3) does not depend on the dimension d, which might be beneficial for problems with higher dimensions.

Instead of implementing an LP/NLP-based branch-and-bound from scratch, we can use solvers such as SCIP to solve Model (2). In this setting, we can separate and add cuts (3) to tighten the LP relaxations. Since for larger $|P|$ and k, adding all inequalities (3) might be impracticable, we may also add only a certain amount of cuts. In particular, in our implementation in SCIP, we add only 10 cuts per separation round.

4 Propagation

Suppose we are at a node of the branch-and-bound tree. Due to branching decisions and further reductions, some variables might have been fixed or their bounds have been tightened in comparison to the original problem formulation. The aim of propagation is to find further variable fixings or bound tightenings that are valid at the current node. That is, one tries to apply further reductions based on local variable bound information. According to Observation 2.1, every assignment of x-variables that satisfies (1b) or (2c) can be extended to a feasible solution of (1) or (2), respectively. Thus, it is crucial to derive propagation mechanisms that exclude assignments of x-variables that cannot be optimal. Moreover, we develop algorithms to strengthen bounds of the c-variables and the objective variables. Before we discuss our propagation algorithms, we fix the following notation and terminology.

For every $j \in [k]$, we denote by $P_j \subseteq P$ the set of all data points $p \in P$ whose corresponding variable $x_{pj}$ has been fixed to 1 at the current node of the branch-and-bound tree. That is, we have already decided to assign p to cluster j. Moreover, we denote by $P'_j \subseteq P$ all data points p such that $x_{pj}$ has not been fixed to 0 yet, i.e., p is already or can still be assigned to cluster j. Note that $P_j \subseteq P'_j$. For a continuous variable z, i.e., for the c- and $\eta $-variables, we denote by $ {z}$ and $\bar{z}$ the lower and upper bound on z at the current node, respectively.

4.1 Barycenter propagation

Given a non-empty set of data points $Q \subseteq P$ defining a cluster, the optimal choice for its centroid is the barycenter

$$\begin{aligned} \mathcal {C}(Q) {:}{=}\frac{1}{|Q|}\sum _{p \in Q} p \end{aligned}$$

of all data points in Q. The respective sum of all squared distances thus is $\mathcal {D}(Q) = \sum _{p \in Q} \Vert p - \mathcal {C}(Q)\Vert ^2$. The idea of the barycenter propagation is to use this observation to find lower bounds on the objective and to strengthen the bounds for the c-variables.

4.1.1 Bound tightening for the objective function values

To find a lower bound on the objective in Model (1), note that for sets $Q \,{\subseteq }\, Q' \,{\subseteq }\, P$, we have $\mathcal {D}(Q) \le \mathcal {D}(Q')$. Consequently, a lower bound on the objective is given by $\sum _{j \in [k]} \mathcal {D}(P_j)$. The barycenter propagator uses this value to possibly tighten the lower bound on the objective at the current node of the branch-and-bound tree. Computing this lower bound for all clusters can be done in $O(kd|P|)$ time and it has also been used by Brusco [7], see also Guns et al. [30], in a repetitive branch-and-bound framework.

For Model (2), no immediate lower bound on the objective can be enforced because the objective is decoupled via the $\eta $-variables. Nevertheless, for each $(p,j) \in P \times [k]$, the following steps can be done. We can prune a node of the branch-and-bound tree if $p \notin P'_j$ and $ {\eta }_{pj} > 0$, because an optimal solution has $\eta _{pj} = 0$ as data points not assigned to a cluster do not contribute to $\mathcal {D}(P_j)$. Otherwise, if $p \notin P'_j$ and $ {\eta }_{pj} = 0$, we can fix $\eta _{pj}$ to 0. The first step is thus a pruning operation based on sub-optimal bounds in the subproblem, whereas the second step is a bound tightening operation.

4.1.2 Bound tightening for the centroids

Recall that $[k] = \{1,\dots ,k\}$. Besides strengthening bounds on the objective, barycenter information can also be used to tighten bounds on centroid variables $c^j_i$ with $(i,j) \in [d] \times [k]$. Suppose $P'_j {\setminus } P_j = \{p^1,\dots ,p^s\}$ such that $p^1_i \le p^2_i \le \dots \le p^s_i$. For each $r \in [s]_0{:}{=}[s] \cup \{0\}$, we compute $\gamma ^{j,r} = \mathcal {C}(P_j \cup \{p^1,\dots ,p^r\})$, i.e., the barycenter of the data points contained in $P_j$ and the data points with the r smallest ith coordinates that are not contained in $P_j$. As we show next, the ith coordinates of these barycenters can be used to compute a lower bound on $c^j_i$.

Lemma 4.1

A valid lower bound on $c^j_i$ is given by $\min _{r \in [s]_0} \gamma ^{j,r}_i$.

Proof

Let $Q \subseteq P'_j \setminus P_j$ and assume $|Q| = r$. Then, $\mathcal {C}(P_j \cup Q)_i \ge \mathcal {C}(P_j \cup \{p^1,\dots ,p^r\})_i$, because the points $p^1, \dots , p^r$ are points with the r smallest ith coordinates. Consequently, to find a lower bound on the centroids, it is sufficient to consider $\mathcal {C}(P_j \cup \{p^1,\dots ,p^r\})$ for each $r \in [s]_0$. $\square $

Analogously, an upper bound is given by $\max _{r \in [s]_0} \mathcal {C}(P_j \cup \{p^{s-r},\dots , p^s\})_i$. Since computing an iterative sequence of barycenters can be done using the formula

$$\begin{aligned} \gamma ^{j,r+1} = \frac{(|P_j| + r)\gamma ^{j,r} + p^{r+1}}{|P_j| + r+1}, \end{aligned}$$

we can compute the minimum and maximum values for all coordinates and clusters in $O(kd|P|)$ time.

4.2 Convexity and cone propagation

Based on optimality arguments, we can also derive rules to assign data points $p \in P'_j \setminus P_j$ to cluster $j \in [k]$. The key idea of the convexity propagator is the following simple observation.

Lemma 4.2

There exists an optimal solution of MSSC with clusters $P_1, \dots , P_k$ such that, for each $j \in [k]$, we have ${{\,\textrm{conv}\,}}(P_j) \cap P = P_j$.

Proof

Given an optimal allocation of the k centroids, the Voronoi cells

$$\begin{aligned} C_j = \{x \in \mathbb {R}^d:\Vert x - c^j\Vert \le \Vert x - c^{j'}\Vert ,\; j' \in [k]\} \end{aligned}$$

for $j \in [k]$ cover the entire $\mathbb {R}^d$ and only intersect at their boundaries. Since Voronoi cells are full-dimensional polyhedra, we can use the following mechanism to prove the assertion. We start with cluster 1 and observe that $P_1 \subseteq C_1$ in any optimal solution. If there exist $p \in P \setminus P_1$ that are contained in $C_1$, they are necessarily contained in the boundary of $C_1$. Hence, if we change the assignment of these points to $P_1$, this does not change the objective of MSSC. The assertion thus holds for $P_1$, and we can use the same arguments iteratively to conclude the proof. $\square $

As a consequence, the convexity propagator computes ${{\,\textrm{conv}\,}}(P_j)$ for each $j \in [k]$. If there exists $p \in P \cap {{\,\textrm{conv}\,}}(P_j)$ it performs the following steps: If $p \notin P'_j$ holds, then we can prune the current node of the branch-and-bound tree, because the local variable bounds cannot lead to an optimal solution adhering to Lemma 4.2. Otherwise, $x_{pj}$ can be fixed to 1.

Besides pruning nodes and fixing variables to 1, Lemma 4.2 has another consequence that allows us to fix some variables to 0, which is illustrated in Fig. 1.

Lemma 4.3

Let $P_1 \cup \dots \cup P_k$ be a partition of a finite set $P \subseteq \mathbb {R}^d$. Suppose ${{\,\textrm{conv}\,}}(P_j) \cap P = P_j$ for each $j \in [k]$. Then, for every $q \in P {\setminus } P_j$,

$$\begin{aligned} q + {{\,\textrm{cone}\,}}\{-(p - q):p \in P_j\} \supseteq P \setminus P_j. \end{aligned}$$

Proof

Note that $q + {{\,\textrm{cone}\,}}\{p - q:p \in P_j\}$ is the smallest cone with apex q that contains ${{\,\textrm{conv}\,}}(P_j)$, because $q \notin P_j$ and we shoot rays from q through each of the finitely many points in $P_j$. Hence, negating these rays leads to a cone that cannot contain any point from $P_j$ as it does not contain ${{\,\textrm{conv}\,}}(P_j)$. $\square $

We can use this observation as follows. If there is $q \in P {\setminus } P'_j$, i.e., $x_{qj}$ is fixed to 0, then all points in $q + {{\,\textrm{cone}\,}}\{-(p - q):p \in P_j\}$ can be fixed to 0 as well.

In arbitrary dimensions, the convexity propagator cannot be implemented efficiently, because ${{\,\textrm{conv}\,}}(P_j)$ might have $\Omega (2^d)$ many facets. In small dimensions, computing convex hulls can be done rather quickly and, as our numerical results will indicate, have a very positive impact on the time needed to solve MSSC.

4.3 Distance propagation

The distance propagator provides another set of rules to fix variables $x_{pj}$, $(p,j) \in P \times [k]$, to 0. To this end, it defines for each $j \in [k]$ the bounding box $B_j = \{y \in \mathbb {R}^d: {c}^j_i \le y_i \le \bar{c}^j_i,\; i \in [d]\}$ for the centroids. That is, the smallest box that contains the centroid for cluster j based on local variable bound information. Afterward, for each $p \in P$ and $j \in [k]$, it computes the minimum and maximum distances $D^{\min }_{j,p}$ and $D^{\max }_{j,p}$ to the bounding box $B_j$, i.e.,

$$\begin{aligned} D^{\min }_{j,p} = \min \left\{ \Vert p-x\Vert :x \in B_j\right\} , \quad D^{\max }_{j,p} = \max \left\{ \Vert p-x\Vert :x \in B_j\right\} . \end{aligned}$$

Since a data point is assigned to a centroid of minimum distance in an optimal solution, p cannot be assigned to cluster $j \in [k]$ if there is $j' \in [k]$ with $D^{\max }_{j',p} < D^{\min }_{j,p}$. Consequently, $x_{pj}$ can be fixed to 0 in this case.

Finding $B_j$ and computing the minimum and maximum distance of a point to a box can be done in O(d) time. Hence, the distance propagator runs in $O(kd|P|)$ time.

5 Branching rules

After a node of the branch-and-bound tree has been processed (adding cutting planes, propagation), branching rules split the current subproblem into further subproblems to, e.g., tighten the problem formulation or enforce integrality of variables. The decision on how to split the current subproblem is typically guided by the solution of the subproblem’s LP relaxation. To enforce integrality of variables (in the notation of the MSSC problem), one typically selects a variable $x_{pj}$ whose value in the LP solution is non-integral. Then, two subproblems are created that fix $x_{pj}$ to 0 and 1, respectively. Despite the existence of many branching rules that perform well for generic problems, see Achterberg et al. [1], there also exist branching rules tailored to a specific problem. This becomes relevant, because they might allow to derive further reductions based on problem structure or further components of a solver such as propagation mechanisms.

For integer programs, Gilpin and Sandholm [27] proposed four families of branching rules motivated from an information-theoretic perspective. The common ground of all their rules is to interpret the values $x_{pj}$ as the probability that a point $p \in P$ is assigned to cluster $j \in [k]$. Using their rules, they aim at reducing the assignment uncertainty in the current subtree. The first and second rule use a look-ahead approach, similar to strong branching. For the MSSC problem with a large number of clusters or points, look-ahead branching rules may become computationally prohibitive when applied to the full problem. Hence, we do not use them. The third family is called entropic look-ahead-free variable selection and the fourth is its extension for a multi-variable branching version. Since we think that the third family might be helpful for the MSSC problem, we describe it in the following. Afterward, two novel branching rules for the MSSC problem will be presented.

5.1 Entropy branching

Suppose that the optimal LP solution of the current subproblem does not satisfy the integrality constraints, which means that the relaxed solution $\bar{x}$ is non-integral. Let $\bar{X}$ be the set of all branching candidates that are non-integral in this LP solution, i.e.,

$$\begin{aligned} \bar{X} {:}{=}\left\{ \bar{x}_{pj}:\bar{x}_{pj} \in (0,1), \ p \in P, \ j \in [k]\right\} . \end{aligned}$$

Since each of these $\bar{x}_{pj}$ is non-integral, the cluster assignment of point p is not fixed. For the assignment to be fixed, $\bar{x}_{pj}$ has to be one. Due to Constraints (1b) or (2c), $\bar{x}_{pj}$ can be seen as a kind of posterior probability of point p to belong to cluster j [61]. A good strategy for branching would be to select the point p for which the probabilities for each cluster assignment are almost the same.

The most unclear situation is where $\bar{x}_{pj}={1}/{k}$ holds for all $j\in [k]$. Here, each cluster assignment is equally probable for point p. This can be seen as a homogeneous information setting. The level of homogeneity can be measured via the Shannon entropy of point p [54]. More precisely, for each variable $\bar{x}_{pj}\in \bar{X}$ the entropy of point p with probabilities $\bar{x}_{pj}$, $j\in [k]$, is

$$\begin{aligned} H_p = - \sum _{j\in [k]} \bar{x}_{pj} \log _2 (\bar{x}_{pj}). \end{aligned}$$

The maximum entropy ($H_p=\log _2 k$) occurs in the above mentioned extreme case. That is, the current LP solution does not provide any information on the best (or most probable) cluster assignment of point p. The minimum entropy is obtained when there is a clear cluster assignment. In this situation, let the point p be already assigned to a cluster, e.g., to cluster $j = 1$ and hence $\bar{x}_{p1} = 1$. Due to (1b) (or (2c)), $\bar{x}_{pj'} = 0$ for $j'\in [k]\setminus \{1\}$. The entropy of p is then

$$\begin{aligned} H_p = - 1 \log _2 1 - 0\log _2 0 - \cdots - 0 \log _2 0 = 0, \end{aligned}$$

where $0\log _2 0$ is taken to be zero.

We are interested in finding the point corresponding to the fractional variable $\bar{x}_{p^*j^*}\in \bar{X}$ such that the entropy of point $p^*$ is the maximal over all points with fractional variables, i.e., we search for the most uncertain assignment. The point is formally given by

$$\begin{aligned} {p^*} \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\{p\in P:\bar{x}_{pj}\in \bar{X}, \, j\in [k]\}} H_p. \end{aligned}$$

For this point $p^*$, we select the cluster index j to branch on arbitrarily, i.e., we create two subproblems by adding either $\bar{x}_{p^*j} = 1$ or $\bar{x}_{p^*j} = 0$.

5.2 Distance branching

Next, we describe three branching rules with a geometric motivation. The first one, called distance branching, is based on the intuition that clusters should be rather compact (opposed to being spread out). Given the current LP solution with its suggestion for the centroids $c^j$, $j \in [k]$, the variable $x_{pj}\in \bar{X}$ selected for branching is the one corresponding to the data point p and cluster j that are most apart from each other, i.e., we find

$$\begin{aligned} (p^*, j^*) \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\{(p, j) \in P \times [k]:\bar{x}_{pj}\in \bar{X}\}} \Vert p -c^j\Vert . \end{aligned}$$

Then, we branch on the fractional variable $\bar{x}_{p^*j^*}$, creating two subproblems by adding either $\bar{x}_{p^*j^*} = 1$ or $\bar{x}_{p^*j^*} = 0$. If an optimal cluster is indeed compact, then the 0-subproblem contains an optimal solution. Otherwise, the convexity propagator has the potential to also fix additional variables that lie between $p^*$ and the remaining points of cluster $j^*$ in the 1-subproblem.

5.3 Centrality branching

Since the distance branching rule is tailored towards the extremes of compact vs. far spread-out clusters, the centrality branching rule takes a more balanced approach by selecting a point whose distance to a cluster is not too big. Given the current LP solution with its suggestion for the centroids $c^j$, $j \in [k]$, we would like to branch on the non-integral variable $x_{pj}$ corresponding to the data point p and cluster j that is lying in the center of the cloud of unassigned data points. To obtain a cheap evaluation, we take the point $p^*$ that is in the center of all centroids, i.e.,

$$\begin{aligned} p^* \in \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\{p\in P:\bar{x}_{pj}\in \bar{X}, \, j\in [k]\}} \ \sum _{j\in [k]} \Vert p -c^j\Vert . \end{aligned}$$

From this point $p^*$, we select an arbitrary variable $\bar{x}_{p^*j^*}$, creating two subproblems by adding either $\bar{x}_{p^*j^*} = 1$ or $\bar{x}_{p^*j^*} = 0$. If the distance of $p^*$ to cluster $j^*$ is not too small, then there is a chance that the convexity propagator can fix further data points to be contained in cluster $j^*$ in the 1-subproblem. Opposed to the distance branching rule, however, also the 0-subproblem becomes relevant for the convexity propagator as it might fix some variables to 0 based on its cone propagation.

5.4 Pairs-in-the-middle-of-pairs branching

The next branching rule is a variation of the last one, but now we branch on more general linear inequalities. Given the current LP solution with its suggestion for the centroids $c^j$, $j \in [k]$, we would like to branch on the sum of a pair of non-integral variables corresponding to the two data points located in the middle of a pair of clusters.

First assume that there are only two clusters with corresponding centroids $c^1$ and $c^2$. We want to find the points that are nearest to the point lying half way from centroid $c^1$ to centroid $c^2$. Any point $p\in P$ that lies on the line segment between $c^1$ and $c^2$, minimizes the sum $\Vert p -c^1\Vert + \Vert p -c^2\Vert $ by the triangle inequality of the Euclidean distance. With the same reasoning, the smaller this sum is, the nearer is point p to the line segment between the two clusters. However, there can be multiple points p with the same value for $\Vert p -c^1\Vert + \Vert p -c^2\Vert $. We are interested in the ones that are nearest to the middle. Thus, we penalize longer distances by minimizing the sum $\Vert p -c^1\Vert ^2 + \Vert p -c^2\Vert ^2$ of squares instead.

Now assume that there are more than two clusters which have pairwise the exact same distance of centroids to each other. Then, still looking for the two points $p \in P$ that minimize $\Vert p -c^j\Vert ^2 + \Vert p -c^{j'}\Vert ^2$ for $j,j'\in [k]$ with $j\ne j'$ gives us the desired points. Let p and q be the selected points and j and $j'$ be the selected clusters. We then compute the sums in the current LP solution, $\bar{x}_{pj} + \bar{x}_{qj}$ and $\bar{x}_{pj'} + \bar{x}_{qj'}$, and select the sum that is most fractional or least fractional. Both versions are tested in our numerical experiments. Suppose that the first sum is selected, which means that the selected cluster is j. Then, we branch on

$$\begin{aligned} \bar{x}_{pj} + \bar{x}_{qj} \le \lfloor \bar{x}_{pj} + \bar{x}_{qj} \rfloor \end{aligned}$$

and

$$\begin{aligned} \bar{x}_{pj} + \bar{x}_{qj} \ge \lceil \bar{x}_{pj} + \bar{x}_{qj} \rceil . \end{aligned}$$

6 Primal heuristics

Primal heuristics try to find feasible solutions of good quality in a short amount of time. Having good feasible solutions at hand early in the solving process is crucial. Feasible solutions help to prune branch-and-bound nodes based on bounding as well as to perform further fixings and reductions. Moreover, a user may already be satisfied with the quality of the heuristic solution, such that the solving process can be stopped at an early stage. In this section we present three primal heuristics for the MSSC problem.

6.1 A root-node heuristic

To obtain a first feasible point, i.e., a point for warm-starting, we use the k-means algorithm, which is the most popular heuristic for finding a feasible solution for the MSSC problem; see, e.g., Lloyd [39] and MacQueen [40]. It consists of two main steps. First, given an initial guess for the location of the centroids, each data point is assigned to the nearest centroid. Afterward, each centroid is updated by calculating the mean of the data points assigned to this centroid. This process is repeated until the centroids no longer change. To obtain an initial guess for the location of the centroids, we use the “furthest point heuristic”, also known as “Maxmin” [28]. The idea is to select the first centroid randomly within the respective bounding box and then obtain new centroids one by one. In each iteration, the next centroid is the point that is the furthest (max) from its nearest (min) existing centroid. Here, we choose the first data point as the first centroid. For a comparison of several initialization heuristics, see, e.g., Fränti and Sieranoja [23].

6.2 A rounding heuristic

Feasible solutions can be obtained at each node by applying a rounding scheme to the LP solution. We use the rounding heuristic proposed by Sherali and Desai [55]. For completeness, we also describe it here.

Given a non-integral LP solution $(\tilde{x}, \tilde{c})$, or $(\tilde{x}, \tilde{c}, \tilde{\eta })$ for Model (2), at a node of the branch-and-bound tree, we round the non-integral $\tilde{x}$-solution to the closest feasible binary solution $\bar{x}$, while respecting the decisions that have already been made, i.e., if a data point is already assigned to a cluster it will remain in that cluster. First, we ensure that there are no empty clusters by finding, for each $j\in [k]$ with $P_j = \emptyset $, the point $\bar{p} \in P {\setminus } \cup _{j \in [k]} P_j$ such that $\bar{p} \in {{\,\mathrm{arg\,max}\,}}\left\{ \tilde{x}_{pj} : p\in P {\setminus } \cup _{j \in [k]} P_j \right\} $, and setting $\bar{x}_{\bar{p} j} = 1$. To break a tie, the point with smallest index is chosen. Now, to ensure that the point $\bar{p}$ is only in one cluster, we set $\bar{x}_{\bar{p} j^{'}} = 0$, for all $j^{'} \in [k] \setminus \{j\}$.

Furthermore, for each data point $p \in P$ such that $\tilde{x}_{pj}$, $j\in [k]$, is not yet rounded, we find a cluster $j^*$ such that $\tilde{x}_{pj^*} = \max \left\{ \tilde{x}_{pj} : j \in [k] \right\} $. Again, we break ties by selecting the cluster with the smallest index. Then, we set $\bar{x}_{pj^*} = 1$ and $\bar{x}_{pj} = 0$ for all $j \in [k] {\setminus } \{j^*\}$. With $\bar{x}$ at hand, we can then compute the centroids for each cluster, see Observation 2.1, and obtain a feasible solution for the MSSC problem.

6.3 An improvement heuristic

Given a feasible solution $(\bar{x},\bar{c})$ of Model (1) or $(\bar{x},\bar{c},\bar{\eta })$ of Model (2), we try to improve this solution by evaluating the loss function (i.e., the intra-variance) within each cluster. For that, consider the weighted value of the loss function restricted to cluster $C_j$ as

$$\begin{aligned} F_{j} = \frac{1}{|C_j|} \sum _{p \in C_j} \Vert p -\bar{c}^j\Vert ^2. \end{aligned}$$

It may happen that some clusters have a large loss function value, while some other clusters may have a very small one. Thus, we may find a better solution—regarding the sum of all losses—by splitting a cluster into two smaller clusters and joining two other clusters. This heuristic has been proposed in Burgard et al. [8], where the motivation for its development is explained in more details.

The procedure is described as follows. For each pair of clusters $(C_{j_1}, C_{j_2})$, we compute their joint centroid and the corresponding total loss via

$$\begin{aligned} c^{j_1j_2} = \frac{ 1 }{ | C_{j_1} | + |C_{j_2}| } \sum _{p \in C_{j_1} \cup C_{j_2}} p \end{aligned}$$

and

$$\begin{aligned} F_{j_{1}j_2} = \frac{ 1 }{ |C_{j_1}| + |C_{j_2}| } \sum _{p \in C_{j_1} \cup C_{j_2}}\Vert p - c^{j_{1}j_2} \Vert ^2. \end{aligned}$$

Now, consider the set

$$\begin{aligned} \Psi {:}{=}\left\{ (C_{j_1}, C_{j_2}, C_{j_3}): F_{j_{1}j_2} < F_{j_3} \right\} , \end{aligned}$$

which is the set of all possible combinations of three clusters such that the total loss within two joined clusters is smaller than the total loss within a third cluster. Note that the set $\Psi $ can be empty. If so, this means that we cannot obtain a better solution by joining two clusters and splitting another one. On the other hand, i.e., if there exists $(C_{j_1}, C_{j_2}, C_{j_3}) \in \Psi $, then the total loss of the joined clusters $C_{j_1}$ and $C_{j_2}$ is smaller than the total loss within cluster $C_{j_3}$. Thus, we obtain a better solution by joining $C_{j_1}$ and $C_{j_2}$ and by splitting cluster $C_{j_3}$ into two smaller clusters. To this end, we update the centroids in such a way that the clusters $C_{j_1}$ and $C_{j_2}$ are now one cluster with centroid $c^{j_1j_2}$, cluster $C_{j_3}$ receives two new centroids, and the other centroids remain the same, i.e.,

$$\begin{aligned} \hat{c}^{j_1} \leftarrow c^{j_1j_2}, \quad \hat{c}^{j_2} \leftarrow \tilde{c}, \quad \hat{c}^{j_3} \leftarrow \tilde{c}', \\ \hat{c}^j \leftarrow \bar{c}^j \quad \text {for all} \quad j \notin \{j_1,j_2,j_3\}, \end{aligned}$$

where $\tilde{c}$ and $\tilde{c}'$ are obtained as follows. First we find the two furthest points in $C_{j_3}$ to be the initial guesses for the location of the centroids, i.e.,

$$\begin{aligned} (\tilde{c}, \tilde{c}') \in \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{p, p^{'} \in C_{j_3}} \left\{ \Vert p - p' \Vert ^2 \right\} . \end{aligned}$$

Next, each point in $C_{j_3}$ is assigned to the closest centroid, either $\tilde{c}$ or $\tilde{c}'$. Then, the centroids $\tilde{c}$ and $\tilde{c}'$ are updated based on this assignment. Now, the update of the assignments and centroids is repeated until they do not change anymore. This way we obtain the new centroids $\tilde{c}$ and $\tilde{c}'$ that give us the desired splitting of cluster $C_{j_3}$.

Finally, if the set $\Psi $ has more than one element, then we repeat the process starting with the element $(C_{j_1}, C_{j_2}, C_{j_3})$ that gives the minimum ratio $F_{j_{1}j_2} / F_{j_3}$. Each time an element $(C_{j_1}, C_{j_2}, C_{j_3})$ is used, we exclude all the elements that contain $C_{j_1}$, $C_{j_2}$, or $C_{j_3}$, because these clusters have already been modified.

With $\hat{c}$ at hand, we can easily compute $\hat{x}$ and, thus, a new feasible solution $(\hat{x}, \hat{c})$ is obtained. If the objective function value is better at this new solution, then we have found an improved solution out of $(\bar{x}, \bar{c})$.

7 Symmetry breaking

Note that both Model (1) and (2) are symmetric with respect to cluster assignments. That is, once a feasible solution has been found, one can generate equivalent (symmetric) solutions by exchanging the labels of the clusters. Such symmetries are known to deteriorate the performance of search-based approaches like branch-and-bound, because symmetric subproblems are created repeatedly without providing the solver with new information. Such cluster symmetries can be handled in both models by imposing additional restrictions on the x-variables. If we interpret x as a binary matrix whose columns are labeled by clusters, then we can handle symmetries by enforcing that the columns of x need to be sorted lexicographically non-increasing. Since each row of matrix x has exactly one 1-entry due to (1b) and (2c), the lexicographic sorting can be imposed by orbitopal fixing, see Kaibel et al. [37], and separating the symmetry handling inequalities developed by Kaibel and Pfetsch [36].

8 Numerical experiments

In this section, we report extensive computational results that show the benefits of the techniques proposed in Sects. 3–6. To this end, we have incorporated all these techniques into the state-of-the-art solver SCIP; see Gamrath et al. [26]. As a reference for comparison, we use plain SCIP for both problem formulations (1) and (2). That is, we solve the two formulations without our problem-specific enhancements but with enabled symmetry handling.

To conduct the experiments, we use different test sets from the literature, which contain both real-world as well as synthetic instances. The test sets and the general computational setup are described in Sects. 8.1 and 8.2, respectively. Then, in Sect. 8.3, we start the discussion of the numerical results for the case $k=2$. We evaluate the benefits of each particular technique and indicate which setting performs best. Next, in Sect. 8.4, we repeat the discussion but for the case $k=3$. Finally, in Sect. 8.5, we present results on a larger test set in order to draw solid and comprehensive conclusions about the performance of the novel techniques.

8.1 Test sets

We evaluate the impact of the presented algorithmic ideas for solving the MSSC problem using both synthetic and real-world test sets. To be able to draw conclusions on a reliable basis, we have collected all publicly available instances that have been used in the related literature for solving the MSSC problem to global optimality. Thus, to the best of our knowledge, our results are based on the largest publicly available test set for the MSSC problem consisting of realistic instances. Specifically, we use the instances that have been used in Aloise and Hansen [4], Sherali and Desai [55], as well as in Aloise et al. [5]. Since these instances come from different sources, we provide the source for every instance in Table 1. The synthetic test set has been proposed in Fränti and Sieranoja [22]. The authors show that these synthetic instances cover a wide range of classic MSSC instances. In particular, the test set contains instances with different degrees of overlap, density, and sparsity of data points.

Table 1 Information about the test sets

Mixed-integer programming techniques for the minimum sum-of-squares clustering problem

Abstract

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Data clustering: application and trends

1 Introduction

2 MINLP models for the MSSC problem

Observation 2.1

3 Cutting planes

3.1 Cardinality cuts

3.2 Outer approximation cuts

4 Propagation

4.1 Barycenter propagation

4.1.1 Bound tightening for the objective function values

4.1.2 Bound tightening for the centroids

Lemma 4.1

Proof

4.2 Convexity and cone propagation

Lemma 4.2

Proof

Lemma 4.3

Proof

4.3 Distance propagation

5 Branching rules

5.1 Entropy branching

5.2 Distance branching

5.3 Centrality branching

5.4 Pairs-in-the-middle-of-pairs branching

6 Primal heuristics

6.1 A root-node heuristic

6.2 A rounding heuristic

6.3 An improvement heuristic

7 Symmetry breaking

8 Numerical experiments

8.1 Test sets

8.2 Computational setup

8.3 Discussion of the numerical results for 2 clusters

8.3.1 Primal heuristics

8.3.2 Propagators

8.3.3 Barycenter propagator

8.3.4 Convexity+Cone propagator

8.3.5 Barycenter+Convexity+Cone propagators

8.3.6 Cutting planes

8.3.7 Distance propagator

8.3.8 Branching rules

8.3.9 Best setting

8.4 Discussion of the numerical results for 3 clusters

8.4.1 Primal heuristics

8.4.2 Propagators

8.4.3 Barycenter propagator

8.4.4 Convexity+Cone propagator

8.4.5 Barycenter+Convexity+Cone propagator

8.4.6 Cutting planes

8.4.7 Distance propagator

8.4.8 Branching rules

8.4.9 Best setting

8.5 Discussion of the numerical results for samples of instances

8.5.1 Primal heuristics and propagators

8.5.2 Cutting planes

8.5.3 Branching rules

8.5.4 Best setting

9 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A. List of SCIP heuristics disabled in our experiments

Appendix B. Numerical results per instance

Rights and permissions

About this article

Cite this article

Share this article

Keywords