Mixed-Integer Quadratic Optimization and Iterative Clustering Techniques for Semi-Supervised Support Vector Machines

Among the most famous algorithms for solving classification problems are support vector machines (SVMs), which find a separating hyperplane for a set of labeled data points. In some applications, however, labels are only available for a subset of points. Furthermore, this subset can be non-representative, e.g., due to self-selection in a survey. Semi-supervised SVMs tackle the setting of labeled and unlabeled data and can often improve the reliability of the results. Moreover, additional information about the size of the classes can be available from undisclosed sources. We propose a mixed-integer quadratic optimization (MIQP) model that covers the setting of labeled and unlabeled data points as well as the overall number of points in each class. Since the MIQP's solution time rapidly grows as the number of variables increases, we introduce an iterative clustering approach to reduce the model's size. Moreover, we present an update rule for the required big-$M$ values, prove the correctness of the iterative clustering method as well as derive tailored dimension-reduction and warm-starting techniques. Our numerical results show that our approach leads to a similar accuracy and precision than the MIQP formulation but at much lower computational cost. Thus, we can solve solve larger problems. With respect to the original SVM formulation, we observe that our approach has even better accuracy and precision for biased samples.


Introduction
Support vector machines (SVMs) are a standard approach for supervised binary classification (Boser et al. 1992;Cortes and Vapnik 1995).The core idea is to find a separating hyperplane that optimally splits the feature space in a positive and a negative side according to the positive and negative labels of the data.
Obtaining labels for all units of interest can be costly.This is especially the case if one has to do a classic survey to obtain the labels.In this case, it would be favorable to train the SVM on only partly labeled data.This yields a semisupervised learning setting.Bennett and Demiriz (1998) formulate and solve the semi-supervised SVM (S 3 VM) as a mixed-integer linear problem (MILP).Many strategies for solving S 3 VM have been proposed in the following decades such as the transductive approach (TSVM) by Joachims (2002) and Xu Yu (2012) or manifold regularization (LapSVM) by Belkin et al. (2006) and Melacci and Belkin (2009).Some researchers also consider a balancing constraint as done in meanS3VM by Kontonatsios et al. (2017) and in c 3 SVM by Chapelle et al. (2006).Moreover, the balancing constraint proposed by Chapelle and Zien (2005) enforces that the proportion of unlabeled and labeled data on both sides is similar to the proportion given by the labeled data.
In many cases, however, the aggregated information about the number of positive and negative cases in a population is known from an external source.For example, in population surveys, there are population figures from official statistics agencies.This setting is studied, e.g., by Burgard et al. (2021), who develop a cardinality-constrained multinomial logit model and apply it in the context of micro-simulations.As another example, in some businesses, the total amount of positive labels could be known but not which customer has a positive or a negative label.An intuitive example is a supermarket for which the amount of cash payments is known.However, this information is not ex-post attributable to the individual customers.We propose to add this aggregated additional information to the optimization model by imposing a cardinality constraint on the predicted labels for the unlabeled data.As will be shown in our numerical experiments, this improves the accuracy of the classification of the unlabeled data.Furthermore, the inclusion of such a cardinality constraint is very useful in the case in which the labeled data is not a representative sample from the population.When obtaining the labels from process data or from online surveys, the inclusion process of the labeled data is generally not known.This is subsumed under the non-probability sample.In this case, inverse inclusion probability weighting, as typically done in survey sampling, is not applicable.By not controlling the inclusion process, strong overor under-coverage of relevant information in the data set is possible and should be taken into account in the analysis.Not accounting for possible biases in the data generally leads to biased results.
We propose a big-M -based MIQP to solve the semi-supervised SVM problem with a cardinality constraint for the unlabeled data.Here, we restrict ourselves to the linear kernel.Other kernels such as Gaussian and polynomial ones can, in principle, be used as well.However, this would lead to additional nonlinear constraints in a our mixed-integer model and would thus significantly increase the computational challenge of solving the problem.Although we strongly suspect that the problem is NP-hard, we have no proof for it since we focus here on solution techniques and not on a formal complexity analysis of the problem.The cardinality constraint helps to account for biased samples since the number of positive predictions on the population is bounded by the constraint.The computation time for this MIQP grows rapidly with the number of variables-especially for an increasing number of integer variables.We develop an algorithm that uses a clustering-based model reduction to reduce the computation time.Similar reduction approaches can be found for the classic SVM using, e.g., fuzzy clustering (Almasi and Rouhani 2016;Cervantes et al. 2006), clustering-based convex hulls (Birzhandi and Youn 2019), and k-means clustering (Almeida et al. 2000;Yao et al. 2013).We prove the correctness of our iterative clustering method and further show that it computes feasible points for the original problem.Hence, it also delivers proper upper bounds.Within our iterative approach, we additionally derive a scheme for updating the required big-M values and present tailored dimension-reduction as well as warm-starting techniques.
The paper is organized as follows.In Section 2, we describe our optimization problem and the big-M -based MIQP formulation.Afterward, the clustering-based model reduction technique is presented in Section 3. There, we also present our algorithm that combines the model reduction and the MIQP formulation.In Section 4, we discuss some algorithmic improvements such as the handling of data points that are far away from the hyperplane and the choice of M in the big-M formulation.In Section 5, we present how to use the solution of our algorithm to obtain the solution of the initial MIQP formulation by fixing some points on the correct side of the hyperplane.Finally, in Section 6, numerical results are reported and discussed and we conclude in Section 7.
2. An MIQP Formulation for a Cardinality-Constrained Semi-Supervised SVM Let X ∈ R d×N be the data matrix with X l = [x 1 , . . ., x n ] being the labeled data and X u = [x n+1 , . . ., x N ] being the unlabeled data.Hence, we have x i ∈ R d for all i ∈ [1, N ] := {1, . . ., N }.We set m := N − n and y ∈ {−1, 1} n is the vector of class labels for the labeled data.When the data is linearly separable, the SVM provides a hyperplane (ω, b) that separates the positively and negatively labeled data.In the case that the data is not linearly separable, the standard approach is to use the ℓ 2 -SVM by Cortes and Vapnik (1995) given by min ω,b,ξ Here and in what follows, ∥ • ∥ denotes the Euclidean norm.However, other norms such as the 1-or the max-norm could be used as well.For being able to include unlabeled data in the optimization process, Bennett and Demiriz (1998) propose the semi-supervised SVM (S 3 VM).In many applications, the aggregated information on the labels is available, e.g., from census data.In the following, we know the total number τ of positive labels for the unlabeled data from an external source.
We adapt the idea of the S 3 VM such that we can use τ as an additional information in the optimization model.Our goal is to find optimal parameters ω * ∈ R d , b * ∈ R, ξ * ∈ R n , and η * ∈ R 2 that solve the optimization problem with Note that the objective function in (P2a) is a compromise between maximizing the distance between the two classes as well as minimizing the classification error for the label and the unlabeled data.The penalty parameters C 1 > 0 and C 2 > 0 aim to control the importance of the slack variables ξ and η, respectively.Constraint (P2b) enforces on which side of the hyperplane the labeled data x i should lie.Constraint (P2c) ensures that we have τ unlabeled data on the positive side.If η * 1 > 0 holds for a solution (ω * , b * , ξ * , η * ), then less than τ unlabeled points are classified as positive.On the other hand, if η * 2 > 0 holds, more than τ unlabeled points are classified as positive.If η * 1 = η * 2 = 0 holds, exactly τ unlabeled points are classified in the positive class.Note that, having assigned a very high value to C 1 or C 2 , the objective function value is dominated by these slack variables.The function h ω,b (•) in Constraint (P2c) is not continuous, which means that Problem (P2) cannot be easily solved by standard solvers.A typical way to overcome this problem is to add binary variables to turn on or off the enforcement of a constraint.By introducing binary variables z i ∈ {0, 1}, i ∈ [n + 1, N ], we can reformulate the optimization Problem (P2) using the following big-M formulation: where M needs to be chosen sufficiently large.As z i is binary, Constraints (P3c) and (P3d) lead to If x i lies on the hyperplane, i.e., ω ⊤ x i + b = 0, Constraints (P3c) and (P3d) hold for z i = 1 and z i = 0.In this case, it can be counted either on the positive or on the negative side.For this reason, Problem (P3) is not formally equivalent to Problem (P2).Reformulation (P3) is a mixed-integer quadratic problem (MIQP) in which all constraints are linear but the objective function is quadratic.We refer to this problem as CS 3 VM.
Since we now stated our first model, let us shed some light on the results depending on whether the standard SVM or CS 3 VM is used.Figure 1 shows a 2dimensional example data set and the corresponding hyperplanes for SVM and CS 3 VM.In this case, τ = 11, i.e., 11 unlabeled points belong the the positive class.Note that SVM only classifies 6 unlabeled points as positive, while CS 3 VM classifies 11 as such.The point that lies on the CS 3 VM hyperplane is classified as positive because the binary variable regarding this point is 1.This example shows that using τ as additional information can improve the classification of unlabeled points.
In the big-M formulation, the choice of M is crucial.If M is too small, the problem can become infeasible or optimal solutions could be cut off.If M is chosen too large, the respective continuous relaxations usually lead to bad lower bounds and solvers may encounter numerical troubles.The choice of M is discussed in the following lemma and theorem.In Lemma 1 we show how M is related to the objective function and the given data.This is then used in Theorem 2 to derive a provably correct big-M .Lemma 1.Given a feasible point for Problem (P3) with an objective function value f , an optimal solution (ω * , b * , ξ * , η * , z * ) of (P3) satisfies and, consequently, every optimal solution satisfies (P3c) and (P3d) for Proof.Due to optimality, we get The second inequality is shown by contradiction.To this end, we w.l.o.g.assume that b = ∥ω * ∥ max i∈[1,N ] ∥x i ∥ + 1 + δ is part of an optimal solution for some δ > 0.
Using the inequality of Cauchy-Schwarz then yields Hence, for all i ∈ [1, n] with y i = 1, we get ξi = 0 from Constraint (P3b) and the objective function.Moreover, for i ∈ [1, n] with y i = −1, the same reasoning implies Besides that, for the unlabeled data i ∈ This means that the objective function value for the point (ω * , b, ξ, η, z) is given by However, if we set b := ∥ω * ∥ max i∈[1,N ] ∥x i ∥ + 1, we get i.e., z i = 1 for all i ∈ [n + 1, N ], η1 = 0, η2 = m − τ , and ξi = 0 for i with y i = 1.Moreover, for i ∈ [1, n] with y i = −1, from Constraint (P3b) we obtain All this implies that the objective function value f for the point (ω * , b, ξ, η, z) satisfies which contradicts the assumption that f is optimal.Hence, holds, which proves the second inequality.Note further that We now use the result from the last technical lemma to obtain a provably correct big-M .
Theorem 2. A valid big-M for Problem (P3) is given by (1) Proof.Consider the feasible point of (P3) given by ω = 0 ∈ R d and b = 1.Since ω ⊤ x i + b = 1 holds for all i ∈ [1, N ], Constraint (P3b) implies Moreover, using Constraints (P3c)-(P3e) leads to which implies that the objective function for the point (ω, b, ξ, η, z) is given by Finally, from Lemma 1, we get In Model (P3) of the last section, each binary variable is related to an unlabeled point.The larger the number of unlabeled data, the larger the number of binary variables and, hence, the larger the computational burden to solve Problem (P3).To reduce this computational burden, we propose to cluster the unlabeled data.This way, only one binary variable per cluster is needed.For every cluster, we use its centroid as its representative point.To obtain clusterings, we use minimum sum-of-squares clustering (MSSC).The MSSC problem is NP-hard; see, e.g., Aloise et al. (2009), Dasgupta (2007), and Mahajan et al. (2012).However, we do not need a globally optimal solution for the MSSC problem as will be shown below.Given a number k of clusters and a matrix S = [s 1 , . . ., s p ] ∈ R d×p of given points, the goal of the MSSC is to find mean vectors with C j ⊂ R d being the set of data points that are assigned to cluster j.
We solve this problem heuristically using the k-means algorithm (Lloyd 1982;MacQueen 1967) for S = X u , i.e., we cluster the unlabeled data.Then, instead of using all unlabeled data as in the last section, we only use the clusters' centroids c 1 , . . ., c k and the numbers e 1 , . . ., e k of data points in each cluster to obtain the problem A valid big-M is still given by (1) as shown in the next proposition.
Proof.The proof follows the same lines as the proofs of Lemma 1 and Theorem 2 with the additional observation that for all j ∈ [1, k], it holds It can happen that the hyperplane given by (ω * , b * ) that results from the solution of Problem (P4) cuts through some cluster.This means that not all data points of the cluster actually lie on the same side of the hyperplane.If this happens, the solution of Problem (P4) does not satisfy the cardinality constraint (P3e) of Problem (P3).To fix this, we propose an iterative method that is formally listed in Algorithm 1.Note that the use the k-means algorithm is helpful here as it automatically provides the convex hulls of the clusters.Hence, it is easy to check if the hyperplane cuts through some cluster or not.
If Algorithm 1 terminates it holds that all points in a cluster are on the same side of the final hyperplane.This implies the cardinality constraint (P3e) is satisfied.Note that the k-means algorithm is only called once to initialize the clustering.For all other iterations, we manually split clusters if they are cut by the hyperplane of the respective iteration and compute the new centroids directly.
Algorithm 1: Re-Clustering Method (RCM) 1), compute a clustering of X u in k 1 many clusters using the k-means algorithm, and obtain the centroids c 1 , . . ., c k 1 as well as the numbers e 1 , . . ., e k 1 of data points in each cluster. 2 Solve Problem (P4) to compute the hyperplane (ω t , b t ) as well as ξ t , η t , z t .
for each cluster that is cut by the hyperplane (ω t , b t ) do 6 Split the cluster into two new clusters so that neither of the two new clusters is cut by the hyperplane (ω t , b t ). 7 Update the centroids of the newly created clusters.
else Return the hyperplane (ω t , b t ) as well as ξ t , η t , z t . end The next theorem establishes that Algorithm 1 always terminates after finitely many iterations.
Theorem 3. Suppose that e j ≥ 1 for all j ∈ [1, k 1 ] after Step 1 of Algorithm 1.Then, Algorithm 1 terminates after at most m − k 1 iterations, where m is the number of the unlabeled data points and k 1 is the number of initial clusters.
Proof.Observe that since we cluster m unlabeled points, the maximum number of clusters we can obtain is m.Besides that, if in an iteration t, Algorithm 1 does not terminate, at least one cluster is split Step 6.Because we start with k 1 clusters and since in each iteration, we increase the number of clusters at least by one, the maximum number of iterations is m − k 1 .□ Note that the point obtained by Algorithm 1 is not necessarily a minimizer of Problem (P3).However, the objective function value of the point obtained by Algorithm 1 is an upper bound for the objective function value of Problem (P3).
Proof.For all clusters C j , j ∈ {1, . . ., k t }, where t is the final iteration of Algorithm 1, we set zi = zj for all i with x i ∈ C j .We now show that (ω, b, ξ, η, z) is a feasible point for Problem (P3).Indeed, Constraints (P3b), (P3f), (P3g), and (P3h) are clearly fulfilled.Furthermore, since i∈Cj zi = e j zj for all j ∈ [1, k t ], using (P4e) we get and Constraint (P3e) is satisfied.Besides that, holds and as in Lemma 1, we get Moreover, by construction, for all i ∈ {n + 1, . . .N } with zi = 1, x i belongs to a cluster C j such that ω⊤ c j + b ≥ 0. Using the fact that all points in C j are on the same side of the hyperplane, this side must be the positive one.This fact together with ( 2) and (3) implies Similarly, for all i ∈ {n + 1, . . .N } with zi = 0, we get and (P3c) as well as (P3d) are fulfilled.Because (ω, b, ξ, η, z) is a feasible point for Problem (P3), f is an upper bound to the Problem (P3).□ Note, finally, that since the point obtained from Algorithm 1 is feasible for Problem (P3), we can use it for warm starting.

Further Algorithmic Enhancements
In order to reduce computational costs, we propose two additional enhancements.The first one (see Section 4.1) makes use of the fact that the SVM is mostly influenced by data points that are close to the separating hyperplane.The second one (see Section 4.2) introduces a rule for updating M in each iteration of Algorithm 1.

4.1.
Handling Points far From the Hyperplane.In Algorithm 1, the number of clusters increases in each iteration.Hence, the time to solve Problem (P4) increases from iteration to iteration in general.Like in the original SVM, the points closest to the hyperplane influence the resulting hyperplane more than the other points.Obviously, eliminating points that do not strongly influence the hyperplane decreases the size of the problem.Some approaches to eliminate these points have also been proposed for the original SVM.For a survey, see, e.g., Birzhandi et al. (2002).However, most of these approaches are heuristics and do not necessarily yield a feasible point of the problem.
The idea for our setting is the following.Clusters that are far away from the hyperplane could be omitted as this will not change the solution.The farther a cluster is from the hyperplane in an iteration, the less likely it is that the cluster will be split or change sides completely in a future iteration.Hence, the clusters farthest from the current hyperplane mainly add information about their side and capacity.However, in a later iteration, the cluster may become relevant again.Thus, we need to find a way to discard detailed information on certain clusters but also a way to reactivate the discarded clusters if necessary.
We propose the following procedure to reduce the amount of clusters that have to be considered in the current iteration of the algorithm.If the number of clusters exceeds a fixed value k + , we first fix the cluster with the centroid farthest from the hyperplane as a kind of residual cluster on a side if this side has points far from the hyperplane.Second, we discard all clusters in which all points are farther from the hyperplane than some ∆ t and assign them to the residual cluster on their side of the hyperplane.This way the cardinality constraint remains valid.Moreover, all formerly discarded clusters are checked for re-consideration.If a discarded cluster has a point with a distance to the hyperplane less than ∆ t or if any point in the cluster changed the side, the cluster is reactivated.
Let S = (s α(1) , . . ., s α(d) ) ⊤ be the vector of increasingly sorted values of S = {s 1 , . . ., s d } and let a ∈ (0, 1).The a-quantile of S, as proposed by Hyndman and Fan (1996), is given by Given a parameter ∆t ∈ (0, 1), we choose ∆ t in each iteration t according to Note that if in an iteration t, a point in some discarded cluster changed the side, the vector z as part of the current solution does not fit to this change.This happens when, e.g., (ω t−1 ) ⊤ x i + b t−1 > 0 and (ω t ) ⊤ x i + b t < 0 but z t j > 0 with C j being the cluster with centroid farthest from the hyperplane on the positive side.To avoid that this happens too often, ∆t+1 is increased by a fixed value ∆ ∈ (0, 1) when there is some point in some discarded cluster that has changed sides.
Motivated by the above discussions, we add new steps in the Algorithm 1 that can be seen in Algorithm 2. In Step 5, if the number of clusters exceeds k + , clusters far from the hyperplane are discarded.In Steps 9 and 10, clusters discarded with a point that changed sides or that is closer to the hyperplane than ∆ t are reactivated.In Step 12, ∆t is updated.4.2.Updating the Big-M .As discussed in Section 2, M needs to be sufficiently large.However, the bigger the M , the more likely we face numerical issues.As shown in Section 2, the smaller the objective function provided by a feasible point, the smaller the value of M can be chosen.Based on that, we update M in each iteration with the aim of decreasing it.We do this by adding Step 16 in Algorithm 2 and the next theorem justifies this.
Theorem 5. Consider X, y, C 1 , C 2 , τ , as well as c 1 , . . ., c k t and e 1 , . . .e k t in an iteration t of Algorithm 1.Then, the optimal solution (ω t , bt , ξt , ηt , zt ) of Problem (P4) provides an upper bound and Algorithm 2: Improved Re-Clustering Method (IRCM) 1 Set t = 1, compute M t as in (1), cluster X u in k 1 clusters using k-means, leading to centroids c 1 , . . ., c k 1 and the numbers e 1 , . . ., e k 1 of data points in each cluster. 2 Solve Problem (P4) to compute the hyperplane (ω t , b t ) as well as ξ t , η t , z t .3 Compute ∆ t as in (4).
for each cluster that is cut by the hyperplane (ω t , b t ) do Split the cluster into two new clusters so that neither of the two new clusters is cut by the hyperplane (ω t , b t ).
Update the centroids of the newly created clusters.
Theorem 6.The Algorithm 2 terminates after at most iterations, where m is the number of unlabeled data points, k 1 is the number of initial clusters, and ∆1 , ∆ are inputs of Algorithm 2.
Proof.In Algorithm 2, the number of iterations can only be greater as in Algorithm 1 if there is some iteration t for which J t ̸ = ∅ holds but the hyperplane does not cut any cluster.At each iteration in which this happens, ∆t is increased and, in the worst case, i.e., we get ∆t = 1.This implies that for all further iterations t, holds.Thus, no cluster is added to the set G t .Since |G t| ≤ m and J t ⊂ G t, Algorithm 2 can only have m more iterations with J t ̸ = ∅.This means that the maximum number of iterations is 2m − k 1 + (1 − ∆1 )/ ∆. □ Although Theorem 6 shows that, in the worst case, Algorithm 2 can take more iterations than Algorithm 1 to terminate, Algorithm 2 solves problems with less binary variable in every iteration, which means that the time per iteration will be lower compared to Algorithm 1.
Note that the objective function value obtained by Algorithm 2 is an upper bound for the objective function value of Problem (P3).
Proof.Since Algorithm 2 terminates when no cluster changes the side and no cluster is cut by the hyperplane, the proof is the same as for Theorem 5. □ As before, we can use the point obtained from Algorithm 2 to warm start Problem (P3).

Using IRCM for Warm-Starting
As stated in Theorem 7, the solution found by Algorithm 2 is feasible for Problem (P3).Hence, we can use it for warm-starting the solution process of Problem (P3).The next lemma establishes that unlabeled points can be fixed to be in one side of the hyperplane.
Lemma 8. Let (ω, b, ξ, η, z) be a feasible point of Problem (P3) with objective function value f .Furthermore, let (ω * , b * , ξ * , η * , z * ) be an optimal solution of Problem (P3) with objective function value f * .Set and let S p ⊆ P u , S n ⊆ N u be arbitrarily chosen subsets and let x s / ∈ S n be an unlabeled point with ω⊤ x s + b < 0.Then, the objective function value f given by any feasible point of the problem with M as defined in (8), satisfies the following properties: (a) f is an upper bound for f * , (b) if f is the optimal objective function value of Problem (P5) and Proof.
(a) The points that satisfy Constraints (P5b)-( P5k) are feasible for Problem (P3) and provide an objective function value f .Since f * is the optimal objective function value of Problem (P3), f * ≤ f holds.(b) Consider by contradiction that (ω * ) ⊤ x s + b * ≥ 0 holds.This means that (ω * , b * , ξ * , η * , z * ) satisfies (P5b)-(P5k).Moreover, since f is the objective function for Problem (P5), we get f * = f .However, f * ≤ f holds.Thus, Note that the last lemma can be adapted for the case ω⊤ x s + b > 0. In this case, the constraints (P5g) need to be replaced with and (b) needs to be replaced with (ω * ) ⊤ x s + b * > 0, i.e., x s ∈ P u .Note that the more points we have fixed on one side, the solution of Problem (P3) tends to be faster as there are fewer binary variables.Moreover, the solution of Problem (P3) can be found by solving the problem where S p and S n are subsets of P u and N u , respectively.
Consider now a given and fixed parameter B max , a factor γ ∈ (1, m/B max ], and let β be γB max rounded to the next integer.While the number of fixed points is smaller than B max , we do the following.For i ∈ {1, . . ., β}, if ω⊤ x α(i) + b < 0 holds, we try to solve Problem (P5) using the limit time of T max and the upper bound f .If there is a feasible point of this problem, we set (ω, b, ξ, η, z) to this point and update the objective function value f accordingly.If no feasible point could be computed and if the limit time was not reached, we fix x s to be in the negative side.
Similarly, we do the same if ω⊤ x d ℓ + b > 0 holds with (P5g) replaced by ( 9).The method is formally described in Algorithm 3. Finally note that, although Problem (P5) is an MIQP, it is a feasibility problem, which is often easier to solve than an optimization problem in practice.Besides that, if the point obtained from Algorithm 2 is close to the optimum of Problem (P3), many unlabeled points will be fixed and Problem (P3) will be faster to solve.

Numerical Results
In this section, we present and discuss our computational results that illustrate the benefits of knowing the total amount of each class of unlabeled data and of using our approaches to speed up the solution process.We evaluate this on different test sets from the literature.The test sets are described in Section 6.1, while the computational setup is depicted in Section 6.2.The evaluation criteria are described in Section 6.3 and the numerical results are discussed in Section 6.4.6.1.Test Sets.For the computational analysis of the proposed approaches, we consider the subset of instances presented by Olson et al. (2017) that are suitable Algorithm 3: Improved & Warm-Started Re-Clustering Method (WIRCM) Solve the problem (P5) with (P5g) replaced by ( 9), using f as an upper bound and a time limit of T max .for classification problems and that have at most three classes.We restrict ourselves to instances of at most three classes to obtain an overall test set of manageable size.Repeated instances are removed and instances with missing information are reduced to the observations without missing information.If three classes are given in an instance, we transform them into two classes such that the class with label 1 represents the positive class, and the other two classes represent the negative class.This results in a final test set of 97 instances; see Table 1 in Appendix A.
To avoid numerical instabilities, we re-scale all data sets as follows.For each coordinate j ∈ [1, d], we compute and shift each coordinate j of all data points x i via xi j = x i j − m j .If we do this for all data points, they get centered around the origin.Moreover, if a coordinate j of the re-scaled points is still large, i.e., if lj = l j − m j < −10 2 or ũj = u j − m j > 10 2 holds, it is re-scaled via with v = 10 2 and v = −10 2 .The corresponding 29 instances that we re-scaled are marked with an asterisk in Table 1.Note that we use a linear transformation to scale the datasets.Hence, after computing the hyperplane for the scaled data, the respective hyperplane for the original data can also be computed ex post by applying another suitably chosen linear transformation as well.
In our computational study, we want to highlight the importance of cardinality constraints, especially for the case of non-representative biased samples.Biased samples occur frequently in non-probability surveys, which are surveys for which the inclusion process is not monitored and, hence, the inclusion probabilities are unknown as well.Correction methods like inverse inclusion probability weighting are therefore not applicable.For an insight into inverse inclusion probability weighting, see Skinner and D'arrigo (2011) and references therein.
To mimic this situation, we create 5 biased samples with 10 % of the data being labeled for each instance.Different from a simple random sample in which each point has an equal probability of being chosen as labeled data, in the biased sample, the labeled data is chosen with probability 85 % for being on the positive side of the hyperplane.Then, for each instance, with a time limit of 3600 s, we apply the approaches listed in Section 6.2.In Appendix C, we also provide the results under simple random sampling, which produces unbiased samples.We see that the results form the proposed methods are similar to the plain SVM in that setting.Hence, besides the additional computational burden, there is no downside to use the proposed method in case of an unknown sampling process.6.2.Computational Setup.Our algorithm has been implemented in Julia 1.8.5 and we use Gurobi 9.5.2 and JuMP (Dunning et al. 2017) to solve Problem (P1), (P3), and (P4).All computations were executed on the high-performance cluster "Elwetritsch", which is part of the "Alliance of High-Performance Computing Rheinland-Pfalz" (AHRP).We used a single Intel XEON SP 6126 core with 2.6 GHz and 64 GB RAM.
For each one of the 485 instances described in Section 6.1, the following approaches are compared: (a) SVM as given in Problem (P1), where only labeled data are considered; (b) CS 3 VM as given in Problem (P3) with M as given in (1); (c) IRCM as described in Algorithm 2; (d) WIRCM as described in Algorithm 3. Based on our preliminary experiments, we set the penalty parameters C 1 = C 2 = 1.For WIRCM, we impose a time limit for solving Problem (P5) of T max = 40 s.Moreover, we choose γ = 1.2 and the maximum number B max of unlabeled points that can be fixed as Finally, for IRCM and WIRCM, we set ∆1 = 0.8, ∆ = 0.1, k + = 50, and the initial number of clusters is set to A more detailed discussion of the choice of hyperparameters is given in Appendix D. 6.3.Evaluation Criteria.The first evaluation criterion is the run time of SVM, CS 3 VM, IRCM, and WIRCM.The results will help to contextualize other evaluation criteria such as accuracy and precision.To compare run times, we use empirical cumulative distribution functions (ECDFs).Specifically, for S being a set of solvers (or approaches as above) and for P being a set of problems, we denote denote by t p,s ≥ 0 the run time of approach s ∈ S applied to problem p ∈ P in seconds.If t p,s > 3600, we consider problem p as not being solved by approach s.With these notations, the performance profile of approach s is the graph of the function The second evaluation criterion is based on Theorem 5, where we show that the objective function value of the point obtained by IRCM is an upper bound for CS 3 VM, and consequently for Problem (P4) that is solved with WIRCM.Note that SVM also provides a feasible point for CS 3 VM and, consequently, provides an upper bound as well.Consider (ω, b, ξ) the solution of SVM, we compute the binary variables z i , i ∈ [n + 1, N ] as follows: If ω ⊤ x i + b = 0 for some x i , we set Finally, we set and the objective function value can be computed as Based on that, we compare how close the objective function values obtained from SVM, CS 3 VM, IRCM, and WIRCM are to the optimal solution.To this end, we use ECDFs, for which we replace t p,s by f p,s in Equation ( 10) with where f * p is the optimal objective function value of problem p and b p,s is the objective function value obtained by approach s.
Besides that, for each instance and for each approach described in Section 6.2, after computing the hyperplane (ω, b), we classify all points x i as being on the positive side if ω ⊤ x i + b > 0 and as being on the negative side if ω ⊤ x i + b < 0 holds.For CS 3 VM and WIRCM, if the hyperplane (ω, b) satisfies ω ⊤ x i + b = 0 for some unlabeled point x i , we classify this point as positive or negative depending on the respective binary variable z i .On the other hand, for IRCM, if ω ⊤ x i + b = 0 for some unlabeled point x i , we classify this point as positive or negative depending on z j with j so that x i ∈ C j .For the labeled points in these three approaches and for all points in the SVM, if ω ⊤ x i + b = 0 holds, we classify the point on the correct side.Note that for the cases in which the IRCMs take more than 3600 s to solve the instance, we use the last hyperplane found by the algorithm.If we hit the time limit in Gurobi when solving CS 3 VM (either standalone or in the final phase of the WIRCM), we take the best solution found so far.
Knowing the true label of all points, we then distinguish all points in four categories: true positive (TP) or true negative (TN) if the point is classified correctly in the positive or negative class, respectively, as well as false positive (FP) if the point is misclassified in the positive class and as false negative (FN) if the point is misclassified in the negative class.Based on that we compute two classification metrics, for which a higher value indicates a better classification.The first one is accuracy (AC).It measures the proportion of correctly classified points and is given by The second metric is precision (PR).It measures the proportion of correctly classified points among all positively classified points and is computed by The main comparison in terms of accuracy and precision is w.r.t. the "true hyperplane", i.e., the solution of Problem (P1) on the complete data with all N points and all labels available.The main question is how close the accuracy and precision is to the one of the true hyperplane.Hence, we compute the ratios of the accuracy and precision according to where AC true and PR true are computed as in Equations ( 12) and ( 13) for the true hyperplane.
We also compare the measures with the SVM method, which only considers the information of the labeled data.For this purpose, we compute where AC SVM and PR SVM are computed as in ( 12) and ( 13) for the SVM hyperplane.To keep the numerical results section concise, we report on recall and the false positive rate in Appendix B.
6.4.1.Run Time. Figure 2 shows the ECDFs for the measured run times.Clearly, SVM is the fastest algorithm.This is expected as the SVM does not include any binary variables related to the unlabeled points, which is in contrast to other approaches.It can be seen that the IRCM outperforms both CS 3 VM and WIRCM.This shows that the idea to cluster unlabeled data points significantly decreases the run time.However, we need to be careful with the interpretation of these run times since termination of SVM and IRCM does not imply that a globally optimal point is found, whereas this is guaranteed CS 3 VM and the WIRCM.The quality of the points found by SVM and IRCM will be discussed in the next section.The figure also clearly indicates that Problem (P2) is rather challenging: Even IRCM, which terminates for the most instances within the time limit (indicated by the gray and dashed vertical line) only does so for 57 % of the instances.Note that the WIRCM has the worst efficiency.This obviously needs to be the case since due to Step 1 of Algorithm 3, its run time always includes the run time of the IRCM.To shed some light on the scalability of the approaches, we also present a brief analysis of the run times in dependence of the number of samples in Appendix E.  6.4.2.Quality of the Obtained Upper Bounds.As discussed in the last section, for some instances none of the three approaches that actually consider the unlabeled data terminate within the given time limit.This means we do not obtain the optimal objective function value for these instances, which we, moreover, can only provably obtain by CS 3 VM and the WIRCM.In fact, we have the optimal solution for 179 instances.These are the baseline instances for Figure 3, which shows the ECDFs for the upper bound quality, as defined in (11).Note that the objective function value obtained by SVM is very far from the optimal value, while the IRCM finds an objective function value rather close to the optimal value (with f ps ≤ 0.2, see the gray dashed vertical line) in 90 % of these instances.Besides that, the WIRCM outperforms CS 3 VM in this comparison, which means using the IRCM as a warm start improves the result.The consequences of the results so far are the following.If one is interested in getting a rather good feasible point as quickly as possible, one should use the IRCM.If one is able to spend some more run time, one should use the WIRCM.Hence, both novel methods derived in this paper have their advantage over just solving the CS 3 VM with a standard MIQP solver.6.4.3.Accuracy.For some instances, none of the three approaches that actually tackle the unlabeled data terminate within the given time limit.Hence, our first comparison only considers instances for which CS 3 VM terminates within the time limit.As can be seen in Figure 4, the relative accuracy AC (w.r.t. the true hyperplane) of CS 3 VM, is closer to 1 than the relative accuracy of SVM-especially for the unlabeled data.This means that using the unlabeled points as well as the cardinality constraint allows to re-produce the classification of the true hyperplane with higher accuracy than the standard SVM does.Besides that, the relative accuracy of the SVM is more spread than the one of the other approaches, indicating that there is comparable more variation in the results as compared the results of CS 3 VM.The box in the boxplot depicts the range of the medium 50 % of the values; 25 % of the values are below and 25 % are above the box.
Figure 5 shows that, in almost 75 % of the cases, CS 3 VM, has AC values larger than zero, where zero means the same accuracy as the SVM itself.In the others 25 % of the cases, the AC of CS 3 VM is slightly smaller than SVM.
The second comparison considers only those three approaches that actually consider the unlabeled data, i.e., CS 3 VM, IRCM, and WIRCM for all instances.As can be seen in Figure 6, even though IRCM does not have an optimality guarantee, it has a better relative accuracy AC than the hyperplane obtained from CS 3 VM within the time limit.Consequently, as the hyperplane obtained from IRCM is used as a warm-start in WIRCM, it also has better accuracy.Figure 7 shows that, in almost 75 % of the cases, CS 3 VM, the IRCM, and the WIRCM have AC values larger than zero.That is, in general, our methods have greater accuracy than the SVM.Though, some cases indicate worse AC values for our methods than for the SVM.This happens because for some instances, the methods (mainly for CS 3 VM; see also Figure 2) do not terminate within the time limit.Hence, we expect that the number of negative values will decrease if we would increase the time limit.
6.4.4.Precision.We again separate the comparisons as in Section 6.4.3.Figure 8 shows that the SVM's relative precision PR is lower than the relative precision of CS 3 VM.This means that CS 3 VM re-produces the classification of the true hyperplane with higher precision than the original SVM.Hence, SVM has more false-positive results.This happens because the biased sample is more likely to have positively labeled data and due to having no information about the unlabeled data, the SVM ends up classifying points on the positive side.As can be seen in Figure 9, CS 3 VM has slightly higher PR values than 0, which is the baseline here that refers to the SVM itself.This means, CS 3 VM is slightly more precise than the SVM.
Figure 10 shows that the PR values of the IRCM and the WIRCM are less spread than the ones of CS 3 VM.The reason most likely is that the CS 3 VM approach terminates on fewer instances than the IRCM and the WIRCM.As can be seen in Figure 11, the IRCM and the WIRCM also have slightly higher PR values than 0. This means that our methods are slightly more precise than the SVM.The negative  outliers most likely are due to the same reason as those for the respective accuracy values.

Conclusion
For many classification problems, it can be costly to obtain labels for the entire population of interest.However, aggregate information on how many points are in each class can be available from external sources.For this situation, we proposed a semi-supervised SVM that can be modeled via a big-M -based MIQP formulation.We also presented a rule for updating the big-M in an iterative re-clustering method and derived further computational techniques such as tailored dimension reduction and warm-starting to reduce the computational cost.
In case of simple random samples, our proposed semi-supervised methods perform as good as the classic SVM approach.However, in many applications, the available data is coming from non-probability samples.Hence, there is the risk of obtaining biased samples.Our numerical study shows that our approaches have better accuracy and precision than the original SVM formulation in this setting.
The problem of considering a cardinality constraint is computationally challenging.Our proposed clustering approach significantly helps to decrease the run time and to find an objective function value that is very close to the optimal value.Besides that, the clustering approach maintains the same accuracy and precision as the MIQP formulation.Moreover, using the clustering approach as a warm-start and fixing some unlabeled points on one side of the hyperplane helps to improve the quality of the objective function value again.Hence, the newly proposed methods lead to a significant improvement compared to just solving the classic MIQP formulation using a standard solver.
Despite these contributions, there is still room for improvement and future work.First, we only considered the linear SVM kernel.For future work, the development of methods for other kernels, such as a Gaussian kernel, can be a valuable topic.Moreover, the use of other norms than the 2-norm could be analyzed as well and the formal hardness of the considered problem should be settled.Finally, the adaptation of our approaches for multiclass SVMs using a one-vs.-reststrategy may be another This quantity is important in some applications such as quality control, where a false positive can cause more issues than a false negative.Note that for FPR, the lower the value, the better the classification.The main comparison in terms of recall and false positive rate is w.r.t. the "true hyperplane", i.e., the solution of Problem (P1) on the complete data with all N points and all labels available.The main question is how close the recall and false positive rate is to the one of the true hyperplane.Hence, we compute the ratios of the recall and false positive rate according to where RE true and FPR true are computed as in ( 16) and ( 17) for the true hyperplane.
As can be seen in Figure 12, the SVM's relative recall is a little bit larger than the one of the other methods.As in Section 6.4.4,this happens because the biased sample is more likely to have positive labeled data and having no information about the unlabeled data, the SVM ends up classifying points on the positive side.
Figure 13 shows that CS 3 VM, the IRCM, and the WIRCM have lower FPR values than the original SVM.This means that the newly proposed methods have a lower false positive rate than the original SVM.The fact that CS 3 VM terminates for less instances than the IRCM explains why the IRCM has a lower relative false positive rate than CS 3 VM.Finally, since the WIRCM uses the IRCM for warmstarting, the WIRCM also has better relative false positive rates than CS 3 VM.

Appendix C. Numerical Results for Simple random samples
In Section 6, we focused our computational study on non-representative, biased samples.The common baseline scenario to check the performance of estimators is to apply them on simple random samples.Hence, for completeness, we also present the results under simple random sampling.That is, each unit in the data set has the same probability π i = n/N to be included into the sample of size n.The instances are the same as described in Section 6.1.The computational setup follows the description in Section 6.2.As before, the used evaluation criteria are AC, PR as in ( 14) and RE, FPR as in (18).
Figure 14 and 15 show similar accuracy and precision performance for all approaches.This is as expected, as the sample is not biased and hence the cardinality constraint does not contribute relevant additional information to the problem.Therefore, the SVM does not tend to classify the points as positive as it is the case for the biased samples.The outliers, mainly present for CS 3 VM, are due those instances that are not solved within the time limit.As can be seen in Figure 16 and 17, recall and false positive rate are also similar for all approaches.Hence, for the simple random samples our approaches have almost the same results as the SVM.Note that for the biased samples, they outperformed the SVM.Hence, in cases for which the type of sample is not known, it is "safe" to use the newly proposed approaches for classification.

Appendix D. Choosing the Hyperparameters
Each parameter of Algorithm 2 and 3 as well as in Problem (P1) and (P3) can be chosen from a range.In Table 2 we present plausible ranges for these parameters.
Clearly, C 1 ∈ R ≥0 holds.However, the closer the value is to 1, the more equally important are maximizing the margin and minimizing the classification error for the labeled data.The range of C 2 is based on C 1 in order to indicate how much more important the unlabeled data is compared to the labeled data.Again, we choose C 2 = 1 so that both data have the same importance.Besides that, if C 2 is much bigger than C 1 , our preliminary tests showed that this leads to focus on minimizing the classification error for the unlabeled data, which implies focusing on the binary variable and, hence, leads to larger run times.For choosing the other parameters, we consider the first 3 datasets presented in Table 1 and varied the parameter choices in a preliminary numerical study.Based on the results, we now discuss how to choose the remaining parameters.The parameter k 1 can be between 2 and m since we cluster m unlabeled points.Note that, the smaller k 1 , the less time per iteration is needed since we have fewer binary variables.However, more iterations may be needed to find the solution.On the other hand, the bigger k 1 , the more time per iteration is required.We choose to start with a small value of k 1 because in preliminary numerical tests, when the algorithm terminated, the number of clusters never exceeded m/3.Moreover, in our preliminary tests, if the algorithm exceeds k t = 50 for some iteration t, it takes a lot of time to solve Problem (P4).To decrease this time, we reduced the number of clusters, eliminating the ones being far from the hyperplane.This is the reason why we choose k + = 50.
The parameter ∆1 indicates that clusters with a distance to the hyperplane greater than the ∆1 -quantile of all distances will be deactivated.It is between 0.5 and 0.9 because a smaller value than 0.5 means removing points that are too close to the hyperplane.This implies that in next iterations many clusters can be reactivated.On the other hand, if it is larger than 0.9, it means that almost no clusters can be deactivated.We choose 0.8 because in our preliminary numerical tests we noticed that with a smaller value, many clusters were activated again, which increased the required time per iteration.The range of ∆ is justified by the fact that for all t, the maximum value of ∆t is 1.We chose 0.1 because the higher the value we choose, the smaller the possibility to eliminate clusters becomes.If chosen smaller, ∆t and ∆t+1 would be very similar and some clusters would be deactivated and reactivated several times.
Because we have m unlabeled points, we can fix at most m unlabeled points, which justifies the range of B max and the maximum value of γ.Since some points are not fixed on some side-they may be on the wrong side or it could take more than T max to solve Problem (P5)-we try to fix at least more than 10 % of B max many unlabeled points.This is why the minimum value of γ is 1.1.The maximum value of T max is 100 s because, if chosen smaller, we observe that there is often not enough time to solve Problem (P5).On the other hand, if it is larger, we observe that the time needed to solve the Algorithm 3 increases.

Appendix E. Run Times in Dependence of the Number of Data Points
In this section, we complement Section 6 by presenting the run times in dependence of the number of points in the data set in order to shed some light on the scalability of our approaches.To this end, we split the entire data set in three subsets.
The first subset only considers those 46 data sets with N ≤ 500.As can be seen in Figure 18 (top), IRCM solves more than 75 % of the instances while CS 3 VM and WIRCM solve more than 50 %.The second subset contains 11 data sets with N ∈ (500, 1500].Figure 18 (middle) shows that for these test sets, IRCM solves about 40 % of the instances while CS 3 VM and WIRCM solve more than 10 %.The last subset contains those 21 data sets with N > 1500.Figure 18 (bottom) shows that CS 3 VM and WIRCM do not solve any of these instances and IRCM solves about 20 %.As expected, the larger the number of points and, thus, the larger

Figure 1 .
Figure 1.A 2-dimensional example (left) and the hyperplanes resulting from the SVM and the CS 3 VM (right).

Figure 3 .
Figure 3. ECDFs for the quality of the obtained upper bounds.

Figure 4 .Figure 5 .
Figure 4. Relative accuracy AC w.r.t. the true hyperplane; see (14).Only those instances are considered for which CS 3 VM terminated.Left: Comparison for all data points.Right: Comparison only for unlabeled data points.

Figure 6 .Figure 7 .
Figure 6.Relative accuracy AC w.r.t. the true hyperplane; see (14).Left: Comparison for all data points.Right: Comparison only for unlabeled data points.

Figure 11 .
Figure 11.Precision values PR w.r.t. the SVM; see (15).Left: Comparison for all data points.Right: Comparison only for unlabeled data points.

Figure 13 .Figure 14 .
Figure 13.Relative false positive rate FPR w.r.t. the true hyperplane; see (18).Left: Comparison for all data points.Right: Comparison only for unlabeled data points.

Figure 15 .Figure 16 .Figure 17 .
Figure 15.Relative precision PR w.r.t. the true hyperplane; see (14), for the simple random samples.Left: Comparison for all data points.Right: Comparison only for unlabeled data points.
9 else if T max was not reached then 10 S n ← S n ∪ {s} 11 end 12 else if ω⊤ x s + b > 0 then 13 Relative precision PR w.r.t. the true hyperplane as; see (14).Only those instances are considered for which CS 3 VM terminated.Left: Comparison for all data points.Right: Comparison only for unlabeled data points.

Table 2 .
Plausible ranges for the hyperparameters