1 Introduction

Support vector machines (SVMs) are a standard approach for supervised binary classification (Boser et al. 1992; Cortes and Vapnik 1995). The core idea is to find a separating hyperplane that optimally splits the feature space in a positive and a negative side according to the positive and negative labels of the data.

Obtaining labels for all units of interest can be costly. This is especially the case if one has to do a classic survey to obtain the labels. In this case, it would be favorable to train the SVM on only partly labeled data. This yields a semi-supervised learning setting. Bennett and Demiriz (1998) formulate and solve the semi-supervised SVM (S\(^3\)VM) as a mixed-integer linear problem (MILP). Many strategies for solving S\(^3\)VM have been proposed in the following decades such as the transductive approach (TSVM) by Joachims (2002) and Yu et al. (2012) or manifold regularization (LapSVM) by Belkin et al. (2006) and Melacci and Belkin (2009). Some researchers also consider a balancing constraint as done in meanS3VM by Kontonatsios et al. (2017) and in c\(^3\)SVM by Chapelle et al. (2006). Moreover, the balancing constraint proposed by Chapelle and Zien (2005) enforces that the proportion of unlabeled and labeled data on both sides is similar to the proportion given by the labeled data.

In many cases, however, the aggregated information about the number of positive and negative cases in a population is known from an external source. For example, in population surveys, there are population figures from official statistics agencies. This setting is studied, e.g., by Burgard et al. (2021), who develop a cardinality-constrained multinomial logit model and apply it in the context of micro-simulations. As another example, in some businesses, the total amount of positive labels could be known but not which customer has a positive or a negative label. An intuitive example is a supermarket for which the amount of cash payments is known. However, this information is not ex-post attributable to the individual customers. We propose to add this aggregated additional information to the optimization model by imposing a cardinality constraint on the predicted labels for the unlabeled data. As will be shown in our numerical experiments, this improves the accuracy of the classification of the unlabeled data. Furthermore, the inclusion of such a cardinality constraint is very useful in the case in which the labeled data is not a representative sample from the population. When obtaining the labels from process data or from online surveys, the inclusion process of the labeled data is generally not known. This is subsumed under the non-probability sample. In this case, inverse inclusion probability weighting, as typically done in survey sampling, is not applicable. By not controlling the inclusion process, strong over- or under-coverage of relevant information in the data set is possible and should be taken into account in the analysis. Not accounting for possible biases in the data generally leads to biased results.

We propose a big-M-based MIQP to solve the semi-supervised SVM problem with a cardinality constraint for the unlabeled data. Here, we restrict ourselves to the linear kernel. Other kernels such as Gaussian and polynomial ones can, in principle, be used as well. However, this would lead to additional nonlinear constraints in a our mixed-integer model and would thus significantly increase the computational challenge of solving the problem. Although we strongly suspect that the problem is NP-hard, we have no proof for it since we focus here on solution techniques and not on a formal complexity analysis of the problem. The cardinality constraint helps to account for biased samples since the number of positive predictions on the population is bounded by the constraint. The computation time for this MIQP grows rapidly with the number of variables—especially for an increasing number of integer variables. We develop an algorithm that uses a clustering-based model reduction to reduce the computation time. Similar reduction approaches can be found for the classic SVM using, e.g., fuzzy clustering (Almasi and Rouhani 2016; Cervantes et al. 2006), clustering-based convex hulls (Birzhandi and Youn 2019), and k-means clustering (de Almeida et al. 2000; Yao et al. 2013). We prove the correctness of our iterative clustering method and further show that it computes feasible points for the original problem. Hence, it also delivers proper upper bounds. Within our iterative approach, we additionally derive a scheme for updating the required big-M values and present tailored dimension-reduction as well as warm-starting techniques.

The paper is organized as follows. In Sect. 2, we describe our optimization problem and the big-M-based MIQP formulation. Afterward, the clustering-based model reduction technique is presented in Sect. 3. There, we also present our algorithm that combines the model reduction and the MIQP formulation. In Sect. 4, we discuss some algorithmic improvements such as the handling of data points that are far away from the hyperplane and the choice of M in the big-M formulation. In Sect. 5, we present how to use the solution of our algorithm to obtain the solution of the initial MIQP formulation by fixing some points on the correct side of the hyperplane. Finally, in Sect. 6, numerical results are reported and discussed and we conclude in Sect. 7.

2 An MIQP formulation for a cardinality-constrained semi-supervised SVM

Let \(X\in \mathbb {R}^{d \times N}\) be the data matrix with \(X_l = [x^1, \dotsc , x^n]\) being the labeled data and \(X_u = [x^{n+1}, \dotsc , x^N]\) being the unlabeled data. Hence, we have \(x^i \in {\mathbb {R}}^d\) for all \(i \in [1,N] \mathrel {{\mathop :}{=}}\{1,\ldots , N\}.\) We set \(m \mathrel {{\mathop :}{=}}N - n\) and \(y \in \{-1,1\}^n\) is the vector of class labels for the labeled data. When the data is linearly separable, the SVM provides a hyperplane \((\omega , b)\) that separates the positively and negatively labeled data. In the case that the data is not linearly separable, the standard approach is to use the \(\ell _2\)-SVM by Cortes and Vapnik (1995) given by

$$\begin{aligned} \min _{\omega ,b,\xi } \quad&\frac{\Vert \omega \Vert ^2}{2} + C_1 \sum _{i=1}^n \xi _i \end{aligned}$$
(P1a)
$$\begin{aligned} \text {s.t.}\quad&y_i (\omega ^\top x^i -b) \ge 1 - \xi _i, \quad i \in [1,n], \end{aligned}$$
(P1b)
$$\begin{aligned}&\xi _i \ge 0, \quad i \in [1, n]. \end{aligned}$$
(P1c)

Here and in what follows, \(\Vert \cdot \Vert \) denotes the Euclidean norm. However, other norms such as the 1- or the max-norm could be used as well. For being able to include unlabeled data in the optimization process, Bennett and Demiriz (1998) propose the semi-supervised SVM (S\(^3\)VM). In many applications, the aggregated information on the labels is available, e.g., from census data. In the following, we know the total number \(\tau \) of positive labels for the unlabeled data from an external source. We adapt the idea of the S\(^3\)VM such that we can use \(\tau \) as an additional information in the optimization model. Our goal is to find optimal parameters \(\omega ^* \in \mathbb {R}^d,\) \(b^* \in \mathbb {R},\) \(\xi ^* \in \mathbb {R}^n,\) and \(\eta ^* \in \mathbb {R}^2\) that solve the optimization problem

$$\begin{aligned} \min _{\omega ,b,\xi ,\eta } \quad&\frac{\Vert \omega \Vert ^2}{2} + C_1 \sum _{i=1}^n \xi _i +C_2 (\eta _1 + \eta _2) \end{aligned}$$
(P2a)
$$\begin{aligned} \text {s.t.}\quad&y_i (\omega ^\top x^i -b) \ge 1 - \xi _i, \quad i \in [1,n], \end{aligned}$$
(P2b)
$$\begin{aligned}&\tau - \eta _ 1 \le \sum _{i=n+1}^N h_{\omega ,b}(x^i) \le \tau + \eta _2, \end{aligned}$$
(P2c)
$$\begin{aligned}&\xi _i \ge 0, \quad i \in [1, n] , \end{aligned}$$
(P2d)
$$\begin{aligned}&\eta _1, \eta _2 \ge 0, \end{aligned}$$
(P2e)

with

$$\begin{aligned} h_{\omega ,b}(x) = {\left\{ \begin{array}{ll} 1, &{}\text {if}\ \omega ^\top x + b \ge 0,\\ 0, &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

Note that the objective function in (P2a) is a compromise between maximizing the distance between the two classes as well as minimizing the classification error for the label and the unlabeled data. The penalty parameters \(C_1 > 0 \) and \(C_2>0\) aim to control the importance of the slack variables \(\xi \) and \(\eta ,\) respectively. Constraint (P2b) enforces on which side of the hyperplane the labeled data \(x^i\) should lie. Constraint (P2c) ensures that we have \(\tau \) unlabeled data on the positive side. If \(\eta ^*_1>0\) holds for a solution \((\omega ^*, b^*,\xi ^*, \eta ^*),\) then less than \(\tau \) unlabeled points are classified as positive. On the other hand, if \(\eta ^*_2 > 0\) holds, more than \(\tau \) unlabeled points are classified as positive. If \(\eta ^*_1 = \eta ^*_2 =0\) holds, exactly \(\tau \) unlabeled points are classified in the positive class. Note that, having assigned a very high value to \(C_1\) or \(C_2,\) the objective function value is dominated by these slack variables.

The function \(h_{\omega ,b}(\cdot )\) in Constraint (P2c) is not continuous, which means that Problem (P2) cannot be easily solved by standard solvers. A typical way to overcome this problem is to add binary variables to turn on or off the enforcement of a constraint. By introducing binary variables \(z_i \in \{0,1\},\) \(i \in [n+1, N],\) we can reformulate the optimization Problem (P2) using the following big-M formulation:

$$\begin{aligned} \min _{\omega ,b,\xi ,\eta , z} \quad&\frac{\Vert \omega \Vert ^2}{2} + C_1 \sum _{i=1}^n \xi _i +C_2 (\eta _1 + \eta _2) \end{aligned}$$
(P3a)
$$\begin{aligned} \text {s.t.}\quad&y_i (\omega ^\top x^i +b) \ge 1 - \xi _i, \quad i \in [1,n], \end{aligned}$$
(P3b)
$$\begin{aligned}&\omega ^\top x^i +b \le z_i M, \quad i \in [n+1,N], \end{aligned}$$
(P3c)
$$\begin{aligned}&\omega ^\top x^i +b \ge -(1-z_i)M, \quad i \in [n+1,N], \end{aligned}$$
(P3d)
$$\begin{aligned}&\tau - \eta _1 \le \sum _{i=n+1}^N z_i \le \tau + \eta _2, \end{aligned}$$
(P3e)
$$\begin{aligned}&\xi _i \ge 0, \quad i \in [1, n], \end{aligned}$$
(P3f)
$$\begin{aligned}&\eta _1, \eta _2 \ge 0, \end{aligned}$$
(P3g)
$$\begin{aligned}&z_i \in \{0,1\}, \quad i \in [n+1, N], \end{aligned}$$
(P3h)

where M needs to be chosen sufficiently large. As \(z_i\) is binary, Constraints (P3c) and (P3d) lead to

$$\begin{aligned} \omega ^\top x^i +b > 0 \implies z_i = 1, \quad i \in [n+1,N],\\ \omega ^\top x^i +b < 0 \implies z_i = 0,\quad i \in [n+1,N]. \end{aligned}$$

If \(x^i\) lies on the hyperplane, i.e., \(\omega ^\top x^i +b =0 ,\) Constraints (P3c) and (P3d) hold for \(z_i = 1\) and \(z_i = 0.\) In this case, it can be counted either on the positive or on the negative side. For this reason, Problem (P3) is not formally equivalent to Problem (P2). Reformulation (P3) is a mixed-integer quadratic problem (MIQP) in which all constraints are linear but the objective function is quadratic. We refer to this problem as CS\(^3\)VM.

Since we now stated our first model, let us shed some light on the results depending on whether the standard SVM or CS\(^3\)VM is used. Figure 1 shows a 2-dimensional example data set and the corresponding hyperplanes for SVM and CS\(^3\)VM. In this case, \(\tau = 11,\) i.e., 11 unlabeled points belong the positive class. Note that SVM only classifies 6 unlabeled points as positive, while CS\(^3\)VM classifies 11 as such. The point that lies on the CS\(^3\)VM hyperplane is classified as positive because the binary variable regarding this point is 1. This example shows that using \(\tau \) as additional information can improve the classification of unlabeled points.

Fig. 1
figure 1

A 2-dimensional example (left) and the hyperplanes resulting from the SVM and the CS\(^3\)VM (right)

In the big-M formulation, the choice of M is crucial. If M is too small, the problem can become infeasible or optimal solutions could be cut off. If M is chosen too large, the respective continuous relaxations usually lead to bad lower bounds and solvers may encounter numerical troubles. The choice of M is discussed in the following lemma and theorem. In Lemma 1 we show how M is related to the objective function and the given data. This is then used in Theorem 2 to derive a provably correct big-M.

Lemma 1

Given a feasible point for Problem (P3) with an objective function value f,  an optimal solution \((\omega ^*, b^*, \xi ^*, \eta ^*, z^*)\) of (P3) satisfies

$$\begin{aligned} \Vert \omega ^* \Vert \le \sqrt{2f} \quad \text {and}\quad \vert b^* \vert \le \Vert \omega ^* \Vert \max _{i \in [1,N]} \Vert x^i \Vert + 1 \end{aligned}$$

and,  consequently,  every optimal solution satisfies (P3c) and (P3d) for

$$\begin{aligned} M = 2 \sqrt{2f} \max _{i \in [1,N]} \Vert x^i \Vert + 1. \end{aligned}$$

Proof

Due to optimality, we get

$$\begin{aligned} \frac{\Vert \omega ^* \Vert ^2}{2} \le \frac{\Vert \omega ^* \Vert ^2}{2} + C_1 \sum _{i = 1}^n \xi ^*_i + C_2(\eta ^*_1 + \eta ^*_2)\le f \implies \Vert \omega ^* \Vert \le \sqrt{2f}. \end{aligned}$$

The second inequality is shown by contradiction. To this end, we w.l.o.g. assume that \(\tilde{b} = \Vert \omega ^* \Vert \max _{i \in [1,N]} \Vert x^i \Vert +1 + \delta \) is part of an optimal solution for some \(\delta > 0.\) Using the inequality of Cauchy–Schwarz then yields

$$\begin{aligned} (\omega ^{*})^\top x^i + \tilde{b}&= (\omega ^{*})^\top x^i + \Vert \omega ^* \Vert \max _{j \in [1,N]} \Vert x^j \Vert +1 + \delta \\&\ge - \Vert \omega ^* \Vert \Vert x^i \Vert + \Vert \omega ^* \Vert \max _{j \in [1,N]} \Vert x^j \Vert +1 + \delta \\&> 1 \end{aligned}$$

for all \(i \in [1,N].\) Hence, for all \(i \in [1,n]\) with \(y_i = 1,\) we get \(\tilde{\xi }_i = 0\) from Constraint (P3b) and the objective function. Moreover, for \(i \in [1,n]\) with \(y_i = -1,\) the same reasoning implies

$$\begin{aligned} - (\omega ^{*})^\top x^i - \tilde{b} = 1 - \tilde{\xi }_i \implies \tilde{\xi }_i = 2 + (\omega ^{*})^\top x^i + \Vert \omega ^* \Vert \max _{j \in [1,N]} \Vert x^j \Vert + \delta . \end{aligned}$$

Besides that, for the unlabeled data \(i\in [n+1,N] ,\) since \( (\omega ^{*})^\top x^i + \tilde{b} >1,\) we get \(\tilde{z}_i = 1,\) which leads to

$$\begin{aligned} \sum _{i=n+1}^N \tilde{z}_i = m \implies \tilde{\eta }_1 = 0, \ \tilde{\eta }_2 = m -\tau . \end{aligned}$$

This means that the objective function value for the point \((\omega ^*,\tilde{b}, \tilde{\xi }, \tilde{\eta }, \tilde{z})\) is given by

$$\begin{aligned} \tilde{f} \mathrel {{\mathop :}{=}}\frac{\Vert \omega ^* \Vert ^2}{2} + C_1 \sum _{i : y_i =-1} \left( 2 + (\omega ^{*})^\top x^i + \Vert \omega ^* \Vert \max _{j \in [1,N]} \Vert x^j \Vert + \delta \right) + C_2( m -\tau ). \end{aligned}$$

However, if we set \(\bar{b} \mathrel {{\mathop :}{=}}\Vert \omega ^* \Vert \max _{i \in [1,N]} \Vert x^i \Vert +1, \) we get

$$\begin{aligned} (\omega ^{*})^\top x^i + \bar{b} \ge 1, \quad i \in [1,N], \end{aligned}$$

i.e., \(z_i = 1\) for all \(i \in [n+1,N] ,\) \(\bar{\eta }_1 = 0,\) \(\bar{\eta }_2 = m -\tau ,\) and \(\bar{\xi }_i = 0 \) for i with \(y_i = 1.\) Moreover, for \(i \in [1,n]\) with \(y_i = -1,\) from Constraint (P3b) we obtain

$$\begin{aligned} - (\omega ^{*})^\top x^i - \tilde{b} = 1-\bar{\xi _i} \implies \bar{\xi _i} = 2 + (\omega ^{*})^\top x^i + \Vert \omega ^* \Vert \max _{i \in [1,N]} \Vert x^i \Vert . \end{aligned}$$

All this implies that the objective function value \(\bar{f}\) for the point \((\omega ^*,\bar{b}, \bar{\xi }, \bar{\eta }, \bar{z})\) satisfies

$$\begin{aligned} \bar{f} \mathrel {{\mathop :}{=}}\frac{\Vert \omega ^* \Vert ^2}{2} + C_1 \sum _{i{:}\,y_i =-1}(2 + (\omega ^{*})^\top x^i + \Vert \omega ^* \Vert \max _{j \in [1,N]} \Vert x^j \Vert ) \ + C_2( m -\tau ) < \tilde{f}, \end{aligned}$$

which contradicts the assumption that \(\tilde{f}\) is optimal. Hence,

$$\begin{aligned} \vert b^* \vert \le \Vert \omega ^* \Vert \max _{i \in [1,N]} \Vert x^i \Vert +1 \end{aligned}$$

holds, which proves the second inequality. Note further that

$$\begin{aligned} (\omega ^{*})^\top x^i + b^* \le \Vert \omega ^* \Vert \Vert x^i \Vert + \vert b^* \vert \le 2\sqrt{2f} \max _{j \in [1,N]} \Vert x^j \Vert + 1 = M \end{aligned}$$

and

$$\begin{aligned} (\omega ^{*})^\top x^i + b^* \ge - \Vert \omega ^* \Vert \Vert x^i \Vert - \vert b^* \vert \ge - 2\sqrt{2f} \max _{j \in [1,N]} \Vert x^j \Vert - 1 = -M \end{aligned}$$

holds for all \(i \in [n+1,N].\) \(\square \)

We now use the result from the last technical lemma to obtain a provably correct big-M.

Theorem 2

A valid big-M for Problem (P3) is given by

$$\begin{aligned} M = 2\sqrt{2(2C_1\bar{n} + C_2(m-\tau ))}\max _{i \in [1,N]}\Vert x^ i \Vert +1 \end{aligned}$$
(1)

with \(\bar{n} \mathrel {{\mathop :}{=}}\vert \{i \in [1,n]: y_i = -1\}\vert .\)

Proof

Consider the feasible point of (P3) given by \(\omega = 0 \in \mathbb {R}^d\) and \(b=1.\) Since \(\omega ^\top x^i + b = 1\) holds for all \(i\in [1,N],\) Constraint (P3b) implies

$$\begin{aligned} \xi _i = {\left\{ \begin{array}{ll} 2, &{}\text {if}\ y_i = -1,\\ 0, &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

Moreover, using Constraints (P3c)–(P3e) leads to

$$\begin{aligned} z_i = 1, \ i \in [n+1,N], \quad \eta _1 = 0, \quad \eta _2 = m - \tau , \end{aligned}$$

which implies that the objective function for the point \((\omega , b, \xi , \eta , z)\) is given by

$$\begin{aligned} f = 0 + 2C_1\bar{n} + C_2(m-\tau ). \end{aligned}$$

Finally, from Lemma 1, we get

$$\begin{aligned} M = 2\sqrt{2(2C_1\bar{n} + C_2(m-\tau ))}\max _{i \in [1,N]}\Vert x^ i \Vert + 1. \end{aligned}$$

\(\square \)

3 A re-clustering method for solving CS\(^3\)VM

In Model (P3) of the last section, each binary variable is related to an unlabeled point. The larger the number of unlabeled data, the larger the number of binary variables and, hence, the larger the computational burden to solve Problem (P3). To reduce this computational burden, we propose to cluster the unlabeled data. This way, only one binary variable per cluster is needed. For every cluster, we use its centroid as its representative point. To obtain clusterings, we use minimum sum-of-squares clustering (MSSC). The MSSC problem is NP-hard; see, e.g., Aloise et al. (2009), Mahajan et al. (2012), and Dasgupta (2007). However, we do not need a globally optimal solution for the MSSC problem as will be shown below. Given a number k of clusters and a matrix \(S =[s^1, \dotsc , s^p] \in \mathbb {R}^{d\times p}\) of given points, the goal of the MSSC is to find mean vectors \(c^j \in \mathbb {R}^d,\) \(j \in [1, k],\) that solve the problem

$$\begin{aligned} c^* = {\mathop {\mathrm{arg\,min}}\limits _{c}} \ \ell (S,c), \quad c = (c^j)_{j=1,\ldots ,k}, \end{aligned}$$

where the loss function \(\ell \) is the sum of the squared Euclidean distances, i.e.,

$$\begin{aligned} \ell (S,c) = \sum _{j=1}^k \sum _{s^i \in {\mathcal {C}}_j} \Vert s^i - c^j \Vert ^2 \end{aligned}$$

with \({\mathcal {C}}_j \subset \mathbb {R}^{d}\) being the set of data points that are assigned to cluster j.

We solve this problem heuristically using the k-means algorithm (MacQueen 1967; Lloyd 1982) for \(S = X_u,\) i.e., we cluster the unlabeled data. Then, instead of using all unlabeled data as in the last section, we only use the clusters’ centroids \(c^1, \dotsc , c^k\) and the numbers \(e_1, \dots , e_k\) of data points in each cluster to obtain the problem

$$\begin{aligned} \min _{\omega ,b,\xi ,\eta , z} \quad&\frac{\Vert \omega \Vert ^ 2 }{2} + C_1 \sum _{i=1}^n \xi _i +C_2 (\eta _1 + \eta _2) \end{aligned}$$
(P4a)
$$\begin{aligned} \text {s.t.}\quad&y_i (\omega ^\top x^i+b) \ge 1 - \xi _i, \quad i \in [1,n], \end{aligned}$$
(P4b)
$$\begin{aligned}&\omega ^\top c^j +b \le z_j M, \quad j\in [1,k], \end{aligned}$$
(P4c)
$$\begin{aligned}&\omega ^\top c^j +b \ge -(1-z_j) M, \quad j \in [1,k], \end{aligned}$$
(P4d)
$$\begin{aligned}&\tau - \eta _1 \le \sum _{j=1}^k e_jz_j \le \tau + \eta _2, \end{aligned}$$
(P4e)
$$\begin{aligned}&\xi _i \ge 0, \quad i \in [1, n], \end{aligned}$$
(P4f)
$$\begin{aligned}&\eta _1, \eta _2 \ge 0, \end{aligned}$$
(P4g)
$$\begin{aligned}&z_j \in \{0,1\}, \quad j \in [1,k]. \end{aligned}$$
(P4h)

A valid big-M is still given by (1) as shown in the next proposition.

Proposition 1

If \(e_j \ge 1\) for all \(j \in [1,k],\) a valid big-M for Problem (P4) is given by (1).

Proof

The proof follows the same lines as the proofs of Lemma 1 and Theorem 2 with the additional observation that for all \(j \in [1,k],\) it holds

$$\begin{aligned} \Vert c^j \Vert = \frac{ 1 }{e_j} \left\| \sum _{i : x^i\in {\mathcal {C}}_j}x^i\right\| \le \frac{e_j \max _{i \in [n+1,N]}\Vert x^i \Vert }{e_j} = \max _{i \in [n+1,N]}\Vert x^i \Vert . \end{aligned}$$

\(\square \)

It can happen that the hyperplane given by \((\omega ^*,b^*)\) that results from the solution of Problem (P4) cuts through some cluster. This means that not all data points of the cluster actually lie on the same side of the hyperplane. If this happens, the solution of Problem (P4) does not satisfy the cardinality constraint (P3e) of Problem (P3). To fix this, we propose an iterative method that is formally listed in Algorithm 1. Note that the use the k-means algorithm is helpful here as it automatically provides the convex hulls of the clusters. Hence, it is easy to check if the hyperplane cuts through some cluster or not.

Algorithm 1
figure a

Re-Clustering Method (RCM)

If Algorithm 1 terminates it holds that all points in a cluster are on the same side of the final hyperplane. This implies the cardinality constraint (P3e) is satisfied. Note that the k-means algorithm is only called once to initialize the clustering. For all other iterations, we manually split clusters if they are cut by the hyperplane of the respective iteration and compute the new centroids directly.

The next theorem establishes that Algorithm 1 always terminates after finitely many iterations.

Theorem 3

Suppose that \(e_j \ge 1\) for all \(j\in [1,k^1]\) after Step 1 of Algorithm 1. Then,  Algorithm 1 terminates after at most \(m-k^1\) iterations,  where m is the number of the unlabeled data points and \(k^1\) is the number of initial clusters.

Proof

Observe that since we cluster m unlabeled points, the maximum number of clusters we can obtain is m. Besides that, if in an iteration t,  Algorithm 1 does not terminate, at least one cluster is split Step 6. Because we start with \(k^1\) clusters and since in each iteration, we increase the number of clusters at least by one, the maximum number of iterations is \(m-k^1.\) \(\square \)

Note that the point obtained by Algorithm 1 is not necessarily a minimizer of Problem (P3). However, the objective function value of the point obtained by Algorithm 1 is an upper bound for the objective function value of Problem (P3).

Theorem 4

Let \((\bar{\omega }, \bar{b}, \bar{\xi }, \bar{\eta }, \bar{z})\) be the point returned by Algorithm 1. Then,  \((\bar{\omega }, \bar{b}, \bar{\xi }, \bar{\eta }, \bar{z})\) is feasible for Problem (P3) with

$$\begin{aligned} M = 2\sqrt{2\bar{f}}\max _{i~\in [1,N]} \Vert x^i \Vert +1 \end{aligned}$$

and,  consequently, 

$$\begin{aligned} \bar{f} \mathrel {{\mathop :}{=}}\frac{\Vert \bar{\omega } \Vert ^ 2 }{2} + C_1\sum _{i=1}^n\bar{ \xi _i} +C_2 (\bar{\eta }_1 + \bar{\eta }_2) \end{aligned}$$

is an upper bound of Problem (P3).

Proof

For all clusters \({\mathcal {C}}_j,\) \(j\in \{1, \dotsc , k^t\},\) where t is the final iteration of Algorithm 1, we set \(\tilde{z}_i = \bar{z}_j\) for all i with \(x^i \in {\mathcal {C}}_j.\) We now show that \((\bar{\omega }, \bar{b}, \bar{\xi }, \bar{\eta }, \tilde{z})\) is a feasible point for Problem (P3). Indeed, Constraints (P3b), (P3f), (P3g), and (P3h) are clearly fulfilled. Furthermore, since

$$\begin{aligned} \sum _{i \in {\mathcal {C}}_j} \tilde{z}_i = e_j\bar{z}_j \end{aligned}$$

for all \(j \in [1, k^t],\) using (P4e) we get

$$\begin{aligned} \sum _{i=n+1}^N \tilde{z}_i = \sum _{j=1}^{k^{t}} e_j \bar{z}_j \implies \tau - \bar{\eta }_1 \le \sum _{i=n+1}^N \tilde{z}_i \le \tau + \bar{\eta }_2 \end{aligned}$$

and Constraint (P3e) is satisfied. Besides that,

$$\begin{aligned} \frac{\Vert \bar{\omega } \Vert ^2}{2} \le \bar{f} \implies \Vert \bar{\omega } \Vert \le \sqrt{2\bar{f}} \end{aligned}$$
(2)

holds and as in Lemma 1, we get

$$\begin{aligned} \vert \bar{b} \vert \le \Vert \bar{\omega } \Vert \max _{i \in [1,N]} \Vert x^i \Vert +1. \end{aligned}$$
(3)

Moreover, by construction, for all \(i \in \{n+1, \dots N\}\) with \(\tilde{z}_i = 1,\) \(x^i\) belongs to a cluster \({\mathcal {C}}_j\) such that \(\bar{\omega }^\top c^j +\bar{b}\ge 0 .\) Using the fact that all points in \({\mathcal {C}}_j\) are on the same side of the hyperplane, this side must be the positive one. This fact together with (2) and (3) implies

$$\begin{aligned} -(1- \tilde{z}_i) M = 0\le \bar{\omega }^\top x^i + \bar{b}&\le \Vert \bar{\omega }\Vert \max _{i \in [1,N]} \Vert x^i \Vert + \vert \bar{b} \vert \\&\le 2\sqrt{2\bar{f}}\max _{i \in [1,N]} \Vert x^i \Vert + 1 = M = \tilde{z}_i M. \end{aligned}$$

Similarly, for all \(i \in \{n+1, \dots N\}\) with \(\tilde{z}_i = 0,\) we get

$$\begin{aligned} -M = -(1- \tilde{z}_i) M \le \bar{\omega }^\top x^i + \bar{b} \le 0 = \tilde{z}_i M \end{aligned}$$

and (P3c) as well as (P3d) are fulfilled. Because \((\bar{\omega }, \bar{b}, \bar{\xi }, \bar{\eta }, \tilde{z})\) is a feasible point for Problem (P3), \(\bar{f}\) is an upper bound to the Problem (P3). \(\square \)

Note, finally, that since the point obtained from Algorithm 1 is feasible for Problem (P3), we can use it for warm starting.

4 Further algorithmic enhancements

In order to reduce computational costs, we propose two additional enhancements. The first one (see Sect. 4.1) makes use of the fact that the SVM is mostly influenced by data points that are close to the separating hyperplane. The second one (see Sect. 4.2) introduces a rule for updating M in each iteration of Algorithm 1.

4.1 Handling points far from the hyperplane

In Algorithm 1, the number of clusters increases in each iteration. Hence, the time to solve Problem (P4) increases from iteration to iteration in general. Like in the original SVM, the points closest to the hyperplane influence the resulting hyperplane more than the other points. Obviously, eliminating points that do not strongly influence the hyperplane decreases the size of the problem. Some approaches to eliminate these points have also been proposed for the original SVM. For a survey, see, e.g., Birzhandi et al. (2002). However, most of these approaches are heuristics and do not necessarily yield a feasible point of the problem.

The idea for our setting is the following. Clusters that are far away from the hyperplane could be omitted as this will not change the solution. The farther a cluster is from the hyperplane in an iteration, the less likely it is that the cluster will be split or change sides completely in a future iteration. Hence, the clusters farthest from the current hyperplane mainly add information about their side and capacity. However, in a later iteration, the cluster may become relevant again. Thus, we need to find a way to discard detailed information on certain clusters but also a way to reactivate the discarded clusters if necessary.

We propose the following procedure to reduce the amount of clusters that have to be considered in the current iteration of the algorithm. If the number of clusters exceeds a fixed value \(k^+,\) we first fix the cluster with the centroid farthest from the hyperplane as a kind of residual cluster on a side if this side has points far from the hyperplane. Second, we discard all clusters in which all points are farther from the hyperplane than some \(\Delta ^t\) and assign them to the residual cluster on their side of the hyperplane. This way the cardinality constraint remains valid. Moreover, all formerly discarded clusters are checked for re-consideration. If a discarded cluster has a point with a distance to the hyperplane less than \(\Delta ^t\) or if any point in the cluster changed the side, the cluster is reactivated.

Let \(\bar{S} = (s_{\alpha (1)}, \dotsc , s_{\alpha (d)})^\top \) be the vector of increasingly sorted values of \(S = \{s_1, \dots , s_d\}\) and let \(a \in (0,1).\) The a-quantile of S,  as proposed by Hyndman and Fan (1996), is given by

$$\begin{aligned} P_S(a) \mathrel {{\mathop :}{=}}s_{\alpha (q)} + \frac{s_{\alpha (q)} - s_{\alpha (r)}}{q - r} \left( (d-1)a - q +1 \right) \end{aligned}$$

with

$$\begin{aligned} q \mathrel {{\mathop :}{=}}\max _{i \in [1,d]} \left\{ i{:}\,\frac{i-1}{d-1} \le a \right\} , \quad r \mathrel {{\mathop :}{=}}\min _{i \in [1,d]} \left\{ i{:}\,\frac{i-1}{d-1} \ge a \right\} . \end{aligned}$$

Given a parameter \(\hat{\Delta }^t \in (0,1),\) we choose \(\Delta ^t\) in each iteration t according to

$$\begin{aligned} \Delta ^t = P_{D^{t}}(\hat{\Delta }^t) \quad \text {with} \quad D^{t}_j = \left| (\omega ^t)^\top c_{j} + b^t \right| \quad \text {for all } j \in \left[ 1, k^t\right] . \end{aligned}$$
(4)

Note that if in an iteration t,  a point in some discarded cluster changed the side, the vector z as part of the current solution does not fit to this change. This happens when, e.g., \((\omega ^{t-1})^\top x^i +b^{t-1} > 0\) and \((\omega ^{t})^\top x^i +b^{t} < 0\) but \(z^t_j>0\) with \({\mathcal {C}}_j\) being the cluster with centroid farthest from the hyperplane on the positive side. To avoid that this happens too often, \(\hat{\Delta }^{t+1}\) is increased by a fixed value \(\tilde{\Delta } \in (0,1)\) when there is some point in some discarded cluster that has changed sides.

Motivated by the above discussions, we add new steps in the Algorithm 1 that can be seen in Algorithm 2. In Step 5, if the number of clusters exceeds \(k^+,\) clusters far from the hyperplane are discarded. In Steps 9 and 10, clusters discarded with a point that changed sides or that is closer to the hyperplane than \(\Delta ^t\) are reactivated. In Step 12, \(\hat{\Delta }^t\) is updated.

Algorithm 2
figure b

Improved Re-Clustering Method (IRCM)

4.2 Updating the Big-M

As discussed in Sect. 2, M needs to be sufficiently large. However, the bigger the M,  the more likely we face numerical issues. As shown in Sect. 2, the smaller the objective function provided by a feasible point, the smaller the value of M can be chosen. Based on that, we update M in each iteration with the aim of decreasing it. We do this by adding Step 16 in Algorithm 2 and the next theorem justifies this.

Theorem 5

Consider \(X, y, C_1, C_2, \tau ,\) as well as \(c^1,\ldots , c^{k^t}\) and \(e_1, \dots e_{k^t}\) in an iteration t of Algorithm 1. Then,  the optimal solution \((\bar{\omega }^t,\bar{b}^t, \bar{\xi }^t, \bar{\eta }^t, \bar{z}^t)\) of Problem (P4) provides an upper bound

$$\begin{aligned} \tilde{f}_t \mathrel {{\mathop :}{=}}\frac{\Vert \bar{\omega }^t \Vert ^2}{2} + C_1 \sum _{i=1}^n \bar{\xi _i} + C_2(\tilde{\eta }_1 + \tilde{\eta }_2 ), \end{aligned}$$
(5)

with

$$\begin{aligned} \tilde{z}_j = {\left\{ \begin{array}{ll} 1, &{} \text {if } \left( \bar{\omega }^{t}\right) ^\top \tilde{c}_j + \bar{b}^t \ge 0,\\ 0, &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(6)

and

$$\begin{aligned} \tilde{\eta }_1 = \max \left\{ 0, \tau - \sum _{j=1}^s e_j\tilde{z}_j\right\} , \quad \tilde{\eta }_2 = \max \left\{ 0, \sum _{j=1}^s e_j\tilde{z}_j - \tau \right\} , \end{aligned}$$
(7)

for Problem (P4) with \(c^1,\ldots , c^{k^{t+1}}\) and \(e_1, \ldots e_{k^{t+1}}\) as updated in iteration t with

$$\begin{aligned} M = 2 \sqrt{2\tilde{f_t}} \max _{i \in [1,N]} \Vert x^i \Vert + 1. \end{aligned}$$
(8)

Proof

Consider \(\tilde{z}\) as given in (6) and \(\tilde{\eta }_1, \tilde{\eta }_2\) as given in (7). We now show that \((\bar{\omega }^t,\bar{b}^t, \bar{\xi }^t, \tilde{z}, \tilde{\eta } )\) is a feasible point for Problem (P4). Indeed, Constraints (P4b) and (P4e)–(P4h) are clearly satisfied. Moreover, \((\bar{\omega }^t,\bar{b}^t, \bar{\xi }^t, \bar{\eta }^t, \bar{z}^t)\) provides the objective function value given by (5) and

$$\begin{aligned} \Vert \bar{\omega }^t \Vert \le \sqrt{2\tilde{f}_t }, \quad \vert \bar{b}^t \vert \le \Vert \bar{\omega }^t \Vert \max _{i \in [1,N]} \Vert x^i \Vert +1, \end{aligned}$$

see the proof of Lemma 1. This together with \(\Vert c^j \Vert \le \max _{i \in [n+1,N]}\Vert x^i \Vert \) implies

$$\begin{aligned} \left( \bar{\omega }^t\right) ^\top c^j + \bar{b}^t \le \Vert \bar{\omega }^t \Vert \max _{i \in [n+1,N]}\Vert x^i \Vert + \vert \bar{b}^t \vert \le 2 \sqrt{2\tilde{f}_t} \max _{i \in [1,N]} \Vert x^i \Vert +1 = M \end{aligned}$$

and

$$\begin{aligned} \left( \bar{\omega }^t\right) ^\top c^j + \bar{b}^t \ge -M. \end{aligned}$$

Hence, Constraints (P4c) and (P4d) are satisfied. Since \((\bar{\omega }^t, \bar{b}^t, \bar{\xi }^t, \tilde{z}, \tilde{\eta })\) is a feasible point for Problem (P4), \(\tilde{f}_t\) is an upper bound for Problem (P4). \(\square \)

Using Theorem 5, we can update M in each iteration of Algorithm 2 as in (8). The following theorem establishes that as Algorithm 1, Algorithm 2 always terminates after finitely many iterations.

Theorem 6

The Algorithm 2 terminates after at most

$$\begin{aligned} 2m-k^1 + \frac{(1-\hat{\Delta }^1)}{\tilde{\Delta }} \end{aligned}$$

iterations,  where m is the number of unlabeled data points,  \(k^1\) is the number of initial clusters,  and \(\hat{\Delta }^1, \tilde{\Delta }\) are inputs of Algorithm 2.

Proof

In Algorithm 2, the number of iterations can only be greater as in Algorithm 1 if there is some iteration t for which \({\mathcal {J}}^t \ne \emptyset \) holds but the hyperplane does not cut any cluster. At each iteration in which this happens, \(\hat{\Delta }^t\) is increased and, in the worst case, i.e.,

$$\begin{aligned} \hat{t} \mathrel {{\mathop :}{=}}m-k^1 + \frac{(1-\hat{\Delta }^1)}{\tilde{\Delta }}, \end{aligned}$$

we get \(\hat{\Delta }^{\hat{t}} = 1.\) This implies that for all further iterations t

$$\begin{aligned} \Delta ^t = \max _{j \in [1,k^t]} \vert (\omega ^t)^\top c^j + b^t\vert \end{aligned}$$

holds. Thus, no cluster is added to the set \({\mathcal {G}}^t.\) Since \(\vert {\mathcal {G}}^{\hat{t}} \vert \le m\) and \({\mathcal {J}}^t \subset {\mathcal {G}}^{\hat{t}},\) Algorithm 2 can only have m more iterations with \({\mathcal {J}}^t \ne \emptyset .\) This means that the maximum number of iterations is \(2\,m - k^1 + (1-\hat{\Delta }^1)/\tilde{\Delta }.\) \(\square \)

Although Theorem 6 shows that, in the worst case, Algorithm 2 can take more iterations than Algorithm 1 to terminate, Algorithm 2 solves problems with less binary variable in every iteration, which means that the time per iteration will be lower compared to Algorithm 1.

Note that the objective function value obtained by Algorithm 2 is an upper bound for the objective function value of Problem (P3).

Theorem 7

Let \((\bar{\omega }, \bar{b}, \bar{\xi }, \bar{\eta }, \bar{z})\) be the point returned by Algorithm 2. Then,  \((\bar{\omega }, \bar{b}, \bar{\xi }, \bar{\eta }, \bar{z})\) is feasible for Problem (P3) with

$$\begin{aligned} M = 2\sqrt{2\bar{f}}\max _{i \in [1,N]} \Vert x^i \Vert + 1 \end{aligned}$$

and,  consequently,

$$\begin{aligned} \bar{f} \mathrel {{\mathop :}{=}}\frac{\Vert \bar{\omega } \Vert ^ 2 }{2} + C_1\sum _{i=1}^n\bar{ \xi _i} +C_2 (\bar{\eta }_1 + \bar{\eta }_2) \end{aligned}$$

is an upper bound of Problem (P3).

Proof

Since Algorithm 2 terminates when no cluster changes the side and no cluster is cut by the hyperplane, the proof is the same as for Theorem 5. \(\square \)

As before, we can use the point obtained from Algorithm 2 to warm start Problem (P3).

5 Using IRCM for warm-starting

As stated in Theorem 7, the solution found by Algorithm 2 is feasible for Problem (P3). Hence, we can use it for warm-starting the solution process of Problem (P3). The next lemma establishes that unlabeled points can be fixed to be in one side of the hyperplane.

Lemma 8

Let \((\bar{\omega }, \bar{b}, \bar{\xi }, \bar{\eta }, \bar{z})\) be a feasible point of Problem (P3) with objective function value \(\bar{f}.\) Furthermore, let \((\omega ^*,b^*, \xi ^ *, \eta ^*, z^*)\) be an optimal solution of Problem (P3) with objective function value \(f^*.\) Set

$$\begin{aligned} P_u&\mathrel {{\mathop :}{=}}\left\{ i \in [n+1,N]:(\omega ^*)^{\top } x^i + b^* >0 \right\} ,\\ N_u&\mathrel {{\mathop :}{=}}\left\{ i \in [n+1,N]:(\omega ^*)^{\top } x^i + b^* < 0\right\} , \end{aligned}$$

and let \(S_p \subseteq P_u,\) \(S_n \subseteq N_u\) be arbitrarily chosen subsets and let \(x^s \notin S_n\) be an unlabeled point with \(\bar{\omega }^{\top } x^s + \bar{b} <0.\) Then,  the objective function value \(\tilde{f}\) given by any feasible point of the problem

$$\begin{aligned} \min _{\omega ,b,\xi ,\eta , z} \quad&\frac{\Vert \omega \Vert ^ 2 }{2} + C_1 \sum _{i=1}^n \xi _i +C_2 (\eta _1 + \eta _2) \end{aligned}$$
(P5a)
$$\begin{aligned} \text {s.t} \quad&y_i (\omega ^{\top } x^i +b) \ge 1 - \xi _i, \quad i \in [1,n], \end{aligned}$$
(P5b)
$$\begin{aligned}&\omega ^{\top } x^i +b \le z_iM, \quad i\in [n+1,N] \setminus (\{s\} \cup S_p \cup S_n), \end{aligned}$$
(P5c)
$$\begin{aligned}&\omega ^{\top } x^i +b \ge -(1-z_i)M, \quad i\in [n+1,N] \setminus (\{s\} \cup S_p \cup S_n), \end{aligned}$$
(P5d)
$$\begin{aligned}&\omega ^{\top } x^i +b \ge 0, \quad i\in S_p, \end{aligned}$$
(P5e)
$$\begin{aligned}&\omega ^{\top } x^i +b \le 0, \quad i\in S_n, \end{aligned}$$
(P5f)
$$\begin{aligned}&0 \le \omega ^{\top } x^s +b \le z_sM, \end{aligned}$$
(P5g)
$$\begin{aligned}&\tau - \eta _1 \le \vert S_p\vert + \sum _{i\in [n+1,N]\setminus ( S_p \cup S_n) } z_i \le \tau + \eta _2, \end{aligned}$$
(P5h)
$$\begin{aligned}&\xi _i \ge 0, \quad i \in [1, n], \end{aligned}$$
(P5i)
$$\begin{aligned}&\eta _1, \eta _2 \ge 0, \end{aligned}$$
(P5j)
$$\begin{aligned}&z_i \in \{0,1\}, \quad i\in [n+1,N]\setminus ( S_p \cup S_n), \end{aligned}$$
(P5k)

with M as defined in (8), satisfies the following properties : 

  1. (a)

    \(\tilde{f}\) is an upper bound for \(f^*,\)

  2. (b)

    if \(\tilde{f}\) is the optimal objective function value of Problem (P5) and \(\bar{f} < \tilde{f}\) is satisfied,  it holds \((\omega ^*)^{\top } x^s + b^* <0,\) i.e.,  \(x^s \in N_u.\)

Proof

  1. (a)

    The points that satisfy Constraints (P5b)–(P5k) are feasible for Problem (P3) and provide an objective function value \(\tilde{f}.\) Since \(f^*\) is the optimal objective function value of Problem (P3), \(f^* \le \tilde{f}\) holds.

  2. (b)

    Consider by contradiction that \((\omega ^*)^{\top } x^s + b^* \ge 0\) holds. This means that \((\omega ^*,b^*, \xi ^ *, \eta ^*, z^*)\) satisfies (P5b)–(P5k). Moreover, since \(\tilde{f}\) is the objective function for Problem (P5), we get \(f^* = \tilde{f}.\) However, \(f^* \le \bar{f}\) holds. Thus,

    $$\begin{aligned} f^* \le \bar{f} < \tilde{f } = f^* \end{aligned}$$

    yields a contradiction.\(\square \)

Note that the last lemma can be adapted for the case \(\bar{\omega }^{\top } x^s + \bar{b}> 0.\) In this case, the constraints (P5g) need to be replaced with

$$\begin{aligned} -(1-z_s)M \le \omega ^{\top } x^s +b \le 0 \end{aligned}$$
(9)

and (b) needs to be replaced with \((\omega ^*)^{\top } x^s + b^* >0,\) i.e., \(x^s \in P_u.\) Note that the more points we have fixed on one side, the solution of Problem (P3) tends to be faster as there are fewer binary variables.

Moreover, the solution of Problem (P3) can be found by solving the problem

$$\begin{aligned} \min _{\omega ,b,\xi ,\eta , z} \quad&\frac{\Vert \omega \Vert ^ 2 }{2} + C_1 \sum _{i=1}^n \xi _i +C_2 (\eta _1 + \eta _2) \end{aligned}$$
(P6a)
$$\begin{aligned} \text {s.t.}\quad&y_i (\omega ^{\top } x^i -b) \ge 1 - \xi _i, \quad i \in [1,n], \end{aligned}$$
(P6b)
$$\begin{aligned}&\omega ^{\top } x^i +b \le z_iM, \quad i\in [n+1,N] \setminus ( S_p \cup S_n), \end{aligned}$$
(P6c)
$$\begin{aligned}&\omega ^{\top } x^i +b \ge -(1-z_i)M, \quad i\in [n+1,N] \setminus ( S_p \cup S_n), \end{aligned}$$
(P6d)
$$\begin{aligned}&\omega ^{\top } x^i +b \ge 0, \quad i\in S_p, \end{aligned}$$
(P6e)
$$\begin{aligned}&\omega ^{\top } x^i +b \le 0, \quad i\in S_n, \end{aligned}$$
(P6f)
$$\begin{aligned}&\tau - \eta _1 \le \vert S_p\vert + \sum _{i\in [n+1,N]\setminus ( S_p \cup S_n) } z_i \le \tau + \eta _2, \end{aligned}$$
(P6g)
$$\begin{aligned}&\xi _i \ge 0, \quad i \in [1, n], \end{aligned}$$
(P6h)
$$\begin{aligned}&\eta _1, \eta _2 \ge 0 \end{aligned}$$
(P6i)
$$\begin{aligned}&z_i \in \{0,1\}, \quad i\in [n+1,N] \setminus ( S_p \cup S_n), \end{aligned}$$
(P6j)

where \(S_p\) and \(S_n\) are subsets of \(P_u\) and \(N_u,\) respectively.

Based on these results, we propose the following. We compute the point \((\bar{\omega },\bar{b}, \bar{\xi },\bar{\eta }, \bar{z})\) using Algorithm 2, leading to an objective function value \(\bar{f}\) for Problem (P3). Afterward, we sort the indices \(i \in \{n+1, N\},\) indicated by the permutation \(\alpha {:}\,\{n+1, N\} \rightarrow \{n+1, N\},\) so that \(\vert \bar{\omega }^{\top }x^{\alpha (i)} + \bar{b} \vert \ge \vert \bar{\omega }^{\top }x^{\alpha (i)+1} + \bar{b} \vert \) holds.

Consider now a given and fixed parameter \(B_{\max },\) a factor \(\gamma \in (1, m / B_{\max }],\) and let \(\beta \) be \(\gamma B_{\max }\) rounded to the next integer. While the number of fixed points is smaller than \(B_{\max },\) we do the following. For \(i \in \{1,\dotsc , \beta \},\) if \(\bar{\omega }^{\top }x^{\alpha (i)} + \bar{b} < 0\) holds, we try to solve Problem (P5) using the limit time of \(T_{\max }\) and the upper bound \(\bar{f}.\) If there is a feasible point of this problem, we set \((\bar{\omega },\bar{b}, \bar{\xi },\bar{\eta }, \bar{z})\) to this point and update the objective function value \(\bar{f}\) accordingly. If no feasible point could be computed and if the limit time was not reached, we fix \(x^s\) to be in the negative side.

Similarly, we do the same if \(\bar{\omega }^{\top }x^{d_\ell } + \bar{b} > 0\) holds with (P5g) replaced by (9). The method is formally described in Algorithm 3. Finally note that, although Problem (P5) is an MIQP, it is a feasibility problem, which is often easier to solve than an optimization problem in practice. Besides that, if the point obtained from Algorithm 2 is close to the optimum of Problem (P3), many unlabeled points will be fixed and Problem (P3) will be faster to solve.

Algorithm 3
figure c

Improved & Warm-Started Re-Clustering Method (WIRCM)

6 Numerical results

In this section, we present and discuss our computational results that illustrate the benefits of knowing the total amount of each class of unlabeled data and of using our approaches to speed up the solution process. We evaluate this on different test sets from the literature. The test sets are described in Sect. 6.1, while the computational setup is depicted in Sect. 6.2. The evaluation criteria are described in Sect. 6.3 and the numerical results are discussed in Sect. 6.4.

6.1 Test sets

For the computational analysis of the proposed approaches, we consider the subset of instances presented by Olson et al. (2017) that are suitable for classification problems and that have at most three classes. We restrict ourselves to instances of at most three classes to obtain an overall test set of manageable size. Repeated instances are removed and instances with missing information are reduced to the observations without missing information. If three classes are given in an instance, we transform them into two classes such that the class with label 1 represents the positive class, and the other two classes represent the negative class. This results in a final test set of 97 instances; see Table 1 in “Appendix A”.

To avoid numerical instabilities, we re-scale all data sets as follows. For each coordinate \(j \in [1,d],\) we compute

$$\begin{aligned} l_j = \min _{i \in [1,N]}\{x^i_j\}, \quad u_j = \max _{i \in [1,N]}\{x^i_j\}, \quad m_j = 0.5 \left( l_j + u_j \right) \end{aligned}$$

and shift each coordinate j of all data points \(x^i\) via \(\bar{x}^i_j = x^i_j - m_j.\) If we do this for all data points, they get centered around the origin. Moreover, if a coordinate j of the re-scaled points is still large, i.e., if \(\tilde{l}_j = l_j - m_j <- 10^{2}\) or \(\tilde{u}_j = u_j - m_j > 10^{2}\) holds, it is re-scaled via

$$\begin{aligned} \tilde{x}^i_j = (\overline{v} - \underline{v} ) \frac{\bar{x}^i_j - \tilde{l}_j}{\tilde{u}_j-\tilde{l}_j} + \overline{v}, \end{aligned}$$

with \(\overline{v} = 10^2\) and \( \underline{v} = -10^{2}.\) The corresponding 29 instances that we re-scaled are marked with an asterisk in Table 1. Note that we use a linear transformation to scale the datasets. Hence, after computing the hyperplane for the scaled data, the respective hyperplane for the original data can also be computed ex post by applying another suitably chosen linear transformation as well.

In our computational study, we want to highlight the importance of cardinality constraints, especially for the case of non-representative biased samples. Biased samples occur frequently in non-probability surveys, which are surveys for which the inclusion process is not monitored and, hence, the inclusion probabilities are unknown as well. Correction methods like inverse inclusion probability weighting are therefore not applicable. For an insight into inverse inclusion probability weighting, see Skinner and D’arrigo (2011) and references therein.

To mimic this situation, we create 5 biased samples with \({10\,\mathrm{\%}}\) of the data being labeled for each instance. Different from a simple random sample in which each point has an equal probability of being chosen as labeled data, in the biased sample, the labeled data is chosen with probability \({85\,\mathrm{\%}}\) for being on the positive side of the hyperplane. Then, for each instance, with a time limit of 3600 s, we apply the approaches listed in Sect. 6.2. In Appendix C, we also provide the results under simple random sampling, which produces unbiased samples. We see that the results form the proposed methods are similar to the plain SVM in that setting. Hence, besides the additional computational burden, there is no downside to use the proposed method in case of an unknown sampling process.

6.2 Computational setup

Our algorithm has been implemented in Julia 1.8.5 and we use Gurobi 9.5.2 and JuMP (Dunning et al. 2017) to solve Problem (P1), (P3), and (P4). All computations were executed on the high-performance cluster “Elwetritsch”, which is part of the “Alliance of High-Performance Computing Rheinland-Pfalz” (AHRP). We used a single Intel XEON SP 6126 core with \(2.6~\hbox {GHz}\) and \(64~\hbox {GB}\) RAM.

For each one of the 485 instances described in Sect. 6.1, the following approaches are compared:

  1. (a)

    SVM as given in Problem (P1), where only labeled data are considered;

  2. (b)

    CS\(^3\)VM as given in Problem (P3) with M as given in (1);

  3. (c)

    IRCM as described in Algorithm 2;

  4. (d)

    WIRCM as described in Algorithm 3.

Based on our preliminary experiments, we set the penalty parameters \(C_1 = C_2 = 1.\) For WIRCM, we impose a time limit for solving Problem (P5) of \(T_{\max } = {40\,\mathrm{\text {s}}}.\) Moreover, we choose \(\gamma = 1.2\) and the maximum number \(B_{\max }\) of unlabeled points that can be fixed as

$$\begin{aligned} B_{\max } = {\left\{ \begin{array}{ll} 0.2m, &{}\text {if } m \in [1,100],\\ 0.25m, &{}\text {if } m \in (100,500],\\ 0.35m, &{}\text {if } m \in (500,1000],\\ 0.45m, &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

Finally, for IRCM and WIRCM, we set \(\hat{\Delta }{^1} = 0.8,\) \(\tilde{\Delta } = 0.1,\) \(k^+ = 50,\) and the initial number of clusters is set to

$$\begin{aligned} k^1 = {\left\{ \begin{array}{ll} 10, &{}\text {if } m \in [1,500],\\ 20, &{}\text {if } m \in (500,1000],\\ 50, &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

A more detailed discussion of the choice of hyperparameters is given in Appendix D.

6.3 Evaluation criteria

The first evaluation criterion is the run time of SVM, CS\(^3\)VM, IRCM, and WIRCM. The results will help to contextualize other evaluation criteria such as accuracy and precision. To compare run times, we use empirical cumulative distribution functions (ECDFs). Specifically, for S being a set of solvers (or approaches as above) and for P being a set of problems, we denote by \(t_{p,s} \ge 0\) the run time of approach \(s \in S\) applied to problem \(p \in P\) in seconds. If \(t_{p,s} > 3600,\) we consider problem p as not being solved by approach s. With these notations, the performance profile of approach s is the graph of the function \(\gamma _s{:}\, [0, \infty ) \rightarrow [0,1]\) given by

$$\begin{aligned} \gamma _s(\sigma ) = \frac{1}{\vert P \vert }\big \vert \left\{ p \in P{:}\,t_{p,s} \le \sigma \right\} \big \vert . \end{aligned}$$
(10)

The second evaluation criterion is based on Theorem 5, where we show that the objective function value of the point obtained by IRCM is an upper bound for CS\(^3\)VM, and consequently for Problem (P4) that is solved with WIRCM. Note that SVM also provides a feasible point for CS\(^3\)VM and, consequently, provides an upper bound as well. Consider \((\omega ,b, \xi )\) the solution of SVM, we compute the binary variables \(z_i,\) \(i\in [n+1,N]\) as follows:

$$\begin{aligned} { z_i = {\left\{ \begin{array}{ll} 1, &{}\text {if } \omega ^\top x^i + b >0,\\ 0, &{}\text {if } \omega ^\top x^i + b <0. \end{array}\right. }} \end{aligned}$$

If \(\omega ^\top x^i + b =0\) for some \(x^i,\) we set

$$\begin{aligned} { z_i = {\left\{ \begin{array}{ll} 1, &{}\text {if } \sum _{j \in [n+1,N]{:}\, \omega ^\top x^j + b \ne 0 } z_i \le \tau ,\\ 0, &{}\text {otherwise}. \end{array}\right. }} \end{aligned}$$

Finally, we set

$$\begin{aligned} \eta _1 = \max \left\{ 0, \tau - \sum _{i=n+1}^N z_i\right\} , \quad \eta _2 = \max \left\{ 0, \sum _{i=n+1}^N z_i - \tau \right\} , \end{aligned}$$

and the objective function value can be computed as

$$\begin{aligned} \frac{\Vert \omega \Vert ^ 2 }{2} + C_1 \sum _{i=1}^n \xi _i +C_2 (\eta _1 + \eta _2). \end{aligned}$$

Based on that, we compare how close the objective function values obtained from SVM, CS\(^3\)VM, IRCM, and WIRCM are to the optimal solution. To this end, we use ECDFs, for which we replace \(t_{p,s}\) by \(f_{p,s}\) in Eq. (10) with

$$\begin{aligned} {f}_{p,s} \mathrel {{\mathop :}{=}}\frac{b_{p,s}-f^*_{p}}{f^*_{p}}, \end{aligned}$$
(11)

where \(f^*_{p}\) is the optimal objective function value of problem p and \(b_{p,s}\) is the objective function value obtained by approach s.

Besides that, for each instance and for each approach described in Sect. 6.2, after computing the hyperplane \((\omega , b),\) we classify all points \(x^i\) as being on the positive side if \(\omega ^\top x^i + b > 0\) and as being on the negative side if \(\omega ^\top x^i + b <0\) holds. For CS\(^3\)VM and WIRCM, if the hyperplane \((\omega , b)\) satisfies \(\omega ^\top x^i + b = 0\) for some unlabeled point \(x^i,\) we classify this point as positive or negative depending on the respective binary variable \(z_i.\) On the other hand, for IRCM, if \(\omega ^\top x^i + b = 0\) for some unlabeled point \(x^i,\) we classify this point as positive or negative depending on \(z_j\) with j so that \(x^i \in {\mathcal {C}}_j.\) For the labeled points in these three approaches and for all points in the SVM, if \(\omega ^\top x^i + b = 0\) holds, we classify the point on the correct side. Note that for the cases in which the IRCMs take more than 3600 s to solve the instance, we use the last hyperplane found by the algorithm. If we hit the time limit in Gurobi when solving CS\(^3\)VM (either standalone or in the final phase of the WIRCM), we take the best solution found so far.

Knowing the true label of all points, we then distinguish all points in four categories: true positive (TP) or true negative (TN) if the point is classified correctly in the positive or negative class, respectively, as well as false positive (FP) if the point is misclassified in the positive class and as false negative (FN) if the point is misclassified in the negative class. Based on that we compute two classification metrics, for which a higher value indicates a better classification. The first one is accuracy \((\text {AC}).\) It measures the proportion of correctly classified points and is given by

$$\begin{aligned} \text {AC}\mathrel {{\mathop :}{=}}\frac{\text {TP}+ \text {TN}}{\text {TP}+ \text {TN}+ \text {FP}+ \text {FN}} \in [0,1]. \end{aligned}$$
(12)

The second metric is precision \((\text {PR}).\) It measures the proportion of correctly classified points among all positively classified points and is computed by

$$\begin{aligned} \text {PR}\mathrel {{\mathop :}{=}}\frac{\text {TP}}{\text {TP}+ \text {FP}} \in [0,1]. \end{aligned}$$
(13)

The main comparison in terms of accuracy and precision is w.r.t. the “true hyperplane”, i.e., the solution of Problem (P1) on the complete data with all N points and all labels available. The main question is how close the accuracy and precision is to the one of the true hyperplane. Hence, we compute the ratios of the accuracy and precision according to

$$\begin{aligned} \widehat{\text {AC}} \mathrel {{\mathop :}{=}}\frac{\text {AC}}{\text {AC}_{\text {true}}}, \quad \widehat{\text {PR}} \mathrel {{\mathop :}{=}}\frac{\text {PR}}{\text {PR}_{\text {true}}}, \end{aligned}$$
(14)

where \(\text {AC}_{\text {true}}\) and \(\text {PR}_{\text {true}}\) are computed as in Eqs. (12) and (13) for the true hyperplane.

We also compare the measures with the SVM method, which only considers the information of the labeled data. For this purpose, we compute

$$\begin{aligned} \overline{\text {AC}} \mathrel {{\mathop :}{=}}\frac{\text {AC}-\text {AC}_{\text {SVM}}}{\text {AC}_{\text {SVM}}}, \quad \overline{\text {PR}} \mathrel {{\mathop :}{=}}\frac{\text {PR}-\text {PR}_{\text {SVM}}}{\text {PR}_{\text {SVM}}}, \end{aligned}$$
(15)

where \(\text {AC}_{\text {SVM}}\) and \(\text {PR}_{\text {SVM}}\) are computed as in (12) and (13) for the SVM hyperplane. To keep the numerical results section concise, we report on recall and the false positive rate in Appendix B.

6.4 Numerical results

6.4.1 Run time

Figure 2 shows the ECDFs for the measured run times. Clearly, SVM is the fastest algorithm. This is expected as the SVM does not include any binary variables related to the unlabeled points, which is in contrast to other approaches. It can be seen that the IRCM outperforms both CS\(^3\)VM and WIRCM. This shows that the idea to cluster unlabeled data points significantly decreases the run time. However, we need to be careful with the interpretation of these run times since termination of SVM and IRCM does not imply that a globally optimal point is found, whereas this is guaranteed CS\(^3\)VM and the WIRCM. The quality of the points found by SVM and IRCM will be discussed in the next section. The figure also clearly indicates that Problem (P2) is rather challenging: Even IRCM, which terminates for the most instances within the time limit (indicated by the gray and dashed vertical line) only does so for \({57\,\mathrm{\%}}\) of the instances. Note that the WIRCM has the worst efficiency. This obviously needs to be the case since due to Step 1 of Algorithm 3, its run time always includes the run time of the IRCM. To shed some light on the scalability of the approaches, we also present a brief analysis of the run times in dependence of the number of samples in Appendix E.

Fig. 2
figure 2

ECDFs for run time (in seconds)

6.4.2 Quality of the obtained upper bounds

As discussed in the last section, for some instances none of the three approaches that actually consider the unlabeled data terminate within the given time limit. This means we do not obtain the optimal objective function value for these instances, which we, moreover, can only provably obtain by CS\(^3\)VM and the WIRCM. In fact, we have the optimal solution for 179 instances. These are the baseline instances for Fig. 3, which shows the ECDFs for the upper bound quality, as defined in (11). Note that the objective function value obtained by SVM is very far from the optimal value, while the IRCM finds an objective function value rather close to the optimal value (with \(f_{ps} \le 0.2,\) see the gray dashed vertical line) in \({90\,\mathrm{\%}}\) of these instances. Besides that, the WIRCM outperforms CS\(^3\)VM in this comparison, which means using the IRCM as a warm start improves the result.

Fig. 3
figure 3

ECDFs for the quality of the obtained upper bounds

The consequences of the results so far are the following. If one is interested in getting a rather good feasible point as quickly as possible, one should use the IRCM. If one is able to spend some more run time, one should use the WIRCM. Hence, both novel methods derived in this paper have their advantage over just solving the CS\(^3\)VM with a standard MIQP solver.

6.4.3 Accuracy

For some instances, none of the three approaches that actually tackle the unlabeled data terminate within the given time limit. Hence, our first comparison only considers instances for which CS\(^3\)VM terminates within the time limit.

Fig. 4
figure 4

Relative accuracy \(\widehat{\text {AC}}\) w.r.t. the true hyperplane; see (14). Only those instances are considered for which CS\(^3\)VM terminated. Left: Comparison for all data points. Right: Comparison only for unlabeled data points

As can be seen in Fig. 4, the relative accuracy \(\widehat{\text {AC}}\) (w.r.t. the true hyperplane) of CS\(^3\)VM, is closer to 1 than the relative accuracy of SVM—especially for the unlabeled data. This means that using the unlabeled points as well as the cardinality constraint allows to re-produce the classification of the true hyperplane with higher accuracy than the standard SVM does. Besides that, the relative accuracy of the SVM is more spread than the one of the other approaches, indicating that there is comparable more variation in the results as compared the results of CS\(^3\)VM. The box in the boxplot depicts the range of the medium 50 % of the values; 25 % of the values are below and 25 % are above the box.

Fig. 5
figure 5

Accuracy values \(\overline{\text {AC}}\) w.r.t. the SVM; see (15) only consider the instances that CS\(^3\)VM terminated. Left: Comparison for all data points. Right: Comparison only for unlabeled data points

Figure 5 shows that, in almost 75 % of the cases, CS\(^3\)VM, has \(\overline{\text {AC}}\) values larger than zero, where zero means the same accuracy as the SVM itself. In the others 25 % of the cases, the \(\overline{\text {AC}}\) of CS\(^3\)VM is slightly smaller than SVM.

The second comparison considers only those three approaches that actually consider the unlabeled data, i.e., CS\(^3\)VM, IRCM, and WIRCM for all instances. As can be seen in Fig. 6, even though IRCM does not have an optimality guarantee, it has a better relative accuracy \(\widehat{\text {AC}}\) than the hyperplane obtained from CS\(^3\)VM within the time limit. Consequently, as the hyperplane obtained from IRCM is used as a warm-start in WIRCM, it also has better accuracy.

Fig. 6
figure 6

Relative accuracy \(\widehat{\text {AC}}\) w.r.t. the true hyperplane; see (14). Left: Comparison for all data points. Right: Comparison only for unlabeled data points

Figure 7 shows that, in almost 75 % of the cases, CS\(^3\)VM, the IRCM, and the WIRCM have \(\overline{\text {AC}}\) values larger than zero. That is, in general, our methods have greater accuracy than the SVM. Though, some cases indicate worse \(\overline{\text {AC}}\) values for our methods than for the SVM. This happens because for some instances, the methods (mainly for CS\(^3\)VM; see also Fig. 2) do not terminate within the time limit. Hence, we expect that the number of negative values will decrease if we would increase the time limit.

Fig. 7
figure 7

Accuracy values \(\overline{\text {AC}}\) w.r.t. the SVM; see (15) consider all instances. Left: Comparison for all data points. Right: Comparison only for unlabeled data points

6.4.4 Precision

We again separate the comparisons as in Sect. 6.4.3. Figure 8 shows that the SVM’s relative precision \(\widehat{\text {PR}}\) is lower than the relative precision of CS\(^3\)VM. This means that CS\(^3\)VM re-produces the classification of the true hyperplane with higher precision than the original SVM. Hence, SVM has more false-positive results. This happens because the biased sample is more likely to have positively labeled data and due to having no information about the unlabeled data, the SVM ends up classifying points on the positive side. As can be seen in Fig. 9, CS\(^3\)VM has slightly higher \(\overline{\text {PR}}\) values than 0, which is the baseline here that refers to the SVM itself. This means, CS\(^3\)VM is slightly more precise than the SVM.

Fig. 8
figure 8

Relative precision \(\widehat{\text {PR}}\) w.r.t. the true hyperplane as; see (14). Only those instances are considered for which CS\(^3\)VM terminated. Left: Comparison for all data points. Right: Comparison only for unlabeled data points

Fig. 9
figure 9

Precision values \(\overline{\text {PR}}\) w.r.t. the SVM; see (15). Only those instances are considered for which CS\(^3\)VM terminated. Left: Comparison for all data points. Right: Comparison only for unlabeled data points

Fig. 10
figure 10

Relative precision \(\widehat{\text {PR}}\) w.r.t. the true hyperplane as; see (14). Left: Comparison for all data points. Right: Comparison only for unlabeled data points

Fig. 11
figure 11

Precision values \(\overline{\text {PR}}\) w.r.t. the SVM; see (15). Left: Comparison for all data points. Right: Comparison only for unlabeled data points

Figure 10 shows that the \(\widehat{\text {PR}}\) values of the IRCM and the WIRCM are less spread than the ones of CS\(^3\)VM. The reason most likely is that the CS\(^3\)VM approach terminates on fewer instances than the IRCM and the WIRCM. As can be seen in Fig. 11, the IRCM and the WIRCM also have slightly higher \(\overline{\text {PR}}\) values than 0. This means that our methods are slightly more precise than the SVM. The negative outliers most likely are due to the same reason as those for the respective accuracy values.

7 Conclusion

For many classification problems, it can be costly to obtain labels for the entire population of interest. However, aggregate information on how many points are in each class can be available from external sources. For this situation, we proposed a semi-supervised SVM that can be modeled via a big-M-based MIQP formulation. We also presented a rule for updating the big-M in an iterative re-clustering method and derived further computational techniques such as tailored dimension reduction and warm-starting to reduce the computational cost.

In case of simple random samples, our proposed semi-supervised methods perform as good as the classic SVM approach. However, in many applications, the available data is coming from non-probability samples. Hence, there is the risk of obtaining biased samples. Our numerical study shows that our approaches have better accuracy and precision than the original SVM formulation in this setting.

The problem of considering a cardinality constraint is computationally challenging. Our proposed clustering approach significantly helps to decrease the run time and to find an objective function value that is very close to the optimal value. Besides that, the clustering approach maintains the same accuracy and precision as the MIQP formulation. Moreover, using the clustering approach as a warm-start and fixing some unlabeled points on one side of the hyperplane helps to improve the quality of the objective function value again. Hence, the newly proposed methods lead to a significant improvement compared to just solving the classic MIQP formulation using a standard solver.

Despite these contributions, there is still room for improvement and future work. First, we only considered the linear SVM kernel. For future work, the development of methods for other kernels, such as a Gaussian kernel, can be a valuable topic. Moreover, the use of other norms than the 2-norm could be analyzed as well and the formal hardness of the considered problem should be settled. Finally, the adaptation of our approaches for multiclass SVMs using a one-vs.-rest strategy may be another reasonable future work.