1 Introduction

We are interested in convex constrained optimization problems with an additional cardinality constraint. In other words, we are interested in finding sparse solutions of those optimization problems, i.e., solutions with a limited number of nonzero elements, as required in many areas including image and signal processing, mathematical statistics, machine learning, portfolio optimization problems, among others. One effective way to ensure the sparsity of the obtained solution is imposing a cardinality constraint where the number of nonzero elements of the solution is bounded in advance.

To be precise, let us consider the following constrained optimization problem:

$$\begin{aligned} \displaystyle \min _{x} f(x) \;\;\;\; \text{ subject } \text{ to } \;\;\;\; x\in \mathit{\Omega} \; \text{ and } \; \Vert x\Vert _0 \le \alpha , \end{aligned}$$
(1)

where \(f:\mathbb {R}^{n}\rightarrow \mathbb {R}\) is continuously differentiable, \(1\le \alpha < n\) is a given natural number, \(\Omega\) is a convex subset of \(\mathbb {R}^{n}\) (that will change depending on the considered application), and the \(L_0\) (quasi) norm \(\Vert x\Vert _0\) denotes the number of nonzero components of x. The sparsity constraint \(\Vert x\Vert _0\le \alpha\) is also called the cardinality constraint. Of course, we will assume that \(\alpha < n\) since otherwise the cardinality constraint could be discarded.

The main difference between problem (1) and a standard convex constrained optimization problem is that the cardinality constraint, despite of the notation, is not a norm, nor continuous neither convex. Because of the non-tractability of the so-called zero norm \(\Vert x\Vert _0\), the 1-norm \(\Vert x\Vert _1\) has also been frequently considered to develop good approximate algorithms. Clearly, to impose a required level of sparsity, the use of the zero norm in (1) is much more effective.

Optimization problems with cardinality constraints are (strongly) NP-hard problems [5, 16], which can be solved by global techniques from discrete or combinatorial optimization (see, e.g., [4, 14, 23]). However, in a more general setting, a continuous reformulation has been recently proposed and analyzed in [12] to deal with this difficult cardinality constraint. The main idea is to address the continuous counterpart of problem (1):

$$\begin{aligned} \begin{array}{ll} \mathop{\min} \limits_{x,y} &{} f(x) \\ \text{subject to:} &{} x\in \mathit{\Omega} , \\ &{} e^{\top } y \ge n-\alpha , \\ &{} x_i y_i = 0, \; \text{for all}\; 1\le i \le n, \\ &{} 0 \le y_i \le 1, \; \text{for all}\; 1\le i \le n, \end{array} \end{aligned}$$
(2)

where \(e\in \mathbb {R}^{n}\) denotes the vector of ones. We note that the last n constraints denote a simple box in the auxiliary variable vector \(y\in \mathbb {R}^{n}\). A more difficult reformulation substitutes the simple box by a set of binary constraints given by either \(y_i = 0\) or \(y_i =1\) for all i. In that case, the problem is an integer programming problem (much harder to solve) for which there are several algorithmic ideas already developed (see, e.g., [4, 5, 14, 19, 36]). In here, we will focus on the continuous formulation (2) that will play a key role in our algorithmic proposal. For additional theoretical properties that include the equivalence between the original version (1) and the continuous relaxed version (2), see [12, 25, 28, 29].

As a consequence of the so-called Hadamard constraint (\(x\, \circ\, y =0\), i.e., \(x_i y_i = 0\) for all i), the formulation (2) is a nonconvex problem, even when the original cardinality constrained problem (except for the cardinality constraint of course) was convex. Thus, one can in general not expect to obtain global minima. But if one is, for example, interested in obtaining local solutions or good starting points for a global method, this continuous formulation (2) can be useful.

In this work, we will pay special attention to those problems for which the set \(\Omega\) is the intersection of a finite collection of convex sets, in such a way that it is very easy to project onto each one of them. In that case, the main idea is to take advantage of the fact that two of the constraints in (2), namely, \(e^{\top } y \ge n-\alpha\) and \(0 \le y_i \le 1\) for all i, are also “easy-to-project" convex sets, and so an alternating projection scheme can be conveniently applied to project onto the intersection of all the involved constraints in (2), except for the Hadamard constraint. For computing a solution candidate of the continuous formulation (2), we can then use a suitable low-cost convex constrained scheme, such as gradient-type methods in which the objective function includes f(x) plus a suitable penalization term that guarantees that the Hadamard constraint is also satisfied at the solution. In Section 2, we will describe and analyze a general penalty method to satisfy the Hadamard constraint that appears in the relaxed formulation (2). In Section 3, we will describe a suitable alternating projection scheme as well as a suitable low-cost gradient-type projection method that can be combined with the penalty method of Section 2. We close Section 3 showing the combined algorithm that represents the main contribution of our work. Concerning some specific applications, in Section 4, we will consider in detail the standard mean-variance limited diversified portfolio selection problem (see, e.g., [13,14,15, 19, 21, 23]). In Section 5, we will present a numerical study to illustrate the computational performance of the proposed scheme on a variety of data sets involving real-world capital market indices from major stock markets. For each considered data set, we will focus our attention on the efficient frontier produced for different values of the limited number of allowed assets. In Section 6, we will present some final comments and perspectives.

2 A Penalization Strategy for the Hadamard Constraint

Let us consider again the continuous formulation (2), and let us focus our attention on the Hadamard constraint \(x\circ y = 0\) (i.e., \(x_i y_i = 0\) for all i). This particular constraint is the only one that does not define a convex set. The others define convex sets in which it is easy to project, as discussed in the previous section. To see that the set of vectors \((x,y) \in \mathbb {R}^{2n}\) such that \(x\circ y = 0\) do not form a convex set, it is enough to consider the two 2-dimensional pairs: \((x_1,y_1) = (1,0,0,1)\) and \((x_2,y_2) = (0,1,1,0)\). Both pairs are clearly in that set, but the convex combination: \(\frac{1}{2}(x_1,y_1) + \frac{1}{2}(x_2,y_2) = \frac{1}{2}e\), which is not in that set.

A classical and straightforward approach to force the Hadamard condition at the solution, while keeping the feasible set of our problem as the intersection of a finite collection of easy convex sets, is to add a penalization term \(\tau h(x,y)\) to the objective function and consider instead the following formulation:

$$\begin{aligned} \begin{array}{ll} \displaystyle \min _{x,y} &{} f(x) + \tau h(x,y) \\ \text{subject to:}\; &{} x\in \mathit{\Omega} , \\ &{} e^{\top } y \ge n-\alpha , \\ &{} 0 \le y_i \le 1, \; \text{for all}\; 1\le i \le n, \end{array} \end{aligned}$$
(3)

where \(\tau >0\) is a penalization parameter that needs to be properly chosen, and the function \(h:\mathbb {R}^{2n}\rightarrow \mathbb {R}\) is continuously differentiable and chosen to satisfy the following two properties: \(h(x,y)\ge 0\) for all feasible vectors x and y, and \(h(x,y) = 0\) if and only if \(x\circ y = 0\). Clearly, the function h(xy) is crucial and should be conveniently chosen depending on the considered application. However, a default option that satisfies all the required properties is given by \(h(x,y)= \sum _{1\le i\le n} x_i^2 y_i^2\).

Applying now a penalty scheme, problem (3) can be reduced to a sequence of convex constrained problems of the following form:

$$\begin{aligned} \displaystyle \min _{x,y} \;\;\;\; f(x) + \tau _k h(x,y), \;\;\;\; \text{ subject to } \;\;\;\; (x,y) \in \mathit{\widehat{\Omega}}, \end{aligned}$$
(4)

where \(\tau _k >0\) is the penalty parameter that increases at every k to penalize the Hadamard-constraint violation, and the closed convex set \(\widehat{\Omega }\) is given by

$$\mathit{\widehat{\Omega }} = \{(x,y)\in \mathbb {R}^{2n}: x\in \mathit{\Omega} , \;\; e^{\top } y \ge n-\alpha , \;\; 0 \le y_i\le 1, \; i=1,\ldots ,n\}.$$

Under some mild assumptions and some specific choice of the sequence \(\{\tau _k\}\), it can be established that the sequence of solutions of problem (4) converges to a solution of (2) (see, e.g., [22] and [30, Secc. 12.1]). Let us assume that problem (2) attains global minimizers. Since f is a continuous function, it is enough to assume that one of the closed and convex sets involved in the definition of \(\mathit{\Omega}\) in (2) is bounded. In here, for the sake of completeness, we summarize the convergence properties of the proposed penalty scheme (4).

Theorem 1

If for all k, \(\tau _{k+1}> \tau _k>0\) and \((x_k,y_k)\) is a global solution of (4), then

$$\begin{aligned} f(x_k) + \tau _k h(x_k,y_k)&\le f(x_{k+1}) + \tau _{k+1} h(x_{k+1},y_{k+1}) \\ h(x_{k+1},y_{k+1})&\le h(x_k,y_k) \\ f(x_k)&\le f(x_{k+1}) . \end{aligned}$$

Moreover, if \(\bar{x}\) is a global solution of problem (2), then for all k

$$f(x_k) \;\le \; f(x_k) + \tau _k h(x_k,y_k) \;\le \; f(\bar{x}) \;.$$

Finally, if \(\tau _k \rightarrow \infty\) and \(\{(x_k,y_k)\}\) is the sequence of global minimizers obtained by solving (4), then any limit point of \(\{(x_k,y_k)\}\) is a global minimizer of (2).

Remark 1

In the proof of the last statement of Theorem 1 (see, e.g., [30, Secc. 12.1]), the requirement of \(\tau _k \rightarrow \infty\) is used only to guarantee that the term \(h(x_k,y_k)\rightarrow 0\) when \(k\rightarrow \infty\), i.e., to guarantee that \(x_k\circ y_k \rightarrow 0\). In order to guarantee the convergence result, what is important is that the Hadamard product itself goes to zero even if \(0< \tau _k < \infty\) for all k. This fact will play a key role in our numerical study (Section 5).

We would like to close this section with a pertinent result [28, Theorem 4] that establishes a one-to-one correspondence between minimizers of problems (1) and (2), whenever the obtained solution \(\bar{x}\) satisfies the cardinality constraint with equality, i.e., \(\Vert \bar{x}\Vert _0=\alpha\).

Theorem 2

Let \((\bar{x},\bar{y})\) be a local minimizer of the relaxed problem (2). Then, \(\Vert \bar{x}\Vert _0=\alpha\) if and only if \(\bar{y}\) is unique, that is, if there exist exactly one \(\bar{y}\) such that \((\bar{x},\bar{y})\) is a local minimizer of (2). In this case, the components of \(\bar{y}\) are binary (i.e., \(\bar{y}_i =0\) or \(\bar{y}_i=1\) for all \(1\le i\le n\)) and \(\bar{x}\) is a local minimizer of (1).

3 Dykstra’s Method and the SPG Method

For every k, a low-cost projected gradient method can be used to compute a solution candidate of the optimization problem (4). For a given vector \(\tilde{x}\in \mathbb {R}^{2n}\), a convenient tool for finding the required projections onto \(\mathit{\widehat{\Omega}}\) is Dykstra’s alternating projection algorithm [11], that projects in a clever way onto the convex sets, say \(\mathit{\Omega} _1,\dots ,\mathit{\Omega} _p\), individually to complete a cycle which is repeated iteratively, and as any other iterative method, it can be stopped prematurely.

In Dykstra’s method, it is assumed that the projections onto each of the individual sets \(\mathit{\Omega} _i\) are relatively simple to compute, e.g., boxes, spheres, subspaces, half-spaces, and hyperplanes. The algorithm has been adapted and used for solving a huge amount of different applications and has been combined with several techniques in optimization, including outer approximation strategies for solving nonlinear constraint problems (see, e.g., [2, 17, 33]). For a review on Dykstra’s method, its properties and applications, as well as many other alternating projection schemes, see, e.g., [18, 20].

Dykstra’s algorithm generates two sequences: the iterates \(\{ x_{\ell }^{i} \}\) and the increments \(\{I_{\ell }^{i}\}\). These sequences are defined by the following recursive formulae:

$$\begin{aligned} \begin{array}{ll} x^{0}_\ell = x_{\ell -1}^p &{} \\x_{\ell }^i = P_{\mathit{\Omega} _i}(x_{\ell }^{i-1}-I_{\ell -1}^i) &{} i=1,2, \ldots , p, \\I_{\ell }^i = x_{\ell }^i-(x_{\ell }^{i-1}-I_{\ell -1}^i) &{} i=1,2, \ldots , p, \end{array} \end{aligned}$$
(5)

for \(\ell \in \mathbb {Z}^+\) with initial values \(x_0^p=\tilde{x}\) and \(I_0^i=0\) for \(i=1,2, \dots , p\).

The sequence of increments play a fundamental role in the convergence of the sequence \(\{ x^{i}_{\ell } \}\) to the unique optimal solution \(x^{*}=P_{\mathit{\widehat{\Omega}}}(\tilde{x})\). Boyle and Dykstra [11] established the key convergence theorem associated with algorithm (5), i.e., that for any \(i=1, 2, \ldots ,p\) and any given \(\tilde{x}\), the sequence \(\{ x_{\ell }^i \}\) generated by (5) converges to \(x^{*}=P_{\mathit{\widehat{\Omega}}}(\tilde{x})\) (i.e., \(\Vert x_{\ell }^{i}-x^{*}\Vert \rightarrow 0\) as \(\ell \rightarrow \infty\)). Concerning the rate of convergence, it is well-known that Dykstra’s algorithm exhibits a linear rate of convergence in the polyhedral case [18, 20], which is the case in all problems considered here (see Section 5). Finally, the stopping criterion associated with Dykstra’s algorithm is a delicate issue. A discussion about this topic and the development of some robust stopping criteria are fully described in [10]. Based on that, in here, we will stop the iterations when

$$\begin{aligned} \sum ^p_{i=1} \Vert I_{\ell -1}^i-I_\ell ^i \Vert ^{2}\le \varepsilon , \end{aligned}$$
(6)

where \(\varepsilon > 0\) is a small given tolerance.

Since the gradient \(\nabla f(x,y)\) of \(f(x,y) = f(x) + \tau h(x,y)\) is available for each fixed \(\tau >0\), then Projected Gradient (PG) methods provide an interesting low-cost option for solving (4). They are simple and easy to code, and avoid the need for matrix factorizations (no Hessian matrix is used). There have been many different variations of the early PG methods. They all have the common property of maintaining feasibility of the iterates by frequently projecting trial steps on the feasible convex set. In particular, a well-established and effective scheme is the so-called Spectral Projected Gradient (SPG) method (see Birgin et al. [6,7,8,9]).

The SPG algorithm starts with \((x_0,y_0) \in \mathbb {R}^{2n}\) and moves at every iteration j along the internal projected gradient direction \(d_j=P_{\mathit{\widehat{\Omega}}}((x_j,y_j) -\alpha _j \nabla f(x_j,y_j)) - (x_j,y_j)\), where \(d_j \in \mathbb {R}^{2n}\) and \(\alpha _j\) is the well-known spectral choice of step length (see [9]):

$$\alpha _j = \frac{\langle s_{j-1}, s_{j-1}\rangle }{\langle s_{j-1}, (\nabla f(x_j,y_j) - \nabla f(x_{j-1},y_{j-1})) \rangle },$$

and \(s_{j-1}= (x_j,y_j) - (x_{j-1},y_{j-1})\). In the case of rejection of the first trial point, \((x_j,y_j)+d_j\), the next ones are computed along the same direction, i.e., \((x_{+},y_{+})=(x_j,y_j) + \lambda d_j\), using a nonmonotone line search to choose \(0<\lambda \le 1\) such that the following condition holds

$$f(x_{+},y_{+}) \le \max _{0\le l \le \text{ min } \{j,M-1\}} f(x_{k-l},y_{k-l}) + \gamma \lambda \langle d_j, \nabla f(x_j,y_j) \rangle ,$$

where \(M \ge 1\) is a given integer and \(\gamma\) is a small positive number. Therefore, the projection onto \(\mathit{\widehat{\Omega}}\) must be performed only once per iteration. More details can be found in [6] and [7]. In practice, \(\gamma = 10^{-4}\) and a typical value for the nonmonotone parameter is \(M=10\), but the performance of the method may vary for variations of this parameter, and a fine tuning may be adequate for specific applications.

Another key feature of the SPG method is to accept the initial spectral step-length as often as possible while ensuring global convergence. For this reason, the SPG method employs a non-monotone line search that does not impose functional decrease at every iteration. The global convergence of the SPG method combined with Dykstra’s algorithm to obtain the required projection per iteration can be found in [8, Section 3].

Summing up, our proposed combined algorithm is now described in detail.


Algorithm Penalty-SPG (PSPG)

S0:

: Given \(\tau _{-1} >0\), and vectors \(x_{-1}\) and \(y_{-1}\); set \(k=0\).

S1:

: Compute \(\tau _k > \tau _{k-1}\)

S2:

: Set \(x_{k,0} = x_{k-1} \text{ and } y_{k,0} = y_{k-1}\), and from \((x_{k,0},y_{k,0})\) apply the SPG method to (10), until

$$\Vert P_{\widehat{\Omega }}((x_{k,m_{k}},y_{k,m_{k}}) - \nabla f(x_{k,m_{k}},y_{k,m_{k}})) - (x_{k,m_{k}},y_{k,m_{k}})\Vert _2\le { tol_1}$$

is satisfied at some iteration \(m_{k}\ge 1\). Set \(x_{k} = x_{k,m_{k}}\) and \(y_{k} = y_{k,m_{k}}\).

S3:

: If

$$h(x_{k},y_{k}) \le { tol_2} \; \text{ and } \; |f(x_{k})-f(x_{k-1})| \le tol_2$$

then stop. Otherwise, set \(k= k+1\) and return to S1.

We note that at any iteration \(k\ge 1\), Step S2 of Algorithm PSPG starts from \((x_{k-1},y_{k-1})\), which is the previous solution of (4), obtained using \(\tau _{k-1}\). We also note that to stop the SPG iterations, we monitor the value of \(\Vert P_{\mathit{\widehat{\Omega}}}((x_k,y_k) - \nabla f(x_k,y_k)) - (x_k,y_k)\Vert _2\). It is worth recalling that if \(\Vert P_{\mathit{\widehat{\Omega}}}((x,y) - \nabla f(x,y)) - (x,y)\Vert _2=0\), then \((x,y) \in \mathit{\widehat{\Omega}}\) is stationary for problem (4) (see, e.g., [6, 8]). Each SPG iteration uses Dykstra’s alternating projection scheme to obtain the required projection onto \(\mathit{\widehat{\Omega}}\), and this internal iterative process is stopped when (6) is satisfied.

4 Cardinality Constrained Optimal Portfolio Problem

Let the vector \(v\in \mathbb {R}^n\) and the symmetric and positive semi-definite matrix \(Q\equiv [\sigma _{ij}]_{i,j=1,\ldots ,n}\in \mathbb {R}^{n\times n}\) be the given mean return vector and variance-covariance matrix of the n risky available assets, respectively. The entry \(\sigma _{ij}\) in Q is the covariance between assets i and j for \(i,j=1,\ldots ,n\), \(\sigma _{ii}=\sigma _i^2\) and \(\sigma _{ij}=\sigma _{ji}\). As a consequence of the pioneering work of Markowitz [32], the mean-variance portfolio selection problem can be formulated as (1), where the objective function is given by

$$\begin{aligned} f(x) = \frac{1}{2} x^{\top }Qx, \end{aligned}$$
(7)

and the convex set \(\mathit{\Omega} = \{x\in \mathbb {R}^n: v^{\top } x \ge \rho , \;\; e^{\top } x=1, \;\; {0 \le x_i\le u_i}, \; i=1,\ldots ,n\}\), representing the constraints of minimum expected return level \(\rho\), budget constraint (\(e^{\top } x = \sum _{i=1}^n x_i = 1\) means that all available wealth will be invested), and lower (\(x\ge 0\) excludes short sale) and upper bounds \(u_i\) for each \(x_i\), respectively. Notice that the minimization of f(x), involving the given covariance matrix Q, accounts for the minimization of the variance, while the return is expected to be at least \(\rho\). Notice also that, as previously discussed, in this case, the set \(\mathit{\Omega}\) is the intersection of three easy convex sets: a half-space, a hyperplane, and a box. The additional constraint in (1), \(\Vert x\Vert _0 \le \alpha\) for \(0<\alpha <n\), plays a key role here and indicates that among the n risky available options, we can only invest in at most \(\alpha\) assets (cardinality constraint). The solution vector x denotes an investment portfolio, and each \(x_i\) represents the fraction held of each asset i. It should be mentioned that other inequality and/or equality constraints can be added to the problem, as they represent additional real-life constraints, e.g., transaction costs [3, 26].

Now, as discussed above, our main idea is to consider the continuous formulation (2) instead of the optimization problem (1). For the portfolio selection problem, we would end up with the following problem that involves the auxiliary vector y:

$$\begin{aligned} \begin{array}{ll} \displaystyle \min _{x,y} & \frac{1}{2} x^{\top }Qx \\ \text{subject to:}\; &{} v^{\top } x \ge \rho , \\ &{} e^{\top } x = 1, \\&{} {0 \le x_i\le u_i}, \; \text{ for all }\; 1\le i \le n, \\ &{} e^{\top } y \ge n-\alpha , \\ &{} x\circ y = 0, \\ &{} 0 \le y_i \le 1, \; \text{ for all }\; 1\le i \le n, \end{array} \end{aligned}$$
(8)

where the upper bound vector \(u\in \mathbb {R}^n\) and \(\rho >0\) are given. Note that the vector y appears only in the last 3 constraints, and the vector x appears in the first three constraints but also in the (non-convex) Hadamard constraint: \(x\circ y = 0\).

As discussed in Section 2, the best option to force the Hadamard condition at the solution while keeping the feasible set of our problem as the intersection of a finite collection of easy convex sets is to add the term \(\tau h(x,y)\) to the objective function, where our convenient choice is \(h(x,y)= x^{\top }y\):

$$\begin{aligned} f(x,y) = \frac{1}{2} x^{\top }Qx + \tau x^{\top }y, \end{aligned}$$
(9)

where \(\tau >0\) is a penalization parameter that needs to be properly chosen as described in Section 2. Since the vectors x and y will be forced by the alternating projection scheme to have all their entries greater than or equal to zero, then \(h(x,y)= x^{\top }y\ge 0\) for any feasible pair (xy), and forcing \(\tau x^{\top }y=0\) is equivalent to forcing the Hadamard condition: \(x_i y_i = 0\) for all i. Notice that setting \(\tau =0\) for solving (8) with f(xy) given by (9) minimizes the risk, independently of the Hadamard condition. On the other hand, if \(\tau >0\) is sufficiently large as compared to the size of Q, then the term \(x^{\top }y\) must be zero at the solution. Hence, choosing \(\tau >0\) represents an explicit trade-off between the risk and the Hadamard condition.

Our algorithmic proposal consists in solving a sequence of penalized problems, as described in Section 2, using the SPG scheme and Dykstra’s alternating projection method (that from now on will be denoted as the SPG method) to solve problem (8), without the complementarity constraint \(x\circ y=0\), and using the objective function given by (9). That is, for a sequence of increasing penalty terms \(\tau _k>0\), we will solve the following problems:

$$\begin{aligned} \begin{array}{ll} \displaystyle \min _{x,y} &{} \frac{1}{2} x^{\top }Qx + \tau _k x^{\top }y\\ \text{ subject } \text{ to: } &{} v^{\top } x \ge \rho , \\ &{} e^{\top } x = 1, \\ &{} { 0 \le x_i\le u_i}, \; \text{ for } \text{ all } 1\le i \le n, \\ &{} e^{\top } y \ge n-\alpha , \\ &{} 0 \le y_i \le 1, \; \text{ for } \text{ all } 1\le i \le n. \end{array} \end{aligned}$$
(10)

Since the function \(h(x,y)= x^{\top }y\) satisfies the properties mentioned in Section 2, if we choose the sequence of parameters \(\{\tau _k\}\) such that \(h(x_k,y_k)\) goes to zero when k goes to infinity, then Theorem 1 guarantees the convergence of the proposed scheme.

Before showing some computational results in our next section, let us recall that the gradient and the Hessian of the objective function f at every pair (xy) are given by

$$\begin{aligned} \nabla f(x,y) = \left( \begin{array}{c} Qx + \tau _k y \\ \tau x\end{array}\right) \;\; \text{ and } \;\; \nabla ^2 f(x,y) = \left( \begin{array}{cc} Q &{} \tau _k I \\ \tau _k I &{} 0 \end{array}\right) . \end{aligned}$$

Notice that, for any \(\tau _k >0\), \(\nabla ^2 f(x,y)\) is symmetric and indefinite.

5 Computational Results

To add understanding and illustrate the advantages of our proposed combined scheme, we present the results of some numerical experiments on an academic simple problem (\(n=6\)) and also on some data sets involving real-world capital market indices from major stock markets. All the experiments were performed using Matlab R2022 with double precision on an Intel\(^{\circledR }\) Quad-Core i7-1165G7 at 4.70 GHz with 16GB of RAM memory, using Windows 10 Pro with 64 Bits.

For our experiments, we use Algorithm PSPG described in Section 3, setting \(x_{-1}=(1/n)e\), \(y_{-1} = 0\), \(tol_1=10^{-6}\), and \(tol_2=10^{-8}\). We recall that for the portfolio problems \(h(x_{k},y_{k}) = x_{k}^{\top }y_{k}\). The value of \(\Vert P_{\mathit{\widehat{\Omega}}}((x_k,y_k) - \nabla f(x_k,y_k)) - (x_k,y_k)\Vert _2\) will be denoted as the pgnorm at iteration k (see the tables below). Concerning the nonmonotone line search strategy used by the SPG method, we set \(\gamma = 10^{-4}\) and \(M=10\). Dykstra’s alternating projection scheme is stopped when (6) is satisfied with \(\varepsilon = 10^{-8}\).

To explore the behavior of Algorithm PSPG, we will vary the minimum expected return parameter \(\rho >0\) and the cardinality constraint positive integer \(1\le \alpha <n\). In all cases, we set the upper bound vector \(u=e\), where e is the vector of ones. Of course, for certain combinations of all those parameters, the problem might be infeasible. We will discuss possible choices of these parameters to guarantee that the feasible region of problem (10) is not empty.

To keep a balanced trade-off between the risk and the Hadamard condition, it is convenient to choose the initial parameter \(\tau _{-1}>0\) of the same order of magnitude of the largest eigenvalue of Q. For that, we proceed as follows: set \(z= Qe\) and \({ {\tau _{-1}}} = z^{\top }Qz/(z^{\top }z)\), i.e., a Rayleigh-quotient of Q with a suitable vector z, which produces a good estimate of \(\lambda _{\max }(Q)\). This choice worked well for the vast majority of the test examples. According to Remark 1, to observe convergence, we need to drive the inner product \(x_k^{\top }y_k\) down to zero. For that, we increase the penalization parameter as follows:

$$\begin{aligned} \tau _{k+1}=\delta _{k+1}\tau _k\;\; \text{ where } \;\; \delta _{k+1}=\delta _k+\frac{(n-\alpha )\rho }{n}\dfrac{|v^{\top }x_{k+1}|}{\sqrt{x_{k+1}^{\top }Qx_{k+1}}} \;\; \text{ and } \;\; \delta _{-1}=1. \end{aligned}$$
(11)

We note that in practice, this formula increases the penalty parameter in a controlled way taking into account the ratio between the absolute value of the current return \(|v^{\top }x_{k+1}|\) and the current risk \(\sqrt{x_{k+1}^{\top }Qx_{k+1}}\). In all the reported experiments, the controlled sequence \(\{\tau _k\}\) given by (11) was enough to guarantee that the Hadamard product goes down to zero.

Concerning the choice of the expected return, based on [13, 36], in order to consider feasible problems, we study the behavior of our combined scheme in an interval \([\rho _{\min }, \rho _{\max }]\) of possible values of the parameter \(\rho\), which is obtained as follows. Let \(\rho _{\min }=v^{\top }x_{\min }\) and \(\rho _{\max }=v^{\top }x_{\max }\), where \(x_{\min } =\arg \min _{x}\; \dfrac{1}{2}x^{\top }Qx+\tau x^{\top }y\;\) and \(\;x_{\max } =\arg \max _{x} \; v^{\top }x-\tau x^{\top }y\), both of them subject to \(e^{\top }x=1\), \(e^{\top } y \ge n-\alpha\), \(0 \le x_i\le u_i\), and \(0 \le y_i \le 1, \; \text{ for } \text{ all } \; 1\le i \le n\). These two auxiliary optimization problems are solved in advance, only once for each considered problem, using in turn the proposed Algorithm PSPG. For that, we fix the same parameters and we start from the same initial values indicated above. Once the interval \([\rho _{\min }, \rho _{\max }]\) has been obtained, to choose a suitable return \(\rho\), we can proceed as follows. For a fixed \(0<\tilde{\epsilon }<1\), if \(\rho _{\min }+\tilde{\epsilon }(\rho _{\max }-\rho _{\min })\ge 0\), we set \(\rho =\rho _{\min }+\tilde{\epsilon }(\rho _{\max }-\rho _{\min })\), else if \(|\rho |\le v_{\max }\) we set \(\rho =\tilde{\epsilon }|\rho |\), otherwise we set \(\rho =\tilde{\epsilon } v_{\max }\). In here, \(v_{\min }=\min \{v_1,\ldots ,v_n\}\) and \(v_{\max }=\max \{v_1,\ldots ,v_n\}\).

For our first data set, we consider a simple portfolio problem with \(n=6\) available assets, denoted as Simple-case for which the mean return vector v and the covariance matrix Q are given by

$$v=(0.021\;\; 0.04\;\; -0.034\;\; -0.028\;\; -0.005\;\; 0.006)^{\top },$$
$$Q = \left[ \begin{array}{rrrrrr} 0.038 &{} 0.020 &{} 0.017 &{} 0.014 &{} 0.019 &{} 0.017 \\ 0.020 &{} 0.043 &{} 0.015 &{} 0.013 &{} 0.021 &{} 0.014 \\ 0.017 &{} 0.015 &{} 0.034 &{} 0.011 &{} 0.014 &{} 0.014 \\ 0.014 &{} 0.013 &{} 0.011 &{} 0.044 &{} 0.014 &{} 0.011 \\ 0.019 &{} 0.021 &{} 0.014 &{} 0.014 &{} 0.040 &{} 0.014 \\ 0.017 &{} 0.014 &{} 0.014 &{} 0.011 &{} 0.014 &{} 0.046 \end{array} \right] .$$

We note that Q is symmetric and positive definite (\(\lambda _{\min }(Q) =1.79 \times 10^{-2}\) and \(\lambda _{\max }(Q)=1.17\times 10^{-1}\)). Notice that the assets three, four, and five have negative average returns. The purpose of this simple example is to demonstrate properties of the problem and the proposed algorithm in an easy-to follow fashion. For the other data sets, involving real-world capital market indices, we consider some larger problems obtained from Beasley’s OR Library (http://people.brunel.ac.uk/~mastjjb/jeb/info.html), built from weakly price data from March 1992 to September 1997, and that we will denote as Port1 (Hang Seng index with \(n=31\)), Port2 (DAX index with \(n=85\)), Port3 (FTSE 100 index with \(n=89\)), Port4 (S&P 100 index with \(n=98\)), and Port5 (Nikkei index with \(n=225\)) (see also [1, 15, 24]).

The key properties, to be discussed and illustrated in the rest of this section, are the influence of the cardinality constraint to the feasible set in the risk-return plane, the efficient frontier, and the quality of the solution obtained by Algorithm PSPG. The feasible set is usually represented in the risk-return plane, presenting all possible combinations of assets that satisfy the constraints. In general, the feasible set for the classical problem without cardinality constraint has the so-called bullet shape. The efficient frontier is the set of optimal portfolios that offer the highest expected return for a defined level of risk or the lowest risk for a given level of expected return.

Introducing the cardinality constraints might complicate the feasible set in the sense that the set is shrinking as we will now show. Starting with the feasible interval for the expected return, we report in Table 1, \(\rho _{\max }\le v_{\max }\) and \(\rho _{\min } \ge v_{\min }\), for \(\alpha =5\) and for all the considered data sets.

Table 1 Return value with \(\alpha = 5\) for all data sets

Let us now take a closer look at the Simple-case. If we solve the original Markowitz problem [32] - the minimal variance portfolio, (i.e., \(\displaystyle \min _{x} \frac{1}{2} x^{\top }Qx\) subject to \(e^{\top } x = 1\)) for the Simple-case problem, we obtain

$$\bar{x}=(0.0961,0.1168,0.2625,0.2140,0.1429,0.1677)^{\top },$$

risk \(\sqrt{\bar{x}^{\top }Q\bar{x}}=0.1379\), and expected return \(v^{\top }\bar{x}=-0.0079\). Solving the same problem with the additional constraint \(x\ge 0\), we get the same solution. Thus, the minimal variance portfolio is the same as the minimal variance portfolio without short sale. In Fig. 1, we present for the Simple-case problem, the return and risk for all 6 assets, the minimal variance portfolio, denoted by MVP, the classical Markowitz portfolio without short sale and the expected return constraint \(v^{\top }x\ge \rho =0.002\), denoted by MP, as well as the efficient frontier for different values of the cardinality constraint \(\alpha\). Clearly for \(\alpha = 6\), i.e., without cardinality constraint, we get a classical convex efficient frontier, while for smaller \(\alpha\) values, the curves are discontinuous and deformed (see, e.g., [15] for similar observations).

Fig. 1
figure 1

Risk versus return, using Algorithm PSPG for the Simple-case problem

For the Simple-case problem, with \(n=6\) available assets, an approximation of the feasible set is shown in Fig. 2, which is obtained by running a simulation based on finite sampling. In our simulation, we pay more attention to the left side to observe the bullet shape. As a consequence, only a few scattered points are shown on the right side of the figure. We note that for a larger value of \(\alpha\), we get a larger area of the feasible set. We also note that the bullet shape is not affected by the cardinality constraint, but, as expected, the set is shrinking as the number of zero elements increases.

Fig. 2
figure 2

Feasible set for the Simple case and \(\alpha =2,3,4,5\), and 6

The same conclusions apply to the larger data sets coming from real assets. Below, in Fig. 3, we show the approximate feasible set for Port1. We note that once again, the area is shrinking when \(\alpha\) decreases. We also note that the same is true for all considered cases.

Fig. 3
figure 3

Feasible set for Port1 and \(\alpha =6, 11, 16, 21, 26\), and 31

The efficient frontier for Port1 is shown in Fig. 4. Again, we observe that the efficient frontier is deformed by the value of the cardinality constraint, and when \(\alpha < n\), it is not a convex curve. For the sake of completeness, in the Appendix, we provide some tables with more detailed results, varying the cardinality constraints, for all considered data sets. We can observe in all figures and tables the effectiveness of our low-cost continuous approach (Algorithm PSPG).

Fig. 4
figure 4

Risk versus return, using Algorithm PSPG for Port1 and \(\alpha =6,11,16,21,26, 31\)

Additionally, we compare our approach to IBM ILOG CPLEX Optimization Studio, Version: 22.1.0.0. CPLEX is a mixed integer quadratic programming (MIQP) solver. We note that for these problems, the solution provided by CPLEX is the globally optimal one up to the provided tolerances. The goal of comparison is to investigate the quality of solutions obtained by PSPG and CPLEX in terms of risk and return. We also report CPU time, although CPLEX is implemented in a low-level language, and so it requires significantly less execution time than our high-level Matlab implementation. Hence, CPU time might be misleading. For solving the problems with CPLEX, we consider the following MIQP formulation, instead of (1):

$$\begin{aligned} \begin{array}{ll} \displaystyle \min _{x,y} &{} \frac{1}{2} x^{\top }Qx\\ \text{subject } \text{ to: } &{} v^{\top } x \ge \rho , \\ &{} e^{\top } x = 1, \\&{} e^{\top } y \ge n-\alpha , \\ &{} 0 \le x_i\le 1, \; \text{ for } \text{ all } 1\le i \le n, \\ &{} x_i+y_i\le 1, \; \text{ for } \text{ all } 1\le i \le n, \\ &{} y_i \in \{0,1\}. \end{array} \end{aligned}$$

Notice that in the above problem formulation, we do not have the Hadamard constrained, and instead, we have \(x_i + y_i \le 1\) followed by \(y_i \in \{0,1\}\). CPLEX is designed to work with linear constraints, and for \(y_i = 0\) or \(y_i = 1\), we get the same condition. It is worth mentioning that (1) can also be formulated as a convex mixed integer non-linear program (MINLP) (see, e.g., [27]). Therefore, using a convenient formulation, (1) can also be solved by branch-and-bound, e.g., using BARON [34] or SCIP [35], or by outer approximation strategies [2], e.g., using SHOT [31].

The details of tests for all considered data sets are presented in Tables 3, 4, 5, 6, 7, and 8 in the Appendix. One can easily see that PSPG produces solutions with slightly higher risk and significantly better return. In Table 5, we observe that CPLEX needs a very large number of iterations to solve the problem for \(\alpha \le 20\), which corresponds to the fact the PSPG needed a special value of \(\tau _{-1}\) for these values of \(\alpha\) and large values of penalty parameter \(\tau .\) Thus, this behavior is associated with the data of Port2. In some other cases, reported in the tables in the Appendix, we can observe a rather large number of CPLEX iterations for small values of \(\alpha\), while PSPG solved the same problems with reasonably small values of the penalty parameters.

An interesting observation from the literature, and confirmed by our experiments, is the fact that the optimal portfolio without cardinality constraint is in fact sparse. In Table 2, we report the number of assets obtained by our algorithm and CPLEX which is in accordance with the results reported in [13, Figure 5] and [14, Section 5.2.2]. We can observe that the number of assets in the unconstrained mean-variance optimal portfolio for Port1 \(\Vert x^*\Vert _0\le 12\), for Port4 \(\Vert x^*\Vert _0\le 40\), and for Port5 \(\Vert x^*\Vert _0\le 15\).

Table 2 Performance of Algorithm PSPG for all cases when \(n=\alpha\)

As noticed above, the feasible set of (8) belongs to the feasible set of (10). In addition, since the solution of (10) satisfies the Hadamard condition, we obtain that the solution is also a solution of (8). Then, by Theorem 2, we have that if \(({x}^*,{y}^*)\) is a local minimizer of (10) satisfying \(\Vert {x}^*\Vert _0=\alpha\), then the components of \({y}^*\) are binary, \({y}^*\) is unique, and \(x^*\) is a local minimizer of (1). In fact, for the solutions reported in Tables 34, and 6 in the Appendix, if \(\Vert x^*\Vert _0=\alpha\), we have that the components of \({y}^*\) are binary. The solution may have non-binary entries in \(y^*\); for instance, port1 with \(\alpha =n=31\), we have that \(y^*\) is binary; however, the cardinality constraint is not active \(\Vert x^*\Vert _0=12\). Another interesting example is detected for Port3 with \(\alpha =n=89\) in which we obtain a binary \(y^*\) but \(\Vert x^*\Vert _0=34\).

6 Conclusions and Final Remarks

Taking advantage of a recently developed continuous formulation, we have developed and analyzed a low-cost and effective computational scheme for finding a solution candidate of convex constrained optimization problems that also include a “hard-to-deal" cardinality constraint. As it appears in many applications, we assume that the region defined by the convex constraints can be written as the intersection of a finite collection of “easy to project" convex sets. Under this continuous formulation, to fulfill the cardinality constraint, the Hadamard condition \(x\circ y = 0\) must be satisfied between the solution vector x and an auxiliary vector y. In our scheme, this condition is achieved by adding a non-negative penalty term h(xy) and using a classical penalization strategy. For each penalty subproblem, a convex constrained problem must be solved, which in our proposal is achieved by combining two low-cost computational schemes: the spectral projected gradient (SPG) method and Dykstra’s alternating projection method.

To illustrate the computational performance of our combined scheme, we have considered in detail the standard mean-variance limited diversified portfolio selection problem, which involves obtaining the proportion of the initial budget that should be allocated in a limited number of the available assets. For this specific application, we proposed a natural differentiable choice of the penalty term (given by \(h(x,y)= x^{\top }y\)) that must be driven to zero, which allowed us to develop a simple way of increasing the associated penalty parameter in a controlled and bounded way. In our numerical study, we have included a variety of data sets involving real-world capital market indices. For these data sets, we have produced the feasible sets and also the efficient frontier (a curve illustrating the tradeoff between risk and return) for different values of the limited number of allowed assets. In each case, we highlighted the differences that arise in the shape of this efficient frontiers as compared with the unconstrained efficient one. The presented numerical study includes comparison with CPLEX, a professional software for general mixed integer programming problems. The comparison is presented in terms of quality of solution (higher return, lower risk), and PSPG appears to be competitive.

In our modeling of the portfolio problem, we have bounded the proportion to be invested in each of the selected assets between 0 and 1. However, without altering our proposed scheme, stricter upper limits (less than 1) can be imposed on some particular assets. Clearly, this would require a more careful analysis of the feasible options for the expected return. Moreover, it could also be interesting from a portfolio point of view to allow negative entries in some of the proportions to be invested, and that can be accomplished by allowing negative values in the lower bounds of the solution vector. In that case, the penalization term to force the Hadamard condition needs to be chosen accordingly (e.g., \(h(x,y) = \sum _{i=1}^n (x_i^2 y_i)\)).