Convergent Inexact Penalty Decomposition Methods for Cardinality-Constrained Problems

In this manuscript, we consider the problem of minimizing a smooth function with cardinality constraint, i.e., the constraint requiring that the -norm of the vector of variables cannot exceed a given threshold value. A well-known approach of the literature is represented by the class of penalty decomposition methods, where a sequence of penalty subproblems, depending on the original variables and new variables, are inexactly solved by a two-block decomposition method. The inner iterates of the decomposition method require to perform exact minimizations with respect to the two blocks of variables. The computation of the global minimum with respect to the original variables may be prohibitive in the case of nonconvex objective function. In order to overcome this nontrivial issue, we propose a modified penalty decomposition method, where the exact minimizations with respect to the original variables are replaced by suitable line searches along gradient-related directions. We also present a derivative-free penalty decomposition algorithm for black-box optimization. We state convergence results of the proposed methods, and we report the results of preliminary computational experiments.


Introduction
In this work, we consider the problem of minimizing a smooth function with a sparsity constraint (cardinality constraint). Optimization problems where sparse solutions are sought arise frequently in modern science and engineering. Just as examples, applications of sparse optimization regard compressed sensing in signal processing [1,2], best subset selection [3][4][5][6] and sparse inverse covariance estimation [7,8] in statistics, sparse portfolio selection [9] in decision science, neural networks compression in machine learning [10,11].
It is well known that optimization problems involving the 0 norm are N P-hard [12,13]. Hence, classes of algorithms have been proposed through the years to approximately solve cardinality-constrained problems. Examples of effective methods are given by the Lasso [14] and other p -reformulation approaches [15,16].
Particularly useful algorithms designed to deal with cardinality-constrained optimization problems are the greedy sparse simplex method [17] and the class of penalty decomposition (PD) methods [18]. The former is specifically designed for problems of the form (1), and the latter has been defined to deal with cardinality-constrained problems characterized by the presence of further standard constraints. These methods, based on different approaches, present theoretical convergence properties and are computationally efficient in the solution of cardinality-constrained problems. However, they require to exactly solve at each iteration suitable subproblems (of dimension 1 in the case of the greedy sparse simplex method, and of dimension n for PD methods). This may be prohibitive when either the objective function is nonconvex or the finite termination of an algorithm applied to a convex subproblem cannot be guaranteed. This latter issue typically occurs when the convex function is not quadratic. Note that there are several applications of sparse optimization involving nonconvex objective functions (see, e.g., [10]).
The aim of the present work is to tackle cardinality-constrained problems by defining convergent algorithms that do not require to compute the exact solution of (possibly nonconvex) subproblems. To this aim, we focus on the approach of the PD methods and we present two contributions: (a) the definition of a PD algorithm performing inexact minimizations by an Armijotype line search [19] along gradient-related directions; (b) the definition of a derivative-free PD method for sparse black-box optimization.
The two algorithms share the penalty decomposition approach, but differ significantly in the inexact minimization steps and in the definition of the inner stopping criterion. We perform a theoretical analysis of the proposed methods, and we state convergence results that are equivalent to those of the original PD methods [18] but, in general, weaker than those of the greedy sparse simplex method [17]. Finally, we remark that, to our knowledge, convergent derivative-free methods for cardinality-constrained problems were not known, and this makes the derivative-free algorithm proposed in the present work particularly attractive. The paper is organized as follows. In Sect. 2, we address optimality conditions for problem (1); we also describe the PD method originally introduced in [18]. In Sect. 3, we propose a modified version of the PD algorithm and we state global convergence results. In Sect. 4, we present a derivative-free PD method for black-box optimization and we prove the global convergence of the proposed method. The results of preliminary computational experiments, limited to a class of convex problems, are reported in Sect. 5 and show the validity of the proposed approach. Finally, Sect. 6 contains some concluding remarks.

Background
In this work, we consider the following optimization problem where f : R n → R is a continuously differentiable function, x 0 is the 0 norm of x, i.e., the number of its nonzero components, and s is an integer such that 0 < s < n.
Throughout the paper, we make the following assumption.

Assumption 2.1
The function f : R n → R is continuously differentiable and coercive on R n , i.e., for all sequences {x k } such that x k ∈ R n and lim k→∞ x k = ∞ we have lim k→∞ f (x k ) = ∞.
The above assumption implies that problem (1) admits solution. Necessary optimality conditions for problem (1) have been stated in [17], where the basic feasible (BF) property has been introduced. We recall this notion hereafter.

Definition 2.1
We say that a pointx ∈ R n is a BF-vector, if: It can be easily shown that if x is an optimal solution of problem (1), then x is a BF-vector.
Necessary optimality conditions for cardinality-constrained problems with additional nonlinear constraints have been studied in [18]. Such conditions have been used to study the convergence of the PD method proposed in the same work. In the case of problem (1), the aforementioned necessary optimality conditions result simplified. In particular, on the basis of the convergence analysis performed in [18], we introduce the following definition. If x is an optimal solution, then x satisfies Lu-Zhang first-order optimality conditions [18]. It can be easily verified that a BF-vector satisfies Lu-Zhang conditions. The converse is not necessarily true, i.e., Lu-Zhang conditions are weaker than the optimality conditions expressed by BF property. We show this with the following example.

The Projection onto the Feasible Set
Consider the problem of computing the orthogonal projection of a vectorx onto the feasible set, i.e., the problem Since the feasible set is not convex, the solution of (2) is not unique. A globally optimal solution can be computed in closed form taking the s components ofx with the largest absolute value [17]. To formally characterize the solution, let us define the index set I (x) of the largest nonzero variables (in absolute value) at a generic point x ∈ R n , satisfying the following properties: In general, the index set I (x) is not uniquely defined. Also, note that Then, the solution x of problem (2) is such that

The Penalty Decomposition Method
Applying the classical variable splitting technique [20], Problem (1) can be equivalently expressed as For simplicity, in the following, we will denote Y = {y ∈ R n : y 0 ≤ s}. The quadratic penalty function associated with (5) is where τ > 0 is the penalty parameter.
In [18], the penalty decomposition (PD) method (see Algorithm 1) was proposed to solve Problem (5). In particular, the approach is that of approximately solving a sequence of penalty subproblems by a two-block decomposition method. The algorithm starts from a point (x 0 , y 0 ) that is feasible for problem (5). At every iteration, the algorithm performs the block coordinate descent (BCD) method [21,22] w.r.t. the two blocks of variables x and y, until an approximate stationary point of the penalty function w.r.t. the x block is attained. Then, the penalty parameter τ k is increased for the successive iteration, where a higher degree of accuracy is required to approximate a stationary point.
Note that, as discussed in Sect. 2.1, the y-update step can be performed by computing the closed-form solution of the related subproblem. At the beginning of each iteration, before starting the BCD loop, a test is performed to ensure that the points of the generated sequence belong to a compact level set. This is done in order to guarantee that the sequence generated by the PD method is bounded, so that it admits limit points. In [18] it is proved that each limit point is feasible and satisfies Lu-Zhang conditions.

An Inexact Penalty Decomposition Method
Algorithm 1 has been shown to be effective in practice [18]. However, it requires to compute, in the inner iterations of the block decomposition method, the exact solution of a sequence of subproblems in the x variables (see steps 5 and 10). This may be prohibitive when either the objective function is nonconvex or the finite termination of an algorithm applied to a convex subproblem cannot be guaranteed. On the other hand, the convergence analysis performed in [18] is strongly based on the assumption that the global minima of the subproblems in the x variables are determined. In order to overcome this nontrivial issue by preserving global convergence properties, we propose a modified version of the algorithm, suitable even for problems with nonconvex objective function.

Algorithm 2: InexactPenaltyDecomposition
The proposed procedure is described in Algorithm 2. The exact minimization with respect to the x variables is replaced by an Armijo-type line search along the steepest descent direction of the penalty function. The line search procedure along a descent direction d is shown in Algorithm 3.
We recall some well-known properties for the Armijo-type line search, later used in the convergence analysis. These results can be found, for instance, in [19].

Algorithm 3: ArmijoLineSearch
It can be easily seen that the algorithm is well defined, i.e., there exists a finite integer j such that β j satisfies the acceptability condition (6). Moreover the following result holds.

Remark 3.1
Step 12 of Algorithm 2 can be modified in order to make the algorithm more general. More specifically, the steepest descent direction −∇ x q τ k (u , v ) could be replaced by any gradient-related direction d . In this sense, we have the possibility of arbitrarily defining the updated point u +1 , provided that . It can be easily seen that this modification does not spoil the theoretical analysis we are going to carry out hereafter, while it may bring significant benefits from a computational perspective. [18], the stopping condition at line 10 of Algorithm 2 is useful for establishing the convergence properties of the algorithm, but, in practice, different rules could be employed with benefits in terms of efficiency. For example, the progress of the decreasing sequence {q τ k (u , v )} might be taken into account. As for the main loop, the whole algorithm can be stopped in practice as soon as x k and y k are sufficiently close.

Remark 3.2 As outlined by
We now address the properties of the inexact penalty decomposition method. Let us introduce the level set Note that L 0 ( f ) is compact, being f continuous and coercive on R n . First we show that also q τ (x, y) is a coercive function.
Proof Let us consider any pair of sequences {x k } and {y k } such that at least one of the following conditions holds lim k→∞ y k = ∞.
Suppose first that there exists an infinite subset K 1 ⊆ K such that for some M > 0 and for all k ∈ K 1 . Recalling that f is coercive on R n , from (7), (8) we and this contradicts (9). Then, we must have lim k→∞ k∈K As f is coercive and continuous, it admits minimum over R n . Let f be the minimum value of f . Thus, we have Then, we can conclude that, for any infinite set K , we have and this contradicts (9). Now, we can prove that Algorithm 2 is well defined, i.e., that the cycle between step 10 and step 14 terminates in a finite number of inner iterations. Proposition 3.2 Algorithm 2 cannot infinitely cycle between step 10 and step 14, i.e., for each outer iteration k ≥ 0, the algorithm determines in a finite number of inner iterations a point (x k+1 , y k+1 ) such that Proof Suppose by contradiction that, at a certain iteration k, the sequence {u , v } is infinite. From the instructions of the algorithm, we have Hence, for all ≥ 0, the point {u , v } belongs to the level set Recalling the continuity of the gradient, we have and taking into account the instructions of the algorithm we can write Recalling again the continuity of the gradient, we have that d → ∇ x q τ k (ū,v) for ∈ K and → ∞, and hence d ≤ M for some M > 0 and for all ∈ K . The sequence {q τ k (u , v )} is monotone decreasing, q τ k (u, v) is continuous, and hence, we have that lim From (12), it follows lim , the stopping criterion of step 10 is satisfied in a finite number of iterations, and this contradicts the fact that {u , v } is an infinite sequence.
Before stating the global convergence result, we prove that the sequence generated by the algorithm admits limit points and that every limit point (x,ȳ) is such thatx is feasible for the original problem (1).

Proposition 3.3
Let {x k , y k } be the sequence generated by Algorithm 2. Then, {x k , y k } admits cluster points and every cluster point (x,ȳ) is such thatx =ȳ, and x 0 ≤ s.
Proof Consider a generic iteration k. The instructions of the algorithm imply for all ≥ 0 and hence we can write From the definition of (u 0 , v 0 ), we either have In the former case, we have, by the definition of x trial , that where the last inequality holds, as in this case the condition at line 6 is satisfied. In the latter case, we have Then, in both cases from (13) it follows We also have and hence we can conclude that for all k ≥ 0 we have f (x k+1 ) ≤ f (x 0 ). Therefore, the points of the sequence {x k } belong to the compact set L 0 ( f ), and this implies that {x k } is a bounded sequence and that, for all From (15), dividing by τ k , we get Taking limits for k → ∞, recalling that τ k → ∞ for k → ∞, we obtain lim k→∞ x k+1 − y k+1 = 0.
Therefore, since {x k } is a bounded sequence, from (16) Again from (16) it followsx =ȳ. Finally, as y k 0 ≤ s for all k, recalling the lower semicontinuity of the 0 -norm · 0 , we can conclude that x 0 = ȳ 0 ≤ s.
We are ready to state the global convergence result.
From Proposition 3.3, it followsx =ȳ and Using (11) of Proposition 3.2, for all k ≥ 0, we have so that, taking limits for k ∈ K and k → ∞, as k → 0, we can write From the instructions of the algorithm, we have y k+1 ∈ arg min

From (4) it follows
where we recall that the index set I (x k+1 ) contains at most s elements, those corresponding to the not null components of x k+1 with the largest absolute value.
Note that |I (x k+1 )| < s implies x k+1 0 < s and hence y k+1 = x k+1 . Therefore, we can write The index set I (x k+1 ) belongs to the finite set {1, . . . , n}; therefore, there exists an infinite subset K 1 ⊆ K such that I (x k+1 ) = I for all k ∈ K 1 . Let I = I (x). We show that I ⊆ I . Indeed, assume by contradiction that there exists i ∈ I such that i / ∈ I . Hence,ȳ i =x i = 0, while y k+1 i = 0 for all k ∈ K . This is a contradiction, since y k+1 →ȳ for k → ∞, k ∈ K . Therefore, we have the following possible cases: We now prove each case separately: and, using the first condition of (19), it follows τ k (x k+1 i − y k+1 i ) = 0 for all k ∈ K 1 . Therefore, recalling the continuity of the gradient, we can write i.e., Lu-Zhang conditions hold with the set I = I . (ii) Let i ∈ {1, . . . , n}; similarly to the previous case, we have that and using the second condition of (19) i.e., Lu-Zhang conditions hold taking any subset of {1, . . . , n} of cardinality s that contains I * .
(iii) Let i ∈ I . By the same reasonings of case (i), we can write i.e., Lu-Zhang conditions hold with the set I .
Putting everything together, we have from (i), (ii) and (iii) that Lu-Zhang conditions are always satisfied.
As we can see, the proposed inexact version of the algorithm enjoys the same convergence properties as the original, exact one. We also provide, in the following remark, a better characterization of the algorithm, showing that the limit points are often BF-vectors. Remark 3. 3 We note that, in both case (i) and case (ii) we have thatx satisfies the BF optimality conditions. Moreover, note also that: -If there exists a subsequenceK ⊆ K s.t. x k 0 = x 0 for all k ∈K , the only possible cases are cases (i) and (ii). Indeed, let us consider a further subsequence K 2 ⊆K , such that I (x k+1 ) = I for every k ∈ K 2 , for some I ⊂ {1, . . . , n}. We know that K 2 exists and that I ⊇ I . Since x k+1 0 = x 0 ≤ s for every k ∈ K 2 , I and I are the index sets of nonzero variables of x k+1 andx, respectively, which have the same cardinality. Therefore, it cannot be I ⊃ I . It follows that I = I , so we fall into either case (i) or case (ii), and thus,x satisfies BF conditions. -If there exists a subsequenceK ⊆ K such that x k+1 0 < s for all k ∈K , we can again define K 2 ⊆K such that I (x k+1 ) = I for every k ∈ K 2 , for some I ⊂ {1, . . . , n}. In this case, we have |I | = x k+1 0 < s and case (ii) applies. It follows thatx is a BF-vector.

A Derivative-Free Penalty Decomposition Method
First-order information about the objective function is fundamental for the PD methods we have considered thus far. However, there are applications where the objective function is obtained by direct measurements or it is the result of a complex system of calculations, so that its analytical expression is not available and the computation of its values may be affected by the presence of noise. Hence, in these cases the gradient cannot be explicitly calculated or approximated.
Such lack of information has an impact on the applicability of Algorithm 2. In particular, the x update step and the inner loop stopping criterion are no more employable as they are.
In this section, we propose a derivative-free modification of Algorithm 2 that, similarly to [23][24][25], updates x by line search steps along the coordinate axes and employs a stopping criterion based on the length of such steps.
The derivative-free PD method is described by Algorithm 4. At the x update step, we employ as search directions the coordinate directions and their opposites. A tentative step lengthα i is associated with each of these directions. At every iteration, all search directions are considered one at a time; a derivative-free line search is performed along each direction, according to Algorithm 5. If the tentative step size does not provide a sufficient decrease, it will be reduced for the next iteration. If, on the other hand, the tentative step size is of sufficient decrease, an extrapolation procedure is carried out; the tentative step size for that same direction at the successive iteration will be the longest one tried in the extrapolation phase that provides a sufficient decrease. That same step length is also used to move along the considered direction, provided it is large at least ε k ; otherwise, no movement is done along the direction. The inner loop then stops when all tentative step sizes have become smaller than ε k .

Algorithm 5: LineSearch
Hereafter, we show that Algorithm 4 enjoys the same convergence properties as Algorithm 2. First, we prove that the line search procedure does not loop infinitely inside our procedure.

Proposition 4.1 Algorithm 5 cannot infinitely cycle between steps 5 and 8.
Proof Assume by contradiction that Algorithm 5 does not terminate. Then, for j = 0, 1, . . ., Taking limits for j → ∞, we obtain that f (x + σ j α 0 d) → −∞, and this contradicts the fact that f is bounded below, being f continuous and coercive.
Note that, as shown by Proposition 4.1, q τ k is coercive on R n × R n . We prove that Algorithm 4 is well defined, i.e., the inner loop terminates in finite number of iterations.
Note that K 1 and K 2 cannot both be finite. Then, we analyze the following two cases, K 1 infinite (Case I) and K 2 infinite (Case II). Case (I). We have Taking limits for ∈ K 1 , → ∞, recalling that {q τ k (u , v )} tends to a finite limit, we get lim →∞ ∈K 1α and hence, for ∈ K 1 sufficiently large, we haveα i ≤ ε k . Case (II). For every ∈ K 2 , let m be the maximum index on {0, 1, . . .} such that m ∈ K 1 , m < (m is the index of the last iteration in K 1 preceding ). We can assume m = 0 if the index m does not exist, that is, K 1 is empty. Then, we can writẽ (21) and the fact that δ ∈ (0, 1) imply Thus, for ∈ K 2 sufficiently large, we haveα i ≤ ε k .
Next, we prove a technical result used later.

Thus, in both cases we can write
for some ρ i ∈ ]0, cε k [ and c = max{σ, 1/δ}. Sinceα +1 i ≤ ε k for all i = 1, . . . , 2n, from the instructions of the algorithm, we have u +1 = u and consequently v +1 = v . Hence, equation (26) holds with u = x k+1 , and v = y k+1 . Now, we prove that the sequence generated by the algorithm admits limit points and that every limit point is feasible for the original problem.
From the definition of (u 0 , v 0 ), we either have (u 0 , v 0 ) = (x k , y k ) or (u 0 , v 0 ) = (x 0 , y 0 ). In the former case, for some i ∈ {1, . . . , 2n} we have, by the definition of x trial , that In the latter case, we have Then, in both cases it follows The rest of the proof follows the same reasonings used in the proof of Proposition 3.3, starting from the condition corresponding to (27), i.e., condition (14). y 0 ≤ s.

From (4) it follows
where we recall that the index set I (x k+1 ) contains at most s elements, those corresponding to the not null components of x k+1 with the largest absolute value. Note that |I (x k+1 )| < s implies x k+1 0 < s and hence y k+1 = x k+1 . Therefore, we can write The index set I (x k+1 ) belongs to the finite set {1, . . . , n}; therefore, there exists an infinite subset K 1 ⊆ K such that I (x k+1 ) = I for all k ∈ K 1 .
Let I = I (x). We have already shown in the proof of Theorem 3.1 that I ⊆ I . We consider the following possible cases: (i) |I | = s, I = I ; (ii) |I | < s; (iii) |I | = s, I ⊃ I .
We now prove each case separately: (i) Let i ∈ I = I ; using the first condition of (28) we have that with c = max{σ, 1/δ}. Taking limits for k → ∞, k ∈ K 1 , recalling that ε k → 0, ρ k i , ρ k i+n ∈ ]0, cε k [ and the continuity of the gradient, we get with c = max{σ, 1/δ}, and we can prove i.e., Lu-Zhang conditions hold taking any subset of {1, . . . , n} of cardinality s that contains I * . (iii) Let i ∈ I . By the same reasonings of case (i), we can write i.e., Lu-Zhang conditions hold with the set I . Putting everything together, we have, from (i), (ii) and (iii), that Lu-Zhang conditions are always satisfied.

Preliminary Computational Experiments
In this section, we show the results of preliminary computational experiments, performed to assess the validity of the proposed approach. The purpose of these preliminary experiments is to evaluate the inexact minimization strategy of the proposed algorithm (in both its gradient-based and derivative-free versions), compared with the exact minimization approach of the original PD method. To this aim, we consider the problem of sparse logistic regression, where the objective function is convex, but the solution of the subproblems in the x variables cannot be obtained in closed form, i.e., it requires the adoption of an iterative method.

Test Problems
The problem of sparse logistic regression [26] has important applications, for instance, in machine learning [27,28]. Given a dataset having N samples {z 1 , . . . , z N }, with n features and N corresponding labels {d 1 , . . . , d N } belonging to {−1, 1}, the sparse logistic regression problem can be formulated as follows: The benchmark for this experiment is made up of 18 problems of the form (29), obtained as described hereafter. We employed 6 binary classification datasets, listed in Table 1. All the datasets are from the UCI Machine Learning Repository [29]. For each dataset, we removed data points with missing variables; moreover, we one-hot encoded the categorical variables and standardized the other ones to zero mean and unit standard deviation. For every dataset, we chose 3 different values of s, in order to define 3 different problems of the form (29). The considered values of s correspond to the 25%, 50% and 75% of the number n of features of the dataset.

Implementation Details
Algorithms 1, 2 and 4 have been implemented in Python 3.6. The algorithms start from the feasible initial point x 0 = y 0 = 0 ∈ R n . Their common parameters have been set as follows: τ 0 = 1 and θ = 1.1. The three algorithms differ only in the x-minimization step. Concerning the line search parameters of Algorithm 2, we set γ = 10 −5 and β = 0.5. As for the derivative-free Algorithm 4, we set δ = 0.5, γ = 10 −5 , σ = 2.
The x-minimization step for Algorithm 1 has been performed by the BFGS [19] solver included in the SciPy library [30]. In particular, the inner iterates of the BFGS solver have been stopped whenever the current point u +1 is such that ∇ x q τ k (u +1 , v ) ≤ 10 −5 , i.e., when the current point is a good approximation of a stationary point and hence, being the penalty function q τ k strictly convex with respect to u, of the global minimizer.
For a fair comparison, we employ for the three PD procedures the same stopping criteria for the outer and the inner loop. Specifically, we used the practical stopping criteria proposed in [18]: the inner loop stops when the decrease of the value of the function q τ k is sufficiently small, i.e., when q τ k (u , v )−q τ k (u +1 , v +1 ) ≤ in , where in = 10 −4 ; the outer loop is stopped when x and y are sufficiently close, i.e., as soon as x k+1 − y k+1 ≤ out , where out = 10 −4 .
All the experiments have been carried out on an Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz machine with 4 physical cores (8 threads) and 16 GB RAM.

Numerical Results
The three algorithms, Algorithm 1 called exact PD, Algorithm 2 called inexact PD, and Algorithm 4 called DFPD, have been compared using performance profiles [31]. We recall that, in performance profiles, each curve represents, given a performance metric, the cumulative distribution of the ratio between the result obtained by a solver on an instance of a problem and the best result obtained by any considered solver on that instance. The results of the comparison are shown in Figure 1.
From the results in Figure 1b, we can observe that the performances of the three algorithms, in terms of attained objective function values, are quite close, with rather slight fluctuations. It is worth remarking that different local minima can be attained by different algorithms, even for equal starting points, because of the nonconvex nature of problem (29).
On the other hand, as shown in Figure 1a, the inexact version of the PD algorithm clearly outperforms the other two algorithms in terms of efficiency. This aspect can be valuable in connection with a global optimization strategy, where many local minimizations have to be performed and the availability of an efficient local solver may be useful. The derivative-free algorithm is about an order of magnitude slower than its direct gradient-based counterpart, which is reasonable, considering that the size of the considered problems is quite large in the perspective of derivative-free optimization. In fact, the difference between the speed of gradient-based and derivative-free methods on problems with relatively large size is usually even larger; here, this gap is mitigated, since there is a large set of instructions shared by all the versions of the algorithm.
On the whole, the computational experience, although limited to a single class of problems, confirms the validity of the proposed approach. We remark that we tested

Runtime
Objective value a b Fig. 1 Performance profiles of runtime (a) and attained objective value (b) for the exact, inexact and derivative-free penalty decomposition algorithms, on the 18 sparse logistic regression problems the simplest implementation of the proposed algorithm, that is, performing, in the x-minimization step, a single line search along the steepest direction. Benefits, in terms of attained function values, could be obtained by performing more iterations of a descent method and by introducing a suitable inner stopping criterion. As already observed, this can be done to improve the effectiveness of the algorithm preserving its global convergence properties.

Conclusions
In this paper, we have proposed two penalty decomposition-based methods for smooth cardinality-constrained problems. In the first method, based on gradient information, the exact minimization step of the original penalty decomposition method is replaced by line searches along gradient-related directions. Thus, the contribution related to this algorithm lies in the fact that it represents a viable technique, whenever a closed-form solution of the subproblems in the original variables is not available (in both the convex and nonconvex cases). The second method is a derivative-free algorithm for sparse black-box optimization. We remark that, to our knowledge, derivative-free algorithms for cardinality-constrained problems are not known, so that the presented method seems to yield an important contribution in the field of sparse optimization. We state global convergence results for the new penalty decomposition algorithms. We note that the theoretical analysis is quite different from that of the related literature and that it presents substantial differences for the two proposed algorithms. Although the main focus of the work is theoretical, we have reported also the results of preliminary computational experiments performed by the proposed penalty decomposition methods. The obtained results, although limited to a single class of problems, show the validity of the proposed approach. Further work will regard the extension of the presented algorithms to the case of problems with additional equality and inequality constraints, which, similarly to what is done by [18], might be handled by moving them into the quadratic penalty term. Another interesting theoretical investigation might concern the substitution of the line search step by a trust-region framework. Such a modification, which we consider to be reasonable, would in fact require nontrivial changes to the convergence analysis. Finally, the application of the derivative-free algorithm to real sparse black-box problems would be of great interest.