An optimal subgradient algorithm for large-scale bound-constrained convex optimization

This paper shows that the optimal subgradient algorithm (OSGA)—which uses first-order information to solve convex optimization problems with optimal complexity—can be used to efficiently solve arbitrary bound-constrained convex optimization problems. This is done by constructing an explicit method as well as an inexact scheme for solving the bound-constrained rational subproblem required by OSGA. This leads to an efficient implementation of OSGA on large-scale problems in applications arising from signal and image processing, machine learning and statistics. Numerical experiments demonstrate the promising performance of OSGA on such problems. A software package implementing OSGA for bound-constrained convex problems is available.

1. Introduction.Let V be a finite-dimensional real vector space and V * its dual space.In this paper we consider the bound-constrained convex minimization problem where f : x → R is a -smooth or nonsmooth -convex function, A : R n → R m is a linear operator, and x = [x, x] is an axiparallel box in V in which x and x are the vectors of lower and upper bounds on the components of x, respectively.(Lower bounds are allowed to take the value −∞, and upper bounds the value +∞.)Throughout the paper, g, x denotes the value of g ∈ V * at x ∈ V.A subgradient of the objective function f at x is a vector g(x) ∈ V * satisfying f (z) ≥ f (x) + g(x), z − x , for all z ∈ V.It is assumed that the set of optimal solutions of (1.1) is nonempty and the first-order information about the objective function (i.e., for any x ∈ x, the function value f (x) and some subgradient g(x) at x) are available by a first-order black-box oracle.Motivation & history.Bound-constrained optimization in general is an important problem appearing in many fields of science and engineering, where the parameters describing physical quantities are constrained to be in a given range.Furthermore, it plays a prominent role in the development of general constrained optimization methods since many methods reduce the solution of the general problem to the solution of a sequence of bound-constrained problems.
There are lots of bound-constrained algorithms and solvers for smooth and nonsmooth optimization; here, we mention only those related to our study.Lin & Moré in [28] and Kim et al. in [24] proposed Newton and quasi-Newton methods for solving bound-constrained optimization.In 1995, Byrd et al. [14] proposed a limited memory algorithm called LBFGS-B for general smooth nonlinear bound-constrained optimization.Branch et al. in [13] proposed a trust-region method to solve this problem.Neumaier & Azmi [36] solved this problem by a limited memory algorithm.The smooth bound-constrained optimization problem was also solved by Birgin et al. in [9] and Hager & Zhang in [25,26] using nonmonotone spectral projected gradient methods, active set strategy and affine scaling scheme, respectively.Some limited memory bundle methods for solving bound-constrained nonsmooth problems were proposed by Karmitsa & Mäkelä [20,21].
In recent years convex optimization has received much attention because it arises in many applications and is suitable for solving problems involving high-dimensional data.The particular case of bound-constrained convex optimization involving a smooth or nonsmooth objective function also appears in a variety of applications, of which we mention the following: Bound-constrained linear inverse problems.Given A, W ∈ R m×n , b ∈ R m and λ ∈ R, for m ≥ n, the bound-constrained least-squares problem is given by and the bound-constrained l 1 problem is given by min f (x) := Ax − b 1 + λϕ(x) s.t.
x ∈ x, where x = [x, x] is a box and ϕ is a smooth or nonsmooth regularizer, often a weighted power of a norm; see Section 4 for examples.
The problems (1.2) and (1.3) are commonly arising in the context of control and inverse problems, especially for some imaging problems like denoising, deblurring and inpainting.Morini et al. [29] formulated the bound-constrained least-squares problem (1.2) as a nonlinear system of equations and proposed an iterative method based on a reduced Newton's method.Recently, Zhang & Morini [39] used alternating direction methods to solve these problems.More recently, Chan et al. in [16], Bo? et al. in [11], and Bo? & Hendrich in [10] proposed alternating direction methods, primal-dual splitting methods, and a Douglas-Rachford primal-dual method, respectively, to solve both (1.2) and (1.3) for some applications.Content.In this paper we show that the optimal subgradient algorithm OSGA proposed by Neumaier in [33] can be used for solving bound-constrained problems of the form (1.1).In order to run OSGA, one needs to solve a rational auxiliary subproblem.We here investigate efficient schemes for solving this subproblem in the presence of bounds on its variables.To this end, we show that the solution of the subproblem has a one-dimensional piecewise linear representation and that it may be computed by solving a sequence of one-dimensional piecewise rational optimization problems.We also give an iterative scheme that can solve OSGA's subproblem approximately by solving a one-dimensional nonlinear equation.We give numerical results demonstrating the performance of OSGA on some problems from applications.More specifically, in the next section, we investigate properties of the solution of the subproblem (2.1) that lead to two algorithms for solving it efficiently.In Section 3, we report numerical results with artificial and real data, where to solve large-scale imaging problems we use the inexact scheme to find the solution of OSGA's subproblem.Finally, Section 4 gives some conclusions.
2. Overview of OSGA.OSGA (see Algorithm 1) is an optimal subgradient algorithm proposed by Neumaier in [33] that constructs -for a problem (1.1) with arbitrary nonempty, closed and convex domain, not necessarily a box -a sequence of iterates whose function values converge with the optimal complexity to the minimum of (1.1).Furthermore, OSGA requires no information regarding global parameters like Lipschitz constants of function values and gradients.
Apart from first-order information at the iterates, OSGA requires an efficient routine for finding a maximizer x = U (γ, h) and the optimal objective value E(γ, h) of an auxiliary problem of the form where it is known that the supremum is positive.The function E γ,h : x → R is defined by with γ ∈ R, h ∈ V * , and Q : x → R is a prox-function, i.e., a strongly convex function satisfying inf x∈x Q(x) > 0 and where the convexity parameter is σ = 1.In particular, with Q 0 > 0, if x ∈ R n , we may take where x 2 = i x 2 i is the Euclidean norm, and where x F = i,k x 2 ik is the Frobenius norm.In [1,33], the unconstrained version (2.1) with the prox-function (2.4) or (2.5) is solved in a closed form.Numerical experiments and comparisons with some state-ofthe-art solvers [1,5,6] show the promising behaviour of OSGA for solving practical unconstrained problems.A version of OSGA that can solves structured nonsmooth problems with the complexity O(ε −1/2 ) discussed in [4].In [3], it is shown that the OSGA subproblem can be solved for many simple domains.In this paper we show that for the above choices of Q(x) and an arbitrary box x, the subproblem (2.1) can be solved efficiently.It follows that OSGA is applicable to solve bound-constrained convex problems as well.
It is shown in Neumaier [33] that OSGA satisfies the optimal complexity bounds O ε −2 for Lipschitz continuous nonsmooth convex functions and O ε −1/2 for smooth convex functions with Lipschitz continuous gradients; cf.Nemirovsky & Yudin [30] and Nesterov [31,32].Since the underlying problem (1.1) is a special case of the problem considered in [33], the complexity of OSGA remains valid for (1.1).
Without loss of generality, we here consider V = R n .It is not hard to adapt the results to V = R n×n and other spaces.The method is related to one used in several earlier papers.In 1980, Helgason et al. [27] characterized the solution of singly constrained quadratic problem with bound constraints.Later, Paradolas & Kovoor [37] developed an O(n) algorithm for this problem using binary search to solve the associated Kuhn-Tucker system.This problem was also solved by Dai & Fletcher [17] using a projected gradient method.Zhang et al. [40] solved the linear support vector machine problem by a cutting plane method employing a similar technique.
In the papers mentioned, the key is showing that the problem can be reduced to a piecewise linear problem in a single dimension.To apply this idea to the present problem, we prove that (2.1) is equivalent to an one-dimensional minimization problem and then develop a procedure to calculate its minimizer.We write for the projection of x 0 − λh to the box x.Proposition 3.1.For h = 0, the maximum of the subproblem (2.1) is attained at x := x(λ), where λ > 0 or λ = +∞ is the inverse of the value of the maximum.
Proof.The function E γ,h : V → R defined by (2.2) is continuously differentiable and from Proposition 2.1 in [33] we have e := E γ,h > 0. By differentiating both side of the equation At the maximizer x, we have eQ( x) = −γ − h, x .Now the first-order optimality conditions imply that for i = 1, 2, . . ., n, Since e > 0, we may define λ := e −1 and find that, for i = 1, 2, . . ., n, This implies that x = x(λ).Proposition 3.1 gives the key feature of the solution of the subproblem (2.1) implying that it can be considered in the form of (3.1) with only one variable λ.In the remainder of this section, we focus on deriving the optimal λ.
Example.3.1.Let first consider a very special case that x is nonnegative orthant, i.e., x i = 0 and x i = +∞, for i = 1, • • • , n. Nonnegativity constraint is important in many applications, see [7,18,19,22,23].In this case we consider the quadratic function where Q 0 > 0. In [1], it is proved that (3.4) is a prox-function.By using this proxfunction and (3.1), we obtain where (z) − = min{0, z}.By Proposition 2.2 of [33], we have By substituting λ = 1/e into this equation, we get where Since we search for the maximum e, the solution is the bigger root of this equation, i.e., This shows that for the nonnegativity constraint the subproblem (2.1) can be solved in a closed form, however, for a general bound-constrained problem, we need a much more sophisticated scheme to solve (2.1).
To derive the optimal λ ≥ 0 in Proposition 3.1, we first determine its permissible range provided by the three conditions considered in (3.3) leading to the interval for each component of x.In particular, if h i = 0, since x 0 is a feasible point, x i = x 0 i − λh i = x 0 i satisfies the third condition in (3.3).Thus there is no upper bound for λ, leading to If h i = 0, we consider the three cases (i) The lower bound satisfies −(x i − x 0 i )/h i ≤ 0, so it is not acceptable, leading to In Case (iii), if h i > 0, then However, the lower bound −(x i − x 0 i )/h i ≤ 0 is not acceptable, i.e., As a result, the following proposition is valid.
where λ i and λ i are computed by 2 implies that each element of x satisfies in only one of the conditions (3.6)-(3.10).Thus, for each i = 1, . . ., n with h i = 0, we have a single breakpoint (3.12) Sorting the n bounds λ i , i = 1, . . ., n, in increasing order, augmenting the resulting list by 0 and +∞, and deleting possible duplicate points, we obtain a list of m + 1 ≤ n + 2 different breakpoints, denoted by The next proposition gives an explicit representation for x(λ).Proposition 3.3.The solution x(λ) of the auxiliary problem (2.1) defined by (3.1) has the form where In Case (i), Proposition 3.2 implies that λ i = λ i , while it is not possible λ i = λ i .Therefore, either (3.9) or (3.10) holds dependent on the sign of h i , implying In Case (ii), Proposition 3.2 implies that λ i = λ i , while it is not possible λ i = λ i .Therefore, either (3.7) or (3.8) holds.If h i < 0, then (3.8) holds, i.e., p k i = x i and q k i = 0. Otherwise, (3.7) holds, implying p k i = x i and q k i = 0.This proves the claim.Proposition 3.3 exhibits the solution x(λ) of the auxiliary problem (2.1) as a piecewise linear function of λ.In the next result, we show that solving the problem (2.1) is equivalent to maximizing a one-dimensional piecewise rational function.
Proposition 3.4.The maximal value of the subproblem (2.1) is the maximum of the piecewise rational function e(λ) defined by where as defined in the proposition.
Since Q 0 > 0, we have c k > 0 and the denominator of (3.16) is bounded away from zero, implying 4s k c k > d 2 k .It is enough to verify s k > 0. The definition of q k in (3.15) implies that h i = 0 for i ∈ I = {i : λ k+1 ≤ λ i } leading to q k = 0 implying s k > 0.
The next result leads to a systematic way to maximize the one-dimensional rational problem (3.16).
Proposition 3.5.Let a, b, c, d, and s be real constants with c > 0, s > 0, and 4sc > d is attained at (ii) If b = 0 and a > 0, the global maximum is attained at (iii) If b = 0 and a ≤ 0, the maximum is φ( λ) = 0, attained at λ = +∞ for a < 0 and at all λ ∈ R for a = 0.
Proof.The denominator of (3.18) is positive for all λ ∈ R n iff the stated condition on the coefficients hold.By the differentiation of φ and using the first-order optimality condition, we obtain For solving φ (λ) = 0, we consider possible solutions of the quadratic equation bsλ 2 + 2asλ + ad − bc = 0. Using the assumption 4sc > d 2 , we obtain implying that φ (λ) = 0 has at least one solution.
(i) If b = 0, then implying there exist two solutions.Solving φ (λ) = 0, the stationary points of the function are found to be Inexact solution of OSGA's rational subproblem (2.1).In this subsection we give a scheme to compute an inexact solution for the the subproblem (2.1).
We here use the quadratic prox-function (3.4).In view of Proposition 3.1 and Theorem 3.1 in [3], the solution of the subproblem (2.1) is given by x(λ) defined in (3.1),where λ can be computed by solving the one-dimensional nonlinear equation The solution of OSGA's subproblem can be found by Algorithm 3 (OSS) in [3].In [3], it is shown that in many convex domains the nonlinear equation (3.25) can be solved explicitly, however, for the bound-constrained problems it can be only solved approximately.
As discussed in [3], the one-dimensional nonlinear equation can be solved by some zero-finder schemes such as the bisection method and the secant bisection scheme described in Chapter 5 of [35].One can also use the MATLAB's fzero function combining the bisection scheme, the inverse quadratic interpolation, and the secant method.In the next section we will use this inexact solution of OSGA's rational subproblem (2.1) for solving large-scale imaging problems.

Applications and numerical experiments.
In this section we report some numerical results to show the performance of OSGA compared with some state-ofthe-art algorithms on both artificial and real data.
A software package implementing OSGA for solving unconstrained and boundconstrained convex optimization problems is publicly available at http://homepage.univie.ac.at/masoud.ahookhosh/.The package is written in MATLAB, where the parameters δ = 0.9; α max = 0.7; κ = κ = 0.5; Ψ target = −∞ are used.We also use the prox-function (2.4) with Q 0 = 1 2 x 0 2 + , where is the machine precision.Some examples for each class of problems are available to show how the user can implement it.The interface to each subprogram in the package is fully documented in the corresponding file.Moreover, the OSGA user's manual [2] describes the design of the package and how the user can solve his/her own problems.
The algorithms considered in the comparison use the default parameter values reported in the corresponding papers or packages.All implementations were executed on a Toshiba Satellite Pro L750-176 laptop with Intel Core i7-2670QM Processor and 8 GB RAM.

Experiment with artificial data.
In this subsection we deal with solving the problem (1.1) with the objective functions The problem is generated by where n is the problem dimension and i laplace.m is a code generating an ill-posed test problem using the inverse Laplace transformation from the Regularization Tools MATLAB package, which is available at http://www.imm.dtu.dk/~pcha/Regutools/.The upper and lower bounds on variables are generated by x = 0.05 * ones(n), x = 0.95 * ones(n), respectively.Since among the problems (4.1)only L22L22R is differentiable, we need some nonsmooth algorithms to be compared with OSGA.In our experiment we consider two versions of OSGA, i.e., a version uses BCSS for solving the subproblem (2.1) (OSGA-1) and a version uses the inexact solution described in Subsection 3.2 for solving the subproblem (2.1) (OSGA-2), compared with PSGA-1 (a projected subgradient algorithm with nonsummable diminishing step-size), and PSGA-2 (a projected subgradient algorithm with nonsummable diminishing steplength), see [12].
We solve all of the above-mentioned problems with the dimensions n = 2000 and n = 5000.The results for L22L22R and L22L1R are illustrated in Table 1 and Figure 1, and the results for L1L22R and L1L1R are summarized in Table 2 and Figure 2.More precisely, Figures 1 and 2 show the relative error of function vales versus iterations where f denotes the minimum and f 0 shows the function value on an initial point x 0 .
In our experiments, PSGA-1 and PSGA-2 exploit the step-sizes α := 5/ √ k g k and α := 0.1/ √ k, respectively, in which k is the iteration counter and g k is a subgradient of f at x k .The algorithms are stopped after 100 iterations.In Figure 1, Subfigures (a) and (b) show that OSGA-1 and OSGA-2 substantially outperform PSGA-1 and PSGA-2 with respect to the relative error of function values δ k (4.2), however, they need more running time.In this case OSGA-1 and OSGA-2 are competitive, while OSGA-1 performs better.In Figure 1, Subfigure (c) shows that OSGA-2 produces the best results and the others are competitive.Subfigure (d) of Figure 1 demonstrates that OSGA-1 attains the best results and SPGA-2 and OSGA-2 are competitive but much better than SPGA-1.In Figure 2, Subfigures (a) and (b) show that OSGA-1 and OSGA-2 are comparable but much better that PSGA-1 and PSGA-2.Subfigures (c) and (s) show that the best results produced by OSGA-1 and OSGA-2, respectively.
where ϕ is a smooth or nonsmooth regularizer such as ϕ and ϕ(x) = x AT V .Among the various regularizers, the total variation is much more popular due to its strong edge preserving feature.Two important types of the total variation, namely isotropic and anisotropic, see [15], are defined for x ∈ R m×n by The common drawback of the unconstrained problem (4.4) is that it usually gives a solution outside of the dynamic range of the image, which is either [0, 1] or [0, 255] for 8-bit gray-scale images.Hence one has to project the unconstrained solution to the dynamic range of the image.However, the quality of the projected images is not always acceptable.As a result, it is worth to solve a bound-constrained problem of the form (1.2) in place of the unconstrained problem (4.4), where the bounds are defined by the dynamic range of the images, see [8,16,38].
The comparison concerning the quality of the recovered image is made via the so-called peak signal-to-noise ratio (PSNR) defined by PSNR = 20 log 10 and the improvement in signal-to-noise ratio (ISNR) defined by where • F is the Frobenius norm, x t denotes the m × n true image, y is the observed image, and pixel values are in [0, 1].
4.2.1.Experiment with l 2 2 isotropic total variation.We here consider the image restoration from a blurred/noisy observation using the model (1.2) equipped with the isotropic total variation regularizer.We employ OSGA, MFISTA (a monotone version of FISTA proposed by Beck & Teboulle in [8]), ADMM (an alternating direction method proposed by Chan et al. in [16]), and a projected subgradient algorithms PSGA (with nonsummable diminishing step-size, see [12]).In our implementation we use the original code of MFISTA and ADMM provided by the authors, with minor adaptations to solve the problem form (1.2) and to stop in a fixed number of iterations.
We now restore the 512×512 blurred/noisy Barbara image.Let y be a blurred/noisy version of this image generated by a 9 × 9 uniform blur and adding a Gaussian noise with zero mean and standard deviation set to 10 −3/2 .Our implementation shows that the algorithms are sensitive to the regularization parameter λ.Hence we consider three different regularization parameters λ = 1 × 10 −2 , λ = 7 × 10 −3 , and λ = 4 × 10 −3 .All algorithms are stopped after 50 iterations.The results of our implementation are summarized in Table 3 and Figures 3 and 4.
The results of Table 3 and Figure 3 show that function values, PSNR, and ISNR produced by the algorithms are sensitive to the regularization parameter λ.However, the function values are less sensitive.Subfigures (a), (c), and (e) show that OSGA gives the best performance in terms of function values.According to subfigures (b), (d), and (f), the best ISNR is attained for λ = 4 × 10 −3 , and the algorithms are comparable with each other, but OSGA outperforms the others slightly.Figure 4 illustrates the resulting deblurred images for λ = 4 × 10 −3 .4.2.2.Experiment with l 1 isotropic total variation.In this subsection we study the image restoration from a blurred/noisy observation using the model (1.3) equipped with the isotropic total variation regularizer.The optimization problem is     Here we consider recovering the 256 × 256 blurred/noisy Fingerprint image from a blurred/noisy image constructed by a 7 × 7 Gaussian kernel with standard deviation 5 and salt-and-pepper impulsive noise with the level 40%.The algorithms are stopped after 50 iterations.Three different regularization parameters λ = 3 × 10 −1 , λ = 1 × 10 −1 , and λ = 8 × 10 −2 are considered.The results are presented in Table 4 and Figures 5 and 6.
The results of Figure 5 demonstrates that the algorithms are sensitive to the regularization parameter.For example, ADMM get the best function value and an acceptable ISNR compared with the others for λ = 3×10 −1 , however, for λ = 1×10 −1 and λ = 8 × 10 −2 , it behaves worse than the others.In the sense of ISNR, DRPD1, DRPD2, and OSGA are competitive, but OSGA get the better results.The resulting images for λ = 8 × 10 −2 are illustrated in Figure 6, demonstrating that ADMM could not recover the image properly, however, DRPD1, DRPD2, and OSGA reconstruct acceptable approximations, where OSGA attains the best PSNR. 5. Concluding remarks.This paper addresses an optimal subgradient algorithm, OSGA, for solving bound-constrained convex optimization.It is shown that the    solution of OSGA's subproblem has a piecewise linear form in a single dimension.Afterwards, we give two iterative schemes to solve this one-dimensional problem, where the first one solves OSGA's subproblem exactly and the second one inexactly.In particular, the first scheme uses the piecewise linear solution to translate the subproblem into a one-dimensional piecewise rational problem.Subsequently, the optimizer of the subproblem is founded in each interval and compared to give the global solution.The second scheme substitutes the piecewise linear solution into the subproblem and give a one-dimensional nonlinear equation that can be solved with zero finders to give the optimizer.Numerical results are reported showing the efficiency of OSGA compared with some state-of-the-art algorithms.

Table 4 .
3: Result summary for the l 2 2 isotropic total variation

Table 4 . 4 :
Results for the l 1 isotropic total variation