An optimal subgradient algorithm for large-scale convex optimization in simple domains

This paper shows that the optimal subgradient algorithm, OSGA, proposed in \cite{NeuO} can be used for solving structured large-scale convex constrained optimization problems. Only first-order information is required, and the optimal complexity bounds for both smooth and nonsmooth problems are attained. More specifically, we consider two classes of problems: (i) a convex objective with a simple closed convex domain, where the orthogonal projection on this feasible domain is efficiently available; (ii) a convex objective with a simple convex functional constraint. If we equip OSGA with an appropriate prox-function, the OSGA subproblem can be solved either in a closed form or by a simple iterative scheme, which is especially important for large-scale problems. We report numerical results for some applications to show the efficiency of the proposed scheme. A software package implementing OSGA for above domains is available.

1. Introduction.Convex optimization has been shown to provide efficient algorithms for computing reliable solutions in a broad range of applications.Many applications arising in applied sciences and engineering such as signal and image processing, machine learning, statistics, and general inverse problems can be addressed by a convex optimization problem involving high-dimensional data.In practice, solving a nonsmooth convex problem is usually more difficult and costly than a smooth one.More precisely, for a prescribed accuracy parameter ε, the optimal complexity to achieve an ε-solution of nonsmooth Lipschitz continuous problems is O(ε −2 ), the superior complexity O(ε −1/2 ) for smooth problems with Lipschitz continuous gradient, see [52,53].
Thanks to the low memory requirement and simple structure, first-order methods have received much attention during the past few decades.Indeed, they deal successfully with large-scale problems.In general, convex optimization problems can be solved by gradient-type algorithms [3,21,22,38], conjugate gradient methods [41,45,46] and spectral gradient methods [12,23,63] for smooth objectives and by subgradient-type methods [27,51,57], proximal gradient methods [62,32], smoothing techniques [15,24,34,55], bundle-type algorithms [48,49], and primal-dual firstorder methods [25,26,28] for nonsmooth objectives.Moreover, both classes can be addressed by (zero-order) coordinate descent methods and derivative-free methods.The current paper only addresses first-order methods and assumes that first-order black-box information -function values and subgradients -of the objective function are available.
Historically, gradient descent and subgradient methods were the first numerical schemes proposed to solve optimization problems with smooth and nonsmooth convex objective functions, respectively.In practice, they are too slow, especially for badly scaled problems.This can be addressed by their worst-case complexity bounds to reach an ε-solution, while the gradient descent method achieve the complexity of the order O(ε −1 ) which is not optimal for smooth problems, the subgradient methods attain the worst-case complexity of the order O(ε −2 ).In 1983, Nemirovski & Yudin in [52] derived optimal worst-case complexity bounds of first-order methods to achieve an ε-solution for several class of problems such as Lipschitz continuous nonsmooth problems and smooth problems with Lipschitz continuous gradient.If an algorithm attains the optimal worst-case complexity bound for a class of problems, it is called optimal.Optimal first-order methods dating back to Nesterov [54] in 1983.This optimal first-order method is interesting both theoretically and computationally, attracting many researchers to work in the development of such schemes, for example Auslander & Teboulle [9], Beck & Teboulle [16], Devolder et al. [33], Gonzaga et al. [39,40], Lan [49], Lan et al. [50], Nesterov [55,56,58], Neumaier [59] and Tseng [65].Computational comparisons for composite functions show that optimal Nesterov-type first-order methods are substantially superior to the gradient descent and subgradient methods, see, for example, Ahookhosh [1] and Becker et al. [18].
Content.In this paper we consider structured convex constrained optimization problems frequently observed in applications and develop OSGA to efficiently solve such problems.Two clasess of convex domains are considered, namely, simple convex domains such that the orthogonal projection is cheaply feasible, and sublevel set of a convex function referred as functional domain.For problems with a simple domain, we first introduce an appropriate prox-function and then show that the solution of OSGA's subproblem is obtained by a projection on the domain followed by solving a one-dimensional nonlinear equation.It is shown that if explicit formula for projection is available, the nonlinear equation can be solved in a closed form in many interesting cases.We also establish the optimality condition for functional domain and show for some simple functions that results to in a closed form solution.Finally, we report some numerical results for applications to show the efficiency OSGA in comparison with some state-of-the-art algorithms.
The remainder of this paper is organized as follows.In the next section, we review the basic idea of OSGA.Section 3 considers the structured convex constrained minimization and how to solve the associated OSGA subproblem.We report numerical results in Section 4 and our conclusions are derived in Section 5.
Notation and preliminaries.Let V be a real finite-demensional vector space endowed with the norm • , and V * denotes the dual space of all linear functional on V where the bilinear pairing g, x denotes the value of the functional g ∈ is the function that takes a matrix x ∈ R m×n and returns a vector of singular values in nonincreasing order.If x is a positive definite matrix, we denotes it by x 0. We also denote by x = n i=1 λ i u i u T i and x = n i=1 σ i u i v T i the eigenvalue decomposition and the singular value decomposition of x.For a function f : The set ∂f (x) of all subgradients is called the subdifferential of f at x.
We call a nonempty, closed, and convex subset C of V a simple convex domain if the orthogonal projection of y to C can be found efficiently for every y ∈ V.Note that P C (y) is unique since x − y 2 is strongly convex.Computing the orthogonal projection is a well-studied topic on convex optimization, and the projection operator is available for many domains C either in a closed form or by a simple iterative scheme.Table 1 gives some practically interesting convex domains, associated projection operators, and references for the formulas or iterative schemes.
x ∈ C, where f : C → R is a convex function defined on a nonempty, closed and convex subset C of V.The main objective is to find a solution u ∈ C by using the first-order information, i.e., function values and subgradients.OSGA (see Algorithm 1) is an optimal subgradient algorithm for problem (2.1) that constructs a sequence of iterations whose related function values converge to the minimum with the optimal complexity.Moreover, OSGA requires no information regarding global parameters such as Lipschitz constants of function values and gradients.The primary objective is to monotonically reduce bounds on the error f (x b ) − f of function values, where f is the minimum and x b is the best known point.
OSGA considers the linear relaxations of f at z, where γ ∈ R and h ∈ V * , and a continuously differentiable prox-function and where σ = 1, g Q (x) denotes the gradient of Q at x ∈ C and • is a norm defined on V. OSGA solves a sequence of minimization problems of the form where it is known that the supremum is positive.The function E γ,h : C → R is defined by is the solution of this problem, then it is assumed that e = E(γ, h) and u = U (γ, h) are readily computable.
In [59], it is shown that OSGA attains the following bound on function values Hence, by decreasing the error factor η, the convergence to an ε-minimizer x b is guaranteed by for the accuracy tolerance ε > 0. In [59], it is shown that the number of iterations to achieve the optimizer is in the order O ε −1/2 for smooth f with Lipschitz continuous gradients and in the order O ε −2 for Lipschitz continuous nonsmooth f , which is optimal in both cases, cf.Nemirovsky & Yudin [52] and Nesterov [53].The algorithm does not need to know about the global Lipschitz parameters and has the low memory requirement.Hence if the subproblem (2.5) can be solved efficiently, OSGA is appropriate for solving large-scale problems.Numerical results reported by Ahookhosh in [1] and Ahookhosh & Neumaier in [4,5], for unconstrained problems, and Ahookhosh & Neumaier in [6,7], for constrained problems, show the promising behavior of OSGA for practical problems.In the next section we show that by selecting a suitable prox-function, OSGA's subproblem (2.5) can be solved efficiently for structured convex constrained problems.
Algorithm 1: OSGA (optimal subgradient algorithm) update the parameters α, h, γ, η and u using UPS; end end end As discussed in [59], OSGA uses the following scheme for updating the given parameters α, h, γ, η and u: Algorithm 2: PUS (parameters updating scheme) Structured convex constrained problems in simple domains.In this paper we consider the convex constrained optimization problem min f (Ax) s.t.
x ∈ C, where f : C → R is convex and lower semicontinuous, A : R n → R m is a linear operator, and C is a simple convex domain.We call problem (3.1) a simple domain problem.This problem appears in many applications such as signal and image processing, machine learning, statistics, and inverse problem.

Example. 3.1. (Image restoration)
The process of reconstructing or estimating a true image from a degraded observation is known as the image restoration, also called deblurring or deconvolution.Image restoration is addressed by solving a constraint satisfaction problem of the form where C a convex domain C that is commonly a box or the nonnegativity constraint.This is an ill-posed problem, see Neumaier [60], and normally handled by the regularized least-squares problem or the regularized l 1 problem where ϕ : • IT V , and • AT V .The regularizers • IT V and • AT V are respectively called isotropic and anisotropic total variation, see, for example, [29], where they are defined by Example.3.2.(Basis pursuit problem) Let A : R n → R m be a linear operator with m < n and y ∈ R m .The basis pursuit problem is the constrained minimization problem which determines an l 1 -minimal solution x of the undetermined linear system Ax = y.This problem appears in many applications such signal and image processing and compressed sensing, see [19,20,31,35,67,68,69] and references therein.
According to the features of objective functions, (3.2) can be solved by Nesterovtype optimal methods, however, (3.3) and (3.4) cannot be solved by Nesterov-type optimal methods.Since OSGA only needs first-order information, it can deal with all of these problems without considering the structure of problems.In the remainder of this section, we establish how OSGA can be used to efficiently solve the problem (3.1).Since the underlying problem (3.1) is a special case of the problem (2.1) considered in [59], the complexity of OSGA remains valid for both smooth and nonsmooth problems.
The quadratic function is a prox-function, see e.g.[1].We now show that the solution of OSGA's subproblem (2.5) can be found either in a closed form or by a simple iterative scheme.In particular, we address some convex domains that a closed form solution for associated OSGA's subproblem (2.5) can be found.
The next result shows that the solution of the auxiliary subproblem (2.5) is given by the orthogonal projection (1.1) of y := e −1 h on the domain C followed by solving a one-dimensional nonlinear equation to determine e.
Theorem 3.3.Let u be a minimizer of (2.5) and also let e = E γ,h (u) > 0. Then where, e is a solution of the univariate equation Proof.From Proposition 5.1 in [59], at the minimizer u, we obtain The first-order optimality condition for this problem is where denotes the normal cone to C at u. Since e > 0, u satisfies where y = −e −1 h giving the result.Theorem 3.3 gives a way to compute a solution of OSGA's subproblem (2.5) involving a projection on the domain C and solving the one-dimensional nonlinear equation.This equation can be solved exactly for some projection operators, see Table 2.However, one can solve this nonlinear equation approximately using zero finding schemes, see e.g.Chapter 5 of [61].We apply the results of Theorem 3.3 in the next scheme to solve OSGA's subproblem (2.5):To implement Algorithm 3 (OSS), we first need to solve the projection problem (1.1) effectively, see Table 1.1.If one solves the equation ϕ(e) = 0 approximately, and an initial interval [a, b] is available such that ϕ(a)ϕ(b) < 0, then a solution can be computed to ε-accuracy using the bisection scheme in O(log 2 ((b − a)/ε)) iterations, see, for example, [61].However, it is preferable to use a more sophisticated zero finder like the secant bisection scheme (Algorithm 5.2.6, [61]).If an interval [a, b] with sign change is available 1 , one can also use MATLAB's fzero function combining the bisection scheme, the inverse quadratic interpolation, and the secant method.
In the following we investigate special domains C, where the nonlinear equation ϕ(e) = 0 can be solved explicitly, see Table 2. and with Proof.The projection operator on C is given by (3.10).This and y = −e −1 h give This, together with (3.7), yields where β 1 , β 2 , and β 3 are defined in (3.12).Since the subproblem (2.5) is the maximization, the bigger root of this equation is selected, which is given by (3.11).
and e is given by (3.11) with and e is given by (3.11) with ) , then e = max{e 1 , e 2 }.Proof.The projection operator on C is given by (3.15).This gives If a, h ≥ −eb, we obtain where β 1 := Q 0 , β 2 := γ, and This identity leads to a solution of the form (3.11), say e 1 .If a, h < −eb, (3.13) is valid and e is computed by (3.11) where β 1 , β 2 , and β 3 is defined in (3.14), say e 2 .After computing e 1 and e 2 , we check whether the inequalities a, h ≥ −e 1 b and a, h < −e 2 b are satisfied.Since the subproblem (2.5) has a solution, at least one of the conditions has to satisfied.If one of them is satisfied, the corresponding e and (3.17) give the solution.If both of them hold, we consider the solution with bigger e.
is the nonnegative orthant, then the subproblem (2.5) is solved by u = P C (−e −1 h), where and e is given by (3.11) with Proof.The projection operator on C is given by (3.18) leading to This and (3.7) imply eQ(P C (−e −1 h)) where β 1 , β 2 , and β 3 are defined in (3.19), giving the result.
To solve bound-constrained problems with OSGA, we developed and algorithm that can find the global solution of the subproblem (2.5) by solving a sequence of onedimensional rational optimization problems, see Algorithm 3 in [6].Notice that the constraint C := {x ∈ V | x ∞ ≤ ξ} is a special case of bound-constrained problem with x = −ξ1 and x = ξ1 where 1 is a n-dimensional vector with all elements equal to unity.
4. Solving structured problems with a functional constraint.In this subsection we consider the structured convex constrained problem where φ : C → R is a simple smooth or nonsmooth, real-valued, and convex loss function, and ξ is a real constant.We call the problem (4.1) a functional constraint problem.While it the special case of (3.1) with one can solve OSGA's subproblem (2.5) directly by using the KKT optimality conditions, especially when no efficient method for finding the projection on C is known.Indeed, if a nonsmooth problem can be reformulated in the form (3.1) with a smooth f and a nonsmooth φ, then OSGA can solve this nonsmooth problem with the complexity of the order O(ε −1/2 ), which is optimal for smooth problems.
Example.4.1.(Linear inverse problem) Let A : R n → R m be an illconditioned or singular linear operator and y ∈ R m be a vector of observations.The linear inverse problem is the quest of finding x ∈ R n such that with unknown but small additive noise ν ∈ R m .The problem is solvable if one knows additional qualitative information about x.This qualitative information is encoded in a constraint on x, under which the Euclidean norm of ν is minimized.Constrained optimization problems resulting from two typical qualitative constraints are in which ξ is a nonnegative real constant.This problem often occurs in applied sciences and engineering, see [44,64].
In the reminder of this section we assume that the functional constraint satisfies the Cottle constraint qualification [10] (H1) For all x ∈ C, either φ(x) < 0 or 0 ∈ ∂ φ (x).We also need the following result.
Proof.Let's define the function Since this function is differentiable, by differentiating both sides of the equality In view of the KKT optimality conditions for inequality constrained nonsmooth problems, see [10], we have the optimality condition for (3.1).Now, by substituting (4.7) into (4.8),setting e := −(γ + h, u )/Q(u), and distinguishing between µ = 0 and µ > 0, we obtain either (4.5) or (4.6).Theorem 4.2 gives the optimality conditions for general function φ, however, in view of Theorem 3.3, it is especially useful when the projection in C = {x | φ(x) ≤ ξ} is not efficiently available.In the remainder of this subsection, we derive the solution of OSGA's subproblem (2.5) for some φ such as • 2 and • 1,2 that appear in many applications.We already solve OSGA's subproblem (2.5) with the constraint C = {x | x 2 ≤ ξ} in Proposition 3.5, but to show how to apply Theorem 4.2 we study it in the next result.
Case (i).The condition (4.5) holds.Then we have u = −e −1 h.By substituting this into the identity E γ,h (u) = e, we get By using the bigger root of this equation, we have where β 1 = Q 0 , β 2 = γ, and β 3 = 1 2 h 2 .Case (ii).The condition (4.6) holds.Then we have This implies that there exist λ such that u = λh.By substituting this into φ(u) = u 2 = ξ we get Now, substituting u into (4.9),we obtain This gives the result.
In 2004, Yuan and Lin in [70] proposed an interesting regularizer called grouped LASSO for the linear regression.Later Kim et al. in [44] proposed a constrained ridge regression model using the constraint x gi 2 , where x = (x g1 , • • • , x gm ) and x 1,2 is a so-called the l 1,2 group norm.We consider this constraint in the next result.
Proof.Similar to Proposition 4.3, we consider u = 0.In view of Proposition (4.1), we get We now apply Theorem 4.2 leading to two cases: (i) (4.5) holds; (ii) (4.6) holds.Case (i).The condition (4.5) holds.Then we have u By using the bigger root of this equation, we get where β 1 := Q 0 , β 2 := γ, and 2 .Case (ii).The condition (4.6) holds.Then we have If h gi = 0, then u gi = 0. Now let h gi = 0. Substituting u gi = τ i h gi into the previous identity, it follows that giving Applying a summation from both sides, together with implying .
By substituting this into (4.11),we have By substituting this into E γ,h (u) = e, we get giving the result.

Numerical experiments.
A software package for solving unconstrained and simply constrained convex optimization problems with OSGA is publicly available at http://homepage.univie.ac.at/masoud.ahookhosh/.The package is written in MATLAB; it uses the parameters δ = 0.9; α max = 0.7; κ = κ = 0.5; Ψ target = −∞.and the prox-function (3.5) with Q 0 = 1 2 x 0 2 + , where is the machine precision.A user manual [2] describes the design and use of the package.Some examples are included as illustrations.
This section discusses numerical results and comparisons of OSGA with some state-of-the-art first-order solvers on some ridge regression and image deblurring problems.All numerical results were created with version 1.1 of the above software.The algorithms used for comparison use the default parameter values reported in the corresponding papers or packages.All numerical experiments were executed on a Toshiba Satellite Pro L750-176 laptop with Intel Core i7-2670QM processor and 8 GB RAM.

Ridge regression.
In this subsection we consider a l 2 -constrained least squares of the form (4.3) (so-called ridge regression, see [47]) and report some numerical results.
The problem is generated by where n = 5000 is the problem dimension and i laplace.m is an ill-posed test problem generator using the inverse Laplace transformation from Regularization Tools MATLAB package, which is available in http://www.imm.dtu.dk/~pcha/Regutools/.Since (4.3) is smooth and the projection on C = {x ∈ R n | x ≤ ξ} is available (see Table 1), we employ gradient projection algorithm (PGA), the spectral gradient projection [23] with the Grippo et al. nonmonotone term [37] (SPG-G), the spectral gradient projection with the Amini et al. nonmonotone term [8] (SPG-A), and OSGA (see Proposition 4.3) to solve this minimization problem.The parameters of SPG-G and SPG-A are the same as those reported in the associated papers, but SPG-A uses The algorithms are stopped after 500 iterations.
In Table 3 we consider ξ = 10, 15, 20, 25 and report the best attained function values and the running time.The results imply that OSGA attains the best running in Figure 1, where f denotes the minimum and f 0 shows the function value on an initial point x 0 .

Image deblurring with nonnegativity constraint.
As discussed in Section 3, inverse problems are appearing in many fields of applied sciences and Engineering.This is particularly happen when researchers use digital images to record and analyze results from experiments in many fields such as astronomy, medical sciences, biology, geophysics, and physics.In these cases, observing blurred and noisy images is a common phenomenon happening frequently because of environmental effects and imperfections in the imaging system.
In many applications, the variable x describes physical quantities, which is meaningful if each component of x is restricted to be nonnegative.This constraint is referred as the nonnegativity constraint; it is especially useful for restoring blurred and noisy images, see [11,42,43,66].
We restore the 256 × 256 blurred and noisy MR-brain image using the model (3.2) equipped with the isotropic total variation regularizer.The true image is available in http://graphics.stanford.edu/data/voldata/.The blurred/noisy image y is generated by a 9×9 uniform blur and adding a Gaussian noise with zero mean and standard deviation set to 10 −3 .For restoring the image, we use OSGA (see Proposition 3.4), MFISTA (a monotone version of FISTA proposed by Beck & Teboulle in [17]), ADMM (an alternating direction method proposed by Chan et al. in [30]), and PSGA (a projected subgradient scheme with nonsummable diminishing step size), see [27].The original codes of MFISTA and ADMM provided by the authors are used.Since the methods are sensitive to the regularization parameter λ, three different regularization parameters are used.The algorithms are stopped after 100 iterations.The comparison concerning the quality of the recovered image is made via the so-called peak signal-to-noise ratio (PSNR) defined by PSNR = 20 log 10  even better than them in the sense of ISNR.The deblurred images by the algorithms considered are illustrated in Figure 3 for λ = 10 −4 .We also consider the restoration of the 641 × 641 blurred/noisy Dione image using (3.3).The true image is available in http://photojournal.jpl.nasa.gov/Help/ImageGallery.html.The blurred/noisy image is constructed from the 7 × 7 Gaussian kernel with standard deviation 5 and salt-and-pepper impulsive noise with the level 50%.To recover the image, we use DRPD-1, DRPD-2 (Douglas-Rachford primal-dual schemes proposed by Bo? & Hendrich in [25]), ADMM, and OSGA.The algorithms are stopped after 100 iterations, and three different regularization parameters are considered.The results of implementation are reported in Table 5 and Figures 4 and 5.
The results of Table 5 shows that OSGA outperforms the others in the sense of PSNR. Figure 4 indicates that OSGA attains the best function values for λ = 10 −1 and λ = 5 × 10 −2 , however, ADMM get the best function value for λ = 5 × 10 −1 .It also implies that OSGA are comparable or even better that the others regarding ISNR.The resulted images for λ = 10 −1 are illustrated in Figure 5, demonstrating that the algorithms can restore the image by acceptable qualities while OSGA obtains the best function value and PSNR.6. Conclusions.In this paper an optimal subgradient method, OSGA, is addressed for solving structured convex constrained optimization.More specifically, finding a solution of OSGA's subproblem is investigated in the presence of some convex constraints.Two types of convex constraints are considered, namely, simple convex domains, in which the orthogonal projection in the domains is effectively available, and functional constraints, defined as the sublevel sets of simple convex functions.In each case some interesting examples are discussed for which OSGA's subproblem can be solved efficiently.Numerical results and comparisons with some state-of-the-art algorithms are reported showing that OSGA is efficient and reliable for solving convex optimization problems in applications.
) and eu + h, z − u ≥ 0 for all z ∈ C. (3.8)By setting z = u in this variational inequality, it follows that u is a solution of the minimization problem inf z∈C eu + h, z − u .

. 14 )
Proof.Since the hyperplane C = {x ∈ V | a T x = b} is an affine set, this is a special case of Proposition 3.1.Proposition 3.3.If C = {x ∈ V | a, x ≤ b} is a halfspace, then the subproblem (2.5) is solved by u = P C (−e −1 h), where P C (y) = y − ( a, y − b)

Table 1 .
1: List of some available projection operators for

Table 5 .
1: Result summary for the ridge regression = 15 gives the best function values.To see the results of implementation in more details, we demonstrate the relative error of function values