Variable Smoothing for Weakly Convex Composite Functions

We study minimization of a structured objective function, being the sum of a smooth function and a composition of a weakly convex function with a linear operator. Applications include image reconstruction problems with regularizers that introduce less bias than the standard convex regularizers. We develop a variable smoothing algorithm, based on the Moreau envelope with a decreasing sequence of smoothing parameters, and prove a complexity of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(\epsilon ^{-3})$$\end{document}O(ϵ-3) to achieve an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}ϵ-approximate solution. This bound interpolates between the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(\epsilon ^{-2})$$\end{document}O(ϵ-2) bound for the smooth case and the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(\epsilon ^{-4})$$\end{document}O(ϵ-4) bound for the subgradient method. Our complexity bound is in line with other works that deal with structured nonsmoothness of weakly convex functions.


Introduction
We study minimization of the sum of a smooth function and a nonsmooth, weakly convex function composed with a linear operator. The case in which the nonsmooth regularizer is convex has been studied extensively; see [1,2]. Weakly convex functions (which can be expressed as the difference between a convex function and a quadratic) share some properties with convex functions but include many interesting nonconvex cases, as we discuss in Sect. 2.1. For example, any smooth function with a uniformly Lipschitz continuous gradient is a weakly convex function.
Our approach makes use of a smooth approximation of the weakly convex function known as the Moreau envelope, parametrized by a positive scalar μ. Since evaluation of the gradient of the Moreau envelope is obtained by applying a proximal operator to the function, our method is suitable for problems where this proximal operator can be evaluated at reasonable cost. Our method requires O( −3 ) iterations to obtain an -approximate stationary point. The remainder of the paper is organized as follows. Section 2 is concerned with other problem formulations related to ours and describes specific problems with weakly convex regularizers. In Sect. 3, we give the necessary mathematical preliminaries including a detailed discussion about the notion of stationarity we use. Section 4 describes our approach and its convergence properties. In Sect. 5, we highlight the difference between the variable smoothing technique and a simple proximal gradient approach, for the case in which the linear operator is not present in the weakly smooth term.

Problem Class and Algorithmic Approach
The problem we address in this paper has the form for a smooth function h : R d → R, a weakly convex function g : R n → R (generally nonsmooth) and a matrix A ∈ R n×d . For some ρ ≥ 0, we say that g : R n → R is ρ − weakly convex if g + ρ 2 · 2 is convex.
When g is a smooth function with a uniformly Lipschitz continuous gradient, with Lipschitz constant L, then g is weakly convex with ρ = L. Other interesting weakly convex functions are discussed in Sect. 2.1. The Moreau envelope g μ is a smooth approximation of g, parametrized by a positive scalar μ. The Moreau envelope and the closely related proximal operator are defined as follows. Definition 2.1 For a proper, ρ-weakly convex and lower semicontinuous function g : R n → R, the Moreau envelope of g with the parameter μ ∈]0, ρ −1 [ is the function from R n to R defined by The proximal operator of the function μg is the arg min of the right-hand side in this definition, that is, Note that prox μg (y) is defined uniquely by this formula, because the function being minimized is strongly convex. We describe in Lemma 3.1 the relationship between ∇g μ (y) and prox μg (y), which is key to our algorithm.
Steps of our algorithm have the form for some step length γ . Accelerated versions of these approaches have been proposed for convex problems in [3][4][5]. The use of acceleration makes the analysis more complicated than for the gradient case; see [6,7].

Composite Problems
We discuss several instances of problems of the form (1). Regularization with · 1 (LASSO). Functions that are "sharp" around zero have a long history as regularizers that induce sparsity in the solution vector x. Foremost among such functions is the vector norm · 1 , which is used for example in sparse least-squares regression (also known as LASSO [8]): This formulation is convex and forms a special case of (1) in which A is given by the identity. Regularization with the norm · 1 is used also in logistic regression [9]. Other Convex Regularizers. The case of problems (1) in which g is nonsmooth and convex (with possible smooth and/or nonsmooth additive terms) has received a great deal of attention in the literature on convex optimization and applications; see for example [1,4,[10][11][12]. The most notable applications are found in inverse problems involving images. In particular, discrete (an)isotropic Total Variation (TV) denoising has the form min where b is the observed (noisy) image and ∇ denotes the discretized gradient in two or three dimensions. TV deblurring problems have the form where B is the blurring operator; see [1,13].
Other examples of convex problems of the form (1) include generalized convex feasibility [4] and support vector machine classification [5]. A typical formulation of the latter problem has h(x) = (λ/2) x 2 and g(Ax) = n i=1 φ(y i a T i x), where φ(s) = max{−s, 0} is the hinge loss and the rows of A are y i a T i , i = 1, 2, . . . n, where (y i , a i ) ∈ {−1, 1} × R d are the training points and their labels. Weakly Convex Regularizers. The use of the 1 regularizer in (2) tends to depress the magnitude of nonzero elements of the solution, resulting in bias. This phenomenon is a consequence of the fact that the proximal operator of the 1-norm, often called the soft thresholding operator, does not approach the identity even for large arguments. For this reason, nonconvex alternatives to · 1 are often used to reduce bias. These include p -norms (with 0 < p < 1) which are not weakly convex, and several weakly convex regularizers, which we now describe. The minimax concave penalty (MCP), introduced in [14] and used in [15,16], is a family of functions r λ,θ : R → R + involving two positive parameters λ and θ , and defined by otherwise.
(Note that this function satisfies the definition of ρ-weak convexity with ρ = θ −1 .) The proximal operator of this function (called firm threshold in [17]) can be written in the following closed form when θ > γ : The fractional penalty function (cf. [16,18]) φ a : R → R + (for parameter a > 0) is The smoothly clipped absolute deviation (SCAD) [19] (cf. [16]) is defined for parameters λ > 0 and θ > 2 as follows: (This function is (θ − 1) −1 -weakly convex.) Since these functions approach (or attain) a finite value as their argument grows in magnitude, they do not introduce as much bias in the solution as does the 1 norm, and their proximal operators approach the identity for large arguments.
The regularizers of this section, and the convex · regularizer, have been used mostly in the simple additive setting for a smooth data fidelity term h and nonsmooth regularizer g, for example in least squares or logistic regression [15] and compressed sensing (cf. [17]). Weakly convex composite losses The use of weakly convex functions composed with linear operators has been explored in the robust statistics literature. An early instance is the Tukey biweight function [20], in which g(Ax) has the form where A i· denotes the ith row of A. This function behaves like the usual least-squares loss when θ 2 1 but asymptotes at 1. It is ρ-weakly convex with ρ = 6. A slightly different definition of the Tukey biweight function appears in [21, Section 2.1]. This same reference also mentions another nonconvex loss function, the Cauchy loss, which has the form (3) except that φ is defined by for some parameter ξ . This function is ρ-weakly convex with ρ = 6.

Complexity Bounds for Weakly Convex Problems
To put our results in perspective, we provide a review of the literature on complexity bounds for optimization problems related to our formulation (1), including weakly convex functions. In all cases, these are bounds on the number of iterations required to find an approximately stationary point, where our measure of stationarity is based the norm of the gradient of the Moreau envelope (a smooth proxy). The best-known complexity for black box subgradient optimization for weakly convex functions is O( −4 ). This result is proved for stochastic subgradient in [22], but as in the convex case, there is no known improvement in the deterministic setting. As in convex optimization, subgradient methods are quite general and implementable for weakly convex functions. However, when more structure is present in the function, algorithms that achieve better complexity can be devised. In particular, when the proximal operator of the nonsmooth weakly convex function can be calculated analytically, complexity bounds of O( −2 ) can be proved (see Sect. 5), the same bounds as for steepest descent methods in the smooth nonconvex case. This means that the entire difficulty introduced by the nonsmoothness can be mitigated as long as the nonsmoothness can be treated by a proximal operator.
For convex optimization problems, bounds of O( −1 ) can be obtained for gradient methods on smooth functions and O( −1/2 ) for accelerated gradient methods. These same bounds can also be obtained for nonsmooth problems provided that the nonsmooth term is handled by a proximal operator. When the explicit proximal operator is not available and subgradient methods have to be used, the complexity reverts to O( −2 ).
It is possible to keep the O( −2 ) rate when just a local model of the weakly convex part is evaluated by a convex operator. The paper [23] studies optimization problems of the type min where h is convex, proper and closed; g is convex and Lipschitz continuous; and c is smooth. (Under these assumptions, the composition g • c is weakly convex.) The O( −2 ) bound is proved for an algorithm in which the (convex) subproblem is solved explicitly. In the more realistic case in which (4) must be solved by an iterative procedure, a bound of O( −3 ) is obtained in [23]. (The symbol O hides logarithmic terms.) Functions of the form g(c(x)) have also been studied in [24] for the case of a smooth nonlinear vector function c and a prox-regular g. This formulation is more general than those considered in this paper, both in the fact that all weakly convex functions are prox-regular, and in the nonlinearity of the inner map c. The subproblems in [24] have a form similar to (4), and while convergence results are proved in the latter paper, it does not contain rate-of-convergence results or complexity results.
A different weakly convex structure is explored in [25], in which the weak convexity stems from a smooth saddle point problem. This paper studies the problem min x max y∈Y l(x, y), is proved for a method that uses only gradient evaluations.
In light of the considerations above, the complexity bound of O( −3 ) for our algorithm seems almost inevitable. It interpolates between the setting without structural assumptions about the nonsmoothness (black box subgradient) and the perfect structural knowledge of the nonsmoothness (explicit knowledge of the proximal operator).
In Sect. 5, we treat the simpler setting in which the linear operator from (1) is the identity, so that F(x) = h(x) + g(x). Similar problems have been analyzed before, for example, in [15,17]. However, it is assumed in [17] that convexity in the data fidelity term h compensates for nonconvexity in the regularizer g such that the overall objective function F remains convex. (We make no such assumption here.) The paper [15] does not make such restrictive assumptions and proves convergence but not complexity bounds.

Preliminaries
The concept of subgradient of a convex function can be adapted to weakly convex functions via the following definition. Definition 3.1 (Fréchet subdifferential) Let g : R n → R be a function andȳ a point such that g(ȳ) is finite. Then, the Fréchet subdifferential of g atȳ, denoted by ∂ g(ȳ), is the set of all vectors v ∈ R n such that Modifying the convex case, in which subgradients are the slopes of linear functions that underestimate g but coincide with it atȳ, Fréchet subgradients do so up to first order. This definition makes sense for arbitrary functions, but for lower semicontinuous ρweakly convex functions, more can be said. For example, for this class of function we know that subgradients satisfy the following stronger version of (5), for all v ∈ ∂ g(ȳ), Further, if we assume the weakly convex function to be continuous at a point y, then its subdifferential is nonempty at y. Both of these claims can be verified directly by adding ρ 2 · 2 to g and considering the convex subdifferential; see [26, Lemma 2.1]. Another nice property of weakly convex functions is that the definition of a Moreau envelope (see Definition 2.1) extends without modification to weakly convex functions, subject only to a restriction on the parameter μ. The proximal operator of Definition 2.1 also extends to this setting, and this operator and the Moreau envelope fulfil the same identity as in the convex setting.
This gradient is max μ −1 , ρ 1−ρμ -Lipschitz continuous. In particular, a gradient step with respect to the Moreau envelope corresponds to a proximal step, that is, Lemma 3.1 not only clarifies the smoothness of the Moreau envelope, but also gives a way of computing its gradient via the prox operator. Obviously, a smooth representation whose gradient could not be computed would be of only limited usefulness from an algorithmic standpoint. The only difference between the weakly convex and convex settings is that the Moreau envelope need not be convex in the former case.

Stationarity
We say that a pointx is a stationary point for a function if the Fréchet subdifferential of the function contains 0 atx. The concept of nearly stationary is not quite so straightforward. We motivate our approach by looking first at the simple additive composite problem, also discussed in Sect. 5, which corresponds to setting A = I in (1), that is, min Stationarity for (7) means that 0 ∈ ∂(h + g)(x), that is, −∇h(x) ∈ ∂ g(x). A natural definition for -approximate stationarity would thus be where dist denotes the distance between two sets and is given for a point x ∈ R d and However, since we are running gradient descent on the smoothed problem, our algorithm will naturally compute and detect points with that satisfy a threshold condition of the form The next lemma helps to clarify relationship between these two conditions. Lemma 3.2 Let g : R n → R be a proper, ρ-weakly convex, and lower semicontinuous function; and let μ ∈]0, ρ −1 [. Then, Proof From Definition 2.1, we have that from which the result follows when we use (6).
(This result is proved for the case of g convex in [23, Lemma 2.1], with essentially the same proof.) This lemma tells us that when (9) holds, then (8) is nearly satisfied, except that in the argument of ∂ g, x is replaced by prox μg (x). In general, however, prox μg (x) might be arbitrarily far away from x. We can remedy this issue by requiring g to be Lipschitz continuous also. Lemma 3.3 Let g : R n → R be a ρ-weakly convex function that is L g -Lipschitz continuous, and let μ ∈]0, ρ −1 [. Then, the Moreau envelope g μ is Lipschitz continuous with ∇g μ (x) ≤ L g (11) and Proof Lipschitz continuity is equivalent to bounded subgradients [28], so by (10), we have for all x ∈ R n proving (11). The bound (12) follows immediately when we use the fact that x − prox μg (x) = μ∇g μ (x) from Lemma 3.1.

Stationarity for the Composite Problem
It follows immediately from (10) in Lemma 3.2 that for μ ∈]0, ρ −1 [, we have for all Extending the results of the previous section to the case of a general linear operator A in (1) requires some work. Stationarity for (1) requires that 0 ∈ ∇h(x) + A * ∂ g(Ax), so -near stationarity requires Our method can compute a point x such that where provided that g is L g -Lipschitz continuous, we have The bound in (15) measures the criticality, while the bound in (16) concerns feasibility. The bounds (15), (16) are not a perfect match with (14), since the subdifferentials of h and g • A are evaluated at different points.

Surjectivity of A. When
A is surjective, we can perturb the x that satisfies (15), (16) to a nearby point x * that satisfies a bound of the form (14). Since z = prox μg (Ax) is in the range of A, we can define which is given explicitly by The operator norm of the pseudoinverse is bounded by the inverse of the smallest singular value σ min (A) of A, so when g is L g -Lipschitz continuous, we have from (16) that The point x * is approximately stationary in the sense of (14), for μ sufficiently small, because (18)).
By choosing μ small, x * will be an approximate solution in the stronger sense (14) and not just the weaker notion of (15), (16), which is the case if A is not surjective.

Variable Smoothing
We describe our variable smoothing approaches for the problem (1), where we assume that h is L ∇h -smooth; g is possibly nonsmooth, ρ-weakly convex, and L g -Lipschitz continuous; and A is a nonzero linear continuous operator. For convenience, we define the smoothed approximation F k : R d → R based on the Moreau envelope with parameter μ k as follows: We note from Lemma 3.1 and the chain rule that The quantity L k defined by is a Lipschitz constant of the gradient of ∇ F k ; see Lemma 3.1. When ρμ k ≤ 1/2, the maximum in (21) is achieved by μ −1 k , so in this case we can define

An Elementary Approach
Our first algorithm takes gradient descent steps on the smoothed problem, that is, for certain values of the parameter μ k and step size γ k . From (20), the formula (23) is equivalent to Our basic algorithm is described next.

Algorithm 1 Variable Smoothing
Require: We now state the convergence result for Algorithm 1. This result and later results make use of a quantity which is finite if F is bounded below (and possibly in other circumstances too).
(When F * = −∞, the claim of the theorem is vacuously true.) We also make use of the following quantity: Theorem 4.1 Suppose that Algorithm 1 is applied to the problem (1), where g is ρweakly convex and ∇h and g are Lipschitz continuous with constants L ∇h and L g , respectively. We have and F * is defined as in (24). If A is also surjective, then for x * k defined as in (25), we have Before proving this theorem, we state and prove a lemma that relates the function values of two Moreau envelopes with two different smoothing parameters. In the convex case, such statements are well known, but in the nonconvex case this result is novel. Lemma 4.1 Let g : R n → R be a proper, closed, and ρ-weakly convex function, and let μ 2 and μ 1 be parameters such that 0 < μ 2 ≤ μ 1 < ρ −1 . Then, we have If, in addition, g is L g -Lipschitz continuous, we have Proof By using the definition of the Moreau envelope, together with Lemma 3.1, we obtain proving the first claim. The second claim follows immediately from (11).

Proof of Theorem 4.1
Since L k = 1/γ k is the Lipschitz constant of ∇ F k , we have for any k = 1, 2, . . . that By substituting the definition of x k+1 from (23), we have From Lemma 4.1, we have for all where we used in the second inequality that μ k μ k+1 ≤ 2. We set x = x k+1 and substitute into (26) to obtain By summing both sides of this expression over k = 1, 2, . . . K , and telescoping, we deduce that Since we have from (27) that Now we observe that where the final inequality can be checked numerically. Therefore, by substituting into (28), we have and so min 1≤ j≤K where C : By combining this bound with (15), and defining z j := prox μ j g Ax j , we obtain where we deduce from (12) that The second statement concerning surjectivity of A follows from the consideration made in (17) to (19).
There is a mismatch between the two bounds in this theorem. The first bound (the criticality bound) indicates that during the first k = O( −3 ) iterations, we will encounter an iteration j at which the first-order optimality condition is satisfied to within a tolerance of . However, this bound could have been satisfied at an early iteration (that is, j −3 ), for which value the second (feasibility) bound, on Ax j − prox μ j g Ax j , may not be particularly small. The next section describes an algorithm that remedies this defect.

An Epoch-Wise Approach with Improved Convergence Guarantees
We describe a variant of Algorithm 1 in which the steps are organized into a series of epochs, each of which is twice as long as the one before. We show that there is some iteration j = O( −3 ) such that both Ax j − prox μ j g Ax j and dist(−∇h(x j ), A * ∂ g(prox μ j g Ax j )) are smaller than the given tolerance . where z j = prox μ j g Ax j .

Algorithm 2 Variable Smoothing with Epochs
Require: x 1 ∈ R d and tolerance > 0; for if S l ≤ and Ax k+1 − prox μ k+1 g Ax k+1 ≤ then STOP; end if end if end for end for Proof As in (27), by using monotonicity of {F k (x k )} and discarding nonnegative terms, we have that With the same arguments as in the earlier proof, we obtain Therefore, we have Noting that z j := prox μ j g Ax j , we have as in (30) that as previously. Further, we have for 2 l ≤ j ≤ 2 l+1 − 1 that From (32) and (33), we deduce that Algorithm 2 must terminate before the end of epoch l, that is, before 2 l+1 iterations have been completed, where l is the first nonnegative integer such that Thus, termination occurs after at most 2 max{C 3 , (2ρ) −3 L 3 g } −3 iterations.
For the case of A surjective, we have the following stronger result.  (25). Then, for some Proof With the considerations made in the previous proof as well as the one made in (17) to (19), we can choose l to be the smallest positive integer such that The claim then holds for some j ≤ 2 l+1 .
Although Algorithm 2 seems more complicated than Algorithm 1, the steps are the same. The only difference is that for the second algorithm, we do not search for the iterate that minimizes criticality across all iterations but only across at most the last k/2 iterations, where k is the total number of iterations.

Remark 4.1
For both versions of our proposed method we use an explicit choice of smoothing parameters, choosing μ k to be a multiple of O(k −1/3 ). This specific dependence on k achieves a balance between criticality and feasibility. As can be seen from (29) (criticality measure) and (31) (feasibility measure) both measures decrease like k −1/3 . A slower decrease in μ k would result in a faster decrease in the criticality measure but a slower decrease in the feasibility measure-and vice versa.

Remark 4.2
Our technique does not adapt in an obvious way to the case in which g is actually convex. Typically, we know in advance whether or not h and g in (1) are convex, and if they are, we could choose one of the well-established methods that make use of gradients, proximal operators, and possibly acceleration. See, for example the proximal accelerated gradient approach of [3], which achieves a rate of O(k −1 ). A method in the spirit of [29], which automatically adapts to convexity and simultaneously achieves the optimal rates for both nonconvex and convex problems would be desirable, but is outside the scope of this work.

Proximal Gradient
Here we derive a complexity bound for the proximal gradient algorithm applied to the more elementary problem (7) studied in Sect. 3.1, that is, for h : R d → R a L ∇h -smooth function and g : R d → R a possibly nonsmooth, ρweakly convex function. Such a bound has not been made explicit before, to the authors' knowledge, though it is a fairly straightforward consequence of existing results. The bound makes an interesting comparison with the result in Sect. 4, where the nonsmoothness issue becomes more complicated due to the composition with a linear operator. In this section, we assume that a closed-form proximal operator is available for g, and we show that the complexity bound of O( −2 ) is the same order as for gradient descent applied to smooth nonconvex functions. Standard proximal gradient applied to problem (34), for a given step size λ ∈ ]0, min{ρ −1 /2, L −1 ∇h }] and initial point x 1 , is as follows: x k+1 := arg min where the choice of λ ensures that the function to be minimized in (35) is (λ −1 − ρ)strongly convex, so that x k+1 is uniquely defined. We have the following convergence result.
where F * is defined in (24).
Proof Note first that the result is vacuous if F * = −∞, so we assume henceforth that F * is finite. We have for every x ∈ R d that This theorem indicates that the proximal gradient algorithm requires at most O( −2 ) to find an iterate with -approximate stationarity. This bound contrasts with the bound O( −3 ) of Sect. 4 for the case of general A. Moreover, the O( −2 ) bound has the same order as the bound for gradient descent applied to general smooth nonconvex optimization.

Conclusions
We consider a standard problem formulation in which a linear transformation of the input variables is composed with a nonsmooth regularizer and added to a smooth function. In most works, the regularizer is assumed to be convex, but we extend here to the case in which it is only weakly convex. This extension allows for functions which introduce desired properties, such as sparsity, without causing a bias.
[Two examples from robust statistics are minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD)], We propose a novel method based on the variable smoothing framework and show a complexity of O( −3 ) to obtain an -approximate solution. This iteration complexity falls strictly between the iteration complexity of smooth (nonconvex) problems (O( −2 )) and that of the black box subgradient method for weakly convex function (O( −4 )) which assumes no knowledge of the structure of the nonsmoothness.
A performance comparison between our smoothed approach and the black-box subgradient algorithm on an image denoising problem that uses an MCP total variation regularizer is shown in Figs.1. Fig. 1 Progress of our smoothing algorithm and a naive subgradient algorithm on an image denoising problem, where minimax concave penalty is used instead of the 1-norm in the anisotropic total variation, showing better performance by the smoothing approach. Top: The difference of consecutive iterates scaled by the inverse of the stepsize, representing the norm of the (sub)gradient used at each iteration. Middle: Relative difference between the objective function at the current iterate and an approximate minimum. Bottom: The quality of the resulting reconstruction measured via the structural similarity index measure, see [30] Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.