Solving structured nonsmooth convex optimization with complexity \(\mathcal {O}(\varepsilon ^{1/2})\)
Abstract
This paper describes an algorithm for solving structured nonsmooth convex optimization problems using the optimal subgradient algorithm (OSGA), which is a firstorder method with the complexity \(\mathcal {O}(\varepsilon ^{2})\) for Lipschitz continuous nonsmooth problems and \(\mathcal {O}(\varepsilon ^{1/2})\) for smooth problems with Lipschitz continuous gradient. If the nonsmoothness of the problem is manifested in a structured way, we reformulate the problem so that it can be solved efficiently by a new setup of OSGA (called OSGAV) with the complexity \(\mathcal {O}(\varepsilon ^{1/2})\). Further, to solve the reformulated problem, we equip OSGAO with an appropriate proxfunction for which the OSGAO subproblem can be solved either in a closed form or by a simple iterative scheme, which decreases the computational cost of applying the algorithm for largescale problems. We show that applying the new scheme is feasible for many problems arising in applications. Some numerical results are reported confirming the theoretical foundations.
Keywords
Structured nonsmooth convex optimization Subgradient methods Proximity operator Optimal complexity Firstorder blackbox informationMathematics Subject Classification
90C25 90C60 49M37 65K05 68Q251 Introduction
Subgradient methods are a class of firstorder methods that have been developed to solve convex nonsmooth optimization problems, dating back to 1960; see, e.g., (Polyak 1987; Shor 1985). In general, they only need function values and subgradients, and not only inherit the basic features of general firstorder methods such as low memory requirement and simple structure but also are able to deal with every convex optimization problem. They are suitable for solving convex problems with a large number of variables, say several millions. Although these features make them very attractive for applications involving highdimensional data, they usually suffer from slow convergence, which finally limits the attainable accuracy. In 1983, Nemirovsky and Yudin (1983) derived the worstcase complexity bound of firstorder methods for several classes of problems to achieve an \(\varepsilon \)solution, which is \(\mathcal {O}(\varepsilon ^{2})\) for Lipschitz continuous nonsmooth problems and \(\mathcal {O}(\varepsilon ^{1/2})\) for smooth problems with Lipschitz continuous gradients. The low convergence speed of subgradient methods suggests that they often reach an \(\varepsilon \)solution in the number of iterations closing to the worstcase complexity bound on iterations.
In Nemirovsky and Yudin (1983), it was proved that the subgradient, subgradient projection, and mirror descent methods attain the optimal complexity of firstorder methods for solving Lipschitz continuous nonsmooth problems; here, the mirror descent method is a generalization of the subgradient projection method, cf. (Beck and Teboulle 2003; Beck et al. 2010). Nesterov (2011), Nesterov (2006) proposed some primaldual subgradient schemes, which attain the complexity \(\mathcal {O}(\varepsilon ^{2})\) for Lipschitz continuous nonsmooth problems. Juditsky and Nesterov (2014) proposed a primaldual subgradient scheme for uniformly convex functions with an unknown convexity parameter, which attains the complexity close to the optimal bound. Nesterov (1983) and later in Nesterov (2004) proposed some gradient methods for solving smooth problems with Lipschitz continuous gradients attaining the complexity \(\mathcal {O}(\varepsilon ^{1/2})\). He also in Nesterov (2005a, b) proposed some smoothing methods for structured nonsmooth problems. Smoothing methods also have been studied by many authors; see, e.g., Beck and Teboulle (2012), Boţ and Hendrich (2013), Boţ and Hendrich (2015), and Devolder et al. (2012).
In many fields of applied sciences and engineering such as signal and image processing, geophysics, economic, machine learning, and statistics, there exist many applications that can be modeled as a convex optimization problem, in which the objective function is a composite function of a smooth function with Lipschitz continuous gradients and a nonsmooth function; see Ahookhosh (2016) and references therein. Studying this class of problems using firstorder methods has dominated the convex optimization literature in the recent years. Nesterov (2013, 2015) proposed some gradient methods for solving composite problems obtaining the complexity \(\mathcal {O}(\varepsilon ^{1/2})\). For this class of problems, other firstorder methods with the complexity \(\mathcal {O}(\varepsilon ^{1/2})\) have been developed by Auslender and Teboulle (2006), Beck and Teboulle (2012), Chen et al. (2014, 2017, 2015, 2014), Devolder et al. (2013), Gonzaga and Karas (2013), Gonzaga et al. (2013), Lan (2015), Lan et al. (2011), and Tseng (2008). In particular, Neumaier (2016) proposed an optimal subgradient algorithm (OSGA) attaining the complexity \(\mathcal {O}(\varepsilon ^{2})\) for Lipschitz continuous nonsmooth problems and the complexity \(\mathcal {O}(\varepsilon ^{1/2})\) for smooth problems with Lipschitz continuous gradients. OSGA is a blackbox method and does not need to know about global information of the objective function such as Lipschitz constants.
1.1 Content
This paper focuses on a class of structured nonsmooth convex constrained optimization problems that is a generalization of the composite problems, which is frequently found in applications. OSGA behaves well for composite problems in applications; see Ahookhosh (2016) and Ahookhosh and Neumaier (2017), Ahookhosh and Neumaier (2017); however, it does not attain the complexity \(\mathcal {O}(\varepsilon ^{1/2})\) for this class of problems. Hence, we first reformulate the problem considered in a way that only the smooth part remains in the objective, in the cost of adding a functional constraint to our feasible domain. Afterward, we propose a suitable proxfunction, provide a new setup for OSGA called OSGAO for the reformulated problem, and show that solving the OSGAO auxiliary problem for the reformulated problem is equivalent to solving a proximallike problem. It is shown that this proximallike subproblem can be solved efficiently for many problems appearing in applications either in a closed form or by a simple iterative scheme. Due to this reformulation, the problem can be solved by OSGAO with the complexity \(\mathcal {O}(\varepsilon ^{1/2})\). Finally, some numerical results are reported suggesting a good behavior of OSGAO.
The underlying function of the subproblem of OSGA is quasiconcave and finding its solution is the most costly part of the algorithm. Hence, efficient solving of this subproblem is crucial but not a trivial task. For unconstrained problems, we found a closed form solution for the subproblem and studied the numerical behavior of OSGA in Ahookhosh (2016) and (Ahookhosh and Neumaier 2013, 2016). In Ahookhosh and Neumaier (2017), we gave one projection version of OSGA and provided a framework to solve the subproblem over simple convex domains or simple functional constraints. In particular, we describe a scheme to compute the global solution of the OSGA subproblem for bound constraints in Ahookhosh and Neumaier (2017). Let us emphasize that the subproblem of OSGAO is constrained by a simple convex set and simple functional constraints, which is different from that one used in Ahookhosh (2016), Ahookhosh and Neumaier (2013), Ahookhosh and Neumaier (2016), Ahookhosh and Neumaier (2017), Ahookhosh and Neumaier (2017), which leads to solve a proximallike problem.
The overall structure of this paper takes six sections, including this introductory section. In the next section, we briefly review the main idea of OSGA. In Sect. 3, we give a reformulation for the basic problem considered and show that solving the OSGAO subproblem is equivalent to solving a proximallike problem. Section 4 points out how the proximallike subproblem can be solved in many interesting cases. Some numerical results are reported in Sect. 5, and conclusions are given in Sect. 6.
1.2 Preliminaries and notation
The subdifferential of \(\phi (x) = \Vert Wx\Vert \) is given in the next result, for an arbitrary norm \(\Vert \cdot \Vert \) in \(\mathbb {R}^n\) and a matrix \(W\in \mathbb {R}^{m\times n}\). To observe a proof of this result, see Proposition 2.1.17 in Ahookhosh (2015).
Proposition 1
In the next example, we show how Proposition 1 is applied to \(\phi = \Vert \cdot \Vert _\infty \), which will be needed in Sect. 4. The subdifferential of other norms of \(\mathbb {R}^n\) can be computed with Proposition 1 in the same way.
Example 2
2 A review of optimal subgradient algorithm (OSGA)
If the best function value \(f_{x_b}\) is stored and updated, then each iteration of OSGA only requires the computation of two function values \(f_x\) and \(f_{x'}\) (Lines 6 and 11) and one subgradient \(g_x\) (Line 6).
3 Structured nonsmooth convex optimization
Problems of the form (12) appears in many applications in the fields of signal and image processing, machine learning, statistics, economic, geophysics, and inverse problems. Let us consider the following example.
Example 3
3.1 New setup of optimal subgradient algorithm (OSGAO)
This section describes the subproblem (10) for a problem of the form (13). To this end, we introduce some proxfunction and employ it to derive an inexpensive solution of the subproblem. We generally assume that the domain C is simple enough such that \(\eta \) and \((\widehat{u},\widetilde{u})\) can be computed cheaply, in \(\mathcal {O}(n\log n)\) operations, say.
Proposition 4
Proof
Using (25), (26), and (27), this follows similarly to Proposition 2.1 in Neumaier (2016). \(\square \)
Proposition 5
Proof
The subsequent result gives a systematic way for solving OSGA subproblem (26) for problems of the form (13).
Theorem 6
Proof
In Theorem 6, if \(C = \mathcal {V}\), the problem (32) is reduced to the classical proximity operator \(\widehat{u} = \text {prox}_{\lambda \phi } (y)\) defined in (3). Hence, the problem (32) is called proximallike. Therefore, the word “simple” in the definition of C means that the problem (32) can be solved efficiently either in a closed form or by an inexpensive iterative scheme. To have a clear view of Theorem 6, we give the following example.
Example 7
To implement Algorithm 3 (SUS), we need a reliable nonlinear solver to deal with the system of nonlinear equation (39) and a routine giving the solution of the proximallike problem (32) effectively. In Sect. 4, we investigate solving the proximallike problem (32) for some practically important loss functions \(\phi \). Algorithm 2 requires two solutions of the subproblem (26) (u in Line 6 and \(u'\) in Line 10) that are provided by Line 3 of SUS (similar notation can be considered for \(u'\)).
3.2 Convergence analysis and complexity
In this section, we establish the complexity bounds of OSGAO for Lipschitz continuous nonsmooth problems and smooth problems with Lipschitz continuous gradients. We also show that if \(\widetilde{f}\) is strictly convex, the sequence generated by OSGAO is convergent to \(\widehat{x}\).
 (H1)

The objective function \(\widetilde{f}\) is proper and convex;
 (H2)

The upper level set \(N_{\widetilde{f}}(x_0,\widetilde{x}_0) := \{x \in \widetilde{C} \mid \widetilde{f}(x,\widetilde{x}) \le \widetilde{f}(x_0,\widetilde{x}_0)\}\) is bounded, for the starting point \((x_0,\widetilde{x}_0)\in \mathcal {V}\times \mathbb {R}\).
Since the underlying problem (13) is a special case of the problem (7) considered by Neumaier (2016), the complexity results of OSGAO is the same as OSGA.
Theorem 8
 (i)
(Nonsmooth complexity bound) If the points generated by Algorithm 2 stay in a bounded region of the interior of \(\widetilde{C}\), or if \(\widetilde{f}\) is Lipschitz continuous in \(\widetilde{C}\), the total number of iterations needed to reach a point with \(\widetilde{f}(x,\widetilde{x})\le \widetilde{f}(\widehat{x}, x^*)+\varepsilon \) is at most \(\mathcal {O}((\varepsilon ^2+\mu \varepsilon )^{1})\). Thus the asymptotic worst case complexity is \(\mathcal {O}(\varepsilon ^{2})\) when \(\mu =0\) and \(\mathcal {O}(\varepsilon ^{1})\) when \(\mu >0\).
 (ii)
(Smooth complexity bound) If \(\widetilde{f}\) has Lipschitz continuous gradients with Lipschitz constant L, the total number of iterations needed by Algorithm 2 to reach a point with \(\widetilde{f}(x,\widetilde{x})\le \widetilde{f}(\widehat{x}, x^*)+\varepsilon \) is at most \(\mathcal {O}(\varepsilon ^{1/2})\) if \(\mu =0\), and at most \(\displaystyle \mathcal {O}(\log \varepsilon \sqrt{L/\mu })\) if \(\mu >0\).
Proof
Since all assumptions of Theorems 4.1 and 4.2, Propositions 5.2 and 5.3, and Theorem 5.1 of Neumaier (2016) are satisfied, the results remain valid. \(\square \)
Indeed, if a nonsmooth problem can be reformulated as (13) with a nonsmooth loss function \(\phi \), then OSGAO can solve the reformulated problem with the complexity \(\mathcal {O}(\varepsilon ^{1/2})\) for an arbitrary accuracy parameter \(\varepsilon \). The next result shows that the sequence \(\{(x_k,\widetilde{x}_k)\}\) generated by OSGAO is convergent to \((\widehat{x}, x^*)\) if the objective \(\widetilde{f}\) is strictly convex and \((\widehat{x}, x^*) \in \text {int}~ \widetilde{C}\), where \(\text {int}~ \widetilde{C}\) denotes the interior of \(\widetilde{C}\).
Proposition 9
Suppose that \(\widetilde{f}\) is strictly convex, then the sequence \(\{(x_k,\widetilde{x}_k)\}\) generated by OSGAO is convergent to \((\widehat{x}, x^*)\) if \((\widehat{x}, x^*) \in \text {int}~ \widetilde{C}\).
Proof
4 Solving proximallike subproblem
In this section, we show that the proximallike problem (32) can be solved in a closed form for many special cases appearing in applications. To this end, we first consider unconstrained problems (\(C=\mathcal {V}\)) and study some problems with simple constrained domains (\(C \ne \mathcal {V}\)). Although finding proximal points is a mature area in convex nonsmooth optimization (cf. (Combettes and Pesquet 2011; Parikh and Boyd 2013)), we here address the solution of several proximallike problems of the form (32) appearing in applications that to the best of our knowledge have not been studied in literature.
4.1 Unconstrained examples (\(C = \mathcal {V}\))
We here consider several interesting unconstrained proximal problems appearing in applications and explain how the associated OSGAO auxiliary problem (32) can be solved.
In recent years, the interest of applying regularizations with weighted norms is increased by emerging many applications; see, e.g., (Daubechies et al. 2010; Rauhut and Ward 2016). Let d be a vector in \(\mathbb {R}^n\) such that \(d_i \ne 0\) for \(i = 1, \ldots ,n\). Then, we define the weight matrix \(D:= \text {diag}(d)\), which is a diagonal matrix with \(D_{i,i} = d_i\) for \(i = 1, \ldots ,n\). It is clear that D is an invertible matrix. The next two results show how to compute a solution of the problem (32) for special cases of \(\phi \) arising frequently in applications.
Proposition 10
Proof
See Proposition 6.2.1 in Ahookhosh (2015). \(\square \)
Proposition 11
Proof
 (i)\(\Vert D^{1}y\Vert _2 \le \lambda \); (ii) \(\Vert D^{1}y\Vert _2 > \lambda \).
 Case (i).
Let \(\Vert D^{1}y\Vert _2 \le \lambda \). Then, we show that \(u=0\) satisfies (43). If \(u=0\), Proposition 1 implies \(\partial \phi (0) = \{g \in \mathcal {V}^* \mid \Vert D^{1}g\Vert _2 \le 1 \}\). Using this, we get that \(u=0\) is satisfied in (43) if \(y \in \{g \in \mathcal {V}^* \mid \Vert D^{1}g\Vert _2 \le \lambda \}\) leading to \(\text {prox}_{\lambda \phi }(y) = 0\).
 Case (ii).Let \(\Vert D^{1}y\Vert _2 > \lambda \). Case (i) implies \(u \ne 0\). Proposition 1 implies \(\partial \phi (u) = D^T Du/\Vert Du\Vert _2\), and the optimality condition (5) yieldsBy this and setting \(\tau = \Vert Du\Vert _2\), we get$$\begin{aligned} uy+ \lambda ~ D^T\frac{Du}{\Vert Du\Vert _2} = 0. \end{aligned}$$leading to$$\begin{aligned} \left( 1 + \frac{\lambda d_i^2}{\tau } \right) u_i  y_i = 0, \end{aligned}$$for \(i=1, \ldots , n\). Substituting this into \(\tau = \Vert Du\Vert _2\) implies$$\begin{aligned} u_i = \frac{\tau y_i}{\tau + \lambda d_i^2}, \end{aligned}$$We define the function \(\psi : {]0,+\infty [} \rightarrow \mathbb {R}\) by$$\begin{aligned} \sum _{i=1}^n \frac{d_i^2 y_i^2}{(\tau + \lambda d_i^2)^2} = 1. \end{aligned}$$where it is clear that \(\psi \) is decreasing and$$\begin{aligned} \psi (\tau ):= \sum _{i=1}^n \frac{d_i^2 y_i^2}{(\tau + \lambda d_i^2)^2}  1, \end{aligned}$$It can be deduced by \(\Vert D^{1}y\Vert _2 > \lambda \) and the mean value theorem that there exists \(\widehat{\tau } \in {]0,+\infty [}\) such that \(\psi (\widehat{\tau })=0\), giving the result.\(\square \)$$\begin{aligned} \lim _{\tau \rightarrow 0} \psi (\tau ) = \frac{1}{\lambda ^2} \sum _{i=1}^n \frac{y_i^2}{d_i^2}  1 = \frac{1}{\lambda ^2} \left( \Vert D^{1}y\Vert _2^2  \lambda ^2 \right) , ~~~ \lim _{\tau \rightarrow +\infty } \psi (\tau ) = 1. \end{aligned}$$
 Case (i).
Grouped variables typically appear in highdimensional statistical learning problems. For example, in data mining applications, categorical features are encoded by a set of dummy variables forming a group. Another interesting example is learning sparse additive models in statistical inference, where each component function can be represented using basis expansions and thus can be treated as a group. For such problems (see (Liu et al. 2010) and references therein), it is more natural to select groups of variables instead of individual ones when a sparse model is preferred.
In the following two results, we show how the proximity operator \(\text {prox}_{\lambda \phi } (\cdot )\) can be computed for the mixednorms \(\phi (\cdot ) = \Vert \cdot \Vert _{1,2}\) and \(\phi (\cdot ) = \Vert \cdot \Vert _{1,\infty }\), which are especially important in the context of sparse optimization and sparse recovery with grouped variables.
Proposition 12
Proof
 Case (i).
Let \(\Vert y_{g_i}\Vert _2 \le \lambda \). Then, we show that \(u_{g_i}=0\) satisfies (45). If \(u_{g_i}=0\), Proposition 1 implies \(\partial \phi (0_{g_i}) = \{g \in \mathbb {R}^{n_i} \mid \Vert g_{g_i}\Vert _2 \le 1 \}\). By substituting this into (45), we get that \(u_{g_i}=0\) is satisfied in (45) if \(y_{g_i} \in \{g \in \mathbb {R}^{n_i} \mid \Vert g_{g_i}\Vert _2 \le \lambda \}\), which leads to \(\text {prox}_{\lambda \phi }(y_{g_i}) = 0_{g_i}\). Since the right hand side of (44) is also zero, (44) holds.
 Case (ii).Let \(\Vert y_{g_i}\Vert _2 > \lambda \). Then, Case (i) implies that \(u_{g_i} \ne 0\). From Proposition 1, we obtainwhere \(i=1, \ldots , m\) and \(\Vert y_{g_i}\Vert _2 > \lambda \). Then (45) and (46) imply$$\begin{aligned} \partial _{\phi }(u_{g_i}) = \left\{ \frac{u_{g_i}}{\Vert u_{g_i}\Vert _2} \right\} , \end{aligned}$$(46)leading to$$\begin{aligned} u_{g_i}  y_{g_i}+\lambda \frac{u_{g_i}}{\Vert u_{g_i}\Vert _2} = 0, \end{aligned}$$giving \(u_{g_i} = \mu _i y_{g_i}\). Substituting this into the previous identity and solving it with respect to \(\mu _i\) yield$$\begin{aligned} \left( 1 + \frac{\lambda }{\Vert u_{g_i}\Vert _2} \right) u_{g_i} = y_{g_i} \end{aligned}$$completing the proof. \(\square \)$$\begin{aligned} \mu _i = \left( 1\frac{\lambda }{\Vert y_{g_i}\Vert _2} \right) _+ y_{g_i},~~~u_{g_i} = \mu _i y_{g_i}, \end{aligned}$$
Proposition 13
Proof
 Case (i).
Let \(\Vert y_{g_i}\Vert _1 \le \lambda \). Then, we show that \(u_{g_i}=0\) satisfies (51). If \(u_{g_i}=0\), the subdifferential of \(\phi \) derived in Example 2 is \(\partial \phi (0_{g_i}) = \{g \in \mathbb {R}^{n_i} \mid \Vert g\Vert _1 \le 1 \}\). By substituting this into (51), we get that \(u_{g_i}=0\) satisfies (51) if \(y_{g_i} \in \{g \in \mathbb {R}^{n_i} \mid \Vert g\Vert _1 \le 1 \}\), i.e., \(\text {prox}_{\lambda \phi }(y_{g_i}) = 0_{g_i}\).
 Case (ii).Let \(\Vert y_{g_i}\Vert _1 > \lambda \). From Case (i), we have \(u_{g_i} \ne 0\). We show thatwith \(\mathcal {I}_{g_i}\) defined in (49), satisfies (51). Hence, using the subdifferential of \(\phi \) derived in Example 2, there exist coefficients \(\beta _{g_i}^j\), for \(j \in \mathcal {I}_{g_i}\), such that$$\begin{aligned} u_{g_i}^j = \left\{ \begin{array}{ll} \text {sign}(y_{g_i}^j) u_\infty ^i &{} ~~\text {if}~ i \in \mathcal {I}_{g_i},\\ y_{g_i}^j &{} ~~\text {otherwise}, \end{array} \right. \end{aligned}$$(52)where$$\begin{aligned} u_{g_i}  y_{g_i} + \lambda \sum _{j \in \mathcal {I}_{g_i}} \beta _{g_i}^j~ \text {sign}(u_{g_i}^j) e_j = 0, \end{aligned}$$(53)Let \(u_{g_i}\) be the vector defined in (52). We define$$\begin{aligned} \beta _{g_i}^j \ge 0 ~~ j\in \mathcal {I}_{g_i}, ~~~\sum _{j \in \mathcal {I}_{g_i}} \beta _{g_i}^j = 1. \end{aligned}$$(54)for \(j \in \mathcal {I}_{g_i} = \{l_1^i, \ldots , l_{\widehat{k}_i}^i\}\) with \(u_\infty ^i\) defined in (48). We show that the choice (55) satisfies (53). We first show \(u_\infty ^i >0\). It follows from (48) and (50) if \(\widehat{k}_i < n\) and from \(\Vert y_{g_i}\Vert _1> \lambda \) if \(\widehat{k}_i=n\). Using (52) and (55), we come to$$\begin{aligned} \beta _{g_i}^j := \frac{y_{g_i}^j  u_\infty ^i}{\lambda }, \end{aligned}$$(55)for \(j \in \mathcal {I}_{g_i}\). For \(j \not \in \mathcal {I}_{g_i}^j\), we have \(u_{g_i}^j  y_{g_i}^j = 0\). Hence, (53) is satisfied componentwise. It remains to show that (54) holds. From (50), we have that \(y_{g_i}^j \ge u_\infty ^i\), for \(j \in \mathcal {I}_{g_i}\). This and (55) yield \(\beta _{g_i}^j \ge 0\) for \(j \in \mathcal {I}_{g_i}\). It can be deduced from (48) that$$\begin{aligned} \begin{aligned} u_{g_i}^j  y_{g_i}^j \!+\! \lambda \beta _{g_i}^j~ \text {sign}(u_{g_i}^j)&\!=\! \text {sign}(y_{g_i}^j) u_\infty ^iy_{g_i}^j \!+\! \left( y_{g_i}^j \!\! u_\infty ^i \right) \text {sign}(\text {sign}(y_{g_i}^j) u_\infty ^i)\\&= \text {sign}(y_{g_i}^j) u_\infty ^iy_{g_i}^j + \left( y_{g_i}^ju_\infty ^i\right) \text {sign}(y_{g_i}^j) = 0, \end{aligned} \end{aligned}$$giving the results. \(\square \)$$\begin{aligned} \sum _{j = 1}^{\widehat{k}_i} \beta _{g_i}^j = \frac{1}{\lambda } \sum _{j = 1}^{\widehat{k}_i} y_{g_i}^j  \frac{\widehat{k}^i}{\lambda } u_\infty ^i = \frac{1}{\lambda } \sum _{j = 1}^{\widehat{k}_i} y_{g_i}^j  \frac{1}{\lambda }\left( \sum _{j = 1}^{\widehat{k}_i} y_{g_i}^j\lambda \right) = 1, \end{aligned}$$
Corollary 14
Proof
The proof is straightforward from Proposition 13 by setting \(m=1\), \(n_1=n\), \(y_{g_1}=y\), and \(\mathcal {I}_{g_1}=\mathcal {I}\).
4.2 Constrained examples (\(C \ne \mathcal {V}\))
Proposition 15
Proof
 Case (i).
\(u=0\). Then, we have \(\omega (y,\lambda )\le 0\).
 Case (ii).\(u \ne 0\). Then, \(\omega (y,\lambda ) > \lambda \). Proposition 1 yieldsleading to$$\begin{aligned} \partial \phi (u) = \{ g \in \mathcal {V}^* \mid \Vert D^{1} g\Vert _\infty = 1,~~ \langle g,u \rangle = \Vert Du\Vert _1\}, \end{aligned}$$By induction on nonzero elements of u, we get \(g_i u_i = d_i u_i\), for \(i = 1, \ldots , n\). This implies that \(g_i = d_i~ \text {sign} (u_i)\) if \(u_i \ne 0\). This and the definition of \(N_C(u)\) yield$$\begin{aligned} \sum _{i=1}^n (g_i u_i  d_i u_i) = 0. \end{aligned}$$for \(i = 1, \ldots , n\), and equivalently for \(u \ne 0\), we get$$\begin{aligned} u_i  y_i + \lambda (\partial \Vert Du\Vert _1)_i \left\{ \begin{array}{ll} \ge 0 &{}~~ \text {if}~ u_i = \underline{x}_i,\\ \le 0 &{}~~ \text {if}~ u_i = \overline{x}_i,\\ = 0 &{}~~ \text {if}~ \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$for \(i = 1, \ldots , n\). If \(u_i = \underline{x}_i\), substituting \(u_i = \underline{x}_i\) in (64) implies \(\underline{x}_i  y_i + \lambda d_i~ \text {sign} (\underline{x}_i) \ge 0\). If \(u_i = \overline{x}_i\), substituting \(u_i = \overline{x}_i\) in (64) gives \(\overline{x}_i  y_i + \lambda d_i~ \text {sign} (\overline{x}_i) \le 0\). If \(\underline{x}_i< u_i < \overline{x}_i\), there are three possibilities: (a) \(u_i > 0\); (b) \(u_i < 0\); (c) \(u_i = 0\). In Case (a), \(\text {sign} (u_i) = 1\) and (64) lead to \(u_i = y_i  \lambda d_i > 0\). In Case (b), \(\text {sign} (u_i) =  1\) and (64) imply \(u_i = y_i+ \lambda d_i <0\). In Case (c), we end up to \(u_i = 0\), completing the proof. \(\square \)$$\begin{aligned} u_i  y_i + \lambda d_i~ \text {sign} (u_i) \left\{ \begin{array}{ll} \ge 0 &{}~~ \text {if}~ u_i = \underline{x}_i,\\ \le 0 &{}~~ \text {if}~ u_i = \overline{x}_i,\\ = 0 &{}~~ \text {if}~ \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$(64)
Proposition 16
Proof
Proposition 17
Proof
 Case (i).
\(u=0\). Then, we have \(\omega (y,\lambda _2)\le 0\).
 Case (ii).\(u \ne 0\). Then, \(\omega (y,\lambda _2)>0\). From (68) and the definition of \(N_C(u)\), we obtainfor \(i = 1, \ldots , n\). This leads to$$\begin{aligned} u_i  y_i + \lambda _1 u_i + \lambda _2 \partial d_iu_i \left\{ \begin{array}{ll} \ge 0 &{}~~ \text {if}~~ u_i = \underline{x}_i,\\ \le 0 &{}~~ \text {if}~~ u_i = \overline{x}_i,\\ = 0 &{}~~ \text {if}~~ \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$for \(i = 1, \ldots , n\). If \(u_i = \underline{x}_i\), substituting \(u_i = \underline{x}_i\) in (64) gives \((1+\lambda _1)\underline{x}_i  y_i + \lambda _2 d_i~\text {sign} (\underline{x}_i) \ge 0\). If \(u_i = \overline{x}_i\), substituting \(u_i = \overline{x}_i\) in (64) implies \((1+\lambda _1) \overline{x}_i  y_i + \lambda _2 d_i~ \text {sign} (\overline{x}_i) \le 0\). If \(\overline{x}_i< u_i < \overline{x}_i\), there are three possibilities: (i) \(u_i > 0\); (ii) \(u_i < 0\); (iii) \(u_i = 0\). In Case (i), \(\text {sign}(u_i) = 1\) and (64) imply \(u_i =1/(1+\lambda _1) (y_i  \lambda _2 d_i) > 0\). In Case (ii), \(\text {sign}(u_i) =  1\) and (64) imply \(u_i = 1/(1+\lambda _1)(y_i+ \lambda _2 d_i) <0\). In Case (iii), we get \(u_i = 0\), giving the result. \(\square \)$$\begin{aligned} (1+\lambda _1) u_i  y_i + \lambda _2 d_i~ \text {sign} (u_i) \left\{ \begin{array}{ll} \ge 0 &{}\quad \text {if}\quad u_i = \underline{x}_i,\\ \le 0 &{}\quad \text {if}\quad u_i = \overline{x}_i,\\ = 0 &{}\quad \text {if}\quad \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$(70)
Let \(x \ge 0\) be nonnegativity constraints. These constraints are important in many applications, especially if x describes physical quantities; see, e.g., (Esser et al. 2013; Kaufman and Neumaier 1996, 1997). Since nonnegativity constraints can be regarded as an especial case of boundconstrained domain, Propositions 15, 16, and 17 can be used to derive the results for nonnegativity constraints.
5 Numerical experiments and application
We here report some numerical results to compare the performance of OSGAO with OSGA and some stateoftheart methods. In our comparison, we consider PGA (proximal gradient algorithm (Parikh and Boyd 2013)), NSDSG (nonsummable diminishing subgradient algorithm (Boyd et al. 2003)), FISTA (Beck and Teboulle’s fast proximal gradient algorithm (Beck and Teboulle 2012)), NESCO (Nesterov’s composite optimal algorithm (Nesterov 2013)), NESUN (Nesterov’s universal gradient algorithm (Nesterov 2015)), NES83 (Nesterov’s 1983 optimal algorithm (Nesterov 1983)), NESCS (Nesterov’s constant step optimal algorithm (Nesterov 2004)), and NES05 (Nesterov’s 2005 optimal algorithm (Nesterov 2005a)). We adapt NES83, NESCS, and NES05 by passing a subgradient in the place of the gradient to be able to apply them to nonsmooth problems (see Ahookhosh (2016)). The codes of these algorithms are written in MATLAB, where we use the parameters proposed in the associated papers.
5.1 Experiment with random data
5.1.1 \(\ell _1\) minimization
Here, we consider the \(\ell _1\) minimization problem (19), reformulate it as a minimization problem of the form (13) with the objective and the constraint given in (22), and solve the reformulated problem by OSGAO. We then report some numerical results and a comparison among OSGAO, OSGA and some stateoftheart methods. For OSGAO and OSGA, we here set \(\mu =0\).
Averages (only integer part) of \(N_i\) and T for PGA, FISTA, NESCO, NESUN, OSGA, and OSGAO for solving \(\ell _1\) minimization problem with several regularization parameters
Reg. Par.  OSGAO  OSGA  PGA  FISTA  NESCO  NESUN  

\(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  
\(\lambda = 1\)  100  12  2277  103  5000  138  3316  138  3189  385  2614  223 
\(\lambda = 10^{1}\)  100  12  1497  72  4680  141  1940  87  1162  153  1376  125 
\(\lambda = 10^{2}\)  100  12  638  31  5000  156  1024  48  617  85  735  69 
\(\lambda = 10^{3}\)  100  12  773  38  5000  154  1241  60  749  102  890  85 
\(\lambda = 10^{4}\)  100  12  783  36  5000  138  1287  51  775  87  922  73 
\(\lambda = 10^{5}\)  100  12  462  22  5000  148  744  34  450  59  536  49 
Averages (only integer part) of \(N_i\) and T for NSDSG, NES83, NESCS, NES05, OSGA, and OSGAO for solving \(\ell _1\) minimization problem with several regularization parameters
Reg. Par.  OSGAO  OSGA  NSDSG  NES83  NESCS  NES05  

\(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  
\(\lambda = 1\)  100  12  2277  103  5000  162  3352  169  4508  224  3318  106 
\(\lambda = 10^{1}\)  100  12  1497  72  5000  152  2167  105  4021  193  1947  60 
\(\lambda = 10^{2}\)  100  12  638  31  5000  138  1142  46  3956  167  1029  26 
\(\lambda = 10^{3}\)  100  12  773  38  5000  148  1386  62  4482  200  1248  35 
\(\lambda = 10^{4}\)  100  12  783  36  5000  142  1434  64  4949  216  1229  37 
\(\lambda = 10^{5}\)  100  12  462  22  5000  150  831  37  3572  161  749  21 
The results of Tables 1 and 2 show that OSGAO attains the best number of iterations and running time for the \(\ell _1\) minimization problem, where the average of 10 implementations associated to each regularization parameter is given in these tables. In Fig. 1, subfigures (a) and (b) stand for performance profiles with measures \(N_i\) and T comparing proximalbased methods, where OSGAO outperforms the others substantially. In this figure, subfigures (c) and (d) display performance profiles for measures \(N_i\) and T to compare subgradientbased methods, where OSGAO performs much better than the others with respect to both measures. Further, from Fig. 2, it can be seen that the worst results are obtained by NSDSG and PGA, while FISTA, NESCO, NESUN, NES83, NESCS, NES05, and OSGA are comparable to some extent; however, OSGAO is significantly superior to the other methods.
5.1.2 Elastic net minimization
We now consider the elastic net minimization problem (20), reformulate it as a minimization problem of the form (13) with the objective and the constraint given in (23), and solve the reformulated problem by OSGAO. We then give some numerical results and a comparison among OSGAO,OSGA and some stateoftheart solvers. For OSGAO and OSGA, we here set \(\mu =\lambda _1/2\).
Averages (only integer part) of \(N_i\) and T for PGA, FISTA, NESCO, NESUN, OSGA, and OSGAO for solving the elastic net problem (19) with several regularization parameters
Reg. Par.  OSGAO  OSGA  PGA  FISTA  NESCO  NESUN  

\(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  
\(\lambda = 1\)  100  12  4781  222  5000  143  4904  215  4756  609  4071  371 
\(\lambda = 10^{1}\)  100  12  1128  52  5000  143  1517  66  908  110  1078  94 
\(\lambda = 10^{2}\)  100  12  652  31  5000  148  1038  45  626  78  744  63 
\(\lambda = 10^{3}\)  100  12  474  23  5000  151  762  33  460  55  549  48 
\(\lambda = 10^{4}\)  100  12  513  25  5000  147  839  37  506  62  602  54 
\(\lambda = 10^{5}\)  100  12  661  32  5000  148  1076  47  649  79  772  68 
Averages (only integer part) of \(N_i\) and T for NSDSG, NES83, NESCS, NES05, OSGA, and OSGAO for solving the elastic net problem with several regularization parameters
Reg. Par.  OSGAO  OSGA  NSDSG  NES83  NESCS  NES05  

\(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  \(N_i\)  T  
\(\lambda = 1\)  100  12  4781  222  5000  147  4949  221  5000  226  4904  145 
\(\lambda = 10^{1}\)  100  12  1128  52  5000  156  1692  80  4677  218  1523  47 
\(\lambda = 10^{2}\)  100  12  652  31  5000  146  1158  50  4225  182  1044  28 
\(\lambda = 10^{3}\)  100  12  474  23  5000  144  8516  36  3058  130  766  20 
\(\lambda = 10^{4}\)  100  12  513  25  5000  144  935  40  3120  135  844  23 
\(\lambda = 10^{5}\)  100  12  661  32  5000  148  1203  52  4589  201  1083  29 
5.2 Sparse recovery (compressed sensing)
In recent years, there has been an increasing interest in finding sparse solutions of many problems using the structured models in various areas of applied mathematics. In most cases, the problem involves highdimensional data with a small number of available measurements, where the core of these problems involves an optimization problem of the form (19) or (20). Thanks to the sparsity of solutions and the structure of problems, these optimization problems can be solved in reasonable time even for the extremely highdimensional data sets. Sparse recovery, basis pursuit, lasso, waveletbased deconvolution, and compressed sensing are some examples, where the latter case receives lots of attentions during the recent years, cf. (Candés 2006; Donoho 2006).
6 Conclusions
This paper discusses the solution of structured nonsmooth convex optimization problems with the complexity \(\mathcal {O}(\varepsilon ^{1/2})\), which is optimal for smooth problems with Lipschitz continuous gradients. First, if the nonsmoothness of the problem is manifested in a structured way, the problem is reformulated so that the objective is smooth with Lipschitz continuous gradients in the price of adding a functional constraint to the feasible domain. Then, a new setup of the optimal subgradient algorithm (OSGAO) is developed to solve the reformulated problem with the complexity \(\mathcal {O}(\varepsilon ^{1/2})\).
Next, it is proved that the OSGAO auxiliary problem is equivalent to a proximallike problem, which is wellstudied due to its appearance in Nesterovtype optimal methods for composite minimization. For several problems appearing in applications, either an explicit formula or a simple iterative scheme for solving the corresponding proximallike problems is provided.
Finally, some numerical results with random data and a sparse recovery problem are given indicating a good behavior of OSGAO compared to some stateoftheart firstorder methods, which confirm the theoretical foundations.
Footnotes
Notes
Acknowledgements
Open access funding provided by University of Vienna. Thanks to Stephen M. Robinson and Defeng Sun for their comments about solving nonsmooth equations. We are very thankful to anonymous referees for a careful reading and many useful suggestions, which improved the paper.
References
 Ahookhosh M (2015) Highdimensional nonsmooth convex optimization via optimal subgradient methods, PhD Thesis, University of ViennaGoogle Scholar
 Ahookhosh M (2016) Optimal subgradient algorithms with application to largescale linear inverse problems (Submitted). http://arxiv.org/abs/1402.7291
 Ahookhosh M, Amini K, Kimiaei M (2015) A globally convergent trustregion method for largescale symmetric nonlinear systems. Numer Funct Anal Optim 36:830–855CrossRefGoogle Scholar
 Ahookhosh M, Neumaier A (2013) Highdimensional convex optimization via optimal affine subgradient algorithms. In: ROKS workshop, pp 83–84Google Scholar
 Ahookhosh M, Neumaier A (2016) An optimal subgradient algorithm with subspace search for costly convex optimization problems (submitted). http://www.optimizationonline.org/DB_FILE/2015/04/4852.pdf
 Ahookhosh M, Neumaier A (2017) An optimal subgradient algorithms for largescale boundconstrained convex optimization. Math Methods Oper Res. doi: 10.1007/s0018601705851
 Ahookhosh M, Neumaier A (2017) Optimal subgradient algorithms for largescale convex optimization in simple domains. Numer Algorithm. doi: 10.1007/s110750170297x
 Auslender A, Teboulle M (2006) Interior gradient and proximal methods for convex and conic optimization. SIAM J Optim 16:697–725CrossRefGoogle Scholar
 Beck A, Teboulle M (2003) Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper Res Lett 31(3):167–175CrossRefGoogle Scholar
 Beck A, Teboulle M (2012) Smoothing and first order methods: a unified framework. SIAM J Optim 22:557–580CrossRefGoogle Scholar
 Beck A, BenTal A, GuttmannBeck N, Tetruashvili L (2010) The CoMirror algorithm for solving nonsmooth constrained convex problems. Oper Res Lett 38(6):493–498CrossRefGoogle Scholar
 Boţ RI, Hendrich C (2013) A double smoothing technique for solving unconstrained nondifferentiable convex optimization problems. Comput Optim Appl 54(2):239–262CrossRefGoogle Scholar
 Boţ RI, Hendrich C (2015) On the acceleration of the double smoothing technique for unconstrained convex optimization problems. Optimization 64(2):265–288CrossRefGoogle Scholar
 Boyd S, Xiao L, Mutapcic A (2003) Subgradient methods, Notes for EE392o, Stanford University. http://www.stanford.edu/class/ee392o/subgrad_method.pdf
 Candés E (2006) Compressive sampling. In: Proceedings of International Congress of Mathematics, vol 3, Madrid, Spain, pp 1433–1452Google Scholar
 Chen Y, Lan G, Ouyang Y (2014) Optimal primaldual methods for a class of saddle point problems. SIAM J Optim 24(4):1779–1814CrossRefGoogle Scholar
 Chen Y, Lan G, Ouyang Y (2017) Accelerated scheme for a class of variational inequalities. doi: 10.1007/s1010701711614
 Chen Y, Lan G, Ouyang Y (2015) An accelerated linearized alternating direction method of multipliers. SIAM J Imaging Sci 8(1):644–681CrossRefGoogle Scholar
 Chen Y, Lan G, Ouyang Y, Zhang W (2014) Fast bundlelevel type methods for unconstrained and ballconstrained convex optimization. http://arxiv.org/pdf/1412.2128v1.pdf
 Combettes P, Pesquet JC (2011) Proximal splitting methods in signal processing. In: Bauschke H, Burachik R, Combettes P, Elser V, Luke D, Wolkowicz H (eds) Fixedpoint algorithms for inverse problems in science and engineering. Springer, New York, pp 185–212Google Scholar
 Daubechies I, DeVore R, Fornasier M, Güntürk CS (2010) Iteratively reweighted least squares minimization for sparse recovery. Commun Pure Appl Math 63(1):1–38CrossRefGoogle Scholar
 Devolder O, Glineur F, Nesterov Y (2013) Firstorder methods of smooth convex optimization with inexact oracle. Math Program 146:37–75CrossRefGoogle Scholar
 Devolder O, Glineur F, Nesterov Y (2012) Double smoothing technique for largescale linearly constrained convex optimization. SIAM J Optim 22(2):702–727CrossRefGoogle Scholar
 Dolan ED, Moré JJ (2002) Benchmarking optimization software with performance profiles. Math Program 91(2):201–213CrossRefGoogle Scholar
 Donoho DL (2006) Compressed sensing. IEEE Tran Inf Theory 52(4):1289–1306CrossRefGoogle Scholar
 Esser E, Lou Y, Xin J (2013) A method for finding structured sparse solutions to nonnegative least squares problems with applications. SIAM J Imaging Sci 6(4):2010–2046CrossRefGoogle Scholar
 Figueiredo MAT, Nowak RD, Wright SJ (2007) Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J Sel Top Signal Process 1(4):586–597CrossRefGoogle Scholar
 Hansen N, Auger A, Ros R, Finck S, Posik P (2010) Comparing results of 31 algorithms from the blackbox optimization benchmarking BBOB2009. In: Proc. Workshop GECCO, pp 1689–1696Google Scholar
 Gonzaga CC, Karas EW (2013) Fine tuning Nesterov’s steepest descent algorithm for differentiable convex programming. Math Program 138:141–166CrossRefGoogle Scholar
 Gonzaga CC, Karas EW, Rossetto DR (2013) An optimal algorithm for constrained differentiable convex optimization. SIAM J Optim 23(4):1939–1955CrossRefGoogle Scholar
 Juditsky A, Nesterov Y (2014) Deterministic and stochastic primaldual subgradient algorithms for uniformly convex minimization. Stoch Syst 4(1):44–80CrossRefGoogle Scholar
 Kaufman L, Neumaier A (1996) PET regularization by envelope guided conjugate gradients. IEEE Trans Med Imaging 15:385–389CrossRefGoogle Scholar
 Kaufman L, Neumaier A (1997) Regularization of illposed problems by envelope guided conjugate gradients. J Comput Graph Stat 6(4):451–463Google Scholar
 Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder–Mead simplex method in low dimensions. SIAM J Optim 9:112–147CrossRefGoogle Scholar
 Lan G (2015) Bundlelevel type methods uniformly optimal for smooth and nonsmooth convex optimization. Math Program 149(1):1–45CrossRefGoogle Scholar
 Lan G, Lu Z, Monteiro RDC (2011) Primaldual firstorder methods with \(O(1/\varepsilon )\) iterationcomplexity for cone programming. Math Program 126:1–29CrossRefGoogle Scholar
 Li DH, Yamashita N, Fukushima M (2001) Nonsmooth equation based bfgs method for solving KKT systems in mathematical programming. J Optim Theory Appl 109(1):123–167CrossRefGoogle Scholar
 Liu H, Zhang J, Jiang X, Liu J (2010) The group Dantzig selector. J Mach Learn Res Proc Track 9:461–468Google Scholar
 Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization. Wiley, New YorkGoogle Scholar
 Nesterov Y (2004) Introductory lectures on convex optimization: a basic course. Kluwer, DordrechtCrossRefGoogle Scholar
 Nesterov Y (1983) A method of solving a convex programming problem with convergence rate \(O(1/k^2)\), Doklady AN SSSR (In Russian), 269:543–547. English translation: Soviet Math. Dokl., 27: 372–376 (1983)Google Scholar
 Nesterov Y (2005) Smooth minimization of nonsmooth functions. Math Prog 103:127–152CrossRefGoogle Scholar
 Nesterov Y (2005) Excessive gap technique in nonsmooth convex minimization. SIAM J Optim 16:235–249CrossRefGoogle Scholar
 Nesterov Y (2011) Barrier subgradient method. Math Program 127:31–56CrossRefGoogle Scholar
 Nesterov Y (2006) Primaldual subgradient methods for convex problems. Math Program 120:221–259CrossRefGoogle Scholar
 Nesterov Y (2013) Gradient methods for minimizing composite objective function. Math Program 140:125–161CrossRefGoogle Scholar
 Nesterov Y (2015) Universal gradient methods for convex optimization problems. Math Program 152(1):381–404CrossRefGoogle Scholar
 Neumaier A (2016) OSGA: a fast subgradient algorithm with optimal complexity. Math Program 158(1):1–21CrossRefGoogle Scholar
 Neumaier A (2001) Introduction to numerical analysis. Cambridge University Press, CambridgeCrossRefGoogle Scholar
 Pang JS, Qi L (1993) Nonsmooth equations: motivation and algorithms. SIAM J Optim 3:443–465CrossRefGoogle Scholar
 Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231Google Scholar
 Polyak B (1987) Introduction to optimization. Optimization Software Inc., Publications Division, NewYorkGoogle Scholar
 Potra FA, Qi L, Sun D (1998) Secant methods for semismooth equations. Numerische Mathematik 80(2):305–324CrossRefGoogle Scholar
 Qi L (1995) Trust region algorithms for solving nonsmooth equations. SIAM J Optim 5:219–230CrossRefGoogle Scholar
 Qi L, Sun D (1999) A survey of some nonsmooth equations and smoothing Newton methods. Prog Optim 30:121–146CrossRefGoogle Scholar
 Rauhut H, Ward R (2016) Interpolation via weighted \(l_1\)minimization. Appl Comput Harmon Anal 40(2):321–351Google Scholar
 Shor NZ (1985) Minimization methods for nondifferentiable functions. Springer, Berlin (Springer series in computational mathematics)CrossRefGoogle Scholar
 Sun D, Han J (1997) Newton and quasiNewton methods for a class of nonsmooth equations and related problems. SIAM J Optim 7(2):463–480CrossRefGoogle Scholar
 Tseng P (2008) On accelerated proximal gradient methods for convexconcave optimization, Technical report, Mathematics Department, University of Washington. http://pages.cs.wisc.edu/~brecht/cs726docs/Tseng.APG.pdf
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.