1 Introduction

Subgradient methods are a class of first-order methods that have been developed to solve convex nonsmooth optimization problems, dating back to 1960; see, e.g., (Polyak 1987; Shor 1985). In general, they only need function values and subgradients, and not only inherit the basic features of general first-order methods such as low memory requirement and simple structure but also are able to deal with every convex optimization problem. They are suitable for solving convex problems with a large number of variables, say several millions. Although these features make them very attractive for applications involving high-dimensional data, they usually suffer from slow convergence, which finally limits the attainable accuracy. In 1983, Nemirovsky and Yudin (1983) derived the worst-case complexity bound of first-order methods for several classes of problems to achieve an \(\varepsilon \)-solution, which is \(\mathcal {O}(\varepsilon ^{-2})\) for Lipschitz continuous nonsmooth problems and \(\mathcal {O}(\varepsilon ^{-1/2})\) for smooth problems with Lipschitz continuous gradients. The low convergence speed of subgradient methods suggests that they often reach an \(\varepsilon \)-solution in the number of iterations closing to the worst-case complexity bound on iterations.

In Nemirovsky and Yudin (1983), it was proved that the subgradient, subgradient projection, and mirror descent methods attain the optimal complexity of first-order methods for solving Lipschitz continuous nonsmooth problems; here, the mirror descent method is a generalization of the subgradient projection method, cf. (Beck and Teboulle 2003; Beck et al. 2010). Nesterov (2011), Nesterov (2006) proposed some primal-dual subgradient schemes, which attain the complexity \(\mathcal {O}(\varepsilon ^{-2})\) for Lipschitz continuous nonsmooth problems. Juditsky and Nesterov (2014) proposed a primal-dual subgradient scheme for uniformly convex functions with an unknown convexity parameter, which attains the complexity close to the optimal bound. Nesterov (1983) and later in Nesterov (2004) proposed some gradient methods for solving smooth problems with Lipschitz continuous gradients attaining the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\). He also in Nesterov (2005a, b) proposed some smoothing methods for structured nonsmooth problems. Smoothing methods also have been studied by many authors; see, e.g., Beck and Teboulle (2012), Boţ and Hendrich (2013), Boţ and Hendrich (2015), and Devolder et al. (2012).

In many fields of applied sciences and engineering such as signal and image processing, geophysics, economic, machine learning, and statistics, there exist many applications that can be modeled as a convex optimization problem, in which the objective function is a composite function of a smooth function with Lipschitz continuous gradients and a nonsmooth function; see Ahookhosh (2016) and references therein. Studying this class of problems using first-order methods has dominated the convex optimization literature in the recent years. Nesterov (2013, 2015) proposed some gradient methods for solving composite problems obtaining the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\). For this class of problems, other first-order methods with the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\) have been developed by Auslender and Teboulle (2006), Beck and Teboulle (2012), Chen et al. (2014, 2017, 2015, 2014), Devolder et al. (2013), Gonzaga and Karas (2013), Gonzaga et al. (2013), Lan (2015), Lan et al. (2011), and Tseng (2008). In particular, Neumaier (2016) proposed an optimal subgradient algorithm (OSGA) attaining the complexity \(\mathcal {O}(\varepsilon ^{-2})\) for Lipschitz continuous nonsmooth problems and the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\) for smooth problems with Lipschitz continuous gradients. OSGA is a black-box method and does not need to know about global information of the objective function such as Lipschitz constants.

1.1 Content

This paper focuses on a class of structured nonsmooth convex constrained optimization problems that is a generalization of the composite problems, which is frequently found in applications. OSGA behaves well for composite problems in applications; see Ahookhosh (2016) and Ahookhosh and Neumaier (2017), Ahookhosh and Neumaier (2017); however, it does not attain the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\) for this class of problems. Hence, we first reformulate the problem considered in a way that only the smooth part remains in the objective, in the cost of adding a functional constraint to our feasible domain. Afterward, we propose a suitable prox-function, provide a new setup for OSGA called OSGA-O for the reformulated problem, and show that solving the OSGA-O auxiliary problem for the reformulated problem is equivalent to solving a proximal-like problem. It is shown that this proximal-like subproblem can be solved efficiently for many problems appearing in applications either in a closed form or by a simple iterative scheme. Due to this reformulation, the problem can be solved by OSGA-O with the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\). Finally, some numerical results are reported suggesting a good behavior of OSGA-O.

The underlying function of the subproblem of OSGA is quasi-concave and finding its solution is the most costly part of the algorithm. Hence, efficient solving of this subproblem is crucial but not a trivial task. For unconstrained problems, we found a closed form solution for the subproblem and studied the numerical behavior of OSGA in Ahookhosh (2016) and (Ahookhosh and Neumaier 2013, 2016). In Ahookhosh and Neumaier (2017), we gave one projection version of OSGA and provided a framework to solve the subproblem over simple convex domains or simple functional constraints. In particular, we describe a scheme to compute the global solution of the OSGA subproblem for bound constraints in Ahookhosh and Neumaier (2017). Let us emphasize that the subproblem of OSGA-O is constrained by a simple convex set and simple functional constraints, which is different from that one used in Ahookhosh (2016), Ahookhosh and Neumaier (2013), Ahookhosh and Neumaier (2016), Ahookhosh and Neumaier (2017), Ahookhosh and Neumaier (2017), which leads to solve a proximal-like problem.

The overall structure of this paper takes six sections, including this introductory section. In the next section, we briefly review the main idea of OSGA. In Sect. 3, we give a reformulation for the basic problem considered and show that solving the OSGA-O subproblem is equivalent to solving a proximal-like problem. Section 4 points out how the proximal-like subproblem can be solved in many interesting cases. Some numerical results are reported in Sect. 5, and conclusions are given in Sect. 6.

1.2 Preliminaries and notation

Let \(\mathcal {V}\) be a finite-dimensional vector space endowed with the norm \(\Vert \cdot \Vert \), and let \(\mathcal {V}^*\) denotes its dual space, formed by all linear functional on \(\mathcal {V}\) where the bilinear pairing \(\langle g,x\rangle \) denotes the value of the functional \(g \in \mathcal {V}^*\) at \(x \in \mathcal {V}\). The associated dual norm of \(\Vert \cdot \Vert \) is defined by

$$\begin{aligned} \Vert g\Vert _* = \sup _{z\in \mathcal {V}} \{\langle g,z \rangle : \Vert z\Vert \le 1\}. \end{aligned}$$

If \(\mathcal {V} = \mathbb {R}^n\), then, for \(1 \le p \le \infty \),

$$\begin{aligned} \Vert x\Vert _p = \left( \sum _{i=1}^n |x_i|^p\right) ^{1/p},~~ \Vert x\Vert _{1,p} = \sum _{i=1}^m \Vert x_{g_i}\Vert _p, \end{aligned}$$

where \(x = (x_{g_1}, \ldots , x_{g_m}) \in \mathbb {R}^{n_1} \times \cdots \times \mathbb {R}^{n_m}\) in which \(n_1 + \cdots + n_m = n\). We set \((x)_+ = \max (x,0)\). For a function \(f: \mathcal {V} \rightarrow \overline{\mathbb {R}} = \mathbb {R}\cup \{\pm \infty \}\),

$$\begin{aligned} \text {dom}f = \{ x \in \mathcal {V} ~|~ f(x) < +\infty \} \end{aligned}$$

denotes its effective domain, and f is called proper if \(\text {dom}f \ne \emptyset \) and \(f(x) > -\infty \) for all \(x \in \mathcal {V}\). Let C be a subset of \(\mathcal {V}\). In particular, if C is a box, we denote it by \(\mathbf {x}= [\underline{x},\overline{x}]\) in which \(\underline{x}\) and \(\overline{x}\) are the vectors of lower and upper bounds on the components of x, respectively. The vector \(g \in \mathcal {V}^* \) is called a subgradient of f at x if \(f(x) \in \mathbb {R}\) and

$$\begin{aligned} f(y) \ge f(x) + \langle g,y-x \rangle ~~~ \hbox {for all}~ y \in \mathcal {V}. \end{aligned}$$

The set of all subgradients is called the subdifferential of f at x and is denoted by \(\partial f(x)\). If \(f: \mathcal {V} \rightarrow \mathbb {R}\) is nonsmooth and convex, then Fermat’s optimality condition for the nonsmooth convex optimization problem

$$\begin{aligned} \begin{array}{ll} \min &{} f(x)\\ \text {s.t.} &{} x \in C \end{array} \end{aligned}$$

is given by

$$\begin{aligned} 0 \in \partial f(x) + N_C(x), \end{aligned}$$
(1)

where \(N_C(x)\) is the normal cone of C at x defined by

$$\begin{aligned} N_C(x) = \{p \in \mathcal {V} \mid \langle p, x-z \rangle \ge 0 ~~ \forall z \in C\}. \end{aligned}$$
(2)

The proximal-like operator \(\text {prox}_{\lambda f}^C(y)\) is the unique optimizer of the optimization problem

$$\begin{aligned} \text {prox}_{\lambda f}^C(y) := \mathop {\hbox { argmin}}_{x \in C} \frac{1}{2} \Vert x-y\Vert _2^2 + \lambda f(x), \end{aligned}$$
(3)

where \(\lambda >0\). From (1), the first-order optimality condition of (3) is given by

$$\begin{aligned} 0 \in x-y + \lambda \partial f(x) + N_C(x). \end{aligned}$$
(4)

If \(C=\mathcal {V}\), then (4) is simplified to

$$\begin{aligned} 0 \in x-y + \lambda \partial f(x), \end{aligned}$$
(5)

giving the classical proximity operator. A function f is called strongly convex with the convexity parameter \(\sigma >0\) if and only if

$$\begin{aligned} f(z)\ge f(x)+\langle g,z-x\rangle +\frac{\sigma }{2}\Vert z-x\Vert _2^2 ~~~\text {for all}~ x,z\in \mathcal {V} \end{aligned}$$
(6)

where g denotes any subgradient of f at x, i.e., \(g \in \partial f(x)\).

The subdifferential of \(\phi (x) = \Vert Wx\Vert \) is given in the next result, for an arbitrary norm \(\Vert \cdot \Vert \) in \(\mathbb {R}^n\) and a matrix \(W\in \mathbb {R}^{m\times n}\). To observe a proof of this result, see Proposition 2.1.17 in Ahookhosh (2015).

Proposition 1

Let \(\phi :\mathbb {R}^n \rightarrow \mathbb {R},~ \phi (x) = \Vert Wx\Vert \), where \(W\in \mathbb {R}^{m\times n}\) is an invertible matrix and \(\Vert \cdot \Vert \) is any norm of \(\mathbb {R}^n\). Then

$$\begin{aligned} \partial \phi (x)= \left\{ \begin{array}{ll} \{g \in \mathbb {R}^n \mid \Vert W^{-T}g\Vert \le 1\} &{} ~~ \text {if} ~ x=0,\\ \{g \in \mathbb {R}^n \mid \Vert W^{-T}g\Vert = 1,~ \langle g,x \rangle = \Vert Wx\Vert \} &{} ~~ \text {if} ~ x \ne 0. \end{array} \right. \end{aligned}$$

In particular, if \(\Vert \cdot \Vert \) is self-dual (\(\Vert \cdot \Vert =\Vert \cdot \Vert _*\)), we have

$$\begin{aligned} \partial \phi (x) = \left\{ \begin{array}{ll} \{g \in \mathbb {R}^n \mid \Vert W^{-T}g\Vert _* \le 1\} &{} ~~ \text {if} ~ x=0, \\ W^T \frac{Wx}{\Vert Wx\Vert } &{} ~~ \text {if} ~ x \ne 0. \end{array} \right. \end{aligned}$$

In the next example, we show how Proposition 1 is applied to \(\phi = \Vert \cdot \Vert _\infty \), which will be needed in Sect. 4. The subdifferential of other norms of \(\mathbb {R}^n\) can be computed with Proposition 1 in the same way.

Example 2

We use Proposition 1 to derive the subdifferential of \(\phi = \Vert \cdot \Vert _\infty \) at an arbitrary point x. We first recall that the dual norm of \(\Vert \cdot \Vert _\infty \) is \(\Vert \cdot \Vert _1\). If \(x = 0\), Proposition 1 implies

$$\begin{aligned} \partial \phi (0)= & {} \{ g \in \mathbb {R}^n ~|~ \Vert g\Vert _1 \le 1 \}\\= & {} \left\{ g \in \mathbb {R}^n ~|~ g = \sum _{i=1}^n \beta _i e_i,~ \beta \in [-1,1],~ \sum _{i=1}^n |\beta _i| \le 1 \right\} , \end{aligned}$$

leading to

$$\begin{aligned} \partial \phi (x)= & {} \left\{ g \in \mathbb {R}^n ~|~ \Vert g\Vert _1 = 1, \langle g,x \rangle = \Vert x\Vert _\infty = \max _{1 \le i \le n} |x_i| \right\} \\= & {} \left\{ g \in \mathbb {R}^n ~|~ \sum _{j =1}^n |g_j| = 1, \sum _{j =1}^n g_j x_j = \Vert x\Vert _\infty \right\} . \end{aligned}$$

If \(x \ne 0\), we set

$$\begin{aligned} \mathcal {I} := \{i \in \{1, \ldots , n\} ~|~ \Vert x\Vert _\infty = |x_i| \} \end{aligned}$$

and we have \(\Vert x\Vert _\infty = \sum _{i \in \mathcal {I}} \beta _i |x_i|\) and \(\sum _{i \in \mathcal {I}} \beta _i = 1\) leading to

$$\begin{aligned} \partial \phi (x) = \left\{ g \in \mathbb {R}^n ~|~ g = \sum _{i \in \mathcal {I}} \beta _i~ \text {sign}(x_i) e_i,~~ \sum _{i \in \mathcal {I}} \beta _i = 1 \right\} . \end{aligned}$$

2 A review of optimal subgradient algorithm (OSGA)

In this section, we briefly review the main idea of the optimal subgradient algorithm (OSGA) proposed by Neumaier (2016). To this end, we first consider the convex constrained minimization problem

$$\begin{aligned} \begin{array}{ll} \min &{}~ f(x)\\ \mathop {\hbox { s.t.~}}&{}~ x \in C, \end{array} \end{aligned}$$
(7)

where \(f:C\rightarrow \overline{\mathbb {R}}\) is a proper and convex function defined on a nonempty, closed, and convex subset C of \(\mathcal {V}\). The aim is to derive a solution \(\widehat{x}\in C\) using the first-order black-box information, i.e., function values and subgradients. OSGA (see Algorithm 2) is an optimal subgradient algorithm for the problem (7) that constructs a sequence of iterates whose related function values converge to the minimum with the optimal complexity. The primary objective is to monotonically reduce bounds on the error term \(f(x_b)-\widehat{f}\) of the function values, where \(\widehat{f}:=f(\widehat{x})\) and \(x_b\) is the best known point.

In details, OSGA considers a linear relaxation of f at x defined by

$$\begin{aligned} f(x)\ge \gamma +\langle h,x\rangle ~{\hbox { for all}~}x\in C, \end{aligned}$$
(8)

where \(\gamma \in \mathbb {R}\) and \(h\in \mathcal {V}^*\) and a continuously differentiable prox-function \(Q:C\rightarrow \mathbb {R}\) satisfying (6) and

$$\begin{aligned} Q_0:=\inf _{x\in C} Q(x) >0. \end{aligned}$$
(9)

Moreover, OSGA requires an efficient routine for finding a maximizer \(u:=U(\gamma ,h)\) and the optimal objective value \(\eta :=E(\gamma ,h)\) of the auxiliary problem

$$\begin{aligned} \begin{array}{ll} \sup &{} E_{\gamma ,h}(x)\\ \mathop {\hbox { s.t.~}}&{} x \in C, \end{array} \end{aligned}$$
(10)

where it is known that the supremum \(\eta \) is positive and the function \(E_{\gamma ,h}: C \rightarrow \mathbb {R}\) is defined by

$$\begin{aligned} E_{\gamma ,h}(x):= -\frac{\gamma +\langle h,x\rangle }{Q(x)}, \end{aligned}$$
(11)

with \(\gamma \in \mathbb {R}\), \(h\in \mathcal {V}^*\).

In Neumaier (2016), it is shown that OSGA attains the following bound on function values

$$\begin{aligned} 0\le f(x_b) -\widehat{f}\le \eta Q(\widehat{x}). \end{aligned}$$

Hence, by decreasing the error factor \(\eta \), the convergence to an \(\varepsilon \)-minimizer \(x_b\) is guaranteed by

$$\begin{aligned} 0\le f(x_b) -\widehat{f}\le \varepsilon , \end{aligned}$$

for some target tolerance \(\varepsilon >0\). In Neumaier (2016), it is shown that the number of iterations to achieve this optimizer is \(\mathcal {O}(\varepsilon ^{-1/2})\) for smooth f with Lipschitz continuous gradients and \(\mathcal {O}(\varepsilon ^{-2})\) for Lipschitz continuous nonsmooth f, which are optimal in both cases, cf. (Nemirovsky and Yudin 1983). The algorithm does not need to know about the global Lipschitz parameters and has a low memory requirement. Hence, if the subproblem (10) can be solved efficiently, it is appropriate for solving large-scale problems. In the next section, we show that OSGA can solve some structured nonsmooth problems with the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\). Moreover, it is shown that by selecting a suitable prox-function Q, the subproblem (10) can be solved efficiently for this class of problems.

As discussed in Neumaier (2016), to update the given parameters \(\alpha \), h, \(\gamma \), \(\eta \) and u, OSGA uses the following scheme:

figure a
figure b

If the best function value \(f_{x_b}\) is stored and updated, then each iteration of OSGA only requires the computation of two function values \(f_x\) and \(f_{x'}\) (Lines 6 and 11) and one subgradient \(g_x\) (Line 6).

3 Structured nonsmooth convex optimization

Let us consider the convex constrained problem

$$\begin{aligned} \begin{array}{ll} \min &{}~ f(\mathcal {A}x, \phi (x))\\ \mathop {\hbox { s.t.~}}&{}~ x \in C, \end{array} \end{aligned}$$
(12)

where \(f: \mathcal {U} \times \mathbb {R}\rightarrow \mathbb {R}\) is a proper and convex function that is smooth with Lipschitz continuous gradients with respect to both arguments and monotone increasing with respect to the second argument, \(\mathcal {A}: \mathcal {V} \rightarrow \mathcal {U}\) is a linear operator, \(C \subseteq \mathcal {V}\) is a simple convex domain, and \(\phi :\mathcal {V} \rightarrow \mathbb {R}\) is a simple nonsmooth, real-valued, and convex loss function. This class of convex problems generalizes the composite problem considered in Nesterov (2013, 2015). As discussed in Sect. 2, OSGA attains the complexity \(\mathcal {O}(\varepsilon ^{-2})\) for this class of problems. Hence we aim to reformulate the problem (12) in such a way that OSGA attains the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\). We here reformulate the problem (12) in the form

$$\begin{aligned} \begin{array}{ll} \min &{}~ \widetilde{f}(x, \xi )\\ \mathop {\hbox { s.t.~}}&{}~ (x, \xi ) \in \widetilde{C}, \end{array} \end{aligned}$$
(13)

where

$$\begin{aligned}&\widetilde{f}: \mathcal {V} \times \mathbb {R}\rightarrow \mathbb {R},~~~\widetilde{f}(x, \xi ) := f(\mathcal {A}x, \xi ), \end{aligned}$$
(14)
$$\begin{aligned}&\widetilde{C} := \{ (x, \xi ) \in \mathcal {V} \times \mathbb {R} ~|~ x \in C,~~\phi (x)\le \xi \}. \end{aligned}$$
(15)

By the assumptions on f, the reformulated function \(\widetilde{f}\) is smooth and has Lipschitz continuous gradients. OSGA can handle the problems of the form (13) with the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\) in the price of adding a functional constraint to the feasible domain C. In the next subsection, we will show how OSGA can effectively handle (13) with the feasible domain \(\widetilde{C}\). A version of OSGA that take advantages of the problem (13) is called OSGA-O.

Problems of the form (12) appears in many applications in the fields of signal and image processing, machine learning, statistics, economic, geophysics, and inverse problems. Let us consider the following example.

Example 3

(composite minimization) We consider the unconstrained minimization problem

$$\begin{aligned} \begin{array}{ll} \min &{}~ f(\mathcal {A}x) + \phi (x)\\ \mathop {\hbox { s.t.~}}&{}~ x \in C, \end{array} \end{aligned}$$
(16)

where \(f: \mathcal {U} \rightarrow \overline{\mathbb {R}}\) is a smooth, proper, and convex function, \(\mathcal {A}:\mathcal {V} \rightarrow \mathcal {U}\) is a linear operator, and \(\phi : \mathcal {V} \rightarrow \mathbb {R}\) is a simple but nonsmooth, real-valued, and convex loss function. In this case, we reformulate (16) in the form (13) by setting \( \widetilde{f}(x,\xi ) := f(\mathcal {A}x) + \xi \). Let us now consider the linear inverse problem

$$\begin{aligned} y = \mathcal {A}x + \nu , \end{aligned}$$
(17)

where \(x \in \mathbb {R}^n\) is the original object, \(y \in \mathbb {R}^m\) is an observation, and \(\nu \in \mathbb {R}^m\) is an additive or impulsive noise. The objective is to recover x from y by solving (17). In practice, this problem is typically underdetermined and ill-conditioned, and \(\nu \) is unknown. Hence x typically is recovered by solving one of the minimization problems

$$\begin{aligned}&\begin{array}{ll} \min &{}~~ \displaystyle \frac{1}{2} \Vert y-\mathcal {A}x\Vert _2^2 + \frac{1}{2} \lambda \Vert x\Vert _2^2\\ \mathop {\hbox { s.t.~}}&{}~~ x \in \mathbb {R}^n, \end{array} \end{aligned}$$
(18)
$$\begin{aligned}&\begin{array}{ll} \min &{}~~ \displaystyle \frac{1}{2} \Vert y-\mathcal {A}x\Vert _2^2 + \lambda \Vert x\Vert _1\\ \mathop {\hbox { s.t.~}}&{}~~ x \in \mathbb {R}^n, \end{array} \end{aligned}$$
(19)

or

$$\begin{aligned} \begin{array}{ll} \min &{}~~ \displaystyle \frac{1}{2} \Vert y-\mathcal {A}x\Vert _2^2 + \frac{1}{2} \lambda _1 \Vert x\Vert _2^2 + \lambda _2 \Vert x\Vert _1\\ \mathop {\hbox { s.t.~}}&{}~~ x \in \mathbb {R}^n. \end{array} \end{aligned}$$
(20)

These problems can be reformulated in the form (13) by setting

$$\begin{aligned}&\widetilde{f}(x,\xi ) := \frac{1}{2} \Vert y-\mathcal {A}x\Vert _2^2 + \xi , ~~ \phi (x) := \frac{1}{2} \lambda \Vert x\Vert _2^2, \end{aligned}$$
(21)
$$\begin{aligned}&\widetilde{f}(x,\xi ) := \frac{1}{2} \Vert y-\mathcal {A}x\Vert _2^2 + \xi , ~~ \phi (x) := \lambda \Vert x\Vert _1, \end{aligned}$$
(22)

or

$$\begin{aligned} \widetilde{f}(x,\xi ) := \frac{1}{2} \Vert y-\mathcal {A}x\Vert _2^2 + \xi , ~~ \phi (x) := \frac{1}{2} \lambda _1 \Vert x\Vert _2^2 + \lambda _2 \Vert x\Vert _1, \end{aligned}$$
(23)

respectively.

3.1 New setup of optimal subgradient algorithm (OSGA-O)

This section describes the subproblem (10) for a problem of the form (13). To this end, we introduce some prox-function and employ it to derive an inexpensive solution of the subproblem. We generally assume that the domain C is simple enough such that \(\eta \) and \((\widehat{u},\widetilde{u})\) can be computed cheaply, in \(\mathcal {O}(n\log n)\) operations, say.

Let \(Q: \mathcal {V} \times \mathbb {R}\rightarrow \mathbb {R}\) be a function defined by

$$\begin{aligned} Q(x,\widetilde{x}):= Q_0 + \frac{1}{2}\left( \Vert x\Vert _2^2+\widetilde{x}^2\right) , \end{aligned}$$
(24)

where \(Q_0 > 0\). From \(g_Q (x,\widetilde{x}) = (x~ \widetilde{x})^T\), we obtain

$$\begin{aligned} \begin{aligned}&Q(z,\widetilde{z}) + \langle g_Q(z,\widetilde{z}),(x-z, \widetilde{x}-\widetilde{z}) \rangle + \frac{1}{2} \Vert (x - z,\widetilde{x} - \widetilde{z})^T\Vert _2^2 \\&\quad = Q_0 + \frac{1}{2} \left\langle (z , \widetilde{z})^T, (z,\widetilde{z})^T \right\rangle + \left\langle (z,\widetilde{z})^T,(x-z,\widetilde{x}-\widetilde{z})^T \right\rangle \\&\quad + \frac{1}{2} \left\langle (x-z,\widetilde{x}-\widetilde{z})^T, (x-z,\widetilde{x}-\widetilde{z})^T \right\rangle \\&\quad = Q_0 + \frac{1}{2} \left\langle (z,\widetilde{z})^T, (x,\widetilde{x})^T \right\rangle + \frac{1}{2} \left\langle (x,\widetilde{x})^T,(x-z,\widetilde{x}-\widetilde{z})^T \right\rangle \\&\quad = Q_0 + \frac{1}{2} \left\langle (x,\widetilde{x})^T, (x,\widetilde{x})^T \right\rangle = Q_0 + \frac{1}{2}\left( \Vert x\Vert _2^2+\widetilde{x}^2\right) \\&\quad = Q(x,\widetilde{x}). \end{aligned} \end{aligned}$$

This means that Q is a strongly convex function with the convexity parameter 1, and since \(Q_0>0\), we get \(Q(x,\widetilde{x}) > 0\). Then Q is strongly convex, and \(Q(x,\widetilde{x}) > 0\). This shows that Q is a prox-function. We now replace the linear relaxation (8) by

$$\begin{aligned} \widetilde{f}(x,\widetilde{x})\ge \gamma +\langle h,x\rangle + \widetilde{h}\widetilde{x} ~{\hbox { for all}~}x\in \widehat{C}. \end{aligned}$$
(25)

Using this linear relaxation and the prox-function (24), the subproblem (10) is rewritten in the form

$$\begin{aligned} \begin{array}{ll} \sup &{} E_{\gamma , h, \widetilde{h}}(x,\widetilde{x})\\ \mathop {\hbox { s.t.~}}&{} (x,\widetilde{x}) \in \widetilde{C}, \end{array} \end{aligned}$$
(26)

where \(E_{\gamma , h, \widetilde{h}}: \mathcal {V} \times \mathbb {R}\rightarrow \mathbb {R}\) is differentiable and given by

$$\begin{aligned} E_{\gamma , h, \widetilde{h}} (x,\widetilde{x}) := - \frac{\gamma + \langle h,x \rangle + \widetilde{h}\widetilde{x}}{Q(x,\widetilde{x})}. \end{aligned}$$
(27)

Let \(( \widehat{u}, \widetilde{u})\in \mathcal {V} \times \mathbb {R}\) be a maximizer of (26) and \(\eta = E_{\gamma , h, \widetilde{h}} (\widehat{u},\widetilde{u})\). The next result gives a bound on the error \(\widetilde{f}(x_b,\widetilde{x}_b) -\widehat{f}\), which is important for providing the complexity analysis of OSGA-O.

Proposition 4

Let \(\gamma _b:=\gamma -f(x_b,\widetilde{x}_b)\), \((\widehat{u},\widetilde{u}):=U(\gamma _b,h,\widetilde{h})\), and \(\eta :=E(\gamma _b,h,\widetilde{h})\). Then, we have

$$\begin{aligned} 0\le f(x_b,\widetilde{x}_b) -\widehat{f}\le \eta Q(\widehat{x}, x^*), \end{aligned}$$
(28)

where \((\widehat{x}, x^*)\) is the solution of (13). In particular, if \((x_b,\widetilde{x}_b)\) is not yet optimal, then the choice \((\widehat{u},\widetilde{u})\) implies \(\eta =E(\gamma _b,h, \widetilde{h})>0\).

Proof

Using (25), (26), and (27), this follows similarly to Proposition 2.1 in Neumaier (2016). \(\square \)

Proposition 5

The maximizer \((\widehat{u},\widetilde{u})\) of (26) and the associated \(\eta \) satisfy

$$\begin{aligned}&\gamma +\langle h,\widehat{u}\rangle +\widetilde{h}\widetilde{u}=-\eta Q(\widehat{u}, \widetilde{u}), \end{aligned}$$
(29)
$$\begin{aligned}&\langle \eta \widehat{u}+h,x-\widehat{u}\rangle + (\eta \widetilde{u} + \widetilde{h})(\widetilde{x}-\widetilde{u})\ge 0 ~~~\text {for ~all}~ (x,\widetilde{x})\in \widetilde{C}. \end{aligned}$$
(30)

Proof

The problem (26) and the definition (27) imply that the function \(\zeta :C \times \mathbb {R}\rightarrow \mathbb {R}\) defined by

$$\begin{aligned} \zeta (x,\widetilde{x}):=\gamma +\langle h,x\rangle +\widetilde{h}\widetilde{x}+\eta Q(x,\widetilde{x}) \end{aligned}$$

is nonnegative and vanishes at \((x,\widetilde{x})=(\widehat{u},\widetilde{u})\), i.e., the identity (29) holds. Since \(\zeta (x,\widetilde{x})\) is continuously differentiable with gradient \(g_\zeta (x,\widetilde{x})= (\eta \widehat{u}+h, \eta \widetilde{u}+\widetilde{h})^T\), the first order optimality condition holds, i.e.,

$$\begin{aligned} \langle \eta \widehat{u}+h,x-\widehat{u}\rangle +(\eta \widetilde{u}+\widetilde{h})(\widetilde{x}-\widetilde{u})\ge 0 \end{aligned}$$
(31)

for all \((x,\widetilde{x})\in \widetilde{C}\), giving the results. \(\square \)

The subsequent result gives a systematic way for solving OSGA subproblem (26) for problems of the form (13).

Theorem 6

Let \(( \widehat{u}, \widetilde{u})\in \mathcal {V} \times \mathbb {R}\) be a maximizer of (26) and \(\eta = E_{\gamma , h, \widetilde{h}} (\widehat{u},\widetilde{u})\). Then, for \(y := -\eta ^{-1} h\), \(\lambda := \widetilde{u} + \eta ^{-1} \widetilde{h}\), we have

$$\begin{aligned} \widetilde{u} := \phi (\widehat{u}),~~~ \widehat{u} := \mathop {\hbox { argmin}}_{x\in C} \frac{1}{2}\Vert x-y\Vert _2^2+\lambda \phi (x). \end{aligned}$$
(32)

Furthermore, \(\eta \) and \(\lambda \) can be computed by solving the two-dimensional system of equations

$$\begin{aligned} \left\{ \begin{array}{l} \displaystyle \phi (\widehat{u}) + \eta ^{-1} \widetilde{h} - \lambda = 0,\\ \\ \displaystyle \eta \left( \frac{1}{2}(\Vert \widehat{u}\Vert _2^2 + \phi (\widehat{u})^2) + Q_0 \right) + \gamma + \langle h,\widehat{u} \rangle + \widetilde{h} \phi (\widehat{u}) = 0. \end{array} \right. \end{aligned}$$
(33)

Proof

From Proposition 5, at the minimizer \((\widehat{u}, \widetilde{u})\), we obtain

$$\begin{aligned} \eta \left( \frac{1}{2}(\Vert \widehat{u}\Vert _2^2 + (\widetilde{u})^2) + Q_0 \right) = -\gamma - \langle h,\widehat{u} \rangle - \widetilde{h} \widetilde{u} \end{aligned}$$
(34)

and

$$\begin{aligned} \langle \eta \widehat{u} + h, x-\widehat{u} \rangle + (\eta \widetilde{u} + \widetilde{h})(\widetilde{x}-\widetilde{u}) \ge 0~~~ \text {for~all}~ (x, \widetilde{x}) \in C \times \mathbb {R},~ \phi (x) \le \widetilde{x}. \end{aligned}$$
(35)

We conclude the proof in the next two parts:

In the first part, considering \(g_Q (\widehat{u},\widetilde{u})=(\widehat{u}^T,\widetilde{u})^T\), we show that (35) is equivalent to the following two inequalities

$$\begin{aligned} \left\{ \begin{array}{l} \eta \widetilde{u} + \widetilde{h} \ge 0,\\ \\ \langle \eta \widehat{u} + h, x-\widehat{u} \rangle + (\eta \widetilde{u} + \widetilde{h})(\phi (x)-\widetilde{u}) \ge 0~~~ \text {for~all}~ (x, \widetilde{x}) \in C \times \mathbb {R}. \end{array} \right. \end{aligned}$$
(36)

Assuming that these two inequalities hold, we prove (35). From \(\phi (x) \le \widetilde{x}\) and \(\eta \widetilde{u} + \widetilde{h} \ge 0\), we obtain

$$\begin{aligned} \begin{aligned} \langle \eta \widehat{u} + h, x-\widehat{u} \rangle + (\eta \widetilde{u} + \widetilde{h})(\widetilde{x}-\widetilde{u}) \ge \langle \eta \widehat{u} + h, x-\widehat{u} \rangle + (\eta \widetilde{u} + \widetilde{h})(\phi (x)-\widetilde{u}) \ge 0. \end{aligned} \end{aligned}$$

We now assume (35) and prove (36). The inequality \(\eta \widetilde{u} + \widetilde{h} \ge 0\) holds; otherwise, by selecting \(\widetilde{x}\) large enough, we get

$$\begin{aligned} \langle \eta \widehat{u} + h, x-\widehat{u} \rangle + (\eta \widetilde{u} + \widetilde{h})(\widetilde{x}-\widetilde{u}) < 0, \end{aligned}$$

which is a contradiction with (35). Since \(\phi (x) \le \widetilde{x}\), the second inequality of (36) holds.

In the second part, by setting \(x = \widehat{u}\) and \(\widetilde{u} = \phi (\widehat{u})\), we see that \(\widehat{u}\) is a solution of the minimization problem

$$\begin{aligned} \inf _{x \in C}~ \langle \eta \widehat{u} + h, x-\widehat{u} \rangle + (\eta \widetilde{u} + \widetilde{h}) (\phi (x)-\widetilde{u}). \end{aligned}$$

The first-order optimality condition (1) of this problem leads to

$$\begin{aligned} 0 \in \widehat{u} + \eta ^{-1} h + (\widetilde{u} + \eta ^{-1}\widetilde{h})~ \partial \phi (\widehat{u}) + N_C(\widehat{u}). \end{aligned}$$
(37)

On the other hand, by writing the first-order optimality condition (4) for the problem

$$\begin{aligned} \begin{array}{ll} \min &{} \displaystyle \frac{1}{2}\Vert x-y\Vert _2^2+\lambda \phi (x)\\ \text {s.t.} &{} x\in C, \end{array} \end{aligned}$$

we get

$$\begin{aligned} 0 \in \widehat{u} - y + \lambda ~ \partial \phi (\widehat{u}) + N_C(\widehat{u}). \end{aligned}$$
(38)

By comparing (37) and (38) and setting \(y = -\eta ^{-1} h\), \(\lambda = \widetilde{u} + \eta ^{-1}\widetilde{h}\), we conclude that both problems have the same minimizer \(\widehat{u}\). Since \(\widetilde{u} = \phi (\widehat{u})\), we obtain

$$\begin{aligned} \lambda = \widetilde{u} + \eta ^{-1} \widetilde{h} = \phi (\widehat{u}) + \eta ^{-1} \widetilde{h}. \end{aligned}$$

Using this and substituting \(\widetilde{u} = \phi (\widehat{u})\) in (34), \(\eta \) and \(\lambda \) are found by solving the system of nonlinear equations (33). This completes the proof. \(\square \)

In Theorem 6, if \(C = \mathcal {V}\), the problem (32) is reduced to the classical proximity operator \(\widehat{u} = \text {prox}_{\lambda \phi } (y)\) defined in (3). Hence, the problem (32) is called proximal-like. Therefore, the word “simple” in the definition of C means that the problem (32) can be solved efficiently either in a closed form or by an inexpensive iterative scheme. To have a clear view of Theorem 6, we give the following example.

Example 7

Let us consider the \(\ell _1\)-regularized least squares problem (19). Then, the problem can be reformulated as

$$\begin{aligned} \begin{array}{ll} \min &{} \displaystyle \frac{1}{2} \Vert y-Ax\Vert _2^2 + \xi \\ \mathop {\hbox { s.t.~}}&{} \displaystyle \Vert x\Vert _1 \le \xi . \end{array} \end{aligned}$$

Since \(\phi = \Vert \cdot \Vert _1\), the solution of (32) is \(\widehat{u} = \text {sign}(y_i)(|y_i|-\lambda )_+\) with \(y = -\eta ^{-1}h\) (see Table 6.1 in Ahookhosh (2015)). Substituting this into (33) gives

$$\begin{aligned} \left\{ \begin{array}{l} \displaystyle f_1(\eta , \lambda ) :=\sum _{i=1}^n (|y_i|-\lambda )_+ + \eta ^{-1} \widetilde{h} - \lambda = 0,\\ \\ \displaystyle f_2(\eta , \lambda ) :=\eta \left( \frac{1}{2}\left( \sum _{i=1}^n (|y_i|-\lambda )_+^2 + \left( \sum _{i=1}^n (|y_i|-\lambda )_+\right) ^2\right) + Q_0 \right) \\ \qquad \qquad \qquad + \gamma + \sum _{i=1}^n (h_i+\widetilde{h})(|y_i|-\lambda )_+ = 0. \end{array} \right. \end{aligned}$$

This is a two-dimensional system of nonsmooth equations that can be reformulated as a nonlinear least squares problem; see, e.g., (Pang and Qi 1993).

Theorem 6 leads to the two-dimensional nonlinear system

$$\begin{aligned} F(\eta , \lambda ) := (f_1(\eta , \lambda ), f_2(\eta , \lambda ))^T = 0, \end{aligned}$$
(39)

where

$$\begin{aligned} \begin{array}{l} \displaystyle f_1(\eta , \lambda ) := \phi (\widehat{u}) + \eta ^{-1} \widetilde{h} - \lambda ,\\ \displaystyle f_2(\eta , \lambda ) := \eta \left( \frac{1}{2}(\Vert \widehat{u}\Vert _2^2 + \phi (\widehat{u})^2) + Q_0 \right) + \gamma + \langle h,\widehat{u} \rangle + \widetilde{h} \phi (\widehat{u}), \end{array} \end{aligned}$$

in which \(\widehat{u}\) and \(\eta , \lambda >0\). For instance, in Example 7, see the definition of \(f_1(\eta , \lambda )\) and \(f_2(\eta , \lambda )\). The system of nonsmooth equations (39) can be handled by replacing the vector \((\eta , \lambda )\) with \((|\eta |, |\lambda |)\) and solving

$$\begin{aligned} \begin{array}{ll} \min &{}~ \displaystyle \frac{1}{2} \Vert F(|\eta |,|\lambda |)\Vert _2^2\\ \mathop {\hbox { s.t.~}}&{}~ \eta ,\lambda \in \mathbb {R}\end{array} \end{aligned}$$
(40)

if \(f_1(\eta , \lambda )\) and \(f_2(\eta , \lambda )\) are nonsmooth. The problems (39) and (40), such as Example 7, can be solved by semismooth Newton methods or smoothing Newton methods (Qi and Sun 1999), quasi-Newton methods (Sun and Han 1997; Li et al. 2001), secant methods (Potra et al. 1998), and trust-region methods (Ahookhosh et al. 2015; Qi 1995).

In view of Theorem 6, we now provide a systematic way for solving OSGA-O subproblem (26), which is summarized in next scheme.

figure c

To implement Algorithm 3 (SUS), we need a reliable nonlinear solver to deal with the system of nonlinear equation (39) and a routine giving the solution of the proximal-like problem (32) effectively. In Sect. 4, we investigate solving the proximal-like problem (32) for some practically important loss functions \(\phi \). Algorithm 2 requires two solutions of the subproblem (26) (u in Line 6 and \(u'\) in Line 10) that are provided by Line 3 of SUS (similar notation can be considered for \(u'\)).

3.2 Convergence analysis and complexity

In this section, we establish the complexity bounds of OSGA-O for Lipschitz continuous nonsmooth problems and smooth problems with Lipschitz continuous gradients. We also show that if \(\widetilde{f}\) is strictly convex, the sequence generated by OSGA-O is convergent to \(\widehat{x}\).

To guarantee the existence of a minimizer for OSGA-O, we assume the following conditions.

(H1) :

The objective function \(\widetilde{f}\) is proper and convex;

(H2) :

The upper level set \(N_{\widetilde{f}}(x_0,\widetilde{x}_0) := \{x \in \widetilde{C} \mid \widetilde{f}(x,\widetilde{x}) \le \widetilde{f}(x_0,\widetilde{x}_0)\}\) is bounded, for the starting point \((x_0,\widetilde{x}_0)\in \mathcal {V}\times \mathbb {R}\).

Since \(\widetilde{f}\) is convex, the upper level set \(N_{\widetilde{f}}(x_0,\widetilde{x}_0)\) is closed, and \(\mathcal {V}\times \mathbb {R}\) is a finite-dimensional vector space, (H2) implies that the upper level set \(N_{\widetilde{f}}(x_0,\widetilde{x}_0)\) is convex and compact. It follows from the continuity and properness of the objective function \(\widetilde{f}\) that it attains its global minimizer on the upper level set \(N_{\widetilde{f}}(x_0,\widetilde{x}_0)\). Therefore, there is at least one minimizer \((\widehat{x}, x^*)\).

Since the underlying problem (13) is a special case of the problem (7) considered by Neumaier (2016), the complexity results of OSGA-O is the same as OSGA.

Theorem 8

Suppose that \(\widetilde{f}-\mu Q\) is convex and \(\mu \ge 0\). Then we have

  1. (i)

    (Nonsmooth complexity bound) If the points generated by Algorithm 2 stay in a bounded region of the interior of \(\widetilde{C}\), or if \(\widetilde{f}\) is Lipschitz continuous in \(\widetilde{C}\), the total number of iterations needed to reach a point with \(\widetilde{f}(x,\widetilde{x})\le \widetilde{f}(\widehat{x}, x^*)+\varepsilon \) is at most \(\mathcal {O}((\varepsilon ^2+\mu \varepsilon )^{-1})\). Thus the asymptotic worst case complexity is \(\mathcal {O}(\varepsilon ^{-2})\) when \(\mu =0\) and \(\mathcal {O}(\varepsilon ^{-1})\) when \(\mu >0\).

  2. (ii)

    (Smooth complexity bound) If \(\widetilde{f}\) has Lipschitz continuous gradients with Lipschitz constant L, the total number of iterations needed by Algorithm 2 to reach a point with \(\widetilde{f}(x,\widetilde{x})\le \widetilde{f}(\widehat{x}, x^*)+\varepsilon \) is at most \(\mathcal {O}(\varepsilon ^{-1/2})\) if \(\mu =0\), and at most \(\displaystyle \mathcal {O}(|\log \varepsilon |\sqrt{L/\mu })\) if \(\mu >0\).

Proof

Since all assumptions of Theorems 4.1 and 4.2, Propositions 5.2 and 5.3, and Theorem 5.1 of Neumaier (2016) are satisfied, the results remain valid. \(\square \)

Indeed, if a nonsmooth problem can be reformulated as (13) with a nonsmooth loss function \(\phi \), then OSGA-O can solve the reformulated problem with the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\) for an arbitrary accuracy parameter \(\varepsilon \). The next result shows that the sequence \(\{(x_k,\widetilde{x}_k)\}\) generated by OSGA-O is convergent to \((\widehat{x}, x^*)\) if the objective \(\widetilde{f}\) is strictly convex and \((\widehat{x}, x^*) \in \text {int}~ \widetilde{C}\), where \(\text {int}~ \widetilde{C}\) denotes the interior of \(\widetilde{C}\).

Proposition 9

Suppose that \(\widetilde{f}\) is strictly convex, then the sequence \(\{(x_k,\widetilde{x}_k)\}\) generated by OSGA-O is convergent to \((\widehat{x}, x^*)\) if \((\widehat{x}, x^*) \in \text {int}~ \widetilde{C}\).

Proof

Since \(\widetilde{f}\) is strictly convex, the minimizer \((\widehat{x}, x^*)\) is unique. By \((\widehat{x}, x^*) \in \text {int}~ \widetilde{C}\), there exists a small \(\delta >0\) such that the neighborhood

$$\begin{aligned} N(\widehat{x}, x^*) := \{(x,\widetilde{x}) \in \widetilde{C} \mid \Vert (x,\widetilde{x})-(\widehat{x}, x^*)\Vert \le \delta \} \end{aligned}$$

is contained in \(\widetilde{C}\) and it is a convex and compact set. Let \((x_\delta ,\widetilde{x}_\delta )\) be a minimizer of the problem

$$\begin{aligned} \begin{array}{ll} \min &{}~ \widetilde{f}(x,\widetilde{x})\\ \text {s.t.} &{}~ (x,\widetilde{x}) \in \partial N(\widehat{x}, x^*), \end{array} \end{aligned}$$
(41)

where \(\partial N(\widehat{x}, x^*)\) denotes the boundary of \(N(\widehat{x}, x^*)\). Set \(\varepsilon _\delta := \widetilde{f}(x_\delta ,\widetilde{x}_\delta )-\widehat{f}\) and consider the upper level set

$$\begin{aligned} N_{\widetilde{f}}(x_\delta ,\widetilde{x}_\delta ) := \{(x,\widetilde{x}) \in \widetilde{C} \mid \widetilde{f}(x,\widetilde{x}) \le \widetilde{f}(x,\widetilde{x}) = \widehat{f}+\varepsilon _\delta \}. \end{aligned}$$

Now, Theorem 8 implies that the algorithm attains an \(\varepsilon _\delta \)-solution of (13) in a finite number \(\kappa \) of iterations. Hence, after \(\kappa _1\) iterations, the best point \((x_b,\widetilde{x}_b)\) attained by OSGA-O satisfies \(\widetilde{f}(x_b,\widetilde{x}_b) \le \widehat{f}+\varepsilon _\delta \), i.e., \((x_b,\widetilde{x}_b) \in N_{\widetilde{f}}(x_\delta ,\widetilde{x}_\delta )\). We now show that \(N_{\widetilde{f}}(x_\delta ,\widetilde{x}_\delta ) \subseteq N(\widehat{x}, x^*)\). To prove this statement by contradiction, we suppose that there exists \((x,\widetilde{x}) \in N_{\widetilde{f}}(x_\delta ,\widetilde{x}_\delta ) \setminus N(\widehat{x}, x^*)\). Since \((x,\widetilde{x}) \not \in N(\widehat{x}, x^*)\), we have \(\Vert (x,\widetilde{x})-(\widehat{x}, x^*)\Vert > \delta \). Therefore, there exists \(\lambda _0\) such that

$$\begin{aligned} \Vert \lambda _0 (x,\widetilde{x}) + (1-\lambda _0)(\widehat{x}, x^*)\Vert = \delta . \end{aligned}$$

It follows from (41), the strictly convex property of \(\widetilde{f}\), and \(\widetilde{f}(x,\widetilde{x}) \le \widetilde{f}(x_\delta ,\widetilde{x}_\delta )\) that

$$\begin{aligned} \begin{aligned} \widetilde{f}(x_\delta ,\widetilde{x}_\delta )&\le \widetilde{f}(\lambda _0 (x,\widetilde{x}) + (1-\lambda _0)(\widehat{x}, x^*)) < \lambda _0 \widetilde{f}(x,\widetilde{x}) + (1-\lambda _0) \widetilde{f}(\widehat{x}, x^*)\\&\le \lambda _0 \widetilde{f}(x_\delta ,\widetilde{x}_\delta ) + (1-\lambda _0) \widetilde{f}(x_\delta ,\widetilde{x}_\delta ) = \widetilde{f}(x_\delta ,\widetilde{x}_\delta ), \end{aligned} \end{aligned}$$

which is a contradiction, i.e., \(N_{\widetilde{f}}(x_\delta ,\widetilde{x}_\delta ) \subseteq N(\widehat{x}, x^*)\) implying \((x,\widetilde{x}) \in N(\widehat{x}, x^*)\), giving the result. \(\square \)

4 Solving proximal-like subproblem

In this section, we show that the proximal-like problem (32) can be solved in a closed form for many special cases appearing in applications. To this end, we first consider unconstrained problems (\(C=\mathcal {V}\)) and study some problems with simple constrained domains (\(C \ne \mathcal {V}\)). Although finding proximal points is a mature area in convex nonsmooth optimization (cf. (Combettes and Pesquet 2011; Parikh and Boyd 2013)), we here address the solution of several proximal-like problems of the form (32) appearing in applications that to the best of our knowledge have not been studied in literature.

4.1 Unconstrained examples (\(C = \mathcal {V}\))

We here consider several interesting unconstrained proximal problems appearing in applications and explain how the associated OSGA-O auxiliary problem (32) can be solved.

In recent years, the interest of applying regularizations with weighted norms is increased by emerging many applications; see, e.g., (Daubechies et al. 2010; Rauhut and Ward 2016). Let d be a vector in \(\mathbb {R}^n\) such that \(d_i \ne 0\) for \(i = 1, \ldots ,n\). Then, we define the weight matrix \(D:= \text {diag}(d)\), which is a diagonal matrix with \(D_{i,i} = d_i\) for \(i = 1, \ldots ,n\). It is clear that D is an invertible matrix. The next two results show how to compute a solution of the problem (32) for special cases of \(\phi \) arising frequently in applications.

Proposition 10

Let \(D:= \text {diag}(d)\), where \(d \in \mathbb {R}^n\) with \(d_i \ne 0\), for \(i = 1, \ldots ,n\). If \(\phi (x) = \Vert Dx\Vert _1\), the proximity operator (32) is given by

$$\begin{aligned} \left( \text {prox}_{\lambda \phi }(y) \right) _i = \text {sign}(y_i)(|y_i|-\lambda |d_i|)_+, \end{aligned}$$
(42)

for \(i = 1, \ldots , n\).

Proof

See Proposition 6.2.1 in Ahookhosh (2015). \(\square \)

Proposition 11

Let \(D:= \text {diag}(d)\), where \(d \in \mathbb {R}^n\) and \(d_i \ne 0\), for \(i = 1, \ldots ,n\). If \(\phi (x) = \Vert Dx\Vert _2\), the proximity operator (32) is given by \(\text {prox}_{\lambda \phi }(y)=0\) if \(\Vert D^{-1}y\Vert _2 \le \lambda \) and otherwise, for \(i=1, \ldots , n\),

$$\begin{aligned} \left( \text {prox}_{\lambda \phi }(y) \right) _i = \frac{\tau y_i}{\tau + \lambda d_i^2}, \end{aligned}$$

where \(\tau \) is the unique solution of the one-dimensional nonlinear equation

$$\begin{aligned} \sum _{i=1}^n \frac{d_i^2 y_i^2}{(\tau + \lambda d_i^2)^2} - 1 = 0. \end{aligned}$$

Proof

The optimality condition (5) shows that \(u = \text {prox}_{\lambda \phi } (y)\) if and only if

$$\begin{aligned} 0 \in u -y + \lambda ~ \partial \Vert Du\Vert _2. \end{aligned}$$
(43)

We consider two cases:

  1. (i)

    \(\Vert D^{-1}y\Vert _2 \le \lambda \); (ii) \(\Vert D^{-1}y\Vert _2 > \lambda \).

    1. Case (i).

      Let \(\Vert D^{-1}y\Vert _2 \le \lambda \). Then, we show that \(u=0\) satisfies (43). If \(u=0\), Proposition 1 implies \(\partial \phi (0) = \{g \in \mathcal {V}^* \mid \Vert D^{-1}g\Vert _2 \le 1 \}\). Using this, we get that \(u=0\) is satisfied in (43) if \(y \in \{g \in \mathcal {V}^* \mid \Vert D^{-1}g\Vert _2 \le \lambda \}\) leading to \(\text {prox}_{\lambda \phi }(y) = 0\).

    2. Case (ii).

      Let \(\Vert D^{-1}y\Vert _2 > \lambda \). Case (i) implies \(u \ne 0\). Proposition 1 implies \(\partial \phi (u) = D^T Du/\Vert Du\Vert _2\), and the optimality condition (5) yields

      $$\begin{aligned} u-y+ \lambda ~ D^T\frac{Du}{\Vert Du\Vert _2} = 0. \end{aligned}$$

      By this and setting \(\tau = \Vert Du\Vert _2\), we get

      $$\begin{aligned} \left( 1 + \frac{\lambda d_i^2}{\tau } \right) u_i - y_i = 0, \end{aligned}$$

      leading to

      $$\begin{aligned} u_i = \frac{\tau y_i}{\tau + \lambda d_i^2}, \end{aligned}$$

      for \(i=1, \ldots , n\). Substituting this into \(\tau = \Vert Du\Vert _2\) implies

      $$\begin{aligned} \sum _{i=1}^n \frac{d_i^2 y_i^2}{(\tau + \lambda d_i^2)^2} = 1. \end{aligned}$$

      We define the function \(\psi : {]0,+\infty [} \rightarrow \mathbb {R}\) by

      $$\begin{aligned} \psi (\tau ):= \sum _{i=1}^n \frac{d_i^2 y_i^2}{(\tau + \lambda d_i^2)^2} - 1, \end{aligned}$$

      where it is clear that \(\psi \) is decreasing and

      $$\begin{aligned} \lim _{\tau \rightarrow 0} \psi (\tau ) = \frac{1}{\lambda ^2} \sum _{i=1}^n \frac{y_i^2}{d_i^2} - 1 = \frac{1}{\lambda ^2} \left( \Vert D^{-1}y\Vert _2^2 - \lambda ^2 \right) , ~~~ \lim _{\tau \rightarrow +\infty } \psi (\tau ) = -1. \end{aligned}$$

      It can be deduced by \(\Vert D^{-1}y\Vert _2 > \lambda \) and the mean value theorem that there exists \(\widehat{\tau } \in {]0,+\infty [}\) such that \(\psi (\widehat{\tau })=0\), giving the result.\(\square \)

We here emphasize that if \(D=I\) (I denotes the identity matrix) then the proximity operator for \(\phi (\cdot ) = \Vert \cdot \Vert _2\) is given by

$$\begin{aligned} \text {prox}_{\lambda \phi }(y) = (1-\lambda /\Vert y\Vert _2)_+ y, \end{aligned}$$

cf. (Parikh and Boyd 2013). If one solves the equation \(\psi (\tau )=0\) approximately, and an initial interval [ab] is available such that \(\psi (a)\psi (b)<0\), then a solution can be computed to an \(\varepsilon \)-accuracy using the bisection scheme in \(\mathcal {O}(\log _2((b-a)/\varepsilon ))\) iterations; see, e.g., (Neumaier 2001). However, it is preferable to use a more sophisticated zero finder like the secant bisection scheme (Algorithm 5.2.6, (Neumaier 2001)). If an interval [ab] with sign change is available, one can also use MATLAB \(\mathtt {fzero}\) function combining the bisection scheme, the inverse quadratic interpolation, and the secant method.

Grouped variables typically appear in high-dimensional statistical learning problems. For example, in data mining applications, categorical features are encoded by a set of dummy variables forming a group. Another interesting example is learning sparse additive models in statistical inference, where each component function can be represented using basis expansions and thus can be treated as a group. For such problems (see (Liu et al. 2010) and references therein), it is more natural to select groups of variables instead of individual ones when a sparse model is preferred.

In the following two results, we show how the proximity operator \(\text {prox}_{\lambda \phi } (\cdot )\) can be computed for the mixed-norms \(\phi (\cdot ) = \Vert \cdot \Vert _{1,2}\) and \(\phi (\cdot ) = \Vert \cdot \Vert _{1,\infty }\), which are especially important in the context of sparse optimization and sparse recovery with grouped variables.

Proposition 12

Let \(\phi (\cdot ) = \Vert \cdot \Vert _{1,2}\). Then, the proximity operator (32) is given by

$$\begin{aligned} (\text {prox}_{\lambda \phi }(y))_{g_i} = \left( 1 - \frac{\lambda }{\Vert y_{g_i}\Vert _2} \right) _+ y_{g_i}. \end{aligned}$$
(44)

for \(i = 1, \ldots , m\).

Proof

Since \(u = (u_{g_1}, \ldots , u_{g_m}) \in \mathbb {R}^{n_1} \times \ldots \times \mathbb {R}^{n_m}\) and \(\phi \) is separable with respect to the grouped variables, we fix the index \(i \in \{1, \ldots , m\}\). The optimality condition (5) shows that \(u_{g_i} = \text {prox}_{\lambda \phi } (y_{g_i})\) if and only if

$$\begin{aligned} 0 \in u_{g_i} -y_{g_i} + \lambda ~ \partial \Vert u_{g_i}\Vert _2, \end{aligned}$$
(45)

for \(i = 1, \ldots , m\). We now consider two cases: (i) \(\Vert y_{g_i}\Vert _2 \le \lambda \); (ii) \(\Vert y_{g_i}\Vert _2 > \lambda \).

  1. Case (i).

    Let \(\Vert y_{g_i}\Vert _2 \le \lambda \). Then, we show that \(u_{g_i}=0\) satisfies (45). If \(u_{g_i}=0\), Proposition 1 implies \(\partial \phi (0_{g_i}) = \{g \in \mathbb {R}^{n_i} \mid \Vert g_{g_i}\Vert _2 \le 1 \}\). By substituting this into (45), we get that \(u_{g_i}=0\) is satisfied in (45) if \(y_{g_i} \in \{g \in \mathbb {R}^{n_i} \mid \Vert g_{g_i}\Vert _2 \le \lambda \}\), which leads to \(\text {prox}_{\lambda \phi }(y_{g_i}) = 0_{g_i}\). Since the right hand side of (44) is also zero, (44) holds.

  2. Case (ii).

    Let \(\Vert y_{g_i}\Vert _2 > \lambda \). Then, Case (i) implies that \(u_{g_i} \ne 0\). From Proposition 1, we obtain

    $$\begin{aligned} \partial _{\phi }(u_{g_i}) = \left\{ \frac{u_{g_i}}{\Vert u_{g_i}\Vert _2} \right\} , \end{aligned}$$
    (46)

    where \(i=1, \ldots , m\) and \(\Vert y_{g_i}\Vert _2 > \lambda \). Then (45) and (46) imply

    $$\begin{aligned} u_{g_i} - y_{g_i}+\lambda \frac{u_{g_i}}{\Vert u_{g_i}\Vert _2} = 0, \end{aligned}$$

    leading to

    $$\begin{aligned} \left( 1 + \frac{\lambda }{\Vert u_{g_i}\Vert _2} \right) u_{g_i} = y_{g_i} \end{aligned}$$

    giving \(u_{g_i} = \mu _i y_{g_i}\). Substituting this into the previous identity and solving it with respect to \(\mu _i\) yield

    $$\begin{aligned} \mu _i = \left( 1-\frac{\lambda }{\Vert y_{g_i}\Vert _2} \right) _+ y_{g_i},~~~u_{g_i} = \mu _i y_{g_i}, \end{aligned}$$

    completing the proof. \(\square \)

Proposition 13

Let \(\phi (\cdot ) = \Vert \cdot \Vert _{1,\infty }\). Then, the proximity operator (32) is given by

$$\begin{aligned} (\text {prox}_{\lambda \phi }(y_{g_i}))_{g_i}^j = \left\{ \begin{array}{ll} 0 &{} ~~\text {if}~ \Vert y_{g_i}\Vert _1 \le \lambda ,\\ \text {sign}(y_{g_i}^j) u_\infty ^i &{} ~~\text {if}~ \Vert y_{g_i}\Vert _1>\lambda ,~ j \in \mathcal {I}_{g_i},\\ y_{g_i}^j &{} ~~\text {if}~ \Vert y_{g_i}\Vert _1>\lambda ,~ j \not \in \mathcal {I}_{g_i}, \end{array} \right. \end{aligned}$$
(47)

for \(i = 1, \ldots , m\), where

$$\begin{aligned} u_\infty ^i := \frac{1}{\widehat{k}_i} \left( \sum _{j \in \mathcal {I}_{g_i}} |y_{g_i}^j| - \lambda \right) \end{aligned}$$
(48)

with

$$\begin{aligned} \mathcal {I}_{g_i} := \{l_{g_i}^1, \ldots , l_{g_i}^{\widehat{k}_i}\} \end{aligned}$$
(49)

in which \(\widehat{k}_i\) is the smallest \(k \in \{1, \ldots , n_i-1\}\) such that

$$\begin{aligned} \frac{1}{\widehat{k}_i} \left( \sum _{j=1}^{\widehat{k}_i} v_{g_i}^j-\lambda \right) \ge v_{g_i}^{\widehat{k}_i+1}, \end{aligned}$$
(50)

where \(v_{g_i}^j := |y_{l_{g_i}^j}|\) and \(l_{g_i}^1, \ldots , l_{g_i}^{n_i}\) is a permutation of \(1, \ldots , n_i\) such that \(v_{g_i}^1 \ge v_{g_i}^2 \ge \ldots \ge v_{g_i}^{n_i}\). If (59) is not satisfied for \(k \in \{1, \ldots , n_i-1\}\), then \(\widehat{k}_i=n_i\), for \(i = 1, \ldots , m\).

Proof

Since \(u = (u_{g_1}, \ldots , u_{g_m}) \in \mathbb {R}^{n_1} \times \ldots \times \mathbb {R}^{n_m}\) and \(\phi \) is separable with respect to the grouped variables, we fix the index \(i \in \{1, \ldots , m\}\). The optimality condition (5) shows that \(u_{g_i} = \text {prox}_{\lambda \phi } (y_{g_i})\) if and only if

$$\begin{aligned} 0 \in u_{g_i} -y_{g_i} + \lambda ~ \partial \Vert u_{g_i}\Vert _\infty . \end{aligned}$$
(51)

We now consider two cases: (i) \(\Vert y_{g_i}\Vert _1 \le \lambda \); (ii) \(\Vert y_{g_i}\Vert _1 > \lambda \).

  1. Case (i).

    Let \(\Vert y_{g_i}\Vert _1 \le \lambda \). Then, we show that \(u_{g_i}=0\) satisfies (51). If \(u_{g_i}=0\), the subdifferential of \(\phi \) derived in Example 2 is \(\partial \phi (0_{g_i}) = \{g \in \mathbb {R}^{n_i} \mid \Vert g\Vert _1 \le 1 \}\). By substituting this into (51), we get that \(u_{g_i}=0\) satisfies (51) if \(y_{g_i} \in \{g \in \mathbb {R}^{n_i} \mid \Vert g\Vert _1 \le 1 \}\), i.e., \(\text {prox}_{\lambda \phi }(y_{g_i}) = 0_{g_i}\).

  2. Case (ii).

    Let \(\Vert y_{g_i}\Vert _1 > \lambda \). From Case (i), we have \(u_{g_i} \ne 0\). We show that

    $$\begin{aligned} u_{g_i}^j = \left\{ \begin{array}{ll} \text {sign}(y_{g_i}^j) u_\infty ^i &{} ~~\text {if}~ i \in \mathcal {I}_{g_i},\\ y_{g_i}^j &{} ~~\text {otherwise}, \end{array} \right. \end{aligned}$$
    (52)

    with \(\mathcal {I}_{g_i}\) defined in (49), satisfies (51). Hence, using the subdifferential of \(\phi \) derived in Example 2, there exist coefficients \(\beta _{g_i}^j\), for \(j \in \mathcal {I}_{g_i}\), such that

    $$\begin{aligned} u_{g_i} - y_{g_i} + \lambda \sum _{j \in \mathcal {I}_{g_i}} \beta _{g_i}^j~ \text {sign}(u_{g_i}^j) e_j = 0, \end{aligned}$$
    (53)

    where

    $$\begin{aligned} \beta _{g_i}^j \ge 0 ~~ j\in \mathcal {I}_{g_i}, ~~~\sum _{j \in \mathcal {I}_{g_i}} \beta _{g_i}^j = 1. \end{aligned}$$
    (54)

    Let \(u_{g_i}\) be the vector defined in (52). We define

    $$\begin{aligned} \beta _{g_i}^j := \frac{|y_{g_i}^j| - u_\infty ^i}{\lambda }, \end{aligned}$$
    (55)

    for \(j \in \mathcal {I}_{g_i} = \{l_1^i, \ldots , l_{\widehat{k}_i}^i\}\) with \(u_\infty ^i\) defined in (48). We show that the choice (55) satisfies (53). We first show \(u_\infty ^i >0\). It follows from (48) and (50) if \(\widehat{k}_i < n\) and from \(\Vert y_{g_i}\Vert _1> \lambda \) if \(\widehat{k}_i=n\). Using (52) and (55), we come to

    $$\begin{aligned} \begin{aligned} u_{g_i}^j - y_{g_i}^j \!+\! \lambda \beta _{g_i}^j~ \text {sign}(u_{g_i}^j)&\!=\! \text {sign}(y_{g_i}^j) u_\infty ^i-y_{g_i}^j \!+\! \left( |y_{g_i}^j| \!-\! u_\infty ^i \right) \text {sign}(\text {sign}(y_{g_i}^j) u_\infty ^i)\\&= \text {sign}(y_{g_i}^j) u_\infty ^i-y_{g_i}^j + \left( |y_{g_i}^j|-u_\infty ^i\right) \text {sign}(y_{g_i}^j) = 0, \end{aligned} \end{aligned}$$

    for \(j \in \mathcal {I}_{g_i}\). For \(j \not \in \mathcal {I}_{g_i}^j\), we have \(u_{g_i}^j - y_{g_i}^j = 0\). Hence, (53) is satisfied componentwise. It remains to show that (54) holds. From (50), we have that \(|y_{g_i}^j| \ge u_\infty ^i\), for \(j \in \mathcal {I}_{g_i}\). This and (55) yield \(\beta _{g_i}^j \ge 0\) for \(j \in \mathcal {I}_{g_i}\). It can be deduced from (48) that

    $$\begin{aligned} \sum _{j = 1}^{\widehat{k}_i} \beta _{g_i}^j = \frac{1}{\lambda } \sum _{j = 1}^{\widehat{k}_i} |y_{g_i}^j| - \frac{\widehat{k}^i}{\lambda } u_\infty ^i = \frac{1}{\lambda } \sum _{j = 1}^{\widehat{k}_i} |y_{g_i}^j| - \frac{1}{\lambda }\left( \sum _{j = 1}^{\widehat{k}_i} |y_{g_i}^j|-\lambda \right) = 1, \end{aligned}$$

    giving the results. \(\square \)

Corollary 14

Let \(\phi (\cdot ) = \Vert \cdot \Vert _\infty \). Then, the proximity operator (32) is given by

$$\begin{aligned} (\text {prox}_{\lambda \phi }(y))_i = \left\{ \begin{array}{ll} 0 &{} ~~\text {if}~\Vert y\Vert _1 \le \lambda ,\\ \text {sign}(y_i) u_\infty &{} ~~\text {if}~ \Vert y\Vert _1> \lambda ,~ i \in \mathcal {I},\\ y_i &{} ~~\text {if}~ \Vert y\Vert _1 > \lambda ,~ i \not \in \mathcal {I}, \end{array} \right. \end{aligned}$$
(56)

for \(i = 1, \ldots , n\), where

$$\begin{aligned} u_\infty := \frac{1}{\widehat{k}} \left( \sum _{i \in \mathcal {I}} |y_i| - \lambda \right) \end{aligned}$$
(57)

with

$$\begin{aligned} \mathcal {I} := \{l_1, \ldots , l_{\widehat{k}}\} \end{aligned}$$
(58)

in which \(\widehat{k}\) is the smallest \(k \in \{1, \ldots , n-1\}\) such that

$$\begin{aligned} \frac{1}{\widehat{k}} \left( \sum _{i=1}^{\widehat{k}} v_i-\lambda \right) \ge v_{\widehat{k}+1}, \end{aligned}$$
(59)

where \(v_i := |y_{l_i}|\) and \(l_1, \ldots , l_n\) is a permutation of \(1, \ldots , n\) such that \(v_1 \ge v_2 \ge \cdots \ge v_n\). If (59) is not satisfied for \(k \in \{1, \ldots , n-1\}\), then \(\widehat{k}=n\).

Proof

The proof is straightforward from Proposition 13 by setting \(m=1\), \(n_1=n\), \(y_{g_1}=y\), and \(\mathcal {I}_{g_1}=\mathcal {I}\).

4.2 Constrained examples (\(C \ne \mathcal {V}\))

In this section, we consider the subproblem (32) and show how it can be solved for some \(\phi \) and C. More precisely, we solve the minimization problem

$$\begin{aligned} \begin{array}{ll} \min &{}~ \displaystyle \frac{1}{2} \Vert x-y\Vert _2^2 + \lambda \phi (x)\\ \text {s.t.} &{}~ x \in C, \end{array} \end{aligned}$$

where \(\phi (x)\) is a simple convex function and C is a simple domain. We consider a few examples of this form.

Proposition 15

Let \(\phi (x) = \Vert Dx\Vert _1\) and \(C = [\underline{x}, \overline{x}]\), where D is a diagonal matrix. Then, the global minimizer of the subproblem (32) is given by

$$\begin{aligned} (\text {prox}_{\lambda \phi }^C(y))_i = \left\{ \begin{array}{ll} \underline{x}_i &{}~~ \text {if}~\omega (y,\lambda )> 0,~ \underline{x}_i - y_i + \lambda |d_i|~ \text {sign} (\underline{x}_i) \ge 0,\\ \overline{x}_i &{}~~ \text {if}~ \omega (y,\lambda )> 0,~ \overline{x}_i - y_i + \lambda |d_i|~ \text {sign} (\overline{x}_i) \le 0,\\ y_i - \lambda |d_i| &{}~~ \text {if}~ \omega (y,\lambda )> 0,~ y_i> \lambda |d_i|,\\ y_i+ \lambda |d_i| &{}~~ \text {if}~ \omega (y,\lambda ) > 0,~ y_i < -\lambda |d_i|,\\ 0 &{}~~ \text {otherwise}, \end{array} \right. \end{aligned}$$
(60)

for \(i = 1, \ldots , n\), where

$$\begin{aligned} \displaystyle \omega (y,\lambda ) : =\sum _{y_i+\lambda |d_i|<0}(y_i+\lambda |d_i|)\underline{x}+\sum _{y_i+\lambda |d_i|>0}(y_i+\lambda |d_i|)\overline{x}. \end{aligned}$$
(61)

Proof

The optimality condition (4) shows that \(u = \text {prox}_{\lambda \phi }^C (y)\) if and only if

$$\begin{aligned} 0 \in u -y + \lambda ~ \partial \Vert Du\Vert _1 + N_C(u), \end{aligned}$$
(62)

where \(N_C(u)\) is the normal cone of C at u defined in (2). We show that \(u=0\) if and only if \(\omega (y,\lambda )\le 0\). We first consider that

$$\begin{aligned} N_C(0) = \left\{ p \in \mathcal {V} ~\big |~ \forall z \in [\underline{x}, \overline{x}], \langle p, z \rangle \le 0 \right\} = \left\{ p \in \mathcal {V} \mid \sum _{p_i<0} p_i \underline{x}+\sum _{p_i>0}p_i \overline{x} \le 0 \right\} . \end{aligned}$$
(63)

(62) suggests \(u=0\) if and only if there exists \(p\in N_C(0)\cap (y-\lambda \partial \phi (x))\). By Proposition 1, this is possible if and only if

$$\begin{aligned} \min \left\{ \sum _{p_i<0} p_i \underline{x}+\sum _{p_i>0}p_i \overline{x}~\Big |~ p=\lambda g, ~\Vert D^{-1} g\Vert _\infty \le 1 \right\} \le 0. \end{aligned}$$

The solution of this problem is \(p = y-\lambda |D\mathbf{1}|\), where \(\mathbf{1}\) is the vector of all ones. Hence, the minimum of this problem is given by (61). This implies \(u=0\) if and only if \(\omega (y,\lambda )\le 0\). We, therefore, consider two cases:

  1. Case (i).

    \(u=0\). Then, we have \(\omega (y,\lambda )\le 0\).

  2. Case (ii).

    \(u \ne 0\). Then, \(\omega (y,\lambda ) > \lambda \). Proposition 1 yields

    $$\begin{aligned} \partial \phi (u) = \{ g \in \mathcal {V}^* \mid \Vert D^{-1} g\Vert _\infty = 1,~~ \langle g,u \rangle = \Vert Du\Vert _1\}, \end{aligned}$$

    leading to

    $$\begin{aligned} \sum _{i=1}^n (g_i u_i - |d_i u_i|) = 0. \end{aligned}$$

    By induction on nonzero elements of u, we get \(g_i u_i = |d_i u_i|\), for \(i = 1, \ldots , n\). This implies that \(g_i = |d_i|~ \text {sign} (u_i)\) if \(u_i \ne 0\). This and the definition of \(N_C(u)\) yield

    $$\begin{aligned} u_i - y_i + \lambda (\partial \Vert Du\Vert _1)_i \left\{ \begin{array}{ll} \ge 0 &{}~~ \text {if}~ u_i = \underline{x}_i,\\ \le 0 &{}~~ \text {if}~ u_i = \overline{x}_i,\\ = 0 &{}~~ \text {if}~ \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$

    for \(i = 1, \ldots , n\), and equivalently for \(u \ne 0\), we get

    $$\begin{aligned} u_i - y_i + \lambda |d_i|~ \text {sign} (u_i) \left\{ \begin{array}{ll} \ge 0 &{}~~ \text {if}~ u_i = \underline{x}_i,\\ \le 0 &{}~~ \text {if}~ u_i = \overline{x}_i,\\ = 0 &{}~~ \text {if}~ \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$
    (64)

    for \(i = 1, \ldots , n\). If \(u_i = \underline{x}_i\), substituting \(u_i = \underline{x}_i\) in (64) implies \(\underline{x}_i - y_i + \lambda |d_i|~ \text {sign} (\underline{x}_i) \ge 0\). If \(u_i = \overline{x}_i\), substituting \(u_i = \overline{x}_i\) in (64) gives \(\overline{x}_i - y_i + \lambda |d_i|~ \text {sign} (\overline{x}_i) \le 0\). If \(\underline{x}_i< u_i < \overline{x}_i\), there are three possibilities: (a) \(u_i > 0\); (b) \(u_i < 0\); (c) \(u_i = 0\). In Case (a), \(\text {sign} (u_i) = 1\) and (64) lead to \(u_i = y_i - \lambda |d_i| > 0\). In Case (b), \(\text {sign} (u_i) = - 1\) and (64) imply \(u_i = y_i+ \lambda |d_i| <0\). In Case (c), we end up to \(u_i = 0\), completing the proof. \(\square \)

Proposition 16

Let \(\phi (x) = \frac{1}{2} \Vert x\Vert _2^2\) and \(C = [\underline{x}, \overline{x}]\). Then, the global minimizer of the subproblem (32) is given by

$$\begin{aligned} (\text {prox}_{\lambda \phi }^C(y))_i = \left\{ \begin{array}{ll} \underline{x}_i &{}~~ \text {if}~ (1+\lambda ) \underline{x}_i \ge y_i,\\ \overline{x}_i &{}~~ \text {if}~ (1+\lambda ) \overline{x}_i \le y_i,\\ y_i/(1+\lambda ) &{}~~ \text {if}~ \underline{x}_i< y_i/(1+\lambda ) < \overline{x}_i, \end{array} \right. \end{aligned}$$
(65)

for \(i = 1, \ldots , n\).

Proof

The function \(\phi (x) = \frac{1}{2} \Vert x\Vert _2^2\) is differentiable, i.e.,

$$\begin{aligned} \partial \phi (x) = x. \end{aligned}$$

This and the definition of \(N_C(u)\) imply

$$\begin{aligned} u_i - y_i + \lambda u_i \left\{ \begin{array}{ll} \ge 0 &{}~~ \text {if}~ u_i = \underline{x}_i,\\ \le 0 &{}~~ \text {if}~ u_i = \overline{x}_i,\\ = 0 &{}~~ \text {if}~ \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$
(66)

for \(i = 1, \ldots , n\). If \(u_i = \underline{x}_i\), substituting \(u_i = \underline{x}_i\) in (66) implies \((1+\lambda ) \underline{x}_i \ge y_i\). If \(u_i = \overline{x}_i\), substituting \(u_i = \overline{x}_i\) in (66) yields \((1+\lambda ) \overline{x}_i \le y_i\). If \(\underline{x}_i< u_i < \overline{x}_i\), then \(u_i = y_i/(1+\lambda )\), giving the result. \(\square \)

Proposition 17

Let \(\phi (x) = \frac{1}{2} \lambda _1 \Vert x\Vert _2^2 + \lambda _2 \Vert Dx\Vert _1\) and \(C = [\underline{x}, \overline{x}]\). Then the global minimizer of the subproblem (32) is determined by

$$\begin{aligned} (\text {prox}_{\lambda \phi }^C(y))_i = \left\{ \begin{array}{ll} \underline{x}_i &{} \quad \text {if}~~ \omega (y,\lambda )> 0,~(1+\lambda _1)\underline{x}_i - y_i + \lambda _2 |d_i|~ \text {sign} (\underline{x}_i) \ge 0,\\ \overline{x}_i &{} \quad \text {if}~~ \omega (y,\lambda )> 0,~ (1+\lambda _1)\overline{x}_i - y_i + \lambda _2 |d_i|~ \text {sign} (\overline{x}_i) \le 0,\\ 1/(1+\lambda _1)(y_i - \lambda _2 |d_i|) &{} \quad \text {if}~~ \omega (y,\lambda )> 0,~y_i> \lambda _2 |d_i|,\\ 1/(1+\lambda _1)(y_i + \lambda _2 |d_i|) &{} \quad \text {if}~~ \omega (y,\lambda ) > 0,~ y_i < -\lambda _2 |d_i|,\\ 0 &{} \quad \text {otherwise}, \end{array} \right. \end{aligned}$$
(67)

for \(i = 1, \ldots , n\), where \(\omega (y,\lambda )\) is defined by (61).

Proof

Since \(\mathcal {V}\) is finite-dimensional and \(\text {dom} \left( \frac{1}{2} \lambda _1 \Vert x\Vert _2^2 \right) \cap \text {dom} \lambda _2 \Vert Dx\Vert _1 \ne \emptyset \), we get

$$\begin{aligned} \partial \left( \frac{1}{2} \lambda _1 \Vert x\Vert _2^2 + \lambda _2 \Vert Dx\Vert _1 \right) = \lambda _1 \partial \left( \frac{1}{2} \Vert x\Vert _2^2 \right) + \lambda _2 \partial \left( \Vert Dx\Vert _1 \right) . \end{aligned}$$
(68)

The optimality condition (4) shows that \(u = \text {prox}_{\lambda \phi }^C (y)\) if and only if

$$\begin{aligned} 0 \in u -y + \lambda _1 u + \lambda _2~ \partial \Vert Du\Vert _1 + N_C(u). \end{aligned}$$
(69)

By (69), we have \(u=0\) if and only if there exists \(p\in N_C(0)\cap (y-\lambda _2 \partial \phi (x))\), where \(N_C(0)\) is defined by (63). By Proposition 1, this is possible if and only if

$$\begin{aligned} \min \left\{ \sum _{p_i<0} p_i \underline{x}+\sum _{p_i>0}p_i \overline{x}~\Big |~ p=\lambda _2 g, ~\Vert D^{-1} g\Vert _\infty \le 1 \right\} \le 0. \end{aligned}$$

The solution of this problem is \(p = y-\lambda _2 |D\mathbf{1}|\), where \(\mathbf{1}\) is the vector of all ones. Hence the minimum of this problem is given by (61). This implies \(u=0\) if and only if \(\omega (y,\lambda _2)\le 0\). We, therefore, consider two cases:

  1. Case (i).

    \(u=0\). Then, we have \(\omega (y,\lambda _2)\le 0\).

  2. Case (ii).

    \(u \ne 0\). Then, \(\omega (y,\lambda _2)>0\). From (68) and the definition of \(N_C(u)\), we obtain

    $$\begin{aligned} u_i - y_i + \lambda _1 u_i + \lambda _2 \partial |d_iu_i| \left\{ \begin{array}{ll} \ge 0 &{}~~ \text {if}~~ u_i = \underline{x}_i,\\ \le 0 &{}~~ \text {if}~~ u_i = \overline{x}_i,\\ = 0 &{}~~ \text {if}~~ \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$

    for \(i = 1, \ldots , n\). This leads to

    $$\begin{aligned} (1+\lambda _1) u_i - y_i + \lambda _2 |d_i|~ \text {sign} (u_i) \left\{ \begin{array}{ll} \ge 0 &{}\quad \text {if}\quad u_i = \underline{x}_i,\\ \le 0 &{}\quad \text {if}\quad u_i = \overline{x}_i,\\ = 0 &{}\quad \text {if}\quad \underline{x}_i< u_i < \overline{x}_i, \end{array} \right. \end{aligned}$$
    (70)

    for \(i = 1, \ldots , n\). If \(u_i = \underline{x}_i\), substituting \(u_i = \underline{x}_i\) in (64) gives \((1+\lambda _1)\underline{x}_i - y_i + \lambda _2 |d_i|~\text {sign} (\underline{x}_i) \ge 0\). If \(u_i = \overline{x}_i\), substituting \(u_i = \overline{x}_i\) in (64) implies \((1+\lambda _1) \overline{x}_i - y_i + \lambda _2 |d_i|~ \text {sign} (\overline{x}_i) \le 0\). If \(\overline{x}_i< u_i < \overline{x}_i\), there are three possibilities: (i) \(u_i > 0\); (ii) \(u_i < 0\); (iii) \(u_i = 0\). In Case (i), \(\text {sign}(u_i) = 1\) and (64) imply \(u_i =1/(1+\lambda _1) (y_i - \lambda _2 |d_i|) > 0\). In Case (ii), \(\text {sign}(u_i) = - 1\) and (64) imply \(u_i = 1/(1+\lambda _1)(y_i+ \lambda _2 |d_i|) <0\). In Case (iii), we get \(u_i = 0\), giving the result. \(\square \)

Let \(x \ge 0\) be nonnegativity constraints. These constraints are important in many applications, especially if x describes physical quantities; see, e.g., (Esser et al. 2013; Kaufman and Neumaier 1996, 1997). Since nonnegativity constraints can be regarded as an especial case of bound-constrained domain, Propositions 1516, and 17 can be used to derive the results for nonnegativity constraints.

5 Numerical experiments and application

We here report some numerical results to compare the performance of OSGA-O with OSGA and some state-of-the-art methods. In our comparison, we consider PGA (proximal gradient algorithm (Parikh and Boyd 2013)), NSDSG (nonsummable diminishing subgradient algorithm (Boyd et al. 2003)), FISTA (Beck and Teboulle’s fast proximal gradient algorithm (Beck and Teboulle 2012)), NESCO (Nesterov’s composite optimal algorithm (Nesterov 2013)), NESUN (Nesterov’s universal gradient algorithm (Nesterov 2015)), NES83 (Nesterov’s 1983 optimal algorithm (Nesterov 1983)), NESCS (Nesterov’s constant step optimal algorithm (Nesterov 2004)), and NES05 (Nesterov’s 2005 optimal algorithm (Nesterov 2005a)). We adapt NES83, NESCS, and NES05 by passing a subgradient in the place of the gradient to be able to apply them to nonsmooth problems (see Ahookhosh (2016)). The codes of these algorithms are written in MATLAB, where we use the parameters proposed in the associated papers.

5.1 Experiment with random data

We consider solving an underdetermined system

$$\begin{aligned} A x = y, \end{aligned}$$
(71)

where \(A \in \mathbb {R}^{m \times n}\) (\(m < n\)) and \(y \in \mathbb {R}^m\). Underdetermined systems of linear equations have frequently appeared in many applications of linear inverse problem such as those in the fields of signal and image processing, geophysics, economics, machine learning, and statistics. The objective is to recover x from the observed vector y, and matrix A by some optimization models. Due to the ill-conditioned feature of the problem, the most popular optimization models are (18), (19), and (20), where (18) is smooth and (19) and (20) are nonsmooth. In Sect. 5.1.1, we report numerical results with the \(\ell _1\) minimization (19), and in Sect. 5.1.2, we give results regarding the elastic net minimization problem (20). We set \(m = 5000\) and \(n = 10000\), and the data A, y, and \(x_0\) for problem (19) is randomly generated by

$$\begin{aligned} \mathtt {A=rand(m,n),~~ y=rand(1,m),~~ x_0=rand(1,n),} \end{aligned}$$

where \(\mathtt {rand}\) generates uniformly distributed random numbers between 0 and 1 and \(x_0\) is a starting point for algorithms.

We divide the solvers into two classes: (i) proximal-based methods (PGA, FISTA, NESCO, and NESUN) that can be directly applied to nonsmooth problems; (ii) Subgradient-based methods (NSDSG, NES83, NESCS, and NES05) in which the nonsmooth first-order oracle is required, where NES83, NESCS, and NES05 are adapted to take a subgradient in the place of the gradient. We set

$$\begin{aligned} \widehat{L} := \max _{1 \le i \le n} \Vert a_i\Vert ^2, \end{aligned}$$

where \(a_i\) (\(i = 1, 2, \ldots , n\)) is the i-th column of A. In the implementation, NESCS, NES05, PGA, and FISTA use \(L = 10^4 \widehat{L}\), and NSDSG employs \(\alpha _0 = 10^{-7}\). Algorithm 1, for both OSGA and OSGA-O, uses the parameters

$$\begin{aligned} \delta = 0.9,~~ \alpha _{max} = 0.7,~~ \kappa = \kappa ' = 0.5, \end{aligned}$$

and the prox-function (24) with \(Q_0 = \frac{1}{2} \Vert x_0\Vert _2 + \epsilon \), where \(\epsilon \) is the machine precision. All numerical experiments were executed on a PC Intel Core i7-3770 CPU 3.40GHz 8 GB RAM. To solve the nonlinear system of equations (33), we first consider the nonlinear least-squares problem (40) and solve it by the MATLAB internal function \(\mathtt {fminsearch}\),Footnote 1 which is a derivative-free solver handling both smooth and nonsmooth problems. In our implementation, we apply OSGA-O to the problem, stop it after 100 iterations and save the best function value attained (\(f_s\)), and run the others until either the same function value is achieved or the number of iterations reaches 5000. In our comparison, \(N_i\) and T denote the total number of iterations and the running time, respectively.

To display the results, we used the Dolan and Moré performance profile (Dolan and Moré 2002) with the performance measures \(N_i\) and T. In this procedure, the performance of each algorithm is measured by the ratio of its computational outcome versus the best numerical outcome of all algorithms. This performance profile offers a tool to statistically compare the performance of algorithms. Let \(\mathcal {S}\) be a set of all algorithms and \(\mathcal {P}\) be a set of test problems. For each problem p and algorithm s, \(t_{p,s}\) denotes the computational outcome with respect to the performance index, which is used in the definition of the performance ratio

$$\begin{aligned} r_{p,s}:=\frac{t_{p,s}}{\min \{t_{p,s}:s\in \mathcal {S}\}}. \end{aligned}$$
(72)

If an algorithm s fails to solve a problem p, the procedure sets \(r_{p,s}:=r_\text {failed}\), where \(r_\text {failed}\) should be strictly larger than any performance ratio (72). Let \(n_p\) be the number of problems in the experiment. For any factor \(\tau \in \mathbb {R}\), the overall performance of an algorithm s is given by

$$\begin{aligned} \rho _{s}(\tau ):=\frac{1}{n_{p}}\text {size}\{p\in \mathcal {P}:r_{p,s}\le \tau \}. \end{aligned}$$

Here, \(\rho _{s}(\tau )\) is the probability that a performance ratio \(r_{p,s}\) of an algorithm \(s\in \mathcal {S}\) is within a factor \(\tau \) of the best possible ratio. The function \(\rho _{s}(\tau )\) is a distribution function for the performance ratio. In particular, \(\rho _{s}(1)\) gives the probability that an algorithm s wins over all other considered algorithms, and \(\lim _{\tau \rightarrow r_{\text {failed}}}\rho _{s}(\tau )\) gives the probability that algorithm s solves all considered problems. Therefore, this performance profile can be considered as a measure of efficiency among all considered algorithms. In the following figures of this section, the number \(\tau \) is represented in the x-axis, while \(P(r_{p,s}\le \tau :1\le s\le n_{s})\) is shown in the y-axis.

5.1.1 \(\ell _1\) minimization

Here, we consider the \(\ell _1\) minimization problem (19), reformulate it as a minimization problem of the form (13) with the objective and the constraint given in (22), and solve the reformulated problem by OSGA-O. We then report some numerical results and a comparison among OSGA-O, OSGA and some state-of-the-art methods. For OSGA-O and OSGA, we here set \(\mu =0\).

Let us consider 6 different regularization parameters, apply PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O to (19) with 10 generated random data for each regularization parameter, and report numerical results in Table 1. Then, we use NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O for solving (19) with 10 generated random data corresponding to each regularization parameter and report numerical results in Table 2. We give a comparison among these algorithms in Fig. 1 for all 60 problems with the performance profile of \(N_i\) and T. We illustrate function values versus iterations for both classes of solvers with the regularization parameters \(\lambda =1,~10^{-1},~10^{-2},~10^{-3},~10^{-4},~10^{-5}\) in Fig. 2.

Table 1 Averages (only integer part) of \(N_i\) and T for PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O for solving \(\ell _1\) minimization problem with several regularization parameters
Table 2 Averages (only integer part) of \(N_i\) and T for NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O for solving \(\ell _1\) minimization problem with several regularization parameters
Fig. 1
figure 1

Performance profiles for the number of iterations \(N_i\) and the running time T for the \(\ell _1\) minimization problem: a, b display the results for \(N_i\) and T of PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O; c, d, respectively, illustrate the results for \(N_i\) and T of NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O. In all of these subfigures OSGA-O attains the best results with respect to both measures \(N_i\) and T

Fig. 2
figure 2

A comparison among first-order methods for solving \(\ell _1\) minimization problem: af illustrate a comparison of function values versus iterations among PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O for \(\lambda =1,~10^{-1},~10^{-2},~10^{-3},10^{-4},~10^{-5}\), respectively; gl display a comparison of function values versus iterations among NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O for \(\lambda =1,~10^{-1},~10^{-2},~10^{-3},10^{-4},~10^{-5}\), respectively

The results of Tables 1 and 2 show that OSGA-O attains the best number of iterations and running time for the \(\ell _1\) minimization problem, where the average of 10 implementations associated to each regularization parameter is given in these tables. In Fig. 1, subfigures (a) and (b) stand for performance profiles with measures \(N_i\) and T comparing proximal-based methods, where OSGA-O outperforms the others substantially. In this figure, subfigures (c) and (d) display performance profiles for measures \(N_i\) and T to compare subgradient-based methods, where OSGA-O performs much better than the others with respect to both measures. Further, from Fig. 2, it can be seen that the worst results are obtained by NSDSG and PGA, while FISTA, NESCO, NESUN, NES83, NESCS, NES05, and OSGA are comparable to some extent; however, OSGA-O is significantly superior to the other methods.

5.1.2 Elastic net minimization

We now consider the elastic net minimization problem (20), reformulate it as a minimization problem of the form (13) with the objective and the constraint given in (23), and solve the reformulated problem by OSGA-O. We then give some numerical results and a comparison among OSGA-O,OSGA and some state-of-the-art solvers. For OSGA-O and OSGA, we here set \(\mu =\lambda _1/2\).

Let us consider six different regularization parameters \(\lambda _1=\lambda _2=1,~10^{-1},~10^{-2},~10^{-3},~10^{-4},~10^{-5}\). For each of these parameters, we generate the random data 10 times and report numerical results of PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O in Table 3 and numerical results of NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O in Table 4. For these 60 problems, we illustrate the performance profile for the measures \(N_i\) and T in Fig. 3. We then display function values versus iterations for both classes of solvers with \(\lambda _1=\lambda _2=1,~10^{-1},~10^{-2},~10^{-3},~10^{-4},~10^{-5}\) in Fig. 4.

Table 3 Averages (only integer part) of \(N_i\) and T for PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O for solving the elastic net problem (19) with several regularization parameters
Table 4 Averages (only integer part) of \(N_i\) and T for NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O for solving the elastic net problem with several regularization parameters

The results of Tables 3 and 4 show that the best number of iterations (\(N_i\)) and running time (T) are obtained by OSGA-O. From the results of Fig. 3, it can be seen that OSGA-O outperforms the others considerably with respect to \(N_i\) and T for both proximal-type and subgradient-type methods. It is also clear that the second best algorithm is OSGA. In Fig. 4, we can see that the worst results are obtained by NSDSG and PGA, while FISTA, NESCO, NESUN, NES83, NESCS, NES05 and OSGA behave competitively. Further, OSGA-O performs better than the others significantly.

Fig. 3
figure 3

Performance profiles for the number of iterations \(N_i\) and the running time T for the elastic net problem: a, b display the results for \(N_i\) and T of PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O; c, d, respectively, illustrate the results for \(N_i\) and T of NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O. In all of these subfigures OSGA-O attains the best results with respect to both measures \(N_i\) and T

Fig. 4
figure 4

A comparison among first-order methods for solving the elastic net problem: af illustrate a comparison of function values versus iterations among PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O for \(\lambda =1,~10^{-1},~10^{-2},~10^{-3},10^{-4},~10^{-5}\), respectively; gl display a comparison of function values versus iterations among NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O for \(\lambda =1,~10^{-1},~10^{-2},~10^{-3},10^{-4},~10^{-5}\), respectively

Fig. 5
figure 5

Function values versus iterations for the \(\ell _1\) minimization and elastic net problems: a, b display the results for PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O for the \(\ell _1\) minimization; c, d illustrate the results for NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O for the elastic net problem

5.2 Sparse recovery (compressed sensing)

In recent years, there has been an increasing interest in finding sparse solutions of many problems using the structured models in various areas of applied mathematics. In most cases, the problem involves high-dimensional data with a small number of available measurements, where the core of these problems involves an optimization problem of the form (19) or (20). Thanks to the sparsity of solutions and the structure of problems, these optimization problems can be solved in reasonable time even for the extremely high-dimensional data sets. Sparse recovery, basis pursuit, lasso, wavelet-based deconvolution, and compressed sensing are some examples, where the latter case receives lots of attentions during the recent years, cf. (Candés 2006; Donoho 2006).

Let us consider a linear inverse problem of the form (71) that we solve it with minimization problems (19) and (20). We set \(n = 4096\) and \(m = 1024\). The problem is generated by the same procedure given in GPSR (Figueiredo et al. 2007) package available at

$$\begin{aligned} \mathtt {http://www.lx.it.pt/~mtf/GPSR/} \end{aligned}$$

which is

$$\begin{aligned} \begin{array}{l} \mathtt {n\_spikes = floor(0.01*n);} \mathtt {p = zeros(n,1);~} \mathtt {q = randperm(n);} \\ \mathtt {p(q(1:n\_spikes)) = sign(randn(n\_spikes,1));} \mathtt {B = randn(m,n);~} \\ \mathtt {B = orth(B')';} \mathtt {bf = B*p;~ } \mathtt {b = bf+sigma*randn(m,1);} \end{array} \end{aligned}$$

with \(\lambda =\lambda _1=\lambda _2=\frac{1}{2}\max (|A^Tb|)\). We conclude this section by solving this sparse recovery problem with OSGA-O, OSGA, and the other methods described in the previous section. We show the results in Fig. 5. In this implementation, we apply OSGA-O to the problem, stop it after 10 iterations and save the best function value attained (\(f_s\)), and run the others until either the same function value is achieved or the number of iterations reaches 5000. From Fig. 5, it is clear that that OSGA-O attains the best performance compared with the others for both \(\ell _1\) minimization and elastic net problems.

6 Conclusions

This paper discusses the solution of structured nonsmooth convex optimization problems with the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\), which is optimal for smooth problems with Lipschitz continuous gradients. First, if the nonsmoothness of the problem is manifested in a structured way, the problem is reformulated so that the objective is smooth with Lipschitz continuous gradients in the price of adding a functional constraint to the feasible domain. Then, a new setup of the optimal subgradient algorithm (OSGA-O) is developed to solve the reformulated problem with the complexity \(\mathcal {O}(\varepsilon ^{-1/2})\).

Next, it is proved that the OSGA-O auxiliary problem is equivalent to a proximal-like problem, which is well-studied due to its appearance in Nesterov-type optimal methods for composite minimization. For several problems appearing in applications, either an explicit formula or a simple iterative scheme for solving the corresponding proximal-like problems is provided.

Finally, some numerical results with random data and a sparse recovery problem are given indicating a good behavior of OSGA-O compared to some state-of-the-art first-order methods, which confirm the theoretical foundations.