Advertisement

Journal of Optimization Theory and Applications

, Volume 178, Issue 3, pp 673–698 | Cite as

Envelope Functions: Unifications and Further Properties

  • Pontus GiselssonEmail author
  • Mattias Fält
Open Access
Article

Abstract

Forward–backward and Douglas–Rachford splitting are methods for structured nonsmooth optimization. With the aim to use smooth optimization techniques for nonsmooth problems, the forward–backward and Douglas–Rachford envelopes where recently proposed. Under specific problem assumptions, these envelope functions have favorable smoothness and convexity properties and their stationary points coincide with the fixed-points of the underlying algorithm operators. This allows for solving such nonsmooth optimization problems by minimizing the corresponding smooth convex envelope function. In this paper, we present a general envelope function that unifies and generalizes existing ones. We provide properties of the general envelope function that sharpen corresponding known results for the special cases. We also present a new interpretation of the underlying methods as being majorization–minimization algorithms applied to their respective envelope functions.

Keywords

First-order methods Envelope functions Nonsmooth optimization Smooth reformulations Large-scale optimization 

Mathematics Subject Classification

90C30 47J25 

1 Introduction

Many convex optimization problems can be reformulated into a problem of finding a fixed-point of a nonexpansive operator. This is the basis for many first-order optimization algorithms such as; forward–backward splitting [1], Douglas–Rachford splitting [2, 3], the alternating direction method of multipliers (ADMM) [4, 5, 6] and its linearized versions [7], the three operator splitting method [8], and (generalized) alternating projections [9, 10, 11, 12, 13, 14].

In these methods, a fixed-point is found by performing an averaged iteration of the nonexpansive mapping. This scheme guarantees global convergence, but the rate of convergence can be slow. A well studied approach for improving practical convergence—that has proven very successful in practice—is preconditioning of the problem data; see, e.g., [15, 16, 17, 18, 19, 20, 21] for a limited selection of such methods. The underlying idea is to incorporate static second-order information in the respective algorithms.

The performance of the forward–backward and the Douglas–Rachford methods can be further improved by exploiting the properties of the recently proposed forward–backward envelope [22, 23] and Douglas–Rachford envelope [24]. As shown in [22, 23, 24], the stationary points of these envelope functions agree with the fixed-points of the corresponding algorithm operator. Under certain assumptions, they have favorable properties such as convexity and Lipschitz continuity of the gradient. These properties enable for nonsmooth problems to be solved by finding a stationary point of a smooth and convex envelope function. In [22, 23], truncated Newton methods and quasi-Newton methods are applied to the forward–backward envelope function to improve local convergence. During the submission procedure of this paper, these works have been extended to the nonconvex setting in [25, 26] for both forward–backward splitting and Douglas–Rachford splitting.

A unifying property of forward–backward and Douglas–Rachford splitting (for convex optimization) is that they are averaged iterations of a nonexpansive mapping. This mapping is composed of two nonexpansive mappings that are gradients of functions. Based on this observation, we present a general envelope function that has the forward–backward envelope and the Douglas–Rachford envelope as special cases. Other special cases include the Moreau envelope and the ADMM envelope [27], since they are special cases of the forward–backward and Douglas–Rachford envelopes respectively. We also explicitly characterize the relationship between the ADMM and Douglas–Rachford envelopes as being essentially the negatives of each other.

The analyses of the envelope functions in [22, 23, 24] require, translated to our setting, that one of the functions that define one of the nonexpansive operators in the composition, is twice continuously differentiable. In this paper, we analyze the proposed general envelope function in the more restrictive setting of the twice continuously function being quadratic, or equivalently its gradient being affine. We show that if the Hessian matrix of this function is nonsingular the stationary points of the envelope coincide with the fixed-points of the nonexpansive operator. We provide sharp quadratic upper and lower bounds to the envelope function that improve corresponding results for the known special cases in the literature. One implication of these bounds is that the gradient of the envelope function is Lipschitz continuous with constant two. If, in addition, the before mentioned Hessian matrix is positive semidefinite the envelope function is convex, implying that a fixed-point to the nonexpansive operator can be found by minimizing a smooth and convex envelope function.

We also provide an interpretation of the basic averaged fixed-point iteration as a majorization–minimization step on the envelope function. We show that the majorizing function is a quadratic upper bound, which is slightly more conservative than the provided sharp quadratic upper bound. We also note that using the sharp quadratic upper bound as majorizing function would result in computationally more expensive algorithm iterations.

Our contributions are as follows; (i) we propose a general envelope function that has several known envelope functions as special cases, (ii) we provide properties of the general envelope that sharpen (sometimes considerably) and generalize corresponding known results for the special cases, (iii) we provide an interpretation of the basic averaged iteration as a suboptimal majorization–minimization step on the envelope (iv) we provide new insights on the relation between the Douglas–Rachford envelope and the ADMM envelope.

2 Preliminaries

2.1 Notation

We denote by \(\mathbb {R}\) the set of real numbers, \(\mathbb {R}^n\) the set of real n-dimensional vectors, and \(\mathbb {R}^{m\times n}\) the set of real \(m\times n\)-matrices. Further \(\overline{\mathbb {R}}:=\mathbb {R}\cup \{\infty \}\) denotes the extended real line. We denote inner-products on \(\mathbb {R}^n\) by \(\langle \cdot ,\cdot \rangle \) and their induced norms by \(\Vert \cdot \Vert \). We define the scaled norm \(\Vert x\Vert _P:=\sqrt{\langle Px,x\rangle }\), where P is a positive definite operator (defined in Definition 2.2). We will use the same notation for scaled semi-norms, i.e., \(\Vert x\Vert _P:=\sqrt{\langle Px,x\rangle }\), where P is a positive semidefinite operator (defined in Definition 2.1). The identity operator is denoted by \(\mathrm {Id}\). The conjugate function is denoted and defined by \(f^{*}(y)\triangleq \sup _{x}\left\{ \langle y,x\rangle -f(x)\right\} \). The adjoint operator to a linear operator \(L:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is defined as the unique operator \(L^*:\mathbb {R}^m\rightarrow \mathbb {R}^n\) that satisfies \(\langle Lx,y\rangle =\langle x,L^*y\rangle \). The linear operator \(L:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is self-adjoint if \(L=L^*\). The notation \({\mathrm{argmin}}_x f(x)\) refers to any element that minimizes f. Finally, \(\iota _C\) denotes the indicator function for the set C that satisfies \(\iota _C(x)=0\) if \(x\in C\) and \(\iota _C(x)=\infty \) if \(x\not \in C\).

2.2 Background

In this section, we introduce some standard definitions that can be found, e.g., in [28, 29].

2.2.1 Operator Properties

Definition 2.1

(Positive semidefinite) A linear operator \(L:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is positive semidefinite, if it is self-adjoint and all eigenvalues \(\lambda _i(L)\ge 0\).

Remark 2.1

An equivalent characterization of a positive semidefinite operator is that \(\langle Lx,x\rangle \ge 0\) for all \(x\in \mathbb {R}^n\).

Definition 2.2

(Positive definite) A linear operator \(L:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is positive definite, if it is self-adjoint and if all eigenvalues \(\lambda _i(L)\ge m\) with \(m>0\).

Remark 2.2

An equivalent characterization of a positive definite operator L is that \(\langle Lx,x\rangle \ge m\Vert x\Vert ^2\) for some \(m>0\) and all \(x\in \mathbb {R}^n\).

Definition 2.3

(Lipschitz continuous) A mapping \(T:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is \(\delta \)-Lipschitz continuous with \(\delta \ge 0\) if
$$\begin{aligned} \Vert Tx-Ty\Vert \le \delta \Vert x-y\Vert \end{aligned}$$
holds for all \(x,y\in \mathbb {R}^n\). If \(\delta =1\), then T is nonexpansive and if \(\delta \in [0,1[\), then T is \(\delta \)-contractive.

Definition 2.4

(Averaged) A mapping \(T:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is \(\alpha \)-averaged if there exists a nonexpansive mapping \(S:\mathbb {R}^n\rightarrow \mathbb {R}^n\) and an \(\alpha \in ]0,1]\) such that \(T=(1-\alpha )\mathrm {Id}+\alpha S\).

Definition 2.5

(Negatively averaged) A mapping \(T:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is \(\beta \)-negatively averaged with \(\beta \in ]0,1]\) if \(-T\) is \(\beta \)-averaged.

Remark 2.3

For notational convenience, we have included \(\alpha =1\) and \(\beta =1\) in the definitions of (negative) averagedness, which both are equivalent to nonexpansiveness. For values of \(\alpha \in ]0,1[\) and \(\beta \in ]0,1[\) averagedness is a stronger property than nonexpansiveness. For more on negatively averaged operators, see [21] where they were introduced.

If a gradient operator \(\nabla f\) is \(\alpha \)-averaged and \(\beta \)-negatively averaged, then it must hold that \(\alpha +\beta \ge 1\). This follows immediately from Lemma 3.1.

Definition 2.6

(Cocoerciveness) A mapping \(T:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is \(\delta \)-cocoercive with \(\delta > 0\) if \(\delta T\) is \(\tfrac{1}{2}\)-averaged.

Remark 2.4

This definition implies that cocoercive mappings T can be expressed as
$$\begin{aligned} T=\tfrac{1}{2\delta }(\mathrm {Id}+S), \end{aligned}$$
(1)
where S is a nonexpansive operator. Therefore, 1-cocoercivity is equivalent to \(\tfrac{1}{2}\)-averagedness (which is also called firm nonexpansiveness).

2.2.2 Function Properties

Definition 2.7

(Strongly convex) Let \(P:\mathbb {R}^n\rightarrow \mathbb {R}^n\) be positive definite. A proper and closed function \(f:\mathbb {R}^n\rightarrow \overline{\mathbb {R}}\) is \(\sigma \)-strongly convex w.r.t. \(\Vert \cdot \Vert _P\) with \(\sigma >0\) if \(f-\tfrac{\sigma }{2}\Vert \cdot \Vert _P^2\) is convex.

Remark 2.5

If f is differentiable, \(\sigma \)-strong convexity w.r.t. \(\Vert \cdot \Vert _P\) can equivalently be defined as that
$$\begin{aligned} \tfrac{\sigma }{2}\Vert x-y\Vert _P^2\le f(x)-f(y)-\langle \nabla f(y),x-y\rangle \end{aligned}$$
(2)
holds for all \(x,y\in \mathbb {R}^n\). If \(P=\mathrm {Id}\), i.e., if the norm is the induced norm, we merely say that f is \(\sigma \)-strongly convex. If \(\sigma =0\), the function is convex.

There are many smoothness definitions for functions in the literature. We will use the following, which describes the existence of majorizing and minimizing quadratic functions.

Definition 2.8

(Smooth) Let \(P:\mathbb {R}^n\rightarrow \mathbb {R}^n\) be positive semidefinite. A function \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) is \(\beta \)-smooth w.r.t. \(\Vert \cdot \Vert _P\) with \(\beta \ge 0\) if it is differentiable and
$$\begin{aligned} -\tfrac{\beta }{2}\Vert x-y\Vert _P^2\le f(x)-f(y)-\langle \nabla f(y),x-y\rangle \le \tfrac{\beta }{2}\Vert x-y\Vert _P^2 \end{aligned}$$
(3)
holds for all \(x,y\in \mathbb {R}^n\).

2.2.3 Connections

Our main result (see Theorem 3.1) is that the envelope function satisfies upper and lower bounds of the form
$$\begin{aligned} \tfrac{1}{2}\langle M(x-y),x-y\rangle \le f(x)-f(y)-\langle \nabla f(y),x-y\rangle \le \tfrac{1}{2}\langle L(x-y),x-y\rangle \end{aligned}$$
(4)
for all \(x,y\in \mathbb {R}^n\) and for different linear operators \(M,L:\mathbb {R}^n\rightarrow \mathbb {R}^n\). Depending on M and L, we get different properties of f and its gradient \(\nabla f\). Some of these are stated below. The results follow immediately from Lemma D.2 in Appendix D and the definitions of smoothness and strong convexity in Definitions 2.7 and 2.8, respectively.

Proposition 2.1

Assume that \(L=-M=\beta I\) with \(\beta \ge 0\) in (4). Then, (4) is equivalent to that \(\nabla f\) is \(\beta \)-Lipschitz continuous.

Proposition 2.2

Assume that \(M=\sigma I\) and \(L=\beta I\) with \(0\le \sigma \le \beta \) in (4). Then, (4) is equivalent to that \(\nabla f\) is \(\beta \)-Lipschitz continuous and f is \(\sigma \)-strongly convex.

Proposition 2.3

Assume that \(L=-M\) and that L is positive definite. Then, (4) is equivalent to that f is 1-smooth w.r.t. \(\Vert \cdot \Vert _L\).

Proposition 2.4

Assume that M and L are positive definite. Then, (4) is equivalent to that f is 1-smooth w.r.t. \(\Vert \cdot \Vert _L\) and 1-strongly convex w.r.t. \(\Vert \cdot \Vert _M\).

3 Envelope Function

In [22, 24], the forward–backward and Douglas–Rachford envelope functions are proposed. Under certain problem data assumptions, these envelope functions have favorable properties; they are convex, they have Lipschitz continuous gradients, and their minimizers are fixed-points of the nonexpansive operator S that defines the respective algorithms. In this section, we will present a general envelope function that has the forward–backward and Douglas–Rachford envelopes as special cases. We will also provide properties of the general envelope that are sharper than what is known for the special cases.

We assume that the nonexpansive operator S that defines the algorithm is a composition of \(S_1\) and \(S_2\), i.e., \(S=S_2S_1\), where \(S_1\) and \(S_2\) satisfy the following basic assumptions (that sometimes will be sharpened or relaxed).

Assumption 3.1

Suppose that:
  1. (i)

    \(S_1:\mathbb {R}^n\rightarrow \mathbb {R}^n\) and \(S_2:\mathbb {R}^n\rightarrow \mathbb {R}^n\) are nonexpansive.

     
  2. (ii)

    \(S_1=\nabla f_1\) and \(S_2=\nabla f_2\) for some differentiable functions \(f_1:\mathbb {R}^n\rightarrow \mathbb {R}\) and \(f_2:\mathbb {R}^n\rightarrow \mathbb {R}\).

     
  3. (iii)

    \(f_1:\mathbb {R}^n\rightarrow \mathbb {R}\) is twice continuously differentiable.

     
These assumptions are met for our algorithms of interest, see Sect. 4 for details. In this general framework, we propose the following envelope function:
$$\begin{aligned} F(x):=\langle \nabla f_1(x),x\rangle -f_1(x)-f_2(\nabla f_1(x)), \end{aligned}$$
(5)
which has gradient
$$\begin{aligned} \nabla F(x)&=\nabla ^2f_1(x)x+\nabla f_1(x)-\nabla f_1(x)-\nabla ^2f_1(x)\nabla f_2(\nabla f_1(x))\nonumber \\&=\nabla ^2f_1(x)(x-\nabla f_2(\nabla f_1(x)))\nonumber \\&=\nabla ^2f_1(x)(x-S_2S_1x). \end{aligned}$$
(6)
If the Hessian \(\nabla ^2f_1(x)\) is nonsingular for all x, then the set of stationary points of the envelope coincides with the fixed-points of \(S_2S_1\).

Proposition 3.1

Suppose that Assumption 3.1 holds and that \(\nabla ^2f(x)\) is nonsingular for all \(x\in \mathbb {R}^n\). Let
$$\begin{aligned} X^\star&:= \{x\in \mathbb {R}^n:\nabla F(x)=0\},&\mathrm{{fix}}(S_2S_1)=\{x\in \mathbb {R}^n:S_2S_1x=x\}. \end{aligned}$$
Then, \(X^\star =\mathrm{{fix}}(S_2S_1)\).

Proof

The statement follows trivially from (6). \(\square \)

In Sect. 4, we show that the forward–backward and Douglas–Rachford envelopes are special cases of (5). In this section, we will provide properties of the general envelope under the following restriction to Assumption 3.1.

Assumption 3.2

Suppose that Assumption 3.1 holds and that, in addition, \(S_1:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is affine, i.e., \(S_1x=Px+q\) and \(f_1(x)=\tfrac{1}{2}\langle Px,x\rangle +\langle q,x\rangle \), where \(P\in \mathbb {R}^{n\times n}\) is a self-adjoint nonexpansive linear operator and \(q\in \mathbb {R}^n\).

Remark 3.1

That P a self-adjoint nonexpansive linear operator means that it is symmetric with eigenvalues in the interval \([-1,1]\).

When \(S_1=\nabla f_1=P(\cdot )+q\) is affine, the first two terms in the envelope function definition in (5) satisfy
$$\begin{aligned} \langle \nabla f_1(x),x\rangle -f_1(x)=\langle Px+q,x\rangle -\left( \tfrac{1}{2}\langle Px,x\rangle +\langle q,x\rangle \right) =\tfrac{1}{2}\langle Px,x\rangle . \end{aligned}$$
Therefore, the general envelope function in (5) reduces to
$$\begin{aligned} F(x)=\tfrac{1}{2}\langle Px,x\rangle - f_2(\nabla f_1(x)) \end{aligned}$$
(7)
and its gradient (6) becomes
$$\begin{aligned} \nabla F(x) = P(x-S_2S_1x). \end{aligned}$$
(8)
The remainder of this section is devoted to providing smoothness and convexity properties of the envelope function under Assumption 3.2.

3.1 Basic Properties of the Envelope Function

The following two results are special cases and direct corollaries of a more general result in Theorem 3.1, to be presented later. Proofs are therefore omitted.

Proposition 3.2

Suppose that Assumption 3.2 holds. Then, the gradient of F is 2-Lipschitz continuous. That is, \(\nabla F\) satisfies
$$\begin{aligned} \Vert \nabla F(x)-\nabla F(y)\Vert \le 2\Vert x-y\Vert \end{aligned}$$
for all \(x,y\in \mathbb {R}^n\).

Proposition 3.3

Suppose that Assumption 3.2 holds and that P, that defines the linear part of \(S_1\), is positive semidefinite. Then, F is convex.

If P is positive semidefinite, then the envelope function F is convex and differentiable with a Lipschitz continuous gradient. This implies, e.g., that all stationary points are minimizers. If P is positive definite we know from Proposition 3.1 that the set of stationary points coincides with the fixed-point set of \(S=S_2S_1\). Therefore, a fixed-point to \(S_2S_1\) can be found by minimizing the smooth convex envelope function F.

3.2 Finer Properties of the Envelope Function

In this section, we establish sharp upper and lower bounds for the envelope function (7). These results use stronger assumptions on \(S_2\) than nonexpansiveness, namely that \(S_2\) is \(\alpha \)-averaged and \(\beta \)-negatively averaged:

Assumption 3.3

The operator \(S_2\) is \(\alpha \)-averaged and \(\beta \)-negatively averaged with \(\alpha \in ]0,1]\) and \(\beta \in ]0,1]\).

Before we proceed, we state a result on how averaged and negatively averaged gradient operators can equivalently be characterized. The result is proven in Appendix A.

Lemma 3.1

Assume that f is differentiable. Then, \(\nabla f\) is \(\alpha \)-averaged with \(\alpha \in ]0,1]\) and \(\beta \)-negatively averaged with \(\beta \in ]0,1]\) if and only if
$$\begin{aligned} -\tfrac{2\alpha -1}{2}\Vert x-y\Vert ^2\le f(x)-f(y)-\langle \nabla f(y),x-y\rangle \le \tfrac{2\beta -1}{2}\Vert x-y\Vert ^2 \end{aligned}$$
(9)
holds for all \(x,y\in \mathbb {R}^n\), which holds if and only if
$$\begin{aligned} -\,(2\alpha -1)\Vert x-y\Vert ^2\le \langle \nabla f(x)-\nabla f(y),x-y\rangle \le (2\beta -1)\Vert x-y\Vert ^2 \end{aligned}$$
(10)
holds for all \(x,y\in \mathbb {R}^n\).

These properties relate to smoothness and strong convexity properties of f. More precisely, they imply that f is \(\max ((2\alpha -1),(2\beta -1))\)-smooth and, if \(\alpha >\tfrac{1}{2}\), \((2\alpha -1)\)-strongly convex. With this interpretation in mind, we state the main theorem.

Theorem 3.1

Suppose that Assumptions 3.2 and 3.3 hold. Further, let \(\delta _{\alpha }=2\alpha -1\) and \(\delta _{\beta }=2\beta -1\). Then, the envelope function F in (7) satisfies
$$\begin{aligned} F(x)-F(y)-\langle \nabla F(y),x-y\rangle \ge \tfrac{1}{2} \left\langle \left( P-\delta _{\beta }P^2\right) (x-y),x-y\right\rangle \end{aligned}$$
and
$$\begin{aligned} F(x)-F(y)-\langle \nabla F(y),x-y\rangle \le \tfrac{1}{2}\left\langle \left( P+\delta _{\alpha }P^2\right) (x-y),x-y\right\rangle \end{aligned}$$
for all \(x,y\in \mathbb {R}^n\). Furthermore, the bounds are tight.

A proof to this result is found in “Appendix B”.

Utilizing connections established in Sect. 2.2.3, we next derive different properties of the envelope function. Especially, we provide conditions under which the envelope function is convex and strongly convex.

Corollary 3.1

Suppose that the assumptions of Theorem 3.1 hold and that P is positive semidefinite. Then,
$$\begin{aligned} \tfrac{1}{2}\Vert x-y\Vert _{P-\delta _{\beta }P^2}^2\le F(x)-F(y)-\langle \nabla F(y),x-y\rangle \le \tfrac{1}{2}\Vert x-y\Vert _{P+\delta _{\alpha }P^2}^2 \end{aligned}$$
and F is convex and 1-smooth w.r.t. \(\Vert \cdot \Vert _{P+\delta _{\alpha } P^2}\). If in addition P is positive definite and either of the following holds:
  1. (i)

    P is contractive,

     
  2. (ii)

    \(\beta \in ]0,1[\), i.e., \(\delta _{\beta }\in ]-1,1[\),

     
then F is 1-strongly convex w.r.t. \(\Vert \cdot \Vert _{P-\delta _{\beta }P^2}\) and 1-smooth w.r.t. \(\Vert \cdot \Vert _{P+\delta _{\alpha } P^2}\).

Proof

The results follow from Theorem 3.1, the definition of (strong) convexity, and by utilizing Lemma D.3 in “Appendix D” to show that the smallest eigenvalue of \(P-\delta _{\beta }P^2\) is nonnegative and positive, respectively. \(\square \)

Less sharp, but unscaled, versions of these bounds can easily be obtained from Theorem 3.1.

Corollary 3.2

Suppose that the assumptions of Theorem 3.1 hold. Then,
$$\begin{aligned} \tfrac{\beta _l}{2}\Vert x-y\Vert ^2\le F(x)-F(y)-\langle \nabla F(y),x-y\rangle \le \tfrac{\beta _u}{2}\Vert x-y\Vert ^2, \end{aligned}$$
where \(\beta _l=\lambda _{\min }(P-\delta _{\beta }P^2)\) and \(\beta _u=\lambda _{\max }(P+\delta _{\alpha }P^2)\).

Values of \(\beta _l\) and \(\beta _u\) for different assumptions on P, \(\delta _{\alpha }\) and \(\delta _{\beta }\) can be obtained from Lemma D.3 in “Appendix D”.

The results in Theorem 3.1 and its corollaries are stated for \(\alpha \)-averaged and \(\beta \)-negatively averaged operators \(S_2=\nabla f_2\). Using Lemmas 3.1 and D.2, we conclude that \(\delta \)-contractive operators are \(\alpha \)-averaged and \(\beta \)-negatively averaged with \(\alpha \) and \(\beta \) satisfying \(\delta =\delta _{\alpha }=\delta _{\beta }\). This gives the following result.

Proposition 3.4

Suppose that Assumption 3.2 holds and that \(S_2\) is \(\delta \)-Lipschitz continuous with \(\delta \in [0,1]\). Then, all results in this section hold with \(\delta _{\beta }\) and \(\delta _{\alpha }\) replaced by \(\delta \).

If instead \(S_2=\nabla f_2\) is \(\tfrac{1}{\delta }\)-cocoercive, it can be shown (see [28, Definition 4.4] and [30, Theorem 2.1.5]) that
$$\begin{aligned} 0\le f_2(x)-f_2(y)-\langle \nabla f_2(y),x-y\rangle \le \tfrac{\delta }{2}\Vert x-y\Vert ^2. \end{aligned}$$
In view of Lemma 3.1, we can state the following result.

Proposition 3.5

Suppose that Assumption 3.2 holds and that \(S_2\) is \(\tfrac{1}{\delta }\)-cocoercive with \(\delta \in ]0,1]\). Then, all results in this section hold with \(\delta _{\beta }=\delta \) and \(\delta _{\alpha }=0\).

3.3 Majorization–Minimization Interpretation of Averaged Iteration

As noted in [22, 24], the forward–backward and Douglas–Rachford splitting methods are variable metric gradient methods applied to their respective envelope functions. In our setting, with \(S_1\) being affine, they reduce to being fixed-metric scaled gradient methods. In this section, we provide a different interpretation. We show that a step in the basic iteration is obtained by performing majorization–minimization on the envelope. The majorizing function is a closely related to the upper bound provided in Corollary 3.1.

The interpretation is valid under the assumption that P is positive definite, besides being nonexpansive. This implies that the envelope is convex, see Corollary 3.1. It is straightforward to verify that \(P+\delta _{\alpha }P^2\preceq (1+\delta _{\alpha })P\). Therefore, we can construct the following more conservative upper bound to the envelope, compared to Corollary 3.1:
$$\begin{aligned} F(x)\le F(y)+\langle \nabla F(y),x-y\rangle +\tfrac{1+\delta _{\alpha }}{2}\Vert x-y\Vert _{P}^2. \end{aligned}$$
(11)
Minimizing this majorizer, evaluated at \(y=x^k\), in every iteration k gives
$$\begin{aligned} x^{k+1}&= \mathop {{\mathrm{argmin}}}\limits _{x}\{F(x^k)+\langle \nabla F(x^{k}),x-{x}^{k}\rangle +\frac{1+\delta _{\alpha }}{2}\Vert x-x^k\Vert _P^2\}\\&=x^{k}-\frac{1}{1+\delta _{\alpha }}P^{-1}\nabla F(x^k)\\&=x^{k}-\frac{1}{1+\delta _{\alpha }} P^{-1}P(S_2S_1x^k-x^k)\\&=x^{k}-\frac{1}{1+\delta _{\alpha }}(S_2S_1x^k-x^k)\\&=\left( 1-\frac{1}{1+\delta _{\alpha }}\right) x^{k}+\frac{1}{1+\delta _{\alpha }} S_2S_1x^k, \end{aligned}$$
which is the basic method with \(\tfrac{1}{1+\delta _{\alpha }}\)-averaging. It is well known that the gradient method converges with step-length \(\alpha \in ]0,\tfrac{2}{L}[\), where L is a Lipschitz constant. In this case, the upper bound (11) guarantees a Lipschitz constant to \(\nabla F\) of \(L=1+\delta _{\alpha }\) in the \(\Vert \cdot \Vert _P\)-norm, see Lemma D.2. Selecting a step-length within the allowed range yields an averaged iteration with \(\tfrac{1}{1+\delta _{\alpha }}\) replaced by \(\alpha \in ]0,\tfrac{2}{1+\delta _{\alpha }}[\).
The upper bound (11) used to arrive at the averaged iteration is not sharp. Using instead the sharp majorizer from Corollary 3.1, yields the following algorithm:
$$\begin{aligned} x^{k+1}&= \mathop {{\mathrm{argmin}}}\limits _{x}\left\{ F(x^k)+\langle \nabla F(x^k),x-x^k\rangle +\tfrac{1}{2}\Vert x-x^k\Vert _{P+\delta _{\alpha }P^2}^2\right\} \\&=x^k-(\mathrm {Id}+\delta _{\alpha }P)^{-1}P^{-1}\nabla F(x^k)\\&=x^k-(\mathrm {Id}+\delta _{\alpha }P)^{-1} P^{-1}P(S_2S_1x^k-x^k)\\&=x^k-(\mathrm {Id}+\delta _{\alpha }P)^{-1}(S_2S_1x^k-x^k)\\&=(\mathrm {Id}-(\mathrm {Id}+\delta _{\alpha }P)^{-1})x^k+(\mathrm {Id}+\delta _{\alpha }P)^{-1} S_2S_1x^k. \end{aligned}$$
This differs from the basic averaged iteration in that \((1+\delta _{\alpha })^{-1}\mathrm {Id}\) in the basic method is replaced by \((\mathrm {Id}+\delta _{\alpha }P)^{-1}\). The drawback of using this tighter majorizer is that the iterations become more expensive.

None of these methods is probably the most efficient way to find a stationary point of the envelope function (or equivalently a fixed-point to \(S_2S_1\)). At least in the convex setting (for the envelope), there are numerous alternative methods that can minimize smooth functions such as truncated Newton methods, quasi-Newton methods, and nonlinear conjugate gradient methods. See [31] for an overview of such methods and [22, 23] for some of these methods applied to the forward–backward envelope. Evaluating which ones that are most efficient and devising new methods to improve performance is outside the scope of this paper.

4 Special Cases

In this section, we show that our envelope in (5) has four known special cases, namely the Moreau envelope [32], the forward–backward envelope [22, 23], the Douglas–Rachford envelope [24], and the ADMM envelope [27] (which is a special case of the Douglas–Rachford envelope).

We also show that our envelope bounds for \(S_1=\nabla f_1\) being affine coincide with or sharpen corresponding results in the literature for the special cases.

4.1 Algorithm Building Blocks

Before we present the special cases, we introduce some functions, whose gradients are operators that are used in the respective underlying methods. Most importantly, we will introduce a function whose gradient is the proximal operator:
$$\begin{aligned} \mathrm{{prox}}_{\gamma f}(z):=\mathop {{\mathrm{argmin}}}\limits _{x}\{f(x)+\tfrac{1}{2\gamma }\Vert x-z\Vert ^2\}, \end{aligned}$$
where \(\gamma >0\) is a parameter.

Proposition 4.1

Suppose that \(f:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) is proper, closed, and convex and that \(\gamma >0\). The proximal operator \(\mathrm{{prox}}_{\gamma f}\) then satisfies
$$\begin{aligned} \mathrm{{prox}}_{\gamma f} = \nabla r_{\gamma f}^*, \end{aligned}$$
where \(r_{\gamma f}^*\) is the conjugate of
$$\begin{aligned} r_{\gamma f}(x):=\gamma f(x)+\tfrac{1}{2}\Vert x\Vert ^2. \end{aligned}$$
(12)
The reflected proximal operator
$$\begin{aligned} R_{\gamma f}:=2\mathrm{{prox}}_{\gamma f}-\mathrm {Id}\end{aligned}$$
(13)
satisfies \(R_{\gamma f}=\nabla p_{\gamma f}\), where
$$\begin{aligned} p_{\gamma f} := 2r_{\gamma f}^*-\tfrac{1}{2}\Vert \cdot \Vert ^2. \end{aligned}$$
(14)

This proximal map interpretation is from [33, Theorems 31.5, 16.4] and implies that the proximal operator is the gradient of a convex function. The reflected proximal operator interpretation follows trivially from the prox interpretation.

The other algorithm building block that is used in the considered algorithms is the gradient step. The gradient step operator is the gradient of the function \(\tfrac{1}{2}\Vert x\Vert ^2-\gamma f(x)\), i.e.,:
$$\begin{aligned} (x-\gamma \nabla f(x))=\nabla \left( \tfrac{1}{2}\Vert x\Vert ^2-\gamma f(x)\right) . \end{aligned}$$

4.2 The Proximal Point Algorithm

The proximal point algorithm solves problems of the form
$$\begin{aligned} {\hbox {minimize }} f(x), \end{aligned}$$
where \(f:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) is proper, closed, and convex.
The algorithm repeatedly applies the proximal operator of f and is given by
$$\begin{aligned} x^{k+1} = \mathrm{{prox}}_{\gamma f}(x^k), \end{aligned}$$
(15)
where \(\gamma >0\) is a parameter. This algorithm is mostly of conceptual interest since it is often as computationally demanding to evaluate the prox as to minimize the function f itself.
Its envelope function, which is called the Moreau envelope [32], is a scaled version of the envelope F in (7). The scaling factor is \(\gamma ^{-1}\) and the Moreau envelope \(f^{\gamma }\) is obtained by letting \(S_1x=\nabla f_1(x)=x\), i.e., \(P=\mathrm {Id}\) and \(q=0\), and \(f_2=r_{\gamma f}^*\) in (7), where \(r_{\gamma f}\) is defined in (12):
$$\begin{aligned} f^{\gamma }(x)=\gamma ^{-1}F(x)=\gamma ^{-1}\left( \tfrac{1}{2}\Vert x\Vert ^2-r_{\gamma f}^{*}(x)\right) . \end{aligned}$$
(16)
Its gradient satisfies
$$\begin{aligned} \nabla f^{\gamma }(x)=\gamma ^{-1}\left( x-\mathrm{{prox}}_{\gamma f}(x)\right) . \end{aligned}$$
The following properties of the Moreau envelope follow directly from Corollary 3.2 and Proposition 3.5 since the proximal operator is 1-cocoercive (see Remark 2.4 and [28, Proposition 12.27]).

Proposition 4.2

The Moreau envelope \(f^{\gamma }\) in (16) is differentiable and convex and \(\nabla f^{\gamma }\) is \(\gamma ^{-1}\)-Lipschitz continuous.

This coincides with previously known properties of the Moreau envelope, see [28, Chapter 12].

4.3 Forward–Backward Splitting

Forward–backward splitting solves problems of the form
$$\begin{aligned} {\hbox {minimize }} f(x)+g(x), \end{aligned}$$
(17)
where \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) is convex with an L-Lipschitz (or equivalently \(\tfrac{1}{L}\)-cocoercive) gradient, and \(g:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) is proper, closed, and convex.
The algorithm performs a forward step followed by a backward step, and is given by
$$\begin{aligned} x^{k+1}=\mathrm{{prox}}_{\gamma g}(\mathrm {Id}-\gamma \nabla f)x^k, \end{aligned}$$
(18)
where \(\gamma \in ]0,\tfrac{2}{L}[\) is a parameter.
The envelope function, which is called the forward–backward envelope [22, 23], is a scaled version of the envelope F in (5) and applies when f is twice continuously differentiable. The scaling factor is \(\gamma ^{-1}\) and the forward–backward envelope is obtained by letting \(f_1=\tfrac{1}{2}\Vert \cdot \Vert ^2-\gamma f\) and \(f_2=r_{\gamma g}^*\) in (5), where \(r_{\gamma g}\) is defined in (12). The resulting forward–backward envelope function is
$$\begin{aligned} F_{\gamma }^\mathrm{{FB}}(x)=\gamma ^{-1}\left( \langle x-\gamma \nabla f(x),x\rangle -\left( \tfrac{1}{2}\Vert x\Vert ^2-\gamma f(x)\right) -r_{\gamma g}^*(x-\gamma \nabla f(x))\right) . \end{aligned}$$
The gradient of this function is
$$\begin{aligned} \nabla F_{\gamma }^\mathrm{{FB}}(x)&=\gamma ^{-1}\big ((\mathrm {Id}-\gamma \nabla ^2 f(x))x+(x-\gamma \nabla f(x))-(x-\gamma \nabla f(x))\\&\quad -(\mathrm {Id}-\gamma \nabla ^2 f(x))\mathrm{{prox}}_{\gamma g}(x-\gamma \nabla f(x))\big )\\&=\gamma ^{-1}(\mathrm {Id}-\gamma \nabla ^2 f(x))\left( x-\mathrm{{prox}}_{\gamma g}(x-\gamma \nabla f(x))\right) , \end{aligned}$$
which coincides with the gradient in [22, 23]. As described in [22, 23], the stationary points of the envelope coincide with the fixed-points of the mapping \(\mathrm{{prox}}_{\gamma g}(x-\gamma \nabla f(x))\) if \((\mathrm {Id}-\gamma \nabla ^2 f(x))\) is nonsingular.

4.3.1 \(S_1\) Affine

We provide properties of the forward–backward envelope in the more restrictive setting of \(S_1=\nabla f_1=(\mathrm {Id}-\gamma \nabla f)\) being affine. This applies when f is a convex quadratic, \(f(x)=\tfrac{1}{2}\langle Hx,x\rangle +\langle h,x\rangle \) with \(H\in \mathbb {R}^{n\times n}\) positive semidefinite and \(h\in \mathbb {R}^n\). Then, \(S_1x=Px+q\) with \(P=(\mathrm {Id}-\gamma H)\) and \(q=-\gamma h\).

In this setting, the following result follows immediately from Corollary 3.1 and Proposition 3.5 (where Proposition 3.5 is invoked since \(S_2=\mathrm{{prox}}_{\gamma g}\) is 1-cocoercive, see Remark 2.4 and [28, Proposition 12.27]).

Proposition 4.3

Assume that \(f(x)=\tfrac{1}{2}\langle Hx,x\rangle +\langle h,x\rangle \) and \(\gamma \in ]0,\tfrac{1}{L}[\), where \(L=\lambda _{\max }(H)\). Then, the forward–backward envelope \(F_{\gamma }^\mathrm{{FB}}\) satisfies
$$\begin{aligned} \tfrac{1}{2\gamma }\Vert x-y\Vert _{P-P^2}^2&\le F_{\gamma }^\mathrm{{FB}}(x)-F_{\gamma }^\mathrm{{FB}}(y)-\langle \nabla F_{\gamma }^\mathrm{{FB}}(y),x-y\rangle \le \tfrac{1}{2\gamma }\Vert x-y\Vert _P^2 \end{aligned}$$
for all \(x,y\in \mathbb {R}^n\), where \(P=(\mathrm {Id}-\gamma H)\) is positive definite. If in addition \(\lambda _{\min }(H)=m>0\), then \(P-P^2\) is positive definite and \(F_{\gamma }^\mathrm{{FB}}\) is \(\gamma ^{-1}\)-strongly convex w.r.t. \(\Vert \cdot \Vert _{P-P^2}\).

Less tight bounds for the forward–backward envelope are provided next. These follow immediately from the above and Lemma D.3.

Proposition 4.4

Assume that \(f(x)=\tfrac{1}{2}\langle Hx,x\rangle +\langle h,x\rangle \), that \(\gamma \in ]0,\tfrac{1}{L}[\) where \(L=\lambda _{\max }(H)\), and that \(m=\lambda _{\min }(H)\ge 0\). Then, the forward–backward envelope \(F_{\gamma }^\mathrm{{FB}}\) is \(\gamma ^{-1}(1-\gamma m)\)-smooth and \(\min \left( (1-\gamma m)m,(1-\gamma L)L\right) \)-strongly convex (both w.r.t. to the induced norm \(\Vert \cdot \Vert \)).

This result is a less tight version of Proposition 4.3, but is a slight improvement of the corresponding result in [22, Theorem 2.3]. The strong convexity moduli are the same, but our smoothness constant is a factor two smaller.

4.4 Douglas–Rachford Splitting

Douglas–Rachford splitting solves problems of the form
$$\begin{aligned} {\hbox {minimize }} f(x)+g(x), \end{aligned}$$
(19)
where \(f:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) and \(g:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) are proper, closed, and convex functions.
The algorithm performs two reflection steps (13), then an averaging:
$$\begin{aligned} z^{k+1}=(1-\alpha )z^k+\alpha R_{\gamma g}R_{\gamma f}z^k, \end{aligned}$$
(20)
where \(\gamma >0\) and \(\alpha \in ]0,1[\) are parameters. The objective is to find a fixed-point \(\bar{z}\) to \(R_{\gamma g}R_{\gamma f}\), from which a solution to (19) can be computed as \(\mathrm{{prox}}_{\gamma f}(\bar{z})\), see [28, Proposition 25.1].
The envelope function in [24], which is called the Douglas–Rachford envelope, is a scaled version of the basic envelope function F in (5) and applies when f is twice continuously differentiable and \(\nabla f\) is Lipschitz continuous. The scaling factor is \((2\gamma )^{-1}\) and the Douglas–Rachford envelope is obtained by, in (5), letting \(f_1=p_{\gamma f}\) with gradient \(\nabla f_1=S_1=R_{\gamma f}\) and \(f_2 = p_{\gamma g}\), where \(p_{\gamma g}\) is defined in (14). The Douglas–Rachford envelope function becomes
$$\begin{aligned} F_{\gamma }^\mathrm{{DR}}(z)=(2\gamma )^{-1}\left( \langle R_{\gamma f}(z),z\rangle -p_{\gamma f}(z)-p_{\gamma g}(R_{\gamma f}z)\right) . \end{aligned}$$
(21)
The gradient of this function is
$$\begin{aligned} \nabla F_{\gamma }^\mathrm{{DR}}(z)&=(2\gamma )^{-1}\big (\nabla R_{\gamma f}(z)z+R_{\gamma f}-R_{\gamma f}-\nabla R_{\gamma f}(z)R_{\gamma g}(R_{\gamma f}(z))\big )\\&=(2\gamma )^{-1}\nabla R_{\gamma f}(z)(z-R_{\gamma g}R_{\gamma f}(z)), \end{aligned}$$
which coincides with the gradient in [24] since \(\nabla R_{\gamma f}=2\nabla \mathrm{{prox}}_{\gamma f}-\mathrm {Id}\) and
$$\begin{aligned} z-R_{\gamma g}R_{\gamma f}z&=z-2\mathrm{{prox}}_{\gamma g}(2\mathrm{{prox}}_{\gamma f}(z)-z)+2\mathrm{{prox}}_{\gamma f}(z)-z\\&=2(\mathrm{{prox}}_{\gamma f}(z)-\mathrm{{prox}}_{\gamma g}(2\mathrm{{prox}}_{\gamma f}(z)-z)). \end{aligned}$$
As described in [24], the stationary points of the envelope coincide with the fixed-points of \(R_{\gamma g}R_{\gamma f}\) if \(\nabla R_{\gamma f}\) is nonsingular.

4.4.1 \(S_1\) Affine

We state properties of the Douglas–Rachford envelope in the more restrictive setting of \(S_1=R_{\gamma f}\) being affine. This is obtained for convex quadratic f:
$$\begin{aligned} f(x)=\tfrac{1}{2}\langle Hx,x\rangle +\langle h,x\rangle , \end{aligned}$$
where H is positive semidefinite. The operator \(S_1\) becomes
$$\begin{aligned} S_1(z) = R_{\gamma f}(z) = 2(\mathrm {Id}+\gamma H)^{-1}(z-\gamma h)-z, \end{aligned}$$
which confirms that it is affine. We implicitly define P and q through the relation \(S_1=R_{\gamma f}=P(\cdot )+q\), and note that they are given by the expressions \(P=2(\mathrm {Id}+\gamma H)^{-1}-\mathrm {Id}\) and \(q=-2\gamma (\mathrm {Id}+\gamma H)^{-1} h\), respectively.

In this setting, the following result follows immediately from Corollary 3.1 since \(S_2=R_{\gamma g}\) is nonexpansive (1-averaged and 1-negatively averaged).

Proposition 4.5

Assume that \(f(x)=\tfrac{1}{2}\langle Hx,x\rangle +\langle h,x\rangle \) and \(\gamma \in ]0,\tfrac{1}{L}[\), where \(L=\lambda _{\max }(H)\). Then, the Douglas–Rachford envelope \(F_{\gamma }^\mathrm{{DR}}\) satisfies
$$\begin{aligned} \tfrac{1}{4\gamma }\Vert z-y\Vert _{P-P^2}^2&\le F_{\gamma }^\mathrm{{DR}}(z)-F_{\gamma }^\mathrm{{DR}}(z)-\langle \nabla F_{\gamma }^\mathrm{{DR}}(y),z-y\rangle \le \tfrac{1}{4\gamma }\Vert z-y\Vert _{P+P^2}^2 \end{aligned}$$
for all \(y,z\in \mathbb {R}^n\), where \(P=2(\mathrm {Id}+\gamma H)^{-1}-\mathrm {Id}\) is positive definite. If in addition \(\lambda _{\min }(H)=m>0\), then \(P-P^2\) is positive definite and \(F_{\gamma }^\mathrm{{DR}}\) is \((2\gamma )^{-1}\)-strongly convex w.r.t. \(\Vert \cdot \Vert _{P-P^2}\).

The following less tight characterization of the Douglas–Rachford envelope follows from the above and Lemma D.3.

Proposition 4.6

Assume that \(f(x)=\tfrac{1}{2}\langle Hx,x\rangle +\langle h,x\rangle \), that \(\gamma \in ]0,\tfrac{1}{L}[\), where \(L=\lambda _{\max }(H)\), and that \(m=\lambda _{\min }(H)\ge 0\). Then, the Douglas–Rachford envelope \(F_{\gamma }^\mathrm{{DR}}\) is \(\tfrac{1-\gamma m}{(1+\gamma m)^2}\gamma ^{-1}\)-smooth and \(\min \left( \tfrac{(1-\gamma m) m}{(1+\gamma m)^2},\tfrac{(1-\gamma L)L}{(1+\gamma L)^2}\right) \)-strongly convex.

This result is more conservative than the one in Proposition 4.5, but improves on [24, Theorem 2]. The strong convexity modulus coincides with the corresponding one in [24, Theorem 2]. The smoothness constant is \(\tfrac{1}{1+\gamma m}\) times that in [24, Theorem 2], i.e., it is slightly smaller.

4.5 ADMM

The alternating direction method of multipliers (ADMM) solves problems of the form (19). It is well known [34] that ADMM can be interpreted as Douglas–Rachford applied to the dual of (19), namely to
$$\begin{aligned} {\hbox {minimize }} f^*(\mu )+g^*(-\mu ). \end{aligned}$$
(22)
So the algorithm is given by
$$\begin{aligned} v^{k+1} = (1-\alpha )v^k+\alpha R_{\rho (g^*\circ -\mathrm {Id})}R_{\rho f}v^k, \end{aligned}$$
(23)
where \(\rho >0\) is a parameter, \(R_{\rho f}\) is the reflected proximal operator (13), and \((g^*\circ -\mathrm {Id})\) is the composition that satisfies \((g^*\circ -\mathrm {Id})(\mu )=g^*(-\mu )\).
In accordance with the Douglas–Rachford envelope (21), the ADMM envelope is
$$\begin{aligned} F_{\rho }^\mathrm{{ADMM}}(v)=(2\rho )^{-1}\left( \langle R_{\rho f^*}(v),v\rangle -p_{\rho f^*}^2(v)-p_{\rho (g^*\circ -\mathrm {Id})}^2(R_{\rho f^*}v)\right) \end{aligned}$$
(24)
and its gradient becomes
$$\begin{aligned} \nabla F_{\rho }^\mathrm{{ADMM}}(v)=(2\rho )^{-1}\nabla R_{\rho f^*}(v)(v-R_{\rho (g^*\circ -\mathrm {Id})}R_{\rho f^*}(v)). \end{aligned}$$
This envelope function has been utilized in [27] to accelerate performance of ADMM. In this section, we will augment the analysis in [27] by relating the ADMM algorithm and its envelope function to the Douglas–Rachford counterparts. To do so, we need the following result which is proven in “Appendix C”.

Lemma 4.1

Let \(g:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) be proper, closed, and convex and let \(\rho >0\). Then,
$$\begin{aligned} R_{\rho g^*}(x)&= -\rho R_{\rho ^{-1}g}(\rho ^{-1}x),\\ R_{\rho (g^*\circ -\mathrm {Id})}(x)&= \rho R_{\rho ^{-1} g}(-\rho ^{-1}x),\\ p_{\rho (g^*\circ -\mathrm {Id})}(y)&= -\rho ^{2}p_{\rho ^{-1}g}(-\rho ^{-1}y), \end{aligned}$$
where \(R_{\rho g}\) is defined in (13) and \(p_{\rho g}\) is defined in (14).

Before we state the result, we show that the \(z^k\) sequence in (primal) Douglas–Rachford (20) and the \(v^k\) sequence in ADMM (i.e., dual Douglas–Rachford) in (23) differ by a factor only. This is well known [35], but the relation is stated next with a simple proof.

Proposition 4.7

Assume that \(\rho >0\) and \(\gamma >0\) satisfy \(\rho ^{-1}=\gamma \), and that \(z^0 = \rho ^{-1}v^0\). Then \(z^k=\rho ^{-1}v^{k}\) for all \(k\ge 1\), where \(\{z^k\}\) is the primal Douglas–Rachford sequence defined in (20) and the \(\{v^k\}\) is the ADMM sequence is defined in (23).

Proof

Lemma 4.1 implies that
$$\begin{aligned} v^{k+1}&= (1-\alpha )v^k+\alpha R_{\rho (g^*\circ -\mathrm {Id})}R_{\rho f^*}v^k\\&= (1-\alpha )v^k+\alpha \rho R_{\rho ^{-1} g}(-\rho ^{-1} (-\rho R_{\rho ^{-1} f}(\rho ^{-1}v^k)))\\&= (1-\alpha )v^k+\alpha \rho R_{\rho ^{-1} g}(R_{\rho ^{-1} f}(\rho ^{-1}v^k))). \end{aligned}$$
Multiply by \(\rho ^{-1}\), let \(z^{k} = \rho ^{-1}v^k\), and identify \(\gamma =\rho ^{-1}\) to get
$$\begin{aligned} z^{k+1}&= (1-\alpha )z^{k}+\alpha R_{\gamma g}(R_{\gamma f}(z^k))). \end{aligned}$$
This concludes the proof. \(\square \)

There is also a tight relationship between the ADMM and Douglas–Rachford envelopes. Essentially, they have opposite signs.

Proposition 4.8

Assume that \(\rho >0\) and \(\gamma >0\) satisfy \(\rho =\gamma ^{-1}\) and that \(z=\rho ^{-1}v=\gamma v\). Then,
$$\begin{aligned} F_{\rho }^\mathrm{{ADMM}}(v)&=-F_{\gamma }^\mathrm{{DR}}(z). \end{aligned}$$

Proof

Using Lemma 4.1 several times, \(\gamma =\rho ^{-1}\), and \(z=\rho ^{-1}v\), we conclude that
$$\begin{aligned} F_{\rho }^\mathrm{{ADMM}}(v)&= (2\rho )^{-1}\left( \langle R_{\rho f^*}(v),v\rangle -p_{\rho f^*}(v)-p_{\rho (g^*\circ -\mathrm {Id})}(R_{\rho f^*}(v))\right) \\&=(2\rho )^{-1}\Big (-\rho \langle R_{\rho ^{-1}f}(\rho ^{-1}v),v\rangle +\rho ^2p_{\rho ^{-1} (f\circ -\mathrm {Id})}(-\rho ^{-1} v)\\&\quad +\rho ^2p_{\rho ^{-1}g}(-\rho ^{-1}(-\rho R_{\rho ^{-1}f}(\rho ^{-1}v)))\Big )\\&=-\tfrac{\rho }{2}\left( \langle R_{\rho ^{-1}f}(\rho ^{-1}v),\rho ^{-1} v\rangle -p_{\rho ^{-1} f}(\rho ^{-1} v)+p_{\rho ^{-1}g}( R_{\rho ^{-1}f}(\rho ^{-1}v))\right) \\&=-(2\gamma )^{-1}\left( \langle R_{\gamma f}(z),z\rangle -p_{\gamma f}(z)+p_{\gamma g}( R_{\gamma f}(z))\right) \\&=-F_{\gamma }^\mathrm{{DR}}(z). \end{aligned}$$
This concludes the proof. \(\square \)
This result implies that the ADMM envelope is concave when the DR envelope is convex, and vice versa. We know from Sect. 4.4 that the operator \(S_1=R_{\rho f^*}\) is affine when the conjugate \(f^*\) is quadratic. This holds true if
$$\begin{aligned} f(x)={\left\{ \begin{array}{ll} \tfrac{1}{2}\langle Hx,x\rangle +\langle h,x\rangle , &{} {\hbox {if }} Ax=b,\\ \infty , &{} {\hbox {else}}, \end{array}\right. } \end{aligned}$$
and H is positive definite on the nullspace of A. From Propositions 4.5 and 4.6, we conclude that, for an appropriate choice of \(\rho \), the ADMM envelope is convex, which implies that the Douglas–Rachford envelope is concave.

Remark 4.1

The standard ADMM formulation is applied to solve problems of the formUsing infimal post-compositions, also called image functions, the dual of this is on the form (22), see, e.g., [36, Appendix B], which is a longer version of [37], for details. Therefore also this setting is implicitly considered.

5 Conclusions

We have presented an envelope function that unifies the Moreau envelope, the forward–backward envelope, the Douglas–Rachford envelope, and the ADMM envelope. We have provided quadratic upper and lower bounds for the envelope that coincide with or improve on corresponding results in the literature for the special cases. We have also provided a novel interpretation of the underlying algorithms as being majorization–minimization algorithms applied to their respective envelopes. Finally, we have shown how the ADMM and DR envelopes relate to each other.

Notes

Acknowledgements

Pontus Giselsson and Mattias Fält are financially supported by the Swedish Foundation for Strategic Research and members of the LCCC Linneaus Center at Lund University. Pontus Giselsson is also financed by the Swedish Research Council. The reviewers are gratefully acknowledged for useful comments that have considerably improved the paper.

References

  1. 1.
    Combettes, P.L.: Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization 53(5–6), 475–504 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Douglas, J., Rachford, H.H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82, 421–439 (1956)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)CrossRefzbMATHGoogle Scholar
  5. 5.
    Glowinski, R., Marroco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problémes de dirichlet non linéaires. ESAIM: Math. Model. Numer. Anal. 9, 41–76 (1975)zbMATHGoogle Scholar
  6. 6.
    Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)CrossRefzbMATHGoogle Scholar
  7. 7.
    Chambolle, A., Pock, T.: A first-order primal–dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40(1), 120–145 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications (2015). arXiv:1504.01032
  9. 9.
    Gubin, L.G., Polyak, B.T., Raik, E.V.: The method of projections for finding the common point of convex sets. USSR Comput. Math. Math. Phys. 7(6), 1–24 (1967)CrossRefGoogle Scholar
  10. 10.
    Agmon, S.: The relaxation method for linear inequalities. Can. J. Math. 6(3), 382–392 (1954)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Motzkin, T.S., Shoenberg, I.: The relaxation method for linear inequalities. Can. J. Math. 6(3), 383–404 (1954)MathSciNetGoogle Scholar
  12. 12.
    Eremin, I.I.: Generalization of the Motskin–Agmon relaxation method. Usp. mat. Nauk 20(2), 183–188 (1965)Google Scholar
  13. 13.
    Bregman, L.M.: Finding the common point of convex sets by the method of successive projection. Dokl. Akad. Nauk SSSR 162(3), 487–490 (1965)MathSciNetGoogle Scholar
  14. 14.
    von Neumann, J.: Functional Operators. Volume II. The Geometry of Orthogonal Spaces, Annals of Mathematics Studies. Princeton University Press, Princeton (1950). (Reprint of 1933 lecture notes)Google Scholar
  15. 15.
    Benzi, M.: Preconditioning techniques for large linear systems: a survey. J. Comput. Phys. 182(2), 418–477 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Bramble, J.H., Pasciak, J.E., Vassilev, A.T.: Analysis of the inexact Uzawa algorithm for saddle point problems. SIAM J. Numer. Anal. 34(3), 1072–1092 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Hu, Q., Zou, J.: Nonlinear inexact Uzawa algorithms for linear and nonlinear saddle-point problems. SIAM J. Optim. 16(3), 798–825 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Ghadimi, E., Teixeira, A., Shames, I., Johansson, M.: Optimal parameter selection for the alternating direction method of multipliers (ADMM): quadratic problems. IEEE Trans. Autom. Control 60(3), 644–658 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Giselsson, P., Boyd, S.: Metric selection in fast dual forward–backward splitting. Automatica 62, 1–10 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Giselsson, P., Boyd, S.: Linear convergence and metric selection for Douglas–Rachford splitting and ADMM. IEEE Trans. Autom. Control 62(2), 532–544 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Giselsson, P.: Tight global linear convergence rate bounds for Douglas-Rachford splitting. J. Fixed Point Theory Appl. (2017).  https://doi.org/10.1007/s11784-017-0417-1 MathSciNetzbMATHGoogle Scholar
  22. 22.
    Patrinos, P., Stella, L., Bemporad, A.: Forward–backward truncated Newton methods for convex composite optimization. (2014). arXiv:1402.6655
  23. 23.
    Stella, L., Themelis, A., Patrinos, P.: Forward–backward quasi-Newton methods for nonsmooth optimization problems. Comp. Opt. and Appl. 67(3), 443–487 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Patrinos, P., Stella, L., Bemporad, A.: Douglas–Rachford splitting: complexity estimates and accelerated variants. In: Proceedings of the 53rd IEEE Conference on Decision and Control, pp. 4234–4239. Los Angeles, CA (2014)Google Scholar
  25. 25.
    Themelis, A., Stella, L., Patrinos, P.: Forward–backward envelope for the sum of two nonconvex functions: further properties and nonmonotone line-search algorithms. (2016). arXiv:1606.06256
  26. 26.
    Themelis, A., Stella, L., Patrinos, P.: Douglas–Rachford splitting and ADMM for nonconvex optimization: new convergence results and accelerated versions. (2017). arXiv:1709.05747
  27. 27.
    Pejcic, I., Jones, C.N.: Accelerated ADMM based on accelerated Douglas–Rachford splitting. In: 2016 European Control Conference (ECC), pp. 1952–1957 (2016)Google Scholar
  28. 28.
    Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)CrossRefzbMATHGoogle Scholar
  29. 29.
    Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (1998)CrossRefzbMATHGoogle Scholar
  30. 30.
    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, 1st edn. Springer, Dordrecht (2003)zbMATHGoogle Scholar
  31. 31.
    Nocedal, J., Wright, S.: Numerical Optimization. Springer series in operations research and financial engineering, 2nd edn. Springer, New York (2006)zbMATHGoogle Scholar
  32. 32.
    Moreau, J.J.: Proximit et dualit dans un espace hilbertien. Bulletin de la Socit Mathmatique de France 93, 273–299 (1965)CrossRefzbMATHGoogle Scholar
  33. 33.
    Rockafellar, R.T.: Convex Analysis, vol. 28. Princeton Univercity Press, Princeton (1970)CrossRefzbMATHGoogle Scholar
  34. 34.
    Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Fortin, M., Glowinski, R. (eds.) Augmented Lagrangian Methods: Applications to the Solution of Boundary-Value Problems. North-Holland, Amsterdam (1983)Google Scholar
  35. 35.
    Eckstein, J.: Splitting methods for monotone operators with applications to parallel optimization. Ph.D. thesis, MIT (1989)Google Scholar
  36. 36.
    Giselsson, P., Fält, M., Boyd, S.: Line search for averaged operator iteration. (2016). arXiv:1603.06772
  37. 37.
    Giselsson, P., Fält, M., Boyd, S.: Line search for averaged operator iteration. In: Proceedings of the 55th Conference on Decision and Control. Las Vegas, USA (2016)Google Scholar
  38. 38.
    Clarke, F.: Optimization and Nonsmooth Analysis. Wiley, New York (1983)zbMATHGoogle Scholar
  39. 39.
    Sion, M.: On general minimax theorems. Pac. J. Math. 8(1), 171–176 (1958)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2018

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Department of Automatic ControlLund UniversityLundSweden

Personalised recommendations