Exact Lipschitz Regularization of Convex Optimization Problems

Beck, Amir; Teboulle, Marc

doi:10.1007/s10957-024-02465-8

Exact Lipschitz Regularization of Convex Optimization Problems

Open access
Published: 08 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Exact Lipschitz Regularization of Convex Optimization Problems

Download PDF

Abstract

We consider the class of convex composite minimization problems which consists of minimizing the sum of two nonsmooth extended valued convex functions, with one which is composed with a linear map. Convergence rate guarantees for first order methods on this class of problems often require the additional assumption of Lipschitz continuity of the nonsmooth objective function composed with the linear map. We introduce a theoretical framework where the restrictive Lipschitz continuity of this function is not required. Building on a novel dual representation of the so-called Pasch-Hausdorff envelope, we derive an exact Lipshitz regularization for this class of problems. We then show how the aforementioned result can be utilized in establishing function values-based rates of convergence in terms of the original data. Throughout, we provide examples and applications which illustrate the potential benefits of our approach.

Local and global convergence of a general inertial proximal splitting scheme for minimizing composite functions

Article 10 February 2017

Variational Analysis Based on Proximal Subdifferential on Smooth Banach Spaces

Article 15 September 2023

A globally convergent proximal Newton-type method in nonsmooth convex optimization

Article 22 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A fundamental generic optimization problem that covers various classes of convex models arising in many modern applications is the well known composite minimization problem which consists of minimizing the sum of two nonsmooth extended valued convex functions, with one which is composed with a linear map

$$\begin{aligned} \text{(G) } \quad \text{ val }(G):=\min _{\textbf{x}} f(\textbf{A}\textbf{x})+w(\textbf{x}), \end{aligned}$$

(1.1)

where both $f: \mathbb {R}^m \rightarrow (-\infty ,\infty ]$ and $w: \mathbb {R}^n \rightarrow (-\infty ,\infty ]$ are proper closed and convex and $\textbf{A}\in \mathbb {R}^{m \times n}$.

This model is very rich and under specific assumptions on the problem’s data, it has led to the development of fundamental primal and primal-dual optimization algorithms, see e.g., [2, 3, 6] and references therein.

Simple algorithms for solving (G) are based on primal first order methods, whereby we suppose that only w admits a computationally tractable proximal map [13], and we obviously want to avoid the proximal computation of $\textbf{x}\mapsto f(\textbf{A}\textbf{x})$, which in general is intractable, even when f is prox-tractable.^{Footnote 1} A central property required in the non-asymptotic convergence rate analysis (iteration complexity) in terms of function values of such primal methods, is the Lipschitz continuity of the function f. Therefore, whenever the Lipschitz continuity property for the function f is absent, as it might occur in many applications modeled by problem (G), the use of simple primal-based methods might be impossible. Two examples of such simple algorithms where w is prox-tractable and requiring the Lipschitz continuity of f are:

(a)
Proximal subgradient method [15, 20] The proximal subgradient method takes the form^{Footnote 2}$\textbf{x}^{k+1} = \textrm{prox}_{t_k w}(\textbf{x}^k-t_k \textbf{A}^T f'(\textbf{A}\textbf{x}^k))$ where $t_k>0$ is a stepsize, and for any proper closed and convex function $s: \mathbb {R}^n \rightarrow (-\infty ,\infty ]$,
$$\begin{aligned} \textrm{prox}_{s}(\textbf{x}) = \displaystyle \mathop {\text{ argmin }}_{\textbf{u}} \left\{ s(\textbf{u})+\frac{1}{2}\Vert \textbf{u}-\textbf{x}\Vert ^2 \right\} \end{aligned}$$
stands for the proximal map of s [13]. The Lipschitz continuity of f is, however, a key assumption needed for establishing rate of convergence [2, Section 9.3].
(b)
Smoothing-based methods A common way to solve (G) is to replace f by a smooth approximation $f_{\mu }$ ($\mu >0$ is a smoothing parameter), where by “smooth approximation" we mean that $f_{\mu }$ satisfies that it is $\frac{\alpha }{\mu }$-smooth ($\alpha >0$) and that
$$\begin{aligned} (AS)\qquad f_{\mu }(\textbf{x}) \le f(\textbf{x}) \le f_{\mu }(\textbf{x})+\beta \mu , \qquad \text {for some parameter } \beta >0. \end{aligned}$$
Then, an accelerated proximal gradient method is employed on the smooth problem $\min f_{\mu }(\textbf{A}\textbf{x})+w(\textbf{x})$, [4, 14]. The latter approach in which the smoothing parameter is fixed in advance can also be refined within an adaptive smoothing which employs one iteration of an accelerated method on the function $f_{\mu _k}(\textbf{A}\textbf{x})+w(\textbf{x})$ where $\mu _k$ is a decreasing sequences that diminishes to zero, as k, the dynamic iteration index, increases, see for instance [7, 21]. The existence of such smooth approximations satisfying (AS) is guaranteed when f is Lipschitz continuous. Unfortunately, in general such a guarantee does not exist.

When both f and w admit computationally efficient proximal maps [13], one can consider tackling the composite model (G) by applying primal-dual Lagrangian based methods, such as the popular Alternating Direction of Multipliers (ADM) scheme [9], and its related variants; see for instance [5, 8, 9, 12, 19] and references therein. However, to obtain rates of convergence in terms of function values for these methods, some type of Lipschitz continuity is often required (see for example [8, Remark 3] and [19]), while improved types of convergence results can be derived only under additional assumptions, see e.g., [18].

Contribution and Outline We introduce a theoretical framework where the restrictive Lipschitz continuity of the function f of problem (G) is not required. The derivation and the development of our results rely on a powerful fact involving the so-called Pasch-Hausdorff (PH) envelope of a function [10], which consists of the infimal convolution of the given function with a penalized norm, and which generates a Lipschitz continuous function. This is presented in Sect. 2 where we also derive a new dual formulation of the PH envelope which is a key player in our analysis. The main idea is then to replace the function f with its PH envelope which allows to construct an exact Lipschitz regularization of problem (G); a simple and useful property which appears to have been overlooked in the literature. We prove that as long as the PH parameter is larger than a dual optimal bound, then problem (G) and its exact Lipschitz regularization counterpart are equivalent; see Sect. 3. In Sect. 4 we show how the aforementioned equivalence result can be utilized in establishing function values-based rates of convergence in terms of the original data. Finding a dual bound for the norm of the dual optimal solution, as required by the equivalence result, is not always easy to derive. We address this issue in Sect. 5 where we show that given a Slater point for the general convex model (G), we can evaluate such a bound in terms of this Slater point, and without actually needing to compute the dual problem. Throughout the paper, we provide examples and applications which illustrate the potential benefits of our approach.

Notation Vectors are denoted by boldface lowercase letters, e.g., $\textbf{y}$, and matrices by boldface uppercase letters, e.g., $\textbf{B}$. The vectors of all zeros and ones are denoted by $\textbf{0}$ and $\textbf{e}$ respectively. The underlying spaces are $\mathbb {R}^n$-spaces endowed with an inner product $\langle \cdot , \cdot \rangle $. The closed ball with center $\textbf{c}\in \mathbb {R}^n$ and radius $r>0$ w.r.t. a norm $\Vert \cdot \Vert _a$ is denoted by $B_a[\textbf{c},r] = B_{\Vert \cdot \Vert _a}[\textbf{c},r]=\{\textbf{x}\in \mathbb {R}^n: \Vert \textbf{x}-\textbf{c}\Vert _a \le r\}$ and the corresponding open ball by $B_a(\textbf{c},r) = \{\textbf{x}\in \mathbb {R}^n: \Vert \textbf{x}-\textbf{c}\Vert _a < r\}$. Given a matrix $\textbf{A}\in \mathbb {R}^{m \times n}$, $\Vert \textbf{A}\Vert $ denotes its spectral norm: $\Vert \textbf{A}\Vert = \sqrt{\lambda _{\max }(\textbf{A}^T \textbf{A})}$. We use the standard notation $[n]\equiv \{1,2,\ldots ,n\}$ for a positive integer n. For any extended real-valued function h, the conjugate is defined as $h^*(\textbf{y}) \equiv \max _{\textbf{x}} \left\{ \langle \textbf{x},\textbf{y}\rangle - h(\textbf{x}) \right\} $. For a given set S, the indicator function $\delta _S(\textbf{x})$ is equal to 0 if $\textbf{x}\in S$ and $\infty $ otherwise. Further standard definitions or notations in convex analysis which are not explicitly mentioned here can be found in the classical book [17].

2 The Pasch-Hausdorff Lipschitz Regularization

Assume that $\mathbb {R}^m$ is endowed with some norm $\Vert \cdot \Vert _a$. The dual norm is denoted by $\Vert \cdot \Vert _a^*$ (not to be confused with the Fenchel conjugate). A natural way to “transform” a function $h: \mathbb {R}^m \rightarrow (-\infty , \infty ]$ into a Lipschitz continuous function is via the Pasch-Hausdorff (PH) envelope [1, Section 12.3] defined for a parameter $M>0$ as

$$\begin{aligned} h^{[M]}(\textbf{x}) := h\Box (M\Vert \cdot \Vert ) (\textbf{x}) = \min _{\textbf{z}} \{h(\textbf{z})+M\Vert \textbf{z}-\textbf{x}\Vert _a\}. \end{aligned}$$

(2.1)

It is well known [1, Proposition 12.17] that if a proper function h has an M-Lipschitz minorant (w.r.t. $\Vert \cdot \Vert _a$), then $h^{[M]}$ is the largest M-Lipschitz minorant of h, and the only other case is when $h^{[M]} \equiv -\infty $. This result does not require any convexity assumption on h.

2.1 A Dual Representation of The Pasch-Hausdorff Envelope

In our setting of problem (G), f is proper closed and convex. In this case, we will now show that the PH envelope admits a dual representation that will be essential to our analysis. This property is stated in the following lemma. We also state and prove the other elementary properties that $f^{[M]}$ is an M-Lipschitz minorant of f for the sake of completeness. Before proceeding, recall that for any set C, $\delta _C^*(\textbf{y}) = \sigma _C(\textbf{y}):=\max \{\langle \textbf{y}, \textbf{x}\rangle : \textbf{x}\in C\}$, and $\displaystyle \mathop {\textrm{ri}}(C)$ stands for the relative interior of C which is nonempty whenever the set C is nonempty and convex, [17, Theorem, 6.2].

Lemma 2.1

(Dual representation of $f^{[M]}$) Let $f:\mathbb {R}^m \rightarrow (-\infty ,\infty ]$ be a proper closed and convex function. Suppose that there exists $\hat{\textbf{y}} \in \displaystyle \mathop {\textrm{dom}}(f^*)$ such that $\Vert \hat{\textbf{y}}\Vert _a^* < M$ for some $M>0$. Then

(a)
It holds that
$$\begin{aligned} f^{[M]} := (f^*+\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]})^*; \end{aligned}$$
(2.2)
(b)
$f^{[M]}$ is real-valued and convex and the minimal value in (2.1) is attained;
(c)
[1, Proposition 12.17] $f^{[M]}(\textbf{x})\le f(\textbf{x})$ for all $\textbf{x}$;
(d)
[1, Proposition 12.17] $f^{[M]}: \mathbb {R}^m \rightarrow \mathbb {R}$ is M-Lipschitz continuous w.r.t. the norm $\Vert \cdot \Vert _a$.

Proof

(a+b) By [17, Theorem 16.4], if $\displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*)) \cap B_{\Vert \cdot \Vert _a^*}(\textbf{0},M) \ne \emptyset $, then

$$\begin{aligned} (f^*+\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]})^* = f^{**} \Box \delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]}^*. \end{aligned}$$

Since $f^{**}=f$ (as f is proper closed and convex), and $\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]}^* = \sigma _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]} = M \Vert \cdot \Vert _a$, we obtain that $(f^*+\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]})^*= f \Box (M \Vert \cdot \Vert _a)=f^{[M]}$. The result [17, Theorem 16.4] also establishes the finiteness and attainment of the minimal value in (2.1). The convexity of $f^{[M]}$ follows by the fact that it is a conjugate function, see [2, Theorem 4.3]. What is left is to show that $\displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*)) \cap B_{\Vert \cdot \Vert _a^*}(\textbf{0},M) \ne \emptyset $. Indeed, since $\displaystyle \mathop {\textrm{dom}}(f^*)$ is convex and nonempty (by the convexity and properness of f [2, Theorem 4.5]), it follows that there exists $\tilde{\textbf{y}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*))$. Therefore, recalling that $\hat{\textbf{y}} \in \displaystyle \mathop {\textrm{dom}}(f^*)$, by the line segment principle, for any $\lambda \in (0,1)$ we have that $\hat{\textbf{y}}+\lambda (\tilde{\textbf{y}}-\hat{\textbf{y}}) \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*))$. Thus, we can take $\tilde{\lambda } \in (0,1)$ small enough for which $\hat{\textbf{y}}+\tilde{\lambda }(\tilde{\textbf{y}}-\hat{\textbf{y}}) \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*)) \cap B_{\Vert \cdot \Vert _a^*}(\textbf{0},M).$

(c) Follows from the following elementary argument:

$$\begin{aligned} f^{[M]}(\textbf{x})= & {} (f \Box M \Vert \cdot \Vert _a)(\textbf{x}) =\min _{\textbf{z}} \{f(\textbf{z})+M\Vert \textbf{x}-\textbf{z}\Vert _a\} \\{} & {} \le f(\textbf{x})+M\Vert \textbf{x}-\textbf{x}\Vert _a = f(\textbf{x}). \end{aligned}$$

(d) Note that by part (b) $f^{[M]}$ is real-valued. Then by the triangle inequality,

$$\begin{aligned} f^{[M]}(\textbf{x})= & {} \min _{\textbf{z}} \{ f(\textbf{z})+M \Vert \textbf{z}-\textbf{x}\Vert _a\} \\\le & {} \min _{\textbf{z}} \{ f(\textbf{z})+M \Vert \textbf{z}-\textbf{y}\Vert _a\}+M \Vert \textbf{x}-\textbf{y}\Vert _a = f^{[M]}(\textbf{y})+M \Vert \textbf{x}-\textbf{y}\Vert _a. \end{aligned}$$

Changing the roles of $\textbf{x}$ and $\textbf{y}$ we also obtain that $f^{[M]}(\textbf{y}) \le f^{[M]}(\textbf{x}) + M\Vert \textbf{x}-\textbf{y}\Vert _a$, thus establishing the desired result that $|f^{[M]}(\textbf{x})-f^{[M]}(\textbf{y})| \le M \Vert \textbf{x}-\textbf{y}\Vert _a$ for any $\textbf{x},\textbf{y}\in \mathbb {R}^m$. $\square $

2.2 Some Examples of PH Envelopes

Obviously, computing the PH envelope can be a challenging task. In this section we describe several cases in which its evaluation is tractable. In what follows, for any nonempty set C the distance function with respect to a norm $\Vert \cdot \Vert _a$ is defined by $d_{C,\Vert \cdot \Vert _a}(\textbf{x}) = \min _{\textbf{y}\in C} \Vert \textbf{y}-\textbf{x}\Vert _a$. If the distance function is with respect to the Euclidean norm $\Vert \cdot \Vert = \sqrt{\langle \cdot ,\cdot \rangle }$, then we will settle with the notation $d_C$.

Example 2.1

(Indicator function) Suppose $f=\delta _C$ where C is a nonempty closed and convex set. Then the PH envelope of f is given by

$$\begin{aligned} f^{[M]}(\textbf{x}) = (\delta _C \Box M \Vert \cdot \Vert _a)(\textbf{x}) = \min _{\textbf{z}\in C} M\Vert \textbf{z}-\textbf{x}\Vert _a = M d_{C,\Vert \cdot \Vert _a}(\textbf{x}).\end{aligned}$$

Example 2.2

(Ball-pen) Consider the so-called “ball-pen” function $f: \mathbb {R}^n \rightarrow (-\infty ,\infty ]$ given by $f(\textbf{x}) = -\sqrt{1-\Vert \textbf{x}\Vert _2^2}$ with $\displaystyle \mathop {\textrm{dom}}(f) = B[\textbf{0},1]$. Here we assume that $\Vert \cdot \Vert _a = \Vert \cdot \Vert _2$. Obviously, this is not a Lipschitz continuous function being an extended real-valued function. The M-Lipschitz PH envelope is given by

$$\begin{aligned} f^{[M]}(\textbf{x}) = \min _{\textbf{z}} \left\{ -\sqrt{1-\Vert \textbf{z}\Vert _2^2}+M \Vert \textbf{x}-\textbf{z}\Vert _2\right\} . \end{aligned}$$

A rather technical argument that uses the dual representation (2.2) shows that (see Appendix A)

$$\begin{aligned} f^{[M]}(\textbf{x}) = - \sqrt{1- \min \left\{ \Vert \textbf{x}\Vert _2, \frac{M}{\sqrt{M^2+1}}\right\} ^2}+ M\max \left\{ 0, \Vert \textbf{x}\Vert _2-\frac{M}{\sqrt{M^2+1}}\right\} . \end{aligned}$$

(2.3)

A one-dimensional illustration is given in Fig. 1.

Example 2.3

(Minus sum of logs) Let $\textbf{b}\in \mathbb {R}^m$ and define $f(\textbf{z}):= \sum _{i=1}^m f_i(z_i)$, where

$$\begin{aligned} f_i(z_i) = \left\{ \begin{array}{ll} -\log (z_i-b_i), &{} z_i>b_i,\\ \infty , &{} \text{ else. } \end{array} \right. \end{aligned}$$

We want to find the M-Lipschitz PH envelope of f w.r.t. the $\ell _1$-norm:

$$\begin{aligned} f^{[M]}(\textbf{z}) {=} \min _{\textbf{u}\in \mathbb {R}^m} \{ f(\textbf{u})+M \Vert \textbf{z}-\textbf{u}\Vert _1\} = \sum _{i=1}^m \min _{u_i \in \mathbb {R}} \{ -\log (u_i-b_i)+M|z_i-u_i| :u_i>b_i\}. \end{aligned}$$

(2.4)

Note that in the above we exploited the separability of the $\ell _1$-norm, which is a demonstration to the fact that the choice of norm might be essential to the ability of computing the PH envelope. Indeed, in this case, computing an explicit expression for the PH envelope under the $\ell _2$-norm, for example, seems to be a difficult task. By (2.4),

$$\begin{aligned} f^{[M]}(\textbf{z}) = \sum _{i=1}^m h_{b_i}^{[M]}(z_i), \end{aligned}$$

where for any $c,z \in \mathbb {R}$ we define

$$\begin{aligned} h_c^{[M]}(z)=\min _{u\in \mathbb {R}} \{-\log (u-c)+M |z-u| : u>c\}.\end{aligned}$$

(2.5)

Thus, computing $f^{[M]}$ amounts to solving the one-dimensional problem (2.5). An explicit expression for $h_c^{[M]}$ is

$$\begin{aligned} \nonumber h_c^{[M]}(z)= & {} \left\{ \begin{array}{ll} -\log (z-c), &{} z>c+\frac{1}{M},\\ \log (M)+1+Mc-Mz, &{} \text{ else }. \end{array} \right. \\= & {} -\log \left( \max \left\{ z,c+\frac{1}{M} \right\} -c\right) +M \left| \max \left\{ z,c+\frac{1}{M} \right\} -z \right| . \end{aligned}$$

(2.6)

The validity of the above expression for $h_c^{[M]}$ is shown in Appendix B.

3 Exact Lipschitz Regularization for Model (G)

We now return to the general model (G) (equation (1.1)) under the conditions that f and w are proper closed and convex. The main idea is to replace the function f with its PH envelope to obtain the Lipschitz regularized problem

The main question that we wish to address is

Under which conditions are problems (G) and (G$_M$) equivalent?

By “equivalent” we mean that the optimal sets of the two problems are identical. We will show in Theorem 3.1 below that as long as M is larger than a bound on the optimal set of the dual problem, then such an equivalency holds. Since duality arguments are essential in our analysis, we first recall the well-known dual problem of (G):

$$\begin{aligned} \text{(DG) } \quad \max _{\textbf{y}} \left\{ -f^*(\textbf{y})-w^*(-\textbf{A}^T \textbf{y})\right\} . \end{aligned}$$

According to [17, Corollary 31.2.1], to guarantee strong duality, it is sufficient that the constraint qualification

$$\begin{aligned} \exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w), \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f)) \end{aligned}$$

(3.1)

holds. We are now ready to answer the main question stated above.

Theorem 3.1

(Equivalence between (G) and (G$_M$)) Suppose that f and w are proper closed and convex functions and that condition (3.1) holds. In addition, assume that $\text{ val }(G)>-\infty $. Let $\textbf{y}^*$ be an optimal solution of the dual problem (DG) and let $M>\Vert \textbf{y}^*\Vert _a^*$. Then

(a)
Problems (G) and $(G_M)$ have the same optimal value.
(b)
If $\textbf{x}^*$ is an optimal solution of problem $(G_M)$, then $f^{[M]}(\textbf{A}\textbf{x}^*)=f(\textbf{A}\textbf{x}^*)$.
(c)
Problems (G) and $(G_M)$ have the same optimal sets.

Proof

(a)
By condition (3.1) and the finiteness of $\text{ val }(G)$, it follows that $\text{ val }(G)=\text{ val }(DG)$. Since the optimal solution $\textbf{y}^*$ of the dual problem satisfies $\Vert \textbf{y}^*\Vert _a^* <M$, it follows that $\text{ val }(DG)=\text{ val }(R)$, where (R) is the problem
$$\begin{aligned} (\text{ R}) \quad \max _{\textbf{y}} \left\{ -f^*(\textbf{y})-\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]}(\textbf{y})-w^*(-\textbf{A}^T\textbf{y})\right\} . \end{aligned}$$
Note that by the dual representation (2.2), (R) is actually
$$\begin{aligned} \max _{\textbf{y}} \left\{ -(f^{[M]})^*(\textbf{y})-w^*(-\textbf{A}^T\textbf{y})\right\} , \end{aligned}$$
meaning that (R) is the dual problem to $(G_M)$. In particular, $\text{ val }(G_M) \ge \text{ val }(R)$ is finite. Moreover, by Lemma 2.1(b), $\displaystyle \mathop {\textrm{dom}}(f^{[M]})=\mathbb {R}^m$, and thus the condition
$$\begin{aligned} \exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w), \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^{[M]})) \end{aligned}$$
amounts to “$\exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w)$”which trivially holds as the relative interior of nonempty convex sets is always nonempty. Consequently, strong duality between problems (R) and $(G_M)$ holds. We can finally conclude that
$$\begin{aligned} \text{ val }(G_M) = \text{ val }(R) = \text{ val }(DG) = \text{ val }(G). \end{aligned}$$
(b)
Note the following observation that follows from part (a): for any $N>\Vert \textbf{y}^*\Vert _a^*$, it holds that $\text{ val }(G_N) = \text{ val }(G)$. Suppose that $\textbf{x}^*$ is an optimal solution of $(G_M)$. Then
$$\begin{aligned} f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = \text{ val }(G). \end{aligned}$$
Assume by contradiction that $f^{[M]}(\textbf{A}\textbf{x}^*) \ne f(\textbf{A}\textbf{x}^*)$. Then this means that there exists $\textbf{z}\ne \textbf{A}\textbf{x}^*$ such that $f^{[M]}(\textbf{A}\textbf{x}^*) = M \Vert \textbf{z}-\textbf{A}\textbf{x}^*\Vert _a+f(\textbf{z})$. Take $M' \in (\Vert \textbf{y}^*\Vert , M)$, then $f^{[M']}(\textbf{A}\textbf{x}^*) \le M' \Vert \textbf{z}-\textbf{A}\textbf{x}^*\Vert _a+f(\textbf{z})< M \Vert \textbf{z}-\textbf{A}\textbf{x}^*\Vert _a+f(\textbf{z}) = f^{[M]}(\textbf{A}\textbf{x}^*)$, and therefore
$$\begin{aligned} \text{ val }(G_{M'}) \le f^{[M']}(\textbf{A}\textbf{x}^*) +w(\textbf{x}^*) <f^{[M]}(\textbf{A}\textbf{x}^*) +w(\textbf{x}^*)=\text{ val }(G_M), \end{aligned}$$
which is a contradiction to the observation indicated at the beginning of the proof of this part.
(c)
Assume that $\textbf{x}^*$ is an optimal solution of (G). Then
$$\begin{aligned} \text{ val }(G) = f(\textbf{A}\textbf{x}^*) +w(\textbf{x}^*) {\mathop {\ge }\limits ^{(*)}} f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) \ge \text{ val }(G_M), \end{aligned}$$
where $(*)$ follows from Lemma 2.1(c). However, since by part (a) we have that $\text{ val }(G_M)=\text{ val }(G)$, it follows that $ f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = \text{ val }(G_M)$ and hence that $\textbf{x}^*$ is an optimal solution of $(G_M)$. In the opposite direction, assume that $\textbf{x}^*$ is an optimal solution of problem $(G_M)$. Then by part (b), $f^{[M]}(\textbf{A}\textbf{x}^*)=f(\textbf{A}\textbf{x}^*)$ and consequently,
$$\begin{aligned} \text{ val }(G_M) = f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = f(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) \ge \text{ val }(G), \end{aligned}$$
and since $\text{ val }(G)=\text{ val }(G_M)$ (part (a)) we conclude that $f(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = \text{ val }(G)$, meaning that $\textbf{x}^*$ is an optimal solution of (G).

$\square $

Remark 3.1

(Exact Penalty Viewpoint) Theorem 3.1 can be shown as a consequence of a result of Han and Mangasarian [11] on sufficiency conditions on exact penalty functions. Specifically, we can rewrite problem (G) as

$$\begin{aligned} \text{(G-Con) } \; \min _{\textbf{x},\textbf{z}} \{ f(\textbf{z})+w(\textbf{x}): \textbf{z}= \textbf{A}\textbf{x}\}. \end{aligned}$$

and consider the penalized problem

$$\begin{aligned} (\text{ G-Pen}_{M}) \; \min _{\textbf{x},\textbf{z}} \{ f(\textbf{z})+w(\textbf{x}) +M \Vert \textbf{z}-\textbf{A}\textbf{x}\Vert _a\}. \end{aligned}$$

Fixing $\textbf{x}$ and minimizing with respect to $\textbf{z}$, we obtain problem $(G_M)$. Theorem 4.9 from [11] shows that indeed when M is larger than the dual norm of an optimal dual solution, then problem $(\text{ G-Pen}_{M})$, has the same optimal set of problem (G-Con), which readily implies that the optimal sets of (G) and $(G_M)$ coincide. Our simple, independent-interest proof reveals the PH envelope’s benefit for the exact penalty approach. Additionally, our work highlights that existing penalty approaches implicitly generate Lipschitz functions, a property crucial for convergence rate analysis.

What remains is of course the question of how to find a bound on the optimal set of the dual problem. This issue will be studied in Sect. 5.

The objective function of problem $(G_M)$ includes the Lipschitz continuous component $f^{[M]}$. This enables the use of a basic first-order method to achieve non-asymptotic rates of convergence in terms of function values, as explained in the introduction. However, these rates of convergence will depend on the PH envelope $f^{[M]}$. In the next section we show that in the case where f is an indicator function, rates of convergence in terms of the original data can be obtained.

4 Algorithm Iteration Complexity for a Constrained Model

We focus on the important constrained model

$$\begin{aligned} (Q) \quad \min \{w(\textbf{x}): \textbf{A}\textbf{x}\in C\}, \end{aligned}$$

where $w: \mathbb {R}^n \rightarrow (-\infty ,\infty ]$ is proper closed and convex, $\textbf{A}\in \mathbb {R}^{m \times n}$ and $C \subseteq \mathbb {R}^m$ is a nonempty closed and convex set. Model (Q) fits model (G) with $f = \delta _C$.

By Example 2.1, $f^{[M]}(\textbf{x})=M d_{C,\Vert \cdot \Vert _a}(\textbf{x})$, and hence the M-Lipschitz regularization of problem (Q) is

$$\begin{aligned} (\text{ Q}_M) \quad \min _{\textbf{x}} \{ M d_{C, \Vert \cdot \Vert _a}(\textbf{A}\textbf{x}) + w(\textbf{x})\}. \end{aligned}$$

For example, if $C = \{\textbf{b}\}$, meaning that problem (Q) is $\min \{w(\textbf{x}): \textbf{A}\textbf{x}= \textbf{b}\}$, then $(Q_M)$ has the form

$$\begin{aligned} \min _{\textbf{x}} M \Vert \textbf{A}\textbf{x}-\textbf{b}\Vert _a+w(\textbf{x}). \end{aligned}$$

If $C = \{ \textbf{z}: \textbf{z}\le \textbf{b}\}$, meaning that problem (Q) is $\min \{ w(\textbf{x}): \textbf{A}\textbf{x}\le \textbf{b}\}$, then in case where $\Vert \cdot \Vert _a = \Vert \cdot \Vert _2$, $(Q_M)$ has the form

$$\begin{aligned} \min _{\textbf{x}} M \Vert [\textbf{A}\textbf{x}-\textbf{b}]_+\Vert _2+w(\textbf{x}). \end{aligned}$$

Recall that thanks to Theorem 3.1, problem (Q$_M$) has the same optimal set as (Q) under the following assumption.

Assumption 1

(a)
w is proper closed and convex.
(b)
C is nonempty closed and convex.
(c)
$\text{ val }(Q):=\min _{\textbf{A}\textbf{x}\in C} w(\textbf{x}) >-\infty $.
(d)
$\exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w), \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(C)$
(e)
$M>\Vert \textbf{y}^*\Vert _a^*$ for some optimal solution $\textbf{y}^*$ of the dual problem.

Suppose now that we have an algorithm for solving problem $(\text{ Q}_M)$, and that the sequence generated by the algorithm $\{\textbf{x}^k\}_{k \ge 0}$ satisfies the following complexity result in terms of function values:

$$\begin{aligned} M d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)-M d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^*)-w(\textbf{x}^*)\le \alpha (k), \end{aligned}$$

(4.1)

where $\alpha : \mathbb {R}_{++} \rightarrow \mathbb {R}_+$ satisfies

$$\begin{aligned} \alpha (t)\rightarrow 0 \text{ as } t \rightarrow \infty \end{aligned}$$

(4.2)

and $\textbf{x}^*$ is an optimal solution of problem $(\text{ Q}_M)$. We will now show that the complexity result (4.1) can be translated to a complexity result in terms of the original problem (Q) in the sense that we get an $\alpha (k)$-rate of convergence in terms of the original objective function and the constraint violation.

Theorem 4.1

Suppose that Assumption 1 holds for model (Q), and assume that a sequence $\{\textbf{x}^k\}_{k \ge 0}$ satisfies (4.1) with $\alpha : \mathbb {R}_{++} \rightarrow \mathbb {R}_+$ satisfying (4.2) and $\textbf{x}^*$ being an optimal solution of problem $(\text{ Q}_M)$. Then $\textbf{x}^*$ is an optimal solution of (Q) and the following holds for all $k \ge 0$:

$$\begin{aligned} w(\textbf{x}^k)-w(\textbf{x}^*)\le & {} \alpha (k), \\ d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)\le & {} \frac{2\alpha (k)}{M-\Vert \textbf{y}^*\Vert _a^*}. \end{aligned}$$

Proof

By Theorem 3.1, $\textbf{x}^*$ is also an optimal solution of problem (Q), and thus, in particular, $d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^*)=0$. Therefore, (4.1) can be rewritten as

$$\begin{aligned} M d_{C, \Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)-w(\textbf{x}^*)\le \alpha (k). \end{aligned}$$

(4.3)

By the nonnegativity of the distance function, it follows that that $w(\textbf{x}^k)-w(\textbf{x}^*)\le \alpha (k).$ Take $M' =\frac{\Vert \textbf{y}^*\Vert _a^*+M}{2}$. Then (4.3) can be written as

$$\begin{aligned} (M-M')d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+ M' d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)-w(\textbf{x}^*)\le \alpha (k). \end{aligned}$$

(4.4)

By Assumption 1(e) one has $M>\Vert \textbf{y}^*\Vert _a^*$, then, Since $M'>\Vert \textbf{y}^*\Vert _a^*$ and it follows by Theorem 3.1 that $\textbf{x}^*$ is also a minimizer of

$$\begin{aligned} M' d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x})+w(\textbf{x}), \end{aligned}$$

which implies in particular that

$$\begin{aligned} M' d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)\ge M'd_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) =w(\textbf{x}^*), \end{aligned}$$

and thus, by (4.4), we conclude that

$$\begin{aligned} d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k) \le \frac{\alpha (k)}{M-M'}=\frac{2\alpha (k)}{M-\Vert \textbf{y}^*\Vert _a^*}. \end{aligned}$$

$\square $

5 Bounding the Dual Optimal Solution

The results in Sect. 2 assume that we are given a bound on the norm of a dual optimal solution. This bound is not always easy to derive. It is very well known that the boundedness of the dual optimal solution of a given convex problem is guaranteed under the usual Slater condition, see e.g., [17]. In fact, for the classical inequality constrained convex optimization problem, it is possible to exhibit an explicit bound on the norm of dual optimal solutions. More specifically, with $\{f_{i}\}_{i=0}^m$ convex functions on $\mathbb {R}^d$, assume that for the convex optimization problem

$$\begin{aligned} (CC)\qquad f_*:= \min \{ f_0(\textbf{x}): f_i(\textbf{x})\le 0, i \in 1,\ldots m, \; \textbf{x}\in \mathbb {R}^d\}, \end{aligned}$$

there exists $\bar{\textbf{x}} \in \mathbb {R}^d$ such that $ f_i(\bar{\textbf{x}}) <0, i=1,\ldots m$, and that $f_* >-\infty $. Obviously, $\bar{\textbf{x}}$ is a slater point of (CC). Then, it is known and easy to show (see e.g., [5, exercise 5.3.1, p.516]) that for any dual optimal solution $\textbf{y}^* \in \mathbb {R}^d_{+}$, one has

$$\begin{aligned} \Vert \textbf{y}^*\Vert _1 \le \frac{1}{r} \left( f_0(\bar{\textbf{x}}) - f_*\right) , \;\; \text {with}\; r:=\min _{1\le i \le m}(-f_i(\bar{\textbf{x}})). \end{aligned}$$

(5.1)

However, to the best of our knowledge, the derivation of such an explicit bound on an optimal dual solution of the general convex model (G) does not seem to be known or/and have been addressed in the literature. In this section, we show that given a Slater point of the primal general problem (G), we can evaluate such a bound in terms of the Slater point without actually needing to compute the dual problem. We then illustrate the potential benefits of this theoretical result.

The model that we consider is our general model (G) (equation (1.1)) under the following assumption.

Assumption 2

(a)
$f: \mathbb {R}^m \rightarrow (-\infty ,\infty ]$ is proper closed and convex.
(b)
$w: \mathbb {R}^n \rightarrow (-\infty , \infty ]$ is proper closed and convex.
(c)
The optimal set of (G) is nonempty.

For the sake of analysis of this section we will assume that $\displaystyle \mathop {\textrm{dom}}(f)$ has the structure.

$$\begin{aligned} \displaystyle \mathop {\textrm{dom}}(f) = \{\textbf{b}\} \times C, \end{aligned}$$

where $\textbf{b}\in \mathbb {R}^{m_1}$ and $C \subseteq \mathbb {R}^{m_2}$ is a nonempty closed and convex set ($m_1+m_2=m$). We partition $\textbf{A}$ as $\textbf{A}= \begin{pmatrix} \textbf{A}_1 \\ \textbf{A}_2 \end{pmatrix}$, where $\textbf{A}_1 \in \mathbb {R}^{m_1\times n}, \textbf{A}_2\in \mathbb {R}^{m_2 \times n}$. Then actually the domain of $\textbf{x}\mapsto f(\textbf{A}\textbf{x})$ is given by $\{\textbf{x}: \textbf{A}_1 \textbf{x}= \textbf{b}, \textbf{A}_2 \textbf{x}\in C\}.$ We assume that $\textbf{A}_1$ has full row rank (a mild assumption since otherwise we can remove dependancies). We make the convention that the case $m_1=0$ corresponds to the situation where $\displaystyle \mathop {\textrm{dom}}(f) = \{\textbf{x}: \textbf{A}_2 \textbf{x}\in C\}$ and that the case $m_2=0$ corresponds to the case where $\displaystyle \mathop {\textrm{dom}}(f) = \{\textbf{x}: \textbf{A}_1 \textbf{x}= \textbf{b}\}$. The partition of a vector $\textbf{z}\in \mathbb {R}^m$ into $m_1$ and $m_2$-length vectors is given by $\textbf{z}= (\textbf{z}_1^T,\textbf{z}_2^T)^T$, where $\textbf{z}_1 \in \mathbb {R}^{m_1}, \textbf{z}_2 \in \mathbb {R}^{m_2}$. We will assume that $\mathbb {R}^m$ is endowed with the norm

$$\begin{aligned} \Vert \textbf{z}\Vert _{\alpha }:=\Vert \textbf{z}\Vert _{\alpha _1,\alpha _2} = \Vert \textbf{z}_1\Vert _{\alpha _1}+\Vert \textbf{z}_2\Vert _{\alpha _2}, \end{aligned}$$

where $\Vert \cdot \Vert _{\alpha _1}$ and $\Vert \cdot \Vert _{\alpha _2}$ are norms on $\mathbb {R}^{m_1}$ and $\mathbb {R}^{m_2}$ respectively. The dual norm is (as before, $\textbf{y}_1, \textbf{y}_2$ are the $m_1$ and $m_2$-length blocks of $\textbf{y}$)

$$\begin{aligned} \Vert \textbf{y}\Vert _{\alpha _1,\alpha _2}^* = \max \{ \Vert \textbf{y}_1\Vert _{\alpha _1}^*, \Vert \textbf{y}_2\Vert _{\alpha _2}^*\}. \end{aligned}$$

Recall that the dual of problem (G) is

$$\begin{aligned} \text{(DG) } \quad \max _{\textbf{y}} \left\{ -f^*(\textbf{y})-w^*(-\textbf{A}^T \textbf{y}) \right\} . \end{aligned}$$

The dual objective function is thus

$$\begin{aligned} q(\textbf{y}) \equiv -f^*(\textbf{y})-w^*(-\textbf{A}^T \textbf{y}) = \min _{\textbf{x}\in \mathbb {R}^n,\textbf{z}\in \mathbb {R}^m} \mathcal {L}(\textbf{x},\textbf{z};\textbf{y}), \end{aligned}$$

(5.2)

where $\mathcal {L}(\textbf{x},\textbf{z};\textbf{y})$ is the Lagrangian function given by

$$\begin{aligned} \mathcal {L}(\textbf{x},\textbf{z};\textbf{y}) = f(\textbf{z})+w(\textbf{x})+\langle \textbf{A}\textbf{x}-\textbf{z},\textbf{y}\rangle . \end{aligned}$$

Using the partitions of $\textbf{y}$ and $\textbf{z}$ to $m_1$ and $m_2$-length vectors $\textbf{y}= (\textbf{y}_1^T,\textbf{y}_2^T)^T$, $\textbf{z}= (\textbf{z}_1^T,\textbf{z}_2^T)^T$, the Lagrangian can thus be rewritten as

$$\begin{aligned} \mathcal {L}(\textbf{x},\textbf{z};\textbf{y}) = f(\textbf{z})+w(\textbf{x})+\langle \textbf{A}_1 \textbf{x}-\textbf{z}_1, \textbf{y}_1\rangle +\langle \textbf{A}_2 \textbf{x}-\textbf{z}_2, \textbf{y}_2 \rangle . \end{aligned}$$

(5.3)

Strong duality of the pair (G) and (DG) is guaranteed if we assume, in addition to Assumption 2, the following Slater condition (similar to condition (3.1)):

$$\begin{aligned} \exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(w)) \text{ s.t. } \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f)). \end{aligned}$$

(5.4)

For the sake of the current analysis, we will replace the above condition with a slightly stronger condition: $\exists \bar{\textbf{x}}: \textbf{A}_1 \bar{\textbf{x}} = \textbf{b}, \textbf{A}_2 \bar{\textbf{x}} \in \displaystyle \mathop {\textrm{int}}(C), \bar{\textbf{x}} \in \displaystyle \mathop {\textrm{int}}(\displaystyle \mathop {\textrm{dom}}(w))$. The exact assumption, in a more quantitative form, is now stated.

Assumption 3

There exist $r>0, s>0$ and $\bar{\textbf{x}}$ such that $\textbf{A}_1 \bar{\textbf{x}} = \textbf{b}$, $B_{\alpha _2}[\textbf{A}_2 \bar{\textbf{x}}, r] \subseteq C$ and $B_2[\bar{\textbf{x}}, s] \subseteq \displaystyle \mathop {\textrm{dom}}(w)$.

As usual, if $m_2=0$, we make the convention that Assumption 3 reduces to “there exist $s>0$ and $\bar{\textbf{x}}$ such that $\textbf{A}_1 \bar{\textbf{x}} = \textbf{b}, B_2[\bar{\textbf{x}}, s] \subseteq \displaystyle \mathop {\textrm{dom}}(w)$” and in the case where $m_1=0$ the assumption reduces to “there exist $r>0, s>0$ and $\bar{\textbf{x}}$ such that $B_{\alpha _2}[\textbf{A}_2 \bar{\textbf{x}}, r] \subseteq C, B_2[\bar{\textbf{x}}, s] \subseteq \displaystyle \mathop {\textrm{dom}}(w)$".

We are now ready to prove the main theorem connecting an upper bound on the norm of optimal dual solutions to a given Slater point.

Theorem 5.1

(Bound on optimal dual solutions) Suppose that Assumptions 2 and 3 hold with $\bar{\textbf{x}} \in \mathbb {R}^n, r>0$ and $s>0$. Let $\textbf{y}$ be an optimal solution of the dual problem (DG). Then

$$\begin{aligned} \Vert \textbf{y}_2\Vert _{\alpha _2}^* \le C_2, \end{aligned}$$

where

$$\begin{aligned} C_2:= \frac{\max _{\textbf{d}\in B_{\alpha _2}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}})-\text{ val }(G)}{r} \end{aligned}$$

and

$$\begin{aligned} \Vert \textbf{y}_1\Vert _{\alpha _1}^* \le \frac{sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2} +f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) -\text{ val }(G)}{s D_{2,\alpha _1^*}\sigma _{\min }(\textbf{A}_1)}, \end{aligned}$$

where $\Vert \textbf{A}_2\Vert _{2,\alpha _2} = \max \{ \Vert \textbf{A}_2 \textbf{v}\Vert _{\alpha _2}: \Vert \textbf{v}\Vert _2=1\}$, $\sigma _{\min }(\textbf{A}_1) = \sqrt{\lambda _{\min }(\textbf{A}_1 \textbf{A}_1^T)}$ is the minimal singular value of $\textbf{A}_1$ and $D_{2,\alpha _1^*}$ is a constant satisfying that $\Vert \textbf{y}_1\Vert _2 \ge D_{2,\alpha _1^*} \Vert \textbf{y}_1\Vert _{\alpha _1}^*$ for all $\textbf{y}_1 \in \mathbb {R}^{m_1}$ and $\textbf{U}_2\in \mathbb {R}^{m \times m_2}$ the submatrix of $\textbf{I}_{m}$ comprising the last $m_2$ columns.

Proof

By the definition $\textbf{U}_2$, for any $\textbf{w}\in \mathbb {R}^{m_2}$, we have that $\textbf{U}_2 \textbf{w}= \begin{pmatrix} \textbf{0}_{m_1} \\ \textbf{w}\end{pmatrix} \in \mathbb {R}^m$. Define $\bar{\textbf{z}}_1 = \textbf{b}, \bar{\textbf{z}}_2 = \textbf{A}_2 \bar{\textbf{x}}$. For any $\textbf{d}\in B_{\alpha _2}[\textbf{0},r], \textbf{u}\in B_2[\textbf{0},s]$, utilizing (5.2) and (5.3) and Assumption 3, we have,

$$\begin{aligned} \text{ val }(G)= & {} \text{ val }(DG) =q(\textbf{y}) \le \mathcal {L}(\bar{\textbf{x}}+\textbf{u},\bar{\textbf{z}}+\textbf{U}_2 \textbf{d};\textbf{y})\\= & {} f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})+\langle \textbf{A}(\bar{\textbf{x}}+ \textbf{u})-\bar{\textbf{z}}-\textbf{U}_2 \textbf{d},\textbf{y}\rangle \\= & {} f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})+\langle \textbf{A}_1 \bar{\textbf{x}}+\textbf{A}_1 \textbf{u}-\bar{\textbf{z}}_1,\textbf{y}_1\rangle \\{} & {} + \langle \textbf{A}_2 \bar{\textbf{x}}+\textbf{A}_2 \textbf{u}-\bar{\textbf{z}}_2-\textbf{d},\textbf{y}_2\rangle \\= & {} f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})+\langle \textbf{A}_1 \textbf{u},\textbf{y}_1\rangle + \langle \textbf{A}_2 \textbf{u}-\textbf{d},\textbf{y}_2\rangle , \end{aligned}$$

where the last equality follows from the relations $\textbf{A}_1 \bar{\textbf{x}} = \bar{\textbf{z}}_1$ and $\textbf{A}_2 \bar{\textbf{x}} = \bar{\textbf{z}}_2$. Rearranging terms, we obtain that

$$\begin{aligned} \langle \textbf{d}, \textbf{y}_2 \rangle -\langle \textbf{u}, \textbf{A}_1^T \textbf{y}_1+\textbf{A}_2^T \textbf{y}_2\rangle \le f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})-\text{ val }(G). \end{aligned}$$

(5.5)

Take $\tilde{\textbf{d}}$ such that $\Vert \tilde{\textbf{d}}\Vert _{\alpha _2}=r$ for which $\langle \tilde{\textbf{d}}, \textbf{y}_2 \rangle = r\Vert \textbf{y}_2\Vert _{\alpha _2}^*$ (such $\tilde{\textbf{d}}$ exists by the definition of the dual norm). Also, define

$$\begin{aligned} \tilde{\textbf{u}} = \left\{ \begin{array}{ll} -s\frac{\textbf{A}_1^T \textbf{y}_1}{\Vert \textbf{A}_1^T \textbf{y}_1\Vert _2},&{} \textbf{A}_1^T \textbf{y}_1 \ne \textbf{0}, \\ \textbf{0}, &{} \textbf{A}_1^T \textbf{y}_1 = \textbf{0}. \end{array} \right. \end{aligned}$$

so that $\langle \tilde{\textbf{u}}, \textbf{A}_1^T \textbf{y}_1 \rangle = -s \Vert \textbf{A}_1^T \textbf{y}_1\Vert _2$. Plugging $\textbf{d}= \tilde{\textbf{d}}$ and $\textbf{u}= \textbf{0}$ in (5.5) yields

$$\begin{aligned} \Vert \textbf{y}_2\Vert _{\alpha _2}^*\le & {} \frac{f(\bar{\textbf{z}}+\textbf{U}_2 \tilde{\textbf{d}})+w(\bar{\textbf{x}})-\text{ val }(G)}{r}\\\le & {} \underbrace{\frac{\max _{\textbf{d}\in B_{\alpha _2}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}})-\text{ val }(G)}{r}}_{C_2}. \end{aligned}$$

Plugging $\textbf{d}=\textbf{0}$ and $\textbf{u}= \tilde{\textbf{u}}$ in (5.5), we obtain

$$\begin{aligned} s \Vert \textbf{A}_1^T \textbf{y}_1\Vert _2 - \langle \tilde{\textbf{u}}, \textbf{A}_2^T \textbf{y}_2\rangle \le f(\textbf{A}\bar{\textbf{x}})+w(\bar{\textbf{x}}+\tilde{\textbf{u}})-\text{ val }(G). \end{aligned}$$

(5.6)

We have by the Cauchy-Schwarz inequality that^{Footnote 3}

$$\begin{aligned}\langle \tilde{\textbf{u}}, \textbf{A}_2^T \textbf{y}_2\rangle \le \Vert \tilde{\textbf{u}}\Vert _2 \cdot \Vert \textbf{A}_2^T \textbf{y}_2\Vert _2 \le \Vert \tilde{\textbf{u}}\Vert _2 \cdot \Vert \textbf{A}_2^T\Vert _{\alpha _2^*,2} \cdot \Vert \textbf{y}_2\Vert _{\alpha _2}^* \le sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2}, \end{aligned}$$

which combined with (5.6) yields

$$\begin{aligned} s\Vert \textbf{A}_1^T \textbf{y}_1\Vert _2 \le sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2}+f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) -\text{ val }(G). \end{aligned}$$

Using the fact that $\Vert \textbf{A}_1^T \textbf{y}_1\Vert _2\ge \sqrt{\lambda _{\min } (\textbf{A}_1 \textbf{A}_1^T)}\Vert \textbf{y}_1\Vert _2\ge D_{2,\alpha _1^*}\sigma _{\min }(\textbf{A}_1) \Vert \textbf{y}_1\Vert _{\alpha _1}^*$, we finally obtain that

$$\begin{aligned} \Vert \textbf{y}_1\Vert _{{\alpha _1}}^* \le \frac{sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2} +f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) -\text{ val }(G)}{s D_{2,\alpha _1^*}\sigma _{\min }(\textbf{A}_1)}. \end{aligned}$$

$\square $

Remark 5.1

(Case $m_1=0$ ) In the case where $m_1=0$ in which $\displaystyle \mathop {\textrm{dom}}(f) = \{ \textbf{x}: \textbf{A}\textbf{x}\in C\}$, the result is that under Assumption 3 it holds that

$$\begin{aligned} \Vert \textbf{y}\Vert _{\alpha _2}^* \le \frac{\max _{\textbf{d}\in B_{\alpha _2}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{d})+w(\bar{\textbf{x}})-\text{ val }(G)}{r}. \end{aligned}$$

Remark 5.2

(Case $m_2=0$ ) In the case where $m_2=0$ in which $\displaystyle \mathop {\textrm{dom}}(f) = \{ \textbf{x}: \textbf{A}\textbf{x}= \textbf{b}\}$, the result is that under Assumption 3 it holds that

$$\begin{aligned} \Vert \textbf{y}\Vert _{\alpha _1}^* \le \frac{f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) - \text{ val }(G)}{s D_{2,\alpha _1^*} \sigma _{\min }(\textbf{A})}. \end{aligned}$$

Application Examples We end this section with some applications illustrating the potential of our results.

Example 5.1

(Basis pursuit) Consider the so-called “basis pursuit” problem

$$\begin{aligned} \min \{ \Vert \textbf{x}\Vert _1 : \textbf{A}\textbf{x}= \textbf{b}\}\end{aligned}$$

(5.7)

that fits the general model (G) with $w(\textbf{x}) = \Vert \textbf{x}\Vert _1$ and $f = \delta _{\{\textbf{b}\}}$. Problem (5.7) is a well known “convex" relaxation of a compressed sensing model. Suppose that $\bar{\textbf{x}}$ satisfies $\textbf{A}\bar{\textbf{x}}=\textbf{b}$ and s is an arbitrary positive scalar. If we take $\Vert \cdot \Vert _{\alpha _1}=\Vert \cdot \Vert _2$, then according to Remark 5.2, the bound that we obtain on $\Vert \textbf{y}\Vert _2$ is

$$\begin{aligned} \frac{\max _{\textbf{u}\in B_2[\textbf{0},s]} \Vert \bar{\textbf{x}}+\textbf{u}\Vert _1-\text{ val }(G)}{s\sigma _{\min }(\textbf{A})}\le & {} \frac{\Vert \bar{\textbf{x}}\Vert _1+\max _{\Vert \textbf{u}\Vert _2 \le s} \Vert \textbf{u}\Vert _1-\text{ val }(G)}{s \sigma _{\min }(\textbf{A})}\\= & {} \frac{s\sqrt{n}+\Vert \bar{\textbf{x}}\Vert _1-\text{ val }(G)}{s\sigma _{\min }(\textbf{A})}. \end{aligned}$$

Taking $s \rightarrow \infty $ (as s can be taken arbitrarily large), we obtain the bound

$$\begin{aligned}\Vert \textbf{y}\Vert _2 \le \frac{\sqrt{n}}{\sigma _{\min }(\textbf{A})}.\end{aligned}$$

Invoking Theorem 3.1, and recalling that $f^{[\gamma ]}(\textbf{z}) = \gamma \Vert \textbf{z}-\textbf{b}\Vert _2$ (see Example 2.1), we can now deduce that problem (5.7) is equivalent to

$$\begin{aligned}\min _{\textbf{x}} \Vert \textbf{x}\Vert _1 + \gamma \Vert \textbf{A}\textbf{x}-\textbf{b}\Vert _2\end{aligned}$$

whenever $\gamma >\frac{\sqrt{n}}{\sigma _{\min }(\textbf{A})}$.

This provides an exact penalty (see e.g., [5, 16]) unconstrained reformulation of problem (5.7), with an explicit exact penalty parameter.

Example 5.2

(Nonsmooth minimization over linear inequalities) Consider the problem

$$\begin{aligned} \min \{ w(\textbf{x}) : \textbf{A}\textbf{x}\le \textbf{b}\},\end{aligned}$$

(5.8)

where w is a real-valued convex function. Denote the rows of $\textbf{A}\in \mathbb {R}^{m \times n}$ by $\textbf{a}_1^T,\ldots ,\textbf{a}_m^T$. This problem fits the general model (G) with $f = \delta _{C}$ where $C = \{\textbf{z}: \textbf{z}\le \textbf{b}\}$. Suppose that $\bar{\textbf{x}}$ be a point satisfying $\textbf{A}\bar{\textbf{x}}<\textbf{b}$. Obviously $B_{\infty }[\textbf{A}\bar{\textbf{x}},r] \subseteq C$ with $r = \min _{i \in [m]}\{b_i-\textbf{a}_i^T \bar{\textbf{x}}\}$. Then according to Remark 5.1, we have the following bound^{Footnote 4} on the the $\ell _1$-norm of the dual optimal solution:

$$\begin{aligned} \Vert \textbf{y}\Vert _1 \le \frac{w(\bar{\textbf{x}})-\text{ val }(G)}{ \min _{i \in [m]}\{b_i-\textbf{a}_i^T \bar{\textbf{x}}\}}.\end{aligned}$$

For a given $M>0$, the M-Lipschitz counterpart of f is

$$\begin{aligned} f^{[M]}(\textbf{z}) = \min \{M \Vert \textbf{w}-\textbf{z}\Vert _{\infty }: \textbf{w}\le \textbf{b}\}=M \max \{ \max \{z_i-b_i: i \in [m]\}, 0\}.\end{aligned}$$

Thus, by Theorem 3.1, problem (5.8) is equivalent to

$$\begin{aligned} \min \left\{ w(\textbf{x})+ \gamma \max \left\{ \max _{i \in [m]}\{\textbf{a}_i^T \textbf{x}_i-b_i\}, 0\right\} \right\} .\end{aligned}$$

as long as $\gamma >\frac{w(\bar{\textbf{x}})-\text{ val }(G)}{ \min _{i \in [m]}\{b_i-\textbf{a}_i^T \bar{\textbf{x}}\}}$. In case where $\textbf{b}>\textbf{0}$ and w is nonnegative, we can choose $\bar{\textbf{x}}=\textbf{0}$ and use the fact that $\text{ val }(G)\ge 0$ to obtain the simplified upper bound

$$\begin{aligned}\Vert \textbf{y}\Vert _1 \le \frac{w(\textbf{0})}{ \min _{i \in [m]}b_i}.\end{aligned}$$

Example 5.3

(Analytic center of polytops) Consider the problem of finding the analytic center of the set $P = \{\textbf{x}: \textbf{A}\textbf{x}\ge \textbf{b}\}$ with $\textbf{A}\in \mathbb {R}^{m \times n}$ and $\textbf{b}\in \mathbb {R}^m$:

$$\begin{aligned} \text{(AC) } \quad \min _{\textbf{x}\in \mathbb {R}^n} \sum _{i=1}^m \left\{ - \log (\textbf{a}_i^T \textbf{x}-b_i): \textbf{A}\textbf{x}>\textbf{b}\right\} .\end{aligned}$$

Here $\textbf{a}_1^T,\ldots ,\textbf{a}_m^T$ are the rows of $\textbf{A}$. This problem fits our model (G) with $w \equiv 0$ and $f(\textbf{z}) = \sum _{i=1}^m f_i(z_i)$, where

$$\begin{aligned} f_i(z_i) = \left\{ \begin{array}{ll} -\log (z_i-b_i), &{} z_i>b_i,\\ \infty , &{} \text{ else. } \end{array} \right. \end{aligned}$$

By Example 2.3, we have that

$$\begin{aligned} f^{[M]}(\textbf{z}) = \sum _{i=1}^m h_{b_i}^{[M]}(z_i),\end{aligned}$$

where for any $c,z \in \mathbb {R},$

$$\begin{aligned} \nonumber h_c^{[M]}(z)= & {} \left\{ \begin{array}{ll} -\log (z-c), &{} z>c+\frac{1}{M},\\ \log (M)+1+Mc-Mz, &{} \text{ else }. \end{array} \right. \\= & {} -\log \left( \max \left\{ z,c+\frac{1}{M} \right\} -c\right) +M \left| \max \left\{ z,c+\frac{1}{M} \right\} -z \right| . \end{aligned}$$

(5.9)

Since the underlying norm on the primal space is the $\ell _1$-norm, it follows that we need to upper bound the $\ell _{\infty }$-norm of the dual optimal solution, and this is done using Remark 5.1:

$$\begin{aligned} \Vert \textbf{y}\Vert _{\infty } \le \frac{\max _{\textbf{d}\in B_{1}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{d})-\text{ val }(AC)}{r},\end{aligned}$$

(5.10)

where $\bar{\textbf{x}}$ is a point satisfying $\textbf{A}\bar{\textbf{x}}>\textbf{b}$ and $B_1[\textbf{A}\bar{\textbf{x}}, r] \subseteq \{\textbf{z}: \textbf{z}>\textbf{b}\}$. Since $\textbf{A}\bar{\textbf{x}}>\textbf{b}$, the choice $r = \frac{1}{2}\min _{i \in [m]}\{ \textbf{a}_i^T \bar{\textbf{x}}-b_i\}$ implies the inclusion relation, and we will use this value for r. We also have

$$\begin{aligned}{} & {} \max _{\textbf{d}\in B_{1}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{d}) = \max _{\textbf{d}\in B_{1}[\textbf{0},r]}-\sum _{i=1}^m \log (\textbf{a}_i^T \bar{\textbf{x}}-b_i+d_i){\mathop {\le }\limits ^{d_i\ge -\Vert \textbf{d}\Vert _1\ge -r}}\\{} & {} \quad -\sum _{i=1}^m \log (\textbf{a}_i^T \bar{\textbf{x}}-b_i-r). \end{aligned}$$

If in addition we know that the the polytope P is bounded and contained in $B_2[\textbf{0},R]$, then we can also find a lower bound on the optimal value of (AC) using the following obvious inequality that holds for any $\textbf{x}$:

$$\begin{aligned} -\sum _{i=1}^m \log (\textbf{a}_i^T \textbf{x}-b_i) \ge - \sum _{i=1}^m \log (\Vert \textbf{a}_i\Vert _2 R -b_i). \end{aligned}$$

Thus, the bound (5.10) in this setting implies that

$$\begin{aligned} \Vert \textbf{y}\Vert _{\infty } \le \frac{-\sum _{i=1}^m \log (\textbf{a}_i^T \bar{\textbf{x}}-b_i-r)+\sum _{i=1}^m \log (\Vert \textbf{a}_i\Vert _2 R -b_i)}{r}. \end{aligned}$$

The problem (AC) is therefore equivalent to

$$\begin{aligned} \min _{\textbf{x}} \sum _{i=1}^m \left( -\log \left( \max \left\{ (\textbf{A}\textbf{x})_i,b_i+\frac{1}{M} \right\} -b_i\right) +M \left| \max \left\{ (\textbf{A}\textbf{x})_i,b_i+\frac{1}{M} \right\} -(\textbf{A}\textbf{x})_i \right| \right) , \end{aligned}$$

as long as

$$\begin{aligned} M > \frac{-\sum _{i=1}^m \log ((\textbf{A}\bar{\textbf{x}})_i-b_i-r)+\sum _{i=1}^m \log (\Vert \textbf{a}_i\Vert R -b_i)}{r}, \end{aligned}$$

where $r = \frac{1}{2}\min _{i \in [m]}\{ \textbf{a}_i^T \bar{\textbf{x}}-b_i\}$.

Notes

We often use the slightly fuzzy notion “prox-tractable" to describe a function whose proximal map can be computed efficiently.
$f'(\textbf{x})$ is an arbitrary member of the subdifferential set $\partial f(\textbf{x})$.
$\Vert \textbf{A}_2^T\Vert _{\alpha _2^*,2}=\max \{\Vert \textbf{A}_2^T \textbf{v}\Vert _2: \Vert \textbf{v}\Vert _{\alpha _2}^*=1\}$
It coincides with the bound (5.1) with $f_i(\textbf{x})=\textbf{a}_i^T \bar{\textbf{x}}-b_i.$

References

Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, second edition Springer, Cham (2017)
Book Google Scholar
Beck, A.: First-order methods in optimization, volume 25 of MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia; Mathematical Optimization Society, Philadelphia (2017)
Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. In: Palomar, D., Eldar, Y.C. (eds.) Convex Optimization in Signal Processing and Communications, pp. 139–162. Cambridge University Press, Cambridge (2009)
Google Scholar
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)
Article MathSciNet Google Scholar
Bertsekas, D.P.: Nonlinear Programming, second edition Athena Scientific, Belmont (1999)
Google Scholar
Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
Google Scholar
Boţ, R.I., Bohm, A.: Variable smoothing for convex optimization problems using stochastic gradients. J. Sci. Comput. 85(2), 33 (2020)
Article MathSciNet Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. (2010). https://doi.org/10.1007/s10851-010-0251-1
Article Google Scholar
Glowinski, R., Le Tallec, P.: Augmented Lagrangian and Operator Splitting Methods in Nonlinear Mechanics, vol. 9. Society for Industrial Mathematics (1989)
Hausdorff, F.: Über halbstetige Funktionen und deren Verallgemeinerung. Math. Z. 5, 292–309 (1919)
Article MathSciNet Google Scholar
Han, S.P., Mangasarian, O.L.: Exact penalty functions in nonlinear programming. Math. Program. 17(1), 251–269 (1979)
Article MathSciNet Google Scholar
He, B., Yuan, X.: On the ${O}(1/n)$ convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal. 50, 700–709 (2012)
Article MathSciNet Google Scholar
Moreau, J.J.: Proximité et dualité dans un espace Hilbertien. phBulletin de la Société Mathématique de France 90, 273–299 (1965)
Article Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1, Ser. A), 127–152 (2005)
Article MathSciNet Google Scholar
Polyak, B.T.: Minimization of unsmooth functionals. USSR Comput. Math. Phys. 9, 14–29 (1969)
Article Google Scholar
Polyak, B.T.: Introduction to Optimization. Optimization Software Inc., New York (1987)
Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton Mathematical Series, vol. 28. Princeton University Press, Princeton, N.J. (1970)
Book Google Scholar
Sabach, S., Teboulle, M.: Faster Lagrangian-based methods in convex optimization. SIAM J. Optim. 32, 204–227 (2022)
Article MathSciNet Google Scholar
Shefi, R., Teboulle, M.: Rate of convergence analysis of decomposition methods based on the proximal method of multipliers for convex minimization. SIAM J. Optim. 24, 269–297 (2014)
Article MathSciNet Google Scholar
Shor, N.Z.: Minimization Methods for Nondifferentiable Functions, volume 3 of Springer Series in Computational Mathematics. Springer-Verlag, Berlin,: Translated from the Russian by Kiwiel, K. C., Ruszczyński, A.: (1985)
Tran-Dinh, Q.: Adaptive smoothing algorithms for nonsmooth composite convex minimization. Comput. Optim. Appl. 66(3), 425–451 (2017)
Article MathSciNet Google Scholar

Download references

Funding

Open access funding provided by Tel Aviv University.

Author information

Authors and Affiliations

School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel
Amir Beck & Marc Teboulle

Authors

Amir Beck
View author publications
You can also search for this author in PubMed Google Scholar
Marc Teboulle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Beck.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Communicated by Yurii Nesterov.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The research of A. Beck was partially supported by the Israel Science Foundation Grant 926/21. The research of M. Teboulle was partially supported by the Israel Science Foundation under ISF Grant 2619/20.

Appendices

Appendix A: PH Envelope of Example 2.2

First note that $f^*(\textbf{y}) = \sqrt{1+\Vert \textbf{y}\Vert _2^2}$ (see [2, Section 4.4.13]). Thus, by the dual representation of the PH envelope (2.2), we have for any $\textbf{x}$,

$$\begin{aligned} f^{[M]}(\textbf{x}) = \max \left\{ \langle \textbf{x},\textbf{y}\rangle - \sqrt{1+\Vert \textbf{y}\Vert _2^2}:\; \Vert \textbf{y}\Vert _2 \le M \right\} . \end{aligned}$$

(A.1)

Writing the optimality conditions for the above, we get that $\textbf{y}\in \mathbb {R}^m$ is an optimal solution if and only if there exists s such that

$$\begin{aligned} \textbf{x}-\frac{\textbf{y}}{\sqrt{1+\Vert \textbf{y}\Vert _2^2}} - 2 s \textbf{y}=\textbf{0}, s(\Vert \textbf{y}\Vert _2^2-M^2)=0, s \ge 0. \end{aligned}$$

(A.2)

We explore two cases.

Case I If $\Vert \textbf{y}\Vert _2<M$, then $s=0$ and thus

$$\begin{aligned} \textbf{x}= \frac{\textbf{y}}{\sqrt{1+\Vert \textbf{y}\Vert _2^2}}. \end{aligned}$$

(A.3)

Taking the squared norm of both sides we obtain $\Vert \textbf{x}\Vert _2^2 = \frac{\Vert \textbf{y}\Vert _2^2}{1+\Vert \textbf{y}\Vert _2^2}$. Consequently, $\Vert \textbf{x}\Vert _2 < 1$ and $\Vert \textbf{y}\Vert _2^2 = \frac{\Vert \textbf{x}\Vert _2^2}{1-\Vert \textbf{x}\Vert _2^2}$. Note that the requirement $\Vert \textbf{y}\Vert _2 <M$ translates to $\Vert \textbf{x}\Vert _2 < \frac{M}{\sqrt{1+M^2}}$. Thus, under the condition $\Vert \textbf{x}\Vert _2 < \frac{M}{\sqrt{1+M^2}}$, by (A.3) the optimal solution of (A.1) is

$$\begin{aligned} \textbf{y}_{\textbf{x}} = \sqrt{1+\Vert \textbf{y}\Vert _2^2}\textbf{x}= \frac{1}{\sqrt{1-\Vert \textbf{x}\Vert _2^2}}\textbf{x}, \end{aligned}$$

and the optimal value (whenever $\Vert \textbf{x}\Vert _2 < \frac{M}{\sqrt{1+M^2}}$) is

$$\begin{aligned} f^{[M]}(\textbf{x}) = \langle \textbf{x},\textbf{y}_{\textbf{x}} \rangle - \sqrt{1+\Vert \textbf{y}_{\textbf{x}}\Vert _2^2}=-\sqrt{1-\Vert \textbf{x}\Vert _2^2}=f(\textbf{x}). \end{aligned}$$

Case II If $\Vert \textbf{y}\Vert _2 =M$, then by (A.2), $\textbf{x}= \left( 2s +\frac{1}{\sqrt{1+M^2}} \right) \textbf{y}$. Taking the squared norm of both sides we obtain that $ \left( 2s +\frac{1}{\sqrt{1+M^2}} \right) ^2 = \frac{\Vert \textbf{x}\Vert _2^2}{M^2};$ therefore $2s+\frac{1}{\sqrt{1+M^2} }= \pm \frac{\Vert \textbf{x}\Vert _2}{M}$, and since $s \ge 0$, then only $s = \frac{1}{2}\left( \frac{\Vert \textbf{x}\Vert _2}{M}-\frac{1}{\sqrt{1+M^2}}\right) $ is a valid solution whenever $\Vert \textbf{x}\Vert _2 \ge \frac{M}{\sqrt{1+M^2}}$. Therefore, in this case, the optimal solution of (A.1) is $\textbf{y}_{\textbf{x}} = M \frac{\textbf{x}}{\Vert \textbf{x}\Vert _2}$ and the optimal value is

$$\begin{aligned} f^{[M]}(\textbf{x}) = \langle \textbf{x},\textbf{y}_{\textbf{x}} \rangle - \sqrt{1+\Vert \textbf{y}_{\textbf{x}}\Vert _2^2}=M \Vert \textbf{x}\Vert _2 - \sqrt{1+M^2}. \end{aligned}$$

Summarizing the two cases, we obtain that

$$\begin{aligned} f^{[M]}(\textbf{x}) = \left\{ \begin{array}{ll} f(\textbf{x}), &{} \Vert \textbf{x}\Vert _2 < \frac{M}{\sqrt{1+M^2}}, \\ M \Vert \textbf{x}\Vert _2-\sqrt{1+M^2}, &{} \text{ else, } \end{array} \right. \end{aligned}$$

which is the same as the expression in (2.3).

Appendix B: PH Envelope of Example 2.3

The objective is to find the value of

$$\begin{aligned} \min _{u \in \mathbb {R}} t(u), \end{aligned}$$

where $t(u)=-\log (u-c)+M|z-u|$ whenever $u>c$ and $t(u)=\infty $ otherwise. The minimizer of t is $u^*=z$ if $z>c$ and $0 \in \partial t(z)$, which translates to $ \frac{1}{z-c} \le M$, meaning to $ z \ge c+\frac{1}{M}$. Thus, $u^*=z$ is the optimal solution if and only if $ z \ge c+\frac{1}{M}$. Assume now that $z<c+\frac{1}{M}$. In this case z, is not the optimal solution, and thus the optimal solution $u^*$ is attained at a point in which t is differentiable. Since $\frac{d}{du}(-\log (u-c))= - \frac{1}{u-c}$ is negative over the domain $(c,\infty )$, it follows that the optimal solution $u^*$ satisfies $u^*>z$ and hence $ -\frac{1}{u^*-c}+M=0$, meaning that

$$\begin{aligned} u^* = c+\frac{1}{M}. \end{aligned}$$

Overall, we obtain that the optimal solution is

$$\begin{aligned} u^* = \max \left\{ z, c+\frac{1}{M} \right\} \end{aligned}$$

and the corresponding optimal value is

$$\begin{aligned} h_c^{[M]}(z) =t(u^*)= & {} -\log \left( \max \left\{ z,c+\frac{1}{M} \right\} -c\right) +M \left| \max \left\{ z,c+\frac{1}{M} \right\} -z \right| \\= & {} \left\{ \begin{array}{ll} -\log (z-c), &{} z>c+\frac{1}{M},\\ \log (M)+1+Mc-Mz, &{} \text{ else }. \end{array} \right. \end{aligned}$$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Beck, A., Teboulle, M. Exact Lipschitz Regularization of Convex Optimization Problems. J Optim Theory Appl (2024). https://doi.org/10.1007/s10957-024-02465-8

Download citation

Received: 06 June 2023
Accepted: 23 May 2024
Published: 08 June 2024
DOI: https://doi.org/10.1007/s10957-024-02465-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Exact Lipschitz Regularization of Convex Optimization Problems

Abstract

Similar content being viewed by others

Local and global convergence of a general inertial proximal splitting scheme for minimizing composite functions

Variational Analysis Based on Proximal Subdifferential on Smooth Banach Spaces

A globally convergent proximal Newton-type method in nonsmooth convex optimization

1 Introduction

2 The Pasch-Hausdorff Lipschitz Regularization

2.1 A Dual Representation of The Pasch-Hausdorff Envelope

Lemma 2.1

Proof

2.2 Some Examples of PH Envelopes

Example 2.1

Example 2.2

Example 2.3

3 Exact Lipschitz Regularization for Model (G)

Theorem 3.1

Proof

Remark 3.1

4 Algorithm Iteration Complexity for a Constrained Model

Assumption 1

Theorem 4.1

Proof

5 Bounding the Dual Optimal Solution

Assumption 2

Assumption 3

Theorem 5.1

Proof

Remark 5.1

Remark 5.2

Example 5.1

Example 5.2

Example 5.3

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: PH Envelope of Example 2.2

Appendix B: PH Envelope of Example 2.3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation