1 Introduction

A fundamental generic optimization problem that covers various classes of convex models arising in many modern applications is the well known composite minimization problem which consists of minimizing the sum of two nonsmooth extended valued convex functions, with one which is composed with a linear map

$$\begin{aligned} \text{(G) } \quad \text{ val }(G):=\min _{\textbf{x}} f(\textbf{A}\textbf{x})+w(\textbf{x}), \end{aligned}$$
(1.1)

where both \(f: \mathbb {R}^m \rightarrow (-\infty ,\infty ]\) and \(w: \mathbb {R}^n \rightarrow (-\infty ,\infty ]\) are proper closed and convex and \(\textbf{A}\in \mathbb {R}^{m \times n}\).

This model is very rich and under specific assumptions on the problem’s data, it has led to the development of fundamental primal and primal-dual optimization algorithms, see e.g., [2, 3, 6] and references therein.

Simple algorithms for solving (G) are based on primal first order methods, whereby we suppose that only w admits a computationally tractable proximal map [13], and we obviously want to avoid the proximal computation of \(\textbf{x}\mapsto f(\textbf{A}\textbf{x})\), which in general is intractable, even when f is prox-tractable.Footnote 1 A central property required in the non-asymptotic convergence rate analysis (iteration complexity) in terms of function values of such primal methods, is the Lipschitz continuity of the function f. Therefore, whenever the Lipschitz continuity property for the function f is absent, as it might occur in many applications modeled by problem (G), the use of simple primal-based methods might be impossible. Two examples of such simple algorithms where w is prox-tractable and requiring the Lipschitz continuity of f are:

  1. (a)

    Proximal subgradient method [15, 20] The proximal subgradient method takes the formFootnote 2\(\textbf{x}^{k+1} = \textrm{prox}_{t_k w}(\textbf{x}^k-t_k \textbf{A}^T f'(\textbf{A}\textbf{x}^k))\) where \(t_k>0\) is a stepsize, and for any proper closed and convex function \(s: \mathbb {R}^n \rightarrow (-\infty ,\infty ]\),

    $$\begin{aligned} \textrm{prox}_{s}(\textbf{x}) = \displaystyle \mathop {\text{ argmin }}_{\textbf{u}} \left\{ s(\textbf{u})+\frac{1}{2}\Vert \textbf{u}-\textbf{x}\Vert ^2 \right\} \end{aligned}$$

    stands for the proximal map of s [13]. The Lipschitz continuity of f is, however, a key assumption needed for establishing rate of convergence [2, Section 9.3].

  2. (b)

    Smoothing-based methods A common way to solve (G) is to replace f by a smooth approximation \(f_{\mu }\) (\(\mu >0\) is a smoothing parameter), where by “smooth approximation" we mean that \(f_{\mu }\) satisfies that it is \(\frac{\alpha }{\mu }\)-smooth (\(\alpha >0\)) and that

    $$\begin{aligned} (AS)\qquad f_{\mu }(\textbf{x}) \le f(\textbf{x}) \le f_{\mu }(\textbf{x})+\beta \mu , \qquad \text {for some parameter } \beta >0. \end{aligned}$$

    Then, an accelerated proximal gradient method is employed on the smooth problem \(\min f_{\mu }(\textbf{A}\textbf{x})+w(\textbf{x})\), [4, 14]. The latter approach in which the smoothing parameter is fixed in advance can also be refined within an adaptive smoothing which employs one iteration of an accelerated method on the function \(f_{\mu _k}(\textbf{A}\textbf{x})+w(\textbf{x})\) where \(\mu _k\) is a decreasing sequences that diminishes to zero, as k, the dynamic iteration index, increases, see for instance [7, 21]. The existence of such smooth approximations satisfying (AS) is guaranteed when f is Lipschitz continuous. Unfortunately, in general such a guarantee does not exist.

When both f and w admit computationally efficient proximal maps [13], one can consider tackling the composite model (G) by applying primal-dual Lagrangian based methods, such as the popular Alternating Direction of Multipliers (ADM) scheme [9], and its related variants; see for instance [5, 8, 9, 12, 19] and references therein. However, to obtain rates of convergence in terms of function values for these methods, some type of Lipschitz continuity is often required (see for example [8, Remark 3] and [19]), while improved types of convergence results can be derived only under additional assumptions, see e.g., [18].

Contribution and Outline We introduce a theoretical framework where the restrictive Lipschitz continuity of the function f of problem (G) is not required. The derivation and the development of our results rely on a powerful fact involving the so-called Pasch-Hausdorff (PH) envelope of a function [10], which consists of the infimal convolution of the given function with a penalized norm, and which generates a Lipschitz continuous function. This is presented in Sect. 2 where we also derive a new dual formulation of the PH envelope which is a key player in our analysis. The main idea is then to replace the function f with its PH envelope which allows to construct an exact Lipschitz regularization of problem (G); a simple and useful property which appears to have been overlooked in the literature. We prove that as long as the PH parameter is larger than a dual optimal bound, then problem (G) and its exact Lipschitz regularization counterpart are equivalent; see Sect. 3. In Sect. 4 we show how the aforementioned equivalence result can be utilized in establishing function values-based rates of convergence in terms of the original data. Finding a dual bound for the norm of the dual optimal solution, as required by the equivalence result, is not always easy to derive. We address this issue in Sect. 5 where we show that given a Slater point for the general convex model (G), we can evaluate such a bound in terms of this Slater point, and without actually needing to compute the dual problem. Throughout the paper, we provide examples and applications which illustrate the potential benefits of our approach.

Notation Vectors are denoted by boldface lowercase letters, e.g., \(\textbf{y}\), and matrices by boldface uppercase letters, e.g., \(\textbf{B}\). The vectors of all zeros and ones are denoted by \(\textbf{0}\) and \(\textbf{e}\) respectively. The underlying spaces are \(\mathbb {R}^n\)-spaces endowed with an inner product \(\langle \cdot , \cdot \rangle \). The closed ball with center \(\textbf{c}\in \mathbb {R}^n\) and radius \(r>0\) w.r.t. a norm \(\Vert \cdot \Vert _a\) is denoted by \(B_a[\textbf{c},r] = B_{\Vert \cdot \Vert _a}[\textbf{c},r]=\{\textbf{x}\in \mathbb {R}^n: \Vert \textbf{x}-\textbf{c}\Vert _a \le r\}\) and the corresponding open ball by \(B_a(\textbf{c},r) = \{\textbf{x}\in \mathbb {R}^n: \Vert \textbf{x}-\textbf{c}\Vert _a < r\}\). Given a matrix \(\textbf{A}\in \mathbb {R}^{m \times n}\), \(\Vert \textbf{A}\Vert \) denotes its spectral norm: \(\Vert \textbf{A}\Vert = \sqrt{\lambda _{\max }(\textbf{A}^T \textbf{A})}\). We use the standard notation \([n]\equiv \{1,2,\ldots ,n\}\) for a positive integer n. For any extended real-valued function h, the conjugate is defined as \(h^*(\textbf{y}) \equiv \max _{\textbf{x}} \left\{ \langle \textbf{x},\textbf{y}\rangle - h(\textbf{x}) \right\} \). For a given set S, the indicator function \(\delta _S(\textbf{x})\) is equal to 0 if \(\textbf{x}\in S\) and \(\infty \) otherwise. Further standard definitions or notations in convex analysis which are not explicitly mentioned here can be found in the classical book [17].

2 The Pasch-Hausdorff Lipschitz Regularization

Assume that \(\mathbb {R}^m\) is endowed with some norm \(\Vert \cdot \Vert _a\). The dual norm is denoted by \(\Vert \cdot \Vert _a^*\) (not to be confused with the Fenchel conjugate). A natural way to “transform” a function \(h: \mathbb {R}^m \rightarrow (-\infty , \infty ]\) into a Lipschitz continuous function is via the Pasch-Hausdorff (PH) envelope [1, Section 12.3] defined for a parameter \(M>0\) as

$$\begin{aligned} h^{[M]}(\textbf{x}) := h\Box (M\Vert \cdot \Vert ) (\textbf{x}) = \min _{\textbf{z}} \{h(\textbf{z})+M\Vert \textbf{z}-\textbf{x}\Vert _a\}. \end{aligned}$$
(2.1)

It is well known [1, Proposition 12.17] that if a proper function h has an M-Lipschitz minorant (w.r.t. \(\Vert \cdot \Vert _a\)), then \(h^{[M]}\) is the largest M-Lipschitz minorant of h, and the only other case is when \(h^{[M]} \equiv -\infty \). This result does not require any convexity assumption on h.

2.1 A Dual Representation of The Pasch-Hausdorff Envelope

In our setting of problem (G), f is proper closed and convex. In this case, we will now show that the PH envelope admits a dual representation that will be essential to our analysis. This property is stated in the following lemma. We also state and prove the other elementary properties that \(f^{[M]}\) is an M-Lipschitz minorant of f for the sake of completeness. Before proceeding, recall that for any set C, \(\delta _C^*(\textbf{y}) = \sigma _C(\textbf{y}):=\max \{\langle \textbf{y}, \textbf{x}\rangle : \textbf{x}\in C\}\), and \(\displaystyle \mathop {\textrm{ri}}(C)\) stands for the relative interior of C which is nonempty whenever the set C is nonempty and convex, [17, Theorem, 6.2].

Lemma 2.1

(Dual representation of \(f^{[M]}\)) Let \(f:\mathbb {R}^m \rightarrow (-\infty ,\infty ]\) be a proper closed and convex function. Suppose that there exists \(\hat{\textbf{y}} \in \displaystyle \mathop {\textrm{dom}}(f^*)\) such that \(\Vert \hat{\textbf{y}}\Vert _a^* < M\) for some \(M>0\). Then

  1. (a)

    It holds that

    $$\begin{aligned} f^{[M]} := (f^*+\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]})^*; \end{aligned}$$
    (2.2)
  2. (b)

    \(f^{[M]}\) is real-valued and convex and the minimal value in (2.1) is attained;

  3. (c)

    [1, Proposition 12.17] \(f^{[M]}(\textbf{x})\le f(\textbf{x})\) for all \(\textbf{x}\);

  4. (d)

    [1, Proposition 12.17] \(f^{[M]}: \mathbb {R}^m \rightarrow \mathbb {R}\) is M-Lipschitz continuous w.r.t. the norm \(\Vert \cdot \Vert _a\).

Proof

(a+b) By [17, Theorem 16.4], if \(\displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*)) \cap B_{\Vert \cdot \Vert _a^*}(\textbf{0},M) \ne \emptyset \), then

$$\begin{aligned} (f^*+\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]})^* = f^{**} \Box \delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]}^*. \end{aligned}$$

Since \(f^{**}=f\) (as f is proper closed and convex), and \(\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]}^* = \sigma _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]} = M \Vert \cdot \Vert _a\), we obtain that \((f^*+\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]})^*= f \Box (M \Vert \cdot \Vert _a)=f^{[M]}\). The result [17, Theorem 16.4] also establishes the finiteness and attainment of the minimal value in (2.1). The convexity of \(f^{[M]}\) follows by the fact that it is a conjugate function, see [2, Theorem 4.3]. What is left is to show that \(\displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*)) \cap B_{\Vert \cdot \Vert _a^*}(\textbf{0},M) \ne \emptyset \). Indeed, since \(\displaystyle \mathop {\textrm{dom}}(f^*)\) is convex and nonempty (by the convexity and properness of f [2, Theorem 4.5]), it follows that there exists \(\tilde{\textbf{y}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*))\). Therefore, recalling that \(\hat{\textbf{y}} \in \displaystyle \mathop {\textrm{dom}}(f^*)\), by the line segment principle, for any \(\lambda \in (0,1)\) we have that \(\hat{\textbf{y}}+\lambda (\tilde{\textbf{y}}-\hat{\textbf{y}}) \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*))\). Thus, we can take \(\tilde{\lambda } \in (0,1)\) small enough for which \(\hat{\textbf{y}}+\tilde{\lambda }(\tilde{\textbf{y}}-\hat{\textbf{y}}) \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^*)) \cap B_{\Vert \cdot \Vert _a^*}(\textbf{0},M).\)

(c) Follows from the following elementary argument:

$$\begin{aligned} f^{[M]}(\textbf{x})= & {} (f \Box M \Vert \cdot \Vert _a)(\textbf{x}) =\min _{\textbf{z}} \{f(\textbf{z})+M\Vert \textbf{x}-\textbf{z}\Vert _a\} \\{} & {} \le f(\textbf{x})+M\Vert \textbf{x}-\textbf{x}\Vert _a = f(\textbf{x}). \end{aligned}$$

(d) Note that by part (b) \(f^{[M]}\) is real-valued. Then by the triangle inequality,

$$\begin{aligned} f^{[M]}(\textbf{x})= & {} \min _{\textbf{z}} \{ f(\textbf{z})+M \Vert \textbf{z}-\textbf{x}\Vert _a\} \\\le & {} \min _{\textbf{z}} \{ f(\textbf{z})+M \Vert \textbf{z}-\textbf{y}\Vert _a\}+M \Vert \textbf{x}-\textbf{y}\Vert _a = f^{[M]}(\textbf{y})+M \Vert \textbf{x}-\textbf{y}\Vert _a. \end{aligned}$$

Changing the roles of \(\textbf{x}\) and \(\textbf{y}\) we also obtain that \(f^{[M]}(\textbf{y}) \le f^{[M]}(\textbf{x}) + M\Vert \textbf{x}-\textbf{y}\Vert _a\), thus establishing the desired result that \(|f^{[M]}(\textbf{x})-f^{[M]}(\textbf{y})| \le M \Vert \textbf{x}-\textbf{y}\Vert _a\) for any \(\textbf{x},\textbf{y}\in \mathbb {R}^m\). \(\square \)

2.2 Some Examples of PH Envelopes

Obviously, computing the PH envelope can be a challenging task. In this section we describe several cases in which its evaluation is tractable. In what follows, for any nonempty set C the distance function with respect to a norm \(\Vert \cdot \Vert _a\) is defined by \(d_{C,\Vert \cdot \Vert _a}(\textbf{x}) = \min _{\textbf{y}\in C} \Vert \textbf{y}-\textbf{x}\Vert _a\). If the distance function is with respect to the Euclidean norm \(\Vert \cdot \Vert = \sqrt{\langle \cdot ,\cdot \rangle }\), then we will settle with the notation \(d_C\).

Example 2.1

(Indicator function) Suppose \(f=\delta _C\) where C is a nonempty closed and convex set. Then the PH envelope of f is given by

$$\begin{aligned} f^{[M]}(\textbf{x}) = (\delta _C \Box M \Vert \cdot \Vert _a)(\textbf{x}) = \min _{\textbf{z}\in C} M\Vert \textbf{z}-\textbf{x}\Vert _a = M d_{C,\Vert \cdot \Vert _a}(\textbf{x}).\end{aligned}$$

Example 2.2

(Ball-pen) Consider the so-called “ball-pen” function \(f: \mathbb {R}^n \rightarrow (-\infty ,\infty ]\) given by \(f(\textbf{x}) = -\sqrt{1-\Vert \textbf{x}\Vert _2^2}\) with \(\displaystyle \mathop {\textrm{dom}}(f) = B[\textbf{0},1]\). Here we assume that \(\Vert \cdot \Vert _a = \Vert \cdot \Vert _2\). Obviously, this is not a Lipschitz continuous function being an extended real-valued function. The M-Lipschitz PH envelope is given by

$$\begin{aligned} f^{[M]}(\textbf{x}) = \min _{\textbf{z}} \left\{ -\sqrt{1-\Vert \textbf{z}\Vert _2^2}+M \Vert \textbf{x}-\textbf{z}\Vert _2\right\} . \end{aligned}$$

A rather technical argument that uses the dual representation (2.2) shows that (see Appendix A)

$$\begin{aligned} f^{[M]}(\textbf{x}) = - \sqrt{1- \min \left\{ \Vert \textbf{x}\Vert _2, \frac{M}{\sqrt{M^2+1}}\right\} ^2}+ M\max \left\{ 0, \Vert \textbf{x}\Vert _2-\frac{M}{\sqrt{M^2+1}}\right\} . \end{aligned}$$
(2.3)

A one-dimensional illustration is given in Fig. 1.

Fig. 1
figure 1

The one-dimensional function \(f(x)=-\sqrt{1-x^2}\) and its 1-Lipschitz PH envelope \(f^{[1]}\). The two functions coincide in the interval \([-\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}]\)

Example 2.3

(Minus sum of logs) Let \(\textbf{b}\in \mathbb {R}^m\) and define \(f(\textbf{z}):= \sum _{i=1}^m f_i(z_i)\), where

$$\begin{aligned} f_i(z_i) = \left\{ \begin{array}{ll} -\log (z_i-b_i), &{} z_i>b_i,\\ \infty , &{} \text{ else. } \end{array} \right. \end{aligned}$$

We want to find the M-Lipschitz PH envelope of f w.r.t. the \(\ell _1\)-norm:

$$\begin{aligned} f^{[M]}(\textbf{z}) {=} \min _{\textbf{u}\in \mathbb {R}^m} \{ f(\textbf{u})+M \Vert \textbf{z}-\textbf{u}\Vert _1\} = \sum _{i=1}^m \min _{u_i \in \mathbb {R}} \{ -\log (u_i-b_i)+M|z_i-u_i| :u_i>b_i\}. \end{aligned}$$
(2.4)

Note that in the above we exploited the separability of the \(\ell _1\)-norm, which is a demonstration to the fact that the choice of norm might be essential to the ability of computing the PH envelope. Indeed, in this case, computing an explicit expression for the PH envelope under the \(\ell _2\)-norm, for example, seems to be a difficult task. By (2.4),

$$\begin{aligned} f^{[M]}(\textbf{z}) = \sum _{i=1}^m h_{b_i}^{[M]}(z_i), \end{aligned}$$

where for any \(c,z \in \mathbb {R}\) we define

$$\begin{aligned} h_c^{[M]}(z)=\min _{u\in \mathbb {R}} \{-\log (u-c)+M |z-u| : u>c\}.\end{aligned}$$
(2.5)

Thus, computing \(f^{[M]}\) amounts to solving the one-dimensional problem (2.5). An explicit expression for \(h_c^{[M]}\) is

$$\begin{aligned} \nonumber h_c^{[M]}(z)= & {} \left\{ \begin{array}{ll} -\log (z-c), &{} z>c+\frac{1}{M},\\ \log (M)+1+Mc-Mz, &{} \text{ else }. \end{array} \right. \\= & {} -\log \left( \max \left\{ z,c+\frac{1}{M} \right\} -c\right) +M \left| \max \left\{ z,c+\frac{1}{M} \right\} -z \right| . \end{aligned}$$
(2.6)

The validity of the above expression for \(h_c^{[M]}\) is shown in Appendix B.

3 Exact Lipschitz Regularization for Model (G)

We now return to the general model (G) (equation (1.1)) under the conditions that f and w are proper closed and convex. The main idea is to replace the function f with its PH envelope to obtain the Lipschitz regularized problem

The main question that we wish to address is

Under which conditions are problems (G) and (G\(_M\)) equivalent?

By “equivalent” we mean that the optimal sets of the two problems are identical. We will show in Theorem 3.1 below that as long as M is larger than a bound on the optimal set of the dual problem, then such an equivalency holds. Since duality arguments are essential in our analysis, we first recall the well-known dual problem of (G):

$$\begin{aligned} \text{(DG) } \quad \max _{\textbf{y}} \left\{ -f^*(\textbf{y})-w^*(-\textbf{A}^T \textbf{y})\right\} . \end{aligned}$$

According to [17, Corollary 31.2.1], to guarantee strong duality, it is sufficient that the constraint qualification

$$\begin{aligned} \exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w), \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f)) \end{aligned}$$
(3.1)

holds. We are now ready to answer the main question stated above.

Theorem 3.1

(Equivalence between (G) and (G\(_M\))) Suppose that f and w are proper closed and convex functions and that condition (3.1) holds. In addition, assume that \(\text{ val }(G)>-\infty \). Let \(\textbf{y}^*\) be an optimal solution of the dual problem (DG) and let \(M>\Vert \textbf{y}^*\Vert _a^*\). Then

  1. (a)

    Problems (G) and \((G_M)\) have the same optimal value.

  2. (b)

    If \(\textbf{x}^*\) is an optimal solution of problem \((G_M)\), then \(f^{[M]}(\textbf{A}\textbf{x}^*)=f(\textbf{A}\textbf{x}^*)\).

  3. (c)

    Problems (G) and \((G_M)\) have the same optimal sets.

Proof

  1. (a)

    By condition (3.1) and the finiteness of \(\text{ val }(G)\), it follows that \(\text{ val }(G)=\text{ val }(DG)\). Since the optimal solution \(\textbf{y}^*\) of the dual problem satisfies \(\Vert \textbf{y}^*\Vert _a^* <M\), it follows that \(\text{ val }(DG)=\text{ val }(R)\), where (R) is the problem

    $$\begin{aligned} (\text{ R}) \quad \max _{\textbf{y}} \left\{ -f^*(\textbf{y})-\delta _{B_{\Vert \cdot \Vert _a^*}[\textbf{0},M]}(\textbf{y})-w^*(-\textbf{A}^T\textbf{y})\right\} . \end{aligned}$$

    Note that by the dual representation (2.2), (R) is actually

    $$\begin{aligned} \max _{\textbf{y}} \left\{ -(f^{[M]})^*(\textbf{y})-w^*(-\textbf{A}^T\textbf{y})\right\} , \end{aligned}$$

    meaning that (R) is the dual problem to \((G_M)\). In particular, \(\text{ val }(G_M) \ge \text{ val }(R)\) is finite. Moreover, by Lemma 2.1(b), \(\displaystyle \mathop {\textrm{dom}}(f^{[M]})=\mathbb {R}^m\), and thus the condition

    $$\begin{aligned} \exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w), \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f^{[M]})) \end{aligned}$$

    amounts to “\(\exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w)\)”which trivially holds as the relative interior of nonempty convex sets is always nonempty. Consequently, strong duality between problems (R) and \((G_M)\) holds. We can finally conclude that

    $$\begin{aligned} \text{ val }(G_M) = \text{ val }(R) = \text{ val }(DG) = \text{ val }(G). \end{aligned}$$
  2. (b)

    Note the following observation that follows from part (a): for any \(N>\Vert \textbf{y}^*\Vert _a^*\), it holds that \(\text{ val }(G_N) = \text{ val }(G)\). Suppose that \(\textbf{x}^*\) is an optimal solution of \((G_M)\). Then

    $$\begin{aligned} f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = \text{ val }(G). \end{aligned}$$

    Assume by contradiction that \(f^{[M]}(\textbf{A}\textbf{x}^*) \ne f(\textbf{A}\textbf{x}^*)\). Then this means that there exists \(\textbf{z}\ne \textbf{A}\textbf{x}^*\) such that \(f^{[M]}(\textbf{A}\textbf{x}^*) = M \Vert \textbf{z}-\textbf{A}\textbf{x}^*\Vert _a+f(\textbf{z})\). Take \(M' \in (\Vert \textbf{y}^*\Vert , M)\), then \(f^{[M']}(\textbf{A}\textbf{x}^*) \le M' \Vert \textbf{z}-\textbf{A}\textbf{x}^*\Vert _a+f(\textbf{z})< M \Vert \textbf{z}-\textbf{A}\textbf{x}^*\Vert _a+f(\textbf{z}) = f^{[M]}(\textbf{A}\textbf{x}^*)\), and therefore

    $$\begin{aligned} \text{ val }(G_{M'}) \le f^{[M']}(\textbf{A}\textbf{x}^*) +w(\textbf{x}^*) <f^{[M]}(\textbf{A}\textbf{x}^*) +w(\textbf{x}^*)=\text{ val }(G_M), \end{aligned}$$

    which is a contradiction to the observation indicated at the beginning of the proof of this part.

  3. (c)

    Assume that \(\textbf{x}^*\) is an optimal solution of (G). Then

    $$\begin{aligned} \text{ val }(G) = f(\textbf{A}\textbf{x}^*) +w(\textbf{x}^*) {\mathop {\ge }\limits ^{(*)}} f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) \ge \text{ val }(G_M), \end{aligned}$$

    where \((*)\) follows from Lemma 2.1(c). However, since by part (a) we have that \(\text{ val }(G_M)=\text{ val }(G)\), it follows that \( f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = \text{ val }(G_M)\) and hence that \(\textbf{x}^*\) is an optimal solution of \((G_M)\). In the opposite direction, assume that \(\textbf{x}^*\) is an optimal solution of problem \((G_M)\). Then by part (b), \(f^{[M]}(\textbf{A}\textbf{x}^*)=f(\textbf{A}\textbf{x}^*)\) and consequently,

    $$\begin{aligned} \text{ val }(G_M) = f^{[M]}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = f(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) \ge \text{ val }(G), \end{aligned}$$

    and since \(\text{ val }(G)=\text{ val }(G_M)\) (part (a)) we conclude that \(f(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) = \text{ val }(G)\), meaning that \(\textbf{x}^*\) is an optimal solution of (G).

\(\square \)

Remark 3.1

(Exact Penalty Viewpoint) Theorem 3.1 can be shown as a consequence of a result of Han and Mangasarian [11] on sufficiency conditions on exact penalty functions. Specifically, we can rewrite problem (G) as

$$\begin{aligned} \text{(G-Con) } \; \min _{\textbf{x},\textbf{z}} \{ f(\textbf{z})+w(\textbf{x}): \textbf{z}= \textbf{A}\textbf{x}\}. \end{aligned}$$

and consider the penalized problem

$$\begin{aligned} (\text{ G-Pen}_{M}) \; \min _{\textbf{x},\textbf{z}} \{ f(\textbf{z})+w(\textbf{x}) +M \Vert \textbf{z}-\textbf{A}\textbf{x}\Vert _a\}. \end{aligned}$$

Fixing \(\textbf{x}\) and minimizing with respect to \(\textbf{z}\), we obtain problem \((G_M)\). Theorem 4.9 from [11] shows that indeed when M is larger than the dual norm of an optimal dual solution, then problem \((\text{ G-Pen}_{M})\), has the same optimal set of problem (G-Con), which readily implies that the optimal sets of (G) and \((G_M)\) coincide. Our simple, independent-interest proof reveals the PH envelope’s benefit for the exact penalty approach. Additionally, our work highlights that existing penalty approaches implicitly generate Lipschitz functions, a property crucial for convergence rate analysis.

What remains is of course the question of how to find a bound on the optimal set of the dual problem. This issue will be studied in Sect. 5.

The objective function of problem \((G_M)\) includes the Lipschitz continuous component \(f^{[M]}\). This enables the use of a basic first-order method to achieve non-asymptotic rates of convergence in terms of function values, as explained in the introduction. However, these rates of convergence will depend on the PH envelope \(f^{[M]}\). In the next section we show that in the case where f is an indicator function, rates of convergence in terms of the original data can be obtained.

4 Algorithm Iteration Complexity for a Constrained Model

We focus on the important constrained model

$$\begin{aligned} (Q) \quad \min \{w(\textbf{x}): \textbf{A}\textbf{x}\in C\}, \end{aligned}$$

where \(w: \mathbb {R}^n \rightarrow (-\infty ,\infty ]\) is proper closed and convex, \(\textbf{A}\in \mathbb {R}^{m \times n}\) and \(C \subseteq \mathbb {R}^m\) is a nonempty closed and convex set. Model (Q) fits model (G) with \(f = \delta _C\).

By Example 2.1, \(f^{[M]}(\textbf{x})=M d_{C,\Vert \cdot \Vert _a}(\textbf{x})\), and hence the M-Lipschitz regularization of problem (Q) is

$$\begin{aligned} (\text{ Q}_M) \quad \min _{\textbf{x}} \{ M d_{C, \Vert \cdot \Vert _a}(\textbf{A}\textbf{x}) + w(\textbf{x})\}. \end{aligned}$$

For example, if \(C = \{\textbf{b}\}\), meaning that problem (Q) is \(\min \{w(\textbf{x}): \textbf{A}\textbf{x}= \textbf{b}\}\), then \((Q_M)\) has the form

$$\begin{aligned} \min _{\textbf{x}} M \Vert \textbf{A}\textbf{x}-\textbf{b}\Vert _a+w(\textbf{x}). \end{aligned}$$

If \(C = \{ \textbf{z}: \textbf{z}\le \textbf{b}\}\), meaning that problem (Q) is \(\min \{ w(\textbf{x}): \textbf{A}\textbf{x}\le \textbf{b}\}\), then in case where \(\Vert \cdot \Vert _a = \Vert \cdot \Vert _2\), \((Q_M)\) has the form

$$\begin{aligned} \min _{\textbf{x}} M \Vert [\textbf{A}\textbf{x}-\textbf{b}]_+\Vert _2+w(\textbf{x}). \end{aligned}$$

Recall that thanks to Theorem 3.1, problem (Q\(_M\)) has the same optimal set as (Q) under the following assumption.

Assumption 1

  1. (a)

    w is proper closed and convex.

  2. (b)

    C is nonempty closed and convex.

  3. (c)

    \(\text{ val }(Q):=\min _{\textbf{A}\textbf{x}\in C} w(\textbf{x}) >-\infty \).

  4. (d)

    \(\exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}w), \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(C)\)

  5. (e)

    \(M>\Vert \textbf{y}^*\Vert _a^*\) for some optimal solution \(\textbf{y}^*\) of the dual problem.

Suppose now that we have an algorithm for solving problem \((\text{ Q}_M)\), and that the sequence generated by the algorithm \(\{\textbf{x}^k\}_{k \ge 0}\) satisfies the following complexity result in terms of function values:

$$\begin{aligned} M d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)-M d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^*)-w(\textbf{x}^*)\le \alpha (k), \end{aligned}$$
(4.1)

where \(\alpha : \mathbb {R}_{++} \rightarrow \mathbb {R}_+\) satisfies

$$\begin{aligned} \alpha (t)\rightarrow 0 \text{ as } t \rightarrow \infty \end{aligned}$$
(4.2)

and \(\textbf{x}^*\) is an optimal solution of problem \((\text{ Q}_M)\). We will now show that the complexity result (4.1) can be translated to a complexity result in terms of the original problem (Q) in the sense that we get an \(\alpha (k)\)-rate of convergence in terms of the original objective function and the constraint violation.

Theorem 4.1

Suppose that Assumption 1 holds for model (Q), and assume that a sequence \(\{\textbf{x}^k\}_{k \ge 0}\) satisfies (4.1) with \(\alpha : \mathbb {R}_{++} \rightarrow \mathbb {R}_+\) satisfying (4.2) and \(\textbf{x}^*\) being an optimal solution of problem \((\text{ Q}_M)\). Then \(\textbf{x}^*\) is an optimal solution of (Q) and the following holds for all \(k \ge 0\):

$$\begin{aligned} w(\textbf{x}^k)-w(\textbf{x}^*)\le & {} \alpha (k), \\ d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)\le & {} \frac{2\alpha (k)}{M-\Vert \textbf{y}^*\Vert _a^*}. \end{aligned}$$

Proof

By Theorem 3.1, \(\textbf{x}^*\) is also an optimal solution of problem (Q), and thus, in particular, \(d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^*)=0\). Therefore, (4.1) can be rewritten as

$$\begin{aligned} M d_{C, \Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)-w(\textbf{x}^*)\le \alpha (k). \end{aligned}$$
(4.3)

By the nonnegativity of the distance function, it follows that that \(w(\textbf{x}^k)-w(\textbf{x}^*)\le \alpha (k).\) Take \(M' =\frac{\Vert \textbf{y}^*\Vert _a^*+M}{2}\). Then (4.3) can be written as

$$\begin{aligned} (M-M')d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+ M' d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)-w(\textbf{x}^*)\le \alpha (k). \end{aligned}$$
(4.4)

By Assumption 1(e) one has \(M>\Vert \textbf{y}^*\Vert _a^*\), then, Since \(M'>\Vert \textbf{y}^*\Vert _a^*\) and it follows by Theorem 3.1 that \(\textbf{x}^*\) is also a minimizer of

$$\begin{aligned} M' d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x})+w(\textbf{x}), \end{aligned}$$

which implies in particular that

$$\begin{aligned} M' d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k)+w(\textbf{x}^k)\ge M'd_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^*)+w(\textbf{x}^*) =w(\textbf{x}^*), \end{aligned}$$

and thus, by (4.4), we conclude that

$$\begin{aligned} d_{C,\Vert \cdot \Vert _a}(\textbf{A}\textbf{x}^k) \le \frac{\alpha (k)}{M-M'}=\frac{2\alpha (k)}{M-\Vert \textbf{y}^*\Vert _a^*}. \end{aligned}$$

\(\square \)

5 Bounding the Dual Optimal Solution

The results in Sect. 2 assume that we are given a bound on the norm of a dual optimal solution. This bound is not always easy to derive. It is very well known that the boundedness of the dual optimal solution of a given convex problem is guaranteed under the usual Slater condition, see e.g., [17]. In fact, for the classical inequality constrained convex optimization problem, it is possible to exhibit an explicit bound on the norm of dual optimal solutions. More specifically, with \(\{f_{i}\}_{i=0}^m\) convex functions on \(\mathbb {R}^d\), assume that for the convex optimization problem

$$\begin{aligned} (CC)\qquad f_*:= \min \{ f_0(\textbf{x}): f_i(\textbf{x})\le 0, i \in 1,\ldots m, \; \textbf{x}\in \mathbb {R}^d\}, \end{aligned}$$

there exists \(\bar{\textbf{x}} \in \mathbb {R}^d\) such that \( f_i(\bar{\textbf{x}}) <0, i=1,\ldots m\), and that \(f_* >-\infty \). Obviously, \(\bar{\textbf{x}}\) is a slater point of (CC). Then, it is known and easy to show (see e.g., [5, exercise 5.3.1, p.516]) that for any dual optimal solution \(\textbf{y}^* \in \mathbb {R}^d_{+}\), one has

$$\begin{aligned} \Vert \textbf{y}^*\Vert _1 \le \frac{1}{r} \left( f_0(\bar{\textbf{x}}) - f_*\right) , \;\; \text {with}\; r:=\min _{1\le i \le m}(-f_i(\bar{\textbf{x}})). \end{aligned}$$
(5.1)

However, to the best of our knowledge, the derivation of such an explicit bound on an optimal dual solution of the general convex model (G) does not seem to be known or/and have been addressed in the literature. In this section, we show that given a Slater point of the primal general problem (G), we can evaluate such a bound in terms of the Slater point without actually needing to compute the dual problem. We then illustrate the potential benefits of this theoretical result.

The model that we consider is our general model (G) (equation (1.1)) under the following assumption.

Assumption 2

  1. (a)

    \(f: \mathbb {R}^m \rightarrow (-\infty ,\infty ]\) is proper closed and convex.

  2. (b)

    \(w: \mathbb {R}^n \rightarrow (-\infty , \infty ]\) is proper closed and convex.

  3. (c)

    The optimal set of (G) is nonempty.

For the sake of analysis of this section we will assume that \(\displaystyle \mathop {\textrm{dom}}(f)\) has the structure.

$$\begin{aligned} \displaystyle \mathop {\textrm{dom}}(f) = \{\textbf{b}\} \times C, \end{aligned}$$

where \(\textbf{b}\in \mathbb {R}^{m_1}\) and \(C \subseteq \mathbb {R}^{m_2}\) is a nonempty closed and convex set (\(m_1+m_2=m\)). We partition \(\textbf{A}\) as \(\textbf{A}= \begin{pmatrix} \textbf{A}_1 \\ \textbf{A}_2 \end{pmatrix}\), where \(\textbf{A}_1 \in \mathbb {R}^{m_1\times n}, \textbf{A}_2\in \mathbb {R}^{m_2 \times n}\). Then actually the domain of \(\textbf{x}\mapsto f(\textbf{A}\textbf{x})\) is given by \(\{\textbf{x}: \textbf{A}_1 \textbf{x}= \textbf{b}, \textbf{A}_2 \textbf{x}\in C\}.\) We assume that \(\textbf{A}_1\) has full row rank (a mild assumption since otherwise we can remove dependancies). We make the convention that the case \(m_1=0\) corresponds to the situation where \(\displaystyle \mathop {\textrm{dom}}(f) = \{\textbf{x}: \textbf{A}_2 \textbf{x}\in C\}\) and that the case \(m_2=0\) corresponds to the case where \(\displaystyle \mathop {\textrm{dom}}(f) = \{\textbf{x}: \textbf{A}_1 \textbf{x}= \textbf{b}\}\). The partition of a vector \(\textbf{z}\in \mathbb {R}^m\) into \(m_1\) and \(m_2\)-length vectors is given by \(\textbf{z}= (\textbf{z}_1^T,\textbf{z}_2^T)^T\), where \(\textbf{z}_1 \in \mathbb {R}^{m_1}, \textbf{z}_2 \in \mathbb {R}^{m_2}\). We will assume that \(\mathbb {R}^m\) is endowed with the norm

$$\begin{aligned} \Vert \textbf{z}\Vert _{\alpha }:=\Vert \textbf{z}\Vert _{\alpha _1,\alpha _2} = \Vert \textbf{z}_1\Vert _{\alpha _1}+\Vert \textbf{z}_2\Vert _{\alpha _2}, \end{aligned}$$

where \(\Vert \cdot \Vert _{\alpha _1}\) and \(\Vert \cdot \Vert _{\alpha _2}\) are norms on \(\mathbb {R}^{m_1}\) and \(\mathbb {R}^{m_2}\) respectively. The dual norm is (as before, \(\textbf{y}_1, \textbf{y}_2\) are the \(m_1\) and \(m_2\)-length blocks of \(\textbf{y}\))

$$\begin{aligned} \Vert \textbf{y}\Vert _{\alpha _1,\alpha _2}^* = \max \{ \Vert \textbf{y}_1\Vert _{\alpha _1}^*, \Vert \textbf{y}_2\Vert _{\alpha _2}^*\}. \end{aligned}$$

Recall that the dual of problem (G) is

$$\begin{aligned} \text{(DG) } \quad \max _{\textbf{y}} \left\{ -f^*(\textbf{y})-w^*(-\textbf{A}^T \textbf{y}) \right\} . \end{aligned}$$

The dual objective function is thus

$$\begin{aligned} q(\textbf{y}) \equiv -f^*(\textbf{y})-w^*(-\textbf{A}^T \textbf{y}) = \min _{\textbf{x}\in \mathbb {R}^n,\textbf{z}\in \mathbb {R}^m} \mathcal {L}(\textbf{x},\textbf{z};\textbf{y}), \end{aligned}$$
(5.2)

where \(\mathcal {L}(\textbf{x},\textbf{z};\textbf{y})\) is the Lagrangian function given by

$$\begin{aligned} \mathcal {L}(\textbf{x},\textbf{z};\textbf{y}) = f(\textbf{z})+w(\textbf{x})+\langle \textbf{A}\textbf{x}-\textbf{z},\textbf{y}\rangle . \end{aligned}$$

Using the partitions of \(\textbf{y}\) and \(\textbf{z}\) to \(m_1\) and \(m_2\)-length vectors \(\textbf{y}= (\textbf{y}_1^T,\textbf{y}_2^T)^T\), \(\textbf{z}= (\textbf{z}_1^T,\textbf{z}_2^T)^T\), the Lagrangian can thus be rewritten as

$$\begin{aligned} \mathcal {L}(\textbf{x},\textbf{z};\textbf{y}) = f(\textbf{z})+w(\textbf{x})+\langle \textbf{A}_1 \textbf{x}-\textbf{z}_1, \textbf{y}_1\rangle +\langle \textbf{A}_2 \textbf{x}-\textbf{z}_2, \textbf{y}_2 \rangle . \end{aligned}$$
(5.3)

Strong duality of the pair (G) and (DG) is guaranteed if we assume, in addition to Assumption 2, the following Slater condition (similar to condition (3.1)):

$$\begin{aligned} \exists \hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(w)) \text{ s.t. } \textbf{A}\hat{\textbf{x}} \in \displaystyle \mathop {\textrm{ri}}(\displaystyle \mathop {\textrm{dom}}(f)). \end{aligned}$$
(5.4)

For the sake of the current analysis, we will replace the above condition with a slightly stronger condition: \(\exists \bar{\textbf{x}}: \textbf{A}_1 \bar{\textbf{x}} = \textbf{b}, \textbf{A}_2 \bar{\textbf{x}} \in \displaystyle \mathop {\textrm{int}}(C), \bar{\textbf{x}} \in \displaystyle \mathop {\textrm{int}}(\displaystyle \mathop {\textrm{dom}}(w))\). The exact assumption, in a more quantitative form, is now stated.

Assumption 3

There exist \(r>0, s>0\) and \(\bar{\textbf{x}}\) such that \(\textbf{A}_1 \bar{\textbf{x}} = \textbf{b}\), \(B_{\alpha _2}[\textbf{A}_2 \bar{\textbf{x}}, r] \subseteq C\) and \(B_2[\bar{\textbf{x}}, s] \subseteq \displaystyle \mathop {\textrm{dom}}(w)\).

As usual, if \(m_2=0\), we make the convention that Assumption 3 reduces to “there exist \(s>0\) and \(\bar{\textbf{x}}\) such that \(\textbf{A}_1 \bar{\textbf{x}} = \textbf{b}, B_2[\bar{\textbf{x}}, s] \subseteq \displaystyle \mathop {\textrm{dom}}(w)\)” and in the case where \(m_1=0\) the assumption reduces to “there exist \(r>0, s>0\) and \(\bar{\textbf{x}}\) such that \(B_{\alpha _2}[\textbf{A}_2 \bar{\textbf{x}}, r] \subseteq C, B_2[\bar{\textbf{x}}, s] \subseteq \displaystyle \mathop {\textrm{dom}}(w)\)".

We are now ready to prove the main theorem connecting an upper bound on the norm of optimal dual solutions to a given Slater point.

Theorem 5.1

(Bound on optimal dual solutions) Suppose that Assumptions 2 and 3 hold with \(\bar{\textbf{x}} \in \mathbb {R}^n, r>0\) and \(s>0\). Let \(\textbf{y}\) be an optimal solution of the dual problem (DG). Then

$$\begin{aligned} \Vert \textbf{y}_2\Vert _{\alpha _2}^* \le C_2, \end{aligned}$$

where

$$\begin{aligned} C_2:= \frac{\max _{\textbf{d}\in B_{\alpha _2}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}})-\text{ val }(G)}{r} \end{aligned}$$

and

$$\begin{aligned} \Vert \textbf{y}_1\Vert _{\alpha _1}^* \le \frac{sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2} +f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) -\text{ val }(G)}{s D_{2,\alpha _1^*}\sigma _{\min }(\textbf{A}_1)}, \end{aligned}$$

where \(\Vert \textbf{A}_2\Vert _{2,\alpha _2} = \max \{ \Vert \textbf{A}_2 \textbf{v}\Vert _{\alpha _2}: \Vert \textbf{v}\Vert _2=1\}\), \(\sigma _{\min }(\textbf{A}_1) = \sqrt{\lambda _{\min }(\textbf{A}_1 \textbf{A}_1^T)}\) is the minimal singular value of \(\textbf{A}_1\) and \(D_{2,\alpha _1^*}\) is a constant satisfying that \(\Vert \textbf{y}_1\Vert _2 \ge D_{2,\alpha _1^*} \Vert \textbf{y}_1\Vert _{\alpha _1}^*\) for all \(\textbf{y}_1 \in \mathbb {R}^{m_1}\) and \(\textbf{U}_2\in \mathbb {R}^{m \times m_2}\) the submatrix of \(\textbf{I}_{m}\) comprising the last \(m_2\) columns.

Proof

By the definition \(\textbf{U}_2\), for any \(\textbf{w}\in \mathbb {R}^{m_2}\), we have that \(\textbf{U}_2 \textbf{w}= \begin{pmatrix} \textbf{0}_{m_1} \\ \textbf{w}\end{pmatrix} \in \mathbb {R}^m\). Define \(\bar{\textbf{z}}_1 = \textbf{b}, \bar{\textbf{z}}_2 = \textbf{A}_2 \bar{\textbf{x}}\). For any \(\textbf{d}\in B_{\alpha _2}[\textbf{0},r], \textbf{u}\in B_2[\textbf{0},s]\), utilizing (5.2) and (5.3) and Assumption 3, we have,

$$\begin{aligned} \text{ val }(G)= & {} \text{ val }(DG) =q(\textbf{y}) \le \mathcal {L}(\bar{\textbf{x}}+\textbf{u},\bar{\textbf{z}}+\textbf{U}_2 \textbf{d};\textbf{y})\\= & {} f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})+\langle \textbf{A}(\bar{\textbf{x}}+ \textbf{u})-\bar{\textbf{z}}-\textbf{U}_2 \textbf{d},\textbf{y}\rangle \\= & {} f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})+\langle \textbf{A}_1 \bar{\textbf{x}}+\textbf{A}_1 \textbf{u}-\bar{\textbf{z}}_1,\textbf{y}_1\rangle \\{} & {} + \langle \textbf{A}_2 \bar{\textbf{x}}+\textbf{A}_2 \textbf{u}-\bar{\textbf{z}}_2-\textbf{d},\textbf{y}_2\rangle \\= & {} f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})+\langle \textbf{A}_1 \textbf{u},\textbf{y}_1\rangle + \langle \textbf{A}_2 \textbf{u}-\textbf{d},\textbf{y}_2\rangle , \end{aligned}$$

where the last equality follows from the relations \(\textbf{A}_1 \bar{\textbf{x}} = \bar{\textbf{z}}_1\) and \(\textbf{A}_2 \bar{\textbf{x}} = \bar{\textbf{z}}_2\). Rearranging terms, we obtain that

$$\begin{aligned} \langle \textbf{d}, \textbf{y}_2 \rangle -\langle \textbf{u}, \textbf{A}_1^T \textbf{y}_1+\textbf{A}_2^T \textbf{y}_2\rangle \le f(\bar{\textbf{z}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}}+\textbf{u})-\text{ val }(G). \end{aligned}$$
(5.5)

Take \(\tilde{\textbf{d}}\) such that \(\Vert \tilde{\textbf{d}}\Vert _{\alpha _2}=r\) for which \(\langle \tilde{\textbf{d}}, \textbf{y}_2 \rangle = r\Vert \textbf{y}_2\Vert _{\alpha _2}^*\) (such \(\tilde{\textbf{d}}\) exists by the definition of the dual norm). Also, define

$$\begin{aligned} \tilde{\textbf{u}} = \left\{ \begin{array}{ll} -s\frac{\textbf{A}_1^T \textbf{y}_1}{\Vert \textbf{A}_1^T \textbf{y}_1\Vert _2},&{} \textbf{A}_1^T \textbf{y}_1 \ne \textbf{0}, \\ \textbf{0}, &{} \textbf{A}_1^T \textbf{y}_1 = \textbf{0}. \end{array} \right. \end{aligned}$$

so that \(\langle \tilde{\textbf{u}}, \textbf{A}_1^T \textbf{y}_1 \rangle = -s \Vert \textbf{A}_1^T \textbf{y}_1\Vert _2\). Plugging \(\textbf{d}= \tilde{\textbf{d}}\) and \(\textbf{u}= \textbf{0}\) in (5.5) yields

$$\begin{aligned} \Vert \textbf{y}_2\Vert _{\alpha _2}^*\le & {} \frac{f(\bar{\textbf{z}}+\textbf{U}_2 \tilde{\textbf{d}})+w(\bar{\textbf{x}})-\text{ val }(G)}{r}\\\le & {} \underbrace{\frac{\max _{\textbf{d}\in B_{\alpha _2}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{U}_2 \textbf{d})+w(\bar{\textbf{x}})-\text{ val }(G)}{r}}_{C_2}. \end{aligned}$$

Plugging \(\textbf{d}=\textbf{0}\) and \(\textbf{u}= \tilde{\textbf{u}}\) in (5.5), we obtain

$$\begin{aligned} s \Vert \textbf{A}_1^T \textbf{y}_1\Vert _2 - \langle \tilde{\textbf{u}}, \textbf{A}_2^T \textbf{y}_2\rangle \le f(\textbf{A}\bar{\textbf{x}})+w(\bar{\textbf{x}}+\tilde{\textbf{u}})-\text{ val }(G). \end{aligned}$$
(5.6)

We have by the Cauchy-Schwarz inequality thatFootnote 3

$$\begin{aligned}\langle \tilde{\textbf{u}}, \textbf{A}_2^T \textbf{y}_2\rangle \le \Vert \tilde{\textbf{u}}\Vert _2 \cdot \Vert \textbf{A}_2^T \textbf{y}_2\Vert _2 \le \Vert \tilde{\textbf{u}}\Vert _2 \cdot \Vert \textbf{A}_2^T\Vert _{\alpha _2^*,2} \cdot \Vert \textbf{y}_2\Vert _{\alpha _2}^* \le sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2}, \end{aligned}$$

which combined with (5.6) yields

$$\begin{aligned} s\Vert \textbf{A}_1^T \textbf{y}_1\Vert _2 \le sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2}+f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) -\text{ val }(G). \end{aligned}$$

Using the fact that \(\Vert \textbf{A}_1^T \textbf{y}_1\Vert _2\ge \sqrt{\lambda _{\min } (\textbf{A}_1 \textbf{A}_1^T)}\Vert \textbf{y}_1\Vert _2\ge D_{2,\alpha _1^*}\sigma _{\min }(\textbf{A}_1) \Vert \textbf{y}_1\Vert _{\alpha _1}^*\), we finally obtain that

$$\begin{aligned} \Vert \textbf{y}_1\Vert _{{\alpha _1}}^* \le \frac{sC_2 \Vert \textbf{A}_2\Vert _{2,\alpha _2} +f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) -\text{ val }(G)}{s D_{2,\alpha _1^*}\sigma _{\min }(\textbf{A}_1)}. \end{aligned}$$

\(\square \)

Remark 5.1

(Case \(m_1=0\) ) In the case where \(m_1=0\) in which \(\displaystyle \mathop {\textrm{dom}}(f) = \{ \textbf{x}: \textbf{A}\textbf{x}\in C\}\), the result is that under Assumption 3 it holds that

$$\begin{aligned} \Vert \textbf{y}\Vert _{\alpha _2}^* \le \frac{\max _{\textbf{d}\in B_{\alpha _2}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{d})+w(\bar{\textbf{x}})-\text{ val }(G)}{r}. \end{aligned}$$

Remark 5.2

(Case \(m_2=0\) ) In the case where \(m_2=0\) in which \(\displaystyle \mathop {\textrm{dom}}(f) = \{ \textbf{x}: \textbf{A}\textbf{x}= \textbf{b}\}\), the result is that under Assumption 3 it holds that

$$\begin{aligned} \Vert \textbf{y}\Vert _{\alpha _1}^* \le \frac{f(\textbf{A}\bar{\textbf{x}})+\max _{\textbf{u}\in B_2[\textbf{0},s]} w(\bar{\textbf{x}}+\textbf{u}) - \text{ val }(G)}{s D_{2,\alpha _1^*} \sigma _{\min }(\textbf{A})}. \end{aligned}$$

Application Examples We end this section with some applications illustrating the potential of our results.

Example 5.1

(Basis pursuit) Consider the so-called “basis pursuit” problem

$$\begin{aligned} \min \{ \Vert \textbf{x}\Vert _1 : \textbf{A}\textbf{x}= \textbf{b}\}\end{aligned}$$
(5.7)

that fits the general model (G) with \(w(\textbf{x}) = \Vert \textbf{x}\Vert _1\) and \(f = \delta _{\{\textbf{b}\}}\). Problem (5.7) is a well known “convex" relaxation of a compressed sensing model. Suppose that \(\bar{\textbf{x}}\) satisfies \(\textbf{A}\bar{\textbf{x}}=\textbf{b}\) and s is an arbitrary positive scalar. If we take \(\Vert \cdot \Vert _{\alpha _1}=\Vert \cdot \Vert _2\), then according to Remark 5.2, the bound that we obtain on \(\Vert \textbf{y}\Vert _2\) is

$$\begin{aligned} \frac{\max _{\textbf{u}\in B_2[\textbf{0},s]} \Vert \bar{\textbf{x}}+\textbf{u}\Vert _1-\text{ val }(G)}{s\sigma _{\min }(\textbf{A})}\le & {} \frac{\Vert \bar{\textbf{x}}\Vert _1+\max _{\Vert \textbf{u}\Vert _2 \le s} \Vert \textbf{u}\Vert _1-\text{ val }(G)}{s \sigma _{\min }(\textbf{A})}\\= & {} \frac{s\sqrt{n}+\Vert \bar{\textbf{x}}\Vert _1-\text{ val }(G)}{s\sigma _{\min }(\textbf{A})}. \end{aligned}$$

Taking \(s \rightarrow \infty \) (as s can be taken arbitrarily large), we obtain the bound

$$\begin{aligned}\Vert \textbf{y}\Vert _2 \le \frac{\sqrt{n}}{\sigma _{\min }(\textbf{A})}.\end{aligned}$$

Invoking Theorem 3.1, and recalling that \(f^{[\gamma ]}(\textbf{z}) = \gamma \Vert \textbf{z}-\textbf{b}\Vert _2\) (see Example 2.1), we can now deduce that problem (5.7) is equivalent to

$$\begin{aligned}\min _{\textbf{x}} \Vert \textbf{x}\Vert _1 + \gamma \Vert \textbf{A}\textbf{x}-\textbf{b}\Vert _2\end{aligned}$$

whenever \(\gamma >\frac{\sqrt{n}}{\sigma _{\min }(\textbf{A})}\).

This provides an exact penalty (see e.g., [5, 16]) unconstrained reformulation of problem (5.7), with an explicit exact penalty parameter.

Example 5.2

(Nonsmooth minimization over linear inequalities) Consider the problem

$$\begin{aligned} \min \{ w(\textbf{x}) : \textbf{A}\textbf{x}\le \textbf{b}\},\end{aligned}$$
(5.8)

where w is a real-valued convex function. Denote the rows of \(\textbf{A}\in \mathbb {R}^{m \times n}\) by \(\textbf{a}_1^T,\ldots ,\textbf{a}_m^T\). This problem fits the general model (G) with \(f = \delta _{C}\) where \(C = \{\textbf{z}: \textbf{z}\le \textbf{b}\}\). Suppose that \(\bar{\textbf{x}}\) be a point satisfying \(\textbf{A}\bar{\textbf{x}}<\textbf{b}\). Obviously \(B_{\infty }[\textbf{A}\bar{\textbf{x}},r] \subseteq C\) with \(r = \min _{i \in [m]}\{b_i-\textbf{a}_i^T \bar{\textbf{x}}\}\). Then according to Remark 5.1, we have the following boundFootnote 4 on the the \(\ell _1\)-norm of the dual optimal solution:

$$\begin{aligned} \Vert \textbf{y}\Vert _1 \le \frac{w(\bar{\textbf{x}})-\text{ val }(G)}{ \min _{i \in [m]}\{b_i-\textbf{a}_i^T \bar{\textbf{x}}\}}.\end{aligned}$$

For a given \(M>0\), the M-Lipschitz counterpart of f is

$$\begin{aligned} f^{[M]}(\textbf{z}) = \min \{M \Vert \textbf{w}-\textbf{z}\Vert _{\infty }: \textbf{w}\le \textbf{b}\}=M \max \{ \max \{z_i-b_i: i \in [m]\}, 0\}.\end{aligned}$$

Thus, by Theorem 3.1, problem (5.8) is equivalent to

$$\begin{aligned} \min \left\{ w(\textbf{x})+ \gamma \max \left\{ \max _{i \in [m]}\{\textbf{a}_i^T \textbf{x}_i-b_i\}, 0\right\} \right\} .\end{aligned}$$

as long as \(\gamma >\frac{w(\bar{\textbf{x}})-\text{ val }(G)}{ \min _{i \in [m]}\{b_i-\textbf{a}_i^T \bar{\textbf{x}}\}}\). In case where \(\textbf{b}>\textbf{0}\) and w is nonnegative, we can choose \(\bar{\textbf{x}}=\textbf{0}\) and use the fact that \(\text{ val }(G)\ge 0\) to obtain the simplified upper bound

$$\begin{aligned}\Vert \textbf{y}\Vert _1 \le \frac{w(\textbf{0})}{ \min _{i \in [m]}b_i}.\end{aligned}$$

Example 5.3

(Analytic center of polytops) Consider the problem of finding the analytic center of the set \(P = \{\textbf{x}: \textbf{A}\textbf{x}\ge \textbf{b}\}\) with \(\textbf{A}\in \mathbb {R}^{m \times n}\) and \(\textbf{b}\in \mathbb {R}^m\):

$$\begin{aligned} \text{(AC) } \quad \min _{\textbf{x}\in \mathbb {R}^n} \sum _{i=1}^m \left\{ - \log (\textbf{a}_i^T \textbf{x}-b_i): \textbf{A}\textbf{x}>\textbf{b}\right\} .\end{aligned}$$

Here \(\textbf{a}_1^T,\ldots ,\textbf{a}_m^T\) are the rows of \(\textbf{A}\). This problem fits our model (G) with \(w \equiv 0\) and \(f(\textbf{z}) = \sum _{i=1}^m f_i(z_i)\), where

$$\begin{aligned} f_i(z_i) = \left\{ \begin{array}{ll} -\log (z_i-b_i), &{} z_i>b_i,\\ \infty , &{} \text{ else. } \end{array} \right. \end{aligned}$$

By Example 2.3, we have that

$$\begin{aligned} f^{[M]}(\textbf{z}) = \sum _{i=1}^m h_{b_i}^{[M]}(z_i),\end{aligned}$$

where for any \(c,z \in \mathbb {R},\)

$$\begin{aligned} \nonumber h_c^{[M]}(z)= & {} \left\{ \begin{array}{ll} -\log (z-c), &{} z>c+\frac{1}{M},\\ \log (M)+1+Mc-Mz, &{} \text{ else }. \end{array} \right. \\= & {} -\log \left( \max \left\{ z,c+\frac{1}{M} \right\} -c\right) +M \left| \max \left\{ z,c+\frac{1}{M} \right\} -z \right| . \end{aligned}$$
(5.9)

Since the underlying norm on the primal space is the \(\ell _1\)-norm, it follows that we need to upper bound the \(\ell _{\infty }\)-norm of the dual optimal solution, and this is done using Remark 5.1:

$$\begin{aligned} \Vert \textbf{y}\Vert _{\infty } \le \frac{\max _{\textbf{d}\in B_{1}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{d})-\text{ val }(AC)}{r},\end{aligned}$$
(5.10)

where \(\bar{\textbf{x}}\) is a point satisfying \(\textbf{A}\bar{\textbf{x}}>\textbf{b}\) and \(B_1[\textbf{A}\bar{\textbf{x}}, r] \subseteq \{\textbf{z}: \textbf{z}>\textbf{b}\}\). Since \(\textbf{A}\bar{\textbf{x}}>\textbf{b}\), the choice \(r = \frac{1}{2}\min _{i \in [m]}\{ \textbf{a}_i^T \bar{\textbf{x}}-b_i\}\) implies the inclusion relation, and we will use this value for r. We also have

$$\begin{aligned}{} & {} \max _{\textbf{d}\in B_{1}[\textbf{0},r]}f(\textbf{A}\bar{\textbf{x}}+\textbf{d}) = \max _{\textbf{d}\in B_{1}[\textbf{0},r]}-\sum _{i=1}^m \log (\textbf{a}_i^T \bar{\textbf{x}}-b_i+d_i){\mathop {\le }\limits ^{d_i\ge -\Vert \textbf{d}\Vert _1\ge -r}}\\{} & {} \quad -\sum _{i=1}^m \log (\textbf{a}_i^T \bar{\textbf{x}}-b_i-r). \end{aligned}$$

If in addition we know that the the polytope P is bounded and contained in \(B_2[\textbf{0},R]\), then we can also find a lower bound on the optimal value of (AC) using the following obvious inequality that holds for any \(\textbf{x}\):

$$\begin{aligned} -\sum _{i=1}^m \log (\textbf{a}_i^T \textbf{x}-b_i) \ge - \sum _{i=1}^m \log (\Vert \textbf{a}_i\Vert _2 R -b_i). \end{aligned}$$

Thus, the bound (5.10) in this setting implies that

$$\begin{aligned} \Vert \textbf{y}\Vert _{\infty } \le \frac{-\sum _{i=1}^m \log (\textbf{a}_i^T \bar{\textbf{x}}-b_i-r)+\sum _{i=1}^m \log (\Vert \textbf{a}_i\Vert _2 R -b_i)}{r}. \end{aligned}$$

The problem (AC) is therefore equivalent to

$$\begin{aligned} \min _{\textbf{x}} \sum _{i=1}^m \left( -\log \left( \max \left\{ (\textbf{A}\textbf{x})_i,b_i+\frac{1}{M} \right\} -b_i\right) +M \left| \max \left\{ (\textbf{A}\textbf{x})_i,b_i+\frac{1}{M} \right\} -(\textbf{A}\textbf{x})_i \right| \right) , \end{aligned}$$

as long as

$$\begin{aligned} M > \frac{-\sum _{i=1}^m \log ((\textbf{A}\bar{\textbf{x}})_i-b_i-r)+\sum _{i=1}^m \log (\Vert \textbf{a}_i\Vert R -b_i)}{r}, \end{aligned}$$

where \(r = \frac{1}{2}\min _{i \in [m]}\{ \textbf{a}_i^T \bar{\textbf{x}}-b_i\}\).