Skip to main content
Log in

Exterior-Point Optimization for Sparse and Low-Rank Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

Many problems of substantial current interest in machine learning, statistics, and data science can be formulated as sparse and low-rank optimization problems. In this paper, we present the nonconvex exterior-point optimization solver (NExOS)—a first-order algorithm tailored to sparse and low-rank optimization problems. We consider the problem of minimizing a convex function over a nonconvex constraint set, where the set can be decomposed as the intersection of a compact convex set and a nonconvex set involving sparse or low-rank constraints. Unlike the convex relaxation approaches, NExOS finds a locally optimal point of the original problem by solving a sequence of penalized problems with strictly decreasing penalty parameters by exploiting the nonconvex geometry. NExOS solves each penalized problem by applying a first-order algorithm, which converges linearly to a local minimum of the corresponding penalized formulation under regularity conditions. Furthermore, the local minima of the penalized problems converge to a local minimum of the original problem as the penalty parameter goes to zero. We then implement and test NExOS on many instances from a wide variety of sparse and low-rank optimization problems, empirically demonstrating that our algorithm outperforms specialized methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Auslender, A.: Stability in mathematical programming with nondifferentiable data. SIAM J. Control. Optim. 22(2), 239–254 (1984)

    Article  MathSciNet  Google Scholar 

  2. Bach, F.: Sharp analysis of low-rank kernel matrix approximations. J. Mach. Learn. Res., (2013)

  3. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, vol. 408. Springer, Berlin (2017)

    Book  Google Scholar 

  4. Bauschke, H.H, Lal, M.K., Wang, X.: Projections onto hyperbolas or bilinear constraint sets in Hilbert spaces. J. Glob. Optim., pp. 1–12 (2022)

  5. Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, Philadelphia (2017)

    Book  Google Scholar 

  6. Bernard, F., Thibault, L., Zlateva, N.: Prox-regular sets and epigraphs in uniformly convex Banach spaces: various regularities and other properties. Trans. Am. Math. Soc. 363(4), 2211–2247 (2011)

    Article  MathSciNet  Google Scholar 

  7. Bertsimas, D., Copenhaver, M.S., Mazumder, R.: Certifiably optimal low rank factor analysis. J. Mach. Learn. Res. 18(1), 907–959 (2017)

    MathSciNet  Google Scholar 

  8. Bertsimas, D., Cory-Wright, R.: A scalable algorithm for sparse portfolio selection. INFORMS J. Comput. 34(3), 1489–1511 (2022)

    Article  MathSciNet  Google Scholar 

  9. Bertsimas, Dimitris, Cory-Wright, Ryan, Lo, Sean, Pauphilet, Jean: Optimal low-rank matrix completion: Semidefinite relaxations and eigenvector disjunctions. arXiv preprint arXiv:2305.12292, (2023)

  10. Bertsimas, D., Cory-Wright, R., Pauphilet, J.: Mixed-projection conic optimization: A new paradigm for modeling rank constraints. Oper. Res. 70(6), 3321–3344 (2022)

    Article  MathSciNet  Google Scholar 

  11. Bertsimas, D., Digalakis Jr, V., Li, M.L., Lami, O.S.: Slowly varying regression under sparsity. Oper. Res. (2024)

  12. Bertsimas, D., Dunn, J.: Machine Learning Under a Modern Optimization Lens. Dynamic Ideas, Charlestown (2019)

    Google Scholar 

  13. Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 813–852 (2016)

  14. Bertsimas, D., Parys, B.V.: Sparse hierarchical regression with polynomials. Mach. Learn., (2020)

  15. Bertsimas, D., Van Parys, B., et al.: Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. Ann. Stat. 48(1), 300–323 (2020)

    Article  MathSciNet  Google Scholar 

  16. Blanchard, J.D., Tanner, J., Wei, K.: CGIHT: Conjugate gradient iterative hard thresholding for compressed sensing and matrix completion. Inf. Inference 4(4), 289–327 (2015)

    MathSciNet  Google Scholar 

  17. Blumensath, T., Davies, M.E.: Iterative thresholding for sparse approximations. J. Fourier Anal. Appl. 14(5–6), 629–654 (2008)

    Article  MathSciNet  Google Scholar 

  18. Blumensath, T., Davies, M.E.: Normalized iterative hard thresholding: Guaranteed stability and performance. IEEE J. Sel. Top. Signal Process. 4(2), 298–309 (2010)

    Article  Google Scholar 

  19. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1):1–122 (2011)

  20. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  21. Candès, E., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. (2009)

  22. Candès, E., Wakin, M.B., Boyd, S.: Enhancing sparsity by reweighted l1 minimization. J. Fourier Anal. Appl. 14, 877–905 (2008)

    Article  MathSciNet  Google Scholar 

  23. Clarke, F.H., Stern, R.J., Wolenski, P.R.: Proximal smoothness and the lower-\(\cal{C} ^2\) property. J. Convex Anal. 2(1–2), 117–144 (1995)

    MathSciNet  Google Scholar 

  24. Correa, R., Jofre, A., Thibault, L.: Characterization of lower semicontinuous convex functions. Proc. Am. Math. Soc. 116, 67–72 (1992)

    Article  MathSciNet  Google Scholar 

  25. Diamond, S., Takapoui, R., Boyd, S.: A general system for heuristic minimization of convex functions over non-convex sets. Optim. Methods Softw. 33(1), 165–193 (2018)

    Article  MathSciNet  Google Scholar 

  26. Dontchev, A.L., Rockafellar, R.T.: Implicit Functions and Solution Mappings, vol. 543. Springer, Berlin (2009)

    Book  Google Scholar 

  27. Dunning, I., Huchette, J., Lubin, M.: JuMP: A modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)

    Article  MathSciNet  Google Scholar 

  28. Fazel, M., Candes, E., Recht, B., Parrilo, P.: Compressed sensing and robust recovery of low rank matrices. In: 2008 42nd Asilomar Conference on Signals, Systems and Computers, 1043–1047 (2008)

  29. Fiacco, A.V., McCormick, G.P.: Nonlinear Programming: Sequential Unconstrained Minimization Techniques. SIAM, Philadelphia (1990)

    Book  Google Scholar 

  30. Foucart, S.: Hard thresholding pursuit: an algorithm for compressive sensing. SIAM J. Numer. Anal. 49(6), 2543–2563 (2011)

    Article  MathSciNet  Google Scholar 

  31. Friedman, J., Hastie, T., Tibshirani, R., et al.: glmnet: Lasso and elastic-net regularized generalized linear models. R Package Version 1(4), 1–24 (2009)

    Google Scholar 

  32. Giselsson, P., Boyd, S.: Linear convergence and metric selection for Douglas-Rachford splitting and ADMM. IEEE Trans. Autom. Control 62(2), 532–544 (2017)

    Article  MathSciNet  Google Scholar 

  33. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

    Google Scholar 

  34. Gress, A., Davidson, I.: A flexible framework for projecting heterogeneous data. In: CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management (2014)

  35. Hardt, M., Meka, R., Raghavendra, P., Weitz, B.: Computational limits for matrix completion. J. Mach. Learn. Res. (2014)

  36. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY (2001)

    Google Scholar 

  37. Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692 (2017)

  38. Hastie, T., Tibshirani, R., Martin, W.: The Lasso and Generalizations. Statistical Learning with Sparsity. Taylor & Francis, New York (2015)

    Book  Google Scholar 

  39. Hazimeh, H., Mazumder, R.: Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Oper. Res. 68(5), 1517–1537 (2020)

    Article  MathSciNet  Google Scholar 

  40. Jain, P., Kar, P.: Non-convex optimization for machine learning. Found. Trends® Mach. Learn. 10(3–4), 142–336 (2017)

  41. Jun, K.-S., Willett, R., Wright, S., Nowak, R.: Bilinear bandits with low-rank structure. In: International Conference on Machine Learning, pp. 3163–3172. PMLR (2019)

  42. Lee, J., Kim, S., Lebanon, G., Singer, Y., Bengio, S.: LLORMA: Local low-rank matrix approximation. J. Mach. Learn. Res. 17(1), 442–465 (2016)

    MathSciNet  Google Scholar 

  43. Guoyin, L., Pong, T.K.: Douglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Math. Programm. 159(1), 371–401 (2016)

    MathSciNet  Google Scholar 

  44. Russell Luke, D.: Prox-regularity of rank constraint sets and implications for algorithms. J. Math. Imaging Vis. 47(3), 231–238 (2013)

    Article  MathSciNet  Google Scholar 

  45. Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)

    MathSciNet  Google Scholar 

  46. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)

  47. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MathSciNet  Google Scholar 

  48. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends® Optim., 1(3):127–239 (2014)

  49. Poliquin, R., Rockafellar, R.T.: Prox-regular functions in variational analysis. Trans. Am. Math. Soc. 348(5), 1805–1838 (1996)

    Article  MathSciNet  Google Scholar 

  50. Poliquin, R., Rockafellar, R.T., Thibault, L.: Local differentiability of distance functions. Trans. Am. Math. Soc. 352(11), 5231–5249 (2000)

    Article  MathSciNet  Google Scholar 

  51. Polyak, B.T.: Introduction to Optimization. Optimization Software, Cambridge (1987)

    Google Scholar 

  52. Rockafellar, R.T.: Characterizing firm nonexpansiveness of prox mappings both locally and globally. J. Nonlinear Convex Anal., 22(5) (2021)

  53. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, vol. 317. Springer Science & Business Media, Berlin (2009)

    Google Scholar 

  54. Rudin, W.: Principles of Mathematical Analysis. McGraw-hill, New York (1986)

    Google Scholar 

  55. Ryu, E.K.: Uniqueness of DRS as the 2 operator resolvent-splitting and impossibility of 3 operator resolvent-splitting. Math. Program. 182(1), 233–273 (2020)

    Article  MathSciNet  Google Scholar 

  56. Ryu, E.K., Boyd, S.: Primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)

    MathSciNet  Google Scholar 

  57. Ryu, E.K., Yin, W.: Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press, Cambridge (2022)

    Book  Google Scholar 

  58. Saunderson, J., Chandrasekaran, V., Parrilo, P., Willsky, A.S.: Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting. SIAM J. Matrix Anal. Appl. 33(4), 1395–1416 (2012)

    Article  MathSciNet  Google Scholar 

  59. Shapiro, A.: Existence and differentiability of metric projections in Hilbert spaces. SIAM J. Optim. 4(1), 130–141 (1994)

    Article  MathSciNet  Google Scholar 

  60. Srikumar, V., Manning, C.D.: Learning distributed representations for structured output prediction. Adv. Neural Inf. Process. Syst. 27 (2014)

  61. Stella, L., Antonello, N., Fält, M., Volodin, D., Herceg, D., Saba, E., Carlson, F.B., Kelman, T., Brown, E., TagBot, J., Sopasakis, P.: JuliaFirstOrder/ProximalOperators.jl: v0.16.1. https://doi.org/10.5281/zenodo.10048760, (2023)

  62. Takapoui, R.: The Alternating Direction Method of Multipliers for Mixed-integer Optimization Applications. PhD thesis, Stanford University (2017)

  63. Takapoui, R., Moehle, N., Boyd, S., Bemporad, A.: A simple effective heuristic for embedded mixed-integer quadratic programming. Int. J. Control 1–11 (2017)

  64. ten Berge, J.M.F.: Some recent developments in factor analysis and the search for proper communalities. In: Advances in Data Science and Classification, pp. 325–334. Springer, Berlin (1998)

  65. Themelis, A., Patrinos, P.: Douglas-Rachford splitting and ADMM for nonconvex optimization: Tight convergence results. SIAM J. Optim. 30(1), 149–181 (2020)

    Article  MathSciNet  Google Scholar 

  66. Tillmann, A.M, Bienstock, D., Lodi, A., Schwartz, A.: Cardinality minimization, constraints, and regularization: A survey. arXiv preprint arXiv:2106.09606 (2021)

  67. Tropp, J.A.: Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inf. Theory (2006)

  68. Udell, M., Horn, C., Zadeh, R., Boyd, S., et al.: Generalized low rank models. Found. Trends® Mach. Learn., 9(1):1–118 (2016)

  69. Vial, J.-P.: Strong and weak convexity of sets and functions. Math. Oper. Res. 8(2), 231–259 (1983)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bartolomeo Stellato.

Additional information

Communicated by Clément W. Royer.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof and Derivation to Results in \(\S \) 1

1.1 Lemma Regarding Prox-Regularity of Intersection of Sets

Lemma 4

Consider the nonempty constraint set \(\mathcal {X}=\mathcal {\mathcal {C}\bigcap \mathcal {N}}\subseteq {\textbf{E}}\), where \(\mathcal {C}\) is compact and convex, and \(\mathcal {N}\) is prox-regular at \(x\in \mathcal {X}\). Then \(\mathcal {X}\) is prox-regular at x.

Proof to Lemma 4

To prove this result we record the following result from [6], where by \(d_{\mathcal {S}}(x)\) we denote the Euclidean distance of a point x from the set \(\mathcal {S}\), and \(\overline{\mathcal {S}}\) denotes closure of a set \(\mathcal {X}\).

Lemma 5

(Intersection of Prox-Regular Sets [6, Corollary 7.3(a)]) Let \(\mathcal {S}_{1},\mathcal {S}_{2}\) be two closed sets in \({\textbf{E}},\) such that \(\mathcal {S}=\mathcal {S}_{1}\bigcap \mathcal {S}_{2}\ne \emptyset \) and both \(\mathcal {S}_{1},\mathcal {S}_{2}\) are prox-regular at \(x\in \mathcal {S}\). If \(\mathcal {S}\) is metrically calm at x, i.e., if there exist some \(\varsigma >0\) and some neighborhood of x denoted by \(\mathcal {B}\) such that \(d_{\mathcal {S}}(y)\le \varsigma (d_{\mathcal {S}_{1}}(y)+d_{\mathcal {S}_{2}}(y))\) for all \(y\in \mathcal {B}\), then \(\mathcal {S}\) is prox-regular at x.

Proof to Lemma 4

By definition, projection onto \(\mathcal {N}\) is single-valued on some open ball B(xa) with center x and radius \(a>0\) [50, Theorem 1.3]. The set \(\mathcal {C}\) is compact and convex, hence projection onto \(\mathcal {C}\) is single-valued around every point, hence single-valued on B(xa) as well [3, Theorem 3.14, Remark 3.15]. Note that for any \(y\in B(x;a)\), \(d_{\mathcal {X}}(y)=0\) if and only if both \(d_{\mathcal {C}}(y)\) and \(d_{\mathcal {N}}(y)\) are zero. Hence, for any \(y\in B(x;a)\bigcap \mathcal {X}\), the metrically calmness condition is trivially satisfied. Next, recalling that the distance from a closed set is continuous [53, Example 9.6], over the compact set \(\overline{B(x;a)\setminus \mathcal {X}}\), define the function h, such that \(h(y)=1\) if \(y\in \mathcal {X}\), and \(h(y)=d_{\mathcal {X}}(y)/(d_{\mathcal {C}}(y)+d_{\mathcal {N}}(y))\) else. The function h is upper-semicontinuous over \(\overline{B(x;a)\setminus \mathcal {X}}\), hence it will attain a maximum \(\varsigma >0\) over \(\overline{B(x;a)\setminus \mathcal {X}}\) [54, Theorem 4.16], thus satisfying the metrically calmness condition on \(B(x;a)\setminus \mathcal {X}\) as well. Hence, using Lemma 5, the constraint set \(\mathcal {X}\) is prox-regular at x. \(\square \)

Proofs and Derivations to the Results in \(\S \) 3

1.1 Modifying NExOS for Nonsmooth and Convex Loss Function

We now discuss how to modify NExOS when the loss function is nonsmooth and convex. The key idea is working with a strongly convex, smooth, and arbitrarily close approximation of f; such smoothing techniques are very common in optimization [5, 47]. The optimization problem in this case, where the positive regularization parameter is denoted by \(\widetilde{\beta }\), is given by: \(\min _{x}\phi (x)+(\widetilde{\beta }/2)\Vert x\Vert ^{2}+\iota _\mathcal {X}(x)\), where the setup is same as problem (\({\mathcal {P}}\)), except the function \(\phi :{\textbf{E}}\rightarrow {\textbf{R}}\cup \left\{ +\infty \right\} \) is lower-semicontinuous, proper (its domain is nonempty), and convex. Let \(\beta {:}{=}\widetilde{\beta }/2\). For a \(\nu \) that is arbitrarily small, define the following \(\beta \) strongly convex and \((\nu ^{-1}+\beta )\)-smooth function: \(f {:}{=}{}^{\nu }_{}{\phi }(\cdot )+(\beta /2)\Vert \cdot \Vert ^{2}\) where \({}^{\nu }_{}{\phi }\) is the Moreau envelope of \(\phi \) with paramter \(\nu \). Following the properties of the Moreau envelope of a convex function discussed in §2, the following optimization problem acts as an arbitrarily close approximation to the first nonsmooth convex problem: \(\min _{x}f+(\beta /2)\Vert x\Vert ^{2}+\iota _\mathcal {X}(x)\), which has the same setup as problem (\({\mathcal {P}}\)).

We can compute \(\textbf{prox}_{\gamma f}(x)\) using the formula in by [5, Theorem 6.13, Theorem 6.63]. Then, we apply NExOS to \(\min _{x}f+(\beta /2)\Vert x\Vert ^{2}+\iota _\mathcal {X}(x)\) and proceed in the same manner as discussed earlier.

1.2 Proof to Proposition 1

1.2.1 Proof to Proposition 1(i)

We prove (i) in three steps. In the first step, we show that for any \(\mu >0\), \(f+{}^{\mu }_{}{\mathcal {I}}\) will be differentiable on some \(B({\bar{x}};r_{\text {diff}})\) with \(r_{\text {diff}}>0\). In the second step, we then show that, for any \(\mu \in (0,1/\beta ],\) \(f+{}^{\mu }_{}{\mathcal {I}}\) will be strongly convex and differentiable on some \(B({\bar{x}};r_{\text {cvxdiff}})\). In the third step, we will show that there exist \(\mu _{\text {max}}>0\) such that for any \(\mu \in (0,\mu _{\text {max}}]\), \(f+{}^{\mu }_{}{\mathcal {I}}\) will be strongly convex and smooth on some \(B({\bar{x}};r_{\text {max}})\) and will attain the unique local minimum \(x_{\mu }\) in this ball.

Proof of the first step

To prove the first step, we start with the following lemma regarding differentiability of \({}^{\mu }_{}{\iota }\).

Lemma 6

(Differentiability of \({}^{\mu }_{}{\iota }\)) Let \({\bar{x}}\) be a local minimum to problem (\({\mathcal {P}}\)), where Assumptions 1 and 2 hold. Then there exists some \(r_{diff }>0\) such that for any \(\mu >0\): (i) the function \({}^{\mu }_{}{\iota }\) is differentiable on \(B({\bar{x}};r_{diff })\) with derivative \(\nabla {}^{\mu }_{}{\iota }=(1/\mu )(\mathbb {I}-\mathop {\varvec{\Pi }_\mathcal {X}}),\) and (ii) the projection operator \(\mathop {\varvec{\Pi }_\mathcal {X}}\) onto \(\mathcal {X}\) is single-valued and Lipschitz continuous on \(B({\bar{x}};r_{diff })\).

Proof

From [50, Theorem 1.3(e)], there exists some \(r_{\text {diff}}>0\) such that the function \(d^{2}\) is differentiable on \(B({\bar{x}};r_{\text {diff}})\). As \({}^{\mu }_{}{\iota }=(1/2\mu )d^{2}\) from (3), it follows that for any \(\mu >0,\) \({}^{\mu }_{}{\iota }\) is differentiable on \(B({\bar{x}};r_{\text {diff}})\) which proves the first part of (i). The second part of (i) follows from the fact that \(\nabla d^{2}(x)=2\left( x-\mathop {\varvec{\Pi }_\mathcal {X}}(x)\right) \) whenever \(d^{2}\) is differentiable at x [50, page 5240]. Finally, from [50, Lemma 3.2], whenever \(d^{2}\) is differentiable at a point, projection \(\mathop {\varvec{\Pi }_\mathcal {X}}\) is single-valued and Lipschitz continuous around that point, and this proves (ii). \(\square \)

Due to the lemma above, \(f+{}^{\mu }_{}{\mathcal {I}}\) will be differentiable on \(B({\bar{x}};r_{\text {diff}})\) with \(r_{\text {diff}}>0\), as f and \((\beta /2)\Vert \cdot \Vert ^{2}\) are differentiable. Also, due to Lemma 6(ii), projection operator \(\mathop {\varvec{\Pi }_\mathcal {X}}\) is \({\widetilde{L}}\)-Lipschitz continuous on \(B({\bar{x}};r_{\text {diff}})\) for some \({\widetilde{L}}>0\). This proves the first step.

Proof of the second step To prove this step, we are going to record: (1) the notion of general subdifferential of a function, followed by (2) the definition of prox-regularity of a function and its connection with prox-regular set, and (3) a helper lemma regarding convexity of the Moreau envelope under prox-regularity.

Definition 3

(Fenchel, Fréchet, and general subdifferential) For any lower-semicontinuous function \(h:{\textbf {R}}^{n}\rightarrow {\textbf {R}}\cup \{\infty \}\), its Fenchel subdifferential \(\partial h\) is defined as [24, page 1]: \(u\in \partial h(x)\Leftrightarrow h(y)\ge h(x)+\left\langle u\mid y-x\right\rangle \) for all \(y \in {\textbf {R}}^{n}\). For the function h, its Fréchet subdifferential \(\partial ^{F}h\) (also known as regular subdifferential) at a point x is defined as [24, Definition 2.5]: \(u\in \partial ^{F}h(x)\Leftrightarrow \liminf _{y\rightarrow 0}\,(h(x+y)-h(x)-\langle u\mid y\rangle )/\Vert y\Vert \ge 0\). Finally, the general subdifferential of h, denoted by \(\partial ^{G}h\), is defined as [52, Equation (2.8)]: \(u\in \partial ^{G}h(x)\Leftrightarrow u_{n}\rightarrow u,x_{n}\rightarrow x,f(x_{n})\rightarrow f(x),\) for some \((x_{n},u_{n})\in \mathop {{\textbf {gra}}}\partial ^{F}h\). If h is additionally convex, then \(\partial h=\partial ^{F}h=\partial ^{G}h\) [24, Property (2.3), Property 2.6].

Definition 4

(Connection between prox-regularity of a function and a set [49, Definition 1.1 ]) A function \(h:{\textbf{R}}^{n}\rightarrow {\textbf{R}}\cup \{\infty \}\) that is finite at \(\tilde{x}\) is prox-regular at \(\tilde{x}\) for \({\tilde{\nu }}\), where \({\tilde{\nu }}\in \partial ^{G}h(\tilde{x})\), if h is locally l.s.c. at \(\tilde{x}\) and there exist a distance \(\sigma >0\) and a parameter \(\rho >0\) such that whenever \(\Vert x'-\tilde{x}\Vert <\sigma \) and \(\Vert x-\tilde{x}\Vert <\sigma \) with \(x'\ne x\), \(\Vert h(x)-h(\tilde{x})\Vert <\sigma \), \(\Vert \nu -{\tilde{\nu }}\Vert <\sigma \) with \(\nu \in \partial ^{G}h(x)\), we have \(h(x')>h(x)+\left\langle \nu \mid x'-x\right\rangle -(\rho /2)\Vert x'-x\Vert ^{2}\). Also, a set \(\mathcal {S}\) is prox-regular at \(\tilde{x}\) for \({\tilde{\nu }}\) if we have the indicator function \(\iota _{\mathcal {S}}\) is prox-regular at \(\tilde{x}\) for \({\tilde{\nu }}\in \partial ^{G}\iota _{\mathcal {S}}(\tilde{x})\) [49, Proposition 2.11]. The set \(\mathcal {S}\) is prox-regular at \(\tilde{x}\) if it is prox-regular at \(\tilde{x}\) for all \({\tilde{\nu }}\in \partial ^{G}\iota _{\mathcal {S}}(\tilde{x})\) [53, page 612].

We have the following helper lemma from [49].

Lemma 7

( [49, Theorem 5.2]) Consider a function h which is lower semicontinuous at 0 with \(h(0)=0\) and there exists \(\rho >0\) such that \(h(x)>-(\rho /2)\Vert x\Vert ^{2}\) for any \(x\ne 0\). Let h be prox-regular at \(\tilde{x}=0\) and \({\tilde{\nu }}=0\) with respect to \(\sigma \) and \(\rho \) (\(\sigma \) and \(\rho \) as described in Definition 4), and let \(\lambda \in (0,1/\rho ).\) Then, on some neighborhood of 0, the function

$$\begin{aligned} ^{\lambda }h+\rho /(2-2\lambda \rho )\Vert \cdot \Vert ^{2} \end{aligned}$$
(10)

is convex, where \(^{\lambda }h\) is the Moreau envelope of h with parameter \(\lambda \).

Now we start proving step 2 earnestly. To prove this result, we assume \({\bar{x}}=0\). This does not cause any loss of generality because this is equivalent to transferring the coordinate origin to the optimal solution and prox-regularity of a set and strong convexity of a function is invariant under such a coordinate transformation.

First, note that the indicator function of our constraint closed set \(\mathcal {X}\) is lower semicontinuous due to [53, Remark after Theorem 1.6, page 11], and as \({\bar{x}},\) the local minimizer lies in \(\mathcal {X},\) we have \(\iota _\mathcal {X}({\bar{x}})=0\). The set \(\mathcal {X}\) is prox-regular at \({\bar{x}}\) for all \(\nu \in \partial ^{G}\iota _\mathcal {X}(x)\) per our setup, so using Definition 4, we have \(\iota _\mathcal {X}\) prox-regular at \({\bar{x}}=0\) for \(\bar{\nu }=0\in \partial ^{G}\iota _\mathcal {X}({\bar{x}})\) (because \({\bar{x}}\in \mathcal {X}\), we will have 0 as a subgradient of \(\partial \iota _\mathcal {X}({\bar{x}})\)) with respect to some distance \(\sigma >0\) and parameter \(\rho >0\).

Note that the indicator function satisfies \(\iota _\mathcal {X}(x)=c\iota _\mathcal {X}(x)\) for any \(c>0\) due to its definition, so \(u\in \partial ^{G}\iota _\mathcal {X}(x)\Leftrightarrow cu\in c\partial ^{G}\iota _\mathcal {X}(x)=\partial (c\iota _\mathcal {X}^{G}(x))=\partial \iota _\mathcal {X}^{G}(x)\) [53, Equation 10(6)] In our setup, we have \(\mathcal {X}\) prox-regular at \({\bar{x}}\). So, setting \(h{:}{=}\iota _mathcal{X},\tilde{x}{:}{=}{\bar{x}}=0,\) \({\tilde{\nu }}{:}{=}\bar{\nu }=0\), and \(\nu {:}{=}u/(\beta /2\rho )\) in Definition 4, we have \(\iota _\mathcal {X}\) is also prox-regular at \({\bar{x}}=0\) for \(\bar{\nu }=0\) with respect to distance \(\sigma \min \{1,\beta /2\rho \}\) and parameter \(\beta /2\).

Next, because the range of the indicator function is \(\{0,\infty \}\), we have \(\iota _\mathcal {X}(x)>-(\rho /2)\Vert x\Vert ^{2}\) for any \(x\ne 0\). So, we have all the conditions of Theorem 7 satisfied. Hence, applying Lemma 7, we have \((1/2\mu )\left( d^{2}+\beta \mu /(2-\beta \mu )\Vert \cdot \Vert ^{2}\right) \) convex and differentiable on

$$\begin{aligned}B\left( {\bar{x}};\min \left\{ \sigma \min \{1,\beta /2\rho \},r_{\text {diff}}\right\} \right) \end{aligned}$$

for any \(\mu \in (0,2/\beta )\), where \(r_{\text {diff}}\) comes from Lemma 6. As \(r_{\text {diff}}\) in this setup does not depend on \(\mu \), the ball does not depend on \(\mu \) either. Finally, note that in our exterior-point minimization function we have \({}^{\mu }_{}{\mathcal {I}}=(1/2\mu )\left( d^{2}+\beta \mu \Vert \cdot \Vert ^{2}\right) \).

So if we take \(\mu \le \frac{1}{\beta }\), then we have \((\beta /2)\mu /\left( 1-\mu (\beta /2)\right) \le \beta \mu \), and on the ball \(B\left( {\bar{x}};\min \left\{ \sigma \min \{1,\beta /2\rho \},r_{\text {diff}}\right\} \right) \), the function \({}^{\mu }_{}{\mathcal {I}}\) will be convex and differentiable. But f is strongly-convex and smooth, so \(f+{}^{\mu }_{}{\mathcal {I}}\) will be strongly convex and differentiable on \(B\left( {\bar{x}};\min \left\{ \sigma \min \{1,\beta /2\rho \},r_{\text {diff}}\right\} \right) \) for \(\mu \in (0,1/\beta ]\). This proves step 2.

Proof of the third stepAs point \({\bar{x}}\in \mathcal {X}\) is a local minimum of problem (\({\mathcal {P}}\)), from Definition 2, there is some \(r>0\) such that for all \(y\in {\overline{B}}({\bar{x}};r),\) we have \(f({\bar{x}})+(\beta /2)\Vert {\bar{x}}\Vert ^{2}<f(y)+(\beta /2)\Vert y\Vert ^{2}+\iota _\mathcal {X}(y).\)

Then, due to the first two steps, for any \(\mu \in (0,1/\beta ]\), the function \(f+{}^{\mu }_{}{\mathcal {I}}\) will be strongly convex and differentiable on \(B\left( {\bar{x}};\min \left\{ \sigma \min \{1,\beta /2\rho \},r_{\text {diff}}\right\} \right) \). For notational convenience, denote \(r_{\text {max}}{:}{=}\min \left\{ \sigma \min \{1,\beta /2\rho \},r_{\text {diff}}\right\} ,\) which is a constant. As \(f+{}^{\mu }_{}{\mathcal {I}}\) is a global underestimator of and approximates the function \(f+(\beta /2)\Vert \cdot \Vert ^{2}+\iota _\mathcal {X}\) with arbitrary precision as \(\mu \rightarrow 0\), the previous statement and [53, Theorem 1.25] imply that there exist some \(0<\mu _{\text {max}}\le 1/\beta \) such that for any \(\mu \in (0,\mu _{\text {max}}],\) the function \(f+{}^{\mu }_{}{\mathcal {I}}\) will achieve a local minimum \(x_{\mu }\) over \(B({\bar{x}};r_{\text {max}})\) where \(\nabla (f+{}^{\mu }_{}{\mathcal {I}})\) vanishes, i.e.,

$$\begin{aligned} \nabla (f+{}^{\mu }_{}{\mathcal {I}})(x_{\mu })&={\nabla }f(x_{\mu })+\beta x_{\mu }+ (1/\mu )\left( x_{\mu }-\mathop {\varvec{\Pi }_\mathcal {X}}\left( x_{\mu }\right) \right) =0 \end{aligned}$$
(11)
$$\begin{aligned} \Rightarrow x_{\mu }&=(1/ (\beta \mu + 1)) \left( \mathop {\varvec{\Pi }_\mathcal {X}}(x_{\mu })-\mu \nabla f(x_{\mu })\right) . \end{aligned}$$
(12)

As the right hand side of the last equation is a singleton, this minimum must be unique. Finally to show the smoothness \(f+{}^{\mu }_{}{\mathcal {I}}\), for any \(x\in B({\bar{x}};r_{\text {max}}),\) we have

$$\begin{aligned} \nabla \left( f+{}^{\mu }_{}{\mathcal {I}}\right) (x) \overset{a)}{=}\nabla f(x)+\left( \beta + (1/\mu )\right) x- (1/\mu )\mathop {\varvec{\Pi }_\mathcal {X}}(x), \end{aligned}$$
(13)

where a) uses Lemma 6. Thus, for any \(x_{1},x_{2}\in B({\bar{x}};r_{\text {max}})\) we have \(\Vert \nabla (f+(\beta /2)\Vert \cdot \Vert ^{2}+{}^{\mu }_{}{\iota })(x_{1})-\nabla (f+(\beta /2)\Vert \cdot \Vert ^{2}+{}^{\mu }_{}{\iota })(x_{2})\Vert \le (L+\beta +(1/\mu )+{\widetilde{L}})\Vert x_{1}-x_{2}\Vert \), where we have used the following: \(\nabla f\) is L-Lipschitz everywhere due to f being an \(L-\)smooth function in \({\textbf{E}}\) ( [3, Theorem 18.15]), and \(\mathop {\varvec{\Pi }_\mathcal {X}}\) is \({\widetilde{L}}\)-Lipschitz continuous on \(B({\bar{x}};r_{\text {max}})\), as shown in step 1. This completes the proof for (i).

(ii): Using [53, Theorem 1.25], as \(\mu \rightarrow 0,\) we have \(x_{\mu }\rightarrow {\bar{x}},\text { and }\left( f+{}^{\mu }_{}{\mathcal {I}}\right) (x_{\mu })\rightarrow f({\bar{x}})+ (\beta /2)\Vert {\bar{x}}\Vert ^{2}.\) Note that \(x_{\mu }\) reaches \({\bar{x}}\) only in limit, as otherwise Assumption 2 will be violated.

1.3 Proof to Proposition 2

1.3.1 Proof to Proposition 2(i)

We will need the notions of nonexpansive and firmly nonexpansive operators in this proof. An operator \(\mathbb {A}:{\textbf{E}}\rightarrow {\textbf{E}}\) is nonexpansive on some set \(\mathcal {S}\) if it is Lipschitz continuous with Lipschitz constant 1 on \(\mathcal {S}\); the operator is contractive if the Lipschitz constant is strictly smaller than 1. On the other hand, \(\mathbb {A}\) is firmly nonexpansive on \(\mathcal {S}\) if and only if its reflection operator \(2\mathbb {A}-\mathbb {I}\) is nonexpansive on \(\mathcal {S}\). A firmly nonexpansive operator is always nonexpansive [3, page 59].

We next introduce the following definition.

Definition 5

(Resolvent and reflected resolvent [3, pages 333, 336]) For a lower-semicontinuous, proper, and convex function h,  the resolvent and reflected resolvent of its subdifferential operator are defined by \(\mathbb {J}_{\gamma \partial h}=\left( \mathbb {I}+\gamma \partial h\right) ^{-1}\) and \(\mathbb {R}_{\gamma \partial h}=2\mathbb {J}_{\gamma \partial h}-\mathbb {I},\) respectively.

The proof of (i) is proven in two steps. First, we show that the reflection operator of \(\mathbb {T}_{\mu },\) defined by

$$\begin{aligned} \mathbb {R}_{\mu }=2\mathbb {T}_{\mu }-\mathbb {I}, \end{aligned}$$
(14)

is contractive on \(B({\bar{x}},r_{\text {max}})\), and using this we show that \(\mathbb {T}_{\mu }\) in also contractive there in the second step. To that goal, note that \(\mathbb {R}_{\mu }\) can be represented as:

$$\begin{aligned} \mathbb {R}_{\mu }=(2\textbf{prox}_{\gamma {}^{\mu }_{}{\mathcal {I}}}-\mathbb {I})(2\textbf{prox}_{\gamma f}-\mathbb {I}), \end{aligned}$$
(15)

which can be proven by simply using (5) and (14) on the left-hand side and by expanding the factors on the right-hand side. Now, the operator \(2\textbf{prox}_{\gamma f}-\mathbb {I}\) associated with the \(\alpha \)-strongly convex and L-smooth function f is a contraction mapping for any \(\gamma >0\) with the contraction factor \(\kappa =\max \left\{ (\gamma L-1)/(\gamma L+1),(1-\gamma \alpha )/(\gamma \alpha +1)\right\} \in (0,1)\), which follows from [32, Theorem 1]. Next, we show that \(2\textbf{prox}_{\gamma {}^{\mu }_{}{\mathcal {I}}}-\mathbb {I}\) is nonexpansive on \(B({\bar{x}};r_{max })\) for any \(\mu \in (0,\mu _{\text {max}}]\). For any \(\mu \in (0,\mu _{\text {max}}]\), define the function g as follows. We have \(g(y)={}^{\mu }_{}{\mathcal {I}}(y)\) if \(y\in B({\bar{x}};r_{max }),\) \(g(y)=\liminf _{\tilde{y}\rightarrow y}{}^{\mu }_{}{\mathcal {I}}(\tilde{y})\) if \(\Vert y-{\bar{x}}\Vert =r_{\text {max}}\), and \(g(y)=\infty \) else. The function g is lower-semicontinuous, proper, and convex everywhere due to [3, Lemma 1.31 and Corollary 9.10 ]. As a result for \(\mu \in (0,\mu _{\text {max}}]\), we have \(\textbf{prox}_{\gamma g}=\mathbb {J}_{\gamma \partial g}\) on \({\textbf{E}}\) and \(\textbf{prox}_{\gamma g}\) is firmly nonexpansive and single-valued everywhere, which follows from [3, Proposition 12.27, Proposition 16.34, and Example 23.3]. But, for \(y\in B({\bar{x}};r_{max }),\) we have \({}^{\mu }_{}{\mathcal {I}}(y)=g(y)\) and \(\nabla {}^{\mu }_{}{\mathcal {I}}(y)=\partial g(y)\). Thus, on \(B({\bar{x}};r_{max }),\) the operator \(\textbf{prox}_{\gamma {}^{\mu }_{}{\mathcal {I}}}=\mathbb {J}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\), and it is firmly nonexpansive and single-valued for \(\mu \in (0,\mu _{\text {max}}]\). Any firmly nonexpansive operator \(\mathbb {A}\) has a nonexpansive reflection operator \(2\mathbb {A}-\mathbb {I}\) on its domain of firm nonexpansiveness [3, Proposition 4.2]. Hence, on \(B({\bar{x}};r_{max }),\) for \(\mu \in (0,\mu _{\text {max}}]\) the operator \(2\textbf{prox}_{\gamma {}^{\mu }_{}{\mathcal {I}}}-\mathbb {I}\) is nonexpansive using (15).

Now we show that \(\mathbb {R}_{\mu }\) is contractive for every \(x_{1},x_{2}\in B({\bar{x}};r_{max })\) and \(\mu \in (0,\mu _{\text {max}}]\), we have \(\Vert \mathbb {R}_{\mu }(x_{1})-\mathbb {R}_{\mu }(x_{2})\Vert \le \Vert (2\textbf{prox}_{\gamma f}-\mathbb {I})(x_{1})-(2\textbf{prox}_{\gamma f}-\mathbb {I})(x_{2})\Vert \le {\kappa }\Vert x_{1}-x_{2}\Vert \) where the last inequality uses \(\kappa \)-contractiveness of \(2\textbf{prox}_{\gamma f}-\mathbb {I}\) thus proving that \(\mathbb {R}_{\mu }\) acts as a contractive operator on \(B({\bar{x}};r_{max })\) for \(\mu \in (0,\mu _{\text {max}}]\). Similarly, for any \(x_{1},x_{2}\in B({\bar{x}};r_{max }),\) using (\({\mathcal {A}_{\mu }}\)) and the triangle inequality we have \(\Vert \mathbb {T}_{\mu }(x_{1})-\mathbb {T}_{\mu }(x_{2})\Vert \le (1+\kappa )/2\Vert x_{1}-x_{2}\Vert \) and as \(\kappa '=(1+\kappa )/2\in [0,1);\) the operator \(\mathbb {T}_{\mu }\) is \(\kappa '-\)contractive on on \(B({\bar{x}};r_{max }),\) for \(\mu \in (0,\mu _{\text {max}}]\).

1.3.2 Proof to Proposition 2(ii)

Recalling \(\mathbb {T}_{\mu }=(1/2)\mathbb {R}_{\mu }+(1/2)\mathbb {I}\) from (14), using (15), and then expanding, and finally using Lemma 1 and triangle inequality, we have for any \(\mu ,{\tilde{\mu }}\in (0,\mu _{\text {max}}],\;x\in B({\bar{x}};r_{max }),\) and \(y=2\textbf{prox}_{\gamma f}(x)-x\):

$$\begin{aligned} \left\| \mathbb {T}_{\mu }(x)-\mathbb {T}_{{\tilde{\mu }}}(x)\right\|&\le \left\| \left( \mu /(\gamma +\mu (\beta \gamma +1))-{\tilde{\mu }}/(\gamma +{\tilde{\mu }}(\beta \gamma +1))\right) \right\| \left\| y\right\| \nonumber \\&+\left\| \left( \gamma /(\gamma +\mu (\beta \gamma +1))-\gamma /(\gamma +{\tilde{\mu }}(\beta \gamma +1))\right) \right\| \left\| \mathop {\varvec{\Pi }_\mathcal {X}}\left( y/(\beta \gamma +1)\right) \right\| . \end{aligned}$$
(16)

Now, in (16), the coefficient of \(\Vert y\Vert \) satisfies \(\Vert \mu /(\gamma +\mu (\beta \gamma +1))-{\tilde{\mu }}/(\gamma +{\tilde{\mu }}(\beta \gamma +1))\Vert \le (1/\gamma )\Vert \mu -{\tilde{\mu }}\Vert \)

and similarly the coefficient of \(\Vert \mathop {\varvec{\Pi }_\mathcal {X}}\left( y/(\beta \gamma +1)\right) \Vert \) satisfies

$$\begin{aligned}\Vert \gamma /(\gamma +\mu (\beta \gamma +1))-\gamma /(\gamma +{\tilde{\mu }}(\beta \gamma +1))\Vert&\le (\beta +(1/\gamma ))\Vert \mu -{\tilde{\mu }}\Vert .\end{aligned}$$

Putting the last two inequalities in (16), and then replacing \(y=2\textbf{prox}_{\gamma f}(x)-x\), we have for any \(x\in \mathcal {B}\), and for any \(\mu ,{\tilde{\mu }}\in {\textbf {R}}_{++}\),

$$\begin{aligned}&\left\| \mathbb {T}_{\mu }(x)-\mathbb {T}_{{\tilde{\mu }}}(x)\right\| \le (1/\gamma )\left\| \mu -{\tilde{\mu }}\right\| \Vert y\Vert +\left( \beta +(1/\gamma )\right) \Vert \mu -{\tilde{\mu }}\Vert \left\| \mathop {\varvec{\Pi }_\mathcal {X}}\left( y/(\beta \gamma +1)\right) \right\| \nonumber \\ =&\{(1/\gamma )\Vert 2\textbf{prox}_{\gamma f}(x)-x\Vert +(\beta +(1/\gamma ))\Vert \mathop {\varvec{\Pi }_\mathcal {X}}((2\textbf{prox}_{\gamma f}(x)-x)/(\beta \gamma +1))\Vert \}\Vert \mu -{\tilde{\mu }}\Vert . \end{aligned}$$
(17)

Now, as \(B({\bar{x}};r_{max })\) is a bounded set and \(x\in \mathcal {B}\), norm of the vector \(y=2\textbf{prox}_{\gamma f}(x)-x\) can be upper-bounded over \(B({\bar{x}};r_{max })\) because \(2\textbf{prox}_{\gamma f}-\mathbb {I}\) is continuous (in fact contractive) as shown in (i). Similarly, \(\Vert \mathop {\varvec{\Pi }_\mathcal {X}}\left( (2\textbf{prox}_{\gamma f}(x)-x)/(\beta \gamma +1)\right) \Vert \) can be upper-bounded on \(B({\bar{x}};r_{max })\). Combining the last two-statements, it follows that there exists some \(\ell >0\) such that

$$\begin{aligned}{} & {} \sup _{x\in B({\bar{x}};r_{max })}(1/\gamma )\Vert 2\textbf{prox}_{\gamma f}(x)-x\Vert +(\beta +1/\gamma )\left\| \mathop {\varvec{\Pi }_\mathcal {X}}\left( (2\textbf{prox}_{\gamma f}(x)-x)/(\beta \gamma +1)\right) \right\| \\{} & {} \le \ell , \end{aligned}$$

and putting the last inequality in (17), we arrive at the claim.

1.4 Proof to Proposition 3

The structure of the proof follows that of [3, Proposition 25.1(ii)]. Let \(\mu \in (0,\mu _{\text {max}}].\) Recalling Definition 5, and due to Proposition 1(i), \(x_{\mu }\in B({\bar{x}};r_{\text {max}})\) satisfies

$$\begin{aligned}&x_{\mu }=\mathop {\text {argmin}}_{B({\bar{x}};r_{max })}f(x)+{}^{\mu }_{}{\mathcal {I}}(x)=\mathop {{\textbf {zer}}}(\nabla f+\nabla {}^{\mu }_{}{\mathcal {I}})\nonumber \\&\overset{a)}{\Leftrightarrow }\ \left( \exists y\in {\textbf{E}}\right) \;x_{\mu }=\mathbb {J}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\mathbb {R}_{\gamma \nabla f}(y)\text { and }x_{\mu }=\mathbb {J}_{\gamma \nabla f}(y), \end{aligned}$$
(18)

where a) uses the facts (shown in the proof to Proposition 2) that: (i) \(\mathbb {J}_{\gamma \nabla f}\) is a single-valued operator everywhere, whereas \(\mathbb {J}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\) is a single-valued operator on the region of convexity \(B({\bar{x}};r_{\text {max}})\), and (ii) \(x_{\mu }=\mathbb {J}_{\gamma \nabla f}(y)\) can be expressed as \(x_{\mu }=\mathbb {J}_{\gamma \nabla f}(y)\Leftrightarrow 2x_{\mu }-y=\left( 2\mathbb {J}_{\gamma \nabla f}-\mathbb {I}\right) y=\mathbb {R}_{\gamma \nabla f}(y)\). Also, using the last expression, we can write the first term of (18) as \(\mathbb {J}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\mathbb {R}_{\gamma \nabla f}(y)=x_{\mu }\Leftrightarrow y\in \mathop {{\textbf {fix}}}(\mathbb {R}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\mathbb {R}_{\gamma \nabla f})\). Because for lower-semicontinuous, proper, and convex function, the resolvent of the subdifferential is equal to its proximal operator [3, Proposition 12.27, Proposition 16.34, and Example 23.3], we have \(\mathbb {J}_{\gamma \partial f}=\textbf{prox}_{\gamma f}\) with both being single-valued. Using the last fact along with (18), \(y\in \mathop {{\textbf {fix}}}(\mathbb {R}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\mathbb {R}_{\gamma \nabla f})\), we have \(x_{\mu }\in \textbf{prox}_{\gamma f}\left( \mathop {{\textbf {fix}}}\left( \mathbb {R}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\mathbb {R}_{\gamma \partial f}\right) \right) \), but \(x_{\mu }\) is unique due to Proposition 1, so the inclusion can be replaced with equality. Thus \(x_{\mu }\), satisfies \(x_{\mu }=\textbf{prox}_{\gamma f}\left( \mathop {{\textbf {fix}}}\left( \mathbb {R}_{\gamma \nabla {}^{\mu }_{}{\mathcal {I}}}\mathbb {R}_{\gamma \partial f}\right) \right) \) where the sets are singletons due to Proposition 1 and single-valuedness of \(\textbf{prox}_{\gamma f}\). Also, because \(\mathbb {T}_{\mu }\) in (5) and \(\mathbb {R}_{\mu }\) in (14) have the same fixed point set (follows from (14)), using (15), we arrive at the claim.

1.5 Proof to Lemma 2

(i): This follows directly from the proof to Proposition 1.

(ii): From Lemma 2(i), and recalling that \(\eta '>1\), for any \(\mu \in (0,\mu _{max }]\), we have the first equation. Recalling Definition 5, and using the fact that for lower-semicontinuous, proper, and convex function, the resolvent of the subdifferential is equal to its proximal operator [3, Proposition 12.27, Proposition 16.34, and Example 23.3], we have \(\mathbb {J}_{\gamma \partial f}=\textbf{prox}_{\gamma f}\) with both being single-valued. So, from Proposition 3: \(x_{\mu }=\textbf{prox}_{\gamma f}(z_{\mu })=\left( \mathbb {I}+\gamma \partial f\right) ^{-1}(z_{\mu })\Leftrightarrow z_{\mu }=x_{\mu }+\gamma \nabla f(x_{\mu })\). Hence, for any \(\mu \in (0,\mu _{max }]\):

$$\begin{aligned}&\Vert z_{\mu }-{\bar{x}}\Vert =\Vert x_{\mu }+\gamma \nabla f(x_{\mu })-{\bar{x}}\Vert \le \Vert x_{\mu }-{\bar{x}}\Vert +\gamma \Vert \nabla f(x_{\mu })\Vert \\ \Leftrightarrow&r_{\text {max}}-\Vert z_{\mu }-{\bar{x}}\Vert \ge r_{\text {max}}-\Vert x_{\mu }-{\bar{x}}\Vert -\gamma \Vert \nabla f(x_{\mu })\Vert \overset{a)}{\ge }(\eta '-1)r_{\text {max}}/\eta '-\gamma \Vert \nabla f(x_{\mu })\Vert , \end{aligned}$$

where a) uses the first equation of Lemma 2(ii). Because, for the strongly convex and smooth function \(f,\) its gradient is bounded over a bounded set \(B({\bar{x}};r_{max })\) [51, Lemma 1, §1.4.2], then for \(\gamma \) satisfying the fourth equation of Lemma 2(ii) and the definition of \(\psi \) in the third equation of Lemma 2(ii), we have the second equation of Lemma 2(ii) for any \(\mu \in (0,\mu _{max }].\) To prove the final equation of Lemma 2(ii), note that

$$\begin{aligned}&\lim _{\mu \rightarrow 0}\left( r_{max }-\Vert z_{\mu }-{\bar{x}}\Vert \right) -\psi \nonumber \\&\overset{a)}{=}\lim _{\mu \rightarrow 0}\left( r_{max }-\Vert x_{\mu }+\gamma \nabla f(x_{\mu })-{\bar{x}}\Vert \right) -(\eta '-1)r_{max }/\eta '+\gamma \;max _{x\in B({\bar{x}};r_{max })}\Vert \nabla f(x)\Vert \nonumber \\&\overset{b)}{=}\left( r_{max }-\Vert {\bar{x}}+\gamma \nabla f({\bar{x}})-{\bar{x}}\Vert \right) -(\eta '-1)r_{max }/\eta '+\gamma \;max _{x\in B({\bar{x}};r_{max })}\Vert \nabla f(x)\Vert \nonumber \\&=(1/\eta ')r_{max }+\gamma \left( max _{x\in B({\bar{x}};r_{max })}\Vert \nabla f(x)\Vert -\Vert \nabla f({\bar{x}})\Vert \right) >0, \end{aligned}$$
(19)

where in a) we have used \(z_{\mu }=x_{\mu }+\gamma \nabla f(x_{\mu })\) and the third equation of Lemma 2(ii), in b) we have used smoothness of f along with Proposition 1(ii). Inequality (19) along with the second equation of Lemma 2(ii) implies the final equation of Lemma 2(ii).

1.6 Proof to Theorem 1

We use the following result from [26] in proving Theorem 1.

Theorem 3

(Convergence of local contraction mapping [26, pp. 313–314]) Let \(\mathbb {A}:{\textbf{E}}\rightarrow {\textbf{E}}\) be some operator. If there exist \(\tilde{x}\), \(\omega \in (0,1)\), and \(r>0\) such that (a) \(\mathbb {A}\) is \(\omega \)-contractive on \(B(\tilde{x};r)\), i.e., for all \(x_{1},x_{2}\) in \(B(\tilde{x};r)\), and (b) \(\Vert \mathbb {A}(\tilde{x})-\tilde{x}\Vert \le (1-\omega )r\). Then \(\mathbb {A}\) has a unique fixed point in \(B(\tilde{x};r)\) and the iteration scheme \(x_{n+1}=\mathbb {A}(x_{n})\) with the initialization \(x_{0}:=\tilde{x}\) linearly converges to that unique fixed point.

Furthermore, recall that NExOS (Algorithm 1) can be compactly represented using (\({\mathcal {A}_{\mu }}\)) as follows. For any \(m\in \{1,2,\ldots ,N\}\) (equivalently for each \(\mu _{m}\in \{\mu _{1},\ldots ,\mu _{N}\}\)),

$$\begin{aligned} \begin{aligned}z_{\mu _{m}}^{n+1}&=\mathbb {T}_{\mu _{m}}\left( z_{\mu _{m}}^{n}\right) ,\end{aligned} \end{aligned}$$
(20)

where \(z_{\mu _{m}}^{0}\) is initialized at \(z_{\mu _{m-1}}\). From Proposition 2, for any \(\mu \in {\mathfrak {M}},\) the operator \(\mathbb {T}_{\mu }\) is a \(\kappa '\)-contraction mapping over the region of convexity \(B({\bar{x}};r_{\text {max}})\), where \(\kappa '\in (0,1)\). From Proposition 1, there will be a unique local minimum \(x_{\mu }\) of problem (\({\mathcal {P}_{\mu }}\)) over \(B({\bar{x}};r_{\text {max}})\). Suppose, instead of the exact fixed point \(z_{\mu _{m-1}}\in \mathop {{\textbf {fix}}}\mathbb {T}_{\mu _{m-1}},\) we have computed \({\widetilde{z}}\), which is an \(\epsilon \)-approximate fixed point of \(\mathbb {T}_{\mu _{m-1}}\) in \(B({\bar{x}};r_{\text {max}})\), i.e., \(\Vert {\widetilde{z}}-\mathbb {T}_{\mu _{m-1}}({\widetilde{z}})\Vert \le \epsilon \) and \(\Vert {\widetilde{z}}-z_{\mu _{m-1}}\Vert \le \epsilon \), where \(\epsilon \in [0,{\overline{\epsilon }})\). Then, we have:

$$\begin{aligned} \Vert \mathbb {T}_{\mu _{m-1}}({\widetilde{z}})-z_{\mu _{m-1}}\Vert =\Vert \mathbb {T}_{\mu _{m-1}}({\widetilde{z}})-\mathbb {T}_{\mu _{m-1}}(z_{\mu _{m-1}})\Vert \overset{a)}{\le }\kappa '\underbrace{\Vert {\widetilde{z}}-z_{\mu _{m-1}}\Vert }_{\le \epsilon }\le \epsilon , \nonumber \\ \end{aligned}$$
(21)

where a) uses \(\kappa '\)-contractive nature of \(\mathbb {T}_{\mu _{m-1}}\) over \(B({\bar{x}};r_{\text {max}})\). Hence, using triangle inequality,

$$\begin{aligned} \Vert {\widetilde{z}}-{\bar{x}}\Vert \overset{a)}{\le }\Vert {\widetilde{z}}-\mathbb {T}_{\mu _{m-1}}({\widetilde{z}})\Vert +\Vert \mathbb {T}_{\mu _{m-1}}({\widetilde{z}})-z_{\mu _{m-1}}\Vert +\Vert z_{\mu _{m-1}}-{\bar{x}}\Vert \overset{b)}{\le }2\epsilon +\Vert z_{\mu _{m-1}}-{\bar{x}}\Vert , \end{aligned}$$

where a) uses triangle inequality and b) uses (21). As \(\epsilon \in [0,{\overline{\epsilon }}),\) where \({\overline{\epsilon }}\) is defined in (6), due to the second equation of Lemma 2(ii), we have \(r_{\text {max}}-\Vert {\widetilde{z}}-{\bar{x}}\Vert >\psi \).

Define \(\varDelta =\left( (1-\kappa ')\psi -\epsilon \right) /\ell ,\) which will be positive due to \(\epsilon \in [0,{\overline{\epsilon }})\) and (6). Next, select \(\theta \in (0,1)\) such that \({\overline{\varDelta }}=\theta \varDelta <\mu _{1},\) hence there exists a \(\rho \in (0,1)\) such that \({\overline{\varDelta }}=(1-\rho )\mu _{1}.\) Now reduce the penalty parameter using

$$\begin{aligned} {\mu }_{m} =\mu _{m-1}-\rho ^{m-2}{\overline{\varDelta }}=\rho \mu _{m-1}=\rho ^{m-1}\mu _{1} \end{aligned}$$
(22)

for any \(m \ge 2\). Next, we initialize the iteration scheme \(z_{\mu _{m}}^{n+1}=\mathbb {T}_{\mu _{m}}\left( z_{\mu _{m}}^{n}\right) \) at \(z_{\mu _{m}}^{0}{:}{=}{\widetilde{z}}.\) Around this initial point, let us consider the open ball \(B({\widetilde{z}},\psi )\). For any \(x\in B({\widetilde{z}};\psi )\), we have \(\Vert x-{\bar{x}}\Vert \le \Vert x-{\widetilde{z}}\Vert +\Vert {\widetilde{z}}-{\bar{x}}\Vert<\psi +\Vert {\widetilde{z}}-{\bar{x}}\Vert <r_{\text {max}},\) where the last inequality follows from \(r_{\text {max}}-\Vert {\widetilde{z}}-{\bar{x}}\Vert >\psi \). Thus we have shown that \(B({\widetilde{z}};\psi )\subseteq B({\bar{x}};r_{\text {max}})\). Hence, from Proposition 2, on \(B({\widetilde{z}};\psi )\), the Douglas–Rachford operator \(\mathbb {T}_{\mu _{m}}\) is contractive. Next, we have \(\Vert \mathbb {T}_{\mu _{m}}({\widetilde{z}})-{\widetilde{z}}\Vert \le (1-\kappa ')\psi \), because \(\Vert \mathbb {T}_{\mu _{m}}({\widetilde{z}})-{\widetilde{z}}\Vert \overset{a)}{\le }\Vert \mathbb {T}_{\mu _{m}}({\widetilde{z}})-\mathbb {T}_{\mu _{m-1}}({\widetilde{z}})\Vert +\Vert \mathbb {T}_{\mu _{m-1}}({\widetilde{z}})-{\widetilde{z}}\Vert \overset{b)}{\le }\ell \Vert \mu _{m}-\mu _{m-1}\Vert +\epsilon \overset{c)}{\le }\epsilon +\ell \varDelta \overset{d)}{\le }(1-\kappa ')\psi ,\) where a) triangle inequality, b) uses Proposition 2(ii) and \(\Vert {\widetilde{z}}-\mathbb {T}_{\mu _{m-1}}({\widetilde{z}})\Vert \le \epsilon \), c) uses (22) and  \(\Vert \mu _{m}-\mu _{m-1}\Vert \le {\overline{\varDelta }}\le \varDelta \) d) uses the definition of \(\varDelta \). Thus, both conditions of Theorem 3 are satisfied, and \(z_{\mu _{m}}^{n}\) in (20) will linearly converge to the unique fixed point \(z_{\mu _{m}}\)of the operator \(\mathbb {T}_{\mu _{m}}\), and \(x_{\mu _{m}}^{n},{y_{\mu _{m}}^{n}}\) will linearly converge to \(x_{\mu _{m}}\). This completes the proof.

1.7 Proof to Lemma 3

First, we show that, for the given initialization of \(z_{\text {init}}\), the iterates \(z_{\mu _{1}}^{n}\) stay in \({\overline{B}}(z_{\mu _{1}};\Vert z_{\text {init}}-z_{\mu _{1}}\Vert )\) for any \(n\in {\textbf {N}}\) via induction. The base case is true via given. Let, \(z_{\mu _{1}}^{n}\in {\overline{B}}(z_{\mu _{1}};\Vert z_{\text {init}}-z_{\mu _{1}}\Vert )\). Then, \(\Vert z_{\mu _{1}}^{n+1}-z_{\mu _{1}}\Vert \overset{a)}{=}\Vert \mathbb {T}_{\mu _{1}}(z_{\mu _{1}}^{n})-\mathbb {T}_{\mu _{1}}(z_{\mu _{1}})\Vert \overset{b)}{\le }\kappa '\Vert z_{\mu _{1}}^{n}-z_{\mu _{1}}\Vert \overset{c)}{\le }\kappa '\Vert z_{\text {init}}-z_{\mu _{1}}\Vert \), where a) uses \(z_{\mu _{1}}\in \mathop {{\textbf {fix}}}\mathbb {T}_{\mu },\) and b) uses Proposition 2, and c) uses \(\Vert z_{\mu _{1}}^{n}-z_{\mu _{1}}\Vert \le \Vert z_{\text {init}}-z_{\mu _{1}}\Vert \). So, the iterates \(z_{\mu _{1}}^{n}\) stay in \({\overline{B}}(z_{\mu _{1}};\Vert z_{\text {init}}-z_{\mu _{1}}\Vert )\). As, \(\kappa '\in (0,1)\), this inequality also implies that \(z_{\mu }^{n}\) linearly converges to \(z_{\mu }\) with the rate of at least \(\kappa '\). Then using similar reasoning presented in the proof to Theorem 1, we have \(x_{\mu }^{n}\) and \(y_{\mu }^{n}\) linearly converge to the unique local minimum \(x_{\mu }\) of problem (\({\mathcal {P}_{\mu }}\)). This completes the proof.

1.8 Proof to Theorem 2

The proof is based on the results in [43, Theorem 4] and [65, Theorem 4.3]. The function f is L-Lipschitz continuous and strongly smooth, hence f is a coercive function satisfying \(\liminf _{\Vert x\Vert \rightarrow \infty }f(x)=\infty \) and is bounded below [3, Corollary 11.17]. Also, \({}^{\mu }_{}{\mathcal {I}}(x)\) is jointly continuous hence lower-semicontinuous in x and \(\mu \) and is bounded below by definition. Let the proximal parameter \(\gamma \) be smaller than or equal to 1/L. Then due to [43, (14), (15) and Theorem 4], \(\{x_{\mu }^{n},y_{\mu }^{n},z_{\mu }^{n}\}\) (iterates of the inner algorithm of NExOS for any penalty parameter \(\mu \)) will be bounded. This boundedness implies the existence of a cluster point of the sequence, which allows us to use [43, Theorem 4 and Theorem 1] to show that for any \(z_{\text {init}}\), the iterates \(x_{\mu }^{n}\) and \(y_{\mu }^{n}\) subsequentially converges to a first-order stationary point \(x_{\mu }\) satisfying \(\nabla \left( f+{}^{\mu }_{}{\mathcal {I}}\right) (x_\mu )=0\). The rate \(\min _{n\le k}\Vert \nabla \left( f+{}^{\mu }_{}{\mathcal {I}}\right) (x_{\rho \mu }^{n})\Vert \le ((1-\gamma L)/2\,L) o(1/\sqrt{k})\) is a direct application of [65, Theorem 4.3] as our setup satisfies all the conditions to apply it.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das Gupta, S., Stellato, B. & Van Parys, B.P.G. Exterior-Point Optimization for Sparse and Low-Rank Optimization. J Optim Theory Appl (2024). https://doi.org/10.1007/s10957-024-02448-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10957-024-02448-9

Keywords

Mathematics Subject Classification

Navigation