Skip to main content
Log in

A Line Search Based Proximal Stochastic Gradient Algorithm with Dynamical Variance Reduction

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

A Correction to this article was published on 23 June 2023

This article has been updated

Abstract

Many optimization problems arising from machine learning applications can be cast as the minimization of the sum of two functions: the first one typically represents the expected risk, and in practice it is replaced by the empirical risk, and the other one imposes a priori information on the solution. Since in general the first term is differentiable and the second one is convex, proximal gradient methods are very well suited to face such optimization problems. However, when dealing with large-scale machine learning issues, the computation of the full gradient of the differentiable term can be prohibitively expensive by making these algorithms unsuitable. For this reason, proximal stochastic gradient methods have been extensively studied in the optimization area in the last decades. In this paper we develop a proximal stochastic gradient algorithm which is based on two main ingredients. We indeed combine a proper technique to dynamically reduce the variance of the stochastic gradients along the iterative process with a descent condition in expectation for the objective function, aimed to fix the value for the steplength parameter at each iteration. For general objective functionals, the a.s. convergence of the limit points of the sequence generated by the proposed scheme to stationary points can be proved. For convex objective functionals, both the a.s. convergence of the whole sequence of the iterates to a minimum point and an \({\mathcal {O}}(1/k)\) convergence rate for the objective function values have been shown. The practical implementation of the proposed method does not need neither the computation of the exact gradient of the empirical risk during the iterations nor the tuning of an optimal value for the steplength. An extensive numerical experimentation highlights that the proposed approach appears robust with respect to the setting of the hyperparameters and competitive compared to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of Data and Materials

The datasets analysed during the current study are available in links given in the paper.

Change history

Notes

  1. \(\Vert a-b\Vert ^2+\Vert b-c\Vert ^2-\Vert a-c\Vert ^2 = 2(a-b)^T(c-b), \quad \forall a,b,c \in {\mathbb {R}}^d\).

References

  1. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. Ser. A 137(1), 91–129 (2013)

  2. Bertsekas, D.: Convex Optimization Theory, Chapter 6 on Convex Optimization Algorithms, pp. 251–489. Athena Scientific, Belmont (2009)

    Google Scholar 

  3. Berahas, A.S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise. SIAM J. Optim. 31(2), 1489–1518 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bollapragada, R., Byrd, R., Nocedal, J.: Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28(4), 3312–3343 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bonettini, S., Loris, I., Porta, F., Prato, M.: Variable metric inexact line-search based methods for nonsmooth optimization. SIAM J. Optim. 26, 891–921 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bonettini, S., Porta, F., Prato, M., Rebegoldi, S., Ruggiero, V., Zanni, L.: Recent advances in variable metric first-order methods. In: Donatelli, M., Serra-Capizzano, S. (eds.) Computational Methods for Inverse Problems in Imaging. Springer INDAM Series, vol. 36, pp. 1–31 (2019)

  7. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bottou, L.: Online algorithms and stochastic approximations, in online learning and neural networks. In: Saad, D. (ed.) Cambridge University Press, Cambridge (1998). https://leon.bottou.org/publications/pdf/online-1998.pdf

  9. Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 128–155 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  10. Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer Optimization and Its Applications, pp. 185–212. Springer, New York (2011)

    Chapter  MATH  Google Scholar 

  11. Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. SIAM Multiscale Model. Simul. 4, 1168–1200 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2873–2898 (2009)

    MathSciNet  MATH  Google Scholar 

  13. Franchini, G., Ruggiero, V., Zanni, L.: Ritz-like values in steplength selections for stochastic gradient methods. Soft. Comput. 24, 17573–17588 (2020)

    Article  MATH  Google Scholar 

  14. Franchini, G., Ruggiero, V., Trombini, I.: Automatic steplength selection in Stochastic gradient methods. Mach. Learn. Optim. Data Sci. LOD 2021, 4124–4132 (2021)

    Google Scholar 

  15. Freund, J.E.: Mathematical Statistics. Prentice-Hall, Englewood Cliffs (1962)

    Google Scholar 

  16. Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  17. Iusem, A.N., Jofrè, A., Oliveira, R.I., Thompson, P.: Variance-based extragradient methods with line search for stochastic variational inequalities. SIAM J. Optim. 29(1), 175–206 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  18. Le, T.V., Gopee, N.: Classifying CIFAR-10 images using unsupervised feature & ensemble learning. https://trucvietle.me/files/601-report.pdf

  19. Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30(1), 349–376 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  20. Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)

    MATH  Google Scholar 

  21. Poon, C., Liang, J., Schoenlieb, C.: Local Convergence properties of SAGA/Prox-SVRG and acceleration, PMLR. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 4124–4132 (2018)

  22. Phamy, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an effcient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 1–48 (2020)

    Google Scholar 

  23. Poon, C., Liang, J., Schoenlieb, C.: Local convergence properties of SAGA/Prox-SVRG and acceleration. In: Dy, J., Krause, A (eds.) Proceedings of the 35th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol. 80, pp. 4124–4132 (2018)

  24. Rockafellar, R.T., Wets, R.J.-B., Wets, M.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)

    MATH  Google Scholar 

  25. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  26. Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: SpiderBoost and momentum: faster stochastic variance reduction algorithms. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, vol. 216, pp. 2406–2416. Curran Associates Inc. (2019)

  27. Xiao, L., Zhang, T.: A proximal Stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  28. Yang, Z., Wang, C., Zang, Y., Li, J.: Mini-batch algorithms with Barzilai-Borwein update step. Neurocomputing 314, 2177–185 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank the anonymous referees for their careful reading and useful remarks and suggestions that improved the quality of the paper.

Funding

This work has been partially supported by the INdAM research group GNCS. The publication was created with the co-financing of the European Union-FSE-REACT-EU, PON Research and Innovation 2014–2020 DM1062/2021.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally to the study conception and design. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Giorgia Franchini.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proofs of Theorems for Sect. 2

To prove Theorems 123 and 4, Lemmas 12 and 3 are needed. Lemma 1 recalls well known results on the proximal operator (for the proof, see [5, 10] and references therein) while Lemma 3 is a classical result from stochastic analysis.

Lemma 1

Let \(\alpha >0,\; x\in \text{ dom } (P),\; u\in \mathbb {R}^d\). The following statements hold true.

  1. a.

    \({\hat{y}}={\text {prox}}_{\alpha R}(x-\alpha u)\) if and only if \(\frac{1}{\alpha }(x-{\hat{y}})-u=w\), \(w\in \partial R({\hat{y}})\).

  2. b.

    The function \(h_{\alpha }\) is strongly convex with modulus of convexity \(\displaystyle \frac{1}{\alpha }\).

  3. c.

    \(h_\alpha (x; x) = 0\).

  4. d.

    \(h_{\alpha }(p_{\alpha }(x); x)\le 0\) and \(h_{\alpha }(p_{\alpha }(x); x) = 0\) if and only if \(p_{\alpha }(x) = x\).

  5. e.

    x is a stationary point for problem (1) if and only if \(x = p_{\alpha }(x)\) if and only if \(h_{\alpha }(p_{\alpha }(x);x) = 0\).

Lemma 2

Under the Assumption 1 (i), let us consider the sequence \(\{x^{(k)}\}\) generated by the iteration (7). If \(\alpha _k>0\), the following inequality holds:

$$\begin{aligned} h_{\alpha _k}(x^{(k+1)};x^{(k)})- h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)})\le \frac{\alpha _k}{2}\Vert e_g^{(k)}\Vert ^2. \end{aligned}$$
(A1)

Proof

In view of (4), we have

$$\begin{aligned} \begin{aligned}&h_{\alpha _k}(x^{(k+1)};x^{(k)})+ {e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\\&\quad =\frac{1}{2\alpha _k}\Vert x^{(k+1)}-x^{(k)}\Vert ^2 {+} \\&\qquad + (\nabla F(x^{(k)})+e_g^{(k)})^T (x^{(k+1)}-x^{(k)}) + R(x^{(k+1)})-R(x^{(k)}) \\&\quad = \nabla F(x^{(k)})^T (x^{(k+1)}-p_{\alpha _k}(x^{(k)})) + \nabla F(x^{(k)})^T (p_{\alpha _k}(x^{(k)})-{x}^{(k)}){+}\\&\qquad + {e_g^{(k)}}^T(x^{(k+1)}-x^{(k)}) +\frac{1}{2\alpha _k}\Vert x^{(k+1)}-p_{\alpha _k}(x^{(k)})\Vert ^2 {+} \\&\qquad + \frac{1}{2\alpha _k}\Vert p_{\alpha _k}(x^{(k)})-x^{(k)}\Vert ^2 + \frac{1}{\alpha _k} (x^{(k+1)}-p_{\alpha _k}(x^{(k)}))^T(p_{\alpha _k}(x^{(k)})-x^{(k)}){+}\\&\qquad +R(x^{(k+1)})-R(x^{(k)}) +R(p_{\alpha _k}(x^{(k)}))- R(p_{\alpha _k}(x^{(k)}))\\&\quad = h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)}) +\nabla F(x^{(k)})^T (x^{(k+1)}-p_{\alpha _k}(x^{(k)})){+}\\&\qquad + {e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})+ \frac{1}{2\alpha _k}\Vert x^{(k+1)}-p_{\alpha _k}(x^{(k)})\Vert ^2+ \\&\qquad +{\frac{1}{\alpha _k} (x^{(k+1)}-p_{\alpha _k}(x^{(k)}))^T(p_{\alpha _k}(x^{(k)})-x^{(k)}) } +R(x^{(k+1)})- R(p_{\alpha _k}(x^{(k)})). \end{aligned} \end{aligned}$$
(A2)

Now, from the convexity of R at \({x}^{(k+1)}\) and \(\frac{x^{(k)}-{x}^{(k+1)}}{\alpha _k}-(\nabla F(x^{(k)})+e_g^{(k)})\in \partial R({x}^{(k+1)})\) (Lemma 1 a), we obtain

$$\begin{aligned} \begin{aligned}&R({x^{(k+1)}})-R(p_{\alpha _k}(x^{(k)}))\le \frac{1}{\alpha _k} (x^{(k)}-x^{(k+1)})^T ({x}^{(k+1)}-p_{\alpha _k}(x^{(k)})){+}\\&\quad - (\nabla F(x^{(k)})+e_g^{(k)})^T (x^{(k+1)}-p_{\alpha _k}(x^{(k)})). \end{aligned} \end{aligned}$$

Including the above inequality in (A2), we obtain

$$\begin{aligned} \begin{aligned}&h_{\alpha _k}(x^{(k+1)};x^{(k)})+ {e_g^{(k)}}^T(x^{(k+1)}-x^{(k)}) \le h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)}){+}\\&\quad + {e_g^{(k)}}^T(p_{\alpha _k}(x^{(k)})-x^{(k)}) -\frac{1}{2\alpha _k}\Vert x^{(k+1)}-p_{\alpha _k}(x^{(k)})\Vert ^2. \end{aligned} \end{aligned}$$
(A3)

Then, we have

$$\begin{aligned}{} & {} h_{\alpha _k}(x^{(k+1)};x^{(k)})- h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)})\le {e_g^{(k)}}^T(p_{\alpha _k}(x^{(k)})-x^{(k+1)}) {+}\\{} & {} \quad -\frac{1}{2\alpha _k}\Vert x^{(k+1)}-p_{\alpha _k}(x^{(k)})\Vert ^2. \end{aligned}$$

By adding and subtracting \(\frac{\alpha _k}{2}\Vert e_g^{(k)}\Vert ^2\), we obtain

$$\begin{aligned}{} & {} h_{\alpha _k}(x^{(k+1)};x^{(k)})- h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)})\le \\{} & {} \quad \le -\frac{1}{2\alpha _k}\Vert -\alpha _k e_g^{(k)}-x^{(k+1)}+p_{\alpha _k}(x^{(k)}) \Vert ^2+ \frac{\alpha _k}{2}\Vert e_g^{(k)}\Vert ^2 \le \frac{\alpha _k}{2}\Vert e_g^{(k)}\Vert ^2. \end{aligned}$$

\(\square \)

Lemma 3

[20, Lemma 11] Let \(\nu _k\), \(u_k\), \(\alpha _k\), \(\beta _k\) be nonnegative random variables and let

$$\begin{aligned} {\mathbb {E}}(\nu _{k+1}|{\mathcal {F}}_k)\le & {} (1+\alpha _k)\nu _k-u_k+\beta _k \qquad \text{ a.s. } \\{} & {} \sum _{k=0}^{\infty } \alpha _k< \infty \quad \text{ a.s. }, \qquad \sum _{k=0}^{\infty } \beta _k < \infty \quad \text{ a.s. }, \end{aligned}$$

where \({\mathbb {E}}(\nu _{k+1}|{\mathcal {F}}_k)\) denotes the conditional expectation for the given \(\nu _0,\dots ,\nu _k\), \(u_0,\dots ,u_k\), \(\alpha _0,\dots ,\alpha _k\), \(\beta _0,\dots ,\beta _k\). Then

$$\begin{aligned} \nu _k\longrightarrow \nu \quad \text{ a.s }, \qquad \sum _{k=0}^{\infty }u_k<\infty \quad \text{ a.s }, \end{aligned}$$

where \(\nu \ge 0\) is some random variable.

Proof of Theorem 1

In view of Assumption 1 (iii), \(P(x^{(k)})-P^*\) is a nonnegative random variable and, from (9), we obtain:

$$\begin{aligned}{} & {} {\mathbb {E}}(P(x^{(k+1)})-P^*|{\mathcal {F}}_k)\le (P(x^{(k)})-P^*)+\\{} & {} \quad - {\gamma }{\mathbb {E}}(-h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)}) |{\mathcal {F}}_k) + \eta _k. \end{aligned}$$

In view of (8) and Lemma 3, we obtain that \(P(x^{(k+1)})-P^*\longrightarrow {\overline{P}}\) a.s. and

$$\begin{aligned} \sum _{k=0}^{\infty }{\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) <\infty \quad \text{ a.s. } \end{aligned}$$

In order to conclude the proof, we follow a strategy similar to the one employed in the proof of [23, Theorem 2.1]. Define a new random variable \(w_j = \sum _{k\ge j}{\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) \). The sequence \(\{w_j\}\) is non increasing and converges to 0 as \(j\rightarrow +\infty \). As a consequence, from the monotone convergence theorem, it holds that

$$\begin{aligned} \begin{aligned} 0&= {\mathbb {E}}(\lim _{j\rightarrow +\infty }w_j ) = \lim _{j\rightarrow +\infty }{\mathbb {E}}(w_j)\\ {}&= \lim _{j\rightarrow +\infty }\sum _{k\ge j}{\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\right) \\&=\lim _{j\rightarrow +\infty }{\mathbb {E}}\left( \sum _{k\ge j}-h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\right) \end{aligned} \end{aligned}$$

which implies

$$\begin{aligned} {\mathbb {E}}\left( \sum _{k\ge j}-h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\right) <+\infty \end{aligned}$$

and, hence,

$$\begin{aligned} \sum _{k}-h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})<+\infty \quad \text{ a.s. } \end{aligned}$$

Then \(h_{\alpha _k}(x^{(k+1)};x^{(k)})+{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\rightarrow 0\) a.s. \(\square \)

Proof of Theorem 2

We suppose that there exists a subsequence of \(\{x^{(k)}\}\) that converges a.s. to \({\bar{x}}\), namely there exists \({\mathcal {K}}\subseteq {\mathbb {N}}\) such that

$$\begin{aligned} \lim _{{k\rightarrow \infty , \, k\in {\mathcal {K}}}}x^{(k)} = {\bar{x}} \text{ a.s. } \end{aligned}$$

We observe that, since \(h_{\alpha _k}\) is strongly convex with modulus of convexity \(\displaystyle \frac{1}{\alpha _{{max}}}\) and \(p_{\alpha _k}(x^{(k)})\) is its minimum point, we have

$$\begin{aligned} \frac{1}{2\alpha _{{max}}}\Vert z-p_{\alpha _k}(x^{(k)})\Vert ^2\le h_{\alpha _k}(z;x^{(k)})-h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)}), \ \ \forall z. \end{aligned}$$
(A4)

Setting \(z=x^{(k+1)}\) in the previous inequality gives

$$\begin{aligned} \frac{1}{2\alpha _{{max}}}\Vert x^{(k+1)}-p_{\alpha _k}(x^{(k)})\Vert ^2\le h_{\alpha _k}(x^{(k+1)};x^{(k)})-h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)}). \end{aligned}$$
(A5)

From the last inequality and Lemma 2, we have

$$\begin{aligned} 0\le h_{\alpha _k}(x^{(k+1)};x^{(k)})-h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)})\le \frac{\alpha _{max}}{2} \Vert e_g^{(k)}\Vert ^2 \end{aligned}$$

and, consequently, by considering the conditional expectation in both members, we have

$$\begin{aligned} \begin{aligned} 0&\le {\mathbb {E}}\left( h_{\alpha _k}(x^{(k+1)};x^{(k)})-h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)})|{\mathcal {F}}_k\right) \\&\le \frac{\alpha _{max}}{2} {\mathbb {E}}\left( \Vert e_g^{(k)}\Vert ^2|{\mathcal {F}}_k\right) \\&\le \frac{\alpha _{max}\varepsilon _k}{2}. \end{aligned} \end{aligned}$$

In view of the law of total expectation and the hypothesis on the sequence \(\{\varepsilon _k\}\), the above inequality allows to state that

$$\begin{aligned} \lim _{k\rightarrow \infty } {\mathbb {E}}(h_{\alpha _k}(x^{(k+1)};x^{(k)})-h_{\alpha _k}(p_{\alpha _k}(x^{(k)});x^{(k)})) = 0. \end{aligned}$$
(A6)

From (A5) and (A6) we can conclude that

$$\begin{aligned} \lim _{k\rightarrow \infty } {\mathbb {E}}(\Vert x^{(k+1)}-p_{\alpha _k}(x^{(k)})\Vert ^2) = 0. \end{aligned}$$
(A7)

Then there exists \({\mathcal {K}}'\subseteq {\mathcal {K}}\) such that \(\lim _{k\rightarrow \infty ,k\in {\mathcal {K}}'}(x^{(k+1)}-p_{\alpha _k}(x^{(k)}))=0\) a.s. By continuity of the operator \(p_{\alpha _k}(\cdot )\) with respect to all its arguments, since \(\{x^{(k)}\}_{k\in {\mathcal {K}}}\) is bounded a.s., \(\{p_{\alpha _k}(x^{(k)})\}_{k\in {\mathcal {K}}'}\) is bounded a.s. as well. Thus \(\{x^{(k+1)}\}_{k\in {\mathcal {K}}'}\) is also bounded a.s. and there exists a limit point \(\bar{{\bar{x}}}\) of \(\{x^{(k+1)}\}_{k\in {\mathcal {K}}'}\). We define \({\mathcal {K}}''\subseteq {\mathcal {K}}'\) such that \(\lim _{{k\rightarrow \infty , \, k\in {\mathcal {K}}''}}x^{(k+1)} = \bar{{\bar{x}}}\) a.s. By continuity of the operator \(p_{\alpha _k}(\cdot )\), (A7) implies that \(\bar{{\bar{x}}}=p_{\alpha _k}({\bar{x}})\) a.s.

Since \(h_{\alpha _k}(x;x^{(k)})+{e_g^{(k)}}^T(x-x^{(k)})\) is strongly convex with modulus of convexity \(\frac{1}{\alpha _{max}}\) as well and \(x^{(k+1)}\) is its minimum point, we have

$$\begin{aligned} \begin{aligned} \frac{1}{\alpha _{max}}\Vert z-x^{(k+1)}\Vert ^2&\le h_{\alpha _k}(z;x^{(k)})+{e_g^{(k)}}^T(z-x^{(k)}) - h_{\alpha _k}(x^{(k+1)};x^{(k)}){+}\\&\quad -{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)}). \end{aligned} \end{aligned}$$

By setting \(z=x^{(k)}\) in the previous inequality, we obtain

$$\begin{aligned} \frac{1}{\alpha _{max}}\Vert x^{(k)}-x^{(k+1)}\Vert ^2\le - h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)}). \end{aligned}$$

In view of Theorem 1 (ii), we can state that

$$\begin{aligned} \Vert x^{(k)}-x^{(k+1)}\Vert ^2 \ \longrightarrow 0 \quad \text{ a.s. } \end{aligned}$$

Thus we proved that \({\bar{x}}=\bar{{\bar{x}}}=p_{\alpha _k}({\bar{x}})\) a.s. and by Lemma 1 e., we have that \({\bar{x}}\) is a stationary point a.s. \(\square \)

Proof of Theorem 3

Let \(x^*\in X^*\). Since \(\displaystyle \frac{x^{(k)}-x^{(k+1)}}{\alpha _k}-g^{(k)} \in \partial R(x^{(k+1)}), \) it holds that

$$\begin{aligned} R(y)\ge R(x^{(k+1)})+\frac{1}{\alpha _k}(x^{(k)}-x^{(k+1)}-\alpha _kg^{(k)})^T(y-x^{(k+1)}), \quad \forall y\in {\mathbb {R}}^d. \end{aligned}$$

It follows that, \(\forall y\in {\mathbb {R}}^d\),

$$\begin{aligned} \begin{aligned} \alpha _kR(y)&\ge \alpha _kR(x^{(k+1)})+(x^{(k)}-x^{(k+1)}-\alpha _kg^{(k)})^T(y-x^{(k+1)})\\&=\alpha _kR(x^{(k+1)})+(x^{(k)}-x^{(k+1)})^T(y-x^{(k+1)})-\alpha _k{g^{(k)}}^T(y-x^{(k+1)}), \end{aligned} \end{aligned}$$

and, hence, the following inequality holds

$$\begin{aligned} (x^{(k+1)}-x^{(k)})^T(y-x^{(k+1)})\ge \alpha _k\left( R(x^{(k+1)})-R(y)+{g^{(k)}}^T(x^{(k+1)}-y)\right) . \end{aligned}$$
(A8)

For \(y=x^*\) the previous inequality gives

$$\begin{aligned} \begin{aligned}&(x^{(k+1)}-x^{(k)})^T(x^*-x^{(k)}+x^{(k)}-x^{(k+1)})\ge \\&\quad {\ge }\alpha _k\left( R(x^{(k+1)})-R(x^*)+{g^{(k)}}^T(x^{(k+1)}-x^{(k)}+x^{(k)}-x^*)\right) . \end{aligned} \end{aligned}$$

As a consequence, we obtain the following relations:

$$\begin{aligned} \begin{aligned}&(x^{(k+1)}-x^{(k)})^T(x^*-x^{(k)})\ge \alpha _k\left( R(x^{(k+1)})-R(x^*)+{g^{(k)}}^T(x^{(k)}-x^*)\right) {+}\\&\qquad -(x^{(k+1)}-x^{(k)})^T(x^{(k)}-x^{(k+1)})+\alpha _k{g^{(k)}}^T(x^{(k+1)}-x^{(k)})\\&\quad =\alpha _k\left( R(x^{(k+1)})-R(x^*)+({\nabla F(x^{(k)})+e_g^{(k)}})^T(x^{(k)}-x^*)\right) {+}\\&\qquad +(x^{(k+1)}-x^{(k)})^T(x^{(k+1)}-x^{(k)})+\alpha _k({\nabla F(x^{(k)})+e_g^{(k)}})^T(x^{(k+1)}-x^{(k)})\\&\quad \ge \alpha _k\left( R(x^{(k+1)})-R(x^*)+F(x^{(k)})-F(x^*)\right) +\alpha _k{e_g^{(k)}}^T(x^{(k)}-x^*){+}\\&\qquad +\Vert x^{(k+1)}-x^{(k)}\Vert ^2 + \alpha _k(\nabla F(x^{(k)})+e_g^{(k)})^T(x^{(k+1)}-x^{(k)})\\&\quad =\alpha _k\left( R(x^{(k+1)}) + R(x^{(k)}) - R(x^{(k)}) + F(x^{(k)})-P(x^*)\right) {+}\\&\qquad + \Vert x^{(k+1)}-x^{(k)}\Vert ^2 +\alpha _k {e_g^{(k)}}^T(x^{(k)}-x^*){+}\\&\qquad +\alpha _k(\nabla F(x^{(k)})+e_g^{(k)})^T(x^{(k+1)}-x^{(k)})\\&\quad = \alpha _k\left( R(x^{(k+1)}) - R(x^{(k)}) + P(x^{(k)})-P(x^*)\right) + \Vert x^{(k+1)}-x^{(k)}\Vert ^2{+}\\&\qquad +\alpha _k {e_g^{(k)}}^T(x^{(k)}-x^*) +\alpha _k(\nabla F(x^{(k)})+e_g^{(k)})^T(x^{(k+1)}-x^{(k)})\\&\quad \ge \alpha _k\left( R(x^{(k+1)}) - R(x^{(k)})\right) + \Vert x^{(k+1)}-x^{(k)}\Vert ^2 +\alpha _k {e_g^{(k)}}^T(x^{(k)}-x^*) {+}\\&\qquad +\alpha _k(\nabla F(x^{(k)})+e_g^{(k)})^T(x^{(k+1)}-x^{(k)}), \end{aligned} \end{aligned}$$
(A9)

where the second inequality follows from the convexity of F and the last inequality follows from the fact that \(P(x^{(k)})-P(x^*)\ge 0\). From a basic property of the Euclidean normFootnote 1 we can write

$$\begin{aligned} \begin{aligned} \Vert x^{(k+1)}-x^*\Vert ^2&= \Vert x^{(k+1)}-x^{(k)}\Vert ^2+\Vert x^{(k)}-x^*\Vert ^2-2(x^{(k+1)}-x^{(k)})^T(x^*-x^{(k)})\\&{\mathop {\le }\limits ^{(A9)}} \Vert x^{(k+1)}-x^{(k)}\Vert ^2+\Vert x^{(k)}-x^*\Vert ^2-2\alpha _k\left( R(x^{(k+1)}) - R(x^{(k)})\right) {+}\\&\quad - 2\Vert x^{(k+1)}-x^{(k)}\Vert ^2{+}\\&\quad -2\alpha _k{e_g^{(k)}}^T(x^{(k)}-x^*) -2\alpha _k(\nabla F(x^{(k)})+e_g^{(k)})^T(x^{(k+1)}-x^{(k)})\\&=\Vert x^{(k)}-x^*\Vert ^2-\Vert x^{(k+1)}-x^{(k)}\Vert ^2-2\alpha _k\left( R(x^{(k+1)}) - R(x^{(k)})\right) {+}\\&\quad -2\alpha _k\nabla F(x^{(k)})^T(x^{(k+1)}-x^{(k)}) -2\alpha _k{e_g^{(k)}}^T(x^{(k)}-x^*){+}\\&\quad -2\alpha _k {e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\\&=\Vert x^{(k)}-x^*\Vert ^2-2\alpha _k{e_g^{(k)}}^T(x^{(k)}-x^*) -2\alpha _k {e_g^{(k)}}^T(x^{(k+1)}-x^{(k)}){+}\\&\quad -2\alpha _k\left( R(x^{(k+1)}) - R(x^{(k)}) + \nabla F(x^{(k)})^T(x^{(k+1)}-x^{(k)})\right. {+}\\&\quad \left. + \frac{1}{2\alpha _k}\Vert x^{(k+1)}-x^{(k)}\Vert ^2\right) \\&=\Vert x^{(k)}-x^*\Vert ^2-2\alpha _k\left( h_{\alpha _k}(x^{(k+1)};x^{(k)})+{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\right) {+}\\&\quad -2\alpha _k{e^{(k)}_g}^T(x^{(k)}-x^*)\\ {}&\le \Vert x^{(k)}-x^*\Vert ^2-2\alpha _{max}\left( h_{\alpha _k}(x^{(k+1)};x^{(k)})+{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\right) {+}\\&\quad -2\alpha _{k}{e^{(k)}_g}^T(x^{(k)}-x^*). \end{aligned} \end{aligned}$$

Taking the conditional expectation with respect to the \(\sigma \)-algebra \({\mathcal {F}}_k\), we obtain

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left( \Vert x^{(k+1)}-x^*\Vert ^2 |{\mathcal {F}}_k\right)&\le \Vert x^{(k)}-x^*\Vert ^2-2\alpha _{max}{\mathbb {E}}\left( h_{\alpha _k}(x^{(k+1)};x^{(k)})|{\mathcal {F}}_k\right) {+}\\&\quad -2\alpha _{max}{\mathbb {E}}\left( {e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) {+}\\&\quad - 2{\mathbb {E}}\left( \alpha _{k}{e_g^{(k)}}^T(x^{(k)}-x^*)|{\mathcal {F}}_k\right) . \end{aligned} \end{aligned}$$
(A10)

Since \(\alpha _k\in {\mathcal {F}}_{k+1}\) where \({\mathcal {F}}_k\subset {\mathcal {F}}_{k+1}\), in view of the tower property we obtain \({\mathbb {E}}\left( \alpha _k{e_g^{(k)}}^T(x^{(k)}-x^*)|{\mathcal {F}}_k\right) =0\) and we rewrite (A10) as

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left( \Vert x^{(k+1)}-x^*\Vert ^2 |{\mathcal {F}}_k\right)&\le \Vert x^{(k)}-x^*\Vert ^2+2\alpha _{max}{\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)})|{\mathcal {F}}_k\right) {+}\\&\quad +2\alpha _{max}{\mathbb {E}}\left( -{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) . \end{aligned} \end{aligned}$$
(A11)

By combining (A11) and part i) of Theorem 1 together with Lemma 3, we can state that the sequence \(\{\Vert x^{(k)}-x^*\Vert \}_{k\in {\mathbb {N}}}\) converges a.s.

Next we prove the almost sure convergence of the sequence \(\{x^{(k)}\}\) by following a strategy similar to the one employed in [21, Theorem 2.1]. Let \(\{x_i^*\}_i\) be a countable subset of the relative interior \(\text {ri}(X^*)\) that is dense in \(X^*\). From the almost sure convergence of \(\Vert x^{(k)}-x^*\Vert \), \(x^*\in X^*\), we have that for each i, the probability \(\text {Prob}(\{\Vert x^{(k)}-x_i^*\Vert \} \ \text{ is } \text{ not } \text{ convergent})=0\). Therefore, we observe that

$$\begin{aligned} \begin{aligned}&\text {Prob}(\forall i\ \exists b_i\ \text{ s.t. } \lim _{k\rightarrow +\infty }\Vert x^{(k)}-x_{i}^* \Vert =b_i )=1-\text {Prob}(\{\Vert x^{(k)}-x_{i} ^*\Vert \} \ \text{ is } \text{ not } \text{ convergent})\\&\quad \ge 1- \sum _i \text {Prob}(\{\Vert x^{(k)}-x_i^*\Vert \} \ \text{ is } \text{ not } \text{ convergent})=1, \end{aligned} \end{aligned}$$

where the inequality follows from the union bound, i.e. for each i, \(\{\Vert x^{(k)}-x_i^*\Vert \}\) is a convergent sequence a.s. For a contradiction, suppose that there are convergent subsequences \(\{u_{k_j}\}_{k_j}\) and \(\{v_{k_j}\}_{k_j}\) of \(\{x^{(k)}\}\) which converge to their limiting points \(u^*\) and \(v^*\) respectively, with \(\Vert u^*-v^*\Vert =r>0\). By Theorem 2, \(u^*\) and \(v^*\) are stationary; in particular, since P is convex, they are minimum points, i.e. \(u^*,v^*\in X^*\). Since \(\{x^*_i\}_i\) is dense in \(X^*\), we may assume that for all \(\epsilon >0\), we have \(x^*_{i_1}\) and \(x^*_{i_2}\) are such that \(\Vert x^*_{i_1}-u^*\Vert <\epsilon \) and \(\Vert x^*_{i_2}-v^*\Vert <\epsilon \). Therefore, for all \(k_j\) sufficiently large,

$$\begin{aligned} \Vert u_{k_j}-x^*_{i_1}\Vert \le \Vert u_{k_j}-u^*\Vert +\Vert u^*-x^*_{i_1}\Vert <\Vert u_{k_j}-u^*\Vert +\epsilon . \end{aligned}$$

On the other hand, for sufficiently large j, we have

$$\begin{aligned} \Vert v_{k_j}-x_{i_1}^*\Vert \ge \Vert v^*-u^*\Vert -\Vert u^*-x^*_{i_1}\Vert -\Vert v_{k_j}-v^*\Vert>r-\epsilon -\Vert v_{k_j}-v^*\Vert >r-2\epsilon . \end{aligned}$$

This contradicts with the fact that \(x^{(k)}-x^*_{i_1}\) is convergent. Therefore, we must have \(u^*=v^*\), hence there exists \({\bar{x}}\in X^*\) such that \(x^{(k)}\longrightarrow {\bar{x}}\). \(\square \)

Proof of Theorem 4

If we do not neglect the term \(P(x^{(k)}) - P(x^*)\) in (A9) and in all the subsequent inequalities, instead of (A11) we obtain

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left( \Vert x^{(k+1)}-x^*\Vert ^2 |{\mathcal {F}}_k\right)&\le \Vert x^{(k)}-x^*\Vert ^2+ \\&\quad + 2\alpha _{max}{\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)}) -{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) {+}\\&\quad -2\alpha _{min} {\mathbb {E}}\left( P(x^{(k)})- P(x^*) |{\mathcal {F}}_k\right) . \end{aligned} \end{aligned}$$
(A12)

Summing the previous inequality from 0 to K and taking the total expectation, we obtain

$$\begin{aligned} \begin{aligned}&\sum _{k=0}^K {\mathbb {E}}\left( P(x^{(k)})- P(x^*)\right) \le \frac{1}{2 \alpha _{min}}\left( \Vert x^{(0)}-x^*\Vert ^2 - {\mathbb {E}}(\Vert x^{(K+1)}-x^*\Vert ^2)\right) + \\&\quad +\frac{\alpha _{max}}{\alpha _{min}} {\mathbb {E}}\left( \sum _{k=0}^K {\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) \right) . \end{aligned} \end{aligned}$$

By neglecting the term \(- {\mathbb {E}}(\Vert x^{(K+1)}-x^*\Vert ^2)\) and bounding by S the second term (Theorem 1i))

$$\begin{aligned} \sum _{k=0}^K {\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) \le S, \end{aligned}$$

we obtain

$$\begin{aligned} \sum _{k=0}^K {\mathbb {E}}\left( P(x^{(k)})- P(x^*)\right) \le \frac{1}{2 \alpha _{min}} \Vert x^{(0)}-x^*\Vert ^2+ \frac{\alpha _{max}}{\alpha _{min}} S. \end{aligned}$$
(A13)

Setting \({\overline{x}}^{(K)}= \frac{1}{K+1} \sum _{k=0}^K x^{(k)}\), from the Jensen’s inequality, we observe that \( {\mathbb {E}}(P({\overline{x}}^{(K)}))\le \frac{1}{K+1} \sum _{k=0}^K {{\mathbb {E}}(P(x^{(k)}))}\). Thus, by dividing (A13) by \(K+1\), we can write

$$\begin{aligned} {\mathbb {E}}\left( P({\overline{x}}^{(K)})- P(x^*)\right) \le \frac{1}{K+1} \left( \frac{1}{2 \alpha _{min}} \Vert x^{(0)}-x^*\Vert ^2+ \frac{\alpha _{max}}{\alpha _{min}} S\right) . \end{aligned}$$
(A14)

Thus, we obtain the \({\mathcal {O}}(1/K)\) ergodic convergence rate of \({\mathbb {E}}\left( P({\overline{x}}^{(K)})- P(x^*)\right) \).

Now, we assume \(\sum _{k=0}^\infty k \eta _k=\Sigma \). In (A13) the term \(\sum _{k=0}^K {\mathbb {E}}\left( P(x^{(k)})- P(x^*)\right) \) is equal to \({\mathbb {E}} \left( \sum _{k=0}^K P(x^{(k)}) \right) -(K+1) P(x^*) \). We observe that, since \(0\le P(x^{(0)})-P(x^*)\), we can write

$$\begin{aligned} {\mathbb {E}} \left( \sum _{k=1}^K P(x^{(k)})\right) - K P(x^*)\le & {} {\mathbb {E}} \left( \sum _{k=0}^K P(x^{(k)}) \right) -(K+1) P(x^*) \\\le & {} \frac{1}{2 \alpha _{min}} \Vert x^{(0)}-x^*\Vert ^2+ \frac{\alpha _{max}}{\alpha _{min}} S. \end{aligned}$$

Now we determine a lower bound for \({\mathbb {E}} \left( \sum _{k=1}^K P(x^{(k)})\right) \). From the inequality (8), we have that \({\mathbb {E}}\left( P(x^{(k)})-P(x^{(k+1)})|{\mathcal {F}}_k\right) +\eta _k\ge 0\) and, hence, by considering the total expectation we obtain \({\mathbb {E}}\left( P(x^{(k)})-P(x^{(k+1)})\right) +{\mathbb {E}}(\eta _k)\ge 0\). Thus, we have

$$\begin{aligned} 0\le & {} \sum _{k=1}^K k {\mathbb {E}}\left( P(x^{(k)})-P(x^{(k+1)})\right) +\sum _{k=1}^K k{\mathbb {E}}(\eta _k) \nonumber \\= & {} \sum _{k=1}^K {\mathbb {E}}( P(x^{(k)})) -K {\mathbb {E}}(P(x^{(K+1)}))+ {\mathbb {E}}\left( \sum _{k=1}^K k\eta _k\right) . \end{aligned}$$
(A15)

Then, we can write

$$\begin{aligned} K {\mathbb {E}}( P(x^{(K+1)}))-\Sigma \le \sum _{k=1}^K {\mathbb {E}}\left( P(x^{(k)})\right) . \end{aligned}$$
(A16)

Consequently, we can conclude that

$$\begin{aligned} {\mathbb {E}}(P(x^{(K+1)})-P(x^*)) \le \frac{1}{K} \left( \frac{1}{2 \alpha _{min}} \Vert x^{(0)}-x^*\Vert ^2+ \frac{\alpha _{max}}{\alpha _{min}} S +\Sigma \right) . \end{aligned}$$

\(\square \)

Appendix B Hyperparameter Settings for Hybrid Methods

For Prox-SVRG method we use the hyperparameter setting proposed in [27], i.e., \({\overline{N}}=1\), \(m=2N\), where m is the number of Prox-SVRG inner iterations. This means that a full gradient has to be computed every two epochs. As for the fixed steplength \({\overline{\alpha }}\), we tried the suggestions reported in the experimental part of [27], i.e., \(\alpha =\{ \frac{1}{{\hat{L}}},\frac{0.1}{{\hat{L}}},\frac{0.01}{{\hat{L}}}\}\), where \({\hat{L}}\) is an approximation of the Lipschitz constant L of \(\nabla F\). In Table  we report the best obtained steplength values for all the test problems.

Table 8 Best tuned values of the steplength for Prox-SVRG for the considered test problems

For the Prox-SARAH method we use the hyperparameter setting specified in [22] where, by borrowing the notation of the referred paper, \(q=2+0.01+(\frac{1}{100})\), \(C=\frac{q^2}{(q^2+8){\hat{L}}^2\gamma ^2}\) and the values for the other hyperparameters are shown in Table .

Table 9 Settings of Prox-SARAH [22]

For the Prox-Spider-boost method we use the hyperparameter setting specified in [26] and the values for the hyperparameters are shown in Table .

Table 10 Settings of Prox-Spider-boost [26]

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Franchini, G., Porta, F., Ruggiero, V. et al. A Line Search Based Proximal Stochastic Gradient Algorithm with Dynamical Variance Reduction. J Sci Comput 94, 23 (2023). https://doi.org/10.1007/s10915-022-02084-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-022-02084-3

Keywords

Navigation