Abstract
Many optimization problems arising from machine learning applications can be cast as the minimization of the sum of two functions: the first one typically represents the expected risk, and in practice it is replaced by the empirical risk, and the other one imposes a priori information on the solution. Since in general the first term is differentiable and the second one is convex, proximal gradient methods are very well suited to face such optimization problems. However, when dealing with large-scale machine learning issues, the computation of the full gradient of the differentiable term can be prohibitively expensive by making these algorithms unsuitable. For this reason, proximal stochastic gradient methods have been extensively studied in the optimization area in the last decades. In this paper we develop a proximal stochastic gradient algorithm which is based on two main ingredients. We indeed combine a proper technique to dynamically reduce the variance of the stochastic gradients along the iterative process with a descent condition in expectation for the objective function, aimed to fix the value for the steplength parameter at each iteration. For general objective functionals, the a.s. convergence of the limit points of the sequence generated by the proposed scheme to stationary points can be proved. For convex objective functionals, both the a.s. convergence of the whole sequence of the iterates to a minimum point and an \({\mathcal {O}}(1/k)\) convergence rate for the objective function values have been shown. The practical implementation of the proposed method does not need neither the computation of the exact gradient of the empirical risk during the iterations nor the tuning of an optimal value for the steplength. An extensive numerical experimentation highlights that the proposed approach appears robust with respect to the setting of the hyperparameters and competitive compared to state-of-the-art methods.
Similar content being viewed by others
Availability of Data and Materials
The datasets analysed during the current study are available in links given in the paper.
Change history
23 June 2023
A Correction to this paper has been published: https://doi.org/10.1007/s10915-023-02267-6
Notes
\(\Vert a-b\Vert ^2+\Vert b-c\Vert ^2-\Vert a-c\Vert ^2 = 2(a-b)^T(c-b), \quad \forall a,b,c \in {\mathbb {R}}^d\).
References
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. Ser. A 137(1), 91–129 (2013)
Bertsekas, D.: Convex Optimization Theory, Chapter 6 on Convex Optimization Algorithms, pp. 251–489. Athena Scientific, Belmont (2009)
Berahas, A.S., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise. SIAM J. Optim. 31(2), 1489–1518 (2021)
Bollapragada, R., Byrd, R., Nocedal, J.: Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 28(4), 3312–3343 (2018)
Bonettini, S., Loris, I., Porta, F., Prato, M.: Variable metric inexact line-search based methods for nonsmooth optimization. SIAM J. Optim. 26, 891–921 (2016)
Bonettini, S., Porta, F., Prato, M., Rebegoldi, S., Ruggiero, V., Zanni, L.: Recent advances in variable metric first-order methods. In: Donatelli, M., Serra-Capizzano, S. (eds.) Computational Methods for Inverse Problems in Imaging. Springer INDAM Series, vol. 36, pp. 1–31 (2019)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Bottou, L.: Online algorithms and stochastic approximations, in online learning and neural networks. In: Saad, D. (ed.) Cambridge University Press, Cambridge (1998). https://leon.bottou.org/publications/pdf/online-1998.pdf
Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 128–155 (2012)
Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer Optimization and Its Applications, pp. 185–212. Springer, New York (2011)
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. SIAM Multiscale Model. Simul. 4, 1168–1200 (2005)
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2873–2898 (2009)
Franchini, G., Ruggiero, V., Zanni, L.: Ritz-like values in steplength selections for stochastic gradient methods. Soft. Comput. 24, 17573–17588 (2020)
Franchini, G., Ruggiero, V., Trombini, I.: Automatic steplength selection in Stochastic gradient methods. Mach. Learn. Optim. Data Sci. LOD 2021, 4124–4132 (2021)
Freund, J.E.: Mathematical Statistics. Prentice-Hall, Englewood Cliffs (1962)
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Iusem, A.N., Jofrè, A., Oliveira, R.I., Thompson, P.: Variance-based extragradient methods with line search for stochastic variational inequalities. SIAM J. Optim. 29(1), 175–206 (2019)
Le, T.V., Gopee, N.: Classifying CIFAR-10 images using unsupervised feature & ensemble learning. https://trucvietle.me/files/601-report.pdf
Paquette, C., Scheinberg, K.: A stochastic line search method with expected complexity analysis. SIAM J. Optim. 30(1), 349–376 (2020)
Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)
Poon, C., Liang, J., Schoenlieb, C.: Local Convergence properties of SAGA/Prox-SVRG and acceleration, PMLR. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 4124–4132 (2018)
Phamy, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an effcient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 1–48 (2020)
Poon, C., Liang, J., Schoenlieb, C.: Local convergence properties of SAGA/Prox-SVRG and acceleration. In: Dy, J., Krause, A (eds.) Proceedings of the 35th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol. 80, pp. 4124–4132 (2018)
Rockafellar, R.T., Wets, R.J.-B., Wets, M.: Variational Analysis. Grundlehren der Mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: SpiderBoost and momentum: faster stochastic variance reduction algorithms. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, vol. 216, pp. 2406–2416. Curran Associates Inc. (2019)
Xiao, L., Zhang, T.: A proximal Stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
Yang, Z., Wang, C., Zang, Y., Li, J.: Mini-batch algorithms with Barzilai-Borwein update step. Neurocomputing 314, 2177–185 (2018)
Acknowledgements
The authors thank the anonymous referees for their careful reading and useful remarks and suggestions that improved the quality of the paper.
Funding
This work has been partially supported by the INdAM research group GNCS. The publication was created with the co-financing of the European Union-FSE-REACT-EU, PON Research and Innovation 2014–2020 DM1062/2021.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to the study conception and design. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Proofs of Theorems for Sect. 2
To prove Theorems 1, 2, 3 and 4, Lemmas 1, 2 and 3 are needed. Lemma 1 recalls well known results on the proximal operator (for the proof, see [5, 10] and references therein) while Lemma 3 is a classical result from stochastic analysis.
Lemma 1
Let \(\alpha >0,\; x\in \text{ dom } (P),\; u\in \mathbb {R}^d\). The following statements hold true.
-
a.
\({\hat{y}}={\text {prox}}_{\alpha R}(x-\alpha u)\) if and only if \(\frac{1}{\alpha }(x-{\hat{y}})-u=w\), \(w\in \partial R({\hat{y}})\).
-
b.
The function \(h_{\alpha }\) is strongly convex with modulus of convexity \(\displaystyle \frac{1}{\alpha }\).
-
c.
\(h_\alpha (x; x) = 0\).
-
d.
\(h_{\alpha }(p_{\alpha }(x); x)\le 0\) and \(h_{\alpha }(p_{\alpha }(x); x) = 0\) if and only if \(p_{\alpha }(x) = x\).
-
e.
x is a stationary point for problem (1) if and only if \(x = p_{\alpha }(x)\) if and only if \(h_{\alpha }(p_{\alpha }(x);x) = 0\).
Lemma 2
Under the Assumption 1 (i), let us consider the sequence \(\{x^{(k)}\}\) generated by the iteration (7). If \(\alpha _k>0\), the following inequality holds:
Proof
In view of (4), we have
Now, from the convexity of R at \({x}^{(k+1)}\) and \(\frac{x^{(k)}-{x}^{(k+1)}}{\alpha _k}-(\nabla F(x^{(k)})+e_g^{(k)})\in \partial R({x}^{(k+1)})\) (Lemma 1 a), we obtain
Including the above inequality in (A2), we obtain
Then, we have
By adding and subtracting \(\frac{\alpha _k}{2}\Vert e_g^{(k)}\Vert ^2\), we obtain
\(\square \)
Lemma 3
[20, Lemma 11] Let \(\nu _k\), \(u_k\), \(\alpha _k\), \(\beta _k\) be nonnegative random variables and let
where \({\mathbb {E}}(\nu _{k+1}|{\mathcal {F}}_k)\) denotes the conditional expectation for the given \(\nu _0,\dots ,\nu _k\), \(u_0,\dots ,u_k\), \(\alpha _0,\dots ,\alpha _k\), \(\beta _0,\dots ,\beta _k\). Then
where \(\nu \ge 0\) is some random variable.
Proof of Theorem 1
In view of Assumption 1 (iii), \(P(x^{(k)})-P^*\) is a nonnegative random variable and, from (9), we obtain:
In view of (8) and Lemma 3, we obtain that \(P(x^{(k+1)})-P^*\longrightarrow {\overline{P}}\) a.s. and
In order to conclude the proof, we follow a strategy similar to the one employed in the proof of [23, Theorem 2.1]. Define a new random variable \(w_j = \sum _{k\ge j}{\mathbb {E}}\left( -h_{\alpha _k}(x^{(k+1)};x^{(k)})-{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})|{\mathcal {F}}_k\right) \). The sequence \(\{w_j\}\) is non increasing and converges to 0 as \(j\rightarrow +\infty \). As a consequence, from the monotone convergence theorem, it holds that
which implies
and, hence,
Then \(h_{\alpha _k}(x^{(k+1)};x^{(k)})+{e_g^{(k)}}^T(x^{(k+1)}-x^{(k)})\rightarrow 0\) a.s. \(\square \)
Proof of Theorem 2
We suppose that there exists a subsequence of \(\{x^{(k)}\}\) that converges a.s. to \({\bar{x}}\), namely there exists \({\mathcal {K}}\subseteq {\mathbb {N}}\) such that
We observe that, since \(h_{\alpha _k}\) is strongly convex with modulus of convexity \(\displaystyle \frac{1}{\alpha _{{max}}}\) and \(p_{\alpha _k}(x^{(k)})\) is its minimum point, we have
Setting \(z=x^{(k+1)}\) in the previous inequality gives
From the last inequality and Lemma 2, we have
and, consequently, by considering the conditional expectation in both members, we have
In view of the law of total expectation and the hypothesis on the sequence \(\{\varepsilon _k\}\), the above inequality allows to state that
From (A5) and (A6) we can conclude that
Then there exists \({\mathcal {K}}'\subseteq {\mathcal {K}}\) such that \(\lim _{k\rightarrow \infty ,k\in {\mathcal {K}}'}(x^{(k+1)}-p_{\alpha _k}(x^{(k)}))=0\) a.s. By continuity of the operator \(p_{\alpha _k}(\cdot )\) with respect to all its arguments, since \(\{x^{(k)}\}_{k\in {\mathcal {K}}}\) is bounded a.s., \(\{p_{\alpha _k}(x^{(k)})\}_{k\in {\mathcal {K}}'}\) is bounded a.s. as well. Thus \(\{x^{(k+1)}\}_{k\in {\mathcal {K}}'}\) is also bounded a.s. and there exists a limit point \(\bar{{\bar{x}}}\) of \(\{x^{(k+1)}\}_{k\in {\mathcal {K}}'}\). We define \({\mathcal {K}}''\subseteq {\mathcal {K}}'\) such that \(\lim _{{k\rightarrow \infty , \, k\in {\mathcal {K}}''}}x^{(k+1)} = \bar{{\bar{x}}}\) a.s. By continuity of the operator \(p_{\alpha _k}(\cdot )\), (A7) implies that \(\bar{{\bar{x}}}=p_{\alpha _k}({\bar{x}})\) a.s.
Since \(h_{\alpha _k}(x;x^{(k)})+{e_g^{(k)}}^T(x-x^{(k)})\) is strongly convex with modulus of convexity \(\frac{1}{\alpha _{max}}\) as well and \(x^{(k+1)}\) is its minimum point, we have
By setting \(z=x^{(k)}\) in the previous inequality, we obtain
In view of Theorem 1 (ii), we can state that
Thus we proved that \({\bar{x}}=\bar{{\bar{x}}}=p_{\alpha _k}({\bar{x}})\) a.s. and by Lemma 1 e., we have that \({\bar{x}}\) is a stationary point a.s. \(\square \)
Proof of Theorem 3
Let \(x^*\in X^*\). Since \(\displaystyle \frac{x^{(k)}-x^{(k+1)}}{\alpha _k}-g^{(k)} \in \partial R(x^{(k+1)}), \) it holds that
It follows that, \(\forall y\in {\mathbb {R}}^d\),
and, hence, the following inequality holds
For \(y=x^*\) the previous inequality gives
As a consequence, we obtain the following relations:
where the second inequality follows from the convexity of F and the last inequality follows from the fact that \(P(x^{(k)})-P(x^*)\ge 0\). From a basic property of the Euclidean normFootnote 1 we can write
Taking the conditional expectation with respect to the \(\sigma \)-algebra \({\mathcal {F}}_k\), we obtain
Since \(\alpha _k\in {\mathcal {F}}_{k+1}\) where \({\mathcal {F}}_k\subset {\mathcal {F}}_{k+1}\), in view of the tower property we obtain \({\mathbb {E}}\left( \alpha _k{e_g^{(k)}}^T(x^{(k)}-x^*)|{\mathcal {F}}_k\right) =0\) and we rewrite (A10) as
By combining (A11) and part i) of Theorem 1 together with Lemma 3, we can state that the sequence \(\{\Vert x^{(k)}-x^*\Vert \}_{k\in {\mathbb {N}}}\) converges a.s.
Next we prove the almost sure convergence of the sequence \(\{x^{(k)}\}\) by following a strategy similar to the one employed in [21, Theorem 2.1]. Let \(\{x_i^*\}_i\) be a countable subset of the relative interior \(\text {ri}(X^*)\) that is dense in \(X^*\). From the almost sure convergence of \(\Vert x^{(k)}-x^*\Vert \), \(x^*\in X^*\), we have that for each i, the probability \(\text {Prob}(\{\Vert x^{(k)}-x_i^*\Vert \} \ \text{ is } \text{ not } \text{ convergent})=0\). Therefore, we observe that
where the inequality follows from the union bound, i.e. for each i, \(\{\Vert x^{(k)}-x_i^*\Vert \}\) is a convergent sequence a.s. For a contradiction, suppose that there are convergent subsequences \(\{u_{k_j}\}_{k_j}\) and \(\{v_{k_j}\}_{k_j}\) of \(\{x^{(k)}\}\) which converge to their limiting points \(u^*\) and \(v^*\) respectively, with \(\Vert u^*-v^*\Vert =r>0\). By Theorem 2, \(u^*\) and \(v^*\) are stationary; in particular, since P is convex, they are minimum points, i.e. \(u^*,v^*\in X^*\). Since \(\{x^*_i\}_i\) is dense in \(X^*\), we may assume that for all \(\epsilon >0\), we have \(x^*_{i_1}\) and \(x^*_{i_2}\) are such that \(\Vert x^*_{i_1}-u^*\Vert <\epsilon \) and \(\Vert x^*_{i_2}-v^*\Vert <\epsilon \). Therefore, for all \(k_j\) sufficiently large,
On the other hand, for sufficiently large j, we have
This contradicts with the fact that \(x^{(k)}-x^*_{i_1}\) is convergent. Therefore, we must have \(u^*=v^*\), hence there exists \({\bar{x}}\in X^*\) such that \(x^{(k)}\longrightarrow {\bar{x}}\). \(\square \)
Proof of Theorem 4
If we do not neglect the term \(P(x^{(k)}) - P(x^*)\) in (A9) and in all the subsequent inequalities, instead of (A11) we obtain
Summing the previous inequality from 0 to K and taking the total expectation, we obtain
By neglecting the term \(- {\mathbb {E}}(\Vert x^{(K+1)}-x^*\Vert ^2)\) and bounding by S the second term (Theorem 1i))
we obtain
Setting \({\overline{x}}^{(K)}= \frac{1}{K+1} \sum _{k=0}^K x^{(k)}\), from the Jensen’s inequality, we observe that \( {\mathbb {E}}(P({\overline{x}}^{(K)}))\le \frac{1}{K+1} \sum _{k=0}^K {{\mathbb {E}}(P(x^{(k)}))}\). Thus, by dividing (A13) by \(K+1\), we can write
Thus, we obtain the \({\mathcal {O}}(1/K)\) ergodic convergence rate of \({\mathbb {E}}\left( P({\overline{x}}^{(K)})- P(x^*)\right) \).
Now, we assume \(\sum _{k=0}^\infty k \eta _k=\Sigma \). In (A13) the term \(\sum _{k=0}^K {\mathbb {E}}\left( P(x^{(k)})- P(x^*)\right) \) is equal to \({\mathbb {E}} \left( \sum _{k=0}^K P(x^{(k)}) \right) -(K+1) P(x^*) \). We observe that, since \(0\le P(x^{(0)})-P(x^*)\), we can write
Now we determine a lower bound for \({\mathbb {E}} \left( \sum _{k=1}^K P(x^{(k)})\right) \). From the inequality (8), we have that \({\mathbb {E}}\left( P(x^{(k)})-P(x^{(k+1)})|{\mathcal {F}}_k\right) +\eta _k\ge 0\) and, hence, by considering the total expectation we obtain \({\mathbb {E}}\left( P(x^{(k)})-P(x^{(k+1)})\right) +{\mathbb {E}}(\eta _k)\ge 0\). Thus, we have
Then, we can write
Consequently, we can conclude that
\(\square \)
Appendix B Hyperparameter Settings for Hybrid Methods
For Prox-SVRG method we use the hyperparameter setting proposed in [27], i.e., \({\overline{N}}=1\), \(m=2N\), where m is the number of Prox-SVRG inner iterations. This means that a full gradient has to be computed every two epochs. As for the fixed steplength \({\overline{\alpha }}\), we tried the suggestions reported in the experimental part of [27], i.e., \(\alpha =\{ \frac{1}{{\hat{L}}},\frac{0.1}{{\hat{L}}},\frac{0.01}{{\hat{L}}}\}\), where \({\hat{L}}\) is an approximation of the Lipschitz constant L of \(\nabla F\). In Table we report the best obtained steplength values for all the test problems.
For the Prox-SARAH method we use the hyperparameter setting specified in [22] where, by borrowing the notation of the referred paper, \(q=2+0.01+(\frac{1}{100})\), \(C=\frac{q^2}{(q^2+8){\hat{L}}^2\gamma ^2}\) and the values for the other hyperparameters are shown in Table .
For the Prox-Spider-boost method we use the hyperparameter setting specified in [26] and the values for the hyperparameters are shown in Table .
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Franchini, G., Porta, F., Ruggiero, V. et al. A Line Search Based Proximal Stochastic Gradient Algorithm with Dynamical Variance Reduction. J Sci Comput 94, 23 (2023). https://doi.org/10.1007/s10915-022-02084-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-022-02084-3