Skip to main content
Log in

A new large-scale learning algorithm for generalized additive models

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Additive model plays an important role in machine learning due to its flexibility and interpretability in the prediction function. However, solving large-scale additive models is a challenging task due to several difficulties. Until now, scaling up additive models is still an open problem. To address this challenging problem, in this paper, we propose a new doubly stochastic optimization algorithm for solving the generalized additive models (DSGAM). We first propose a generalized formulation of additive models without the orthogonal hypothesis on the basis function. After that, we propose a wrapper algorithm to optimize the generalized additive models. Importantly, we introduce a doubly stochastic gradient algorithm (DSG) to solve an inner subproblem in the wrapper algorithm, which can scale well in sample size and dimensionality simultaneously. Finally, we prove the fast convergence rate of our DSGAM algorithm. The experimental results on various large-scale benchmark datasets not only confirm the fast convergence of our DSGAM algorithm, but also show a huge reduction of computational time compared with existing algorithms, while retaining the similar generalization performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Finland)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of data and material

Datasets are available at the UCI, LIBSVM repositories and http://automl.chalearn.org/data/.

Notes

  1. Generally speaking, the problem (3) approaches the original constrained problem (2) as the penalty parameter \(\nu\) goes to infinity. In practice, the formulation (3) can generate a good solution when the penalty parameter \(\nu\) is set as a finite value, which is confirmed by our experimental results.

References

  • Allen-Zhu, Z., & Li, Y. (2016). Lazysvd: Even faster svd decomposition yet without agonizing pain. In: Advances in Neural Information Processing Systems, pp. 974–982.

  • Aravkin, A. Y., Kambadur, A., Lozano, A. C., & Luss, R. (2014). Sparse quantile huber regression for efficient and robust estimation. arXiv preprint arXiv:1402.4624

  • Beck, A. (2015). On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1), 185–209.

    Article  MathSciNet  MATH  Google Scholar 

  • Bottou, L. (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, Springer, pp. 177–186

  • Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.

    Book  MATH  Google Scholar 

  • Callebaut, D. (1965). Generalization of the Cauchy–Schwarz inequality. Journal of Mathematical Analysis and Applications, 12(3), 491–494.

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, H., Wang, X., Deng, C., & Huang, H. (2017). Group sparse additive machine. In: Advances in Neural Information Processing Systems, pp. 197–207.

  • Chouldechova, A., & Hastie, T. (2015). Generalized additive model selection. arXiv preprint arXiv:1506.03850

  • Christmann, A., & Zhou, D. X. (2016). Learning rates for the risk of Kernel-based quantile regression estimators in additive models. Analysis and Applications, 14(03), 449–477.

    Article  MathSciNet  MATH  Google Scholar 

  • Dominici, F., McDermott, A., Zeger, S. L., & Samet, J. M. (2002). On the use of generalized additive models in time-series studies of air pollution and health. American Journal of Epidemiology, 156(3), 193–203.

    Article  Google Scholar 

  • Fang, C., Li, C. J., Lin, Z., & Zhang, T. (2018). Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 687–697.

  • Fercoq, O., & Bianchi, P. (2019). A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions. SIAM Journal on Optimization, 29(1), 100–134.

    Article  MathSciNet  MATH  Google Scholar 

  • Gu, B., Huo, Z., & Huang, H. (2018). Asynchronous doubly stochastic group regularized learning. In: International Conference on Artificial Intelligence and Statistics, (AISTATS 2018)

  • Gu, B., Xin, M., Huo, Z., & Huang, H. (2018). Asynchronous doubly stochastic sparse kernel learning. In: Thirty-Second AAAI Conference on Artificial Intelligence.

  • Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1, 297–318.

    MathSciNet  MATH  Google Scholar 

  • Hastie, T., & Tibshirani, R. (1995). Generalized additive models for medical research. Statistical Methods in Medical Research, 4(3), 187–196.

    Article  Google Scholar 

  • Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge University Press.

    Book  Google Scholar 

  • Kandasamy, K., & Yu, Y. (2016). Additive approximations in high dimensional nonparametric regression via the salsa. In: International Conference on Machine Learning.

  • Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics, 34(5), 2272–2297.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, J., & Wright, S. J. (2015). Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1), 351–376.

    Article  MathSciNet  MATH  Google Scholar 

  • Meng, Q., Chen, W., Yu, J., Wang, T., Ma, Z., & Liu, T. Y. (2017). Asynchronous stochastic proximal optimization algorithms with variance reduction. In: AAAI, pp. 2329–2335.

  • Ouyang, H., He, N., Tran, L., & Gray, A. (2013). Stochastic alternating direction method of multipliers. In: International Conference on Machine Learning, pp. 80–88.

  • Park, H., Petkova, E., Tarpey, T., & Ogden, R. T. (2022). A sparse additive model for treatment effect-modifier selection. Biostatistics, 23(2), 412–429.

    Article  MathSciNet  Google Scholar 

  • Piegl, L., & Tiller, W. (1995) B-spline basis functions. In: The NURBS Book, Springer, pp. 47–79.

  • Rahimi, A., & Recht, B. (2008). Random features for large-scale Kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184.

  • Ravikumar, P., Liu, H., Lafferty, J., & Wasserman, L. (2007) Spam: Sparse additive models. In: Proceedings of the 20th international conference on neural information processing systems, Curran Associates Inc., pp. 1201–1208.

  • Shamir, O. (2015) A stochastic PCA and SVD algorithm with an exponential convergence rate. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp. 144–152.

  • Wood, S. N., Goude, Y., & Shaw, S. (2015). Generalized additive models for large data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64(1), 139–155.

    MathSciNet  Google Scholar 

  • Wood, S. N., Li, Z., Shaddick, G., & Augustin, N. H. (2017). Generalized additive models for Gigadata: Modeling the UK black smoke network daily data. Journal of the American Statistical Association, 112(519), 1199–1210.

    Article  MathSciNet  Google Scholar 

  • Yin, J., Chen, X., & Xing, EP. (2012) Group sparse additive models. In: Proceedings of the international conference on machine learning. international conference on machine learning, NIH Public Access, vol 2012, p 871.

  • Yuan, G. X., Ho, C. H., & Lin, C. J. (2012). An improved glmnet for l1-regularized logistic regression. Journal of Machine Learning Research, 13, 1999–2030.

    MathSciNet  MATH  Google Scholar 

  • Zhao, T., Yu, M., Wang, Y., Arora, R., & Liu, H. (2014). Accelerated mini-batch randomized block coordinate descent method. In: Advances in Neural Information Processing Systems,

  • Zhong, W., & Kwok, J. (2014). Fast stochastic alternating direction method of multipliers. In: International Conference on Machine Learning, pp. 46–54.

Download references

Funding

Bin Gu was partially supported by the National Natural Science Foundation of China (No: 62076138) and the Six talent peaks project (No. XYDXX-042) in Jiangsu Province.

Author information

Authors and Affiliations

Authors

Contributions

BG contributed to the conception and the design of the method. CZ contributed to running the experiments and revising paper. ZH contributed to running the experiments and writing paper. HH contributed to providing feedback and guidance. All authors contributed on the results analysis and the manuscript writing and revision.

Corresponding author

Correspondence to Heng Huang.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Ethical approval

Not Applicable.

Consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Code availability

The code will be publicly available once the work is published upon agreement of different sides.

Additional information

Editor: Pradeep Ravikumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix: A Convergence Analysis of Theorem 1

Before providing the theoretical analysis, we give the definitions of \({\overline{\varvec{\theta }}}_{[t]}^{[s+1]}\) and \(F(\varvec{\theta })\) used in the analysis as follows.

\({\overline{\varvec{\theta }}}_{t}^{s}\): \({\overline{\varvec{\theta }}}_{t}^{s}\) is defined as:

$$\begin{aligned} {\overline{\varvec{\theta }}}_{[t]}^{[s+1]} {\mathop {=}\limits ^\textrm{def}} \varvec{\theta }_{[t]}^{[s+1]} -\gamma \varvec{v}^{[s+1]}_{[t]} \end{aligned}$$
(12)

\(F(\varvec{\theta })\): \(F(\varvec{\theta })\) is defined as:

$$\begin{aligned} F(\varvec{\theta })= \frac{1}{l} \sum _{i=1}^l {F_i(\varvec{\theta })} \end{aligned}$$
(13)

Based on (12), it is easy to verify that \(({\overline{\varvec{\theta }}}_{t+1}^{s})_{J(t)} =({\varvec{\theta }}_{t+1}^{s+1})_{J(t)}\). Thus, we have \(\mathbb {E}_{J(t)} ({\varvec{\theta }}_{t+1}^{s} - {\varvec{\theta }}_{t}^{s}) = \frac{1}{k} \left( {\overline{\varvec{\theta }}}_{t+1}^{s} - {\varvec{\theta }}_{t}^{s} \right)\). It means that \({\overline{\varvec{\theta }}}_{t+1}^{s} - {\varvec{\theta }}_{t}^{s}\) captures the expectation of \({\varvec{\theta }}_{t+1}^{s} - {\varvec{\theta }}_{t}^{s}\).

Then, we give a inequality in Lemma 1 respectively. Based on Lemma 1, we prove that \(\mathbb {E}\Vert \varvec{\theta }_{[t-1]}^{[s]} - {\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert ^2 \le \rho \mathbb {E}\Vert \varvec{\theta }_{[t]}^{[s]} - {\overline{\varvec{\theta }}}_{[t+1]}^{[s]}\Vert ^2\) (Lemma 2), where \(\rho >1\) is a user defined parameter. Then, we prove the monotonicity of the expectation of the objectives \(\mathbb {E} F(\varvec{\theta }_{[t+1]}^{[s]}) \le \mathbb {E} F(\varvec{\theta }_{[t]}^{[s]})\) (Lemma 3). Note that the analyses only consider the case \(|\mathcal {B}|=1\) without loss of generality. The case of \(|\mathcal {B}|>1\) can be proved similarly.

Lemma 1

In each iteration of DSG, \(\forall \varvec{\theta }\), we have the following inequality.

$$\begin{aligned} \left\langle (\varvec{v}^{[s]}_{[t]})_{J(t)}, (\Delta ^{s}_{t})_{J(t)}\right\rangle \le - \frac{1}{\gamma } \Vert (\Delta ^{s}_{t})_{J(t)} \Vert \end{aligned}$$
(14)

Proof

The updating rule of line 7 in Algorithm 2 is equivalent to solve the following problem.

$$\begin{aligned} \varvec{\theta }_{[t+1]}^{[s]} = \arg \min _{\varvec{\theta }} P(\varvec{\theta }) = \arg \min _{\varvec{\theta }}{} & {} \left\langle (\varvec{v}^{[s]}_{[t]})_{J(t)}, (\varvec{\theta } - \varvec{\theta }_{[t]}^{[s]})_{J(t)}\right\rangle + \frac{1}{2\gamma } \left\| (\varvec{\theta } - \varvec{\theta }_{[t]}^{[s]})_{J(t)} \right\| ^2\nonumber \\ s.t.{} & {} \varvec{\theta }_{\setminus J(t)} = (\varvec{\theta }_{[t]}^{[s]})_{\setminus J(t)} \end{aligned}$$
(15)

Substituting \(\varvec{\theta }_{[t+1]}^{[s]}=\left[ \begin{array}{c} \left( \varvec{\theta }_{[t]}^{[s]}\right) _{J(t)} - \gamma \left( \varvec{v}^{[s]}_{[t]} \right) _{J(t)} \\ (\varvec{\theta }_{[t]}^{[s]})_{\setminus J(t)} \end{array} \right]\) into (15), we have that \(P(\varvec{\theta }_{[t+1]}^{[s]}) = - \frac{\gamma }{2} \left\| \varvec{v}^{[s]}_{[t]} \right\| ^2\). Thus, we have that

$$\begin{aligned} \left\langle (\varvec{v}^{[s]}_{[t]})_{J(t)}, (\varvec{\theta } - \varvec{\theta }_{[t]}^{[s]})_{J(t)}\right\rangle + \frac{1}{2\gamma } \left\| (\varvec{\theta } - \varvec{\theta }_{[t]}^{[s]})_{J(t)} \right\| ^2 \ge - \frac{\gamma }{2} \left\| \varvec{v}^{[s]}_{[t]} \right\| ^2 \end{aligned}$$
(16)

Based on (16) and let \(\varvec{\theta }=\varvec{\theta }_{[t+1]}^{[s]}\), we have the conclusion. This completes the proof. \(\square\)

Lemma 2

The size of the partition of \(\{1,...,d \times p \}\) is k. Let \(\rho\) be a constant that satisfies \(\rho > 1\), and define the quantity \(\theta = \frac{\rho ^{\frac{1}{2}} - \rho ^{\frac{m}{2}}}{1-\rho ^{\frac{1}{2}}}\). Suppose the nonnegative steplength parameter \(\gamma >0\) satisfies \(\gamma \le \frac{k^{1/2}(1-\rho ^{-1})-2}{4 L_{nor} \left( 1+ \theta \right) }\), under Assumptions 1 and 2, we have

$$\begin{aligned} \mathbb {E}\Vert \varvec{\theta }_{[t-1]}^{[s]} - {\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert ^2 \le \rho \mathbb {E}\Vert \varvec{\theta }_{[t]}^{[s]} - {\overline{\varvec{\theta }}}_{[t+1]}^{[s]}\Vert ^2 \end{aligned}$$
(17)

Proof

According to (A.8) in (Liu and Wright, 2015), we have

$$\begin{aligned}{} & {} \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]}\Vert ^2 - \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]}\Vert ^2\nonumber \\{} & {} \quad \le 2 \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} + {\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \end{aligned}$$
(18)

The second part in the right half side of (18) is bound as follows if \(\mathcal {B}=\{i_t\}\) and \({J}(t)=\{J(t)\}\).

$$\begin{aligned}{} & {} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} + {\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \nonumber \\{} & {} \quad = \left\| \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t]}^{[s]} + \gamma \varvec{{v}}_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} + \varvec{\theta }_{[t-1]}^{[s]} - \gamma \varvec{v}_{[t-1]}^{[s]} \right\| \nonumber \\{} & {} \quad \le \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} \Vert + \gamma \left\| \varvec{{v}}_{[t]}^{[s]} - \varvec{v}_{[t-1]}^{[s]} \right\| =\Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} \Vert \nonumber \\{} & {} \quad + \gamma \left\| \nabla F_{i_t}(\varvec{\theta }_{[t]}^{[s]})- \nabla F_{i_t}(\varvec{\theta }^{[s-1]}) \right. \nonumber \\{} & {} \left. \quad + \varvec{\mu }^{[s-1]} - \nabla F_{i_{t-1}}(\varvec{\theta }_{[t-1]}^{[s]})+ \nabla F_{i_{t-1}}(\varvec{\theta }^{[s-1]}) - \varvec{\mu }^{[s-1]} \right\| \nonumber \\{} & {} \quad = \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} \Vert + \gamma \left\| \nabla F_{i_t}(\varvec{\theta }_{[t]}^{[s]}) \right. \nonumber \\{} & {} \left. \quad - \nabla F_{i_t}(\varvec{\theta }^{[s-1]}) - \nabla F_{i_{t-1}}(\varvec{\theta }_{[t-1]}^{[s]}) + \nabla F_{i_{t-1}}(\varvec{\theta }^{[s-1]}) \right\| \nonumber \\{} & {} \quad \le \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} \Vert \nonumber \\{} & {} \quad + \gamma \left\| \nabla F_{i_t}(\varvec{\theta }_{[t]}^{[s]}) -\nabla F_{i_t}(\varvec{\theta }^{[s-1]}) \right\| + \gamma \left\| \nabla F_{i_{t-1}}(\varvec{\theta }_{[t-1]}^{[s]}) - \nabla F_{i_{t-1}}(\varvec{\theta }^{[s-1]}) \right\| \nonumber \\{} & {} \quad \le \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} \Vert + {\gamma L_{nor}} \left( \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^{[s-1]} \Vert + \Vert \varvec{\theta }_{[t-1]}^{[s]} - \varvec{\theta }^{[s-1]} \Vert \right) \nonumber \\{} & {} \quad \le \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} \Vert + 2 \gamma L_{nor}\sum _{t' = 0 }^{t-1} \Vert \Delta ^s_{t'} \Vert \end{aligned}$$
(19)

where the first and second inequalities use \(\Vert a_1+a_2 \Vert \le \Vert a_1 \Vert +\Vert a_1 \Vert\), the third inequality use Assumption 1, the final inequality comes from \(\Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^{[s-1]} \Vert =\Vert \sum _{t' = 0 }^{t-1} \Delta ^s_{t'} \Vert \le \sum _{t' = 0 }^{t-1}\Vert \Delta ^s_{t'} \Vert\).

If \(t=1\), according to (19), we have

$$\begin{aligned} \Vert \varvec{\theta }_{[1]}^{[s]} -\overline{\varvec{\theta }}_{[2]}^{[s]} - \varvec{\theta }_{[0]}^{[s]} + \overline{\varvec{\theta }}_{[1]}^{[s]} \Vert \le \Vert \varvec{\theta }_{[1]}^{[s]} - \varvec{\theta }_{[0]}^{[s]} \Vert + 2\gamma L_{nor} \Vert \Delta ^s_{0} \Vert \end{aligned}$$
(20)

Substituting (20) into (18), and takeing expectations, we have

$$\begin{aligned}{} & {} \mathbb {E} \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]}\Vert ^2 - \mathbb {E} \Vert \varvec{\theta }_{[1]}^{[s]} -\overline{\varvec{\theta }}_{[2]}^{[s]}\Vert ^2 \le 2 \mathbb {E} \left( \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]} \Vert \Vert \varvec{\theta }_{[1]}^{[s]} -\overline{\varvec{\theta }}_{[2]}^{[s]} - \varvec{\theta }_{[0]}^{[s]} + \overline{\varvec{\theta }}_{[1]}^{[s]} \Vert \right) \nonumber \\{} & {} \quad \le 2 \mathbb {E} \left( \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]} \Vert \Vert \varvec{\theta }_{[1]}^{[s]} - \varvec{\theta }_{[0]}^{[s]} \Vert \right) + 4\gamma L_{nor} \mathbb {E} \left( \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]} \Vert \Vert \Delta ^s_{0} \Vert \right) \nonumber \\{} & {} \quad \le 2 k^{-\frac{1}{2}} \mathbb {E} \left( \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]} \Vert ^2 \right) + 4\gamma L_{nor} \mathbb {E} \left( \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]} \Vert \Vert \Delta ^s_{0} \Vert \right) \end{aligned}$$
(21)

where the last inequality uses A.13 in (Liu and Wright, 2015). Further, we have the upper bound of \(\mathbb {E} \left( \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert \Vert \Delta ^s_{t} \Vert \right)\) as

$$\begin{aligned}{} & {} \mathbb {E} \left( \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert \Vert \Delta ^s_{t} \Vert \right) \le \frac{1}{2}\mathbb {E} \left( k^{-\frac{1}{2}} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 + k^{ \frac{1}{2}} \Vert \Delta ^s_{t} \Vert ^2 \right) \nonumber \\{} & {} \quad = \frac{1}{2} \mathbb {E} \left( k^{-\frac{1}{2}} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 + k^{\frac{1}{2}} \mathbb {E}_{J(t)} \Vert \Delta ^s_{t} \Vert ^2 \right) \nonumber \\{} & {} \quad = \frac{1}{2} \mathbb {E} \left( k^{-\frac{1}{2}} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 + k^{- \frac{1}{2}}\mathbb {E} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 \right) \nonumber \\{} & {} \quad = k^{- \frac{1}{2}}\mathbb {E} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 \end{aligned}$$
(22)

Substituting (22) into (21), we have

$$\begin{aligned} \mathbb {E} \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]}\Vert ^2 - \mathbb {E} \Vert \varvec{\theta }_{[1]}^{[s]} -\overline{\varvec{\theta }}_{[2]}^{[s]}\Vert ^2 \le k^{-\frac{1}{2}} \left( 2 + 4\gamma L_{nor} \right) \mathbb {E} \left( \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]} \Vert ^2 \right) \end{aligned}$$
(23)

which implies that

$$\begin{aligned} \mathbb {E} \Vert \varvec{\theta }_{[0]}^{[s]} -\overline{\varvec{\theta }}_{[1]}^{[s]}\Vert ^2 \le \left( 1 - \frac{ 2 + 4\gamma L_{nor} }{\sqrt{k}} \right) ^{-1} \mathbb {E} \Vert \varvec{\theta }_{[1]}^{[s]} -\overline{\varvec{\theta }}_{[2]}^{[s]}\Vert ^2 \le \rho \mathbb {E} \Vert \varvec{\theta }_{[1]}^{[s]} -\overline{\varvec{\theta }}_{[2]}^{[s]}\Vert ^2 \end{aligned}$$
(24)

where the last inequality follows from the fact \(\rho ^{-1} \le 1 - \frac{ 2 + 4\gamma L_{nor} }{\sqrt{k}} \Leftrightarrow \gamma \le \frac{k^{1/2}(1-\rho ^{-1})-2}{4 L_{nor} }\). Thus, we have (17) for \(t=1\).

Next, we consider the cases for \(t>1\). For \(t' \le t-1\) and any \(\beta >0\), we have

$$\begin{aligned}{} & {} \mathbb {E} \left( \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert \Vert \Delta ^s_{t'} \Vert \right) \le \frac{1}{2}\mathbb {E} \left( k^{-\frac{1}{2}} \beta ^{-1} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 + k^{ \frac{1}{2}} \beta \Vert \Delta ^s_{t'} \Vert ^2 \right) \nonumber \\{} & {} \quad = \frac{1}{2} \mathbb {E} \left( k^{-\frac{1}{2}} \beta ^{-1} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 + k^{ \frac{1}{2}} \beta \mathbb {E}_{J(t)} \Vert \Delta ^s_{t'} \Vert ^2 \right) \nonumber \\{} & {} \quad = \frac{1}{2} \mathbb {E} \left( k^{-\frac{1}{2}} \beta ^{-1} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 + k^{-\frac{1}{2}} \beta \mathbb {E} \Vert \varvec{\theta }_{[t']}^{[s]} -{\overline{\varvec{\theta }}}_{[t'+1]}^{[s]} \Vert ^2 \right) \nonumber \\{} & {} \quad \le \frac{1}{2} \mathbb {E} \left( k^{-\frac{1}{2}} \beta ^{-1} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 + k^{-\frac{1}{2}} \rho ^{t-t'} \beta \mathbb {E} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 \right) \nonumber \\{} & {} \quad {\mathop {\le }\limits ^{\beta =\rho ^{\frac{t'-t}{2}} }} k^{-\frac{1}{2}} \rho ^{\frac{t-t'}{2}} \mathbb {E} \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2 \end{aligned}$$
(25)

We assume that (17) holds \(\forall t' <t\). By substituting (19) into (18) and taking expectation on both sides of (18), we can have

$$\begin{aligned}{} & {} \mathbb {E} \left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]}\Vert ^2 - \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]}\Vert ^2 \right) \nonumber \\{} & {} \quad \le 2 \mathbb {E} \left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \Vert \varvec{\theta }_{[t]}^{[s]} -\overline{x}_{t+1} - \varvec{\theta }_{[t-1]}^{[s]} + {\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \right) \nonumber \\{} & {} \quad \le 2 \mathbb {E} \left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \left( \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }_{[t-1]}^{[s]} \Vert + 2 \gamma L_{nor}\sum _{t' = 0 }^{t-1} \Vert \Delta ^s_{t'} \Vert \right) \right) \nonumber \\{} & {} \quad = 2 \mathbb {E} \left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \right) + 4 \gamma \mathbb {E} \left( L_{nor} \sum _{t' = 0 }^{t-1} \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert \Vert \Delta ^s_{t'} \Vert \right) \nonumber \\{} & {} \quad \le 2 k^{-1/2} \mathbb {E}\left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert ^2 \right) + 4 \gamma k^{-1/2} \mathbb {E}\left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert ^2 \right) L_{nor}\sum _{t' = 0 }^{t-1}\rho ^{\frac{t-1-t'}{2}}\nonumber \\{} & {} \quad \le k^{-1/2} \mathbb {E}\left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert ^2 \right) \left( 2 + 4 \gamma L_{nor} \left( 1+ \frac{\rho ^{\frac{1}{2}} - \rho ^{\frac{m}{2}}}{1-\rho ^{\frac{1}{2}}} \right) \right) \nonumber \\{} & {} \quad = k^{-1/2} \mathbb {E}\left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]} \Vert ^2 \right) \cdot \left( 2 + 4 \gamma L_{nor} \left( 1+ \theta \right) \right) \end{aligned}$$
(26)

where the third inequality uses (25). Based on (26), we have that

$$\begin{aligned}{} & {} \mathbb {E} \left( \Vert \varvec{\theta }_{[t-1]}^{[s]} -{\overline{\varvec{\theta }}}_{[t]}^{[s]}\Vert ^2 \right) \nonumber \\{} & {} \quad \le \left( 1- k^{-1/2} \left( 2+4 \gamma L_{nor} \left( 1+ \theta \right) \right) \right) ^{-1} \cdot \mathbb {E} \left( \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]}\Vert ^2 \right) \nonumber \\{} & {} \quad \le \rho \mathbb {E} \left( \Vert \varvec{\theta }_{[t]}^{[s]} -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]}\Vert ^2 \right) \end{aligned}$$
(27)

where the last inequality follows from

$$\begin{aligned} \rho ^{-1} \le 1- k^{-1/2} \left( 2+4 \gamma L_{nor} \left( 1+ \theta \right) \right) \Leftrightarrow \gamma \le \frac{k^{1/2}(1-\rho ^{-1})-2}{4 L_{nor} \left( 1+ \theta \right) } \end{aligned}$$
(28)

This completes the proof. \(\square\)

Lemma 3

Let \(\rho\) be a constant that satisfies \(\rho > 1\), the size of the partition of \(\{1,...,d \times p \}\) is k, and define the quantity \(\theta = \frac{\rho ^{\frac{1}{2}} - \rho ^{\frac{m}{2}}}{1-\rho ^{\frac{1}{2}}}\). Suppose Suppose the steplength parameter \(\gamma\) satisfies \(\gamma \le \min \left\{ \frac{1}{\frac{L_{\max }}{2} +\frac{2 L_{\max } \theta }{k^{1/2}}}, \frac{k^{1/2}(1-\rho ^{-1})-2}{4 L_{nor} \left( 1+ \theta \right) } \right\}\). Under Assumptions 1 and 2, the expectation of the objective function \(\mathbb {E} F(\varvec{\theta }_{[t]}^{[s]})\) is monotonically decreasing, i.e., \(\mathbb {E} F(\varvec{\theta }_{[t+1]}^{[s]}) \le \mathbb {E} F(\varvec{\theta }_{[t]}^{[s]})\).

Proof

Take expectation \(F(\varvec{\theta }_{[t+1]}^{[s]})\) on J(t), we have that

$$\begin{aligned}{} & {} \mathbb {E}_{J(t)} F(\varvec{\theta }_{[t+1]}^{[s]}) = \mathbb {E}_{J(t)} F(\varvec{\theta }_{[t]}^{[s]} + \Delta _t^s) \nonumber \\{} & {} \quad \le \mathbb {E}_{J(t)} \left( F(\varvec{\theta }_{[t]}^{[s]} ) + \left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}), ( \Delta _t^s)_{J(t)}\right\rangle + \frac{L_{\max }}{2} \left\| ( \Delta _t^s)_{J(t)} \right\| ^2 \right) \nonumber \\{} & {} \quad = F(\varvec{\theta }_{[t]}^{[s]} ) + \mathbb {E}_{J(t)} \left( \left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}), (\Delta _t^s)_{J(t)}\right\rangle \right. \nonumber \\{} & {} \left. \quad + \frac{L_{\max }}{2} \left\| ( \Delta _t^s)_{J(t)} \right\| ^2 \right) \nonumber \\{} & {} \quad = F(\varvec{\theta }_{[t]}^{[s]} ) + \mathbb {E}_{J(t)} \left( \left\langle (\varvec{v}^s_t)_{J(t)}, ( \Delta _t^s)_{J(t)}\right\rangle + \frac{L_{\max }}{2} \left\| ( \Delta _t^s)_{J(t)} \right\| ^2 \right. \nonumber \\{} & {} \left. \quad +\left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}) -(\varvec{v}^s_t)_{J(t)}, ( \Delta _t^s)_{J(t)}\right\rangle \right) \nonumber \\{} & {} \quad \le F(\varvec{\theta }_{[t]}^{[s]}) + \mathbb {E}_{J(t)} \left( -\frac{1}{\gamma } \left\| ( \Delta _t^s)_{J(t)} \right\| ^2 + \frac{L_{\max }}{2} \left\| ( \Delta _t^s)_{J(t)} \right\| ^2 \right. \nonumber \\{} & {} \quad \left. +\left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}) -(\varvec{v}^s_t)_{J(t)}, ( \Delta _t^s)_{J(t)}\right\rangle \right) \nonumber \\{} & {} \quad = F(\varvec{\theta }_{[t]}^{[s]}) + \mathbb {E}_{J(t)} \left( \left( \frac{L_{\max }}{2} -\frac{1}{\gamma } \right) \left\| ( \Delta _t^s)_{J(t)} \right\| ^2 \right) \nonumber \\{} & {} \quad +\mathbb {E}_{J(t)} \left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}) -(\varvec{v}^s_t)_{J(t)}, ( \Delta _t^s)_{J(t)}\right\rangle \nonumber \\{} & {} \quad = F(\varvec{\theta }_{[t]}^{[s]}) + \frac{1}{k}\left( \frac{L_{\max }}{2} -\frac{ 1}{\gamma } \right) \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert ^2\nonumber \\{} & {} \quad +\mathbb {E}_{J(t)} \left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}) -(\varvec{v}^s_t)_{J(t)}, ( \Delta _t^s)_{J(t)}\right\rangle \end{aligned}$$
(29)

where the first inequality uses (6), and the second inequality uses (14) in Lemma 1.

Consider the expectation of the last term on the right-hand side of (29), we have

$$\begin{aligned}{} & {} \mathbb {E} \left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}) -(\varvec{v}^s_t)_{J(t)}, (\Delta _t^s)_{J(t)}\right\rangle \nonumber \\{} & {} \quad = \mathbb {E} \left\langle \frac{1}{|\mathcal {B}'|}\sum _{i\in \mathcal {B}'} \nabla _{J(t)} F_i(\varvec{\theta }_{[t]}^{[s]}) -(\varvec{v}^s_t)_{J(t)}, (\Delta _t^s)_{J(t)}\right\rangle \nonumber \\{} & {} \quad = \mathbb {E} \left\langle \frac{1}{|\mathcal {B}'|}\sum _{i\in \mathcal {B}'} \nabla _{J(t)} F_i(\varvec{\theta }_{[t]}^{[s]}) - \left( \nabla F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) \right. \right. \nonumber \\{} & {} \quad \left. \left. - \nabla _{J(t)} F_{i_{t}}(\varvec{\theta }^{[s-1]}) + \varvec{\mu }^{[s-1]} \right) _{J(t)}, ( \Delta _t^s)_{J(t)}\right\rangle \nonumber \\{} & {} \quad = \mathbb {E} \frac{1}{|\mathcal {B}'|}\sum _{i\in \mathcal {B}'} \left\langle \nabla _{J(t)} F_i(\varvec{\theta }_{[t]}^{[s]}) - \nabla _{J(t)} F_i(\varvec{\theta }^{[s-1]}), ( \Delta _t^s)_{J(t)} \right\rangle \nonumber \\{} & {} \quad + \mathbb {E} \left\langle \nabla _{J(t)} F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) - \nabla _{J(t)} F_{i_{t}}(\varvec{\theta }^{[s-1]}), (\Delta _t^s)_{J(t)} \right\rangle \nonumber \\{} & {} \quad \le \mathbb {E} \frac{1}{|\mathcal {B}'|}\sum _{i\in \mathcal {B}'} \left( \left\| \nabla _{J(t)} F_i(\varvec{\theta }_{[t]}^{[s]}) - \nabla _{J(t)} F_i(\varvec{\theta }^{[s-1]})\right\| \left\| \Delta _t^s \right\| \right) \nonumber \\{} & {} \quad + \mathbb {E} \left( \left\| \nabla _{J(t)} F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) - \nabla _{J(t)} F_{i_{t}}(\varvec{\theta }^{[s-1]})\right\| \left\| \Delta _t^s \right\| \right) \nonumber \\{} & {} \quad \le \frac{2L_{\max } }{k}\mathbb {E} \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^{[s-1]} \Vert \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert \nonumber \\{} & {} \quad \le \frac{2L_{\max } }{k}\mathbb {E} \sum _{t'=0}^{t-1} \Vert \Delta _{t'}^s \Vert \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert \nonumber \\{} & {} \quad \le 2L_{\max }\sum _{t'=0}^{t-1} \frac{\rho ^{\frac{t-t'}{2}}}{k^{3/2}} \mathbb {E} \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert ^2\nonumber \\{} & {} \quad \le 2L_{\max } k^{-3/2} \frac{\rho ^{\frac{1}{2}} - \rho ^{\frac{m}{2}}}{1-\rho ^{\frac{1}{2}}} \cdot \mathbb {E} \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert ^2\nonumber \\{} & {} \quad = 2L_{\max } k^{-3/2} \theta \mathbb {E} \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert ^2 \end{aligned}$$
(30)

where the first inequality uses the Cauchy-Schwarz inequality (Callebaut, 1965), the second inequality uses Assumption 1, the third inequality uses \(\Vert \sum _{i=1}^n a_i \Vert \le \sum _{i=1}^n \Vert a_i \Vert\), the fourth inequality uses (25).

By taking expectations on both sides of (29) and substituting (30), we have

$$\begin{aligned}{} & {} \mathbb {E} F(\varvec{\theta }_{[t+1]}^{[s]})\nonumber \\{} & {} \quad \le F(\varvec{\theta }_{[t]}^{[s]}) + \frac{1}{k}\left( \frac{L_{\max }}{2} -\frac{ 1}{\gamma } \right) \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert ^2 +\mathbb {E}_{J(t)} \left\langle \nabla _{J(t)} F(\varvec{\theta }_{[t]}^{[s]}) -(\varvec{v}^s_t)_{J(t)}, ( \Delta _t^s)_{J(t)}\right\rangle \nonumber \\{} & {} \quad \le \mathbb {E} F(\varvec{\theta }_{[t]}^{[s]})- \frac{1}{k} \cdot \left( \frac{ 1}{\gamma }- \frac{L_{\max }}{2} -\frac{2 L_{\max } \theta }{k^{1/2}} \right) \mathbb {E} \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert ^2 \end{aligned}$$
(31)

where \(\frac{ 1}{\gamma }- \frac{L_{\max }}{2} -\frac{2 L_{\max } \theta }{k^{1/2}}\ge 0\) because that \(\gamma ^{-1} \ge \frac{L_{\max }}{2} +\frac{2 L_{\max } \theta }{k^{1/2}}\). This completes the proof. \(\square\)

Now, we provide the proof to Theorem 1 as follows.

Proof

We have that

$$\begin{aligned}{} & {} \Vert \varvec{\theta }_{[t+1]}^{[s]} - \varvec{\theta }^* \Vert ^2 = \Vert \varvec{\theta }_{[t]}^{[s]} + \Delta _t^s - \varvec{\theta }^* \Vert ^2 \nonumber \\= & {} \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^* \Vert ^2 - \Vert \Delta _t^s \Vert ^2 - 2 \langle \left( \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]} - \Delta _t^s \right) _{J(t)},( \Delta _t^s)_{J(t)} \rangle \nonumber \\= & {} \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^* \Vert ^2 - \Vert \Delta _t^s \Vert ^2 - 2 \langle \left( \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]} - \Delta _t^s \right) _{J(t)}, -\gamma (\varvec{v}_t^s)_{J(t)} \rangle \nonumber \\= & {} \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^* \Vert ^2 - \Vert \Delta _t^s \Vert ^2 + 2 \gamma \underbrace{ \left( \langle \left( \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]} \right) _{J(t)},( {v}_t^s)_{J(t)} \rangle \right) }_{T_1} + 2 \gamma \underbrace{ \left( \langle \left( \Delta _{t}^s \right) _{J(t)},( {v}_t^s)_{J(t)} \rangle \right) }_{T_2} \end{aligned}$$
(32)

For the expectation of \(T_1\), we have

$$\begin{aligned} \mathbb {E}(T_1)= & {} \mathbb {E} { \left( \langle \left( \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]} \right) _{J(t)},( {v}_t^s)_{J(t)} \rangle \right) }\nonumber \\= & {} \frac{1}{k}\mathbb {E} \langle \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]}, {v}_t^s \rangle \nonumber \\= & {} \frac{1}{k}\mathbb {E} \langle \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]}, \nabla F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) - \nabla F_{i_{t}}(\varvec{\theta }^{[s-1]}) + \varvec{\mu }^{[s-1]} \rangle \nonumber \\= & {} \frac{1}{k}\mathbb {E} \langle \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]}, \nabla F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) \rangle + \frac{1}{k} \langle \mathbb {E} ( \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]} ), \mathbb {E} (- \nabla F_{i_{t}}(\varvec{\theta }^{[s-1]}) + \varvec{\mu }^{[s-1]} ) \rangle \nonumber \\= & {} \frac{1}{k}\mathbb {E} \langle \varvec{\theta }^* -\varvec{\theta }_{[t]}^{[s]}, \nabla F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) \rangle \nonumber \\\le & {} \frac{1}{k}\mathbb {E} \left( F_{i_t}( \varvec{\theta }^*)- F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) \right) \nonumber \\= & {} \frac{1}{k}\mathbb {E} \left( F( \varvec{\theta }^*)- F(\varvec{\theta }_{[t]}^{[s]}) \right) \end{aligned}$$
(33)

where the first inequality uses the convexity of \(F_i\). For the expectation of \(T_2\), we have

$$\begin{aligned}{} & {} \mathbb {E}(T_2) =\mathbb {E} \langle \left( \Delta _{t}^s \right) _{J(t)},( {v}_t^s)_{J(t)} \rangle \end{aligned}$$
(34)
$$\begin{aligned}{} & {} \quad = \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, \left( \nabla F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) - \nabla F_{i_{t}}(\varvec{\theta }^{[s-1]}) + \varvec{\mu }^{[s-1]} \right) _{J(t)} \rangle \nonumber \\{} & {} \quad = \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, \left( \nabla F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) - \nabla F_{i_{t}}(\varvec{\theta }^{[s-1]}) \right) _{J(t)} \rangle +\mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \nonumber \\{} & {} \quad \le \frac{1}{k} \mathbb {E} \left( \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert \left\| \nabla F_{i_{t}}(\varvec{\theta }_{[t]}^{[s]}) - \nabla F_{i_{t}}(\varvec{\theta }^{[s-1]}) \right\| \right) + \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \nonumber \\{} & {} \quad \le \frac{L_{res }}{k} \mathbb {E} \left( \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^{[s-1]} \Vert \right) + \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \nonumber \\{} & {} \quad \le \frac{L_{res }}{k} \mathbb {E} \left( \sum _{t'=0}^{t-1} \Vert \overline{\varvec{\theta }}_{[t+1]}^{[s]} -\varvec{\theta }_{[t]}^{[s]} \Vert \Vert \Delta ^{s}_{t'} \Vert \right) + \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \nonumber \\{} & {} \quad \le \frac{L_{res}}{k^{3/2}} \sum _{t'=0}^{t-1} \rho ^{(t-t')/2} \mathbb {E}(\Vert \varvec{\theta }_{[t]}^{[s]}-{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2)+ \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \nonumber \\{} & {} \quad \le \frac{ L_{res} \theta }{k^{3/2}} \mathbb {E}(\Vert \varvec{\theta }_{[t]}^{[s]}-{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2)+ \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \end{aligned}$$
(35)

where the second inequality uses Assumption 1, the fourth inequality uses (25). By substituting the upper bounds from (33) and (34) into (32), we have

$$\begin{aligned} \mathbb {E}\Vert \varvec{\theta }_{[t+1]}^{[s]} - \varvec{\theta }^* \Vert ^2\le & {} \mathbb {E} \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^* \Vert ^2 - \frac{1}{k} \mathbb {E}(\Vert \varvec{\theta }_{[t]}^{[s]}-{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2)\nonumber \\{} & {} + \frac{2 \gamma }{k} \mathbb {E} \left( F( \varvec{\theta }^*)- F(\varvec{\theta }_{[t]}^{[s]}) \right) + 2 \gamma \left( \frac{ L_{res} \theta }{k^{3/2}} \theta \mathbb {E}(\Vert \varvec{\theta }_{[t]}^{[s]}\right. \nonumber \\{} & {} \left. -{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2) + \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \right) \nonumber \\= & {} \mathbb {E} \Vert \varvec{\theta }_{[t]}^{[s]} - \varvec{\theta }^* \Vert ^2 + \frac{2 \gamma }{k} \mathbb {E} \left( F( \varvec{\theta }^*)- F(\varvec{\theta }_{[t]}^{[s]}) \right) \nonumber \\{} & {} - \frac{1}{k} \left( 1- \frac{2 L_{res} \theta \gamma }{k^{1/2} }\right) \mathbb {E}(\Vert \varvec{\theta }_{[t]}^{[s]}-{\overline{\varvec{\theta }}}_{[t+1]}^{[s]} \Vert ^2) + 2 \gamma \mathbb {E} \langle \left( \Delta _{t} \right) _{J(t)}, (\varvec{\mu }^{[s-1]})_{J(t)} \rangle \end{aligned}$$
(36)

We consider a fixed stage \(s+1\) such that \(x_0^{s+1} = x_{m}^{s}\). By summing the the inequality (36) over \(t = 0,\cdots ,m-1\), we obtain

$$\begin{aligned} \mathbb {E}\Vert \varvec{\theta }^{[s+1]} - \varvec{\theta }^* \Vert ^2\le & {} \mathbb {E}\Vert \varvec{\theta }^{[s]} - \varvec{\theta }^* \Vert ^2+ \sum _{t'=0}^{m-1}\frac{2 \gamma }{k} \mathbb {E} \left( F( \varvec{\theta }^*)\right. \nonumber \\{} & {} \left. - F(\varvec{\theta }^{[s+1]}_{[t']}) \right) - \sum _{t'=0}^{m-1} \frac{1}{k} \left( 1- \frac{2 L_{res} \theta \gamma }{k^{1/2} }\right) \cdot \mathbb {E}(\Vert \varvec{\theta }^{[s+1]}_{[t']}\nonumber \\{} & {} -\overline{\varvec{\theta }}_{[t'+1]}^{[s+1]} \Vert ^2) + 2 \gamma \sum _{t'=0}^{m-1} \mathbb {E} \left\langle \left( \Delta _{t'} \right) _{J(t')}, (\varvec{\mu }^{[s-1]})_{J(t')} \right\rangle \nonumber \\= & {} \mathbb {E}\Vert \varvec{\theta }^{[s]} - \varvec{\theta }^* \Vert ^2 + \sum _{t'=0}^{m-1}\frac{2 \gamma }{k} \mathbb {E} \left( F( \varvec{\theta }^*)- F(\varvec{\theta }^{[s+1]}_{[t']}) \right) \nonumber \\{} & {} - \sum _{t'=0}^{m-1} \frac{1}{k} \left( 1- \frac{2 L_{res} \theta \gamma }{k^{1/2} }\right) \cdot \mathbb {E}(\Vert \varvec{\theta }^{[s+1]}_{[t']}-\overline{\varvec{\theta }}_{[t'+1]}^{[s+1]} \Vert ^2) \nonumber \\{} & {} + 2 \gamma \sum _{t'=0}^{m-1} \mathbb {E} \left\langle \varvec{\theta }^{[s+1]}_{[t']}-\varvec{\theta }^{[s+1]}_{[t'+1]}, \nabla F (\varvec{\theta }^{[s-1]}) \right\rangle \nonumber \\\le & {} \mathbb {E}\Vert \varvec{\theta }^{[s]} - \varvec{\theta }^* \Vert ^2 + \sum _{t'=0}^{m-1}\frac{2 \gamma }{k} \mathbb {E} \left( F( \varvec{\theta }^*)- F(\varvec{\theta }^{[s+1]}_{[t']}) \right) \nonumber \\{} & {} - \sum _{t'=0}^{m-1} \frac{1}{k} \left( 1- \frac{2 L_{res} \theta \gamma }{k^{1/2} }\right) \cdot \mathbb {E}(\Vert \varvec{\theta }^{[s+1]}_{[t']}-\overline{\varvec{\theta }}_{[t'+1]}^{[s+1]} \Vert ^2)\nonumber \\{} & {} + 2 \gamma \mathbb {E} \left( F(\varvec{\theta }^{[s]} ) - F(\varvec{\theta }^{[s+1]}) + \frac{L_{res}}{2k} \sum _{t'=0}^{m-1} \Vert \varvec{\theta }^{[s+1]}_{[t']}-\overline{\varvec{\theta }}_{[t'+1]}^{[s+1]} \Vert ^2 \right) \nonumber \\\le & {} \mathbb {E}\Vert \varvec{\theta }^{[s]} - \varvec{\theta }^* \Vert ^2 + \frac{2 \gamma }{k}\sum _{t'=0}^{m-1} \left( F( \varvec{\theta }^*) - \mathbb {E}F(\varvec{\theta }^{[s+1]}_{[t']}) \right) \nonumber \\{} & {} + 2 \gamma \left( \mathbb {E}F(\varvec{\theta }^{[s]}) - \mathbb {E}F(\varvec{\theta }^{[s+1]}) \right) \nonumber \\{} & {} - \sum _{t'=0}^{m-1} \frac{1}{k} \left( 1- L_{res} \gamma - \frac{2 L_{res} \theta \gamma }{k^{1/2} }\right) \cdot \mathbb {E}(\Vert \varvec{\theta }^{[s+1]}_{[t']}-\overline{\varvec{\theta }}_{[t'+1]}^{[s+1]} \Vert ^2)\nonumber \\\le & {} \mathbb {E}\Vert \varvec{\theta }^{[s]} - \varvec{\theta }^* \Vert ^2 + \frac{2 \gamma }{k}\sum _{t'=0}^{m-1} \left( F( \varvec{\theta }^*) - \mathbb {E}F(\varvec{\theta }^{[s+1]}_{[t']}) \right) \nonumber \\{} & {} + 2 \gamma \left( \mathbb {E}F(\varvec{\theta }^{[s]}) - \mathbb {E}F(\varvec{\theta }^{[s+1]}) \right) \end{aligned}$$
(37)

where the second inequality uses (3), the final inequality comes from \(1- L_{res} \gamma - \frac{2 L_{res} \theta \gamma }{k^{1/2} } \ge 0\). Define \(\mathcal {F}(\varvec{\theta }^{[s]}) = \mathbb {E} \Vert \varvec{\theta }^{[s]} - \varvec{\theta }^* \Vert ^2 + 2 \gamma \mathbb {E} \left( F( \varvec{\theta }^{[s]}) - F( \varvec{\theta }^*) \right)\). According to (37), we have

$$\begin{aligned} \mathcal {F}(\varvec{\theta }^{[s+1]})\le & {} \mathcal {F}(\varvec{\theta }^{[s]}) - \frac{2 \gamma }{k}\sum _{t'=0}^{m-1} \mathbb {E} \left( F( \varvec{\theta }^{[s+1]}_{t'}) - F( \varvec{\theta }^*) \right) \nonumber \\\le & {} \mathcal {F}(\varvec{\theta }^{[s]}) - \frac{2 m \gamma }{ k}\mathbb {E} \left( F( \varvec{\theta }^{[s+1]}) - F( \varvec{\theta }^*) \right) \end{aligned}$$
(38)

where the second inequality comes from the monotonicity of \(\mathbb {E} F(\varvec{\theta }_{[t]}^{[s]})\). According to (38), we have

$$\begin{aligned} \mathcal {F}(\varvec{\theta }^{[S]}) \le \mathcal {F}(\varvec{\theta }^{[0]}) - \frac{2 m \gamma S}{ k}\mathbb {E} \left( F( \varvec{\theta }^{[S]}) - F( \varvec{\theta }^*) \right) \end{aligned}$$
(39)

Thus, the sublinear convergence rate can be obtained from (39). This completes the proof. \(\square\)

Convergence analysis of theorem 2

Lemma 4

The function \(\nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }^j \right\| ^2\) with the parameter \(\varvec{\beta }\) in (5) has the normal Lipschitz constant \(2\nu\) as similarly defined in Definition 1.

Proof

First, we have that

$$\begin{aligned} \nu \Vert \varvec{\beta } \Vert ^2 - \nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }^j \right\| ^2= & {} \nu \Vert \varvec{\beta } \Vert ^2 - \nu \Vert \varvec{\beta } \Vert ^2 + 2\nu \sum _{j=1}^p \langle \tilde{\varvec{\beta }}^j, \varvec{\beta }^j \rangle -\nu \Vert \varvec{\tilde{\beta }} \Vert ^2\nonumber \\= & {} 2\nu \sum _{j=1}^p \langle \varvec{\tilde{\beta }}^j, \varvec{\beta }^j \rangle -\nu \Vert \varvec{\tilde{\beta }} \Vert ^2 \end{aligned}$$
(40)

It is easy to verify that (40) is a convex function w.r.t. the parameter \(\varvec{\beta }\). Thus, according to the convexity, we have

$$\begin{aligned} \nu \Vert \varvec{\beta } \Vert ^2 - \nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }^j \right\| ^2 \ge \nu \Vert \varvec{\beta }' \Vert ^2 - \nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }'^j \right\| ^2 + \left\langle 4\nu \varvec{\beta }'- 2\nu \varvec{\tilde{\beta }}, \varvec{\beta } - \varvec{\beta }' \right\rangle \end{aligned}$$
(41)

Based on (41), we have that

$$\begin{aligned}{} & {} - \nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }^j \right\| ^2 \ge - \nu \Vert \varvec{\beta } \Vert ^2 - \nu \Vert \varvec{\beta }' \Vert ^2 + \left\langle 2\nu \varvec{\beta }', \varvec{\beta } \right\rangle \nonumber \\{} & {} - \nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }'^j \right\| ^2 + \left\langle 2\nu \varvec{\beta }'- 2\nu \varvec{\tilde{\beta }}, \varvec{\beta } -\varvec{\beta }' \right\rangle \nonumber \\= & {} - \nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }'^j \right\| ^2 + \left\langle 2\nu \varvec{\beta }'- 2\nu \varvec{\tilde{\beta }}, \varvec{\beta } -\varvec{\beta }' \right\rangle -\frac{2 \nu }{2}\Vert \varvec{\beta } - \varvec{\beta }' \Vert ^2 \end{aligned}$$
(42)

According to (42), we have that the function \(\nu \sum _{j=1}^p\left\| \varvec{\tilde{\beta }}^j - \varvec{\beta }^j \right\| ^2\) with the parameter \(\varvec{\beta }\) in (5) has the normal Lipschitz constant \(2\nu\). This completes the proof. \(\square\)

Lemma 5

Let \(\varvec{\bar{\theta }}^{[1]}\) be the solution of (6) produced by batch gradient descent algorithm after the first iteration with the learning rate of \(\frac{1}{L_{nor}}\). Assume \(\mathbb {E} \bar{\mathcal {F}}(\varvec{\theta }^{[S]};\varvec{\beta }) \le \bar{\mathcal {F}}(\bar{\theta }^{[1]};\varvec{\beta })\) for each call of DSG algorithm. For DSGAM algorithm we have that

$$\begin{aligned} {\mathcal {F}}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t+1]}) - {\mathbb {E}} {\mathcal {F}}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) \ge \frac{1}{2L_{nor} } \left\| \nabla F(\varvec{\theta }^{[t]}) \right\| ^2 \end{aligned}$$
(43)

where \(F(\varvec{\theta }^{[t]}) = \frac{1}{l}\sum _i^l \nabla F_i(\varvec{\theta }^{[t]})\).

Proof

According to the strong convexity of \(\mathcal {F}(\varvec{\theta },\varvec{\beta })\) w.r.t. the parameter \(\varvec{\theta }\) and \(\varvec{\theta }^{[t+1]} = \varvec{\theta }^{[t]} - \frac{1}{L_{nor}} \nabla F(\varvec{\theta }^{[t]})\), we have that

$$\begin{aligned}{} & {} \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) \le \mathcal {F}(\overline{\theta ^{[t]}}^{[1]},\varvec{\beta }^{[t+1]})\nonumber \\{} & {} \quad \le \mathcal {F}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t+1]}) + \left\langle \nabla F(\varvec{\theta }^{[t]}), \varvec{\theta }^{[t+1]} - \varvec{\theta }^{[t]} \right\rangle + \frac{L_{nor}}{2} \left\| \varvec{\theta }^{[t+1]} - \varvec{\theta }^{[t]} \right\| ^2\nonumber \\{} & {} \quad = \mathcal {F}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t+1]}) - \frac{1}{L_{nor}} \left\| \nabla F(\varvec{\theta }^{[t]}) \right\| ^2+ \frac{1}{2L_{nor}} \left\| \nabla F(\varvec{\theta }^{[t]}) \right\| ^2\nonumber \\{} & {} \quad = \mathcal {F}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t+1]}) - \frac{1}{2 L_{nor}} \left\| \nabla F(\varvec{\theta }^{[t]}) \right\| ^2 \end{aligned}$$
(44)

where the first inequality uses Assumption 1, the first equality uses \(\varvec{\theta }^{[t+1]} = \varvec{\theta }^{[t]} - \frac{1}{L_{nor}} \nabla F(\varvec{\theta }^{[t]})\). This completes the proof. \(\square\)

Now, we provide the proof to Theorem 2 as follows.

Proof

According to Lemma 3.3 in Beck (2015), we have that

$$\begin{aligned}{} & {} \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) - \mathcal {F}(\varvec{\theta }^{*},\varvec{\beta }^{*}) \le \left\| \nabla F(\varvec{\theta }^{[t]}) \right\| \left( \left\| \varvec{\theta }^{[t]} -\varvec{\theta }^{*} \right\| + \left\| \varvec{\beta }^{[t+1]} -\varvec{\beta }^{*} \right\| \right) \nonumber \\{} & {} \quad \le \left\| \nabla F(\varvec{\theta }^{[t]}) \right\| \left( \left\| \varvec{\theta }^{[0]} -\varvec{\theta }^{*} \right\| + \left\| \varvec{\beta }^{[0]} -\varvec{\beta }^{*} \right\| \right) =\left\| \nabla F(\varvec{\theta }^{[t]}) \right\| R_2 \end{aligned}$$
(45)

where the second inequality uses the fact \(\left\| \varvec{\theta }^{[t]} -\varvec{\theta }^{*} \right\| + \left\| \varvec{\beta }^{[t+1]} -\varvec{\beta }^{*} \right\| \le \left\| \varvec{\theta }^{[0]} -\varvec{\theta }^{*} \right\| + \left\| \varvec{\beta }^{[0]} -\varvec{\beta }^{*} \right\|\). According to Lemma 5, we have that

$$\begin{aligned}{} & {} \mathcal {F}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t]}) - \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]})\nonumber \\{} & {} \quad \ge \mathcal {F}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t+1]}) - \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]})\nonumber \\{} & {} \quad \ge \frac{1}{2 L_{nor} } \left\| \nabla F(\varvec{\theta }^{[t]}) \right\| ^2\nonumber \\{} & {} \quad \ge \frac{ \left( \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) - \mathcal {F}(\varvec{\theta }^{*},\varvec{\beta }^{*}) \right) ^2 }{2 L_{nor} R_2^2} \end{aligned}$$
(46)

Similarly, considering the line 3 of our DSGAM, according to Lemma 4, we can have that

$$\begin{aligned} \mathcal {F}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t]}) - \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) \ge \frac{ \left( \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) - \mathcal {F}(\varvec{\theta }^{*},\varvec{\beta }^{*}) \right) ^2 }{4 \nu R_2^2} \end{aligned}$$
(47)

This inequality is proved in Beck (2015).

Combining (46) and (47), we have that

$$\begin{aligned} \mathcal {F}(\varvec{\theta }^{[t]},\varvec{\beta }^{[t]}) - \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) \ge \frac{ \left( \mathbb {E} \mathcal {F}(\varvec{\theta }^{[t+1]},\varvec{\beta }^{[t+1]}) - \mathcal {F}(\varvec{\theta }^{*},\varvec{\beta }^{*}) \right) ^2 }{2\min \{L_{nor}, 2\nu \} R_2^2} \end{aligned}$$
(48)

According to Lemma 3.6 in Beck (2015) and (48), we have that

$$\begin{aligned} \mathbb {E} \mathcal {F}(\varvec{\theta }^{[T]},\varvec{\beta }^{[T]}) - \mathcal {F}(\varvec{\theta }^{*},\varvec{\beta }^{*}) \le \max \left\{ \left( \frac{1}{2} \right) ^{\frac{T-1}{2}} R_1, \frac{8 \min \{L_{nor}, 2\nu \} R_2}{T-1} \right\} \end{aligned}$$
(49)

This completes the proof. \(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, B., Zhang, C., Huo, Z. et al. A new large-scale learning algorithm for generalized additive models. Mach Learn 112, 3077–3104 (2023). https://doi.org/10.1007/s10994-023-06339-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06339-4

Keywords

Navigation