Skip to main content

Laplacian smoothing gradient descent

Abstract

We propose a class of very simple modifications of gradient descent and stochastic gradient descent leveraging Laplacian smoothing. We show that when applied to a large variety of machine learning problems, ranging from logistic regression to deep neural nets, the proposed surrogates can dramatically reduce the variance, allow to take a larger step size, and improve the generalization accuracy. The methods only involve multiplying the usual (stochastic) gradient by the inverse of a positive definitive matrix (which can be computed efficiently by FFT) with a low condition number coming from a one-dimensional discrete Laplacian or its high-order generalizations. Given any vector, e.g., gradient vector, Laplacian smoothing preserves the mean and increases the smallest component and decreases the largest component. Moreover, we show that optimization algorithms with these surrogates converge uniformly in the discrete Sobolev \(H_\sigma ^p\) sense and reduce the optimality gap for convex optimization problems. The code is available at: https://github.com/BaoWangMath/LaplacianSmoothing-GradientDescent.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. Here, the condition number is the ratio of the largest and smallest eigenvalues of the Jacobian of the strongly convex objective functions.

References

  1. 254a, notes 1: Concentration of measure. https://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/

  2. Abadi, M., Agarwal, A., et al.: Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (2016). arXiv preprint arXiv:1603.04467

  3. Allen-Zhu, Z.: Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, 1–51 (2018)

    MathSciNet  MATH  Google Scholar 

  4. Arjovsky, M., Bottou, L.: Towards Principled Methods for Training Generative Adversarial Networks (2017). arXiv preprint arXiv:1701.04862

  5. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  6. Bhatia, R.: Matrix Analysis. Springer (1997)

  7. Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks, Tricks of the Trade, Reloaded, p. 7700 (2012)

  8. Bottou, L., Curtis, E.F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    MathSciNet  Article  Google Scholar 

  9. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym (2016). arXiv preprint arXiv:1606.01540

  10. Chaudhari, P., Oberman, A., Osher, S., Soatto, S., Guillame, C.: Deep Relaxation: Partial Differential Equations for Optimizing Deep Neural Networks (2017). arXiv preprint arXiv:1704.04932

  11. Defazio, A., Bach, F.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (2014)

  12. Dozat, T.: Incorporating nesterov momentum into Adam. In: 4th International Conference on Learning Representation Workshop (ICLR 2016) (2016)

  13. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Evans, L.C.: Partial Differential Equations (2010)

  15. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

  16. Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. In: 33rd International COnference on Machine Learning (ICML 2016) (2016)

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  18. Jastrzebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y., Storkey, A.: Dnn’s sharpest directions along the sgd trajectory (2018). arXiv preprint arXiv:1807.05031

  19. Johoson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (2013)

  20. Jung, M., Chung, G., Sundaramoorthi, G., Vese, L., Yuille, A.: Sobolev gradients and joint variational image segmentation, denoising, and deblurring. In: Computational Imaging VII, volume 7246, pp. 72460I. International Society for Optics and Photonics (2009)

  21. Kingma, D., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980

  22. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images (2009)

  23. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 81, 2278–2324 (1998)

    Article  Google Scholar 

  24. Lei, L., Ju, C., Chen, J., Jordan, M.: Nonconvex finite-sum optimization via scsg methods. In: Advances in Neural Information Processing Systems (2017)

  25. Li, F., et al.: Cs231n: Convolutional Neural Networks for Visual Recognition (2018)

  26. Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the Loss Landscape of Neural Nets (2017). arXiv preprint arXiv:1712.09913

  27. Chintala, S., Arjovsky, M., Bottou, L.: Wasserstein Gan. arXiv preprint arXiv:1701.07875 (2017)

  28. Mandt, S., Hoffman, M., Blei, D.: Stochastic gradient descent as approximate bayesian inference. J. Mach. Learn. Res. 18, 1–35 (2017)

    MathSciNet  MATH  Google Scholar 

  29. Mnih, et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)

  30. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

  31. Nesterov, Y.: A method for solving the convex programming problem with convergence rate \(o(1/k^2)\). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)

    MathSciNet  Google Scholar 

  32. Nesterov, Y.: Introductory lectures on convex programming volume i: Basic course. Lecture Notes (1998)

  33. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)

  34. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Networks 12(1), 145–151 (1999)

    Article  Google Scholar 

  35. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

  36. Reddi, S., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: 6th International Conference on Learning Representation (ICLR 2018) (2018)

  37. Robinds, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    MathSciNet  Article  Google Scholar 

  38. Schmidhuber, J.: Deep learning in neural networks: an overview. arXiv preprint arXiv:1404.7828 (2014)

  39. Senior, A., Heigold, G., Ranzato, M., Yang, K.: An empirical study of learning rates in deep neural networks for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2013)

  40. Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: 30th International Conference on Machine Learning (ICML 2013) (2013)

  41. Shapiro, A., Wardi, Y.: Convergence analysis of gradient descent stochastic algorithms. J. Optim. Theory Appl. 91(2), 439–454 (1996)

    MathSciNet  Article  Google Scholar 

  42. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016)

    Article  Google Scholar 

  43. Sutton, R.: Two problems with backpropagation and other steepest-descent learning procedures for networks. In: Proc. 8th Annual Conf. Cognitive Science Society (1986)

  44. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks Mach. Learn 4(2), 26–31 (2012)

  45. Wang, B., Gu, Q., Boedihardjo, M., Barekat, F., Osher, S.: Privacy-preserving erm by laplacian smoothing stochastic gradient descent. UCLA Computational and Applied Mathematics Reports, pp. 19–24 (2019)

  46. Welling, M., Teh, Y.: Bayesian learning via stochastic gradient langevin dynamics. In 28th International Conference on Machine Learning (ICML 2011) (2011)

  47. Wu, Y., He, K.: Group normalization. In: European Conference on Computer Vision (2018)

  48. Zeiler, M.: Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)

Download references

Funding

This material is based on research sponsored by NSF grants DMS-1924935, DMS-1952339, DMS-2110145, DMS-2152762, DMS-2208361, DOE grant DE-SC0021142, and ONR grant N00014-18-1-2527 and the ONR MURI grant N00014-20-1-2787.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bao Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Theorem 1

In this part, we will give a proof for Theorem 1.

Lemma 2

[1] Let \(t, u > 0\), \({\varvec{v}}\) be an m-dimensional standard normal random vector, and let \(F:\mathbb {R}^m \rightarrow \mathbb {R}\) be a function such that \(\Vert F({\varvec{x}}) - F({\varvec{y}})\Vert \le \Vert {\varvec{x}}- {\varvec{y}}\Vert \) for all \({\varvec{x}}\), \({\varvec{y}}\in \mathbb {R}^m\). Then

$$\begin{aligned} \mathbb {P}\left( F({\varvec{v}}) \ge \mathbb {E}F({\varvec{v}}) + u \right) \le \exp {\left( -tu+\frac{1}{2}\left( \frac{\pi t}{2}\right) ^2\right) }. \end{aligned}$$
(22)

Taking \(t=\frac{4}{\pi ^2}\) in Lemma 2, we obtain

Lemma 3

Let \(u > 0\), \({\varvec{v}}\) be an m-dimensional standard normal random vector, and let \(F:\mathbb {R}^m \rightarrow \mathbb {R}\) be a function such that \(\Vert F({\varvec{x}}) - F({\varvec{y}})\Vert \le \Vert {\varvec{x}}- {\varvec{y}}\Vert \) for all \({\varvec{x}}\), \({\varvec{y}}\in \mathbb {R}^m\). Then

$$\begin{aligned} \mathbb {P}\left( F({\varvec{v}}) \ge \mathbb {E}F({\varvec{v}}) + u \right) \le \exp {\left( -\frac{2}{\pi ^2}u^2 \right) }. \end{aligned}$$
(23)

Lemma 4

Let \({\varvec{v}}\) be an m-dimensional standard normal random vector. Let \(1\le p\le \infty \). Let \(0< u <\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}\). Let \({\varvec{T}}\in \mathbb {R}^{m\times m}\) be such that \(\Vert {\varvec{T}}{\varvec{x}}\Vert _{\ell _{p}}\le \Vert {\varvec{x}}\Vert _{\ell _{p}}\) for all \({\varvec{x}}\in \mathbb {R}^{m}\). Then

$$\begin{aligned} \mathbb {P}\left( \Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \frac{\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u}{\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u} \Vert {\varvec{v}}\Vert _{\ell _{p}}\right) \le 2\exp {\left( -\frac{2}{\pi ^{2}}u^{2}\right) }. \end{aligned}$$

Proof

By Lemma 3,

$$\begin{aligned} \mathbb {P}(\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}} \end{aligned}$$

and

$$\begin{aligned} \mathbb {P}(-\Vert {\varvec{v}}\Vert _{\ell _{p}}\ge -\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

The second inequality gives

$$\begin{aligned} \mathbb {P}(\Vert {\varvec{v}}\Vert _{\ell _{p}}\le \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u)\le e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

Therefore,

$$\begin{aligned}&\mathbb {P}\left( \Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \frac{\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u}{ \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u}\Vert {\varvec{v}}\Vert _{\ell _{p}}\right) \\&\quad \le \mathbb {P}(\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\ge \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}+u)+ \mathbb {P}(\Vert {\varvec{v}}\Vert _{\ell _{p}}\le \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{p}}-u)\le 2e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

\(\square \)

Lemma 5

Let \(1\le p\le 2\). Let \({\varvec{T}}\in \mathbb {R}^{m\times m}\). Let \({\varvec{v}}\) be an m-dimensional standard normal random vector. Then

$$\begin{aligned} \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}\le m^{\frac{1}{p}-\frac{1}{2}}(\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}})^{\frac{1}{2}}\left( \mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}, \end{aligned}$$

where \({\varvec{v}}_1\) is the first coordinate of \({\varvec{v}}\).

Proof

We write \({\varvec{T}}=({\varvec{T}}_{i,j})_{1\le i,j\le n}\). Then

$$\begin{aligned} \mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{p}}= & {} \mathbb {E}\left( \sum _{i=1}^{n}\left| \sum _{j=1}^{n}{\varvec{T}}_{i,j}{\varvec{v}}_{j}\right| ^{p}\right) ^{\frac{1}{p}}\le \left( \sum _{i=1}^{n}\mathbb {E}\left| \sum _{j=1}^{n}{\varvec{T}}_{i,j}{\varvec{v}}_{j}\right| ^{p}\right) ^{\frac{1}{p}}\\= & {} \left( \sum _{i=1}^{n}\left( \sum _{j=1}^{n}{\varvec{T}}_{i,j}^{2}\right) ^{\frac{p}{2}}\mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}\le \left( n^{1-\frac{p}{2}}\left( \sum _{1\le i,j\le n}{\varvec{T}}_{i,j}^{2}\right) ^{\frac{p}{2}}\mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}\\= & {} n^{\frac{1}{p}-\frac{1}{2}}\left( \text {Trace }{\varvec{T}}^{*}{\varvec{T}}\right) ^{\frac{1}{2}}\left( \mathbb {E}|{\varvec{v}}_{1}|^{p}\right) ^{\frac{1}{p}}, \end{aligned}$$

where the second equality follows from the assumption that \({\varvec{v}}\) is an m-dimensional standard normal random vector. \(\square \)

Lemma 6

Let \({\varvec{v}}\) be an m-dimensional standard normal random vector. Then

$$\begin{aligned} \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge \sqrt{m}-\pi . \end{aligned}$$

Proof

By Lemma 3,

$$\begin{aligned} \mathbb {P}(\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge \mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}} \end{aligned}$$

and

$$\begin{aligned} \mathbb {P}(-\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge -\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}+u)\le e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

Thus,

$$\begin{aligned} \mathbb {P}(|\Vert {\varvec{v}}\Vert _{\ell _{2}}-\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}|\ge u)\le 2e^{-\frac{2}{\pi ^{2}}u^{2}}. \end{aligned}$$

Consider the random variable \(W=\Vert {\varvec{v}}\Vert _{\ell _{2}}\). We have

$$\begin{aligned} \mathbb {E}|W-\mathbb {E}W|^{2}=\int _{0}^{\infty }\mathbb {P}(|W-\mathbb {E}W|\ge \sqrt{u})\,du\le \int _{0}^{\infty }2e^{-\frac{2}{\pi ^{2}}u}\,du=\pi ^{2}. \end{aligned}$$

Since \(\mathbb {E}|W-\mathbb {E}W|^2 = \mathbb {E}W^2 - (\mathbb {E}W)^2\), we have

$$\begin{aligned} \mathbb {E}W\ge (\mathbb {E}W^{2})^{\frac{1}{2}}-(\mathbb {E}|W-\mathbb {E}W|^{2})^{\frac{1}{2}}\ge \sqrt{m}-\pi . \end{aligned}$$

\(\square \)

Lemma 7

Let \(0<\epsilon <1-\frac{\pi }{\sqrt{m}}\). Let \(\sigma >0\). Let

$$\begin{aligned} \beta =\frac{1}{m}\sum _{i=1}^{m}\frac{1}{|1+2\sigma -\sigma z_{i}-\sigma \overline{z_{i}}| }, \end{aligned}$$

where \(z_{1},\ldots ,z_{m}\) are the m roots of unity. Let \({\varvec{B}}\) be the circular shift operator on \(\mathbb {R}^{m}\). Let \({\varvec{v}}\) be an m-dimensional standard normal random vector. Then

$$\begin{aligned} \mathbb {P}\left( \Vert ((1+2\sigma ){\varvec{I}}-\sigma {\varvec{B}}-\sigma {\varvec{B}}^{*})^{-1/2}{\varvec{v}}\Vert _{\ell _{2}}\ge \frac{\sqrt{\beta }+\epsilon }{1-\frac{\pi }{\sqrt{m}}-\epsilon } \Vert {\varvec{v}}\Vert _{\ell _{2}}\right) \le 2e^{-\frac{2}{\pi ^{2}}m\epsilon ^{2}}. \end{aligned}$$

Proof

Let \({\varvec{T}}=((1+2\sigma ){\varvec{I}}-\sigma {\varvec{B}}-\sigma {\varvec{B}}^{*})^{-1/2}\). Taking \(u=\sqrt{m}\epsilon \) in Lemma 4, we have

$$\begin{aligned} \mathbb {P}\left( \Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}\ge \frac{\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}+\sqrt{m}\epsilon }{\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}-\sqrt{m}\epsilon } \Vert {\varvec{v}}\Vert _{l^{2}}\right) \le 2e^{-\frac{2}{\pi ^{2}}m\epsilon ^{2}}. \end{aligned}$$

By Lemma 5, \(\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}\le (\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}})^{\frac{1}{2}}\), we have \(\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}}=m\beta \). It is easy to show that \(\mathrm {Trace}\,{\varvec{T}}^{*}{\varvec{T}}=m\beta \) So \(\mathbb {E}\Vert {\varvec{T}}{\varvec{v}}\Vert _{\ell _{2}}\le \sqrt{m\beta }\). Also by Lemma 6, \(\mathbb {E}\Vert {\varvec{v}}\Vert _{\ell _{2}}\ge \sqrt{m}-\pi \). Therefore,

$$\begin{aligned} \mathbb {P}\left( \Vert ((1+2\sigma ){\varvec{I}}-\sigma {\varvec{B}}-\sigma {\varvec{B}}^{*})^{-1/2}{\varvec{v}}\Vert _{\ell _{2}}\ge \frac{\sqrt{\beta }+\epsilon }{1-\frac{\pi }{\sqrt{m}}-\epsilon } \Vert {\varvec{v}}\Vert _{\ell _{2}}\right) \le 2e^{-\frac{2}{\pi ^{2}}m\epsilon ^{2}}. \end{aligned}$$

\(\square \)

Proof of Theorem 1

Theorem 1 follows from Lemma 7 by substituting \(\frac{{\varvec{v}}}{\Vert {\varvec{v}}\Vert _{\ell _2}}\) and using homogeneity and direct calculations. \(\square \)

Proof of Theorem 2

In this part, we will give a proof for Theorem 2.

Lemma 8

[6] Let \(\prec _{w}\) denote weak majorization. Denote eigenvalues of Hermitian matrix \({\varvec{X}}\) by \(\lambda _1({\varvec{X}})\ge \ldots \ge \lambda _m({\varvec{X}})\). For every two Hermitian positive definite matrices \({\varvec{A}}\) and \({\varvec{B}}\), we have

$$\begin{aligned} (\lambda _1({\varvec{A}}{\varvec{B}}),\cdots ,\lambda _m({\varvec{A}}{\varvec{B}})) \prec _w (\lambda _1({\varvec{A}})\lambda _1({\varvec{B}}),\cdots ,\lambda _m({\varvec{A}})\lambda _m({\varvec{B}})). \end{aligned}$$

In particular,

$$\begin{aligned} \sum _{j=1}^{m} \lambda _j({\varvec{A}}{\varvec{B}}) \le \sum _{j=1}^{m}\lambda _j({\varvec{A}})\lambda _j({\varvec{B}}). \end{aligned}$$

proof of Theorem 2

Let \(\lambda _1\ge \ldots \ge \lambda _m\) denote the eigenvalues of \(\Sigma \). The eigenvalues of \((A_\sigma ^{n})^{-2}\) are given by \(\{[1+4^n\sigma \sin ^{2n}(\pi j/m)]^{-2}\}_{j=0}^{j=m-1}\), which we denote by \(1=\alpha _1\ge \ldots \ge \alpha _m\ge (1+4^n\sigma )^{-2}\). We have

$$\begin{aligned} \sum _{j=1}^{m}\mathrm {Var}[{\varvec{n}}_j] = {{\,\mathrm{trace}\,}}(\Sigma ) = \sum _{j=1}^{m} \lambda _j. \end{aligned}$$
(24)

On the other hand we also have

$$\begin{aligned} \sum _{j=1}^{m}\mathrm {Var}[({\varvec{A}}_\sigma ^n)^{-1} {\varvec{n}}_j] = {{\,\mathrm{trace}\,}}(({\varvec{A}}_\sigma ^n)^{-1}\Sigma ({\varvec{A}}_\sigma ^n)^{-1}) = {{\,\mathrm{trace}\,}}(({\varvec{A}}_\sigma ^n)^{-2}\Sigma ) \le \sum _{j=1}^{m} \alpha _j \lambda _j, \end{aligned}$$
(25)

where the last inequality is by Lemma 8. Now,

$$\begin{aligned}&\sum _{j=1}^{m} \lambda _j - \sum _{j=1}^{m} \alpha _j \lambda _j = \sum _{j=1}^{m} (1-\alpha _j)\lambda _j\ge \lambda _m (m - \sum _{j=1}^{m} \alpha _j) \\&\quad = \frac{\lambda _1}{\kappa } (m - \sum _{j=1}^{m} \alpha _j)\ge \frac{\sum _{j=1}^{m}\lambda _j}{m\kappa } (m - \sum _{j=1}^{m} \alpha _j) \end{aligned}$$

Rearranging and simplifying above implies that

$$\begin{aligned} \sum _{j=1}^{m}\alpha _j\lambda _j \le (\sum _{j=1}^{m} \lambda _j)(1-\frac{1}{\kappa }+\frac{ \sum _{j=1}^{m} \alpha _j}{m\kappa }). \end{aligned}$$

Substituting (24) and (25) in the above inequality yields (11). \(\square \)

Proof of Lemma 1

To prove Lemma 1, we first introduce the following lemma.

Lemma 9

For \(0\le \theta \le 2\pi \), suppose

$$\begin{aligned} F(\theta ) = \frac{1}{1+2\sigma (1-\cos (\theta ))} \end{aligned}$$

has the discrete-time Fourier transform of series f[k]. Then, for integer k,

$$\begin{aligned} f[k] = \frac{\alpha ^{|k|}}{\sqrt{4\sigma +1}} \end{aligned}$$

where

$$\begin{aligned} \alpha = \frac{2\sigma +1 - \sqrt{4\sigma +1}}{2\sigma } \end{aligned}$$

Proof

By definition,

$$\begin{aligned} f[k] = \frac{1}{2\pi } \int _{0}^{2\pi } F(\theta ) e^{ik\theta } \,d\theta = \frac{1}{2\pi } \int _{0}^{2\pi } \frac{e^{ik\theta }}{1+2\sigma (1-\cos (\theta ))} \,d\theta . \end{aligned}$$
(26)

We compute (26) by using Residue theorem. First, note that because \(F(\theta )\) is real valued, \(f[k]=f[-k]\); therefore, it suffices to compute (26)) for nonnegative k. Set \(z=e^{i\theta }\). Observe that \(\cos (\theta )=0.5(z+1/z)\) and \(dz=iz d\theta \). Substituting in (26) and simplifying yields that

$$\begin{aligned} f[k] = \frac{-1}{2\pi i \sigma }\oint \frac{z^k}{(z-\alpha _{-})(z-\alpha _{+}) } \,dz, \end{aligned}$$
(27)

where the integral is taken around the unit circle, and \(\alpha _{\pm }= \frac{2\sigma +1 \pm \sqrt{4\sigma +1}}{2\sigma }\) are the roots of quadratic \(-\sigma z^2 +(2\sigma +1)z -\sigma \). Note that \(\alpha _{-}\) lies within the unit circle, whereas \(\alpha _{+}\) lies outside of the unit circle. Therefore, because k is nonnegative, \(\alpha _{-}\) is the only singularity of the integrand in (27) within the unit circle. A straightforward application of the Residue Theorem in complex analysis yields that

$$\begin{aligned} f[k] = \frac{- \alpha _{-}^{k}}{\sigma (\alpha _{-}-\alpha _{+})} = \frac{\alpha ^{k}}{\sqrt{4\sigma +1}}. \end{aligned}$$

This completes the proof. \(\square \)

Proof of Lemma 1

First observe that we can re-write \(\beta \) as

$$\begin{aligned} \frac{1}{d}\sum _{j=0}^{d-1} \frac{1}{1+2\sigma (1-\cos (\frac{2\pi j}{d}))}. \end{aligned}$$
(28)

It remains to show that the above summation is equal to \(\frac{1+\alpha ^d}{(1-\alpha ^d)\sqrt{4\sigma +1}}\). This follows by Lemmas 9 and standard sampling results in Fourier analysis (i.e., sampling \(\theta \) at points \(\{2\pi j/d\}_{j=0}^{d-1}\)). Nevertheless, we provide the details here for completeness: Observe that that the inverse discrete-time Fourier transform of

$$\begin{aligned} G(\theta ) = \sum _{j=0}^{d-1}\delta \bigg (\theta -\frac{2\pi j }{d}\bigg ). \end{aligned}$$

is given by

$$\begin{aligned} g[k] = {\left\{ \begin{array}{ll} d/2\pi \qquad &{}\text {if { k} divides { d},}\\ 0 \qquad &{}\text {otherwise.} \end{array}\right. } \end{aligned}$$

Furthermore, let

$$\begin{aligned} F(\theta ) = \frac{1}{1+2\sigma (1-\cos (\theta ))}, \end{aligned}$$

and use f[k] to denote its inverse discrete-time Fourier transform. Now,

$$\begin{aligned}&\frac{1}{d}\sum _{j=0}^{d-1} \frac{1}{1+2\sigma (1-\cos (\frac{2\pi j}{d}))} = \frac{1}{d} \int _0^{2\pi } F(\theta )G(\theta ) \\&\quad = \frac{2\pi }{d} {{\,\mathrm{\mathrm {DTFT}}\,}}^{-1}[F\cdot G][0] = \frac{2\pi }{d} ({{\,\mathrm{\mathrm {DTFT}}\,}}^{-1}[F] * {{\,\mathrm{\mathrm {DTFT}}\,}}^{-1}[G])[0] \\&\quad = \frac{2\pi }{d} \sum _{r=-\infty }^{\infty } f[-r]g[r] = \frac{2\pi }{d} \sum _{\ell =-\infty }^{\infty } f[-\ell d] \frac{d}{2\pi } = \sum _{\ell =-\infty }^{\infty } f[-\ell d]. \end{aligned}$$

The proof is completed by substituting the result of Lemma 9 in the above sum and simplifying. \(\square \)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Osher, S., Wang, B., Yin, P. et al. Laplacian smoothing gradient descent. Res Math Sci 9, 55 (2022). https://doi.org/10.1007/s40687-022-00351-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40687-022-00351-1

Keywords

  • Laplacian smoothing
  • Machine learning
  • Optimization