Skip to main content
Log in

Blended coarse gradient descent for full quantization of deep neural networks

  • Research
  • Published:
Research in the Mathematical Sciences Aims and scope Submit manuscript

Abstract

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full-precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. This assumption is valid for the population loss function; we refer readers to Lemma 2 in Sect. 5.

  2. We redefine the second term as \(\mathbf {0}\) in the case \(\theta (\mathbf {w},\mathbf {w}^*) = \pi \).

  3. Same as in Lemma 3, we redefine \(\mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] =\mathbf {0}\) in the case \(\theta (\mathbf {w},\mathbf {w}^*) = \pi \).

References

  1. Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)

  2. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  3. Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a ConvNet with Gaussian inputs. arXiv preprint arXiv:1702.07966 (2017)

  4. Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave Gaussian quantization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  5. Carreira-Perpinán, M.: Model compression as constrained optimization, with application to neural nets. Part I: general framework. arXiv preprint arXiv:1707.01209 (2017)

  6. Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan, K.: Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)

  7. Combettes, P.L., Pesquet, J.C.: Stochastic approximations and perturbations in forward–backward splitting for monotone operators. Pure Appl. Funct. Anal. 1, 13–37 (2016)

    MathSciNet  MATH  Google Scholar 

  8. Courbariaux, M., Bengio, Y., David, J.: Binaryconnect: training deep neural networks with binary weights during propagations. In: Conference on Neural Information Processing Systems (NIPS), pp. 3123–3131 (2015)

  9. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)

  10. Du, S.S., Lee, J.D., Tian, Y., Poczos, B., Singh, A.: Gradient descent learns one-hidden-layer CNN: don’t be afraid of spurious local minimum. arXiv preprint arXiv:1712.00779 (2018)

  11. Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)

    Article  Google Scholar 

  12. Gilbert, J.C., Nocedal, J.: Global convergence properties of conjugate gradient methods for optimization. SIAM J. Optim. 2(1), 21–42 (1992)

    Article  MathSciNet  Google Scholar 

  13. He, J., Li, L., Xu, J., Zheng, C.: ReLu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 (2018)

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)

  15. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: IEEE International Conference on Computer Vision (ICCV) (2015)

  16. Hinton, G.: Neural networks for machine learning, coursera. Coursera, video lectures (2012)

  17. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: training neural networks with weights and activations constrained to +1 or \(-1\). arXiv preprint arXiv:1602.02830 (2016)

  18. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1–30 (2018)

    MathSciNet  MATH  Google Scholar 

  19. Ioffe, S., Szegedy, C.: Normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

  20. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)

  21. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  22. Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)

  23. Li, H., De, S., Xu, Z., Studer, C., Samet, H., Goldstein, T.: Training quantized nets: a deeper understanding. In: NIPS, pp. 5813–5823 (2017)

  24. Li, Y., Yuan, Y.: Convergence analysis of two-layer neural networks with ReLu activation. In: Advances in Neural Information Processing Systems, pp. 597–607 (2017)

  25. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  26. Park, E., Ahn, J., Yoo, S.: Weighted-entropy-based quantization for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5456–5464 (2017)

  27. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. Technical report (2017)

  28. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: ImageNet classification using binary convolutional neural networks. In: European Conference on Computer Vision (ECCV) (2016)

  29. Rosenblatt, F.: The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory, Buffalo (1957)

    Google Scholar 

  30. Rosenblatt, F.: Principles of Neurodynamics. Spartan Books, Washington (1962)

    MATH  Google Scholar 

  31. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2015)

  32. Tian, Y.: An analytical formula of population gradient for two-layered ReLu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560 (2017)

  33. Wang, B., Luo, X., Li, Z., Zhu, W., Shi, Z., Osher, S.J.: Deep neural nets with interpolating function as output activation. arXiv preprint arXiv:1802.00168 (2018)

  34. Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990)

    Article  Google Scholar 

  35. Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., Xin, J.: Binaryrelax: a relaxation approach for training deep neural networks with quantized weights. SIAM J. Imaging Sci. (to appear). arXiv preprint arXiv:1801.06313 (2018)

  36. Yin, P., Zhang, S., Qi, Y., Xin, J.: Quantization and training of low bit-width convolutional neural networks for object detection. J. Comput. Math. (to appear). arXiv preprint arXiv:1612.06052 (2018)

  37. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jack Xin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A. Additional preliminaries

Lemma 6

Let \(\mathbf {z}\) be a Gaussian random vector with entries i.i.d. sampled from \({\mathcal {N}}(0,1)\). Given nonzero vectors \(\mathbf {w}\) and \({\tilde{\mathbf {w}}}\) with angle \(\theta \), we have

$$\begin{aligned} \mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0\}}\right] = \frac{1}{2}, \; \mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0, \, \mathbf {z}^\top {\tilde{\mathbf {w}}}> 0\}}\right] = \frac{\pi -\theta }{2\pi }, \end{aligned}$$

andFootnote 3

$$\begin{aligned} \mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0\} } \right] = \frac{1}{\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }, \; \mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{\cos (\theta /2)}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| }. \end{aligned}$$

Proof

The third identity was proved in Lemma A.1 of [10]. To show the first one, since Gaussian distribution is rotation invariant, without loss of generality we assume \(\mathbf {w}= [w_1,0,\mathbf {0}^\top ]^\top \) with \(w_1> 0\), then \(\mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0\}}\right] = \mathbb {P}(z_1>0) = \frac{1}{2}\).

We further assume \({\tilde{\mathbf {w}}}= [{\tilde{w}}_1,{\tilde{w}}_2,\mathbf {0}^\top ]^\top \). It is easy to see

$$\begin{aligned} \mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0, \, \mathbf {z}^\top {\tilde{\mathbf {w}}}>0\}}\right] = \mathbb {P}(\mathbf {z}^\top \mathbf {w}> 0, \, \mathbf {z}^\top {\tilde{\mathbf {w}}}>0) = \frac{\pi - \theta }{2\pi }, \end{aligned}$$

which is the probability that \(\mathbf {z}\) forms an acute angle with both \(\mathbf {w}\) and \(\mathbf {w}^*\).

To prove the last identity, we use polar representation of 2-D Gaussian random variables, where r is the radius and \(\phi \) is the angle with \(\mathrm {d}\mathbb {P}_r = r \exp (-r^2/2)\mathrm {d}r\) and \(\mathrm {d}\mathbb {P}_\phi = \frac{1}{2\pi }\mathrm {d}\phi \). Then, \(\mathbb {E}\left[ z_i 1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = 0\) for \(i\ge 3\). Moreover,

$$\begin{aligned} \mathbb {E}\left[ z_1 1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{1}{2\pi }\int _{0}^\infty r^2\exp \left( -\frac{r^2}{2}\right) \mathrm {d}r \int _{-\frac{\pi }{2}+\theta }^{\frac{\pi }{2}} \cos (\phi ) \mathrm {d}\phi = \frac{1+\cos (\theta )}{2\sqrt{2\pi }} \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}\left[ z_2 1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{1}{2\pi }\int _{0}^\infty r^2\exp \left( -\frac{r^2}{2}\right) \mathrm {d}r \int _{-\frac{\pi }{2}+\theta }^{\frac{\pi }{2}} \sin (\phi ) \mathrm {d}\phi = \frac{\sin (\theta )}{2\sqrt{2\pi }}. \end{aligned}$$

Therefore,

$$\begin{aligned} \mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{\cos (\theta /2)}{\sqrt{2\pi }}[\cos (\theta /2), \sin (\theta /2),\mathbf {0}^\top ]^\top = \frac{\cos (\theta /2)}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| }, \end{aligned}$$

where the last equality holds because \(\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }\) and \(\frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\) are two unit-normed vectors with angle \(\theta \). \(\square \)

Lemma 7

For any nonzero vectors \(\mathbf {w}\) and \({\tilde{\mathbf {w}}}\) with \(\Vert {\tilde{\mathbf {w}}}\Vert \ge \Vert \mathbf {w}\Vert = c>0\), we have

  1. 1.

    \(|\theta (\mathbf {w},\mathbf {w}^*)-\theta ({\tilde{\mathbf {w}}},\mathbf {w}^*)|\le \frac{\pi }{2c}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert \).

  2. 2.

    \(\left\| \frac{1}{\Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{1}{\Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \le \frac{1}{c^2}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert \).

Proof

1. Since by Cauchy–Schwarz inequality,

$$\begin{aligned} \left\langle {\tilde{\mathbf {w}}}, \mathbf {w}- \frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right\rangle = {\tilde{\mathbf {w}}}^\top \mathbf {w}- c\Vert {\tilde{\mathbf {w}}}\Vert \le 0, \end{aligned}$$

we have

$$\begin{aligned} \Vert {\tilde{\mathbf {w}}}- \mathbf {w}\Vert ^2 =&\; \left\| \left( 1-\frac{c}{\Vert {\tilde{\mathbf {w}}}\Vert } \right) {\tilde{\mathbf {w}}}- \left( \mathbf {w}-\frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right) \right\| ^2 \ge \left\| \left( 1-\frac{c}{\Vert {\tilde{\mathbf {w}}}\Vert } \right) {\tilde{\mathbf {w}}}\right\| ^2 + \left\| \mathbf {w}-\frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| ^2 \nonumber \\ \ge&\; \left\| \mathbf {w}-\frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| ^2 = c^2 \left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right\| ^2. \end{aligned}$$
(24)

Therefore,

$$\begin{aligned} \; |\theta (\mathbf {w},\mathbf {w}^*)-\theta ({\tilde{\mathbf {w}}},\mathbf {w}^*)|&\le \theta (\mathbf {w},{\tilde{\mathbf {w}}}) = \theta \left( \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert },\frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right) \\&\le \; \pi \sin \left( \frac{\theta \left( \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert },\frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right) }{2}\right) = \frac{\pi }{2}\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right\| \le \frac{\pi }{2c}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert , \end{aligned}$$

where we used the fact \(\sin (x)\ge \frac{2x}{\pi }\) for \(x\in [0,\frac{\pi }{2}]\) and the estimate in (24).

2. Since \(\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\) is the projection of \(\mathbf {w}^*\) onto the complement space of \(\mathbf {w}\) and likewise for \(\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\), the angle between \(\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\) and \(\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\) is equal to the angle between \(\mathbf {w}\) and \({\tilde{\mathbf {w}}}\). Therefore,

$$\begin{aligned} \left\langle \frac{\left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*\right\| } , \frac{\left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*\right\| } \right\rangle = \left\langle \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } , \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\rangle , \end{aligned}$$

and thus

$$\begin{aligned} \left\| \frac{1}{\Vert \mathbf {w}\Vert } \frac{\left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*\right\| } - \frac{1}{\Vert {\tilde{\mathbf {w}}}\Vert } \frac{\left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*\right\| } \right\|&= \left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert ^2} - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert ^2} \right\| \\&= \frac{\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert }{\Vert \mathbf {w}\Vert \Vert {\tilde{\mathbf {w}}}\Vert }\le \frac{1}{c^2}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert . \end{aligned}$$

The second equality above holds because

$$\begin{aligned} \left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert ^2} - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert ^2} \right\| ^2 = \frac{1}{\Vert \mathbf {w}\Vert ^2} + \frac{1}{\Vert {\tilde{\mathbf {w}}}\Vert ^2} - \frac{2\langle \mathbf {w}, {\tilde{\mathbf {w}}}\rangle }{\Vert \mathbf {w}\Vert ^2 \Vert {\tilde{\mathbf {w}}}\Vert ^2} = \frac{\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert ^2}{\Vert \mathbf {w}\Vert ^2 \Vert {\tilde{\mathbf {w}}}\Vert ^2}. \end{aligned}$$

\(\square \)

1.2 B. Proofs

Proof of Proposition 1

We rewrite the update (11) as

$$\begin{aligned} \mathbf {w}^{t+1} = \arg \min _{\mathbf {w}\in \mathcal {Q}} \; \langle \mathbf {w}, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \Vert \mathbf {w}-\mathbf {w}_f^t\Vert ^2 + \frac{\rho }{2\eta } \Vert \mathbf {w}-\mathbf {w}^t\Vert ^2 . \end{aligned}$$

Since \(\mathbf {w}^t, \, \mathbf {w}^{t+1} \in \mathcal {Q}\), we have

$$\begin{aligned}&\langle \mathbf {w}^{t+1}, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \Vert \mathbf {w}^{t+1}-\mathbf {w}_f^t\Vert ^2 + \frac{\rho }{2\eta } \Vert \mathbf {w}^{t+1}-\mathbf {w}^t\Vert ^2 \\&\quad \le \langle \mathbf {w}^t, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \Vert \mathbf {w}^t-\mathbf {w}_f^t\Vert ^2, \end{aligned}$$

or equivalently,

$$\begin{aligned}&\langle \mathbf {w}^{t+1}-\mathbf {w}^t, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \left( \left\| \mathbf {w}^{t+1}-\mathbf {w}_f^t \right\| ^2- \left\| \mathbf {w}^t-\mathbf {w}_f^t \right\| ^2 \right) \nonumber \\&\quad + \frac{\rho }{2\eta }\Vert \mathbf {w}^{t+1}-\mathbf {w}^t\Vert ^2 \le 0. \end{aligned}$$
(25)

On the other hand, since f has L-Lipschitz gradient, the descent lemma [2] gives

$$\begin{aligned} f(\mathbf {w}^{t+1})\le f(\mathbf {w}^t) + \langle \nabla f(\mathbf {w}^t), \mathbf {w}^{t+1}-\mathbf {w}^t \rangle + \frac{L}{2}\Vert \mathbf {w}^{t+1}-\mathbf {w}^t\Vert ^2. \end{aligned}$$
(26)

Combining (25) and (26) completes the proof. \(\square \)

Proof of Lemma 1

We first evaluate \(\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] \), \(\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] \) and \(\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w}^*)\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] \). Let \(\mathbf {Z}_i^\top \) be the i-th row vector of \(\mathbf {Z}\). Since \(\mathbf {w}\ne \mathbf {0}\), using Lemma 6, we have

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] _{ii} = \mathbb {E}\left[ \sigma (\mathbf {Z}_i^\top \mathbf {w})\sigma (\mathbf {Z}_i^\top \mathbf {w})\right] = \mathbb {E}\left[ 1_{\{\mathbf {Z}_i^\top \mathbf {w}> 0\}}\right] = \frac{1}{2}, \end{aligned}$$

and for \(i\ne j\),

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] _{ij} = \mathbb {E}\left[ \sigma (\mathbf {Z}_i^\top \mathbf {w})\sigma (\mathbf {Z}_j^\top \mathbf {w})\right] = \mathbb {E}\left[ 1_{\{\mathbf {Z}_i^\top \mathbf {w}> 0\}}\right] \mathbb {E}\left[ 1_{\{\mathbf {Z}_j^\top \mathbf {w}> 0\}}\right] = \frac{1}{4}. \end{aligned}$$

Therefore, \(\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] =\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w}^*)\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] = \frac{1}{4}\left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \). Furthermore,

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] _{ii} = \mathbb {E}\left[ 1_{\{\mathbf {Z}_i^\top \mathbf {w}> 0, \mathbf {Z}_i^\top \mathbf {w}^*> 0\}}\right] = \frac{\pi -\theta (\mathbf {w},\mathbf {w}^*)}{2\pi }, \end{aligned}$$

and \(\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] _{ij}=\frac{1}{4}\). So,

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] = \frac{1}{4}\left( \left( 1-\frac{2\theta (\mathbf {w},\mathbf {w}^*)}{\pi }\right) \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) . \end{aligned}$$

We thus have proved (14) by noticing that

$$\begin{aligned} f(\mathbf {v},\mathbf {w}) =&\; \frac{1}{2}\left( \mathbf {v}^\top \mathbb {E}_\mathbf {Z}[\sigma (\mathbf {Z}\mathbf {w})^\top \sigma (\mathbf {Z}\mathbf {w})]\mathbf {v}- 2\mathbf {v}^\top \mathbb {E}_\mathbf {Z}[\sigma (\mathbf {Z}\mathbf {w})^\top \sigma (\mathbf {Z}\mathbf {w}^*)]\mathbf {v}^* \right. \\&\left. +(\mathbf {v}^*)^\top \mathbb {E}_\mathbf {Z}[\sigma (\mathbf {Z}\mathbf {w}^*)^\top \sigma (\mathbf {Z}\mathbf {w}^*)]\mathbf {v}^*\right) . \end{aligned}$$

Next, since (15) is trivial, we only show (16). Since \(\theta (\mathbf {w},\mathbf {w}^*) = \arccos \left( \frac{\mathbf {w}^\top \mathbf {w}^*}{\Vert \mathbf {w}\Vert }\right) \) is differentiable w.r.t. \(\mathbf {w}\) at \(\theta (\mathbf {w},\mathbf {w}^*)\in (0,\pi )\), we have

$$\begin{aligned} \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w})= & {} \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi }\frac{\partial \theta }{\partial \mathbf {w}}(\mathbf {w},\mathbf {w}^*) = -\frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi }\frac{\Vert \mathbf {w}\Vert ^2\mathbf {w}^* - (\mathbf {w}^\top \mathbf {w}^*)\mathbf {w}}{\Vert \mathbf {w}\Vert ^3\sqrt{1-\frac{(\mathbf {w}^\top \mathbf {w}^*)^2}{\Vert \mathbf {w}\Vert ^2}}} \\= & {} -\frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert \mathbf {w}\Vert }\frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert }. \end{aligned}$$

\(\square \)

Proof of Proposition 2

Suppose \(\mathbf {v}^\top \mathbf {v}^*=0\) and \(\frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) = \mathbf {0}\), then by Lemma 1,

$$\begin{aligned} 0 = \mathbf {v}^\top \mathbf {v}^* = (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}\left( \left( 1- \frac{2}{\pi }\theta (\mathbf {w},\mathbf {w}^*)\right) \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}^*. \end{aligned}$$
(27)

From (27), it follows that

$$\begin{aligned} \frac{2}{\pi }\theta (\mathbf {w},\mathbf {w}^*) (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1} \mathbf {v}^*= (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}\left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}^* = \Vert \mathbf {v}^*\Vert ^2. \end{aligned}$$
(28)

On the other hand, from (27) it also follows that

$$\begin{aligned} \left( \frac{2}{\pi }\theta (\mathbf {w},\mathbf {w}^*)-1\right) (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1} \mathbf {v}^* = (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1} \mathbf {1}(\mathbf {1}^\top \mathbf {v}^*) = \frac{(\mathbf {1}^\top \mathbf {v}^*)^2}{m+1}, \end{aligned}$$

where \(\mathbf {I}\) is an m-by-m identity matrix, and we used \((\mathbf {I}+ \mathbf {1}\mathbf {1}^\top ) \mathbf {1} = (m+1)\mathbf {1}\). Taking the difference of the two equalities above gives

$$\begin{aligned} (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}\mathbf {v}^* = \Vert \mathbf {v}^*\Vert ^2 - \frac{(\mathbf {1}^\top \mathbf {v}^*)^2}{m+1}. \end{aligned}$$

By (28), we have \(\theta (\mathbf {w},\mathbf {w}^*) = \frac{\pi }{2}\frac{(m+1)\Vert \mathbf {v}^*\Vert ^2}{(m+1)\Vert \mathbf {v}^*\Vert ^2 - (\mathbf {1}^\top \mathbf {v}^*)^2}\), which requires

$$\begin{aligned} \frac{\pi }{2}\frac{(m+1)\Vert \mathbf {v}^*\Vert ^2}{(m+1)\Vert \mathbf {v}^*\Vert ^2 - (\mathbf {1}^\top \mathbf {v}^*)^2}<\pi , \; \text{ or } \text{ equivalently, } \; (\mathbf {1}^\top \mathbf {v}^*)^2 < \frac{m+1}{2}\Vert \mathbf {v}^*\Vert ^2. \end{aligned}$$

Otherwise, \(\frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\) and \(\frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w})\) do not vanish simultaneously, and there is no critical point. \(\square \)

Proof of Lemma 2

It is easy to check that \(\Vert \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \Vert = m+1\). Invoking Lemma 7.1 gives

$$\begin{aligned} \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) - \frac{\partial f}{\partial \mathbf {v}}({\tilde{\mathbf {v}}},{\tilde{\mathbf {w}}}) \right\| =&\; \frac{1}{4}\left\| \big (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top \big )(\mathbf {v}-{\tilde{\mathbf {v}}}) + \frac{2}{\pi }(\theta (\mathbf {w},\mathbf {w}^*) - \theta ({\tilde{\mathbf {w}}},\mathbf {w}^*) )\mathbf {v}^* \right\| \\ \le&\; \frac{1}{4}\left( (m+1)\Vert \mathbf {v}- {\tilde{\mathbf {v}}}\Vert + \frac{2\Vert \mathbf {v}^*\Vert }{\pi } |\theta (\mathbf {w},\mathbf {w}^*) - \theta ({\tilde{\mathbf {w}}},\mathbf {w}^*)|\right) \\ \le&\; \frac{1}{4}\left( (m+1)\Vert \mathbf {v}- {\tilde{\mathbf {v}}}\Vert + \frac{\Vert \mathbf {v}^*\Vert }{c} \left\| \mathbf {w}- {\tilde{\mathbf {w}}}\right\| \right) \\ \le&\; \frac{1}{4}\left( m+1 + \frac{\Vert \mathbf {v}^*\Vert }{c} \right) \Vert (\mathbf {v}, \mathbf {w}) - ({\tilde{\mathbf {v}}}, {\tilde{\mathbf {w}}})\Vert . \end{aligned}$$

Using Lemma 7.2, we further have

$$\begin{aligned} \left\| \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) - \frac{\partial f}{\partial \mathbf {w}}({\tilde{\mathbf {v}}},{\tilde{\mathbf {w}}}) \right\| =&\; \left\| \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{{\tilde{\mathbf {v}}}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \\ \le&\; \left\| \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \\&\; + \left\| \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{{\tilde{\mathbf {v}}}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \\ \le&\; \frac{|\mathbf {v}^\top \mathbf {v}^*|}{2 \pi c^2 }\Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert + \frac{\Vert \mathbf {v}^*\Vert }{2\pi c}\Vert \mathbf {v}-{\tilde{\mathbf {v}}}\Vert \\ \le&\; \frac{(C+c)\Vert \mathbf {v}^*\Vert }{2\pi c^2}\Vert (\mathbf {v}, \mathbf {w}) - ({\tilde{\mathbf {v}}}, {\tilde{\mathbf {w}}})\Vert . \end{aligned}$$

Combining the two inequalities above validates the claim. \(\square \)

Proof of Lemma 3

Equation (22) is true because \(\frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v},\mathbf {w};\mathbf {Z})\) is linear in \(\mathbf {v}\). To show (23), by (20) and the fact that \(\mu ^{\prime } = \sigma \), we have

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v},\mathbf {w};\mathbf {Z})\right] =&\; \mathbb {E}_\mathbf {Z}\left[ \left( \sum _{i=1}^m v_i \sigma (\mathbf {Z}^\top _i\mathbf {w}) - \sum _{i=1}^m v^*_i\sigma (\mathbf {Z}^\top _i\mathbf {w}^*) \right) \left( \sum _{i=1}^m \mathbf {Z}_i v_i \sigma (\mathbf {Z}^\top _i\mathbf {w}) \right) \right] \\ =&\; \mathbb {E}_\mathbf {Z}\left[ \left( \sum _{i=1}^m v_i 1_{\{\mathbf {Z}^\top _i\mathbf {w}>0\}} - \sum _{i=1}^m v^*_i1_{\{\mathbf {Z}^\top _i\mathbf {w}^*>0\}} \right) \left( \sum _{i=1}^m 1_{\{\mathbf {Z}^\top _i\mathbf {w}>0\}} v_i\mathbf {Z}_i \right) \right] . \end{aligned}$$

Invoking Lemma 6, we have

$$\begin{aligned} \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}>0\}}\right] = {\left\{ \begin{array}{ll} \frac{1}{\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } &{} \text{ if } i=j, \\ \frac{1}{2\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } &{} \text{ if } i\ne j, \end{array}\right. } \end{aligned}$$
(29)

and

$$\begin{aligned} \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}^* >0\}}\right] = {\left\{ \begin{array}{ll} \frac{\cos (\theta (\mathbf {w},\mathbf {w}^*)/2)}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* \right\| } &{} \text{ if } i=j, \\ \frac{1}{2\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } &{} \text{ if } i\ne j. \end{array}\right. } \end{aligned}$$
(30)

Therefore,

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v},\mathbf {w};\mathbf {Z})\right] =&\;\sum _{i=1}^m v_i^2 \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0\}}\right] + \sum _{i=1}^m \sum _{\overset{j=1}{j\ne i}}^m v_i v_j \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}>0\}}\right] \\&\; - \sum _{i=1}^m v_i v_i^* \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_i^\top \mathbf {w}^*>0\}}\right] \\&- \sum _{i=1}^m \sum _{\overset{j=1}{j\ne i}}^m v_i v_j^* \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}^*>0\}}\right] \\ =&\; \frac{1}{2\sqrt{2\pi }}\left( \Vert \mathbf {v}\Vert ^2 + (\mathbf {1}^\top \mathbf {v})^2 \right) \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }\\&-\cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* \right\| } \\&\; - \frac{1}{2\sqrt{2\pi }}\left( (\mathbf {1}^\top \mathbf {v})(\mathbf {1}^\top \mathbf {v}^*) - \mathbf {v}^\top \mathbf {v}^* \right) \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }, \end{aligned}$$

which is exactly (23). \(\square \)

Proof of Lemma 4

Notice that \((\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2})\mathbf {w}= \mathbf {0}\) and \(\Vert \mathbf {w}^*\Vert = 1\), if \(\theta (\mathbf {w},\mathbf {w}_*)\ne 0, \pi \), then we have

$$\begin{aligned}&\; \left\langle \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ], \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) \right\rangle \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{(\sqrt{2\pi })^3} \left\langle \frac{1}{\Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } , \frac{\mathbf {w}^*}{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^*\right\| } \right\rangle \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{(\sqrt{2\pi })^3} \frac{\Vert \mathbf {w}\Vert ^2 - (\mathbf {w}^\top \mathbf {w}^*)^2}{\Vert \Vert \mathbf {w}\Vert ^2\mathbf {w}^* - \mathbf {w}(\mathbf {w}^\top \mathbf {w}^*)\Vert \, \Vert \mathbf {w}+\Vert \mathbf {w}\Vert \mathbf {w}^*\Vert } \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{(\sqrt{2\pi })^3} \frac{\Vert \mathbf {w}\Vert ^2 - (\mathbf {w}^\top \mathbf {w}^*)^2}{\sqrt{\Vert \mathbf {w}\Vert ^4 -\Vert \mathbf {w}\Vert ^2(\mathbf {w}^\top \mathbf {w}^*)^2} \sqrt{2(\Vert \mathbf {w}\Vert ^2+ \Vert \mathbf {w}\Vert (\mathbf {w}^\top \mathbf {w}^*))}} \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{4(\sqrt{\pi \Vert \mathbf {w}\Vert })^3} \frac{\Vert \mathbf {w}\Vert ^2 - (\mathbf {w}^\top \mathbf {w}^*)^2}{\sqrt{\Vert \mathbf {w}\Vert ^2 -(\mathbf {w}^\top \mathbf {w}^*)^2} \sqrt{\Vert \mathbf {w}\Vert + (\mathbf {w}^\top \mathbf {w}^*)}} \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2\sqrt{1-\frac{\mathbf {w}^\top \mathbf {w}^*}{\Vert \mathbf {w}\Vert }}}{4(\sqrt{\pi })^3\Vert \mathbf {w}\Vert }\\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2\sqrt{1 - \cos (\theta (\mathbf {w},\mathbf {w}^*))}}{4(\sqrt{\pi })^3\Vert \mathbf {w}\Vert } \\&\quad = \; \frac{\sin \left( \theta (\mathbf {w},\mathbf {w}^*)\right) }{2(\sqrt{2\pi })^3\Vert \mathbf {w}\Vert }(\mathbf {v}^{\top }\mathbf {v}^*)^2. \end{aligned}$$

\(\square \)

Proof of Lemma 5

Denote \(\theta := \theta (\mathbf {w},\mathbf {w}^*)\). By Lemma 1, we have

$$\begin{aligned} \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) = \frac{1}{4}\big (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top \big ) \mathbf {v}- \frac{1}{4}\left( \left( 1-\frac{2\theta }{\pi } \right) \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}^*. \end{aligned}$$

Since \(\Vert \mathbf {w}\Vert =1\), Lemma 3 gives

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ] = \frac{h(\mathbf {v},\mathbf {v}^*)}{2\sqrt{2\pi }}\mathbf {w}- \cos \left( \frac{\theta }{2}\right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }}\frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| }, \end{aligned}$$
(31)

where

$$\begin{aligned} h(\mathbf {v},\mathbf {v}^*) =&\; \Vert \mathbf {v}\Vert ^2+ (\mathbf {1}^\top \mathbf {v})^2 - (\mathbf {1}^\top \mathbf {v})(\mathbf {1}^\top \mathbf {v}^*) + \mathbf {v}^\top \mathbf {v}^* \nonumber \\ =&\; \mathbf {v}^\top \left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}- \mathbf {v}^\top (\mathbf {1}\mathbf {1}^\top - \mathbf {I})\mathbf {v}^* \nonumber \\ =&\; \mathbf {v}^\top \left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}- \mathbf {v}^\top \left( \mathbf {1}\mathbf {1}^\top + \left( 1-\frac{2\theta }{\pi }\right) \mathbf {I}\right) \mathbf {v}^* + 2\left( 1 - \frac{\theta }{\pi }\right) \mathbf {v}^\top \mathbf {v}^* \nonumber \\ =&\; 4 \mathbf {v}^\top \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) + 2\left( 1 - \frac{\theta }{\pi }\right) \mathbf {v}^\top \mathbf {v}^*, \end{aligned}$$
(32)

and by Lemma 4,

$$\begin{aligned} \left\langle \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ], \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) \right\rangle = \frac{\sin \left( \theta \right) }{2(\sqrt{2\pi })^3}(\mathbf {v}^\top \mathbf {v}^*)^2. \end{aligned}$$

Hence, for some A depending only on C, we have

$$\begin{aligned}&\; \left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ] \right\| ^2 \\&\quad = \; \left\| \frac{2 \mathbf {v}^\top \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})}{\sqrt{2\pi }} \mathbf {w}+ \cos \left( \frac{\theta }{2}\right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }}\left( \mathbf {w}- \frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| } \right) \right. \\&\qquad \left. + \left( 1-\frac{\theta }{\pi }-\cos \left( \frac{\theta }{2}\right) \right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }}\mathbf {w}\right\| ^2 \\&\quad \le \; \frac{6C^2}{\pi } \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \cos ^2\left( \frac{\theta }{2}\right) \frac{3(\mathbf {v}^\top \mathbf {v}^*)^2}{2\pi }\left\| \mathbf {w}- \frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| } \right\| ^2 \\&\qquad \; + \left( 1-\frac{\theta }{\pi }-\cos \left( \frac{\theta }{2}\right) \right) ^2 \frac{3(\mathbf {v}^\top \mathbf {v}^*)^2}{2\pi } \\&\quad \le \; \frac{6C^2}{\pi }\left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \cos ^2\left( \frac{\theta }{2}\right) \frac{3\theta ^2}{8\pi } (\mathbf {v}^\top \mathbf {v}^*)^2\\&\qquad +\left( 1-\frac{\theta }{\pi }-\cos \left( \frac{\theta }{2}\right) \right) ^2 \frac{3(\mathbf {v}^\top \mathbf {v}^*)^2}{2\pi } \\&\quad \le \; \frac{6C^2}{\pi }\left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \frac{3\pi }{8}\cos ^2\left( \frac{\theta }{2}\right) \sin ^2\left( \frac{\theta }{2}\right) (\mathbf {v}^\top \mathbf {v}^*)^2 + \frac{3\sin (\theta )}{2\pi }(\mathbf {v}^\top \mathbf {v}^*)^2 \\&\quad \le \; A\left( \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \left\langle \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ], \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) \right\rangle \right) , \end{aligned}$$

where the equality is due to (31) and (32), the first inequality is due to Cauchy-Schwarz inequality, the second inequality holds because the angle between \(\mathbf {w}\) and \(\frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| }\) is \(\frac{\theta }{2}\) and \(\left\| \mathbf {w}- \frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| } \right\| \le \frac{\theta }{2}\), whereas the third inequality is due to \(\sin (x)\ge \frac{2x}{\pi }\), \(\cos (x)\ge 1-\frac{2x}{\pi }\), and

$$\begin{aligned} \left( 1-\frac{2x}{\pi } -\cos (x)\right) ^2\le & {} \left( \cos (x) - 1+ \frac{2x}{\pi } \right) \left( \cos (x) + 1 - \frac{2x}{\pi }\right) \\\le & {} \sin (x)(2\cos (x)) = \sin (2x), \end{aligned}$$

for all \(x\in [0,\frac{\pi }{2}]\). \(\square \)

Proof of Theorem 1

To leverage Lemmas 2 and 5, we would need the boundedness of \(\{\mathbf {v}^t\}\). Due to the coerciveness of f w.r.t \(\mathbf {v}\), there exists \(C_0>0\), such that \(\Vert \mathbf {v}\Vert \le C_0\) for any \(\mathbf {v}\in \{\mathbf {v}\in \mathbb {R}^m: f(\mathbf {v},\mathbf {w})\le f(\mathbf {v}^0,\mathbf {w}^0) \text{ for } \text{ some } \mathbf {w}\}\). In particular, \(\Vert \mathbf {v}^0\Vert \le C_0\). Using induction, suppose we already have \(f(\mathbf {v}^{t},\mathbf {w}^{t})\le f(\mathbf {v}^0,\mathbf {w}^0)\) and \(\Vert \mathbf {v}^t\Vert \le C_0\). If \(\mathbf {w}^t = \pm \mathbf {w}^*\), then \(\mathbf {w}^{t+1} = \mathbf {w}^{t+2} = \cdots = \pm \mathbf {w}^*\), and the original problem reduces to a quadratic program in terms of \(\mathbf {v}\). So \(\{\mathbf {v}^t\}\) will converge to \(\mathbf {v}^*\) or \((\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}(\mathbf {1}\mathbf {1}^\top - \mathbf {I})\mathbf {v}^*\) by choosing a suitable step size \(\eta \). In either case, we have \(\left\| \mathbb {E}_\mathbf {Z}\Big [\frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ]\right\| \) and \(\left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ]\right\| \) both converge to 0. Else if \(\mathbf {w}^t \ne \pm \mathbf {w}^*\), we define for \(a\in [0,1]\) that

$$\begin{aligned} \mathbf {v}^t(a) := \mathbf {v}^t - a(\mathbf {v}^{t+1} - \mathbf {v}^t) = \mathbf {v}^t - a \eta \mathbb {E}_\mathbf {Z}\left[ \frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t;\mathbf {Z})\right] \end{aligned}$$

and

$$\begin{aligned} \mathbf {w}^t(a) := \mathbf {w}^t - a(\mathbf {w}^{t+1/2} - \mathbf {w}^t) = \mathbf {w}^t - a\eta \mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\right] , \end{aligned}$$

which satisfy

$$\begin{aligned} \mathbf {v}^t(0) = \mathbf {v}^t, \; \mathbf {v}^t(1) = \mathbf {v}^{t+1}, \; \mathbf {w}^t(0) = \mathbf {w}^t, \; \mathbf {w}^t(1) = \mathbf {w}^{t+1/2}. \end{aligned}$$

Let us fix \(0<c<1\) and \(C\ge C_0\). By the expressions of \(\mathbb {E}_\mathbf {Z}\left[ \frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t;\mathbf {Z})\right] \) and \(\mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\right] \) given in Lemma 3 and since \(\Vert \mathbf {w}^t\Vert =1\), for sufficiently small \({\tilde{\eta }}\) depending on \(C_0\), with \(\eta \le {\tilde{\eta }}\), it holds that \(\Vert \mathbf {v}^t(a)\Vert \le C\) and \(\Vert \mathbf {w}^t(a)\Vert \ge c\) for all \(a\in [0,1]\). Possibly at some point \(a_0\) where \(\theta (\mathbf {w}^t(a_0),\mathbf {w}^*) = 0\) or \(\pi \), such that \(\frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a_0),\mathbf {w}^t(a_0))\) does not exist. Otherwise, \(\left\| \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) \right\| \) is uniformly bounded for all \(a\in [0,1]/\{a_0\}\), which makes it integrable over the interval [0, 1]. Then, we have

$$\begin{aligned} f(\mathbf {v}^{t+1}, \mathbf {w}^{t+1})&= \; f(\mathbf {v}^{t+1}, \mathbf {w}^{t+1/2}) = f(\mathbf {v}^t+ (\mathbf {v}^{t+1} -\mathbf {v}^t), \mathbf {w}^t+ (\mathbf {w}^{t+1/2}-\mathbf {w}^t)) \nonumber \\&= \; f(\mathbf {v}^t, \mathbf {w}^t) + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) , \mathbf {v}^{t+1} -\mathbf {v}^t \right\rangle \mathrm {d}a \nonumber \\&\quad + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a),\mathbf {w}^t(a)), \mathbf {w}^{t+1/2} - \mathbf {w}^t \right\rangle \mathrm {d}a \nonumber \\&= \; f(\mathbf {v}^{t}, \mathbf {w}^{t}) + \left\langle \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {v}^{t+1} -\mathbf {v}^t \right\rangle + \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {w}^{t+1/2} -\mathbf {w}^t \right\rangle \nonumber \\&\quad + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) - \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {v}^{t+1} -\mathbf {v}^t \right\rangle \mathrm {d}a \nonumber \\&\quad + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) - \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {w}^{t+1/2} - \mathbf {w}^t \right\rangle \mathrm {d}a \nonumber \\&\le \; f(\mathbf {v}^{t}, \mathbf {w}^{t}) -\left( \eta -\frac{L\eta ^2}{2}\right) \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) \right\| ^2 \nonumber \\&\quad - \eta \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle \nonumber \\&\quad + \frac{L\eta ^2}{2} \left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\| ^2 \nonumber \\&\le \; f(\mathbf {v}^{t}, \mathbf {w}^{t}) -\left( \eta -(1+A)\frac{L\eta ^2}{2}\right) \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) \right\| ^2 \nonumber \\&\quad - \left( \eta -\frac{AL\eta ^2}{2}\right) \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle . \end{aligned}$$
(33)

The third equality is due to the fundamental theorem of calculus. In the first inequality, we called Lemma 2 for \((\mathbf {v}^t, \mathbf {w}^t)\) and \((\mathbf {v}^t(a), \mathbf {w}^t(a))\) with \(a\in [0,1]/\{a_0\}\). In the last inequality, we used Lemma 5. So when \(\eta < \eta _0:= \min \left\{ \frac{2}{(1+A)L}, {\tilde{\eta }}\right\} \), we have \(f(\mathbf {v}^{t+1},\mathbf {w}^{t+1})\le f(\mathbf {v}^0,\mathbf {w}^0)\) and thus \(\Vert \mathbf {v}^{t+1}\Vert \le C_0\).

Summing up the inequality (33) over t from 0 to \(\infty \) and using \(f\ge 0\), we have

$$\begin{aligned}&\; \eta \sum _{t=0}^\infty \left( 1 -(1+A)\frac{L\eta }{2}\right) \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) \right\| ^2 + \left( 1 -\frac{AL\eta }{2}\right) \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle \\&\quad \le \; f(\mathbf {v}^0,\mathbf {w}^0)<\infty . \end{aligned}$$

Hence,

$$\begin{aligned} \lim _{t\rightarrow \infty }\left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t)\right\| = 0 \end{aligned}$$

and

$$\begin{aligned} \lim _{t\rightarrow \infty } \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle = 0. \end{aligned}$$

Invoking Lemma 5 again, we further have

$$\begin{aligned} \lim _{t\rightarrow \infty }\left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ]\right\| = 0, \end{aligned}$$

which completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yin, P., Zhang, S., Lyu, J. et al. Blended coarse gradient descent for full quantization of deep neural networks. Res Math Sci 6, 14 (2019). https://doi.org/10.1007/s40687-018-0177-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40687-018-0177-6

Keywords

Mathematics Subject Classification

Navigation