Blended coarse gradient descent for full quantization of deep neural networks

Yin, Penghang; Zhang, Shuai; Lyu, Jiancheng; Osher, Stanley; Qi, Yingyong; Xin, Jack

doi:10.1007/s40687-018-0177-6

Blended coarse gradient descent for full quantization of deep neural networks

Research
Published: 02 January 2019

Volume 6, article number 14, (2019)
Cite this article

Research in the Mathematical Sciences Aims and scope Submit manuscript

Penghang Yin¹^na1,
Shuai Zhang²^na1,
Jiancheng Lyu³^na1,
Stanley Osher¹,
Yingyong Qi² &
…
Jack Xin ORCID: orcid.org/0000-0002-6438-8476³

486 Accesses
34 Citations
Explore all metrics

Abstract

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full-precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IQNN: Training Quantized Neural Networks with Iterative Optimizations

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

Article 14 July 2017

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Notes

This assumption is valid for the population loss function; we refer readers to Lemma 2 in Sect. 5.
We redefine the second term as $\mathbf {0}$ in the case $\theta (\mathbf {w},\mathbf {w}^*) = \pi $.
Same as in Lemma 3, we redefine $\mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] =\mathbf {0}$ in the case $\theta (\mathbf {w},\mathbf {w}^*) = \pi $.

References

Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)
MATH Google Scholar
Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a ConvNet with Gaussian inputs. arXiv preprint arXiv:1702.07966 (2017)
Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave Gaussian quantization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Carreira-Perpinán, M.: Model compression as constrained optimization, with application to neural nets. Part I: general framework. arXiv preprint arXiv:1707.01209 (2017)
Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan, K.: Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)
Combettes, P.L., Pesquet, J.C.: Stochastic approximations and perturbations in forward–backward splitting for monotone operators. Pure Appl. Funct. Anal. 1, 13–37 (2016)
MathSciNet MATH Google Scholar
Courbariaux, M., Bengio, Y., David, J.: Binaryconnect: training deep neural networks with binary weights during propagations. In: Conference on Neural Information Processing Systems (NIPS), pp. 3123–3131 (2015)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)
Du, S.S., Lee, J.D., Tian, Y., Poczos, B., Singh, A.: Gradient descent learns one-hidden-layer CNN: don’t be afraid of spurious local minimum. arXiv preprint arXiv:1712.00779 (2018)
Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)
Article Google Scholar
Gilbert, J.C., Nocedal, J.: Global convergence properties of conjugate gradient methods for optimization. SIAM J. Optim. 2(1), 21–42 (1992)
Article MathSciNet Google Scholar
He, J., Li, L., Xu, J., Zheng, C.: ReLu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Hinton, G.: Neural networks for machine learning, coursera. Coursera, video lectures (2012)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: training neural networks with weights and activations constrained to +1 or $-1$. arXiv preprint arXiv:1602.02830 (2016)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1–30 (2018)
MathSciNet MATH Google Scholar
Ioffe, S., Szegedy, C.: Normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)
Li, H., De, S., Xu, Z., Studer, C., Samet, H., Goldstein, T.: Training quantized nets: a deeper understanding. In: NIPS, pp. 5813–5823 (2017)
Li, Y., Yuan, Y.: Convergence analysis of two-layer neural networks with ReLu activation. In: Advances in Neural Information Processing Systems, pp. 597–607 (2017)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)
Article MathSciNet Google Scholar
Park, E., Ahn, J., Yoo, S.: Weighted-entropy-based quantization for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5456–5464 (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. Technical report (2017)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: ImageNet classification using binary convolutional neural networks. In: European Conference on Computer Vision (ECCV) (2016)
Rosenblatt, F.: The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory, Buffalo (1957)
Google Scholar
Rosenblatt, F.: Principles of Neurodynamics. Spartan Books, Washington (1962)
MATH Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2015)
Tian, Y.: An analytical formula of population gradient for two-layered ReLu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560 (2017)
Wang, B., Luo, X., Li, Z., Zhu, W., Shi, Z., Osher, S.J.: Deep neural nets with interpolating function as output activation. arXiv preprint arXiv:1802.00168 (2018)
Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990)
Article Google Scholar
Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., Xin, J.: Binaryrelax: a relaxation approach for training deep neural networks with quantized weights. SIAM J. Imaging Sci. (to appear). arXiv preprint arXiv:1801.06313 (2018)
Yin, P., Zhang, S., Qi, Y., Xin, J.: Quantization and training of low bit-width convolutional neural networks for object detection. J. Comput. Math. (to appear). arXiv preprint arXiv:1612.06052 (2018)
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

Download references

Author information

P. Yin, S. Zhang and J. Lyu contributed equally.

Authors and Affiliations

Department of Mathematics, University of California at Los Angeles, Los Angeles, CA, 90095, USA
Penghang Yin & Stanley Osher
Qualcomm AI Research, San Diego, CA, 92121, USA
Shuai Zhang & Yingyong Qi
Department of Mathematics, University of California at Irvine, Irvine, CA, 92697, USA
Jiancheng Lyu & Jack Xin

Authors

Penghang Yin
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiancheng Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Stanley Osher
View author publications
You can also search for this author in PubMed Google Scholar
Yingyong Qi
View author publications
You can also search for this author in PubMed Google Scholar
Jack Xin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jack Xin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A. Additional preliminaries

Lemma 6

Let $\mathbf {z}$ be a Gaussian random vector with entries i.i.d. sampled from ${\mathcal {N}}(0,1)$. Given nonzero vectors $\mathbf {w}$ and ${\tilde{\mathbf {w}}}$ with angle $\theta $, we have

$$\begin{aligned} \mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0\}}\right] = \frac{1}{2}, \; \mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0, \, \mathbf {z}^\top {\tilde{\mathbf {w}}}> 0\}}\right] = \frac{\pi -\theta }{2\pi }, \end{aligned}$$

and^{Footnote 3}

$$\begin{aligned} \mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0\} } \right] = \frac{1}{\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }, \; \mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{\cos (\theta /2)}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| }. \end{aligned}$$

Proof

The third identity was proved in Lemma A.1 of [10]. To show the first one, since Gaussian distribution is rotation invariant, without loss of generality we assume $\mathbf {w}= [w_1,0,\mathbf {0}^\top ]^\top $ with $w_1> 0$, then $\mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0\}}\right] = \mathbb {P}(z_1>0) = \frac{1}{2}$.

We further assume ${\tilde{\mathbf {w}}}= [{\tilde{w}}_1,{\tilde{w}}_2,\mathbf {0}^\top ]^\top $. It is easy to see

$$\begin{aligned} \mathbb {E}\left[ 1_{\{\mathbf {z}^\top \mathbf {w}> 0, \, \mathbf {z}^\top {\tilde{\mathbf {w}}}>0\}}\right] = \mathbb {P}(\mathbf {z}^\top \mathbf {w}> 0, \, \mathbf {z}^\top {\tilde{\mathbf {w}}}>0) = \frac{\pi - \theta }{2\pi }, \end{aligned}$$

which is the probability that $\mathbf {z}$ forms an acute angle with both $\mathbf {w}$ and $\mathbf {w}^*$.

To prove the last identity, we use polar representation of 2-D Gaussian random variables, where r is the radius and $\phi $ is the angle with $\mathrm {d}\mathbb {P}_r = r \exp (-r^2/2)\mathrm {d}r$ and $\mathrm {d}\mathbb {P}_\phi = \frac{1}{2\pi }\mathrm {d}\phi $. Then, $\mathbb {E}\left[ z_i 1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = 0$ for $i\ge 3$. Moreover,

$$\begin{aligned} \mathbb {E}\left[ z_1 1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{1}{2\pi }\int _{0}^\infty r^2\exp \left( -\frac{r^2}{2}\right) \mathrm {d}r \int _{-\frac{\pi }{2}+\theta }^{\frac{\pi }{2}} \cos (\phi ) \mathrm {d}\phi = \frac{1+\cos (\theta )}{2\sqrt{2\pi }} \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}\left[ z_2 1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{1}{2\pi }\int _{0}^\infty r^2\exp \left( -\frac{r^2}{2}\right) \mathrm {d}r \int _{-\frac{\pi }{2}+\theta }^{\frac{\pi }{2}} \sin (\phi ) \mathrm {d}\phi = \frac{\sin (\theta )}{2\sqrt{2\pi }}. \end{aligned}$$

Therefore,

$$\begin{aligned} \mathbb {E}\left[ \mathbf {z}1_{\{\mathbf {z}^\top \mathbf {w}>0, \, \mathbf {z}^\top \mathbf {w}^* >0\}}\right] = \frac{\cos (\theta /2)}{\sqrt{2\pi }}[\cos (\theta /2), \sin (\theta /2),\mathbf {0}^\top ]^\top = \frac{\cos (\theta /2)}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| }, \end{aligned}$$

where the last equality holds because $\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }$ and $\frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }$ are two unit-normed vectors with angle $\theta $. $\square $

Lemma 7

For any nonzero vectors $\mathbf {w}$ and ${\tilde{\mathbf {w}}}$ with $\Vert {\tilde{\mathbf {w}}}\Vert \ge \Vert \mathbf {w}\Vert = c>0$, we have

1.
$|\theta (\mathbf {w},\mathbf {w}^*)-\theta ({\tilde{\mathbf {w}}},\mathbf {w}^*)|\le \frac{\pi }{2c}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert $.
2.
$\left\| \frac{1}{\Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{1}{\Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \le \frac{1}{c^2}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert $.

Proof

1. Since by Cauchy–Schwarz inequality,

$$\begin{aligned} \left\langle {\tilde{\mathbf {w}}}, \mathbf {w}- \frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right\rangle = {\tilde{\mathbf {w}}}^\top \mathbf {w}- c\Vert {\tilde{\mathbf {w}}}\Vert \le 0, \end{aligned}$$

we have

$$\begin{aligned} \Vert {\tilde{\mathbf {w}}}- \mathbf {w}\Vert ^2 =&\; \left\| \left( 1-\frac{c}{\Vert {\tilde{\mathbf {w}}}\Vert } \right) {\tilde{\mathbf {w}}}- \left( \mathbf {w}-\frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right) \right\| ^2 \ge \left\| \left( 1-\frac{c}{\Vert {\tilde{\mathbf {w}}}\Vert } \right) {\tilde{\mathbf {w}}}\right\| ^2 + \left\| \mathbf {w}-\frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| ^2 \nonumber \\ \ge&\; \left\| \mathbf {w}-\frac{c{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\| ^2 = c^2 \left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right\| ^2. \end{aligned}$$

(24)

Therefore,

$$\begin{aligned} \; |\theta (\mathbf {w},\mathbf {w}^*)-\theta ({\tilde{\mathbf {w}}},\mathbf {w}^*)|&\le \theta (\mathbf {w},{\tilde{\mathbf {w}}}) = \theta \left( \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert },\frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right) \\&\le \; \pi \sin \left( \frac{\theta \left( \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert },\frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right) }{2}\right) = \frac{\pi }{2}\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert }\right\| \le \frac{\pi }{2c}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert , \end{aligned}$$

where we used the fact $\sin (x)\ge \frac{2x}{\pi }$ for $x\in [0,\frac{\pi }{2}]$ and the estimate in (24).

2. Since $\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*$ is the projection of $\mathbf {w}^*$ onto the complement space of $\mathbf {w}$ and likewise for $\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*$, the angle between $\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*$ and $\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*$ is equal to the angle between $\mathbf {w}$ and ${\tilde{\mathbf {w}}}$. Therefore,

$$\begin{aligned} \left\langle \frac{\left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*\right\| } , \frac{\left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*\right\| } \right\rangle = \left\langle \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } , \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert } \right\rangle , \end{aligned}$$

and thus

$$\begin{aligned} \left\| \frac{1}{\Vert \mathbf {w}\Vert } \frac{\left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\right) \mathbf {w}^*\right\| } - \frac{1}{\Vert {\tilde{\mathbf {w}}}\Vert } \frac{\left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*}{\left\| \left( \mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\right) \mathbf {w}^*\right\| } \right\|&= \left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert ^2} - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert ^2} \right\| \\&= \frac{\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert }{\Vert \mathbf {w}\Vert \Vert {\tilde{\mathbf {w}}}\Vert }\le \frac{1}{c^2}\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert . \end{aligned}$$

The second equality above holds because

$$\begin{aligned} \left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert ^2} - \frac{{\tilde{\mathbf {w}}}}{\Vert {\tilde{\mathbf {w}}}\Vert ^2} \right\| ^2 = \frac{1}{\Vert \mathbf {w}\Vert ^2} + \frac{1}{\Vert {\tilde{\mathbf {w}}}\Vert ^2} - \frac{2\langle \mathbf {w}, {\tilde{\mathbf {w}}}\rangle }{\Vert \mathbf {w}\Vert ^2 \Vert {\tilde{\mathbf {w}}}\Vert ^2} = \frac{\Vert \mathbf {w}- {\tilde{\mathbf {w}}}\Vert ^2}{\Vert \mathbf {w}\Vert ^2 \Vert {\tilde{\mathbf {w}}}\Vert ^2}. \end{aligned}$$

$\square $

1.2 B. Proofs

Proof of Proposition 1

We rewrite the update (11) as

$$\begin{aligned} \mathbf {w}^{t+1} = \arg \min _{\mathbf {w}\in \mathcal {Q}} \; \langle \mathbf {w}, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \Vert \mathbf {w}-\mathbf {w}_f^t\Vert ^2 + \frac{\rho }{2\eta } \Vert \mathbf {w}-\mathbf {w}^t\Vert ^2 . \end{aligned}$$

Since $\mathbf {w}^t, \, \mathbf {w}^{t+1} \in \mathcal {Q}$, we have

$$\begin{aligned}&\langle \mathbf {w}^{t+1}, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \Vert \mathbf {w}^{t+1}-\mathbf {w}_f^t\Vert ^2 + \frac{\rho }{2\eta } \Vert \mathbf {w}^{t+1}-\mathbf {w}^t\Vert ^2 \\&\quad \le \langle \mathbf {w}^t, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \Vert \mathbf {w}^t-\mathbf {w}_f^t\Vert ^2, \end{aligned}$$

or equivalently,

$$\begin{aligned}&\langle \mathbf {w}^{t+1}-\mathbf {w}^t, \nabla f(\mathbf {w}^t) \rangle + \frac{1-\rho }{2\eta } \left( \left\| \mathbf {w}^{t+1}-\mathbf {w}_f^t \right\| ^2- \left\| \mathbf {w}^t-\mathbf {w}_f^t \right\| ^2 \right) \nonumber \\&\quad + \frac{\rho }{2\eta }\Vert \mathbf {w}^{t+1}-\mathbf {w}^t\Vert ^2 \le 0. \end{aligned}$$

(25)

On the other hand, since f has L-Lipschitz gradient, the descent lemma [2] gives

$$\begin{aligned} f(\mathbf {w}^{t+1})\le f(\mathbf {w}^t) + \langle \nabla f(\mathbf {w}^t), \mathbf {w}^{t+1}-\mathbf {w}^t \rangle + \frac{L}{2}\Vert \mathbf {w}^{t+1}-\mathbf {w}^t\Vert ^2. \end{aligned}$$

(26)

Combining (25) and (26) completes the proof. $\square $

Proof of Lemma 1

We first evaluate $\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] $, $\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] $ and $\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w}^*)\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] $. Let $\mathbf {Z}_i^\top $ be the i-th row vector of $\mathbf {Z}$. Since $\mathbf {w}\ne \mathbf {0}$, using Lemma 6, we have

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] _{ii} = \mathbb {E}\left[ \sigma (\mathbf {Z}_i^\top \mathbf {w})\sigma (\mathbf {Z}_i^\top \mathbf {w})\right] = \mathbb {E}\left[ 1_{\{\mathbf {Z}_i^\top \mathbf {w}> 0\}}\right] = \frac{1}{2}, \end{aligned}$$

and for $i\ne j$,

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] _{ij} = \mathbb {E}\left[ \sigma (\mathbf {Z}_i^\top \mathbf {w})\sigma (\mathbf {Z}_j^\top \mathbf {w})\right] = \mathbb {E}\left[ 1_{\{\mathbf {Z}_i^\top \mathbf {w}> 0\}}\right] \mathbb {E}\left[ 1_{\{\mathbf {Z}_j^\top \mathbf {w}> 0\}}\right] = \frac{1}{4}. \end{aligned}$$

Therefore, $\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w})^\top \right] =\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w}^*)\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] = \frac{1}{4}\left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) $. Furthermore,

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] _{ii} = \mathbb {E}\left[ 1_{\{\mathbf {Z}_i^\top \mathbf {w}> 0, \mathbf {Z}_i^\top \mathbf {w}^*> 0\}}\right] = \frac{\pi -\theta (\mathbf {w},\mathbf {w}^*)}{2\pi }, \end{aligned}$$

and $\mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] _{ij}=\frac{1}{4}$. So,

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \sigma (\mathbf {Z}\mathbf {w})\sigma (\mathbf {Z}\mathbf {w}^*)^\top \right] = \frac{1}{4}\left( \left( 1-\frac{2\theta (\mathbf {w},\mathbf {w}^*)}{\pi }\right) \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) . \end{aligned}$$

We thus have proved (14) by noticing that

$$\begin{aligned} f(\mathbf {v},\mathbf {w}) =&\; \frac{1}{2}\left( \mathbf {v}^\top \mathbb {E}_\mathbf {Z}[\sigma (\mathbf {Z}\mathbf {w})^\top \sigma (\mathbf {Z}\mathbf {w})]\mathbf {v}- 2\mathbf {v}^\top \mathbb {E}_\mathbf {Z}[\sigma (\mathbf {Z}\mathbf {w})^\top \sigma (\mathbf {Z}\mathbf {w}^*)]\mathbf {v}^* \right. \\&\left. +(\mathbf {v}^*)^\top \mathbb {E}_\mathbf {Z}[\sigma (\mathbf {Z}\mathbf {w}^*)^\top \sigma (\mathbf {Z}\mathbf {w}^*)]\mathbf {v}^*\right) . \end{aligned}$$

Next, since (15) is trivial, we only show (16). Since $\theta (\mathbf {w},\mathbf {w}^*) = \arccos \left( \frac{\mathbf {w}^\top \mathbf {w}^*}{\Vert \mathbf {w}\Vert }\right) $ is differentiable w.r.t. $\mathbf {w}$ at $\theta (\mathbf {w},\mathbf {w}^*)\in (0,\pi )$, we have

$$\begin{aligned} \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w})= & {} \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi }\frac{\partial \theta }{\partial \mathbf {w}}(\mathbf {w},\mathbf {w}^*) = -\frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi }\frac{\Vert \mathbf {w}\Vert ^2\mathbf {w}^* - (\mathbf {w}^\top \mathbf {w}^*)\mathbf {w}}{\Vert \mathbf {w}\Vert ^3\sqrt{1-\frac{(\mathbf {w}^\top \mathbf {w}^*)^2}{\Vert \mathbf {w}\Vert ^2}}} \\= & {} -\frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert \mathbf {w}\Vert }\frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert }. \end{aligned}$$

$\square $

Proof of Proposition 2

Suppose $\mathbf {v}^\top \mathbf {v}^*=0$ and $\frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) = \mathbf {0}$, then by Lemma 1,

$$\begin{aligned} 0 = \mathbf {v}^\top \mathbf {v}^* = (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}\left( \left( 1- \frac{2}{\pi }\theta (\mathbf {w},\mathbf {w}^*)\right) \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}^*. \end{aligned}$$

(27)

From (27), it follows that

$$\begin{aligned} \frac{2}{\pi }\theta (\mathbf {w},\mathbf {w}^*) (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1} \mathbf {v}^*= (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}\left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}^* = \Vert \mathbf {v}^*\Vert ^2. \end{aligned}$$

(28)

On the other hand, from (27) it also follows that

$$\begin{aligned} \left( \frac{2}{\pi }\theta (\mathbf {w},\mathbf {w}^*)-1\right) (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1} \mathbf {v}^* = (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1} \mathbf {1}(\mathbf {1}^\top \mathbf {v}^*) = \frac{(\mathbf {1}^\top \mathbf {v}^*)^2}{m+1}, \end{aligned}$$

where $\mathbf {I}$ is an m-by-m identity matrix, and we used $(\mathbf {I}+ \mathbf {1}\mathbf {1}^\top ) \mathbf {1} = (m+1)\mathbf {1}$. Taking the difference of the two equalities above gives

$$\begin{aligned} (\mathbf {v}^*)^\top (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}\mathbf {v}^* = \Vert \mathbf {v}^*\Vert ^2 - \frac{(\mathbf {1}^\top \mathbf {v}^*)^2}{m+1}. \end{aligned}$$

By (28), we have $\theta (\mathbf {w},\mathbf {w}^*) = \frac{\pi }{2}\frac{(m+1)\Vert \mathbf {v}^*\Vert ^2}{(m+1)\Vert \mathbf {v}^*\Vert ^2 - (\mathbf {1}^\top \mathbf {v}^*)^2}$, which requires

$$\begin{aligned} \frac{\pi }{2}\frac{(m+1)\Vert \mathbf {v}^*\Vert ^2}{(m+1)\Vert \mathbf {v}^*\Vert ^2 - (\mathbf {1}^\top \mathbf {v}^*)^2}<\pi , \; \text{ or } \text{ equivalently, } \; (\mathbf {1}^\top \mathbf {v}^*)^2 < \frac{m+1}{2}\Vert \mathbf {v}^*\Vert ^2. \end{aligned}$$

Otherwise, $\frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})$ and $\frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w})$ do not vanish simultaneously, and there is no critical point. $\square $

Proof of Lemma 2

It is easy to check that $\Vert \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \Vert = m+1$. Invoking Lemma 7.1 gives

$$\begin{aligned} \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) - \frac{\partial f}{\partial \mathbf {v}}({\tilde{\mathbf {v}}},{\tilde{\mathbf {w}}}) \right\| =&\; \frac{1}{4}\left\| \big (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top \big )(\mathbf {v}-{\tilde{\mathbf {v}}}) + \frac{2}{\pi }(\theta (\mathbf {w},\mathbf {w}^*) - \theta ({\tilde{\mathbf {w}}},\mathbf {w}^*) )\mathbf {v}^* \right\| \\ \le&\; \frac{1}{4}\left( (m+1)\Vert \mathbf {v}- {\tilde{\mathbf {v}}}\Vert + \frac{2\Vert \mathbf {v}^*\Vert }{\pi } |\theta (\mathbf {w},\mathbf {w}^*) - \theta ({\tilde{\mathbf {w}}},\mathbf {w}^*)|\right) \\ \le&\; \frac{1}{4}\left( (m+1)\Vert \mathbf {v}- {\tilde{\mathbf {v}}}\Vert + \frac{\Vert \mathbf {v}^*\Vert }{c} \left\| \mathbf {w}- {\tilde{\mathbf {w}}}\right\| \right) \\ \le&\; \frac{1}{4}\left( m+1 + \frac{\Vert \mathbf {v}^*\Vert }{c} \right) \Vert (\mathbf {v}, \mathbf {w}) - ({\tilde{\mathbf {v}}}, {\tilde{\mathbf {w}}})\Vert . \end{aligned}$$

Using Lemma 7.2, we further have

$$\begin{aligned} \left\| \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) - \frac{\partial f}{\partial \mathbf {w}}({\tilde{\mathbf {v}}},{\tilde{\mathbf {w}}}) \right\| =&\; \left\| \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{{\tilde{\mathbf {v}}}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \\ \le&\; \left\| \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \\&\; + \left\| \frac{\mathbf {v}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } - \frac{{\tilde{\mathbf {v}}}^\top \mathbf {v}^*}{2\pi \Vert {\tilde{\mathbf {w}}}\Vert } \frac{\Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{{\tilde{\mathbf {w}}}{\tilde{\mathbf {w}}}^\top }{\Vert {\tilde{\mathbf {w}}}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } \right\| \\ \le&\; \frac{|\mathbf {v}^\top \mathbf {v}^*|}{2 \pi c^2 }\Vert \mathbf {w}-{\tilde{\mathbf {w}}}\Vert + \frac{\Vert \mathbf {v}^*\Vert }{2\pi c}\Vert \mathbf {v}-{\tilde{\mathbf {v}}}\Vert \\ \le&\; \frac{(C+c)\Vert \mathbf {v}^*\Vert }{2\pi c^2}\Vert (\mathbf {v}, \mathbf {w}) - ({\tilde{\mathbf {v}}}, {\tilde{\mathbf {w}}})\Vert . \end{aligned}$$

Combining the two inequalities above validates the claim. $\square $

Proof of Lemma 3

Equation (22) is true because $\frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v},\mathbf {w};\mathbf {Z})$ is linear in $\mathbf {v}$. To show (23), by (20) and the fact that $\mu ^{\prime } = \sigma $, we have

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v},\mathbf {w};\mathbf {Z})\right] =&\; \mathbb {E}_\mathbf {Z}\left[ \left( \sum _{i=1}^m v_i \sigma (\mathbf {Z}^\top _i\mathbf {w}) - \sum _{i=1}^m v^*_i\sigma (\mathbf {Z}^\top _i\mathbf {w}^*) \right) \left( \sum _{i=1}^m \mathbf {Z}_i v_i \sigma (\mathbf {Z}^\top _i\mathbf {w}) \right) \right] \\ =&\; \mathbb {E}_\mathbf {Z}\left[ \left( \sum _{i=1}^m v_i 1_{\{\mathbf {Z}^\top _i\mathbf {w}>0\}} - \sum _{i=1}^m v^*_i1_{\{\mathbf {Z}^\top _i\mathbf {w}^*>0\}} \right) \left( \sum _{i=1}^m 1_{\{\mathbf {Z}^\top _i\mathbf {w}>0\}} v_i\mathbf {Z}_i \right) \right] . \end{aligned}$$

Invoking Lemma 6, we have

$$\begin{aligned} \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}>0\}}\right] = {\left\{ \begin{array}{ll} \frac{1}{\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } &{} \text{ if } i=j, \\ \frac{1}{2\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } &{} \text{ if } i\ne j, \end{array}\right. } \end{aligned}$$

(29)

and

$$\begin{aligned} \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}^* >0\}}\right] = {\left\{ \begin{array}{ll} \frac{\cos (\theta (\mathbf {w},\mathbf {w}^*)/2)}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* \right\| } &{} \text{ if } i=j, \\ \frac{1}{2\sqrt{2\pi }} \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } &{} \text{ if } i\ne j. \end{array}\right. } \end{aligned}$$

(30)

Therefore,

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v},\mathbf {w};\mathbf {Z})\right] =&\;\sum _{i=1}^m v_i^2 \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0\}}\right] + \sum _{i=1}^m \sum _{\overset{j=1}{j\ne i}}^m v_i v_j \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}>0\}}\right] \\&\; - \sum _{i=1}^m v_i v_i^* \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_i^\top \mathbf {w}^*>0\}}\right] \\&- \sum _{i=1}^m \sum _{\overset{j=1}{j\ne i}}^m v_i v_j^* \mathbb {E}\left[ \mathbf {Z}_i 1_{\{\mathbf {Z}_i^\top \mathbf {w}>0, \mathbf {Z}_j^\top \mathbf {w}^*>0\}}\right] \\ =&\; \frac{1}{2\sqrt{2\pi }}\left( \Vert \mathbf {v}\Vert ^2 + (\mathbf {1}^\top \mathbf {v})^2 \right) \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }\\&-\cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }} \frac{\frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* }{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^* \right\| } \\&\; - \frac{1}{2\sqrt{2\pi }}\left( (\mathbf {1}^\top \mathbf {v})(\mathbf {1}^\top \mathbf {v}^*) - \mathbf {v}^\top \mathbf {v}^* \right) \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert }, \end{aligned}$$

which is exactly (23). $\square $

Proof of Lemma 4

Notice that $(\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2})\mathbf {w}= \mathbf {0}$ and $\Vert \mathbf {w}^*\Vert = 1$, if $\theta (\mathbf {w},\mathbf {w}_*)\ne 0, \pi $, then we have

$$\begin{aligned}&\; \left\langle \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ], \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) \right\rangle \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{(\sqrt{2\pi })^3} \left\langle \frac{1}{\Vert \mathbf {w}\Vert } \frac{\Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*}{\Big \Vert \Big (\mathbf {I}- \frac{\mathbf {w}\mathbf {w}^\top }{\Vert \mathbf {w}\Vert ^2}\Big )\mathbf {w}^*\Big \Vert } , \frac{\mathbf {w}^*}{\left\| \frac{\mathbf {w}}{\Vert \mathbf {w}\Vert } + \mathbf {w}^*\right\| } \right\rangle \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{(\sqrt{2\pi })^3} \frac{\Vert \mathbf {w}\Vert ^2 - (\mathbf {w}^\top \mathbf {w}^*)^2}{\Vert \Vert \mathbf {w}\Vert ^2\mathbf {w}^* - \mathbf {w}(\mathbf {w}^\top \mathbf {w}^*)\Vert \, \Vert \mathbf {w}+\Vert \mathbf {w}\Vert \mathbf {w}^*\Vert } \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{(\sqrt{2\pi })^3} \frac{\Vert \mathbf {w}\Vert ^2 - (\mathbf {w}^\top \mathbf {w}^*)^2}{\sqrt{\Vert \mathbf {w}\Vert ^4 -\Vert \mathbf {w}\Vert ^2(\mathbf {w}^\top \mathbf {w}^*)^2} \sqrt{2(\Vert \mathbf {w}\Vert ^2+ \Vert \mathbf {w}\Vert (\mathbf {w}^\top \mathbf {w}^*))}} \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2}{4(\sqrt{\pi \Vert \mathbf {w}\Vert })^3} \frac{\Vert \mathbf {w}\Vert ^2 - (\mathbf {w}^\top \mathbf {w}^*)^2}{\sqrt{\Vert \mathbf {w}\Vert ^2 -(\mathbf {w}^\top \mathbf {w}^*)^2} \sqrt{\Vert \mathbf {w}\Vert + (\mathbf {w}^\top \mathbf {w}^*)}} \\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2\sqrt{1-\frac{\mathbf {w}^\top \mathbf {w}^*}{\Vert \mathbf {w}\Vert }}}{4(\sqrt{\pi })^3\Vert \mathbf {w}\Vert }\\&\quad = \; \cos \left( \frac{\theta (\mathbf {w},\mathbf {w}^*)}{2}\right) \frac{(\mathbf {v}^{\top }\mathbf {v}^*)^2\sqrt{1 - \cos (\theta (\mathbf {w},\mathbf {w}^*))}}{4(\sqrt{\pi })^3\Vert \mathbf {w}\Vert } \\&\quad = \; \frac{\sin \left( \theta (\mathbf {w},\mathbf {w}^*)\right) }{2(\sqrt{2\pi })^3\Vert \mathbf {w}\Vert }(\mathbf {v}^{\top }\mathbf {v}^*)^2. \end{aligned}$$

$\square $

Proof of Lemma 5

Denote $\theta := \theta (\mathbf {w},\mathbf {w}^*)$. By Lemma 1, we have

$$\begin{aligned} \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) = \frac{1}{4}\big (\mathbf {I}+ \mathbf {1}\mathbf {1}^\top \big ) \mathbf {v}- \frac{1}{4}\left( \left( 1-\frac{2\theta }{\pi } \right) \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}^*. \end{aligned}$$

Since $\Vert \mathbf {w}\Vert =1$, Lemma 3 gives

$$\begin{aligned} \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ] = \frac{h(\mathbf {v},\mathbf {v}^*)}{2\sqrt{2\pi }}\mathbf {w}- \cos \left( \frac{\theta }{2}\right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }}\frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| }, \end{aligned}$$

(31)

where

$$\begin{aligned} h(\mathbf {v},\mathbf {v}^*) =&\; \Vert \mathbf {v}\Vert ^2+ (\mathbf {1}^\top \mathbf {v})^2 - (\mathbf {1}^\top \mathbf {v})(\mathbf {1}^\top \mathbf {v}^*) + \mathbf {v}^\top \mathbf {v}^* \nonumber \\ =&\; \mathbf {v}^\top \left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}- \mathbf {v}^\top (\mathbf {1}\mathbf {1}^\top - \mathbf {I})\mathbf {v}^* \nonumber \\ =&\; \mathbf {v}^\top \left( \mathbf {I}+ \mathbf {1}\mathbf {1}^\top \right) \mathbf {v}- \mathbf {v}^\top \left( \mathbf {1}\mathbf {1}^\top + \left( 1-\frac{2\theta }{\pi }\right) \mathbf {I}\right) \mathbf {v}^* + 2\left( 1 - \frac{\theta }{\pi }\right) \mathbf {v}^\top \mathbf {v}^* \nonumber \\ =&\; 4 \mathbf {v}^\top \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w}) + 2\left( 1 - \frac{\theta }{\pi }\right) \mathbf {v}^\top \mathbf {v}^*, \end{aligned}$$

(32)

and by Lemma 4,

$$\begin{aligned} \left\langle \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ], \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) \right\rangle = \frac{\sin \left( \theta \right) }{2(\sqrt{2\pi })^3}(\mathbf {v}^\top \mathbf {v}^*)^2. \end{aligned}$$

Hence, for some A depending only on C, we have

$$\begin{aligned}&\; \left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ] \right\| ^2 \\&\quad = \; \left\| \frac{2 \mathbf {v}^\top \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})}{\sqrt{2\pi }} \mathbf {w}+ \cos \left( \frac{\theta }{2}\right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }}\left( \mathbf {w}- \frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| } \right) \right. \\&\qquad \left. + \left( 1-\frac{\theta }{\pi }-\cos \left( \frac{\theta }{2}\right) \right) \frac{\mathbf {v}^\top \mathbf {v}^*}{\sqrt{2\pi }}\mathbf {w}\right\| ^2 \\&\quad \le \; \frac{6C^2}{\pi } \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \cos ^2\left( \frac{\theta }{2}\right) \frac{3(\mathbf {v}^\top \mathbf {v}^*)^2}{2\pi }\left\| \mathbf {w}- \frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| } \right\| ^2 \\&\qquad \; + \left( 1-\frac{\theta }{\pi }-\cos \left( \frac{\theta }{2}\right) \right) ^2 \frac{3(\mathbf {v}^\top \mathbf {v}^*)^2}{2\pi } \\&\quad \le \; \frac{6C^2}{\pi }\left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \cos ^2\left( \frac{\theta }{2}\right) \frac{3\theta ^2}{8\pi } (\mathbf {v}^\top \mathbf {v}^*)^2\\&\qquad +\left( 1-\frac{\theta }{\pi }-\cos \left( \frac{\theta }{2}\right) \right) ^2 \frac{3(\mathbf {v}^\top \mathbf {v}^*)^2}{2\pi } \\&\quad \le \; \frac{6C^2}{\pi }\left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \frac{3\pi }{8}\cos ^2\left( \frac{\theta }{2}\right) \sin ^2\left( \frac{\theta }{2}\right) (\mathbf {v}^\top \mathbf {v}^*)^2 + \frac{3\sin (\theta )}{2\pi }(\mathbf {v}^\top \mathbf {v}^*)^2 \\&\quad \le \; A\left( \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v},\mathbf {w})\right\| ^2 + \left\langle \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v},\mathbf {w}; \mathbf {Z})\Big ], \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v},\mathbf {w}) \right\rangle \right) , \end{aligned}$$

where the equality is due to (31) and (32), the first inequality is due to Cauchy-Schwarz inequality, the second inequality holds because the angle between $\mathbf {w}$ and $\frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| }$ is $\frac{\theta }{2}$ and $\left\| \mathbf {w}- \frac{\mathbf {w}+ \mathbf {w}^*}{\left\| \mathbf {w}+ \mathbf {w}^*\right\| } \right\| \le \frac{\theta }{2}$, whereas the third inequality is due to $\sin (x)\ge \frac{2x}{\pi }$, $\cos (x)\ge 1-\frac{2x}{\pi }$, and

$$\begin{aligned} \left( 1-\frac{2x}{\pi } -\cos (x)\right) ^2\le & {} \left( \cos (x) - 1+ \frac{2x}{\pi } \right) \left( \cos (x) + 1 - \frac{2x}{\pi }\right) \\\le & {} \sin (x)(2\cos (x)) = \sin (2x), \end{aligned}$$

for all $x\in [0,\frac{\pi }{2}]$. $\square $

Proof of Theorem 1

To leverage Lemmas 2 and 5, we would need the boundedness of $\{\mathbf {v}^t\}$. Due to the coerciveness of f w.r.t $\mathbf {v}$, there exists $C_0>0$, such that $\Vert \mathbf {v}\Vert \le C_0$ for any $\mathbf {v}\in \{\mathbf {v}\in \mathbb {R}^m: f(\mathbf {v},\mathbf {w})\le f(\mathbf {v}^0,\mathbf {w}^0) \text{ for } \text{ some } \mathbf {w}\}$. In particular, $\Vert \mathbf {v}^0\Vert \le C_0$. Using induction, suppose we already have $f(\mathbf {v}^{t},\mathbf {w}^{t})\le f(\mathbf {v}^0,\mathbf {w}^0)$ and $\Vert \mathbf {v}^t\Vert \le C_0$. If $\mathbf {w}^t = \pm \mathbf {w}^*$, then $\mathbf {w}^{t+1} = \mathbf {w}^{t+2} = \cdots = \pm \mathbf {w}^*$, and the original problem reduces to a quadratic program in terms of $\mathbf {v}$. So $\{\mathbf {v}^t\}$ will converge to $\mathbf {v}^*$ or $(\mathbf {I}+ \mathbf {1}\mathbf {1}^\top )^{-1}(\mathbf {1}\mathbf {1}^\top - \mathbf {I})\mathbf {v}^*$ by choosing a suitable step size $\eta $. In either case, we have $\left\| \mathbb {E}_\mathbf {Z}\Big [\frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ]\right\| $ and $\left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ]\right\| $ both converge to 0. Else if $\mathbf {w}^t \ne \pm \mathbf {w}^*$, we define for $a\in [0,1]$ that

$$\begin{aligned} \mathbf {v}^t(a) := \mathbf {v}^t - a(\mathbf {v}^{t+1} - \mathbf {v}^t) = \mathbf {v}^t - a \eta \mathbb {E}_\mathbf {Z}\left[ \frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t;\mathbf {Z})\right] \end{aligned}$$

and

$$\begin{aligned} \mathbf {w}^t(a) := \mathbf {w}^t - a(\mathbf {w}^{t+1/2} - \mathbf {w}^t) = \mathbf {w}^t - a\eta \mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\right] , \end{aligned}$$

which satisfy

$$\begin{aligned} \mathbf {v}^t(0) = \mathbf {v}^t, \; \mathbf {v}^t(1) = \mathbf {v}^{t+1}, \; \mathbf {w}^t(0) = \mathbf {w}^t, \; \mathbf {w}^t(1) = \mathbf {w}^{t+1/2}. \end{aligned}$$

Let us fix $0<c<1$ and $C\ge C_0$. By the expressions of $\mathbb {E}_\mathbf {Z}\left[ \frac{\partial \ell }{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t;\mathbf {Z})\right] $ and $\mathbb {E}_\mathbf {Z}\left[ \mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\right] $ given in Lemma 3 and since $\Vert \mathbf {w}^t\Vert =1$, for sufficiently small ${\tilde{\eta }}$ depending on $C_0$, with $\eta \le {\tilde{\eta }}$, it holds that $\Vert \mathbf {v}^t(a)\Vert \le C$ and $\Vert \mathbf {w}^t(a)\Vert \ge c$ for all $a\in [0,1]$. Possibly at some point $a_0$ where $\theta (\mathbf {w}^t(a_0),\mathbf {w}^*) = 0$ or $\pi $, such that $\frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a_0),\mathbf {w}^t(a_0))$ does not exist. Otherwise, $\left\| \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) \right\| $ is uniformly bounded for all $a\in [0,1]/\{a_0\}$, which makes it integrable over the interval [0, 1]. Then, we have

$$\begin{aligned} f(\mathbf {v}^{t+1}, \mathbf {w}^{t+1})&= \; f(\mathbf {v}^{t+1}, \mathbf {w}^{t+1/2}) = f(\mathbf {v}^t+ (\mathbf {v}^{t+1} -\mathbf {v}^t), \mathbf {w}^t+ (\mathbf {w}^{t+1/2}-\mathbf {w}^t)) \nonumber \\&= \; f(\mathbf {v}^t, \mathbf {w}^t) + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) , \mathbf {v}^{t+1} -\mathbf {v}^t \right\rangle \mathrm {d}a \nonumber \\&\quad + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a),\mathbf {w}^t(a)), \mathbf {w}^{t+1/2} - \mathbf {w}^t \right\rangle \mathrm {d}a \nonumber \\&= \; f(\mathbf {v}^{t}, \mathbf {w}^{t}) + \left\langle \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {v}^{t+1} -\mathbf {v}^t \right\rangle + \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {w}^{t+1/2} -\mathbf {w}^t \right\rangle \nonumber \\&\quad + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) - \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {v}^{t+1} -\mathbf {v}^t \right\rangle \mathrm {d}a \nonumber \\&\quad + \int _{0}^1 \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t(a),\mathbf {w}^t(a)) - \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t) , \mathbf {w}^{t+1/2} - \mathbf {w}^t \right\rangle \mathrm {d}a \nonumber \\&\le \; f(\mathbf {v}^{t}, \mathbf {w}^{t}) -\left( \eta -\frac{L\eta ^2}{2}\right) \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) \right\| ^2 \nonumber \\&\quad - \eta \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle \nonumber \\&\quad + \frac{L\eta ^2}{2} \left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\| ^2 \nonumber \\&\le \; f(\mathbf {v}^{t}, \mathbf {w}^{t}) -\left( \eta -(1+A)\frac{L\eta ^2}{2}\right) \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) \right\| ^2 \nonumber \\&\quad - \left( \eta -\frac{AL\eta ^2}{2}\right) \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle . \end{aligned}$$

(33)

The third equality is due to the fundamental theorem of calculus. In the first inequality, we called Lemma 2 for $(\mathbf {v}^t, \mathbf {w}^t)$ and $(\mathbf {v}^t(a), \mathbf {w}^t(a))$ with $a\in [0,1]/\{a_0\}$. In the last inequality, we used Lemma 5. So when $\eta < \eta _0:= \min \left\{ \frac{2}{(1+A)L}, {\tilde{\eta }}\right\} $, we have $f(\mathbf {v}^{t+1},\mathbf {w}^{t+1})\le f(\mathbf {v}^0,\mathbf {w}^0)$ and thus $\Vert \mathbf {v}^{t+1}\Vert \le C_0$.

Summing up the inequality (33) over t from 0 to $\infty $ and using $f\ge 0$, we have

$$\begin{aligned}&\; \eta \sum _{t=0}^\infty \left( 1 -(1+A)\frac{L\eta }{2}\right) \left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t) \right\| ^2 + \left( 1 -\frac{AL\eta }{2}\right) \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle \\&\quad \le \; f(\mathbf {v}^0,\mathbf {w}^0)<\infty . \end{aligned}$$

Hence,

$$\begin{aligned} \lim _{t\rightarrow \infty }\left\| \frac{\partial f}{\partial \mathbf {v}}(\mathbf {v}^t,\mathbf {w}^t)\right\| = 0 \end{aligned}$$

and

$$\begin{aligned} \lim _{t\rightarrow \infty } \left\langle \frac{\partial f}{\partial \mathbf {w}}(\mathbf {v}^t,\mathbf {w}^t), \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ] \right\rangle = 0. \end{aligned}$$

Invoking Lemma 5 again, we further have

$$\begin{aligned} \lim _{t\rightarrow \infty }\left\| \mathbb {E}_\mathbf {Z}\Big [\mathbf {g}(\mathbf {v}^t,\mathbf {w}^t; \mathbf {Z})\Big ]\right\| = 0, \end{aligned}$$

which completes the proof. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yin, P., Zhang, S., Lyu, J. et al. Blended coarse gradient descent for full quantization of deep neural networks. Res Math Sci 6, 14 (2019). https://doi.org/10.1007/s40687-018-0177-6

Download citation

Received: 15 August 2018
Accepted: 14 December 2018
Published: 02 January 2019
DOI: https://doi.org/10.1007/s40687-018-0177-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Blended coarse gradient descent for full quantization of deep neural networks

Abstract

Access this article

Similar content being viewed by others

IQNN: Training Quantized Neural Networks with Iterative Optimizations

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 A. Additional preliminaries

Lemma 6

Proof

Lemma 7

Proof

1.2 B. Proofs

Proof of Proposition 1

Proof of Lemma 1

Proof of Proposition 2

Proof of Lemma 2

Proof of Lemma 3

Proof of Lemma 4

Proof of Lemma 5

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Blended coarse gradient descent for full quantization of deep neural networks

Abstract

Access this article

Similar content being viewed by others

IQNN: Training Quantized Neural Networks with Iterative Optimizations

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 A. Additional preliminaries

Lemma 6

Proof

Lemma 7

Proof

1.2 B. Proofs

Proof of Proposition 1

Proof of Lemma 1

Proof of Proposition 2

Proof of Lemma 2

Proof of Lemma 3

Proof of Lemma 4

Proof of Lemma 5

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation