Skip to main content

Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks

  • Conference paper
  • First Online:
Pattern Recognition (DAGM GCPR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13024))

Included in the following conference series:

Abstract

Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been achieved with empirical straight-through (ST) approaches, proposing a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. At the same time, ST methods can be truly derived as estimators in the stochastic binary network (SBN) model with Bernoulli weights. We advance these derivations to a more complete and systematic study. We analyze properties, estimation accuracy, obtain different forms of correct ST estimators for activations and weights, explain existing empirical approaches and their shortcomings, explain how latent weights arise from the mirror descent method when optimizing over probabilities. This allows to reintroduce ST methods, long known empirically, as sound approximations, apply them with clarity and develop further improvements.

We gratefully acknowledge support by Czech OP VVV project “Research Center for Informatics (CZ.02.1.01/0.0/0.0/16019/0000765)”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The conditions allow to apply Leibniz integral rule to exchange derivative and integral. Other conditions may suffice, e.g., when using weak derivatives [17].

References

  1. Ajanthan, T., Gupta, K., Torr, P.H., Hartley, R., Dokania, P.K.: Mirror descent view for neural network quantization. arXiv preprint arXiv:1910.08237 (2019)

  2. Alizadeh, M., Fernandez-Marques, J., Lane, N.D., Gal, Y.: An empirical study of binary neural networks’ optimisation. In: ICLR (2019)

    Google Scholar 

  3. Azizan, N., Lale, S., Hassibi, B.: A study of generalization of stochastic mirror descent algorithms on overparameterized nonlinear models. In: ICASSP, pp. 3132–3136 (2020)

    Google Scholar 

  4. Bai, Y., Wang, Y.-X., Liberty, E.: ProxQuant: quantized neural networks via proximal operators. In: ICLR (2019)

    Google Scholar 

  5. Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)

  6. Bethge, J., Yang, H., Bornstein, M., Meinel, C.: Back to simplicity: how to train accurate BNNs from scratch? CoRR, abs/1906.08637 (2019)

    Google Scholar 

  7. Boros, E., Hammer, P.: Pseudo-Boolean optimization. Discret. Appl. Math. 1–3(123), 155–225 (2002)

    Article  MathSciNet  Google Scholar 

  8. Bulat, A., Tzimiropoulos, G.: Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In: ICCV, October 2017

    Google Scholar 

  9. Bulat, A., Tzimiropoulos, G., Kossaifi, J., Pantic, M.: Improved training of binary networks for human pose estimation and image recognition. arXiv (2019)

    Google Scholar 

  10. Bulat, A., Martinez, B., Tzimiropoulos, G.: BATS: binary architecture search (2020)

    Google Scholar 

  11. Bulat, A., Martinez, B., Tzimiropoulos, G.: High-capacity expert binary networks. In: ICLR (2021)

    Google Scholar 

  12. Chaidaroon, S., Fang, Y.: Variational deep semantic hashing for text documents. In: SIGIR Conference on Research and Development in Information Retrieval, pp. 75–84 (2017)

    Google Scholar 

  13. Cheng, P., Liu, C., Li, C., Shen, D., Henao, R., Carin, L.: Straight-through estimator as projected Wasserstein gradient flow. arXiv preprint arXiv:1910.02176 (2019)

  14. Cong, Y., Zhao, M., Bai, K., Carin, L.: GO gradient for expectation-based objectives. In: ICLR (2019)

    Google Scholar 

  15. Courbariaux, M., Bengio, Y., David, J.-P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: NeurIPS, pp. 3123–3131 (2015)

    Google Scholar 

  16. Dadaneh, S.Z., Boluki, S., Yin, M., Zhou, M., Qian, X.: Pairwise supervised hashing with Bernoulli variational auto-encoder and self-control gradient estimator. arXiv, abs/2005.10477 (2020)

    Google Scholar 

  17. Dai, B., Guo, R., Kumar, S., He, N., Song, L.: Stochastic generative hashing. In: ICML 2017, pp. 913–922 (2017)

    Google Scholar 

  18. Esser, S.K., et al.: Convolutional networks for fast, energy-efficient neuromorphic computing. Proc. Natl. Acad. Sci. 113(41), 11441–11446 (2016)

    Article  Google Scholar 

  19. Gong, R., et al.: Differentiable soft quantization: bridging full-precision and low-bit neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  20. Grathwohl, W., Choi, D., Wu, Y., Roeder, G., Duvenaud, D.: Backpropagation through the void: optimizing control variates for black-box gradient estimation. In: ICLR (2018)

    Google Scholar 

  21. Graves, A.: Practical variational inference for neural networks. In: NeurIPS, pp. 2348–2356 (2011)

    Google Scholar 

  22. Gregor, K., Danihelka, I., Mnih, A., Blundell, C., Wierstra, D.: Deep autoregressive networks. In: ICML (2014)

    Google Scholar 

  23. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: ICCV, pp. 1026–1034 (2015)

    Google Scholar 

  24. Helwegen, K., Widdicombe, J., Geiger, L., Liu, Z., Cheng, K.-T., Nusselder, R.: Latent weights do not exist: rethinking binarized neural network optimization. In: NeurIPS, pp. 7531–7542 (2019)

    Google Scholar 

  25. Hinton, G.: Lecture 15D - Semantic hashing: 3:05–3:35 (2012). https://www.cs.toronto.edu/~hinton/coursera/lecture15/lec15d.mp4

  26. Horowitz, M.: Computing’s energy problem (and what we can do about it). In: International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14 (2014)

    Google Scholar 

  27. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: NeurIPS, pp. 4107–4115 (2016)

    Google Scholar 

  28. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, vol. 37, pp. 448–456 (2015)

    Google Scholar 

  29. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. In: ICLR (2017)

    Google Scholar 

  30. Khan, E., Rue, H.: Learning algorithms from Bayesian principles. Draft v. 0.7, August 2020

    Google Scholar 

  31. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  32. Krizhevsky, A., Hinton, G.E.: Using very deep autoencoders for content-based image retrieval. In: ESANN (2011)

    Google Scholar 

  33. Lin, W., Khan, M.E., Schmidt, M.: Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In: ICML, vol. 97, June 2019

    Google Scholar 

  34. Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., Cheng, K.-T.: Bi-real net: enhancing the performance of 1-bit CNNs with improved representational capability and advanced training algorithm. In: ECCV, pp. 722–737 (2018)

    Google Scholar 

  35. Livochka, A., Shekhovtsov, A.: Initialization and transfer learning of stochastic binary networks from real-valued ones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2021)

    Google Scholar 

  36. Martínez, B., Yang, J., Bulat, A., Tzimiropoulos, G.: Training binary neural networks with real-to-binary convolutions. In: ICLR (2020)

    Google Scholar 

  37. Meng, X., Bachmann, R., Khan, M.E.: Training binary neural networks using the Bayesian learning rule. In: ICML (2020)

    Google Scholar 

  38. Nanculef, R., Mena, F.A., Macaluso, A., Lodi, S., Sartori, C.: Self-supervised Bernoulli autoencoders for semi-supervised hashing. CoRR, abs/2007.08799 (2020)

    Google Scholar 

  39. Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)

    Google Scholar 

  40. Owen, A.B.: Monte Carlo theory, methods and examples (2013)

    Google Scholar 

  41. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS, pp. 8024–8035 (2019)

    Google Scholar 

  42. Pervez, A., Cohen, T., Gavves, E.: Low bias low variance gradient estimates for Boolean stochastic networks. In: ICML, vol. 119, pp. 7632–7640, 13–18 July 2020

    Google Scholar 

  43. Peters, J.W., Welling, M.: Probabilistic binary neural networks. arXiv preprint arXiv:1809.03368 (2018)

  44. Raiko, T., Berglund, M., Alain, G., Dinh, L.: Techniques for learning binary stochastic feedforward neural networks. In: ICLR (2015)

    Google Scholar 

  45. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_32

    Chapter  Google Scholar 

  46. Roth, W., Schindler, G., Fröning, H., Pernkopf, F.: Training discrete-valued neural networks with sign activations using weight distributions. In: European Conference on Machine Learning (ECML) (2019)

    Google Scholar 

  47. Shekhovtsov, A.: Bias-variance tradeoffs in single-sample binary gradient estimators. In: GCPR (2021)

    Google Scholar 

  48. Shekhovtsov, A., Yanush, V., Flach, B.: Path sample-analytic gradient estimators for stochastic binary networks. In: NeurIPS (2020)

    Google Scholar 

  49. Shen, D., et al.: NASH: toward end-to-end neural architecture for generative semantic hashing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20 2018, Volume 1: Long Papers, pp. 2041–2050 (2018)

    Google Scholar 

  50. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  51. Sun, Z., Yao, A.: Weights having stable signs are important: finding primary subnetworks and kernels to compress binary weight networks (2021)

    Google Scholar 

  52. Tang, W., Hua, G., Wang, L.: How to train a compact binary neural network with high accuracy? In: AAAI (2017)

    Google Scholar 

  53. Titsias, M.K., Lázaro-Gredilla, M.: Local expectation gradients for black box variational inference. In: NeurIPS, pp. 2638–2646 (2015)

    Google Scholar 

  54. Tokui, S., Sato, I.: Evaluating the variance of likelihood-ratio gradient estimators. In: ICML, pp. 3414–3423 (2017)

    Google Scholar 

  55. Tucker, G., Mnih, A., Maddison, C.J., Lawson, J., Sohl-Dickstein, J.: REBAR: low-variance, unbiased gradient estimates for discrete latent variable models. In: NeurIPS (2017)

    Google Scholar 

  56. Xiang, X., Qian, Y., Yu, K.: Binary deep neural networks for speech recognition. In: INTERSPEECH (2017)

    Google Scholar 

  57. Yin, M., Zhou, M.: ARM: augment-REINFORCE-merge gradient for stochastic binary networks. In: ICLR (2019)

    Google Scholar 

  58. Yin, P., Lyu, J., Zhang, S., Osher, S., Qi, Y., Xin, J.: Understanding straight-through estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662 (2019)

  59. Zhang, S., He, N.: On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. arXiv, Optimization and Control (2018)

    Google Scholar 

  60. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Shekhovtsov .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 504 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shekhovtsov, A., Yanush, V. (2021). Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks. In: Bauckhage, C., Gall, J., Schwing, A. (eds) Pattern Recognition. DAGM GCPR 2021. Lecture Notes in Computer Science(), vol 13024. Springer, Cham. https://doi.org/10.1007/978-3-030-92659-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92659-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92658-8

  • Online ISBN: 978-3-030-92659-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics