Error-Aware Conversion from ANN to SNN via Post-training Parameter Calibration

Li, Yuhang; Deng, Shikuang; Dong, Xin; Gu, Shi

doi:10.1007/s11263-024-02046-2

Error-Aware Conversion from ANN to SNN via Post-training Parameter Calibration

Published: 08 April 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

235 Accesses
Explore all metrics

Abstract

Spiking Neural Network (SNN), originating from the neural behavior in biology, has been recognized as one of the next-generation neural networks. Conventionally, SNNs can be obtained by converting from pre-trained Artificial Neural Networks (ANNs) by replacing the non-linear activation with spiking neurons without changing the parameters. In this work, we argue that simply copying and pasting the weights of ANN to SNN inevitably results in activation mismatch, especially for ANNs that are trained with batch normalization (BN) layers. To tackle the activation mismatch issue, we first provide a theoretical analysis by decomposing local layer-wise conversion error, and then quantitatively measure how this error propagates throughout the layers using the second-order analysis. Motivated by the theoretical results, we propose a set of layer-wise parameter calibration algorithms, which adjusts the parameters to minimize the activation mismatch. To further remove the dependency on data, we propose a privacy-preserving conversion regime by distilling synthetic data from source ANN and using it to calibrate the SNN. Extensive experiments for the proposed algorithms are performed on modern architectures and large-scale tasks including ImageNet classification and MS COCO detection. We demonstrate that our method can handle the SNN conversion and effectively preserve high accuracy even in 32 time steps. For example, our calibration algorithms can increase up to 63% accuracy when converting MobileNet against baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Dropout vs. batch normalization: an empirical study of their impact to deep learning

Article 22 January 2020

Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

Article 30 October 2020

Practical Deep Raw Image Denoising on Mobile Devices

Notes

References

Akopyan, F., Sawada, J., Cassidy, A., Alvarez-Icaza, R., Arthur, J., Merolla, P., Imam, N., Nakamura, Y., Datta, P., Nam, G. J., Taba, B., Beakes, M. P., Brezzo, B., Kuang, J. B., Manohar, R., Risk, W. P., Jackson, B. L., & Modha, D. S. (2015). Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 34(10), 1537–1557.
Article Google Scholar
Barbi, M., Chillemi, S., Di Garbo, A., & Reale, L. (2003). Stochastic resonance in a sinusoidally forced LIF model with noisy threshold. Biosystems, 71(1–2), 23–28.
Article Google Scholar
Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18(24), 10464–10472.
Article Google Scholar
Botev, A., Ritter, H., & Barber, D. (2017). Practical gauss-newton optimisation for deep learning. In International conference on machine learning (pp. 557–565). PMLR.
Bu, T., Fang, W., Ding, J., Dai, P., Yu, Z., & Huang, T. (2021). Optimal ann-snn conversion for high-accuracy and ultra-low-latency spiking neural networks. In International conference on learning representations.
Bu, T., Ding, J., Yu, Z., & Huang, T. (2022). Optimized potential initialization for low-latency spiking neural networks. In Proceedings of the AAAI conference on artificial intelligence (pp. 11–20).
Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B., Xu, C., Xu, C., & Tian, Q. (2019). Data-free learning of student networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3514–3522).
Chowdhury, S. S., Rathi, N., & Roy, K. (2021). One timestep is all you need: Training spiking neural networks with ultra low latency. arXiv preprint arXiv:2110.05929.
Christensen, D. V., Dittmann, R., Linares-Barranco, B., Sebastian, A., Le Gallo, M., Redaelli, A., Slesazeck, S., Mikolajick, T., Spiga, S., Menzel, S., Valov, I., Milano, G., Ricciardi, C., Liang, S.-J., Miao, F., Lanza, M., Quill, T. J., Keene, S. T., Salleo, A., & Pryds, N. (2022). 2022 roadmap on neuromorphic computing and engineering. Neuromorphic Computing and Engineering, 2(2), 022501.
Article Google Scholar
Cox, D. D., & Dean, T. (2014). Neural networks and neuroscience-inspired computer vision. Current Biology, 24(18), R921–R929.
Article Google Scholar
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 113–123).
Davies, M., Srinivasa, N., Lin, T. H., Chinya, G., Cao, Y., Choday, S. H., Dimou, G., Joshi, P., Imam, N., Jain, S., Liao, Y., Lin, C.-K., Lines, A., Liu, R., Mathaikutty, D., McCoy, S., Paul, A., Tse, J., Venkataramanan, G., & Wang, H. (2018). Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1), 82–99.
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.
Deng, L., Wu, Y., Hu, X., Liang, L., Ding, Y., Li, G., Zhao, G., Li, P., & Xie, Y. (2020). Rethinking the performance comparison between SNNS and ANNS. Neural Networks, 121, 294–307.
Article Google Scholar
Deng, S., & Gu, S. (2021). Optimal conversion of conventional artificial neural networks to spiking neural networks. In International conference on learning representationshttps://openreview.net/forum?id=FZ1oTwcXchK.
Deng, S., Li, Y., Zhang, S., & Gu, S. (2022). Temporal efficient training of spiking neural network via gradient re-weighting. In International conference on learning representationshttps://openreview.net/forum?id=_XNtisL32jv.
DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
Diehl, P. U., Neil, D., Binas, J., Cook, M., Liu, S. C., & Pfeiffer, M. (2015). Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In 2015 International joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
Diehl, P. U., Zarrella, G., Cassidy, A., Pedroni, B. U., & Neftci, E. (2016). Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware. In 2016 IEEE international conference on rebooting computing (ICRC) (pp. 1–8). IEEE.
Ding, J., Yu, Z., Tian, Y., & Huang, T. (2021). Optimal ann-snn conversion for fast and accurate inference in deep spiking neural networks. In Zhou, Z. H. (ed) Proceedings of the thirtieth international joint conference on artificial intelligence, ijcai-21. international joint conferences on artificial intelligence organization (pp. 2328–2336). https://doi.org/10.24963/ijcai.2021/321 https://doi.org/10.24963/ijcai.2021/321, main Track.
Dong, X., Chen, S., & Pan, S. (2017a). Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems,30.
Dong, X., Chen, S., & Pan, S. (2017b). Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems.
Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., & Keutzer, K. (2019). Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 293–302).
Fang, W., Yu, Z., Chen, Y., Masquelier, T., Huang, T., & Tian, Y. (2021). Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2661–2671).
Furber, S. B., Galluppi, F., Temple, S., et al. (2014). The spinnaker project. Proceedings of the IEEE, 102(5), 652–665.
Article Google Scholar
Gu, P., Xiao, R., Pan, G., & Tang, H. (2019). STCA: Spatio-temporal credit assignment with delayed feedback in deep spiking neural networks. In IJCAI (Vol. 15, pp. 1366–1372).
Han, B., & Roy, K. (2020). Deep spiking neural network: Energy efficiency through time based coding. In European conference on computer vision.
Han, B., Srinivasan, G., & Roy, K. (2020). Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13558–13567).
Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, 5, 164–171.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition.
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., & Li, M. (2019). Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 558–567).
Hebb, D. O. (2005). The organization of behavior: A neuropsychological theory. Psychology Press.
Book Google Scholar
Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4), 500–544.
Article Google Scholar
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Iakymchuk, T., Rosado-Muñoz, A., Guerrero-Martínez, J. F., Bataller-Mompeán, M., & Francés-Villora, J. V. (2015). Simplified spiking neural network architecture and stdp learning algorithm applied to image classification. EURASIP Journal on Image and Video Processing, 1, 1–11.
Google Scholar
Ikegawa, S. I., Saiin, R., Sawada, Y., & Natori, N. (2022). Rethinking the role of normalization and residual blocks for spiking neural networks. Sensors, 22(8), 2876.
Article Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448–456). PMLR.
Iyer, L. R., & Chua, Y. (2020). Classifying neuromorphic datasets with tempotron and spike timing dependent plasticity. In 2020 international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14(6), 1569–1572.
Article MathSciNet Google Scholar
Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J., & Masquelier, T. (2018). STDP-based spiking deep convolutional neural networks for object recognition. Neural Networks, 99, 56–67.
Article Google Scholar
Kim, S., Park, S., Na, B., & Yoon, S. (2020). Spiking-yolo: Spiking neural network for energy-efficient object detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 11270–11277).
Kim, Y., & Panda, P. (2021). Revisiting batch normalization for training low-latency deep spiking neural networks from scratch. Frontiers in Neuroscience, 15(773), 954.
Google Scholar
Kim, Y., Li, Y., Park, H., Venkatesha, Y., & Panda, P. (2022). Neural architecture search for spiking neural networks. In European conference on computer vision (pp. 36–56). Cham: Springer Nature Switzerland.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krizhevsky, A., Nair, V., & Hinton, G. (2010). Cifar-10 (canadian institute for advanced research). 5(4), 1 https://www.cs.toronto.edu/~kriz/cifar.html.
Lee, C., Panda, P., Srinivasan, G., & Roy, K. (2018). Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning. Frontiers in Neuroscience, 12, 373945.
Article Google Scholar
Lee, C., Sarwar, S. S., Panda, P., Srinivasan, G., & Roy, K. (2020). Enabling spike-based backpropagation for training deep neural network architectures. Frontiers in Neuroscience, 14, 497482.
Article Google Scholar
Lee, J. H., Delbruck, T., & Pfeiffer, M. (2016). Training deep spiking neural networks using backpropagation. Frontiers in Neuroscience, 10, 508.
Article Google Scholar
Li S. L., & Li, J. P. (2019). Research on learning algorithm of spiking neural network. In 2019 16th international computer conference on wavelet active media technology and information processing (pp. 45–48). IEEE.
Li, T., Sahu, A. K., Talwalkar, A., & Smith, V. (2020). Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3), 50–60.
Article Google Scholar
Li, Y., & Zeng, Y. (2022). Efficient and accurate conversion of spiking neural network with burst spikes. arXiv preprint arXiv:2204.13271.
Li, Y., Dong, X., & Wang, W. (2020b). Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. In International conference on learning representations https://openreview.net/forum?id=BkgXT24tDS.
Li, Y., Deng, S., Dong, X., Gong, R., & Gu, S. (2021). A free lunch from ANN: Towards efficient, accurate spiking neural networks calibration. In International conference on machine learning (pp. 6316–6325). PMLR.
Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang Q., Yu, F., Wang, W., & Gu S. (2021b). Brecq: Pushing the limit of post-training quantization by block reconstruction. In International conference on learning representations https://openreview.net/forum?id=POWv6hDd9XH.
Li, Y., Guo, Y., Zhang, S., Deng, S., Hai, Y., & Gu, S. (2021). Differentiable spike: Rethinking gradient-descent for training spiking neural networks. Advances in Neural Information Processing Systems, 34, 23426–23439.
Google Scholar
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13 (pp. 740–755). Springer International Publishing.
Lin, T. Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Liu, Y. H., & Wang, X. J. (2001). Spike-frequency adaptation of a generalized leaky integrate-and-fire model neuron. Journal of Computational Neuroscience, 10(1), 25–45.
Article Google Scholar
Liu, Z., Wu, Z., Gan, C., Zhu, L., & Han, S. (2020). Datamix: Efficient privacy-preserving edge-cloud inference. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16 (pp. 578–595). Springer International Publishing.
Lobov, S. A., Mikhaylov, A. N., Shamshin, M., Makarov, V. A., & Kazantsev, V. B. (2020). Spatial properties of STDP in a self-learning spiking neural network enable controlling a mobile robot. Frontiers in Neuroscience, 14, 491341.
Article Google Scholar
Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
Meng, Q., Xiao, M., Yan, S., Wang, Y., Lin, Z., & Luo, Z. Q. (2022). Training high-performance low-latency spiking neural networks by differentiation on spike representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12444–12453).
Miquel, J. R., Tolu, S., Scholler, F. E., & Galeazzi, R. (2021). Retinanet object detector based on analog-to-spiking neural network conversion. In 2021 8th International Conference on Soft Computing & Machine Intelligence (ISCMI) (pp. 201–205).
Mordvintsev, A., Olah, C., & Tyka, M. (2015). Inceptionism: Going deeper into neural networks. https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html.
Neftci, E. O., Mostafa, H., & Zenke, F. (2019). Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6), 51–63.
Article Google Scholar
Pei, J., Deng, L., Song, S., Zhao, M., Zhang, Y., Wu, S., Wang, Y., Wu, Y., Yang, Z., Ma, C., Li, G., Han, W., Li, H., Wu, H., Zhao, R., Xie, Y., & Shi, L. P. (2019). Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature, 572(7767), 106–111.
Article Google Scholar
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollar, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10428–10436).
Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision (pp. 525–542). Cham: Springer International Publishing.
Rathi, N., & Roy, K. (2021). Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Transactions on Neural Networks and Learning Systems, 34(6), 3174–3182.
Article Google Scholar
Rathi, N., Srinivasan, G., Panda, P., & Roy, K. (2019). Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In International conference on learning representations.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28.
Roy, D., Chakraborty, I., & Roy, K. (2019). Scaling deep spiking neural networks with binary stochastic activations. In 2019 IEEE International Conference on Cognitive Computing (ICCC) (pp. 50–58). IEEE.
Roy, K., Jaiswal, A., & Panda, P. (2019). Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784), 607–617.
Article Google Scholar
Rueckauer, B., Lungu, I. A., Hu, Y., & Pfeiffer, M. (2016). Theory and tools for the conversion of analog to spiking convolutional neural networks. arXiv: Statistics/Machine Learning (1612.04052).
Rueckauer, B., Lungu, I. A., Hu, Y., Pfeiffer, M., & Liu, S. C. (2017). Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in Neuroscience, 11, 294078.
Article Google Scholar
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization?. Advances in Neural Information Processing Systems, 31.
Sengupta, A., Ye, Y., Wang, R., Liu, C., & Roy, K. (2019). Going deeper in spiking neural networks: VGG and residual architectures. Frontiers in Neuroscience, 13, 425055.
Article Google Scholar
Shrestha, S. B., & Orchard, G. (2018). Slayer: Spike layer error reassignment in time. Advances in Neural Information Processing Systems, 31, 1412–1421.
Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, Arthur, Sifre, Laurent, van den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, Dieleman, Sander, Grewe, Dominik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach, Madeleine, Kavukcuoglu, Koray, Graepel, Thore, & Hassabis, Demis. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Suetake, K., Ikegawa, S. I., Saiin, R., & Sawada, Y. (2023). S3NN: Time step reduction of spiking surrogate gradients for training energy efficient single-step spiking neural networks. Neural Networks, 159, 208–219.
Article Google Scholar
Sze, V., Chen, Y. H., Yang, T. J., & Emer, J. S. (2017). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12), 2295–2329.
Article Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., & Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2820–2828).
Tavanaei, A., Ghodrati, M., Kheradpisheh, S. R., Masquelier, T., & Maida, A. (2019). Deep learning in spiking neural networks. Neural Networks, 111, 47–63.
Article Google Scholar
Theis, L., Korshunova, I., Tejani, A., & Huszar, F. (2018). Faster gaze prediction with dense networks and fisher pruning. arXiv preprint arXiv:1801.05787.
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Junhyuk, O., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., & Silver, D. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.
Article Google Scholar
Wang, Y., Zhang, M., Chen, Y., & Qu, H. (2022). Signed neuron with memory: Towards simple, accurate and high-efficient ANN-SNN conversion. In International joint conference on artificial intelligence (pp. 2501–2508).
Wu, J., Chua, Y., Zhang, M., Li, G., Li, H., & Tan, K. C. (2021). A tandem learning rule for effective training and rapid inference of deep spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 34(1), 446–460.
Article Google Scholar
Wu, J., Xu, C., Han, X., Zhou, D., Zhang, M., Li, H., & Tan, K. C. (2021). Progressive tandem learning for pattern recognition with deep spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7824–7840.
Article Google Scholar
Wu, Y., Deng, L., Li, G., & Shi, L. (2018). Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12, 331.
Article Google Scholar
Wu, Y., Zhao, R., Zhu, J., Chen, F., Xu, M., Li, G., Song, S., Deng, L., Wang, G., Zheng, H., Pei, J., Zhang, Y., Zhao, M., & Shi, L. (2022). Brain-inspired global-local learning incorporated with neuromorphic computing. Nature Communications, 13(1), 1–14.
Google Scholar
Xiao, M., Meng, Q., Zhang, Z., Wang, Y., & Lin, Z. (2021). Training feedback spiking neural networks by implicit differentiation on the equilibrium state. Advances in Neural Information Processing Systems, 34, 14516–14528.
Google Scholar
Xiao, M., Meng, Q., Zhang, Z., He, D., & Lin, Z. (2022). Online training through time for spiking neural networks. Advances in Neural Information Processing Systems, 35, 20717-20730.
Google Scholar
Yin, H., Molchanov, P., Alvarez, J. M., Li, Z., Mallya, A., Hoiem, D., Jha, N. K. & Kautz, J. (2020). Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8715–8724).
Zheng, H., Wu, Y., Deng, L., Hu, Y., & Li, G. (2021). Going deeper with directly-trained larger spiking neural networks. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 12, pp. 11062–11070).

Download references

Acknowledgements

This project is supported by NSFC Key Program 62236009, Shenzhen Fundamental Research Program (General Program) JCYJ20210324140807019, NSFC General Program 61876032, and Key Laboratory of Data Intelligence and Cognitive Computing, Longhua District, Shenzhen. Yuhang Li is supported by the Baidu PhD Fellowship Program and completed this work during his prior research assistantship at UESTC. The authors would like to thank Youngeun Kim for helpful feedback on the manuscript.

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, 611731, Sichuan, China
Yuhang Li, Shikuang Deng & Shi Gu
Yale University, New Haven, CT, 06511, USA
Yuhang Li
Harvard University, Cambridge, MA, 02138, USA
Xin Dong
Key Laboratory of Data Intelligence and Cognitive Computing, Longhua District, Shenzhen Institute for Advanced Study, UESTC, Shenzhen, 518000, Guangdong, China
Shi Gu

Authors

Yuhang Li
View author publications
You can also search for this author in PubMed Google Scholar
Shikuang Deng
View author publications
You can also search for this author in PubMed Google Scholar
Xin Dong
View author publications
You can also search for this author in PubMed Google Scholar
Shi Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shi Gu.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Major Proof

In order to prove Theorem 4.1, we need to introduce two lemmas.

Lemma A.1

The error term $\textbf{e}_r^{({n})}$ can be computed using the error from former layer as

$$\begin{aligned} \textbf{e}_r^{(n)} = {\textbf{B}^{(n)}\textbf{W}^{(n-1)}\textbf{e}^{(n-1)}}, \end{aligned}$$

(A1)

where ${\textbf{B}^{(n)}} $ is a diagonal matrix whose values are the derivative of ReLU activation.

Proof

Without loss of generality, we will show that the equation holds for any two consecutive layers $(\ell +1)$ and $(\ell )$. Recall that the error term $\textbf{e}_c^{(\ell +1)}$ is the error between different input activation. Suppose $f(\textbf{a})=\textrm{ReLU}(\textbf{Wa})$ and $\textbf{e}_r^{(\ell +1)}=f(\textbf{x}^{(\ell )}) - f(\bar{\textbf{s}}^{(\ell )})$, we can rewrite $\textbf{e}_r^{(\ell +1)}$ also using the Taylor expansion, give by

$$\begin{aligned} \textbf{e}_r^{(\ell +1)} = \textbf{e}^{(\ell )}\nabla _{\textbf{x}^{(\ell )}}f + \frac{1}{2}\textbf{e}^{(\ell ),\top } \nabla ^2_{\textbf{x}^{(\ell )}}f\textbf{e}^{(\ell )} + {\textrm{O}(||\textbf{e}^{(\ell )}||^3)}, \end{aligned}$$

(2)

where $\textbf{e}^{(\ell )}$ is the difference between $\textbf{x}^{(\ell )}$ and $\bar{\textbf{s}}^{(\ell )}$. For the first order derivative, we easily write it as

$$\begin{aligned} \nabla _{\textbf{x}^{(\ell )}}f = {\textbf{B}^{(\ell +1)}\textbf{W}^{(\ell )}}, \end{aligned}$$

(3)

where $\textbf{B}^{(\ell +1)}\in \{0, 1\}^{c_{\ell +1}\times c_{\ell +1}}$ is a diagonal matrix, each element on the diagonal is the derivative of the ReLU function. $c_{\ell +1}$ denotes the number of neurons in layer $\ell +1$. To calculate the second order derivative, we need to differentiate the above equation with respect to $\textbf{x}^{(\ell )}$. First, the matrix $\textbf{B}^{(\ell +1)}$ is not a function of $\textbf{x}^{(\ell )}$ since it consists of only constants. Second, weight parameters $\textbf{W}^{(\ell )}$ also is not a function of $\textbf{x}^{(\ell )}$. Therefore, we can safely ignore the second-order term and other higher order terms in Eq. (2). Thus, Eq. (A1) holds. $\square $

Lemma A.2

((Botev et al., 2017) Eq. (8)) The Hessian matrix of activation in layer $(\ell )$ can be recursively computed by

$$\begin{aligned} \textbf{H}^{(\ell )} = {\textbf{W}^{(\ell ), \top }\textbf{B}^{(\ell +1), \top }\textbf{H}^{(\ell +1)}\textbf{B}^{(\ell +1)}\textbf{W}^{(\ell )}}. \end{aligned}$$

(4)

Proof

Let $\textbf{H}_{a, b}^{(\ell )}$ be the Hessian matrix (loss L with respect to $\ell $-th layer activation). We can calculate it by

$$\begin{aligned} \textbf{H}_{a, b}^{(\ell )}&= \frac{\partial ^2L}{\partial \textbf{x}_{b}^{(\ell )}\partial \textbf{x}_{a}^{(\ell )}}=\frac{\partial }{\partial \textbf{x}_{b}^{(\ell )}}\left( \sum _i \frac{\partial L}{\partial \textbf{x}_{i}^{(\ell +1)}}\frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{x}_{a}^{(\ell )}} \right) \end{aligned}$$

(5)

$$\begin{aligned}&= \sum _i \frac{\partial }{\partial \textbf{x}_{b}^{(\ell )}} \left( \frac{\partial L}{\partial \textbf{x}_{i}^{(\ell +1)}} \frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{z}_{i}^{(\ell )}} \frac{\partial \textbf{z}_{i}^{(\ell )}}{\partial \textbf{x}_{a}^{(\ell )}}\right) \end{aligned}$$

(6)

$$\begin{aligned}&= \sum _i \textbf{W}_{i,a}^{(\ell )} \frac{\partial }{\partial \textbf{x}_{b}^{(\ell )}} \left( \frac{\partial L}{\partial \textbf{x}_{i}^{(\ell +1)}} \frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{z}_{i}^{(\ell )}} \right) . \end{aligned}$$

(7)

Note that term $\frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{z}_{i}^{(\ell )}}$ is the derivative of the ReLU function, which is a constant and can be either 0 or 1, therefore it has no gradient with respect to $\textbf{x}_{b}^{(\ell )}$. Further differentiating the above equation, we have

$$\begin{aligned} \textbf{H}_{a, b}^{(\ell )}&= \sum _i \textbf{W}_{i,a}^{(\ell )} \left( \frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{z}_{i}^{(\ell )}} \frac{\partial ^2L}{\partial \textbf{x}_{b}^{(\ell )}\partial \textbf{x}_{i}^{(\ell +1)}}\right) \end{aligned}$$

(8)

$$\begin{aligned}&= \sum _i \textbf{W}_{i,a}^{(\ell )} \left[ \frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{z}_{i}^{(\ell )}} \left( \sum _j \frac{\partial ^2L}{\partial \textbf{x}_{j}^{(\ell +1)}\partial \textbf{x}_{i}^{(\ell +1)}}\frac{\partial \textbf{x}_{j}^{(\ell +1)}}{\partial \textbf{x}_{b}^{(\ell )}}\right) \right] \end{aligned}$$

(9)

$$\begin{aligned}&= \sum _i \textbf{W}_{i,a}^{(\ell )} \left[ \frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{z}_{i}^{(\ell )}} \left( \sum _j \textbf{W}_{j,b}^{(\ell )} \frac{\partial ^2L}{\partial \textbf{x}_{j}^{(\ell +1)}\partial \textbf{x}_{i}^{(\ell +1)}} \frac{\partial \textbf{x}_{j}^{(\ell +1)}}{\partial \textbf{z}_{j}^{(\ell )}} \right) \right] \end{aligned}$$

(10)

$$\begin{aligned}&= \sum _{i, j} \textbf{W}_{i,a}^{(\ell )} \frac{\partial \textbf{x}_{i}^{(\ell +1)}}{\partial \textbf{z}_{i}^{(\ell )}} \frac{\partial ^2L}{\partial \textbf{x}_{j}^{(\ell +1)}\partial \textbf{x}_{i}^{(\ell +1)}} \frac{\partial \textbf{x}_{j}^{(\ell +1)}}{\partial \textbf{z}_{j}^{(\ell )}} \textbf{W}_{j,b}^{(\ell )}. \end{aligned}$$

(11)

To this end, we can rewrite it using matrix form. Thus, the Lemma holds. $\square $

Now, we can prove our theorem with the above two lemmas.

Proof

According to Lemma A.1, the error in the last layer can be rewritten by

$$\begin{aligned} \textbf{e}^{(n)} = {\textbf{B}^{(n)}\textbf{W}^{(n-1)}\textbf{e}^{(n-1)}} + \textbf{e}_c^{(n)}. \end{aligned}$$

(12)

Applying this equation to the second-order objective, we have

$$\begin{aligned} \textbf{e}^{(n), \top }\textbf{H}^{(n)}\textbf{e}^{(n)}&= {\textbf{e}^{(n\!-\!1), \top }\textbf{W}^{(n\!-\!1), \top }\textbf{B}^{(n), \top } \textbf{H}^{(n)}\textbf{B}^{(n)}\textbf{W}^{(n\!-\!1)}\textbf{e}^{(n\!-\!1)} }\nonumber \\&\quad + {2\textbf{e}^{(n{-}1), \top }\textbf{W}^{(n{-}1), \top } \textbf{B}^{(n), \top }\textbf{H}^{(n)}\textbf{e}_c^{(n)}} \nonumber \\&\quad + \textbf{e}_c^{(n), \top }\textbf{H}^{(n)}\textbf{e}_c^{(n)}. \end{aligned}$$

(13)

According to Lemma A.2, the Hessian of activation can be derived recursively as

$$\begin{aligned} \textbf{H}^{(n-1)} = {\textbf{W}^{(n-1), \top }\textbf{B}^{(n), \top }\textbf{H}^{(n)}\textbf{B}^{(n)}\textbf{W}^{(n-1)}}. \end{aligned}$$

(14)

Thus, the first term in Eq. (13) can be rewritten to $\textbf{e}^{(n-1), \top }$$\textbf{H}^{(n-1)}\textbf{e}^{(n-1)}$. For the second term, we can upper bound it using Cauchy-Schwarz inequality $(x^\top Ay \le \sqrt{x^\top A xy^\top A y})$ and inequality of arithmetic and geometric mean $(\sqrt{x^\top A xy^\top A y}) \le \frac{1}{2}(x^\top A x+y^\top A y))$ by treating A as $\textbf{H}^{(n)}$. Therefore, the second term is upper bounded by

$$\begin{aligned} {2\textbf{e}^{(n-1), \top }}&{\textbf{W}^{(n-1), \top } \textbf{B}^{(n), \top }\textbf{H}^{(n)}\textbf{e}_c^{(n)}} \le \textbf{e}_c^{(n), \top }\textbf{H}^{(n)}\textbf{e}_c^{(n)} \nonumber \\ {}&+ {\textbf{e}^{(n-1), \top }\textbf{W}^{(n-1), \top }\textbf{B}^{(n), \top } \textbf{H}^{(n)}\textbf{B}^{(n)}\textbf{W}^{(n-1)}\textbf{e}^{(n-1)} }. \end{aligned}$$

(15)

To this end, we can rewrite Eq. (13) as

$$\begin{aligned} \textbf{e}^{(n), \top }\textbf{H}^{(n)}\textbf{e}^{(n)}&\le 2\textbf{e}^{(n-1), \top }\textbf{H}^{(n-1)}\textbf{e}^{(n-1)} + 2\textbf{e}_c^{(n), \top }\textbf{H}^{(n)}\textbf{e}_c^{(n)} \nonumber \\&\le \sum _{\ell =1}^n 2^{n-\ell +1} \textbf{e}_c^{(\ell ), \top }\textbf{H}^{(\ell )}\textbf{e}_c^{(\ell )}. \end{aligned}$$

(16)

$\square $

Appendix B Generalization to Under/Over-Fire Error

Here, we show that our analysis of error propagation and the calibration algorithm can generalize to under-fire and over-fire errors. Recall that in the case when $\textbf{v}^{\ell }(T)$ is not in $\left[ 0, V^{(\ell )}_{th}\right) $, under-fire and over-fire happen and we could not use a closed-form expression for $\bar{\textbf{s}}^{(\ell +1)}$. Here, we use function $\textrm{AvgIF}$ to denote the function of the averaged spike from the IF neuron:

$$\begin{aligned} \bar{\textbf{s}}^{(\ell +1)} = \textrm{AvgIF}(\textbf{W}^{\ell }\bar{\textbf{s}}^{(\ell )}). \end{aligned}$$

(B.17)

The $\mathrm {AvgIF(\textbf{W}^{\ell }\bar{\textbf{s}}^{(\ell )})}$ returns $\frac{m}{T}V_{th}^{\ell }$ where m is the number of output spikes. Note that with under/over-fire error the m cannot be analytically computed. Similarly, we define $\textbf{e}_c^{\ell }=\textrm{ReLU}(\textbf{W}^{\ell }\bar{\textbf{s}}^{(\ell )})-\textrm{AvgIF}(\textbf{W}^{\ell }\bar{\textbf{s}}^{(\ell )})$ as the local conversion error, which contains clipping error, flooring error, over-fire error, and under-fire error. Then, we can rewrite Eq. (13) as

$$\begin{aligned} \textbf{e}^{(n)} = \textbf{x}^{(n)}-\bar{\textbf{s}}^{(n)}&= \textrm{ReLU}(\textbf{W}^{(n-1)}\textbf{x}^{(n-1)}) \nonumber \\ {}&\quad - \textrm{AvgIF}(\textbf{W}^{(n-1)}\bar{\textbf{s}}^{(n-1)}) \nonumber \\&= \textrm{ReLU}(\textbf{W}^{(n-1)}\textbf{x}^{(n-1)}) \nonumber \\ {}&\quad - \textrm{ReLU}(\textbf{W}^{(n-1)}\bar{\textbf{s}}^{(n-1)}) \nonumber \\&\quad + \textrm{ReLU}(\textbf{W}^{(n-1)}\bar{\textbf{s}}^{(n-1)}) \nonumber \\ {}&\quad - \textrm{AvgIF}(\textbf{W}^{(n-1)}\bar{\textbf{s}}^{(n-1)}) \end{aligned}$$

(B.18)

$$\begin{aligned}&= \textbf{e}_r^{(n)} + \textbf{e}_c^{(n)}. \end{aligned}$$

(B.19)

Here, the accumulated error $\textbf{e}_r^{(n)}$ is the same with Eq. (13) and can be eliminated using second-order analysis. As for $\textbf{e}_c$, it can be addressed by our parameter calibration algorithm since our method is a data-driven approach. During calibration, the actual $\textbf{e}_c$ in each layer is computed using $\bar{\textbf{s}}^{(\ell +1)}-\textbf{x}^{(\ell )}$. Therefore, all errors can be reduced by calibration.

Appendix C Experiments Details

1.1 C.1 ImageNet Pre-training

The ImageNet dataset (Deng et al., 2009) contains 120 M training images and 50 k validation images. For training pre-processing, we random crop and resize the training images to 224 $\times $ 224. We additionally apply CollorJitter with brightness = 0.2, contrast = 0.2, saturation = 0.2, and hue = 0.1. For test images, they are center-cropped to the same size. For all architectures we tested, the Max Pooling layers are replaced to Average Pooling layers. The ResNet-34 contains a deep-stem layer (i.e., three 3$\times $3 conv. layers to replace the original 7 $\times $ 7 first conv. layer) as described in (He et al., 2019). We use Stochastic Gradients Descent with a momentum of 0.9 as the optimizer. The learning rate is set to 0.1 and followed by a cosine decay schedule (Loshchilov & Hutter, 2016). Weight decay is set to $10^{-4}$, and the networks are optimized for 120 epochs. We also apply label smooth (Szegedy et al., 2016)(factor = 0.1) and EMA update with 0.999 decay rate to optimize the model. For the MobileNet pre-trained model, we download it from pytorchcv^{Footnote 2}.

1.2 C.2 CIFAR Pre-training

The CIFAR 10 and CIFAR100 dataset (Krizhevsky et al., 2010) contains 50k training images and 10k validation images. We set padding to 4 and randomly cropped the training images to 32 $\times $ 32. Other data augmentations include (1)random horizontal flip, (2) Cutout (DeVries & Taylor, 2017) and (3) AutoAugment (Cubuk et al., 2019). For ResNet-20, we follow prior works (Han et al., 2020; Han & Roy, 2020) who modify the official network structures proposed in (He et al., 2016) to make a fair comparison. The modified ResNet-20 contains 3 stages with 4x more channels. For VGG-16 without BN layers, we add Dropout with a 0.25 drop rate to regularize the network. For the model with BN layers, we use Stochastic Gradients Descent with a momentum of 0.9 as the optimizer. The learning rate is set to 0.1 and followed by a cosine decay schedule (Loshchilov & Hutter, 2016). Weight decay is set to $5\times 10^{-4}$ and the networks are optimized for 300 epochs. For networks without BN layers, we set weight decay to $10^{-4}$ and learning rate to 0.005.

1.3 C.3 COCO Pre-training

COCO dataset (Lin et al., 2014) contains 121,408 images, with 883k objection annotations, and 80 image classes. The median image resolution is 640 $\times $ 480. The backbone ResNet50 is pre-trained on ImageNet dataset. We train RetinaNet and Faster R-CNN with 12 epochs, using SGD with the learning rate of 0.000625, decayed at 8, 11-th epoch by a factor of 10. The momentum is 0.9 and we use linear warmup learning in the first epoch. The batch size per GPU is set to 2 and in total we use 8 GPUs for pre-training. The weight decay is set to $1e-4$. To stabilize the training, we freeze the update in the BN layers (as they have low performance in small batch-size training), and the update in the first stage and stem layer of the backbone.

1.4 C.4 COCO Conversion Details

For the calibration dataset, we use 128 training images taken from the MS COCO dataset for calibration. The image resolution is set to 800 (max size 1333). Note that in object detection the image resolution is varying image-by-image. Therefore we cannot set an initial membrane potential at a space dimension (which requires a fixed dimension). Similar to bias calibration, we can compute the reduced mean value in each channel. Here we set the initial membrane potential as the $T\times \mu (\textbf{e}^{(i)})$. Learning hyper-parameters are kept the same with ImageNet experiments.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Y., Deng, S., Dong, X. et al. Error-Aware Conversion from ANN to SNN via Post-training Parameter Calibration. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02046-2

Download citation

Received: 06 January 2023
Accepted: 29 February 2024
Published: 08 April 2024
DOI: https://doi.org/10.1007/s11263-024-02046-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Error-Aware Conversion from ANN to SNN via Post-training Parameter Calibration

Abstract

Access this article

Similar content being viewed by others

Dropout vs. batch normalization: an empirical study of their impact to deep learning

Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

Practical Deep Raw Image Denoising on Mobile Devices

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A Major Proof

Lemma A.1

Proof

Lemma A.2

Proof

Proof

Appendix B Generalization to Under/Over-Fire Error

Appendix C Experiments Details

1.1 C.1 ImageNet Pre-training

1.2 C.2 CIFAR Pre-training

1.3 C.3 COCO Pre-training

1.4 C.4 COCO Conversion Details

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Error-Aware Conversion from ANN to SNN via Post-training Parameter Calibration

Abstract

Access this article

Similar content being viewed by others

Dropout vs. batch normalization: an empirical study of their impact to deep learning

Drop-Activation: Implicit Parameter Reduction and Harmonious Regularization

Practical Deep Raw Image Denoising on Mobile Devices

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A Major Proof

Lemma A.1

Proof

Lemma A.2

Proof

Proof

Appendix B Generalization to Under/Over-Fire Error

Appendix C Experiments Details

1.1 C.1 ImageNet Pre-training

1.2 C.2 CIFAR Pre-training

1.3 C.3 COCO Pre-training

1.4 C.4 COCO Conversion Details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation