Abstract
Video data is often repetitive; for example, the contents of adjacent frames are usually strongly correlated. Such redundancy occurs at multiple levels of complexity, from low-level pixel values to textures and high-level semantics. We propose Event Neural Networks (EvNets), which leverage this redundancy to achieve considerable computation savings during video inference. A defining characteristic of EvNets is that each neuron has state variables that provide it with long-term memory, which allows low-cost, high-accuracy inference even in the presence of significant camera motion. We show that it is possible to transform a wide range of neural networks into EvNets without re-training. We demonstrate our method on state-of-the-art architectures for both high- and low-level visual processing, including pose recognition, object detection, optical flow, and image enhancement. We observe roughly an order-of-magnitude reduction in computational costs compared to conventional networks, with minimal reductions in model accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akopyan, F., et al.: TrueNorth: design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 34(10), 1537–1557 (2015). https://doi.org/10.1109/TCAD.2015.2474396
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Campos, V., Jou, B., Giró-i Nieto, X., Torres, J., Chang, S.F.: Skip RNN: learning to skip state updates in recurrent neural networks. In: International Conference on Learning Representations (2018)
Cannici, M., Ciccone, M., Romanoni, A., Matteucci, M.: Asynchronous convolutional networks for object detection in neuromorphic cameras. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1656–1665 (2019). https://doi.org/10.1109/CVPRW.2019.00209
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)
Chin, T.W., Ding, R., Marculescu, D.: AdaScale: towards real-time video object detection using adaptive scaling. arXiv:1902.02910 [cs] (2019)
Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: training deep neural networks with binary weights during propagations. Adv. Neural. Inf. Process. Syst. 28, 3123–3131 (2015)
Davies, M., et al.: Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018). https://doi.org/10.1109/MM.2018.112130359
Dutton, N.A.W., et al.: A SPAD-based QVGA image sensor for single-photon counting and quanta imaging. IEEE Trans. Electron Devices 63(1), 189–196 (2016). https://doi.org/10.1109/TED.2015.2464682
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral learning for real-time image enhancement. ACM Trans. Graph. 36(4), 118:1–118:12 (2017). https://doi.org/10.1145/3072959.3073592
Ghodrati, A., Bejnordi, B.E., Habibian, A.: FrameExit: conditional early exiting for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15608–15618 (2021)
Habibian, A., Abati, D., Cohen, T.S., Bejnordi, B.E.: Skip-convolutions for efficient video processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2695–2704 (2021)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. Adv. Neural. Inf. Process. Syst. 28, 1135–1143 (2015)
Hassibi, B., Stork, D.: Second order derivatives for network pruning: Optimal brain surgeon. Adv. Neural. Inf. Process. Syst. 5, 164–171 (1992)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 [cs] (2017)
Huang, G., Chen, D., Li, T., Wu, F., Van Der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: International Conference on Learning Representations (2018)
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Hwang, K., Sung, W.: Fixed-point feedforward deep neural network design using weights +1, 0, and \(-\)1. In: Workshop on Signal Processing Systems (SiPS), pp. 1–6 (2014). https://doi.org/10.1109/SiPS.2014.6986082
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)0.5MB model size. arXiv:1602.07360 [cs] (2016)
Jain, S., Wang, X., Gonzalez, J.E.: Accel: a corrective fusion network for efficient semantic segmentation on video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8866–8875 (2019)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: International Conference on Computer Vision (ICCV), pp. 3192–3199 (2013)
LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. Adv. Neural. Inf. Process. Syst. 2, 598–605 (1989)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient ConvNets. arXiv:1608.08710 [cs] (2017)
Li, Y., Shi, J., Lin, D.: Low-latency video semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5997–6005 (2018)
Lichtsteiner, P., Posch, C., Delbruck, T.: A 128\(\times \) 128 120 dB 15 \(M\)s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circ. 43(2), 566–576 (2008). https://doi.org/10.1109/JSSC.2007.914337
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Maass, W.: Networks of spiking neurons: the third generation of neural network models. Neural Netw. 10(9), 1659–1671 (1997). https://doi.org/10.1016/S0893-6080(97)00011-7
Meng, Y., et al.: AR-net: adaptive frame resolution for efficient action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 86–104. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_6
Meng, Y., et al.: AdaFuse: adaptive temporal fusion network for efficient action recognition. In: International Conference on Learning Representations (2021)
Messikommer, N., Gehrig, D., Loquercio, A., Scaramuzza, D.: Event-based asynchronous sparse convolutional networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 415–431. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_25
Neil, D., Lee, J.H., Delbruck, T., Liu, S.C.: Delta networks for optimized recurrent network computation. In: Proceedings of the 34th International Conference on Machine Learning, pp. 2584–2593. PMLR (2017)
Nie, X., Li, Y., Luo, L., Zhang, N., Feng, J.: Dynamic kernel distillation for efficient pose estimation in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6942–6950 (2019)
O’Connor, P., Welling, M.: Sigma delta quantized networks. arXiv:1611.02024 [cs] (2016)
Pan, B., Lin, W., Fang, X., Huang, C., Zhou, B., Lu, C.: Recurrent residual module for fast inference in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1536–1545 (2018)
Parger, M., Tang, C., Twigg, C.D., Keskin, C., Wang, R., Steinberger, M.: DeltaCNN: end-to-end CNN inference of sparse frame differences in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12497–12506 (2022)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_32
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Sabater, A., Montesano, L., Murillo, A.C.: Robust and efficient post-processing for video object detection. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10536–10542 (2020). https://doi.org/10.1109/IROS45743.2020.9341600
Shelhamer, E., Rakelly, K., Hoffman, J., Darrell, T.: Clockwork convnets for video semantic segmentation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 852–868. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_69
Sillerkiil: Eesti: Aegviidu siniallikad (Aegviidu blue springs in Estonia) (2021)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Teerapittayanon, S., McDanel, B., Kung, H.: BranchyNet: fast inference via early exiting from deep neural networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469 (2016). https://doi.org/10.1109/ICPR.2016.7900006
Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on CPUs. In: Deep Learning and Unsupervised Feature Learning Workshop, NIPS (2011)
Veit, A., Belongie, S.: Convolutional networks with adaptive inference graphs. In: European Conference on Computer Vision (ECCV), pp. 3–18 (2018)
Ware, C.: Visual Thinking for Design, 1st edn. Morgan Kaufmann, Burlington, Amsterdam (2008)
Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6026–6035 (2018)
Wu, Z., Xiong, C., Jiang, Y.G., Davis, L.S.: LiteEval: a coarse-to-fine framework for resource efficient video recognition. Adv. Neural Inf. Process. Syst. 32 (2019)
Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: AdaFrame: adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1278–1287 (2019)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481 (2018)
Xu, R., et al.: ApproxDet: content and contention-aware approximate object detection for mobiles. In: Proceedings of the 18th Conference on Embedded Networked Sensor Systems, SenSys 2020, pp. 449–462. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3384419.3431159
Yang, L., Han, Y., Chen, X., Song, S., Dai, J., Huang, G.: Resolution adaptive networks for efficient inference. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2369–2378 (2020)
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2013). https://doi.org/10.1109/TPAMI.2012.261
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856 (2018)
Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218 (2018)
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)
Acknowledgments
This research was supported by NSF CAREER Award 1943149.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dutson, M., Li, Y., Gupta, M. (2022). Event Neural Networks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13671. Springer, Cham. https://doi.org/10.1007/978-3-031-20083-0_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-20083-0_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20082-3
Online ISBN: 978-3-031-20083-0
eBook Packages: Computer ScienceComputer Science (R0)