Abstract
Recognizing target objects using an event-based camera draws more and more attention in recent years. Existing works usually represent the event streams into point-cloud, voxel, image, etc., and learn the feature representations using various deep neural networks. Their final results may be limited by the following factors: monotonous modal expressions and the design of the network structure. To address the aforementioned challenges, this paper proposes a novel dual-stream framework for event representation, extraction, and fusion. This framework simultaneously models two common representations: event images and event voxels. By utilizing Transformer and Structured Graph Neural Network (GNN) architectures, spatial information and three-dimensional stereo information can be learned separately. Additionally, a bottleneck Transformer is introduced to facilitate the fusion of the dual-stream information. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets. The source code of this work is available at: https://github.com/Event-AHU/EFV_event_classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ul Hassan, M.: AlexNet ImageNet classification with deep convolutional neural networks (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770ā778 (2016)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84ā90 (2017)
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., Liu, J.: Human action recognition from various data modalities: a review. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3200ā3225 (2023)
Wang, X., et al.: VisEvent: reliable object tracking via collaboration of frame and event flows arXiv preprint arXiv:2108.05015 (2021)
Tang, C., et al.: Revisiting color-event based tracking: a unified network, dataset, and metric, arXiv preprint arXiv:2211.11010 (2022)
Zhu, L., Wang, X., Chang, Y., Li, J., Huang, T., Tian, Y.: Event-based video reconstruction via potential-assisted spiking neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3594ā3604 (2022)
Wang, Y., et al.: EV-Gait: event-based robust gait recognition using dynamic vision sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6358ā6367 (2019)
Wang, Y., et al.: Event-stream representation for human gaits identification using deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3436ā3449 (2021)
Fang, H., Shrestha, A., Zhao, Z., Qiu, Q.: Exploiting neuron and synapse filter dynamics in spatial temporal learning of deep spiking neural network, arXiv preprint arXiv:2003.02944 (2020)
Fang, W., Yu, Z., Chen, Y., Masquelier, T., Huang, T., Tian, Y.: Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2661ā2671 (2021)
Bi, Y., Chadha, A., Abbas, A., Bourtsoulatze, E., Andreopoulos, Y.: Graph-based object classification for neuromorphic vision sensing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 491ā501 (2019)
Bi, Y., Chadha, A., Abbas, A., Bourtsoulatze, E., Andreopoulos, Y.: Graph-based spatio-temporal feature learning for neuromorphic vision sensing. IEEE Trans. Image Process. 29, 9084ā9098 (2020)
Wang, Y., et al.: Event-stream representation for human gaits identification using deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3436ā3449 (2021)
Diehl, P.U., Neil, D., Binas, J., Cook, M., Liu, S.-C., Pfeiffer, M.: Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1ā8. IEEE (2015)
Perez-Nieves, N., Goodman, D.: Sparse spiking gradient descent. Adv. Neural Inf. Process. Syst. 34, 11 795ā11 808 (2021)
Fang, W., Yu, Z., Chen, Y., Huang, T., Masquelier, T., Tian, Y.: Deep residual learning in spiking neural networks. Adv. Neural Inf. Process. Syst. 34, 21 056ā21 069 (2021)
Wang, X., et al.: SSTFormer: bridging spiking neural network and memory support transformer for frame-event based recognition arXiv preprint arXiv:2308.04369 (2023)
Jiang, B., Yuan, C., Wang, X., Bao, Z., Zhu, L., Luo, B.: Point-voxel absorbing graph representation learning for event stream based recognition arXiv preprint arXiv:2306.05239 (2023)
Wang, Q., Zhang, Y., Yuan, J., Lu, Y.: Space-time event clouds for gesture recognition: from RGB cameras to event cameras. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1826ā1835. IEEE (2019)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652ā660 (2017)
Xie, B., Deng, Y., Shao, Z., Liu, H., Li, Y.: VMV-GCN: volumetric multi-view based graph CNN for event stream classification. IEEE Robot. Autom. Lett. 7(2), 1976ā1983 (2022)
Li, Z., Asif, M.S., Ma, Z.: Event transformer, arXiv preprint arXiv:2204.05172 (2022)
Schaefer, S., Gehrig, D., Scaramuzza, D.: AEGNN: asynchronous event-based graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12 371ā12 381 (2022)
Li, Y., et al.: Graph-based asynchronous event processing for rapid object recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 934ā943 (2021)
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision And Pattern Recognition, pp. 16 519ā16 529 (2021)
Li, S., et al.: Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv. Neural Inf. Process. Syst., 32 (2019)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 34, 14 200ā14 213 (2021)
Song, R., Feng, Y., Cheng, W., Mu, Z., Wang, X.: BS2T: bottleneck spatial-spectral transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 60, 1ā17 (2022)
Miranda, L.J.: Understanding softmax and the negative log-likelihood. ljvmiranda921. github. io (2017)
Orchard, G., Jayawant, A., Cohen, G.K., Thakor, N.: Converting static image datasets to spiking neuromorphic datasets using saccades. Front. Neurosci. 9, 437 (2015)
Gehrig, D., Loquercio, A., Derpanis, K.G., Scaramuzza, D.: End-to-end learning of representations for asynchronous event-based data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5633ā5643 (2019)
Deng, Y., Li, Y., Chen, H.: AMAE: adaptive motion-agnostic encoder for event-based object classification. IEEE Robot. Autom. Lett. 5(3), 4596ā4603 (2020)
Cannici, M., Ciccone, M., Romanoni, A., Matteucci, M.: A differentiable recurrent surface for asynchronous event-based data. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 136ā152. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_9
Deng, Y., Chen, H., Li, Y.: MVF-Net: a multi-view fusion network for event-based object classification. IEEE Trans. Circuits Syst. Video Technol. 32(12), 8275ā8284 (2021)
Sekikawa, Y., Hara, K., Saito, H.: EventNet: asynchronous recursive event processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3887ā3896(2019)
Deng, Y., Chen, H., Chen, H., Li, Y.: EVVGCNN: a voxel graph CNN for event-based object classification, arXiv preprint arXiv:2106.00216, vol. 1, no. 2, p. 6 (2021)
Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: HATS: histograms of averaged time surfaces for robust event-based object classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1731ā1740 (2018)
Acknowledgement
This work is supported by the National Natural Science Foundation of China (No. 62102205).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yuan, C. et al. (2024). Learning Bottleneck Transformer forĀ Event Image-Voxel Feature Fusion Based Classification. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_1
Download citation
DOI: https://doi.org/10.1007/978-981-99-8429-9_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)