Abstract
Most existing trackers based on RGB/grayscale frames may collapse due to the unreliability of conventional sensors in some challenging scenarios (e.g., motion blur and high dynamic range). Event-based cameras as bioinspired sensors encode brightness changes with high temporal resolution and high dynamic range, thereby providing considerable potential for tracking under degraded conditions. Nevertheless, events lack the fine-grained texture cues provided by RGB/grayscale frames. This complementarity encourages us to fuse visual cues from the frame and event domains for robust object tracking under various challenging conditions. In this paper, we propose a novel event feature extractor to capture spatiotemporal features with motion cues from event-based data by boosting interactions and distinguishing alterations between states at different moments. Furthermore, we develop an effective feature integrator to adaptively fuse the strengths of both domains by balancing their contributions. Our proposed module as the plug-in can be easily applied to off-the-shelf frame-based trackers. We extensively validate the effectiveness of eight trackers extended by our approach on three datasets: EED, VisEvent, and our collected frame-event-based dataset FE141. Experimental results also show that event-based data is a powerful cue for tracking.
Similar content being viewed by others
References
An, N., Zhao, X. G., & Hou, Z. G. (2016). Online RGB-D tracking via detection-learning-segmentation. In ICPR (pp. 1231–1236).
Barranco, F., Fermuller, C., & Ros, E. (2018). Real-time clustering and multi-target tracking using event-based sensors. In IROS (pp. 5764–5769).
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional Siamese networks for object tracking. In ECCV (pp. 850–865).
Bhat, G., Danelljan, M., Gool, L. V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In ICCV (pp. 6182–6191).
Bhat, G., Danelljan, M., Van Gool, L., & Timofte, R. (2020). Know your surroundings: Exploiting scene information for object tracking. In ECCV (pp. 205–221).
Cai, L., McGuire, N. E., Hanlon, R., Mooney, T. A., & Girdhar, Y. (2023). Semi-supervised visual tracking of marine animals using autonomous underwater vehicles. International Journal of Computer Vision, 131(6), 1406–1427.
Camplani, M., Hannuna, S. L., Mirmehdi, M., Damen, D., Paiement, A., Tao, L., & Burghardt, T. (2015). Real-time RGB-D tracking with depth scaling kernelised correlation filters and occlusion handling. In BMVC (Vol. 4, p. 5).
Camuñas-Mesa, L. A., Serrano-Gotarredona, T., Ieng, S. H., Benosman, R., & Linares-Barranco, B. (2017). Event-driven stereo visual tracking algorithm to solve object occlusion. IEEE Transactions on Neural Networks and Learning Systems, 29(9), 4223–4237.
Chen, H., Suter, D., Wu, Q., & Wang, H. (2020). End-to-end learning of object motion estimation from retinal events for event-based object tracking. In AAAI (Vol. 34, pp. 10,534–10,541).
Chen, H., Wu, Q., Liang, Y., Gao, X., & Wang, H. (2019). Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking. In ACM MM (pp. 473–481).
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. In CVPR (pp. 8126–8135).
Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In CVPR (pp. 6668–6677).
Cui, Y., Guo, D., Shao, Y., Wang, Z., Shen, C., Zhang, L., & Chen, S. (2022). Joint classification and regression for visual tracking with fully convolutional Siamese networks. International Journal of Computer Vision, 130(2), 550–566.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2019). ATOM: Accurate tracking by overlap maximization. In CVPR (pp. 4660–4669).
Danelljan, M., Gool, L. V., & Timofte, R. (2020). Probabilistic regression for visual tracking. In CVPR (pp. 7183–7192).
Ding, J., Dong, B., Heide, F., Ding, Y., Zhou, Y., Yin, B., & Yang, X. (2022). Biologically inspired dynamic thresholds for spiking neural networks. In NeurIPS (Vol. 35, pp. 6090–6103).
Ding, J., Gao, L., Liu, W., Piao, H., Pan, J., Du, Z., Yang, X., & Yin, B. (2022). Monocular camera-based complex obstacle avoidance via efficient deep reinforcement learning. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 756–770.
Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Huang, M., Liu, J., Xu, Y., et al. (2021). LaSOT: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision, 129, 439–461.
Fu, Z., Fu, Z., Liu, Q., Cai, W., & Wang, Y. (2022). Sparsett: Visual tracking with sparse transformers. In IJCAI (pp. 905–912).
Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A. J., Conradt, J., Daniilidis, K., & Scaramuzza, D. (2019). Event-based vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 154–180.
Gao, J., Hu, W., & Lu, Y. (2020). Recursive least-squares estimator-aided online learning for visual tracking. In CVPR (pp. 7386–7395).
Gao, J., Zhang, T., & Xu, C. (2019). Graph convolutional tracking. In CVPR (pp. 4649–4659).
Gao, S., Zhou, C., Ma, C., Wang, X., & Yuan, J. (2022). AiATrack: Attention in attention for transformer visual tracking. In ECCV (pp. 146–164).
Gehrig, D., Loquercio, A., Derpanis, K. G., & Scaramuzza, D. (2019). End-to-end learning of representations for asynchronous event-based data. In ICCV (pp. 5633–5643).
Guo, D., Wang, J., Cui, Y., Wang, Z., & Chen, S. (2020). SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In CVPR (pp. 6269–6277).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
Huang, J., Wang, S., Guo, M., & Chen, S. (2018). Event-guided structured output tracking of fast-moving objects using a Celex sensor. IEEE Transactions on Circuits and Systems for Video Technology, 28(9), 2413–2417.
Huang, L., Zhao, X., & Huang, K. (2019). GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1562–1577.
Hui, T., Xun, Z., Peng, F., Huang, J., Wei, X., Wei, X., Dai, J., Han, J., & Liu, S. (2023). Bridging search region interaction with template for RGB-T tracking. In CVPR (pp. 9516–9526).
Hu, Y., Liu, H., Pfeiffer, M., & Delbruck, T. (2016). DVS benchmark datasets for object tracking, action recognition, and object recognition. Frontiers in Neuroscience, 10, 405.
Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition of localization confidence for accurate object detection. In ECCV (pp. 784–799).
Jiawen, Z., Simiao, l., Xin, C., Wang, D., & Lu, H. (2023). Visual prompt multi-modal tracking. In CVPR.
Kart, U., Kämäräinen, J. K., Matas, J., & Matas, J. (2018). How to make an RGBD tracker? In ECCVw (pp. 148–161).
Kristan, M. E. A. (2014). The visual object tracking VOT2014 challenge results. In ECCVW (pp. 191–217).
Kristan, M. E. A. (2017). The visual object tracking VOT2017 challenge results. In ICCVW (pp. 1949–1972).
Lagorce, X., Orchard, G., Galluppi, F., Shi, B. E., & Benosman, R. B. (2016). HOTS: A hierarchy of event-based time-surfaces for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7), 1346–1359.
Lan, X., Ye, M., Zhang, S., & Yuen, P. C. (2018). Robust collaborative discriminative learning for RGB-infrared tracking. In AAAI (Vol. 32, pp. 7008–7015).
Li, P., Chen, B., Ouyang, W., Wang, D., Yang, X., & Lu, H. (2019). GradNet: Gradient-guided network for visual object tracking. In ICCV (pp. 6162–6171).
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of Siamese visual tracking with very deep networks. In CVPR (pp. 4282–4291).
Li, C., Zhu, C., Huang, Y., Tang, J., & Wang, L. (2018). Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking. In ECCV (pp. 808–823).
Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing, 24(12), 5630–5644.
Li, A., Lin, M., Wu, Y., Yang, M. H., & Yan, S. (2015). NUS-PRO: A new visual tracking challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 335–349.
Lin, L., Fan, H., Zhang, Z., Xu, Y., & Ling, H. (2022). SwinTrack: A simple and strong baseline for transformer tracking. In NeurIPS (Vol. 35, pp. 16,743–16,754).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV (pp. 10,012–10,022).
Liu, Y., Xie, J., Shi, X., Qiao, Y., Huang, Y., Tang, Y., & Yang, X. (2021). Tripartite information mining and integration for image matting. In ICCV (pp. 7555–7564).
Liu, Y., Long, C., Zhang, Z., Liu, B., Zhang, Q., Yin, B., & Yang, X. (2022). Explore contextual information for 3D scene graph generation. IEEE Transactions on Visualization and Computer Graphics, 29(12), 5556–5568.
Long Li, C., Lu, A., Hua Zheng, A., Tu, Z., & Tang, J. (2019). Multi-adapter RGBT tracking. In ICCVW (pp. 2262–2270).
Lukezic, A., Kart, U., Kapyla, J., Durmush, A., Kamarainen, J.K., Matas, J., & Kristan, M. (2019). CDTB: A color and depth visual object tracking dataset and benchmark. In ICCV (pp. 10,013–10,022).
Maqueda, A. I., Loquercio, A., Gallego, G., García, N., & Scaramuzza, D. (2018). Event-based vision meets deep learning on steering prediction for self-driving cars. In CVPR (pp. 5419–5427).
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D. P., Yu, F., & Van Gool, L. (2022). Transforming model prediction for tracking. In CVPR (pp. 8731–8740).
Messikommer, N., Gehrig, D., Loquercio, A., & Scaramuzza, D. (2020). Event-based asynchronous sparse convolutional networks. In ECCV (pp. 415–431).
Mitrokhin, A., Fermüller, C., Parameshwara, C., & Aloimonos, Y. (2018). Event-based moving object detection and tracking. In IROS (pp. 1–9).
Mitrokhin, A., Ye, C., Fermuller, C., Aloimonos, Y., & Delbruck, T. (2019). EV-IMO: Motion segmentation dataset and learning pipeline for event cameras. In IROS (pp. 6105–6112).
Mostafavi, M., Wang, L., & Yoon, K. J. (2021). Learning to reconstruct HDR images from events, with applications to depth and flow prediction. International Journal of Computer Vision, 129, 900–920.
Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for UAV tracking. In ECCV (pp. 445–461).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS (Vol. 32).
Piatkowska, E., Belbachir, A. N., Schraml, S., & Gelautz, M. (2012). Spatiotemporal multiple persons tracking using dynamic vision sensor. In CVPRW (pp. 35–40).
Qiao, Y., Liu, Y., Yang, X., Zhou, D., Xu, M., Zhang, Q., & Wei, X. (2020). Attention-guided hierarchical structure aggregation for image matting. In CVPR (pp. 13,676–13,685).
Qiao, Y., Zhu, J., Long, C., Zhang, Z., Wang, Y., Du, Z., & Yang, X. (2022). CPRAL: Collaborative panoptic-regional active learning for semantic segmentation. In AAAI (Vol. 36, pp. 2108–2116).
Rebecq, H., Gallego, G., Mueggler, E., & Scaramuzza, D. (2018). EMVS: Event-based multi-view stereo-3D reconstruction with an event camera in real-time. International Journal of Computer Vision.
Rebecq, H., Horstschaefer, T., & Scaramuzza, D. (2017). Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. In BMVC (pp. 16–1).
Shen, Q., Qiao, L., Guo, J., Li, P., Li, X., Li, B., Feng, W., Gan, W., Wu, W., & Ouyang, W. (2022). Unsupervised learning of accurate Siamese tracking. In CVPR (pp. 8101–8110).
Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. c. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS (Vol. 28).
Shi, X., Gao, Z., Lausen, L., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2017). Deep learning for precipitation nowcasting: A benchmark and a new model. In NeurIPS (Vol. 30).
Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., & Benosman, R. (2018). HATS: Histograms of averaged time surfaces for robust event-based object classification. In CVPR (pp. 1731–1740).
Song, S., & Xiao, J. (2013). Tracking revisited using RGBD camera: Unified benchmark and baselines. In ICCV (pp. 233–240).
Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICLR (pp. 843–852).
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In NeurIPS (Vol. 27).
Vicon motion capture. https://www.vicon.com/
Wang, X., Li, J., Zhu, L., Zhang, Z., Chen, Z., Li, X., Wang, Y., Tian, Y., & Wu, F. (2021). VisEvent: Reliable object tracking via collaboration of frame and event flows. arXiv:2108.05015
Wang, Y., Long, M., Wang, J., Gao, Z., & Yu, P. S. (2017). PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. In NeurIPS (vol. 30).
Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., & Li, H. (2019). Unsupervised deep tracking. In CVPR (pp. 1308–1317).
Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W., & Maybank, S. (2018). Learning attentions: residual attentional siamese network for high performance online visual tracking. In CVPR (pp. 4854–4863).
Wang, C., Xu, C., Cui, Z., Zhou, L., Zhang, T., Zhang, X., & Yang, J. (2020). Cross-modal pattern-propagation for RGB-T tracking. In CVPR (pp. 7064–7073).
Wang, N., Zhou, W., Wang, J., & Li, H. (2021). Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR (pp. 1571–1580).
Wang, T., Li, J., Wu, H. N., Li, C., Snoussi, H., & Wu, Y. (2022). ResLNet: Deep residual LSTM network with longer input for action recognition. Frontiers of Computer Science, 16, 166,334.
Wang, Y., Zhang, X., Shen, Y., Du, B., Zhao, G., Lizhen, L. C. C., & Wen, H. (2021). Event-stream representation for human gaits identification using deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3436–3449.
Wang, N., Zhou, W., Song, Y., Ma, C., Liu, W., & Li, H. (2021). Unsupervised deep representation learning for real-time tracking. International Journal of Computer Vision, 129, 400–418.
Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: A benchmark. In CVPR (pp. 2411–2418).
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.
Wu, H., Yao, Z., Wang, J., & Long, M. (2021). MotionRNN: A flexible model for video prediction with spacetime-varying motions. In CVPR (pp. 15,435–15,444).
Wu, Y., Deng, L., Li, G., Zhu, J., & Shi, L. (2018). Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12, 331.
Xiao, J., Stolkin, R., Gao, Y., & Leonardis, A. (2017). Robust fusion of color and depth data for RGB-D target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE Transactions on Cybernetics, 48(8), 2485–2499.
Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In ICCV (pp. 10,448–10,457).
Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., & Kämäräinen, J. K. (2021). DepthTrack: Unveiling the power of RGBD tracking. In ICCV (pp. 10,725–10,733).
Yang, J., Gao, S., Li, Z., Zheng, F., & Leonardis, A. (2023). Resource-efficient RGBD aerial tracking. In CVPR (pp. 13,374–13,383).
Yang, X., Mei, H., Xu, K., Wei, X., Yin, B., & Lau, R. W. (2019). Where is my mirror? In ICCV (pp. 8809–8818).
Yang, Z., Wu, Y., Wang, G., Yang, Y., Li, G., Deng, L., Zhu, J., & Shi, L. (2019). DashNet: A hybrid artificial and spiking neural network for high-speed object tracking. arXiv:1909.12942
Zhang, Z., & Peng, H. (2019). Deeper and wider siamese networks for real-time visual tracking. In CVPR (pp. 4591–4600).
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., van de Weijer, J., & Shahbaz Khan, F. (2019). Multi-modal fusion for end-to-end RGB-T tracking. In ICCVW (pp. 2252–2261).
Zhang, J., Dong, B., Zhang, H., Ding, J., Heide, F., Yin, B., & Yang, X. (2022). Spiking transformers for event-based single object tracking. In CVPR (pp. 8801–8810).
Zhang, L., Gonzalez-Garcia, A., Weijer, J. V. D., Danelljan, M., & Khan, F. S. (2019). Learning the model update for siamese trackers. In ICCV (pp. 4010–4019).
Zhang, T., Guo, H., Jiao, Q., Zhang, Q., & Han, J. (2023). Efficient RGB-T tracking via cross-modality distillation. In CVPR (pp. 5404–5413).
Zhang, J., Wang, Y., Liu, W., Li, M., Bai, J., Yin, B., & Yang, X. (2023). Frame-event alignment and fusion network for high frame rate tracking. In CVPR (pp. 9781–9790).
Zhang, J., Yang, X., Fu, Y., Wei, X., Yin, B., & Dong, B. (2021). Object tracking by jointly exploiting frame and event domain. In ICCV (pp. 13,043–13,052).
Zhang, H., Zhang, J., Dong, B., Peers, P., Wu, W., Wei, X., Heide, F., & Yang, X. (2023). In the blink of an eye: Event-based emotion recognition. In SIGGRAPH (pp. 1–11).
Zhang, P., Zhao, J., Wang, D., Lu, H., & Ruan, X. (2022). Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In CVPR (pp. 8886–8895).
Zhang, T., Xu, C., & Yang, M. H. (2018). Learning multi-task correlation particle filters for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 365–378.
Zhang, H., Zhang, L., Dai, Y., Li, H., & Koniusz, P. (2023). Event-guided multi-patch network with self-supervision for non-uniform motion deblurring. International Journal of Computer Vision, 131(2), 453–470.
Zhang, J., Zhao, K., Dong, B., Fu, Y., Wang, Y., Yang, X., & Yin, B. (2021). Multi-domain collaborative feature representation for robust visual object tracking. The Visual Computer, 37(9–11), 2671–2683.
Zhao, H., Chen, J., Wang, L., & Lu, H. (2023). ArkitTrack: A new diverse dataset for tracking using mobile RGB-D data. In CVPR (pp. 5126–5135).
Zhao, H., Yan, B., Wang, D., Qian, X., Yang, X., & Lu, H. (2022). Effective local and global search for fast long-term tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 460–474.
Zhou, C., Teng, M., Han, J., Liang, J., Xu, C., Cao, G., & Shi, B. (2023). Deblurring low-light images with events. International Journal of Computer Vision, 126(12), 1394–1414.
Zhou, Q., Wang, R., Li, J., Tian, N., & Zhang, W. (2021). Siamese single object tracking algorithm with natural language prior. Frontiers of Computer Science, 15, 1–2.
Zhu, Z., Hou, J., & Lyu, X. (2022). Learning graph-embedded key-event back-tracing for object tracking in event clouds. In NeurIPS (Vol. 35, pp. 7462–7476).
Zhu, Y., Li, C., Luo, B., Tang, J., & Wang, X. (2019). Dense feature aggregation and pruning for RGBT tracking. In ACM MM (pp. 465–472).
Zhu, A. Z., Yuan, L., Chaney, K., & Daniilidis, K. (2019). Unsupervised event-based learning of optical flow, depth, and egomotion. In CVPR (pp. 989–997).
Acknowledgements
This work was supported in part by National Key Research and Development Program of China (2022ZD0210500), the National Natural Science Foundation of China under Grants 62332019/61972067, and the Distinguished Young Scholars Funding of Dalian (No. 2022RJ01).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Boxin Shi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Dong, B., Fu, Y. et al. A Universal Event-Based Plug-In Module for Visual Object Tracking in Degraded Conditions. Int J Comput Vis 132, 1857–1879 (2024). https://doi.org/10.1007/s11263-023-01959-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-023-01959-8