Abstract
Recently, event cameras, as a new generation of bionic cameras with characteristics of high dynamic range and high temporal resolution, provide a brand new competitive modal for multi-modal tracking. However, recent works on RGBE tracking pay too much attention to utilizing the complementary information while ignoring to enhance the modality-shared information and the global relations inside and across modalities. In this paper, we propose an end-to-end full attention tracker named Swin Transformer Event Frame Tracker (SwinEFT) to fully explore both modality-specific and modality-shared information. To be specific, we firstly adopt a simple but effective event representation to narrow the domain gap as well as obtain a clearer tracking target. With the deployment of shifted window based attention mechanism, our tracker is better able to leverage the global relations, resulting in locating a more accurate bounding box. Besides, in order to enhance the modality-shared information, we design Swin Decoder by introducing cross-attention based on shifted windows for information interaction. Extended experiments on two realistic RGBE tracking datasets demonstrate the outstanding performance and robustness of SwinEFT against the state-of-the-art methods under various challenging scenarios.
Similar content being viewed by others
References
Javed S, Danelljan M, Shahbaz Khan F, Khan MH, Felsberg M, Matas J (2022) Visual object tracking with discriminative filters and siamese networks: a survey and outlook. IEEE Trans Pattern Anal Mach Intell
Huang L, Zhao X, Huang K (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577
Xiao Y, Yang M, Li C, Liu L, Tang J (2022) Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp 2831–2838
Gao Y, Li C, Zhu Y, Tang J, He T, Wang F (2019) Deep adaptive fusion network for high performance RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. pp 0–0
Andong L, Li C, Yan Y, Tang J, Luo B (2021) RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans Image Process 30:5613–5625
Zhao P, Liu Q, Wang W, Guo Q (2021) TSDM: tracking by SIAMRPN++ with a depth-refiner and a mask-generator. In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp 670–676
Yan S, Yang J, Käpylä J, Zheng F, Leonardis A, Kämäräinen J-K (2021) Depthtrack: unveiling the power of RGBD tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 10725–10733
Kumar A, Walia GS, Sharma K (2020) Recent trends in multicue based visual tracking: a review. Expert Syst Appl 162:113711
Gallego G, Delbrück T, Orchard G, Bartolozzi C, Taba B, Censi A, Leutenegger S, Davison AJ, Conradt J, Daniilidis K et al (2020) Event-based vision: a survey. IEEE Trans Pattern Anal Mach Intell 44(1):154–180
Zhang J, Yang X, Fu Y, Wei X, Yin B, Dong B (2021) Object tracking by jointly exploiting frame and event domain. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 13043–13052
Wang X, Li J, Zhu L, Zhang Z, Chen Z, Li X, Wang Y, Tian Y, Wu F (2021) Visevent: reliable object tracking via collaboration of frame and event flows. Preprint at http://arxiv.org/abs/2108.05015
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Proces Syst 30
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 10012–10022
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 8126–8135
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1571–1580
Lin L, Fan H, Zhang Z, Xu Y, Ling H (2022) Swintrack: a simple and strong baseline for transformer tracking. In: Advances in Neural Information Processing Systems
Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Van Gool L (2022) Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 8731–8740
Ye B, Chang H, Ma B, Shan S, Chen X (2022) Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision. Springer, pp 341–357
Zhao C, Liu H, Nan S, Yan Y (2022) TFTN: a transformer-based fusion tracking framework of hyperspectral and RGB. IEEE Trans Geosci Remote Sens 60:1–15
Feng M, Su J (2022) Learning reliable modal weight with transformer for robust RGBT tracking. Knowl-Based Syst 108945
Li C, Cheng H, Shiyi H, Liu X, Tang J, Lin L (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans Image Process 25(12):5743–5756
Lan X, Ye M, Zhang S, Zhou H, Yuen PC (2020) Modality-correlation-aware sparse representation for RGB-infrared object tracking. Pattern Recogn Lett 130:12–20
Qin X, Mei Y, Liu J, Li C (2021) Multimodal cross-layer bilinear pooling for RGBT tracking. IEEE Trans Multimedia 24:567–580
Zhang P, Zhao J, Bo Chunjuan, Wang Dong, Huchuan Lu, Yang Xiaoyun (2021) Jointly modeling motion and appearance cues for robust RGB-t tracking. IEEE Trans Image Process 30:3335–3347
Zhengzheng T, Lin C, Zhao W, Li C, Tang J (2021) M 5 l: multi-modal multi-margin metric learning for RGBT tracking. IEEE Trans Image Process 31:85–98
Hu Yu, Li X, Fan C, Zou L, Yuanmei W (2023) MSDA: multi-scale domain adaptation dehazing network. Appl Intell 53(2):2147–2160
Li X, Fan C, Zhao C, Zou L, Tian S (2022) NIRN: self-supervised noisy image reconstruction network for real-world image denoising. Appl Intell 1–18
Li X, Yu H, Zhao C, Fan C, Zou L (2023) DADRNet: cross-domain image dehazing via domain adaptation and disentangled representation. Neurocomputing 126242
Gehrig D, Rebecq H, Gallego G, Scaramuzza D (2020) EKLT: asynchronous photometric feature tracking using events and frames. Int J Comput Vision 128(3):601–618
Huang J, Wang S, Guo M, Chen S (2018) Event-guided structured output tracking of fast-moving objects using a celex sensor. IEEE Trans Circuits Syst Video Technol 28(9):2413–2417
Yang Z, Wu Y, Wang G, Yang Y, Li G, Deng L, Zhu J, Shi L (2019) DashNet: a hybrid artificial and spiking neural network for high-speed object tracking. Preprint at http://arxiv.org/abs/1909.12942
Rebecq H, Horstschaefer T, Scaramuzza D (2017) Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. In: Proceedings of the British Machine Vision Conference (BMVC). pp 16–1
Maqueda AI, Loquercio A, Gallego G, García N, Scaramuzza D (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 5419–5427
Zhu AZ, Yuan L (2018) EV-flownet: self-supervised optical flow estimation for event-based cameras. In: Robotics: Science and Systems. pp 1–9
Benosman R, Clercq C, Lagorce X, Ieng S-H, Bartolozzi C (2013) Event-based visual flow. IEEE Trans Neural Netw Learn Syst 25(2):407–417
Zhu AZ, Yuan L, Chaney K, Daniilidis K (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 989–997
Sironi A, Brambilla M, Bourdis N, Lagorce X, Benosman R (2018) Hats: histograms of averaged time surfaces for robust event-based object classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 1731–1740
Zhou ST, Ruan PV, Canu S (2022) A tri-attention fusion guided multi-modal segmentation network. Pattern Recogn 124:108417
Zhang H, Wang Y, Dayoub F, Sunderhauf N (2021) Varifocalnet: an iou-aware dense object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 8514–8523
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 658–666
Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 4660–4669
Bhat G, Danelljan M, Van Gool L, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 6182–6191
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) SIAMFC++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp 12549–12556
Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 10448–10457
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. Preprint at http://arxiv.org/abs/1711.05101
Lagorce X, Orchard G, Galluppi F, Shi BE, Benosman RB (2016) Hots: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans Pattern Anal Mach Intell 39(7):1346–1359
Haosheng Chen, David Suter, Qiangqiang Wu, and Hanzi Wang (2020) End-to-end learning of object motion estimation from retinal events for event-based object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp 10534–10541
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 770–778
Funding
This work was funded by Open and Innovation Fund of Hubei Three Gorges Laboratory, grant number SK215002.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that there are no conflict of interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zeng, Z., Li, X., Fan, C. et al. SwinEFT: a robust and powerful Swin Transformer based Event Frame Tracker. Appl Intell 53, 23564–23581 (2023). https://doi.org/10.1007/s10489-023-04763-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04763-6