Skip to main content
Log in

SwinEFT: a robust and powerful Swin Transformer based Event Frame Tracker

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Recently, event cameras, as a new generation of bionic cameras with characteristics of high dynamic range and high temporal resolution, provide a brand new competitive modal for multi-modal tracking. However, recent works on RGBE tracking pay too much attention to utilizing the complementary information while ignoring to enhance the modality-shared information and the global relations inside and across modalities. In this paper, we propose an end-to-end full attention tracker named Swin Transformer Event Frame Tracker (SwinEFT) to fully explore both modality-specific and modality-shared information. To be specific, we firstly adopt a simple but effective event representation to narrow the domain gap as well as obtain a clearer tracking target. With the deployment of shifted window based attention mechanism, our tracker is better able to leverage the global relations, resulting in locating a more accurate bounding box. Besides, in order to enhance the modality-shared information, we design Swin Decoder by introducing cross-attention based on shifted windows for information interaction. Extended experiments on two realistic RGBE tracking datasets demonstrate the outstanding performance and robustness of SwinEFT against the state-of-the-art methods under various challenging scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

All data generated or analysed during this study are included in these published articles [10, 11].

References

  1. Javed S, Danelljan M, Shahbaz Khan F, Khan MH, Felsberg M, Matas J (2022) Visual object tracking with discriminative filters and siamese networks: a survey and outlook. IEEE Trans Pattern Anal Mach Intell

  2. Huang L, Zhao X, Huang K (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43(5):1562–1577

    Article  Google Scholar 

  3. Xiao Y, Yang M, Li C, Liu L, Tang J (2022) Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp 2831–2838

  4. Gao Y, Li C, Zhu Y, Tang J, He T, Wang F (2019) Deep adaptive fusion network for high performance RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. pp 0–0

  5. Andong L, Li C, Yan Y, Tang J, Luo B (2021) RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans Image Process 30:5613–5625

    Article  Google Scholar 

  6. Zhao P, Liu Q, Wang W, Guo Q (2021) TSDM: tracking by SIAMRPN++ with a depth-refiner and a mask-generator. In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp 670–676

  7. Yan S, Yang J, Käpylä J, Zheng F, Leonardis A, Kämäräinen J-K (2021) Depthtrack: unveiling the power of RGBD tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 10725–10733

  8. Kumar A, Walia GS, Sharma K (2020) Recent trends in multicue based visual tracking: a review. Expert Syst Appl 162:113711

    Article  Google Scholar 

  9. Gallego G, Delbrück T, Orchard G, Bartolozzi C, Taba B, Censi A, Leutenegger S, Davison AJ, Conradt J, Daniilidis K et al (2020) Event-based vision: a survey. IEEE Trans Pattern Anal Mach Intell 44(1):154–180

    Article  Google Scholar 

  10. Zhang J, Yang X, Fu Y, Wei X, Yin B, Dong B (2021) Object tracking by jointly exploiting frame and event domain. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 13043–13052

  11. Wang X, Li J, Zhu L, Zhang Z, Chen Z, Li X, Wang Y, Tian Y, Wu F (2021) Visevent: reliable object tracking via collaboration of frame and event flows. Preprint at http://arxiv.org/abs/2108.05015

  12. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Proces Syst 30

  13. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations

  14. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 10012–10022

  15. Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 8126–8135

  16. Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1571–1580

  17. Lin L, Fan H, Zhang Z, Xu Y, Ling H (2022) Swintrack: a simple and strong baseline for transformer tracking. In: Advances in Neural Information Processing Systems

  18. Mayer C, Danelljan M, Bhat G, Paul M, Paudel DP, Yu F, Van Gool L (2022) Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 8731–8740

  19. Ye B, Chang H, Ma B, Shan S, Chen X (2022) Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision. Springer, pp 341–357

  20. Zhao C, Liu H, Nan S, Yan Y (2022) TFTN: a transformer-based fusion tracking framework of hyperspectral and RGB. IEEE Trans Geosci Remote Sens 60:1–15

    Article  Google Scholar 

  21. Feng M, Su J (2022) Learning reliable modal weight with transformer for robust RGBT tracking. Knowl-Based Syst 108945

  22. Li C, Cheng H, Shiyi H, Liu X, Tang J, Lin L (2016) Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans Image Process 25(12):5743–5756

    Article  MathSciNet  MATH  Google Scholar 

  23. Lan X, Ye M, Zhang S, Zhou H, Yuen PC (2020) Modality-correlation-aware sparse representation for RGB-infrared object tracking. Pattern Recogn Lett 130:12–20

    Article  Google Scholar 

  24. Qin X, Mei Y, Liu J, Li C (2021) Multimodal cross-layer bilinear pooling for RGBT tracking. IEEE Trans Multimedia 24:567–580

    Google Scholar 

  25. Zhang P, Zhao J, Bo Chunjuan, Wang Dong, Huchuan Lu, Yang Xiaoyun (2021) Jointly modeling motion and appearance cues for robust RGB-t tracking. IEEE Trans Image Process 30:3335–3347

    Article  Google Scholar 

  26. Zhengzheng T, Lin C, Zhao W, Li C, Tang J (2021) M 5 l: multi-modal multi-margin metric learning for RGBT tracking. IEEE Trans Image Process 31:85–98

    Google Scholar 

  27. Hu Yu, Li X, Fan C, Zou L, Yuanmei W (2023) MSDA: multi-scale domain adaptation dehazing network. Appl Intell 53(2):2147–2160

    Article  Google Scholar 

  28. Li X, Fan C, Zhao C, Zou L, Tian S (2022) NIRN: self-supervised noisy image reconstruction network for real-world image denoising. Appl Intell 1–18

  29. Li X, Yu H, Zhao C, Fan C, Zou L (2023) DADRNet: cross-domain image dehazing via domain adaptation and disentangled representation. Neurocomputing 126242

  30. Gehrig D, Rebecq H, Gallego G, Scaramuzza D (2020) EKLT: asynchronous photometric feature tracking using events and frames. Int J Comput Vision 128(3):601–618

    Article  Google Scholar 

  31. Huang J, Wang S, Guo M, Chen S (2018) Event-guided structured output tracking of fast-moving objects using a celex sensor. IEEE Trans Circuits Syst Video Technol 28(9):2413–2417

    Article  Google Scholar 

  32. Yang Z, Wu Y, Wang G, Yang Y, Li G, Deng L, Zhu J, Shi L (2019) DashNet: a hybrid artificial and spiking neural network for high-speed object tracking. Preprint at http://arxiv.org/abs/1909.12942

  33. Rebecq H, Horstschaefer T, Scaramuzza D (2017) Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization. In: Proceedings of the British Machine Vision Conference (BMVC). pp 16–1

  34. Maqueda AI, Loquercio A, Gallego G, García N, Scaramuzza D (2018) Event-based vision meets deep learning on steering prediction for self-driving cars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 5419–5427

  35. Zhu AZ, Yuan L (2018) EV-flownet: self-supervised optical flow estimation for event-based cameras. In: Robotics: Science and Systems. pp 1–9

  36. Benosman R, Clercq C, Lagorce X, Ieng S-H, Bartolozzi C (2013) Event-based visual flow. IEEE Trans Neural Netw Learn Syst 25(2):407–417

    Article  Google Scholar 

  37. Zhu AZ, Yuan L, Chaney K, Daniilidis K (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 989–997

  38. Sironi A, Brambilla M, Bourdis N, Lagorce X, Benosman R (2018) Hats: histograms of averaged time surfaces for robust event-based object classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 1731–1740

  39. Zhou ST, Ruan PV, Canu S (2022) A tri-attention fusion guided multi-modal segmentation network. Pattern Recogn 124:108417

    Article  Google Scholar 

  40. Zhang H, Wang Y, Dayoub F, Sunderhauf N (2021) Varifocalnet: an iou-aware dense object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 8514–8523

  41. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 658–666

  42. Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 4660–4669

  43. Bhat G, Danelljan M, Van Gool L, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 6182–6191

  44. Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) SIAMFC++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp 12549–12556

  45. Yan B, Peng H, Fu J, Wang D, Lu H (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp 10448–10457

  46. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. Preprint at http://arxiv.org/abs/1711.05101

  47. Lagorce X, Orchard G, Galluppi F, Shi BE, Benosman RB (2016) Hots: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans Pattern Anal Mach Intell 39(7):1346–1359

    Article  Google Scholar 

  48. Haosheng Chen, David Suter, Qiangqiang Wu, and Hanzi Wang (2020) End-to-end learning of object motion estimation from retinal events for event-based object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp 10534–10541

  49. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 770–778

Download references

Funding

This work was funded by Open and Innovation Fund of Hubei Three Gorges Laboratory, grant number SK215002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lian Zou.

Ethics declarations

Conflict of interest

All authors declare that there are no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, Z., Li, X., Fan, C. et al. SwinEFT: a robust and powerful Swin Transformer based Event Frame Tracker. Appl Intell 53, 23564–23581 (2023). https://doi.org/10.1007/s10489-023-04763-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04763-6

Keywords

Navigation