Skip to main content
Log in

Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Siamese-based trackers have achieved outstanding tracking performance. However, these trackers in complex scenarios struggle to adequately integrate the valuable target feature information, which results in poor tracking performance. In this paper, a novel shared-encoder dual-pipeline Transformer architecture is proposed to achieve robust visual tracking. The proposed method integrates several main components based on a hybrid attention mechanism, namely the shared encoder, the feature enhancement pipelines with functional complementarity, and the pipeline feature fusion head. The shared encoder is adopted to process template features and provide useful target feature information for the feature enhancement pipeline. The feature enhancement pipeline is responsible for enhancing feature information, establishing feature dependencies between the template and the search region, and employing global information adequately. To further correlate the global information, the pipeline feature fusion head integrates the feature information from the feature enhancement pipelines. Eventually, we propose a robust Siamese-based Repformer tracker, which incorporates a concise tracking prediction network to obtain efficient tracking representations. Experiments show that our tracking method surpasses numerous state-of-the-art trackers on multiple tracking benchmarks, with a running speed of 57.3 fps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

All data generated or analysed during this study are included in this published article.

References

  1. Xu L, Gao M, Liu Z, et al. (2022) Accelerated duality-aware correlation filters for visual tracking. Neural Comput Appl 1–16.

  2. Hu W, Wang Q, Zhang L et al (2023) Siammask: a framework for fast online object tracking and segmentation. IEEE Trans Pattern Anal Mach Intell 45(3):3072–3089

    Google Scholar 

  3. Huang H, Liu G, Zhang Y et al (2022) Ensemble siamese networks for object tracking. Neural Comput Appl 34(10):8173–8191

    Article  Google Scholar 

  4. Li S, Zhao S, Cheng B et al (2023) Part-aware framework for robust object tracking. IEEE Trans Image Process 32:750–763

    Article  Google Scholar 

  5. Wang H, Liu J, Su Y et al (2023) Trajectory guided robust visual object tracking with selective remedy. IEEE Trans Circuits Syst Video Technol 33:3425

    Article  Google Scholar 

  6. Zhang J, Yuan T, He Y, et al. (2022) A background-aware correlation filter with adaptive saliency-aware regularization for visual tracking. Neural Comput Appl 1–18.

  7. Zhu XF, Wu XJ, Xu T et al (2021) Complementary discriminative correlation filters based on collaborative representation for visual object tracking. IEEE Trans Circuits Syst Video Technol 31(2):557–568

    Article  Google Scholar 

  8. Chen X, Wang D, Li D, et al. (2022) Efficient visual tracking via hierarchical cross-attention transformer. arXiv preprint arXiv:2203.13537

  9. Fu Z, Liu Q, Fu Z, et al. (2021) Stmtrack: template-free visual tracking with space-time memory networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 13774–13783.

  10. Zeng Y, Zeng B, Yin X et al (2022) SiamPCF: siamese point regression with coarse-fine classification network for visual tracking. Appl Intell 52(5):4973–4986

    Article  Google Scholar 

  11. Yu J, Zuo M, Dong L et al (2022) The multi-level classification and regression network for visual tracking via residual channel attention. Digit Signal Process 120:103269

    Article  Google Scholar 

  12. He X, Chen CYC (2022) Learning object-uncertainty policy for visual tracking. Inf Sci 582:60–72

    Article  MathSciNet  Google Scholar 

  13. Bolme D S, Beveridge J R, Draper B A, et al. (2010) Visual object tracking using adaptive correlation filters. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE 2544–2550.

  14. Henriques JF, Caseiro R, Martins P et al (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596

    Article  Google Scholar 

  15. Henriques JF, Caseiro R, Martins P, Batista J (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part IV. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 702–715

    Chapter  Google Scholar 

  16. Valmadre J, Bertinetto L, Henriques J, et al. (2017) End-to-end representation learning for correlation filter based tracking. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 2805–2813.

  17. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 1401–1409.

  18. Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Proc. European Conference on Computer Vision. Springer, Cham. pp. 472–488.

  19. Danelljan M, Bhat G, Shahbaz KF , Felsberg M (2017) Eco: Efficient convolution operators for tracking. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 6638–6646.

  20. Danelljan M, Hager G, Shahbaz Khan F, et al. (2015) Convolutional features for correlation filter based visual tracking. In: Proceedings of the IEEE international conference on computer vision workshops 58–66.

  21. Bhat G, Johnander J, Danelljan M, Khan FS, Felsberg M (2018) Unveiling the power of deep tracking. In: Proc. European Conference on Computer Vision (ECCV) pp. 483–498.

  22. Gu F, Lu J, Cai C (2022) RPformer: A robust parallel transformer for visual tracking in complex scenes. IEEE Trans Instrum Meas 71:1–14

    Google Scholar 

  23. Bertinetto L, Valmadre J, Henriques J F, et al. (2016) Fully-convolutional siamese networks for object tracking. In: European conference on computer vision. Springer, Cham 850–865.

  24. Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8971–8980).

  25. Chen X, Yan B, Zhu J, et al. (2021) Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 8126–8135.

  26. Yan B, Peng H, Fu J, et al. (2021) Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 10448–10457.

  27. Zhou W, Wen L, Zhang L et al (2021) SiamCAN: real-time visual tracking based on siamese center-aware network. IEEE Trans Image Process 30:3597–3609

    Article  Google Scholar 

  28. Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: Proc. European Conference on Computer Vision pp. 771–787.

  29. Li Y, Zhu J. (2014) A scale adaptive kernel correlation filter tracker with feature integration. In: European conference on computer vision. Springer, Cham 254–265.

  30. Yuan D, Chang X, Li Z et al (2022) Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Trans Multimed Comput Commun Appl TOMM 18(3):1–18

    Article  Google Scholar 

  31. Yuan D, Chang X, Liu Q, et al. (2023) Active learning for deep visual tracking. IEEE Trans Neural Netw Learn Syst

  32. Yuan D, Shu X, Liu Q et al (2023) Robust thermal infrared tracking via an adaptively multi-feature fusion model. Neural Comput Appl 35(4):3423–3434

    Article  Google Scholar 

  33. Danelljan M, Hager G, Shahbaz Khan F, et al. (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision 4310–4318.

  34. Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., & Wang, S. (2017). Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE international conference on computer vision (pp. 1763–1771).

  35. Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: Proc. European Conference on Computer Vision pp. 101–117.

  36. Yang K, He Z, Pei W et al (2021) SiamCorners: siamese Corner networks for visual tracking. IEEE Trans Multimedia 24:1956–1967

    Article  Google Scholar 

  37. Yuan D, Chang X, Huang PY, Liu Q, He Z (2020) Self-supervised deep correlation tracking. IEEE Trans Image Process 30:976–985

    Article  Google Scholar 

  38. Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 4282–4291.

  39. Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: visual tracking by re-detection. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6578–6588.

  40. Guo D, Wang J, Cui Y, et al. (2020) SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 6269–6277.

  41. Saribas H, Cevikalp H, Köpüklü O et al (2022) TRAT: tracking by attention using spatio-temporal features. Neurocomputing 492:150–161

    Article  Google Scholar 

  42. Elayaperumal D, Joo YH (2021) Robust visual object tracking using context-based spatial variation via multi-feature fusion. Inf Sci 577:467–482

    Article  MathSciNet  Google Scholar 

  43. Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: Proc.European Conference on Computer Vision. Springer, Cham pp. 205–221.

  44. Danelljan M, Bhat G, Khan F S, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 4660–4669.

  45. Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. In: Advances in neural information processing systems 5998–6008.

  46. Wang Q, Yuan C, Wang J, Zeng W (2018) Learning attentional recurrent neural network for visual tracking. IEEE Trans Multimed 21(4):930–942

    Article  Google Scholar 

  47. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proc. European Conference on Computer Vision. Springer, Cham pp. 213–229.

  48. Liu D, Liu G (2019) A transformer-based variational autoencoder for sentence generation. In: Proc. 2019 International Joint Conference on Neural Networks (IJCNN). IEEE pp.1–7.

  49. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. IEEE conference on computer vision and pattern recognition pp. 770–778.

  50. Ding X, Larson EC (2020) Incorporating uncertainties in student response modeling by loss function regularization. Neurocomputing 409:74–82

    Article  Google Scholar 

  51. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dolla´r P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proc. European Conference on Computer Vision. Springer, Cham pp. 740–755.

  52. Fan H, Lin L, Yang F, Chu P, Deng G, Yu SJ, Bai HX, Xu Y, Liao CY, Ling HB (2019) Lasot: A high-quality benchmark for large-scale single object tracking. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 5374–5383.

  53. Huang L, Zhao X, Huang K (2021) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans Pattern Anal Mach Intell 43:1562–1577

    Article  Google Scholar 

  54. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization.. arXiv preprint arXiv:1711.05101

  55. Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for uav tracking. In: Proc. European Conference on Computer Vision. Springer, Cham pp. 445–461.

  56. Galoogahi KH, Fagg A, Huang C, Ramanan D, Lucey S (2017) Need for speed: A benchmark for higher frame rate object tracking. In: Proc. IEEE International Conference on Computer Vision pp. 1125–1134.

  57. Wu Y, Lim J, Yang M (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37:1834–1848

    Article  Google Scholar 

  58. Kristan M et al. (2018) The sixth visual object tracking vot2018 challenge results. In: Proc. European Conference on Computer Vision (ECCV) Workshops

  59. Liang P, Blasch E, Ling H (2015) Encoding color information for visual tracking: algorithms and benchmark. IEEE Trans Image Process 24(12):5630–5644

    Article  MathSciNet  MATH  Google Scholar 

  60. Huang L, Zhao X, Huang K (2020) Globaltrack: A simple and strong baseline for long-term tracking. Proc AAAI Conf Artif Intell 34(07):11037–11044

    Google Scholar 

  61. Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proc. IEEE/CVF International Conference on Computer Vision pp. 6182–6191.

  62. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 4293–4302.

  63. Nie J, Wu H, He Z et al (2022) Spreading fine-grained prior knowledge for accurate tracking. IEEE Trans Circuits Syst Video Technol 32:6186

    Article  Google Scholar 

  64. Zhang H, Cheng L, Zhang T et al (2022) Target-distractor aware deep tracking with discriminative enhancement learning loss. IEEE Trans Circuits Syst Video Technol 32:6267

    Article  Google Scholar 

  65. Lukezic A, Matas J, Kristan M (2020), D3S-A discriminative single shot segmentation tracker. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 7133–7142.

  66. Zheng L, Tang M, Chen Y, Wang J, Lu H (2020) Learning feature embeddings for discriminant model based tracking. Proc Eur Conf Comput Vis (ECCV) 23(28):759–775

    Google Scholar 

  67. Zhang J, He Y, Wang S (2023) Learning adaptive sparse spatially-regularized correlation filters for visual tracking. IEEE Signal Process Lett 30:11

    Article  Google Scholar 

  68. Ma S, Zhao Z, Hou Z et al (2022) Correlation filters based on multi-expert and game theory for visual object tracking. IEEE Trans Instrum Meas 71:1–14

    Google Scholar 

  69. Xu T, Feng ZH, Wu XJ, Kittler J (2019) Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. In: IEEE Transactions on Image Processing, , pp.5596–5609.

  70. Fan N, Liu Q, Li X et al (2023) Siamese residual network for efficient visual tracking. Inf Sci 624:606

    Article  Google Scholar 

  71. Hu Q, Guo Y, Lin Z et al (2017) Object tracking using multiple features and adaptive model updating. IEEE Trans Instrum Meas 66(11):2882–2897

    Article  Google Scholar 

  72. Liu H, Hu Q, Li B et al (2019) Robust long-term tracking via instance-specific proposals. IEEE Trans Instrum Meas 69(4):950–962

    Article  Google Scholar 

  73. Huang B, Xu T, Shen Z et al (2021) SiamATL: online update of siamese tracking network via attentional transfer learning. IEEE Trans Cybern 52:7527

    Article  Google Scholar 

  74. Yao S, Zhang H, Ren W et al (2021) Robust online tracking via contrastive spatio-temporal aware network. IEEE Trans Image Process 30:1989–2002

    Article  Google Scholar 

  75. Zhang J, Ma S, Sclaroff S (2014) MEEM: robust tracking via multiple experts using entropy minimization. In: Proc. European Conference on Computer Vision. Springer, Cham pp. 188–203.

  76. Yan Y, Guo X, Tang J et al (2021) Learning spatio-temporal correlation filter for visual tracking. Neurocomputing 436:273–282

    Article  Google Scholar 

Download references

Acknowledgements

We are very grateful to the editors and anonymous reviewers for their constructive comments and suggestions to improve our manuscript. Moreover, this work is supported by the Natural Science Foundation of Heilongjiang Province of China under Grant No. F201123, the National Natural Science Foundation of China under Grant 52171332 and 52075530, the Green Intelligent Inland Ship Innovation Programme under Grant MC-202002-C01, and the Development Project of Ship Situational Intelligent Awareness System under Grant MC-201920-X01.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: FG, JL, CC, QZ, and ZJ; Methodology: FG and JL; Formal analysis and investigation: FG, JL, CC, QZ, and ZJ; Writing—original draft preparation: FG; Writing—review and editing: FG, JL, CC, QZ, and ZJ; Funding acquisition: JL, CC, QZ, and ZJ; Resources: JL, CC, QZ, and ZJ; Supervision: JL and ZJ.

Corresponding authors

Correspondence to Jun Lu or Zhaojie Ju.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, F., Lu, J., Cai, C. et al. Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking. Neural Comput & Applic 35, 20581–20603 (2023). https://doi.org/10.1007/s00521-023-08824-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08824-2

Keywords

Navigation