Skip to main content
Log in

Relation-aware Siamese region proposal network for visual object tracking

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The backbone networks used in Siamese trackers are relatively shallow, such as AlexNet and VGGNet, resulting in insufficient features for tracking task. Therefore, this paper focuses on extracting more discriminative features to improve the performance of Siamese trackers. By comprehensive experimental validations, this goal is achieved through a simple yet effective framework referred as relation-aware Siamese region proposal network (Ra-SiamRPN). Firstly, the deep backbone network ResNet-50 is adopted to extract both low-level detail features and high-level semantic features of an image. Then we propose the feature fusion module (FFM), which can combine low-level detail features with high-level semantic features effectively. Furthermore, we propose the relation reasoning module (RRM) to perform the global relation reasoning in multiple disjoint regions. RRM can generate discriminative information to enhance the features generated by ResNet-50. Extensive experiments are conducted on the dataset OTB2015, VOT2016, VOT2018, UAV123 and LaSOT. The experiment results indicate that Ra-SiamRPN achieves competitive performance with the current advanced algorithms and shows good real-time performance. To be highlighted, in the experiments conducted on the large-scale dataset LaSOT, the success score and the normalized precision score of Ra-SiamRPN are 0.495 and 0.576, respectively. These performance indexes are better than the second best tracker MDNet 24.7% and 25.2%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: CVPR

  2. Bertasius G, Torresani L, Yu SX, Shi J (2017) Convolutional random walk networks for semantic image segmentation. In: CVPR, pp. 858–866

  3. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: ECCV, pp. 850–865

  4. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: CVPR, pp. 1401–1409

  5. Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. In: CVPR, pp. 2544–2550

  6. Chandra S, Usunier N, Kokkinos I (2017) Dense and low-rank gaussian crfs using deep embeddings. In: ICCV, pp. 5103–5112

  7. Che M, Wang R, Lu Y, Li Y, Zhi H, Xiong C (2018) Channel pruning for visual tracking. In: ECCVW, pp. 70–82

  8. Choi J, Chang HJ, Fischer T, Yun S, Jin YC (2018) Context-aware deep feature compression for high-speed visual tracking. In: CVPR, pp. 479–488

  9. Dai K, Wang D, Lu H, Sun C, Li J (2019) Visual tracking via adaptive spatially-regularized correlation filters. In: CVPR, pp. 4670–4679

  10. Danelljan M, Häger G, Khan F, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: BMVC

  11. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: CVPR, pp. 4310–4318

  12. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: ICCVW, pp. 58–66

  13. Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: ECCV, pp. 472–488

  14. Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: CVPR, pp. 6638–6646

  15. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255

  16. Fan H, Ling H (2019) Siamese cascaded region proposal networks for real-time visual tracking. In: CVPR, pp. 7952–7961

  17. Fan H, Lin L, Yang F, Chu P, Ling H (2018) LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR, pp. 5374–5383.

  18. Gao J, Zhang T, Xu C (2019) Graph convolutional tracking. In: CVPR, pp. 4649–4659

  19. Grabner H (2006) On-line boosting and vision. In: CVPR

  20. Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) learning dynamic siamese network for visual object tracking. In: ICCV, pp 1763-1771

  21. Hare S, Golodetz S, Saffari A, Vineet V, Cheng MM, Hicks SL, Torr PHS (2016) Struck: structured output tracking with kernels. TPAMI 38(10):2096–2109

  22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp. 770–778

  23. He A, Luo C, Tian X, Zeng W (2018) A twofold siamese network for real-time object tracking. In: CVPR, pp. 4834–4843

  24. Held D, Thrun S, Savarese S (2016) Learning to track at 100 fps with deep regression networks. In: ECCV, pp. 749–765

  25. Henriques JF, Rui C, Martins P, Batista J (2012) Exploiting the Circulant structure of tracking-by-detection with kernels. In: ECCV, pp. 702–715

  26. Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. TPAMI 37(3):583–596

  27. Hong Z, Zhe C, Wang C, Xue M, Tao D (2015) MUlti-store tracker (MUSTer): a cognitive psychology inspired approach to object tracking. In: CVPR, pp. 749–758

  28. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907

  29. Kong T, Sun F, Tan C, Liu H, Huang W (2018) Deep feature pyramid reconfiguration for object detection. In: ECCV, pp. 169–185

  30. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Cehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A (2018) The sixth visual object tracking vot2018 challenge results. In: ECCV, pp. 3–53

  31. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105

  32. Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: CVPR, pp. 8971–8980

  33. Li F, Tian C, Zuo W, Zhang L, Yang M-H (2018) Learning spatial-temporal regularized correlation filters for visual tracking. In: CVPR, pp. 4904–4913

  34. Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J (2019) Siamrpn++: evolution of siamese visual tracking with very deep networks. In: CVPR, pp. 4282–4291

  35. Li P, Chen B, Ouyang W, Wang D, Yang X, Lu H (2019) Gradnet: gradient-guided network for visual object tracking. In: ICCV, pp. 6162–6171

  36. Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft COCO: common objects in context. In: ECCV, pp. 740–755

  37. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125

  38. Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: CVPR, pp. 8759–8768

  39. Lu X, Ma C, Ni B, Yang X, Yang M (2018) Deep regression tracking with shrinkage loss. In: ECCV, pp. 369–386

  40. Lu X, Ma C, Ni B, Yang X (2019) Adaptive region proposal with channel regularization for robust object tracking. TCSVT 1–1

  41. Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: CVPR, pp. 3618–3627

  42. Lukezic A, Vojir T, Cehovin Zajc L, Matas J, Kristan M (2017) Discriminative correlation filter with channel and spatial reliability. In: CVPR, pp. 6309–6318

  43. Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: ECCV, pp. 445–461

  44. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302

  45. Pinheiro PO, Lin T-Y, Collobert R, Dollár P (2016) Learning to refine object segments. In: ECCV, pp. 75–91

  46. Real E, Shlens J, Mazzocchi S, Xin P, Vanhoucke V (2017) YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: CVPR, pp. 7464–7473

  47. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention, pp 234-241

  48. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2014) ImageNet large scale visual recognition challenge. IJCV 115(3):211–252

  49. Shen Y, Li H, Yi S, Chen D, Wang X (2018) Person re-identification with deep similarity-guided graph neural network. In: ECCV, pp. 508–526

  50. Song Y, Ma C, Wu X, Gong L, Bao L, Zuo W, Shen C, Lau R, Yang MH (2018) VITAL: VIsual tracking via adversarial learning. In: CVPR, pp. 8990–8999

  51. Tsafack N, Kengne J, Abd-El-Atty B, Iliyasu AM, Hirota K, Abd AA, EL-Latif (2020) Design and implementation of a simple dynamical 4-d chaotic circuit with applications in image encryption. Inf Sci 515:191–217

    Article  Google Scholar 

  52. Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH (2017) End-to-end representation learning for correlation filter based tracking. In: CVPR, pp. 2805–2813

  53. Wang X, Gupta A (2018) Videos as space-time region graphs. In: ECCV, pp. 399–417

  54. Wang W, Lu X, Shen J, Crandall D, Shao L (2019) Zero-shot video object segmentation via attentive graph neural networks. In: ICCV, pp. 9235–9244

  55. Wang Q, Zhang L, Bertinetto L, Hu W, Torr PH (2019) Fast online object tracking and segmentation: a unifying approach. In: CVPR, pp. 1328–1338

  56. Wang X, Zheng Z, He Y, Yan F, Zeng Z, Yang Y (2020). Progressive local filter pruning for image retrieval acceleration arXiv: 2001.08878

  57. Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. TPAMI 37(9):1834–1848

    Article  Google Scholar 

  58. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI, pp. 7444–7452

  59. Yang L, Zhu J (2014) A scale adaptive kernel correlation filter tracker with feature integration.

  60. Zhang Z, Peng H (2019) Deeper and wider siamese networks for real-time visual tracking. In: CVPR, pp. 4591–4600

  61. Zhang J, Ma S, Sclaroff S (2014) MEEM: robust tracking via multiple experts using entropy minimization. In: ECCVW, pp. 254–256

  62. Zhang Y, Wang L, Qi J, Wang D, Feng M, Lu H (2018) Structured siamese network for real-time visual tracking. In: ECCV, pp. 351–366

  63. Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware siamese networks for visual object tracking. In: ECCV, pp. 101–117

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant nos. 61971421 and 62071470).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shibin Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, J., Zhang, G., Zhou, S. et al. Relation-aware Siamese region proposal network for visual object tracking. Multimed Tools Appl 80, 15469–15485 (2021). https://doi.org/10.1007/s11042-021-10574-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10574-z

Keywords

Navigation