Skip to main content
Log in

Hierarchical attentive Siamese network for real-time visual tracking

  • Extreme Learning Machine and Deep Learning Networks
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Visual tracking is a fundamental and highly useful component in various tasks of computer vision. Recently, end-to-end off-line training Siamese networks have demonstrated great success in visual tracking with high performance in terms of speed and accuracy. However, Siamese trackers usually employ visual features from the last simple convolutional layers to represent the targets while ignoring the fact that features from different layers characterize different representation capabilities of the targets, and hence this may degrade tracking performance in the presence of severe deformation and occlusion. In this paper, we present a novel hierarchical attentive Siamese (HASiam) network for high-performance visual tracking, which exploits different kinds of attention mechanisms to effectively fuse a series of attentional features from different layers. More specifically, we combine a deeper network with a shallow one to take full advantage of the features from different layers and apply spatial and channel-wise attentions on different layers to better capture visual attentions on multi-level semantic abstractions, which is helpful to enhance the discriminative capacity of the model. Furthermore, the top-layer feature maps have low resolution that may affect localization accuracy if each feature is treated independently. To address this issue, a non-local attention module is also adopted on the top layer to force the network to pay more attention to the structural dependency of features at all locations during off-line training. The proposed HASiam is trained off-line in an end-to-end manner and needs no online updating the network parameters during tracking. Extensive evaluations demonstrate that our HASiam has achieved favorable results with AUC scores of \(64.6\%\), \(62.8\%\) and EAO scores of 0.227 while having a speed of 60 fps on the OTB2013, OTB100 and VOT2017 real-time experiments, respectively. Our tracker with high accuracy and real-time speed can be applied to numerous vision applications like visual surveillance systems, robotics and augmented reality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Arulampalam MS, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2):174–188

    Article  Google Scholar 

  2. Tavares JMRS, Padilha A (1995) Matching lines in image sequences with geometric constraints. In: RecPad’95-7th Portuguese conference on pattern recognition

  3. Pinho RR, Tavares JMRS, Correia MV (2007) An improved management model for tracking missing features in computer vision long image sequences. WSEAS Trans Inf Sci Appl 1:196–203

    Google Scholar 

  4. Pinho RR, Correia MV et al (2005) A movement tracking management model with Kalman filtering, global optimization techniques and mahalanobis distance. Adv Comput Methods Sci Eng 4 A & 4 B:100–104

  5. Pinho RR, Tavares JMRS (2009) Tracking features in image sequences with kalman filtering, global optimization, mahalanobis distance and a management model. Comput Model Eng Sci 6:51–75

    MATH  Google Scholar 

  6. Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848

    Article  Google Scholar 

  7. Lei J, Li GH, Tu S, Guo Q (2014) Convolutional restricted Boltzmann machines learning for robust visual tracking. Neural Comput Appl 25(6):1383–1391

    Article  Google Scholar 

  8. Sun S, An Z, Jiang X, Zhang B, Zhang J (2019) Robust object tracking with the inverse relocation strategy. Neural Comput Appl 31:123–132

    Article  Google Scholar 

  9. Almomani R, Dong M, Zhu D (2017) Object tracking via Dirichlet process-based appearance models. Neural Comput Appl 28(5):867–879

    Article  Google Scholar 

  10. Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M et al (2017) Eco: efficient convolution operators for tracking. In: CVPR, vol 1, p 3

  11. Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302

  12. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional Siamese networks for object tracking. arXiv preprint arXiv:1606.09549

  13. Tao R, Gavves E, Smeulders AWM (2016) Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1420–1429

  14. Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5000–5008

  15. Held D, Thrun S, Savarese S (2016) Learning to track at 100 fps with deep regression networks. In: European conference on computer vision. Springer, pp 749–765

  16. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  17. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems. NIPSs Foundation, Inc., Lake Tahoe, pp 1097–1105

  18. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  19. Olshausen BA, Anderson CH, Van Essen DC (1993) A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J Neurosci 13(11):4700–4719

    Article  Google Scholar 

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  21. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems. NIPSs Foundation, Inc., Palai, Montreal CANADA, pp 91–99

  22. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  23. Pławiak P, Rzecki K (2015) Approximation of phenol concentration using computational intelligence methods based on signals from the metal-oxide sensor array. IEEE Sens J 15(3):1770–1783

    Google Scholar 

  24. Pławiak P, Maziarz W (2014) Classification of tea specimens using novel hybrid artificial intelligence methods. Sens Actuators B Chem 192:117–125

    Article  Google Scholar 

  25. Yıldırım Ö, Pławiak P, Tan R-S, Acharya UR (2018) Arrhythmia detection using deep convolutional neural network with long duration ECG signals. Comput Biol Med 102:411–420

    Article  Google Scholar 

  26. Pławiak P, Acharya UR (2019) Novel deep genetic ensemble of classifiers for arrhythmia detection using ECG signals. Neural Comput Appl 5:1–25

    Google Scholar 

  27. Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic Siamese network for visual object tracking. In: The IEEE international conference on computer vision (ICCV), Oct 2017

  28. Rensink RA (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42

    Article  Google Scholar 

  29. Choi J, Jin Chang H, Jeong J, Demiris Y, Young Choi J (2016) Visual tracking using attention-modulated disintegration and integration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4321–4330

  30. Choi J, Jin Chang H, Yun S, Fischer T, Demiris Y, Young Choi J et al (2017) Attentional correlation filter network for adaptive visual tracking. In: CVPR, vol 2, p 7

  31. Kosiorek A, Bewley A, Posner I (2017) Hierarchical attentive recurrent tracking. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. NIPS Foundation, Inc., Long Beach, pp 3053–3061

  32. Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S (2018) Learning attentions: residual attentional Siamese network for high performance online visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4854–4863

  33. Hu J, Shen L, Sun G (2017) Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507

  34. Zhu Z, Wei W, Zou W, Yan J (2017) End-to-end flow correlation tracking with spatial-temporal attention. Illumination 42:20

    Google Scholar 

  35. Woo S, Park J, Lee J-Y, Kweon I S (2018) Cbam: convolutional block attention module. In: Proceedings of European conference on computer vision

  36. Zhang Y, Wang L, Qi J, Wang D, Feng M, Lu H (2018) Structured Siamese network for real-time visual tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 351–366

  37. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  38. Zhang H, Goodfellow I, Metaxas D, Odena A (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318

  39. Song Y, Ma C, Gong L, Zhang J, Lau RWH, Yang M-H (2017) Crest: convolutional residual learning for visual tracking. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2574–2583

  40. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318

  41. Lukežič A, Vojíř T, Čehovin L, Matas J, Kristan M (2016) Discriminative correlation filter with channel and spatial reliability. arXiv preprint arXiv:1611.08461

  42. Martín A, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: OSDI, vol 16, pp 265–283

  43. Wu Yi, Lim Jongwoo, Yang Ming-Hsuan (2013) Online object tracking: A benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418

  44. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc L, Vojir T, Häger G, Lukežič A, Eldesokey A, Fernandez G (2017) The visual object tracking vot2017 challenge results. In: IEEE international conference on computer vision (ICCV)

  45. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PHS (2016) Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1401–1409

  46. Danelljan M, Häger G, Khan F, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: British machine vision conference, Nottingham, September 1–5, 2014. BMVA Press

  47. Wang Q, Gao J, Xing J, Zhang M, Hu W (2017) Dcfnet: discriminant correlation filters network for visual tracking. arXiv preprint arXiv:1704.04057

  48. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China under Grant nos. 61872189, 61876088, in part by the Natural Science Foundation of Jiangsu Province under Grant no. BK20170040, in part by Six Talent Peaks Project in Jiangsu Province under Grant nos. XYDXX-015, XYDXX-045, and in part by the Postgraduate Research and Practice Innovation Program of Jiangsu Province under Grant SJCX19_0311.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kaihua Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, K., Song, H., Zhang, K. et al. Hierarchical attentive Siamese network for real-time visual tracking. Neural Comput & Applic 32, 14335–14346 (2020). https://doi.org/10.1007/s00521-019-04238-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04238-1

Keywords

Navigation