Hierarchical attentive Siamese network for real-time visual tracking

Yang, Kang; Song, Huihui; Zhang, Kaihua; Liu, Qingshan

doi:10.1007/s00521-019-04238-1

Hierarchical attentive Siamese network for real-time visual tracking

Extreme Learning Machine and Deep Learning Networks
Published: 21 May 2019

Volume 32, pages 14335–14346, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Kang Yang¹,
Huihui Song¹,
Kaihua Zhang ORCID: orcid.org/0000-0002-1613-3401¹ &
…
Qingshan Liu¹

780 Accesses
8 Citations
Explore all metrics

Abstract

Visual tracking is a fundamental and highly useful component in various tasks of computer vision. Recently, end-to-end off-line training Siamese networks have demonstrated great success in visual tracking with high performance in terms of speed and accuracy. However, Siamese trackers usually employ visual features from the last simple convolutional layers to represent the targets while ignoring the fact that features from different layers characterize different representation capabilities of the targets, and hence this may degrade tracking performance in the presence of severe deformation and occlusion. In this paper, we present a novel hierarchical attentive Siamese (HASiam) network for high-performance visual tracking, which exploits different kinds of attention mechanisms to effectively fuse a series of attentional features from different layers. More specifically, we combine a deeper network with a shallow one to take full advantage of the features from different layers and apply spatial and channel-wise attentions on different layers to better capture visual attentions on multi-level semantic abstractions, which is helpful to enhance the discriminative capacity of the model. Furthermore, the top-layer feature maps have low resolution that may affect localization accuracy if each feature is treated independently. To address this issue, a non-local attention module is also adopted on the top layer to force the network to pay more attention to the structural dependency of features at all locations during off-line training. The proposed HASiam is trained off-line in an end-to-end manner and needs no online updating the network parameters during tracking. Extensive evaluations demonstrate that our HASiam has achieved favorable results with AUC scores of \(64.6\%\), \(62.8\%\) and EAO scores of 0.227 while having a speed of 60 fps on the OTB2013, OTB100 and VOT2017 real-time experiments, respectively. Our tracker with high accuracy and real-time speed can be applied to numerous vision applications like visual surveillance systems, robotics and augmented reality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning attention modules for visual tracking

Article 21 April 2022

DensSiam: End-to-End Densely-Siamese Network with Self-Attention Model for Object Tracking

Visual Tracking with Attentional Convolutional Siamese Networks

References

Arulampalam MS, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2):174–188
Article Google Scholar
Tavares JMRS, Padilha A (1995) Matching lines in image sequences with geometric constraints. In: RecPad’95-7th Portuguese conference on pattern recognition
Pinho RR, Tavares JMRS, Correia MV (2007) An improved management model for tracking missing features in computer vision long image sequences. WSEAS Trans Inf Sci Appl 1:196–203
Google Scholar
Pinho RR, Correia MV et al (2005) A movement tracking management model with Kalman filtering, global optimization techniques and mahalanobis distance. Adv Comput Methods Sci Eng 4 A & 4 B:100–104
Pinho RR, Tavares JMRS (2009) Tracking features in image sequences with kalman filtering, global optimization, mahalanobis distance and a management model. Comput Model Eng Sci 6:51–75
MATH Google Scholar
Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848
Article Google Scholar
Lei J, Li GH, Tu S, Guo Q (2014) Convolutional restricted Boltzmann machines learning for robust visual tracking. Neural Comput Appl 25(6):1383–1391
Article Google Scholar
Sun S, An Z, Jiang X, Zhang B, Zhang J (2019) Robust object tracking with the inverse relocation strategy. Neural Comput Appl 31:123–132
Article Google Scholar
Almomani R, Dong M, Zhu D (2017) Object tracking via Dirichlet process-based appearance models. Neural Comput Appl 28(5):867–879
Article Google Scholar
Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M et al (2017) Eco: efficient convolution operators for tracking. In: CVPR, vol 1, p 3
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS (2016) Fully-convolutional Siamese networks for object tracking. arXiv preprint arXiv:1606.09549
Tao R, Gavves E, Smeulders AWM (2016) Siamese instance search for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1420–1429
Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5000–5008
Held D, Thrun S, Savarese S (2016) Learning to track at 100 fps with deep regression networks. In: European conference on computer vision. Springer, pp 749–765
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems. NIPSs Foundation, Inc., Lake Tahoe, pp 1097–1105
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Olshausen BA, Anderson CH, Van Essen DC (1993) A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J Neurosci 13(11):4700–4719
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems. NIPSs Foundation, Inc., Palai, Montreal CANADA, pp 91–99
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article Google Scholar
Pławiak P, Rzecki K (2015) Approximation of phenol concentration using computational intelligence methods based on signals from the metal-oxide sensor array. IEEE Sens J 15(3):1770–1783
Google Scholar
Pławiak P, Maziarz W (2014) Classification of tea specimens using novel hybrid artificial intelligence methods. Sens Actuators B Chem 192:117–125
Article Google Scholar
Yıldırım Ö, Pławiak P, Tan R-S, Acharya UR (2018) Arrhythmia detection using deep convolutional neural network with long duration ECG signals. Comput Biol Med 102:411–420
Article Google Scholar
Pławiak P, Acharya UR (2019) Novel deep genetic ensemble of classifiers for arrhythmia detection using ECG signals. Neural Comput Appl 5:1–25
Google Scholar
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic Siamese network for visual object tracking. In: The IEEE international conference on computer vision (ICCV), Oct 2017
Rensink RA (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42
Article Google Scholar
Choi J, Jin Chang H, Jeong J, Demiris Y, Young Choi J (2016) Visual tracking using attention-modulated disintegration and integration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4321–4330
Choi J, Jin Chang H, Yun S, Fischer T, Demiris Y, Young Choi J et al (2017) Attentional correlation filter network for adaptive visual tracking. In: CVPR, vol 2, p 7
Kosiorek A, Bewley A, Posner I (2017) Hierarchical attentive recurrent tracking. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. NIPS Foundation, Inc., Long Beach, pp 3053–3061
Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S (2018) Learning attentions: residual attentional Siamese network for high performance online visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4854–4863
Hu J, Shen L, Sun G (2017) Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507
Zhu Z, Wei W, Zou W, Yan J (2017) End-to-end flow correlation tracking with spatial-temporal attention. Illumination 42:20
Google Scholar
Woo S, Park J, Lee J-Y, Kweon I S (2018) Cbam: convolutional block attention module. In: Proceedings of European conference on computer vision
Zhang Y, Wang L, Qi J, Wang D, Feng M, Lu H (2018) Structured Siamese network for real-time visual tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 351–366
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Zhang H, Goodfellow I, Metaxas D, Odena A (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318
Song Y, Ma C, Gong L, Zhang J, Lau RWH, Yang M-H (2017) Crest: convolutional residual learning for visual tracking. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2574–2583
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318
Lukežič A, Vojíř T, Čehovin L, Matas J, Kristan M (2016) Discriminative correlation filter with channel and spatial reliability. arXiv preprint arXiv:1611.08461
Martín A, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: OSDI, vol 16, pp 265–283
Wu Yi, Lim Jongwoo, Yang Ming-Hsuan (2013) Online object tracking: A benchmark. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Zajc L, Vojir T, Häger G, Lukežič A, Eldesokey A, Fernandez G (2017) The visual object tracking vot2017 challenge results. In: IEEE international conference on computer vision (ICCV)
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PHS (2016) Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1401–1409
Danelljan M, Häger G, Khan F, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: British machine vision conference, Nottingham, September 1–5, 2014. BMVA Press
Wang Q, Gao J, Xing J, Zhang M, Hu W (2017) Dcfnet: discriminant correlation filters network for visual tracking. arXiv preprint arXiv:1704.04057
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China under Grant nos. 61872189, 61876088, in part by the Natural Science Foundation of Jiangsu Province under Grant no. BK20170040, in part by Six Talent Peaks Project in Jiangsu Province under Grant nos. XYDXX-015, XYDXX-045, and in part by the Postgraduate Research and Practice Innovation Program of Jiangsu Province under Grant SJCX19_0311.

Author information

Authors and Affiliations

Jiangsu Key Laboratory of Big Data Analysis Technology (B-DAT), Nanjing University of Information Science and Technology, Nanjing, China
Kang Yang, Huihui Song, Kaihua Zhang & Qingshan Liu

Authors

Kang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Huihui Song
View author publications
You can also search for this author in PubMed Google Scholar
Kaihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qingshan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaihua Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, K., Song, H., Zhang, K. et al. Hierarchical attentive Siamese network for real-time visual tracking. Neural Comput & Applic 32, 14335–14346 (2020). https://doi.org/10.1007/s00521-019-04238-1

Download citation

Received: 17 December 2018
Accepted: 09 May 2019
Published: 21 May 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00521-019-04238-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical attentive Siamese network for real-time visual tracking

Abstract

Access this article

Similar content being viewed by others

Learning attention modules for visual tracking

DensSiam: End-to-End Densely-Siamese Network with Self-Attention Model for Object Tracking

Visual Tracking with Attentional Convolutional Siamese Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical attentive Siamese network for real-time visual tracking

Abstract

Access this article

Similar content being viewed by others

Learning attention modules for visual tracking

DensSiam: End-to-End Densely-Siamese Network with Self-Attention Model for Object Tracking

Visual Tracking with Attentional Convolutional Siamese Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation