Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

Ye, Botao; Chang, Hong; Ma, Bingpeng; Shan, Shiguang; Chen, Xilin

doi:10.1007/978-3-031-20047-2_20

Botao Ye^12,13,
Hong Chang^12,13,
Bingpeng Ma¹³,
Shiguang Shan^12,13 &
…
Xilin Chen^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13682))

Included in the following conference series:

European Conference on Computer Vision

3568 Accesses
107 Citations

Abstract

The current popular two-stream, two-stage tracking framework extracts the template and the search region features separately and then performs relation modeling, thus the extracted features lack the awareness of the target and have limited target-background discriminability. To tackle the above issue, we propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling by bridging the template-search image pairs with bidirectional information flows. In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance. Since no extra heavy relation modeling module is needed and the implementation is highly parallelized, the proposed tracker runs at a fast speed. To further improve the inference efficiency, an in-network candidate early elimination module is proposed based on the strong similarity prior calculated in the one-stream framework. As a unified framework, OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k, i.e.,, achieving 73.7% AO, improving the existing best result (SwinTrack) by 4.3%. Besides, our method maintains a good performance-speed trade-off and shows faster convergence. The code and models are available at https://github.com/botaoye/OSTrack.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: ICCV, pp. 6182–6191 (2019)
Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: CVPR, pp. 2544–2550 (2010)
Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: CVPR, pp. 8126–8135 (2021)
Google Scholar
Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X.: High-performance long-term tracking with meta-updater. In: CVPR, pp. 6298–6307 (2020)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: CVPR, pp. 4660–4669 (2019)
Google Scholar
Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: ECO: efficient convolution operators for tracking. In: CVPR, pp. 6638–6646 (2017)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 x 16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Fan, H., et al.: LaSOT: a high-quality large-scale single object tracking benchmark. IJCV 129(2), 439–461 (2021)
Article Google Scholar
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR, pp. 5374–5383 (2019)
Google Scholar
Fu, Z., Liu, Q., Fu, Z., Wang, Y.: STMTrack: template-free visual tracking with space-time memory networks. In: CVPR, pp. 13774–13783 (2021)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. TPAMI 37(3), 583–596 (2014)
Article Google Scholar
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. TPAMI 43(5), 1562–1577 (2019)
Article Google Scholar
Jung, I., Son, J., Baek, M., Han, B.: Real-time mdnet. In: ECCV, pp. 83–98 (2018)
Google Scholar
Kiani Galoogahi, H., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. In: ICCV, pp. 1125–1134 (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NeurIPS, vol. 25 (2012)
Google Scholar
Law, H., Deng, J.: CornerNet: Detecting objects as paired keypoints. In: ECCV, pp. 734–750 (2018)
Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: CVPR, pp. 4282–4291 (2019)
Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: CVPR, pp. 8971–8980 (2018)
Google Scholar
Liang, Y., GE, C., Tong, Z., Song, Y., Wang, J., Xie, P.: EViT: expediting vision transformers via token reorganizations. In: ICLR (2022)
Google Scholar
Lin, L., Fan, H., Xu, Y., Ling, H.: SwinTrack: a simple and strong baseline for transformer tracking. arXiv preprint arXiv:2112.00995 (2021)
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV, pp. 740–755 (2014)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2018)
Google Scholar
Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: ICCV, pp. 13444–13454 (2021)
Google Scholar
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: ECCV, pp. 445–461 (2016)
Google Scholar
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: ECCV, pp. 300–317 (2018)
Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR, pp. 4293–4302 (2016)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: efficient vision transformers with dynamic token sparsification. In: NeurIPS, vol. 34 (2021)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Google Scholar
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: ImageNet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, pp. 10347–10357 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam R-CNN: visual tracking by re-detection. In: CVPR, pp. 6578–6588 (2020)
Google Scholar
Wang, G., Luo, C., Sun, X., Xiong, Z., Zeng, W.: Tracking by instance detection: a meta-learning approach. In: CVPR, pp. 6288–6297 (2020)
Google Scholar
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: CVPR, pp. 1571–1580 (2021)
Google Scholar
Wang, X., et al.: Towards more flexible and accurate object tracking with natural language: algorithms and benchmark. In: CVPR, pp. 13763–13773 (2021)
Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning Spatio-temporal transformer for visual tracking. In: ICCV, pp. 10448–10457 (2021)
Google Scholar
Yin, H., Vahdat, A., Alvarez, J., Mallya, A., Kautz, J., Molchanov, P.: AdaViT: adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658 (2021)
Zhang, Z., Liu, Y., Wang, X., Li, B., Hu, W.: Learn to match: automatic matching network design for visual tracking. In: ICCV, pp. 13339–13348 (2021)
Google Scholar
Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: object-aware anchor-free tracking. In: ECCV, pp. 771–787 (2020)
Google Scholar

Download references

Acknowledgment

This work is partially supported by Natural Science Foundation of China (NSFC): 61976203 and 61876171. Thanks Zhipeng Zhang for his helpful suggestions.

Author information

Authors and Affiliations

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China
Botao Ye, Hong Chang, Shiguang Shan & Xilin Chen
University of Chinese Academy of Sciences, Beijing, 100049, China
Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan & Xilin Chen

Authors

Botao Ye
View author publications
You can also search for this author in PubMed Google Scholar
Hong Chang
View author publications
You can also search for this author in PubMed Google Scholar
Bingpeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Shiguang Shan
View author publications
You can also search for this author in PubMed Google Scholar
Xilin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Chang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1476 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, B., Chang, H., Ma, B., Shan, S., Chen, X. (2022). Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13682. Springer, Cham. https://doi.org/10.1007/978-3-031-20047-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-20047-2_20
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20046-5
Online ISBN: 978-3-031-20047-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework