Skip to main content

Towards Sequence-Level Training for Visual Tracking

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13682))

Included in the following conference series:

Abstract

Despite the extensive adoption of machine learning on the task of visual object tracking, recent learning-based approaches have largely overlooked the fact that visual tracking is a sequence-level task in its nature; they rely heavily on frame-level training, which inevitably induces inconsistency between training and testing in terms of both data distributions and task objectives. This work introduces a sequence-level training strategy for visual tracking based on reinforcement learning and discusses how a sequence-level design of data sampling, learning objectives, and data augmentation can improve the accuracy and robustness of tracking algorithms. Our experiments on standard benchmarks including LaSOT, TrackingNet, and GOT-10k demonstrate that four representative tracking models, SiamRPN++, SiamAttn, TransT, and TrDiMP, consistently improve by incorporating the proposed methods in training without modifying architectures.

M. Kim S. Lee — These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Both trackers A and B adopt SiamRPN++ as their network architectures, but the backbone of tracker A is frozen in the early training stages.

References

  1. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56

    Chapter  Google Scholar 

  2. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: ICCV (2019)

    Google Scholar 

  3. Chen, B., Wang, D., Li, P., Wang, S., Lu, H.: Real-time ‘Actor-Critic’ tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 328–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_20

    Chapter  Google Scholar 

  4. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: CVPR (2021)

    Google Scholar 

  5. Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. In: CVPR (2020)

    Google Scholar 

  6. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV (2017)

    Google Scholar 

  7. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: CVPR (2019)

    Google Scholar 

  8. Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)

    Google Scholar 

  9. Dong, X., Shen, J., Wang, W., Liu, Y., Shao, L., Porikli, F.: Hyperparameter optimization for tracking with continuous deep q-learning. In: CVPR (2018)

    Google Scholar 

  10. Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR (2019)

    Google Scholar 

  11. Girshick, R.: Fast R-CNN. In: ICCV (2015)

    Google Scholar 

  12. Henschel, R., Zou, Y., Rosenhahn, B.: Multiple people tracking using body and joint detections. In: CVPR (2019)

    Google Scholar 

  13. Hester, T., et al.: Deep q-learning from demonstrations. In: AAAI (2018)

    Google Scholar 

  14. Hu, H.N., et al.: Joint monocular 3d vehicle detection and tracking. In: ICCV (2019)

    Google Scholar 

  15. Huang, C., Lucey, S., Ramanan, D.: Learning policies for adaptive tracking with deep feature cascades. In: ICCV (2017)

    Google Scholar 

  16. Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. TPAMI. 43, 1562–1577 (2019)

    Article  Google Scholar 

  17. III, J.S.S., Ramanan, D.: Tracking as online decision-making: Learning a policy from streaming videos with reinforcement learning. In: ICCV (2017)

    Google Scholar 

  18. Javed, S., Danelljan, M., Khan, F.S., Khan, M.H., Felsberg, M., Matas, J.: Visual object tracking with discriminative filters and Siamese networks: a survey and outlook. arXiv preprint arXiv:2112.02838 (2021)

  19. Jung, I., Son, J., Baek, M., Han, B.: Real-time MDNet. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 89–104. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_6

    Chapter  Google Scholar 

  20. Li, B., Wu, W., Wang, Q., Zhang, F., Junliang Xing, J.Y.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: CVPR (2019)

    Google Scholar 

  21. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: CVPR (2018)

    Google Scholar 

  22. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  23. Marvasti-Zadeh, S.M., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S.: Deep learning for visual tracking: a comprehensive survey. IEEE Transactions on Intelligent Transportation Systems (2021)

    Google Scholar 

  24. Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 310–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_19

    Chapter  Google Scholar 

  25. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR (2016)

    Google Scholar 

  26. Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: YouTube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In: CVPR (2017)

    Google Scholar 

  27. Ren, L., Yuan, X., Lu, J., Yang, M., Zhou, J.: Deep reinforcement learning with iterative shift for visual tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 697–713. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_42

    Chapter  Google Scholar 

  28. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)

    Google Scholar 

  29. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  30. Smeulders, A.W., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI. 36, 1444–1468 (2013)

    Google Scholar 

  31. Wang, G., Luo, C., Sun, X., Xiong, Z., Zeng, W.: Tracking by instance detection: a meta-learning approach. In: CVPR (2020)

    Google Scholar 

  32. Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: CVPR (2021)

    Google Scholar 

  33. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)

    Article  Google Scholar 

  34. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR (2013)

    Google Scholar 

  35. Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_36

    Chapter  Google Scholar 

  36. Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. In: CVPR (2020)

    Google Scholar 

  37. Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: ICCV (2021)

    Google Scholar 

  38. Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-Refine: boosting tracking performance by precise bounding box estimation. In: CVPR (2021)

    Google Scholar 

  39. Yu, Y., Xiong, Y., Huang, W., Scott, M.R.: Deformable Siamese attention networks for visual object tracking. In: CVPR (2020)

    Google Scholar 

  40. Yun, S., Choi, J., Yoo, Y., Yun, K., Young Choi, J.: Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR (2017)

    Google Scholar 

  41. Zhang, D., Zheng, Z., Jia, R., Li, M.: Visual tracking via hierarchical deep reinforcement learning. In: AAAI (2021)

    Google Scholar 

  42. Zhang, W., et al.: Online decision based visual tracking via reinforcement learning. In: NIPS (2020)

    Google Scholar 

  43. Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: object-aware anchor-free tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 771–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_46

    Chapter  Google Scholar 

  44. Zhang, Z., Gonzalez-Garcia, A., van de Weijer, J., Danelljan, M., Khan, F.S.: Learning the model update for Siamese trackers. In: ICCV (2019)

    Google Scholar 

  45. Zhu, Z., et al.: Distractor-aware Siamese networks for visual object tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 103–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_7

    Chapter  Google Scholar 

Download references

Acknowledgement

This work was supported by Samsung Advanced Institute of Technology (Neural Processing Research Center), the NRF grants (No. 2021M3E5D2A01023887, No. 2022R1A2C3012210) and the IITP grants (No. 2021-0-01343, No. 2022-0-00959) funded by the Korea government (MSIT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minsu Cho .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4025 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, M., Lee, S., Ok, J., Han, B., Cho, M. (2022). Towards Sequence-Level Training for Visual Tracking. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13682. Springer, Cham. https://doi.org/10.1007/978-3-031-20047-2_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20047-2_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20046-5

  • Online ISBN: 978-3-031-20047-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics