Unsupervised Deep Representation Learning for Real-Time Tracking

Wang, Ning; Zhou, Wengang; Song, Yibing; Ma, Chao; Liu, Wei; Li, Houqiang

doi:10.1007/s11263-020-01357-4

Unsupervised Deep Representation Learning for Real-Time Tracking

Published: 21 September 2020

Volume 129, pages 400–418, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Ning Wang¹,
Wengang Zhou^1,2,
Yibing Song³,
Chao Ma⁴,
Wei Liu³ &
…
Houqiang Li ORCID: orcid.org/0000-0003-2188-3028^1,2

2620 Accesses
72 Citations
Explore all metrics

Abstract

The advancement of visual tracking has continuously been brought by deep learning models. Typically, supervised learning is employed to train these models with expensive labeled data. In order to reduce the workload of manual annotation and learn to track arbitrary objects, we propose an unsupervised learning method for visual tracking. The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking. Specifically, the tracker is able to forward localize a target object in successive frames and backtrace to its initial position in the first frame. Based on such a motivation, in the training process, we measure the consistency between forward and backward trajectories to learn a robust tracker from scratch merely using unlabeled videos. We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning. Without bells and whistles, the proposed unsupervised tracker achieves the baseline accuracy of classic fully supervised trackers while achieving a real-time speed. Furthermore, our unsupervised framework exhibits a potential in leveraging more unlabeled or weakly labeled data to further improve the tracking accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Online Visual Tracking with a Single Convolutional Neural Network

Advances in Deep Learning Methods for Visual Tracking: Literature Review and Fundamentals

Article Open access 04 March 2021

Robust feature learning for online discriminative tracking without large-scale pre-training

Article 30 June 2018

Notes

In this paper, we do not distinguish between the terms unsupervised and self-supervised, as both refer to learning without ground-truth annotations.

References

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., & Torr, P.H. (2016). Fully-convolutional siamese networks for object tracking. In Proceedings of the European conference on computer vision workshops (ECCV workshop).
Bolme, D.S., Beveridge, J.R., Draper, B.A., & Lui, Y.M. (2010). Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In British machine vision conference (BMVC).
Chen, B., Wang, D., Li, P., Wang, S., & Lu, H. (2018). Real-time’actor-critic’tracking. In: Proceedings of the European conference on computer vision (ECCV).
Choi, J., Jin Chang, H., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., & Young Choi, J. (2018). Context-aware deep feature compression for high-speed visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Choi, J., Jin Chang, H., Jeong, J., Demiris, Y., & Young Choi, J. (2016). Visual tracking using attention-modulated disintegration and integration. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Choi, J., Jin Chang, H., Yun, S., Fischer, T., Demiris, Y., & Young Choi, J. (2017). Attentional correlation filter network for adaptive visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Danelljan, M., Bhat, G., Shahbaz Khan, F., & Felsberg, M. (2017). Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Danelljan, M., Häger, G., Khan, F., & Felsberg, M. (2014). Accurate scale estimation for robust visual tracking. In British machine vision conference (BMVC).
Danelljan, M., Häger, G., Khan, F.S., & Felsberg, M. (2016) Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).
Danelljan, M., Robinson, A., Khan, F.S., & Felsberg, M. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the european conference on computer vision (ECCV).
Dong, X., & Shen, J. (2018). Triplet loss in siamese network for object tracking. In Proceedings of the European conference on computer vision (ECCV).
Dong, X., Shen, J., Wang, W., Liu, Y., Shao, L., & Porikli, F. (2018). Hyperparameter optimization for tracking with continuous deep q-learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Galoogahi, H.K., Fagg, A., & Lucey, S. (2017). Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).
He, A., Luo, C., Tian, X., & Zeng, W. (2018). A twofold siamese network for real-time object tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(3), 583–596.
Article Google Scholar
Huang, C., Lucey, S., & Ramanan, D. (2017). Learning policies for adaptive tracking with deep feature cascades. In Proceedings of the IEEE international conference on computer vision (ICCV).
Huang, D., Luo, L., Chen, Z., Wen, M., & Zhang, C. (2017). Applying detection proposals to visual tracking for scale and aspect ratio adaptability. International Journal of Computer Vision (IJCV), 122(3), 524–541.
Article Google Scholar
Jung, I., Son, J., Baek, M., & Han, B. (2018). Real-time mdnet. In Proceedings of the European conference on computer vision (ECCV).
Kalal, Z., Mikolajczyk, K., & Matas, J. (2012). Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(7), 1409–1422.
Article Google Scholar
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., & Cehovin Zajc, L., et al. (2018) The sixth visual object tracking vot2018 challenge results. In Proceedings of the European conference on computer vision workshops (ECCV Workshop).
Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Cehovin, L., Fernández, G., Vojir T., & Hager, et al. (2016). The visual object tracking vot2016 challenge results. In Proceedings of the European conference on computer vision workshops (ECCV Workshop).
Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Cehovin, L., Fernández, G., Vojir, T., & Hager, et al. (2017). The visual object tracking vot2017 challenge results. In Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshop).
Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernandez, G., et al. (2016). A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(11), 2137–2155.
Article Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS).
Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J. & Ng, A.Y. (2011). Building high-level features using large scale unsupervised learning. arXiv:1112.6209.
Lee, D.Y., Sim, J.Y., & Kim, C.S. (2015). Multihypothesis trajectory analysis for robust visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Lee, H.Y., Huang, J.B., Singh, M., & Yang, M.H. (2017). Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision (ICCV)
Li, B., Yan, J., Wu, W., Zhu, Z. & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.H. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Li, F., Yao, Y., Li, P., Zhang, D., Zuo, W., & Yang, M.H. (2017). Integrating boundary and center correlation filters for visual tracking with aspect ratio variation. In Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshop).
Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for visual tracking: algorithms and benchmark. IEEE Transactions on Image Processing (TIP), 24(12), 5630–5644.
Article MathSciNet Google Scholar
Liu, S., Zhang, T., Cao, X., & Xu, C. (2016). Structural correlation filter for robust visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Lu, X., Ma, C., Ni, B., Yang, X., Reid, I., & Yang, M.H. (2018). Deep regression tracking with shrinkage loss. In Proceedings of the European conference on computer vision (ECCV).
LukeźIăź, A., Vojíř, T., Čehovin Zajc, L., Matas, J., & Kristan, M. (2018). Discriminative correlation filter tracker with channel and spatial reliability. International Journal of Computer Vision (IJCV), 126(7), 671–688.
Article MathSciNet Google Scholar
Lukezic, A., Vojir, T., Cehovin Zajc, L., Matas, J., & Kristan, M. (2017). Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Ma, C., Huang, J.B., Yang, X. & Yang, M.H. (2015). Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).
Ma, C., Huang, J. B., Yang, X., & Yang, M. H. (2018). Adaptive correlation filters with long-term and short-term memory for object tracking. International Journal of Computer Vision (IJCV), 126(8), 771–796.
Article Google Scholar
Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In AAAI conference on artificial intelligence (AAAI).
Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Müller, M., Bibi, A., Giancola, S., Al-Subaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV).
Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.
Article Google Scholar
Real, E., Shlens, J., Mazzocchi, S., Pan, X., & Vanhoucke, V. (2017). Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(6), 1137–1149.
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.
Article MathSciNet Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R., & Yang, M.H. (2017). Crest: Convolutional residual learning for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).
Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R.W., & Yang, M.H. (2018). Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Sui, Y., Zhang, Z., Wang, G., Tang, Y., & Zhang, L. (2019). Exploiting the anisotropy of correlation filter learning for visual tracking. International Journal of Computer Vision (IJCV), 127, 1–22. Please confirm the inserted volume number is correct in Ref. Sui et al. (2019).
Article Google Scholar
Tomasi, C., & Kanade, T. (1991). Detection and tracking of point features. Pittsburgh: Carnegie Mellon University.
Google Scholar
Valmadre, J., Bertinetto, L., Henriques, J.F., Tao, R., Vedaldi, A., Smeulders, A., Torr, P., & Gavves, E. (2018). Long-term tracking in the wild: A benchmark. In Proceedings of the European conference on computer vision (ECCV).
Valmadre, J., Bertinetto, L., Henriques, J.F., Vedaldi, A., & Torr, P.H. (2017). End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV).
Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., & Li, H. (2019). Unsupervised deep tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Wang, N., & Yeung, D.Y. (2013). Learning a deep compact image representation for visual tracking. In Advances in neural information processing systems (NeurIPS).
Wang, N., Zhou, W., Tian, Q., Hong, R., Wang, M., & Li, H. (2018). Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Wang, Q., Gao, J., Xing, J., Zhang, M., & Hu, W. (2017). Dcfnet: Discriminant correlation filters network for visual tracking. arXiv:1704.04057
Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W., & Maybank, S. (2018). Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision (ICCV).
Wang, X., Jabri, A., & Efros, A.A. (2019). Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Weijer, J. V. D., Schmid, C., Verbeek, J., & Larlus, D. (2009). Learning color names for real-world applications. IEEE Transactions on Image Processing (TIP), 18(7), 1512–1523.
Article MathSciNet Google Scholar
Wu, Y., Lim, J., & Yang, M.H. (2013). Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9), 1834–1848.
Article Google Scholar
Yang, T., & Chan, A.B. (2018). Learning dynamic memory networks for object tracking. In Proceedings of the European conference on computer vision (ECCV)
Yao, Y., Wu, X., Zhang, L., Shan, S., & Zuo, W. (2018). Joint representation and truncated inference learning for correlation filter based tracking. In Proceedings of the European conference on computer vision (ECCV)
Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, M., Wang, Q., Xing, J., Gao, J., Peng, P., Hu, W., & Maybank, S. (2018). Visual tracking via spatially aligned correlation filters network. In Proceedings of the European conference on computer vision (ECCV).
Zhang, Y., Wang, L., Qi, J., Wang, D., Feng, M., & Lu, H. (2018). Structured siamese network for real-time visual tracking. In Proceedings of the European conference on computer vision (ECCV).
Zhipeng, Z., Houwen, P., & Qiang, W. (2019). Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2017)
Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., & Efros, A.A. (2016). Learning dense correspondence via 3d-guided cycle consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zhou, X., Zhu, M., & Daniilidis, K. (2015). Multi-image matching via fast alternating minimization. In Proceedings of the IEEE international conference on computer vision (ICCV).
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. In Proceedings of the European conference on computer vision (ECCV).

Download references

Acknowledgements

This work was supported in part to Dr. Houqiang Li by NSFC under contract No. 61836011, and in part to Dr. Wengang Zhou by NSFC under contract No. 61822208 & 61632019 and Youth Innovation Promotion Association CAS (No. 2018497). Dr. Chao Ma was supported by NSFC under contract No. 60906119 and Shanghai Pujiang Program.

Author information

Authors and Affiliations

The CAS Key Laboratory of GIPAS, University of Science and Technology of China, Hefei, China
Ning Wang, Wengang Zhou & Houqiang Li
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
Wengang Zhou & Houqiang Li
Tencent AI Lab, Shenzhen, China
Yibing Song & Wei Liu
The MOE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Chao Ma

Authors

Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wengang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yibing Song
View author publications
You can also search for this author in PubMed Google Scholar
Chao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wengang Zhou or Houqiang Li.

Additional information

Communicated by Mei Chen, Cha Zhang and Katsushi Ikeuchi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, N., Zhou, W., Song, Y. et al. Unsupervised Deep Representation Learning for Real-Time Tracking. Int J Comput Vis 129, 400–418 (2021). https://doi.org/10.1007/s11263-020-01357-4

Download citation

Received: 17 December 2019
Accepted: 09 July 2020
Published: 21 September 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s11263-020-01357-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Deep Representation Learning for Real-Time Tracking

Abstract

Access this article

Similar content being viewed by others

Robust Online Visual Tracking with a Single Convolutional Neural Network

Advances in Deep Learning Methods for Visual Tracking: Literature Review and Fundamentals

Robust feature learning for online discriminative tracking without large-scale pre-training

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised Deep Representation Learning for Real-Time Tracking

Abstract

Access this article

Similar content being viewed by others

Robust Online Visual Tracking with a Single Convolutional Neural Network

Advances in Deep Learning Methods for Visual Tracking: Literature Review and Fundamentals

Robust feature learning for online discriminative tracking without large-scale pre-training

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation