Skip to main content
Log in

Unsupervised Deep Representation Learning for Real-Time Tracking

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The advancement of visual tracking has continuously been brought by deep learning models. Typically, supervised learning is employed to train these models with expensive labeled data. In order to reduce the workload of manual annotation and learn to track arbitrary objects, we propose an unsupervised learning method for visual tracking. The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking. Specifically, the tracker is able to forward localize a target object in successive frames and backtrace to its initial position in the first frame. Based on such a motivation, in the training process, we measure the consistency between forward and backward trajectories to learn a robust tracker from scratch merely using unlabeled videos. We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning. Without bells and whistles, the proposed unsupervised tracker achieves the baseline accuracy of classic fully supervised trackers while achieving a real-time speed. Furthermore, our unsupervised framework exhibits a potential in leveraging more unlabeled or weakly labeled data to further improve the tracking accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. In this paper, we do not distinguish between the terms unsupervised and self-supervised, as both refer to learning without ground-truth annotations.

References

  • Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., & Torr, P.H. (2016). Fully-convolutional siamese networks for object tracking. In Proceedings of the European conference on computer vision workshops (ECCV workshop).

  • Bolme, D.S., Beveridge, J.R., Draper, B.A., & Lui, Y.M. (2010). Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  • Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In British machine vision conference (BMVC).

  • Chen, B., Wang, D., Li, P., Wang, S., & Lu, H. (2018). Real-time’actor-critic’tracking. In: Proceedings of the European conference on computer vision (ECCV).

  • Choi, J., Jin Chang, H., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., & Young Choi, J. (2018). Context-aware deep feature compression for high-speed visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Choi, J., Jin Chang, H., Jeong, J., Demiris, Y., & Young Choi, J. (2016). Visual tracking using attention-modulated disintegration and integration. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Choi, J., Jin Chang, H., Yun, S., Fischer, T., Demiris, Y., & Young Choi, J. (2017). Attentional correlation filter network for adaptive visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Danelljan, M., Bhat, G., Shahbaz Khan, F., & Felsberg, M. (2017). Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Danelljan, M., Häger, G., Khan, F., & Felsberg, M. (2014). Accurate scale estimation for robust visual tracking. In British machine vision conference (BMVC).

  • Danelljan, M., Häger, G., Khan, F.S., & Felsberg, M. (2016) Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Danelljan, M., Robinson, A., Khan, F.S., & Felsberg, M. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the european conference on computer vision (ECCV).

  • Dong, X., & Shen, J. (2018). Triplet loss in siamese network for object tracking. In Proceedings of the European conference on computer vision (ECCV).

  • Dong, X., Shen, J., Wang, W., Liu, Y., Shao, L., & Porikli, F. (2018). Hyperparameter optimization for tracking with continuous deep q-learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  • Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  • Galoogahi, H.K., Fagg, A., & Lucey, S. (2017). Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • He, A., Luo, C., Tian, X., & Zeng, W. (2018). A twofold siamese network for real-time object tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(3), 583–596.

    Article  Google Scholar 

  • Huang, C., Lucey, S., & Ramanan, D. (2017). Learning policies for adaptive tracking with deep feature cascades. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Huang, D., Luo, L., Chen, Z., Wen, M., & Zhang, C. (2017). Applying detection proposals to visual tracking for scale and aspect ratio adaptability. International Journal of Computer Vision (IJCV), 122(3), 524–541.

    Article  Google Scholar 

  • Jung, I., Son, J., Baek, M., & Han, B. (2018). Real-time mdnet. In Proceedings of the European conference on computer vision (ECCV).

  • Kalal, Z., Mikolajczyk, K., & Matas, J. (2012). Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(7), 1409–1422.

    Article  Google Scholar 

  • Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., & Cehovin Zajc, L., et al. (2018) The sixth visual object tracking vot2018 challenge results. In Proceedings of the European conference on computer vision workshops (ECCV Workshop).

  • Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Cehovin, L., Fernández, G., Vojir T., & Hager, et al. (2016). The visual object tracking vot2016 challenge results. In Proceedings of the European conference on computer vision workshops (ECCV Workshop).

  • Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Cehovin, L., Fernández, G., Vojir, T., & Hager, et al. (2017). The visual object tracking vot2017 challenge results. In Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshop).

  • Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernandez, G., et al. (2016). A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(11), 2137–2155.

    Article  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS).

  • Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J. & Ng, A.Y. (2011). Building high-level features using large scale unsupervised learning. arXiv:1112.6209.

  • Lee, D.Y., Sim, J.Y., & Kim, C.S. (2015). Multihypothesis trajectory analysis for robust visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Lee, H.Y., Huang, J.B., Singh, M., & Yang, M.H. (2017). Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision (ICCV)

  • Li, B., Yan, J., Wu, W., Zhu, Z. & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.H. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Li, F., Yao, Y., Li, P., Zhang, D., Zuo, W., & Yang, M.H. (2017). Integrating boundary and center correlation filters for visual tracking with aspect ratio variation. In Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshop).

  • Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for visual tracking: algorithms and benchmark. IEEE Transactions on Image Processing (TIP), 24(12), 5630–5644.

    Article  MathSciNet  Google Scholar 

  • Liu, S., Zhang, T., Cao, X., & Xu, C. (2016). Structural correlation filter for robust visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Lu, X., Ma, C., Ni, B., Yang, X., Reid, I., & Yang, M.H. (2018). Deep regression tracking with shrinkage loss. In Proceedings of the European conference on computer vision (ECCV).

  • LukeźIăź, A., Vojíř, T., Čehovin Zajc, L., Matas, J., & Kristan, M. (2018). Discriminative correlation filter tracker with channel and spatial reliability. International Journal of Computer Vision (IJCV), 126(7), 671–688.

    Article  MathSciNet  Google Scholar 

  • Lukezic, A., Vojir, T., Cehovin Zajc, L., Matas, J., & Kristan, M. (2017). Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Ma, C., Huang, J.B., Yang, X. & Yang, M.H. (2015). Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Ma, C., Huang, J. B., Yang, X., & Yang, M. H. (2018). Adaptive correlation filters with long-term and short-term memory for object tracking. International Journal of Computer Vision (IJCV), 126(8), 771–796.

    Article  Google Scholar 

  • Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In AAAI conference on artificial intelligence (AAAI).

  • Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Müller, M., Bibi, A., Giancola, S., Al-Subaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV).

  • Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.

    Article  Google Scholar 

  • Real, E., Shlens, J., Mazzocchi, S., Pan, X., & Vanhoucke, V. (2017). Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(6), 1137–1149.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  • Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R., & Yang, M.H. (2017). Crest: Convolutional residual learning for visual tracking. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R.W., & Yang, M.H. (2018). Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Sui, Y., Zhang, Z., Wang, G., Tang, Y., & Zhang, L. (2019). Exploiting the anisotropy of correlation filter learning for visual tracking. International Journal of Computer Vision (IJCV), 127, 1–22. Please confirm the inserted volume number is correct in Ref. Sui et al. (2019).

    Article  Google Scholar 

  • Tomasi, C., & Kanade, T. (1991). Detection and tracking of point features. Pittsburgh: Carnegie Mellon University.

    Google Scholar 

  • Valmadre, J., Bertinetto, L., Henriques, J.F., Tao, R., Vedaldi, A., Smeulders, A., Torr, P., & Gavves, E. (2018). Long-term tracking in the wild: A benchmark. In Proceedings of the European conference on computer vision (ECCV).

  • Valmadre, J., Bertinetto, L., Henriques, J.F., Vedaldi, A., & Torr, P.H. (2017). End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV).

  • Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., & Li, H. (2019). Unsupervised deep tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, N., & Yeung, D.Y. (2013). Learning a deep compact image representation for visual tracking. In Advances in neural information processing systems (NeurIPS).

  • Wang, N., Zhou, W., Tian, Q., Hong, R., Wang, M., & Li, H. (2018). Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, Q., Gao, J., Xing, J., Zhang, M., & Hu, W. (2017). Dcfnet: Discriminant correlation filters network for visual tracking. arXiv:1704.04057

  • Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W., & Maybank, S. (2018). Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Wang, X., Jabri, A., & Efros, A.A. (2019). Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Weijer, J. V. D., Schmid, C., Verbeek, J., & Larlus, D. (2009). Learning color names for real-world applications. IEEE Transactions on Image Processing (TIP), 18(7), 1512–1523.

    Article  MathSciNet  Google Scholar 

  • Wu, Y., Lim, J., & Yang, M.H. (2013). Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9), 1834–1848.

    Article  Google Scholar 

  • Yang, T., & Chan, A.B. (2018). Learning dynamic memory networks for object tracking. In Proceedings of the European conference on computer vision (ECCV)

  • Yao, Y., Wu, X., Zhang, L., Shan, S., & Zuo, W. (2018). Joint representation and truncated inference learning for correlation filter based tracking. In Proceedings of the European conference on computer vision (ECCV)

  • Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, M., Wang, Q., Xing, J., Gao, J., Peng, P., Hu, W., & Maybank, S. (2018). Visual tracking via spatially aligned correlation filters network. In Proceedings of the European conference on computer vision (ECCV).

  • Zhang, Y., Wang, L., Qi, J., Wang, D., Feng, M., & Lu, H. (2018). Structured siamese network for real-time visual tracking. In Proceedings of the European conference on computer vision (ECCV).

  • Zhipeng, Z., Houwen, P., & Qiang, W. (2019). Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2017)

  • Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., & Efros, A.A. (2016). Learning dense correspondence via 3d-guided cycle consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhou, X., Zhu, M., & Daniilidis, K. (2015). Multi-image matching via fast alternating minimization. In Proceedings of the IEEE international conference on computer vision (ICCV).

  • Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. In Proceedings of the European conference on computer vision (ECCV).

Download references

Acknowledgements

This work was supported in part to Dr. Houqiang Li by NSFC under contract No. 61836011, and in part to Dr. Wengang Zhou by NSFC under contract No. 61822208 & 61632019 and Youth Innovation Promotion Association CAS (No. 2018497). Dr. Chao Ma was supported by NSFC under contract No. 60906119 and Shanghai Pujiang Program.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wengang Zhou or Houqiang Li.

Additional information

Communicated by Mei Chen, Cha Zhang and Katsushi Ikeuchi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, N., Zhou, W., Song, Y. et al. Unsupervised Deep Representation Learning for Real-Time Tracking. Int J Comput Vis 129, 400–418 (2021). https://doi.org/10.1007/s11263-020-01357-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01357-4

Keywords

Navigation