Skip to main content
Log in

Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Advanced general visual object tracking models have been drastically developed with the access of large annotated datasets and progressive network architectures. However, a general tracker always suffers domain shift when directly adopting to specific testing scenarios. In this paper, we dedicate to addressing the animal tracking problem by proposing a spatio-temporal inference module and a coarse-to-fine tracking strategy. In terms of tracking animals, non-rigid deformation is a typical challenge. Therefore, we particularly design a novel transformer-based inference structure where the changing animal state is transmitted across continuous frames. By explicitly transmitting the appearance variations, this spatio-temporal module enables adaptive target learning, boosting the animal tracking performance compared to the fixed template matching approaches. Besides, considering the altered contours of animals in different frames, we propose to perform coarse-to-fine tracking to obtain a fine-grained animal bounding box with a dedicated distribution-aware regression module. The coarse tracking phase focuses on distinguishing the target against potential distractors in the background. While the fine-grained tracking phase aims at accurately regressing the final animal bounding box. To facilitate animal tracking evaluation, we captured and annotated 145 video sequences with 20 categories from the zoo, forming a new test set for animal tracking, coined ZOO145. We also collected a dataset, AnimalSOT, with 162 video sequences from existing tracking test benchmarks. The experimental performance on animal tracking datasets, MoCA, ZOO145, and AnimalSOT, demonstrate the merit of the proposed approach against advanced general tracking approaches, providing a baseline for future animal tracking studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://www.cv4animals.com/home

References

  • Avidan, S. (2004). Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 1064–1072.

    Article  PubMed  Google Scholar 

  • Babenko, B., Yang, M. H., & Belongie, S. (2011). Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1619–1632.

    Article  PubMed  Google Scholar 

  • Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3), 221–255.

    Article  Google Scholar 

  • Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P. H. S. (2016). Staple: Complementary learners for real-time tracking. IEEE Conference on Computer Vision and Pattern Recognition, 38, 1401–1409.

    Google Scholar 

  • Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016b). Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision. Springer, pp. 850–865.

  • Bhat, G., Danelljan, M., Gool, L. V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191.

  • Bideau, P., & Learned-Miller, E. (2016). It’s moving! a probabilistic model for causal motion segmentation in moving camera videos. In European Conference on Computer Vision. Springer, pp. 433–449.

  • Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A., & Cipolla, R. (2020). Who left the dogs out? 3d animal reconstruction with expectation maximization in the loop. In European Conference on Computer Vision. Springer, pp. 195–211.

  • Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. (2010). Visual object tracking using adaptive correlation filters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550.

  • Briechle, K., & Hanebeck, U. D. (2001). Template matching using fast normalized cross correlation. Proceedings of SPIE, 4387, 95–102.

    Article  Google Scholar 

  • Chan, Y., Hu, A., & Plant, J. (1979). A kalman filter based tracking scheme with input estimation. IEEE Transactions on Aerospace and Electronic Systems, 2, 237–244.

    Article  Google Scholar 

  • Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135.

  • Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6668–6677.

  • Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 142–149.

  • Danelljan, M., Hager, G., Khan, F. S., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In IEEE International Conference on Computer Vision, pp. 4310–4318.

  • Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2017a). Eco: Efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6931–6939.

  • Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2017). Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1561–1575.

    Article  PubMed  Google Scholar 

  • Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2019). Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669.

  • Danelljan, M., Gool, L. V., & Timofte, R. (2020). Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly S et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929

  • Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383.

  • Fink, M., & Ullman, S. (2008). From aardvark to zorro: A benchmark for mammal image classification. International Journal of Computer Vision, 77(1), 143–156.

    Article  Google Scholar 

  • Gordon, N., Salmond, D., & Ewing, C. (1995). Bayesian state estimation for tracking and guidance using the bootstrap filter. Journal of Guidance, Control, and Dynamics, 18(6), 1434–1443.

    Article  Google Scholar 

  • Guo, D., Wang, J., Cui, Y., Wang, Z., & Chen, S. (2020). Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6269–6277.

  • Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552.

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009.

  • Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision. Springer, pp. 749–765.

  • Henriques, J., O. F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision, pp. 702–715.

  • Henriques, J. F., Rui, C., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.

    Article  PubMed  Google Scholar 

  • Huang, L., Zhao, X., & Huang, K. (2019). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1562–1577.

    Article  Google Scholar 

  • Isard, M., & Blake, A. (1998). Condensation-conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.

    Article  Google Scholar 

  • Kiani Galoogahi, H., Fagg, A., & Lucey, S. (2017). Learning background-aware correlation filters for visual tracking. In IEEE International Conference on Computer Vision.

  • Kristan, M., Leonardis, A., & Matas, J., et al. (2016). The visual object tracking vot2016 challenge results. In European Conference on Computer Vision Workshops, 8926, 191–217.

  • Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Cehovin Zajc, L., Vojir, T., Bhat, G., Lukezic, A., & Eldesokey, A. et al. (2018). The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV).

  • Kristan, M., Matas, J., & Leonardis, A., et al. (2019). The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • Lamdouar, H., Yang, C., Xie, W., & Zisserman, A. (2020). Betrayed by motion: Camouflaged object discovery via motion segmentation. In Proceedings of the Asian Conference on Computer Vision.

  • Li, A., Lin, M., Wu, Y., Yang, M. H., & Yan, S. (2016). Nus-pro: A new visual tracking challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 335–349.

    Article  PubMed  Google Scholar 

  • Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980.

  • Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291.

  • Li, S., Song, W., Fang, Z., Shi, J., Hao, A., Zhao, Q., & Qin, H. (2020). Long-short temporal-spatial clues excited network for robust person re-identification. International Journal of Computer Vision, 128(12), 2936–2961.

    Article  Google Scholar 

  • Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J., & Yang, J. (2020). Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33, 21002–21012.

    Google Scholar 

  • Li, Y., & Zhu, J. (2014). A scale adaptive kernel correlation filter tracker with feature integration. In European Conference on Computer Vision Workshops. Springer, pp. 254–265.

  • Li, Y., Xu, N., Yang, W., See, J., & Lin, W. (2022). Exploring the semi-supervised video object segmentation problem from a cyclic perspective. International Journal of Computer Vision, 130(10), 2408–2424.

    Article  Google Scholar 

  • Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing, 24(12), 5630–5644.

    Article  MathSciNet  PubMed  Google Scholar 

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, pp. 740–755.

  • Liu, S., Zhang, T., Cao, X., & Xu, C. (2016). Structural correlation filter for robust visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4312–4320.

  • Martin, D., Andreas, R., Fahad, K., & Michael, F. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pp. 472–488.

  • Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for uav tracking. In European Conference on Computer Vision. Springer, pp. 445–461.

  • Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1396–1404.

  • Ng, X. L., Ong, K. E., Zheng, Q., Ni, Y., & Liu, S. Y. Y. J. (2022). Animal kingdom: A large and diverse dataset for animal behavior understanding. arXiv:2204.08129.

  • Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M., & Dambre, J. (2018). Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision, 126(2), 430–439.

    Article  MathSciNet  Google Scholar 

  • Sui, Y., Zhang, Z., Wang, G., Tang, Y., & Zhang, L. (2019). Exploiting the anisotropy of correlation filter learning for visual tracking. International Journal of Computer Vision, 127(8), 1084–1105.

    Article  Google Scholar 

  • Tao, R., Gavves, E., & Smeulders, A. W. (2016). Siamese instance search for tracking. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1420–1429.

  • Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., & Torr, P. H. (2017). End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 5000–5008.

  • Wang, M., Liu, Y., & Huang, Z. (2017). Large margin object tracking with circulant feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4021–4029.

  • Wang, N., Shi, J., Yeung, D. Y., & Jia, J. (2015). Understanding and diagnosing visual tracking systems. In IEEE International Conference on Computer Vision. IEEE, pp. 3101–3109.

  • Wang, Q., Zhang, L., Bertinetto, L., Hu, W., & Torr, P. H. (2019). Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1328–1338.

  • Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418

  • Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.

    Article  PubMed  Google Scholar 

  • Xing, D., Evangeliou, N., Tsoukalas, A., & Tzes, A. (2022). Siamese transformer pyramid networks for real-time uav tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2139–2148.

  • Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019a). Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7950–7960

  • Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.

    Article  MathSciNet  PubMed  Google Scholar 

  • Xu, T., Feng, Z., Wu, X. J., & Kittler, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129(5), 1359–1375.

    Article  Google Scholar 

  • Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. (2020). Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In The AAAI Conference on Artificial Intelligence, pp. 12549–12556.

  • Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457.

  • Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. arXiv:2108.12617.

  • Yu, Y., Yuan, J., Mittal, G., Fuxin, L., & Chen, M. (2022). Batman: Bilateral attention transformer in motion-appearance neighboring space for video object segmentation. In European Conference on Computer Vision. Springer, pp. 612–629.

  • Zhang, K., Zhang, L., Liu, Q., Zhang, D., & Yang, M. H. (2014). Fast visual tracking via dense spatio-temporal context learning. In European Conference on Computer Vision, pp. 127–141.

  • Zhang, T., Ghanem, B., Liu, S., & Ahuja, N. (2013). Robust visual tracking via structured multi-task sparse learning. International Journal of Computer Vision, 101(2), 367–383.

    Article  MathSciNet  Google Scholar 

  • Zhang, T., Bibi, A., & Ghanem, B. (2016). In defense of sparse tracking: Circulant sparse tracker. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3880–3888

  • Zhang, T., Xu, C., & Yang, M. H. (2017). Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4335–4343.

  • Zheng, X., Guo, Y., Huang, H., Li, Y., & He, R. (2020). A survey of deep facial attribute analysis. International Journal of Computer Vision, 128(8), 2002–2034.

    Article  Google Scholar 

  • Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117.

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (62020106012, U1836218, 62106089).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tianyang Xu.

Additional information

Communicated by Hyun Soo Park.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, T., Kang, Z., Zhu, X. et al. Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02008-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02008-8

Keywords

Navigation