Skip to main content

Non-aligned RGBT Tracking via Joint Temporal-Iterated Homography Estimation and Multimodal Transformer Fusion

  • Chapter
  • First Online:
Computer and Information Science and Engineering

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1156))

Abstract

Existing RGBT tracking methods have limitations in practical applications due to their reliance on spatial-aligned videos, which often require either elaborate platform design or high-cost manual alignment. To address this issue, we propose a Non-Aligned RGBT Tracker (NAT) that can effectively utilizes both weakly-aligned and non-aligned data, enabling it to be trained and tested on both types of data. Our method consists of two key components, including the temporal-iterated homography estimation module and the multimodal transformer fusion module. The temporal-iterated homography estimation module learns a transformation using temporal knowledge. By considering the continuity of homography changes in multimodal video sequences, this module uses an iterated prediction method that leverages the guidance of predicted transformation parameters from previous frames. This enables stable, accurate, and robust homography estimation in weakly-aligned and non-aligned scenarios without pre-alignment, making it practical for applications. The multimodal transformer fusion module aims to capture complementary information from multiple modalities by exploiting the powerful global modeling capability of the transformer. The entire framework can be trained end-to-end and is evaluated on both weakly-aligned and non-aligned RGBT datasets, and the results suggest our NAT outperforms state-of-the-art methods on five RGBT tracking benchmarks. Our approach opens up the practical applications of RGBT tracking research.

This work was completed during a visiting study at the Institute of Automation, Chinese Academy of Sciences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229. Springer (2020)

    Google Scholar 

  2. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)

    Google Scholar 

  3. Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,608–13,618 (2022)

    Google Scholar 

  4. DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. Preprint at arXiv:1606.03798 (2016)

  5. Gao, Y., Li, C., Zhu, Y., Tang, J., He, T., Wang, F.: Deep adaptive fusion network for high performance RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

    Google Scholar 

  6. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2003)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  8. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015)

    Google Scholar 

  9. Li, C., Cheng, H., Hu, S., Liu, X., Tang, J., Lin, L.: Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process. 25(12), 5743–5756 (2016)

    Article  MathSciNet  Google Scholar 

  10. Li, C., Liang, X., Lu, Y., Zhao, N., Tang, J.: RGB-T object tracking: Benchmark and baseline. Pattern Recogn. 96, 106,977 (2019)

    Google Scholar 

  11. Li, C., Liu, L., Lu, A., Ji, Q., Tang, J.: Challenge-aware RGBT tracking. In: Proceedings of the European Conference on Computer Vision, pp. 222–237. Springer (2020)

    Google Scholar 

  12. Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., Sun, D.: Lasher: A large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 31, 392–404 (2021)

    Article  Google Scholar 

  13. Li, C., Zhao, N., Lu, Y., Zhu, C., Tang, J.: Weighted sparse representation regularized graph learning for RGB-T object tracking. In: Proceedings of the ACM International Conference on Multimedia, pp. 1856–1864 (2017)

    Google Scholar 

  14. Li, C., Zhu, T., Liu, L., Si, X., Fan, Z., Zhai, S.: Cross-modal object tracking: Modality-aware representations and a unified benchmark. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1289–1296 (2022)

    Google Scholar 

  15. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10,012–10,022 (2021)

    Google Scholar 

  16. Long Li, C., Lu, A., Hua Zheng, A., Tu, Z., Tang, J.: Multi-adapter RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

    Google Scholar 

  17. Lu, A., Li, C., Yan, Y., Tang, J., Luo, B.: RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 30, 5613–5625 (2021)

    Article  Google Scholar 

  18. Lu, A., Qian, C., Li, C., Tang, J., Wang, L.: Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. (2022)

    Google Scholar 

  19. Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D.P., Yu, F., Van Gool, L.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)

    Google Scholar 

  20. Shao, R., Wu, G., Zhou, Y., Fu, Y., Fang, L., Liu, Y.: Localtrans: A multiscale local transformer network for cross-resolution homography estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14,890–14,899 (2021)

    Google Scholar 

  21. Truong, P., Danelljan, M., Timofte, R.: Glu-net: Global-local universal network for dense flow and correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6258–6268 (2020)

    Google Scholar 

  22. Tu, Z., Lin, C., Zhao, W., Li, C., Tang, J.: M5l: Multi-modal multi-margin metric learning for RGBT tracking. IEEE Trans. Image Process. 31, 85–98 (2021)

    Article  Google Scholar 

  23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  24. Wang, C., Xu, C., Cui, Z., Zhou, L., Zhang, T., Zhang, X., Yang, J.: Cross-modal pattern-propagation for RGB-T tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7064–7073 (2020)

    Google Scholar 

  25. Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: Exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)

    Google Scholar 

  26. Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., Wu, F.: Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,763–13,773 (2021)

    Google Scholar 

  27. Xiao, Y., Yang, M., Li, C., Liu, L., Tang, J.: Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2831–2838 (2022)

    Google Scholar 

  28. Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., Kämäräinen, J.K.: Depthtrack: Unveiling the power of RGBD tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10,725–10,733 (2021)

    Google Scholar 

  29. Zhan, X., Liu, Y., Zhu, J., Li, Y.: Homography decomposition networks for planar object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3234–3242 (2022)

    Google Scholar 

  30. Zhang, H., Zhang, L., Zhuo, L., Zhang, J.: Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 20(2), 393 (2020)

    Article  Google Scholar 

  31. Zhang, J., Yang, X., Fu, Y., Wei, X., Yin, B., Dong, B.: Object tracking by jointly exploiting frame and event domain. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13,043–13,052 (2021)

    Google Scholar 

  32. Zhang, L., Danelljan, M., Gonzalez-Garcia, A., Van De Weijer, J., Shahbaz Khan, F.: Multi-modal fusion for end-to-end RGB-T tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

    Google Scholar 

  33. Zhang, P., Wang, D., Lu, H., Yang, X.: Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int. J. Comput. Vision 129(9), 2714–2729 (2021)

    Article  Google Scholar 

  34. Zhang, P., Zhao, J., Bo, C., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 30, 3335–3347 (2021)

    Article  Google Scholar 

  35. Zhang, P., Zhao, J., Wang, D., Lu, H., Ruan, X.: Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8886–8895 (2022)

    Google Scholar 

  36. Zhu, J., Lai, S., Chen, X., Wang, D., Lu, H.: Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  37. Zhu, Y., Li, C., Luo, B., Tang, J., Wang, X.: Dense feature aggregation and pruning for rgbt tracking. In: Proceedings of the ACM International Conference on Multimedia, pp. 465–472 (2019)

    Google Scholar 

  38. Zhu, Y., Li, C., Tang, J., Luo, B.: Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 6(1), 121–130 (2021)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Major Project for New Generation of AI (No. 2018AAA0100400), the National Natural Science Foundation of China (No. 62376004), the Natural Science Foundation of Anhui Province (No. 2208085J18), the Natural Science Foundation of Anhui Higher Education Institution (No. 2022AH040014), the University Synergy Innovation Program of Anhui Province (No. GXXT-2020-051, GXXT-2022-033), and the Anhui Provincial Colleges Science Foundation for Distinguished Young Scholars (No. 2022AH020093).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenglong Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Liu, L., Li, C., Zheng, A., Tang, J., Xiang, Y. (2024). Non-aligned RGBT Tracking via Joint Temporal-Iterated Homography Estimation and Multimodal Transformer Fusion. In: Lee, R. (eds) Computer and Information Science and Engineering. Studies in Computational Intelligence, vol 1156. Springer, Cham. https://doi.org/10.1007/978-3-031-57037-7_2

Download citation

Publish with us

Policies and ethics