Non-aligned RGBT Tracking via Joint Temporal-Iterated Homography Estimation and Multimodal Transformer Fusion

Liu, Lei; Li, Chenglong; Zheng, Aihua; Tang, Jin; Xiang, Yanping

doi:10.1007/978-3-031-57037-7_2

Lei Liu^3,4,5,
Chenglong Li^3,4,6,
Aihua Zheng⁴,
Jin Tang⁴ &
…
Yanping Xiang⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1156))

41 Accesses
1 Citations

Abstract

Existing RGBT tracking methods have limitations in practical applications due to their reliance on spatial-aligned videos, which often require either elaborate platform design or high-cost manual alignment. To address this issue, we propose a Non-Aligned RGBT Tracker (NAT) that can effectively utilizes both weakly-aligned and non-aligned data, enabling it to be trained and tested on both types of data. Our method consists of two key components, including the temporal-iterated homography estimation module and the multimodal transformer fusion module. The temporal-iterated homography estimation module learns a transformation using temporal knowledge. By considering the continuity of homography changes in multimodal video sequences, this module uses an iterated prediction method that leverages the guidance of predicted transformation parameters from previous frames. This enables stable, accurate, and robust homography estimation in weakly-aligned and non-aligned scenarios without pre-alignment, making it practical for applications. The multimodal transformer fusion module aims to capture complementary information from multiple modalities by exploiting the powerful global modeling capability of the transformer. The entire framework can be trained end-to-end and is evaluated on both weakly-aligned and non-aligned RGBT datasets, and the results suggest our NAT outperforms state-of-the-art methods on five RGBT tracking benchmarks. Our approach opens up the practical applications of RGBT tracking research.

This work was completed during a visiting study at the Institute of Automation, Chinese Academy of Sciences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229. Springer (2020)
Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
Google Scholar
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,608–13,618 (2022)
Google Scholar
DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. Preprint at arXiv:1606.03798 (2016)
Gao, Y., Li, C., Zhu, Y., Tang, J., He, T., Wang, F.: Deep adaptive fusion network for high performance RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2003)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
Li, C., Cheng, H., Hu, S., Liu, X., Tang, J., Lin, L.: Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process. 25(12), 5743–5756 (2016)
Article MathSciNet Google Scholar
Li, C., Liang, X., Lu, Y., Zhao, N., Tang, J.: RGB-T object tracking: Benchmark and baseline. Pattern Recogn. 96, 106,977 (2019)
Google Scholar
Li, C., Liu, L., Lu, A., Ji, Q., Tang, J.: Challenge-aware RGBT tracking. In: Proceedings of the European Conference on Computer Vision, pp. 222–237. Springer (2020)
Google Scholar
Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., Sun, D.: Lasher: A large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 31, 392–404 (2021)
Article Google Scholar
Li, C., Zhao, N., Lu, Y., Zhu, C., Tang, J.: Weighted sparse representation regularized graph learning for RGB-T object tracking. In: Proceedings of the ACM International Conference on Multimedia, pp. 1856–1864 (2017)
Google Scholar
Li, C., Zhu, T., Liu, L., Si, X., Fan, Z., Zhai, S.: Cross-modal object tracking: Modality-aware representations and a unified benchmark. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1289–1296 (2022)
Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10,012–10,022 (2021)
Google Scholar
Long Li, C., Lu, A., Hua Zheng, A., Tu, Z., Tang, J.: Multi-adapter RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Lu, A., Li, C., Yan, Y., Tang, J., Luo, B.: RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 30, 5613–5625 (2021)
Article Google Scholar
Lu, A., Qian, C., Li, C., Tang, J., Wang, L.: Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Google Scholar
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D.P., Yu, F., Van Gool, L.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)
Google Scholar
Shao, R., Wu, G., Zhou, Y., Fu, Y., Fang, L., Liu, Y.: Localtrans: A multiscale local transformer network for cross-resolution homography estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14,890–14,899 (2021)
Google Scholar
Truong, P., Danelljan, M., Timofte, R.: Glu-net: Global-local universal network for dense flow and correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6258–6268 (2020)
Google Scholar
Tu, Z., Lin, C., Zhao, W., Li, C., Tang, J.: M5l: Multi-modal multi-margin metric learning for RGBT tracking. IEEE Trans. Image Process. 31, 85–98 (2021)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, C., Xu, C., Cui, Z., Zhou, L., Zhang, T., Zhang, X., Yang, J.: Cross-modal pattern-propagation for RGB-T tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7064–7073 (2020)
Google Scholar
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: Exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)
Google Scholar
Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., Wu, F.: Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,763–13,773 (2021)
Google Scholar
Xiao, Y., Yang, M., Li, C., Liu, L., Tang, J.: Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2831–2838 (2022)
Google Scholar
Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., Kämäräinen, J.K.: Depthtrack: Unveiling the power of RGBD tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10,725–10,733 (2021)
Google Scholar
Zhan, X., Liu, Y., Zhu, J., Li, Y.: Homography decomposition networks for planar object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3234–3242 (2022)
Google Scholar
Zhang, H., Zhang, L., Zhuo, L., Zhang, J.: Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 20(2), 393 (2020)
Article Google Scholar
Zhang, J., Yang, X., Fu, Y., Wei, X., Yin, B., Dong, B.: Object tracking by jointly exploiting frame and event domain. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13,043–13,052 (2021)
Google Scholar
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., Van De Weijer, J., Shahbaz Khan, F.: Multi-modal fusion for end-to-end RGB-T tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Zhang, P., Wang, D., Lu, H., Yang, X.: Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int. J. Comput. Vision 129(9), 2714–2729 (2021)
Article Google Scholar
Zhang, P., Zhao, J., Bo, C., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 30, 3335–3347 (2021)
Article Google Scholar
Zhang, P., Zhao, J., Wang, D., Lu, H., Ruan, X.: Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8886–8895 (2022)
Google Scholar
Zhu, J., Lai, S., Chen, X., Wang, D., Lu, H.: Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Zhu, Y., Li, C., Luo, B., Tang, J., Wang, X.: Dense feature aggregation and pruning for rgbt tracking. In: Proceedings of the ACM International Conference on Multimedia, pp. 465–472 (2019)
Google Scholar
Zhu, Y., Li, C., Tang, J., Luo, B.: Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 6(1), 121–130 (2021)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Major Project for New Generation of AI (No. 2018AAA0100400), the National Natural Science Foundation of China (No. 62376004), the Natural Science Foundation of Anhui Province (No. 2208085J18), the Natural Science Foundation of Anhui Higher Education Institution (No. 2022AH040014), the University Synergy Innovation Program of Anhui Province (No. GXXT-2020-051, GXXT-2022-033), and the Anhui Provincial Colleges Science Foundation for Distinguished Young Scholars (No. 2022AH020093).

Author information

Authors and Affiliations

Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, 230601, China
Lei Liu & Chenglong Li
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, 230601, China
Lei Liu, Chenglong Li, Aihua Zheng & Jin Tang
Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Lei Liu
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230601, China
Chenglong Li
Intelligent Interconnection Technology Co.., Beijing, China
Yanping Xiang

Authors

Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chenglong Li
View author publications
You can also search for this author in PubMed Google Scholar
Aihua Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Jin Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yanping Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenglong Li .

Editor information

Editors and Affiliations

Department of Computer Science, Software Engineering and Information Technology Institute, Central Michigan University, Mount Pleasant, MI, USA
Roger Lee

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, L., Li, C., Zheng, A., Tang, J., Xiang, Y. (2024). Non-aligned RGBT Tracking via Joint Temporal-Iterated Homography Estimation and Multimodal Transformer Fusion. In: Lee, R. (eds) Computer and Information Science and Engineering. Studies in Computational Intelligence, vol 1156. Springer, Cham. https://doi.org/10.1007/978-3-031-57037-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-57037-7_2
Published: 15 May 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57036-0
Online ISBN: 978-3-031-57037-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics