Abstract
Existing RGBT tracking methods have limitations in practical applications due to their reliance on spatial-aligned videos, which often require either elaborate platform design or high-cost manual alignment. To address this issue, we propose a Non-Aligned RGBT Tracker (NAT) that can effectively utilizes both weakly-aligned and non-aligned data, enabling it to be trained and tested on both types of data. Our method consists of two key components, including the temporal-iterated homography estimation module and the multimodal transformer fusion module. The temporal-iterated homography estimation module learns a transformation using temporal knowledge. By considering the continuity of homography changes in multimodal video sequences, this module uses an iterated prediction method that leverages the guidance of predicted transformation parameters from previous frames. This enables stable, accurate, and robust homography estimation in weakly-aligned and non-aligned scenarios without pre-alignment, making it practical for applications. The multimodal transformer fusion module aims to capture complementary information from multiple modalities by exploiting the powerful global modeling capability of the transformer. The entire framework can be trained end-to-end and is evaluated on both weakly-aligned and non-aligned RGBT datasets, and the results suggest our NAT outperforms state-of-the-art methods on five RGBT tracking benchmarks. Our approach opens up the practical applications of RGBT tracking research.
This work was completed during a visiting study at the Institute of Automation, Chinese Academy of Sciences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229. Springer (2020)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,608–13,618 (2022)
DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. Preprint at arXiv:1606.03798 (2016)
Gao, Y., Li, C., Zhu, Y., Tang, J., He, T., Wang, F.: Deep adaptive fusion network for high performance RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2003)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Li, C., Cheng, H., Hu, S., Liu, X., Tang, J., Lin, L.: Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Trans. Image Process. 25(12), 5743–5756 (2016)
Li, C., Liang, X., Lu, Y., Zhao, N., Tang, J.: RGB-T object tracking: Benchmark and baseline. Pattern Recogn. 96, 106,977 (2019)
Li, C., Liu, L., Lu, A., Ji, Q., Tang, J.: Challenge-aware RGBT tracking. In: Proceedings of the European Conference on Computer Vision, pp. 222–237. Springer (2020)
Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., Sun, D.: Lasher: A large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 31, 392–404 (2021)
Li, C., Zhao, N., Lu, Y., Zhu, C., Tang, J.: Weighted sparse representation regularized graph learning for RGB-T object tracking. In: Proceedings of the ACM International Conference on Multimedia, pp. 1856–1864 (2017)
Li, C., Zhu, T., Liu, L., Si, X., Fan, Z., Zhai, S.: Cross-modal object tracking: Modality-aware representations and a unified benchmark. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1289–1296 (2022)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10,012–10,022 (2021)
Long Li, C., Lu, A., Hua Zheng, A., Tu, Z., Tang, J.: Multi-adapter RGBT tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Lu, A., Li, C., Yan, Y., Tang, J., Luo, B.: RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 30, 5613–5625 (2021)
Lu, A., Qian, C., Li, C., Tang, J., Wang, L.: Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D.P., Yu, F., Van Gool, L.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)
Shao, R., Wu, G., Zhou, Y., Fu, Y., Fang, L., Liu, Y.: Localtrans: A multiscale local transformer network for cross-resolution homography estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14,890–14,899 (2021)
Truong, P., Danelljan, M., Timofte, R.: Glu-net: Global-local universal network for dense flow and correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6258–6268 (2020)
Tu, Z., Lin, C., Zhao, W., Li, C., Tang, J.: M5l: Multi-modal multi-margin metric learning for RGBT tracking. IEEE Trans. Image Process. 31, 85–98 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Wang, C., Xu, C., Cui, Z., Zhou, L., Zhang, T., Zhang, X., Yang, J.: Cross-modal pattern-propagation for RGB-T tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7064–7073 (2020)
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: Exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)
Wang, X., Shu, X., Zhang, Z., Jiang, B., Wang, Y., Tian, Y., Wu, F.: Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,763–13,773 (2021)
Xiao, Y., Yang, M., Li, C., Liu, L., Tang, J.: Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2831–2838 (2022)
Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., Kämäräinen, J.K.: Depthtrack: Unveiling the power of RGBD tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10,725–10,733 (2021)
Zhan, X., Liu, Y., Zhu, J., Li, Y.: Homography decomposition networks for planar object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3234–3242 (2022)
Zhang, H., Zhang, L., Zhuo, L., Zhang, J.: Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 20(2), 393 (2020)
Zhang, J., Yang, X., Fu, Y., Wei, X., Yin, B., Dong, B.: Object tracking by jointly exploiting frame and event domain. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13,043–13,052 (2021)
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., Van De Weijer, J., Shahbaz Khan, F.: Multi-modal fusion for end-to-end RGB-T tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Zhang, P., Wang, D., Lu, H., Yang, X.: Learning adaptive attribute-driven representation for real-time RGB-T tracking. Int. J. Comput. Vision 129(9), 2714–2729 (2021)
Zhang, P., Zhao, J., Bo, C., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 30, 3335–3347 (2021)
Zhang, P., Zhao, J., Wang, D., Lu, H., Ruan, X.: Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8886–8895 (2022)
Zhu, J., Lai, S., Chen, X., Wang, D., Lu, H.: Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Zhu, Y., Li, C., Luo, B., Tang, J., Wang, X.: Dense feature aggregation and pruning for rgbt tracking. In: Proceedings of the ACM International Conference on Multimedia, pp. 465–472 (2019)
Zhu, Y., Li, C., Tang, J., Luo, B.: Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 6(1), 121–130 (2021)
Acknowledgements
This work was supported by the Major Project for New Generation of AI (No. 2018AAA0100400), the National Natural Science Foundation of China (No. 62376004), the Natural Science Foundation of Anhui Province (No. 2208085J18), the Natural Science Foundation of Anhui Higher Education Institution (No. 2022AH040014), the University Synergy Innovation Program of Anhui Province (No. GXXT-2020-051, GXXT-2022-033), and the Anhui Provincial Colleges Science Foundation for Distinguished Young Scholars (No. 2022AH020093).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Liu, L., Li, C., Zheng, A., Tang, J., Xiang, Y. (2024). Non-aligned RGBT Tracking via Joint Temporal-Iterated Homography Estimation and Multimodal Transformer Fusion. In: Lee, R. (eds) Computer and Information Science and Engineering. Studies in Computational Intelligence, vol 1156. Springer, Cham. https://doi.org/10.1007/978-3-031-57037-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-57037-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57036-0
Online ISBN: 978-3-031-57037-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)