Abstract
This paper studies the problem of fusing the infrared and visible images to improve the quality of target image. Traditional image fusion algorithms usually utilize convolutional neural network (CNN) for feature extraction and fusion, and thus can only exploit local information. Some recent approaches combines CNN and Transformer to capture long-range dependencies, but the global contextual information in the images still cannot be full exploited. To improve the ability of capturing global information, we propose a novel multi-scale fusion transformer (MFT) to fuse the infrared and visible images. In the encoder of our MFT, a multi-head pooling attention module is utilized to extract both local features and long-range dependencies for the input image. Then a novel dual-branch fusion module is designed to simultaneously exploit the global contextual and infrared-visible complementary information in the fusion process. Experimental results show that the proposed method can effectively improve the subjective visual experience of the infrared-visible fused image, and outperforms many recent and competitive counterparts in terms of most objective evaluation criteria.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Paramanandham, N., Rajendiran, K.: Infrared and visible image fusion using discrete cosine transform and swarm intelligence for surveillance applications. Infrared Phys. Technol. 88, 13–22 (2018)
Gao, H., Cheng, B., Wang, J., et al.: Object classification using CNN-based fusion of vision and LIDAR in autonomous vehicle environment. IEEE Trans. Industr. Inf. 14(9), 4224–4231 (2018)
Ma, J., Ma, Y., Li, C.: Infrared and visible image fusion methods and applications: a survey. Inf. Fusion 45, 153–178 (2019)
Kristan, M., Matas, J., Leonardis, A., et al.: The seventh visual object tracking VOT2019 challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Lopez-Molina, C., Montero, J., Bustince, H., et al.: Self-adapting weighted operators for multiscale gradient fusion. Inf. Fusion 44, 136–146 (2018)
Wright, J., Yang, A.Y., Ganesh, A., et al.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2008)
He, K., Zhou, D., Zhang, X., et al.: Infrared and visible image fusion based on target extraction in the nonsubsampled contourlet transform domain. J. Appl. Remote Sens. 11(1), 015011–015011 (2017)
Liu, G., Lin, Z., Yan, S., et al.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2012)
Liu, C.H., Qi, Y., Ding, W.R.: Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Phys. Technol. 83, 94–102 (2017)
Zhang, Q., Fu, Y., Li, H., et al.: Dictionary learning method for joint sparse representation-based image fusion. Opt. Eng. 52(5), 057006–057006 (2013)
Li, H., Wu, X.J., Kittler, J.: Infrared and visible image fusion using a deep learning framework. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. IEEE (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Li, H., Wu, X.J.: DenseFuse: a fusion approach to infrared and visible images. IEEE Trans. Image Process. 28(5), 2614–2623 (2018)
Ma, J., Yu, W., Liang, P., et al.: FusionGAN: a generative adversarial network for infrared and visible image fusion. Inf. Fusion 48, 11–26 (2019)
Hu, R., Singh, A.: Transformer is all you need: Multimodal multitask learning with a unified transformer. arXiv preprint arXiv:2102.10772 (2021)
Chen, C.F.R., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)
Ho, J., Kalchbrenner, N., Weissenborn, D., et al.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
Fan, H., Xiong, B., Mangalam, K., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Zhang, Y., Liu, Y., Sun, P., et al.: IFCNN: a general image fusion framework based on convolutional neural network. Inf. Fusion 54, 99–118 (2020)
Li, H., Wu, X.J., Durrani, T.: NestFuse: an infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Trans. Instrum. Meas. 69(12), 9645–9656 (2020)
Xu, H., Ma, J., Jiang, J., et al.: U2Fusion: a unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2020)
Li, H., Wu, X.J., Kittler, J.: RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf. Fusion 73, 72–86 (2021)
Vs, V., Valanarasu, J.M.J., Oza, P., et al.: Image fusion transformer. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 3566–3570. IEEE (2022)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Hwang, S., Park, J., Kim, N., et al.: Multispectral pedestrian detection: benchmark dataset and baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1037–1045 (2015)
Toet, A.: The TNO multiband image data collection. Data Brief 15, 249–251 (2017)
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov, D., et al. (eds.) DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00889-5_1
Kristan, M., et al.: The eighth visual object tracking VOT2020 challenge results. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12539, pp. 547–601. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-68238-5_39
Zhang, Q., Xu, Y., Zhang, J., et al.: VSA: learning varied-size window attention in vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13685, pp. 466–483. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19806-9_27
Xu, Y., Zhang, Q., Zhang, J., et al.: Vitae: vision transformer advanced by exploring intrinsic inductive bias. Adv. Neural. Inf. Process. Syst. 34, 28522–28535 (2021)
Zhang, Q., Xu, Y., Zhang, J., et al.: ViTAEv2: vision transformer advanced by exploring inductive bias for image recognition and beyond. Int. J. Comput. Vision 1–22 (2023)
Acknowledgement
This work is supported by the National Natural Science Foundation of China (No. 62262026 and 62276195), the project of Jiangxi Education Department (No. GJJ211111), and the Fundamental Research Funds for the Central Universities (No. 2042023kf1033).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, CM., Yuan, C., Luo, Y., Zhou, X. (2023). MFT: Multi-scale Fusion Transformer for Infrared and Visible Image Fusion. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14259. Springer, Cham. https://doi.org/10.1007/978-3-031-44223-0_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-44223-0_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44222-3
Online ISBN: 978-3-031-44223-0
eBook Packages: Computer ScienceComputer Science (R0)