MFT: Multi-scale Fusion Transformer for Infrared and Visible Image Fusion

Zhang, Chen-Ming; Yuan, Chengbo; Luo, Yong; Zhou, Xin

doi:10.1007/978-3-031-44223-0_39

Chen-Ming Zhang¹¹,
Chengbo Yuan¹¹,
Yong Luo¹¹ &
…
Xin Zhou¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14259))

Included in the following conference series:

International Conference on Artificial Neural Networks

941 Accesses

Abstract

This paper studies the problem of fusing the infrared and visible images to improve the quality of target image. Traditional image fusion algorithms usually utilize convolutional neural network (CNN) for feature extraction and fusion, and thus can only exploit local information. Some recent approaches combines CNN and Transformer to capture long-range dependencies, but the global contextual information in the images still cannot be full exploited. To improve the ability of capturing global information, we propose a novel multi-scale fusion transformer (MFT) to fuse the infrared and visible images. In the encoder of our MFT, a multi-head pooling attention module is utilized to extract both local features and long-range dependencies for the input image. Then a novel dual-branch fusion module is designed to simultaneously exploit the global contextual and infrared-visible complementary information in the fusion process. Experimental results show that the proposed method can effectively improve the subjective visual experience of the infrared-visible fused image, and outperforms many recent and competitive counterparts in terms of most objective evaluation criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Paramanandham, N., Rajendiran, K.: Infrared and visible image fusion using discrete cosine transform and swarm intelligence for surveillance applications. Infrared Phys. Technol. 88, 13–22 (2018)
Article Google Scholar
Gao, H., Cheng, B., Wang, J., et al.: Object classification using CNN-based fusion of vision and LIDAR in autonomous vehicle environment. IEEE Trans. Industr. Inf. 14(9), 4224–4231 (2018)
Article Google Scholar
Ma, J., Ma, Y., Li, C.: Infrared and visible image fusion methods and applications: a survey. Inf. Fusion 45, 153–178 (2019)
Article Google Scholar
Kristan, M., Matas, J., Leonardis, A., et al.: The seventh visual object tracking VOT2019 challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Lopez-Molina, C., Montero, J., Bustince, H., et al.: Self-adapting weighted operators for multiscale gradient fusion. Inf. Fusion 44, 136–146 (2018)
Article Google Scholar
Wright, J., Yang, A.Y., Ganesh, A., et al.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2008)
Article Google Scholar
He, K., Zhou, D., Zhang, X., et al.: Infrared and visible image fusion based on target extraction in the nonsubsampled contourlet transform domain. J. Appl. Remote Sens. 11(1), 015011–015011 (2017)
Article Google Scholar
Liu, G., Lin, Z., Yan, S., et al.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2012)
Article Google Scholar
Liu, C.H., Qi, Y., Ding, W.R.: Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Phys. Technol. 83, 94–102 (2017)
Article Google Scholar
Zhang, Q., Fu, Y., Li, H., et al.: Dictionary learning method for joint sparse representation-based image fusion. Opt. Eng. 52(5), 057006–057006 (2013)
Article Google Scholar
Li, H., Wu, X.J., Kittler, J.: Infrared and visible image fusion using a deep learning framework. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. IEEE (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Li, H., Wu, X.J.: DenseFuse: a fusion approach to infrared and visible images. IEEE Trans. Image Process. 28(5), 2614–2623 (2018)
Article MathSciNet Google Scholar
Ma, J., Yu, W., Liang, P., et al.: FusionGAN: a generative adversarial network for infrared and visible image fusion. Inf. Fusion 48, 11–26 (2019)
Article Google Scholar
Hu, R., Singh, A.: Transformer is all you need: Multimodal multitask learning with a unified transformer. arXiv preprint arXiv:2102.10772 (2021)
Chen, C.F.R., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)
Google Scholar
Ho, J., Kalchbrenner, N., Weissenborn, D., et al.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
Fan, H., Xiong, B., Mangalam, K., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Zhang, Y., Liu, Y., Sun, P., et al.: IFCNN: a general image fusion framework based on convolutional neural network. Inf. Fusion 54, 99–118 (2020)
Article Google Scholar
Li, H., Wu, X.J., Durrani, T.: NestFuse: an infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Trans. Instrum. Meas. 69(12), 9645–9656 (2020)
Article Google Scholar
Xu, H., Ma, J., Jiang, J., et al.: U2Fusion: a unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2020)
Article Google Scholar
Li, H., Wu, X.J., Kittler, J.: RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf. Fusion 73, 72–86 (2021)
Article Google Scholar
Vs, V., Valanarasu, J.M.J., Oza, P., et al.: Image fusion transformer. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 3566–3570. IEEE (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Hwang, S., Park, J., Kim, N., et al.: Multispectral pedestrian detection: benchmark dataset and baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1037–1045 (2015)
Google Scholar
Toet, A.: The TNO multiband image data collection. Data Brief 15, 249–251 (2017)
Article Google Scholar
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov, D., et al. (eds.) DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00889-5_1
Chapter Google Scholar
Kristan, M., et al.: The eighth visual object tracking VOT2020 challenge results. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12539, pp. 547–601. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-68238-5_39
Chapter Google Scholar
Zhang, Q., Xu, Y., Zhang, J., et al.: VSA: learning varied-size window attention in vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13685, pp. 466–483. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19806-9_27
Chapter Google Scholar
Xu, Y., Zhang, Q., Zhang, J., et al.: Vitae: vision transformer advanced by exploring intrinsic inductive bias. Adv. Neural. Inf. Process. Syst. 34, 28522–28535 (2021)
Google Scholar
Zhang, Q., Xu, Y., Zhang, J., et al.: ViTAEv2: vision transformer advanced by exploring inductive bias for image recognition and beyond. Int. J. Comput. Vision 1–22 (2023)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. 62262026 and 62276195), the project of Jiangxi Education Department (No. GJJ211111), and the Fundamental Research Funds for the Central Universities (No. 2042023kf1033).

Author information

Authors and Affiliations

School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China
Chen-Ming Zhang, Chengbo Yuan & Yong Luo
Jiangxi Science and Technology Normal University, Nanchang, China
Xin Zhou

Authors

Chen-Ming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chengbo Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Zhou .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, CM., Yuan, C., Luo, Y., Zhou, X. (2023). MFT: Multi-scale Fusion Transformer for Infrared and Visible Image Fusion. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14259. Springer, Cham. https://doi.org/10.1007/978-3-031-44223-0_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-44223-0_39
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44222-3
Online ISBN: 978-3-031-44223-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MFT: Multi-scale Fusion Transformer for Infrared and Visible Image Fusion