Skip to main content
Log in

A Deep Learning Framework for Infrared and Visible Image Fusion Without Strict Registration

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In recent years, although significant progress has been made in infrared and visible image fusion, existing methods typically assume that the source images have been rigorously registered or aligned prior to image fusion. However, the difference in modalities of infrared and visible images poses a great challenge to achieve strict alignment automatically, affecting the quality of the subsequent fusion procedure. To address this problem, this paper proposes a deep learning framework for misaligned infrared and visible image fusion, aiming to free the fusion algorithm from strict registration. Technically, we design a convolutional neural network (CNN)-Transformer Hierarchical Interactive Embedding (CTHIE) module, which can combine the respective advantages of CNN and Transformer, to extract features from the source images. In addition, by characterizing the correlation between the features extracted from misaligned source images, a Dynamic Re-aggregation Feature Representation (DRFR) module is devised to align the features with a self-attention-based feature re-aggregation scheme. Finally, to effectively utilize the features at different levels of the network, a Fully Perceptual Forward Fusion (FPFF) module via interactive transmission of multi-modal features is introduced for feature fusion to reconstruct the fused image. Experimental results on both synthetic and real-world data demonstrate the effectiveness of the proposed method, verifying the feasibility of directly fusing infrared and visible images without strict registration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Data Availibility

The datasets for this study can be found in the VOT2020-RGBT dataset https://www.votchallenge.net/vot2020/dataset.html, the KAIST dataset https://github.com/SoonminHwang/rgbt-ped-detection/blob/master/data/README.md and the CVC_14 dataset http://adas.cvc.uab.es/elektra/enigma-portfolio/cvc-14-visible-fir-day-night-pedestrian-sequence-dataset/.

Notes

  1. https://www.flir.ca/oem/adas/adas-dataset-form/

  2. https://figshare.com/articles/TN-Image-Fusion-Dataset/1008029

References

  • Bulanon, D., Burks, T., & Alchanatis, V. (2009). Image fusion of visible and thermal images for fruit detection. Biosystems Engineering, 103(1), 12–22.

    Article  Google Scholar 

  • Carion, N., Massa, F., Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV) (pp. 213–229). Cham: Springer.

  • Chen, C.F., Panda, R., & Fan, Q. (2022). Regionvit: Regional-to-local attention for vision transformers. In Proceedings of the international conference on learning representations (ICLR).

  • Chen, H., & Varshney, P. K. (2007). A human perception inspired quality metric for image fusion based on regional information. Information Fusion, 8(2), 193–207.

    Article  Google Scholar 

  • Chen, H., Wang, Y., Guo, T., et al. (2021). Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 12,299–12,310).

  • Chen, Y., & Blum, R. S. (2009). A new automated quality assessment algorithm for image fusion. Image and Vision Computing, 27(10), 1421–1432.

    Article  Google Scholar 

  • Dosovitskiy, A., Beyer, L., & Kolesnikov, A. (2021). An image is worth 16\(\times \)16 words: Transformers for image recognition at scale. In Proceedings of the international conference on learning representations (ICLR).

  • González, A., Fang, Z., Socarras, Y., et al. (2016). Pedestrian detection at day/night time with visible and fir cameras: A comparison. Sensors, 16(6), 820.

    Article  Google Scholar 

  • Gu, J., Lu, H., Zuo, W. et al. (2019). Blind super-resolution with iterative kernel correction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp 1604–1613)

  • Han, J., & Bhanu, B. (2007). Fusion of color and infrared video for moving human detection. Pattern Recognition, 40(6), 1771–1784.

    Article  Google Scholar 

  • Han, K., Xiao, A., Wu, E., et al. (2021). Transformer in transformer. Advances in Neural Information Processing Systems, 34, 15908–15919.

    Google Scholar 

  • Hwang, S., Park, J., Kim, N., et al. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp 1037–1045).

  • Kingma, D.P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations (ICLR).

  • Kristan, M., Leonardis, A., Matas, J., et al. (2020). The eighth visual object tracking vot2020 challenge results. In European conference on computer vision (ECCV) (pp 547–601). Springer.

  • Kumar, P., Mittal, A., & Kumar, P. (2006). Fusion of thermal infrared and visible spectrum video for robust surveillance. In Computer vision, graphics and image processing (pp. 528–539). Springer.

  • Li, H., & Wu, X. (2018). Densefuse: A fusion approach to infrared and visible images. IEEE Transactions on Image Processing, 28(5), 2614–2623.

    Article  MathSciNet  Google Scholar 

  • Li, H., Qiu, H., Yu, Z., et al. (2016). Infrared and visible image fusion scheme based on NSCT and low-level visual features. Infrared Physics & Technology, 76, 174–184.

    Article  Google Scholar 

  • Li, H., Wang, Y., Yang, Z., et al. (2020). Discriminative dictionary learning-based multiple component decomposition for detail-preserving noisy image fusion. IEEE Transactions on Instrumentation and Measurement, 69(4), 1082–1102.

    Article  Google Scholar 

  • Li, H., Wu, X. J., & Durrani, T. (2020). Nestfuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Transactions on Instrumentation and Measurement, 69(12), 9645–9656.

    Article  Google Scholar 

  • Li, H., Cen, Y., Liu, Y., et al. (2021). Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion. IEEE Transactions on Image Processing, 30, 4070–4083.

    Article  MathSciNet  Google Scholar 

  • Li, H., Wu, J., & Kittler, J. (2021). Rfn-nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion, 73, 72–86.

    Article  Google Scholar 

  • Li, H., Xu, T., Wu, X., et al. (2023). Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 11040–11052.

    Article  Google Scholar 

  • Liu, Y., Chen, X., Ward, R. K., et al. (2015). Simultaneous image fusion and denoising with adaptive sparse representation. IET Image Processing, 9(5), 347–357.

    Article  Google Scholar 

  • Liu, Y., Liu, S., & Wang, Z. (2015). A general framework for image fusion based on multi-scale transform and sparse representation. Information Fusion, 24, 174–164.

    Article  Google Scholar 

  • Liu, Y., Chen, X., Ward, R. K., et al. (2016). Image fusion with convolutional sparse representation. IEEE Signal Processing Letters, 23(12), 1882–1886.

    Article  Google Scholar 

  • Liu, Y., Chen, X., Peng, H., et al. (2017). Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36, 191–207.

    Article  Google Scholar 

  • Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 10012–10022).

  • Ma, J., Ma, Y., & Li, C. (2019). Infrared and visible image fusion methods and applications: A survey. Information Fusion, 45, 153–178.

    Article  Google Scholar 

  • Ma, J., Yu, W., Liang, P., et al. (2019). Fusiongan: A generative adversarial network for infrared and visible image fusion. Information Fusion, 48, 11–26.

    Article  Google Scholar 

  • Ma, J., Xu, H., Jiang, J., et al. (2020). Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29, 4980–4995.

    Article  Google Scholar 

  • Ma, J., Tang, L., Fan, F., et al. (2022). Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica, 9(7), 1200–1217.

    Article  Google Scholar 

  • Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11), 2579–2605.

    Google Scholar 

  • Meinhardt, T., Kirillov, A., Leal-Taixé, L., et al. (2021). Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8844–8854).

  • Paszke, A., Gross, S., Massa, F., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. (pp. 8026–8037).

  • Ram Prabhakar, K., Sai Srikar, V., & Venkatesh Babu, R. (2017). Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 4714–4722).

  • Roberts, J. W., Van Aardt, J. A., & Ahmed, F. B. (2008). Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing, 2(1), 023522.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (ICLR).

  • Singh, R., Vatsa, M., & Noore, A. (2008). Integrated multilevel image fusion and match score fusion of visible and infrared face images for robust face recognition. Pattern Recognition, 41(3), 880–893.

    Article  Google Scholar 

  • Srinivas, A., Lin, T.Y., Parmar, N., et al. (2021). Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 16,519–16,529)

  • Tang, H., Li, Z., Peng, Z., et al. (2020). Blockmix: Meta regularization and self-calibrated inference for metric-based meta-learning. In ACM Multimedia (pp. 610–618).

  • Tang, H., Yuan, C., Li, Z., et al. (2022). Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130(108), 792.

    Google Scholar 

  • Tang, L., Deng, Y., Ma, Y., et al. (2022). Superfusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA Journal of Automatica Sinica, 9(12), 2121–2137.

    Article  Google Scholar 

  • Tang, L., Yuan, J., & Ma, J. (2022). Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Information Fusion, 82, 28–42.

    Article  Google Scholar 

  • Tang, W., He, F., Liu, Y., et al. (2023). Datfuse: Infrared and visible image fusion via dual attention transformer. IEEE Transactions on Circuits and Systems for Video Technology, 33(7), 3159–3172.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In Advances in neural information processing systems.

  • Vs, V., Valanarasu, J.M.J., Oza, P., et al. (2022). Image fusion transformer. In 2022 IEEE International conference on image processing (ICIP) (pp. 3566–3570).

  • Wang, D., Liu, J., Fan, X., et al. (2022). Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration. In Proceedings of the thirty-first international joint conference on artificial intelligence (IJCAI) (pp. 3508–3515).

  • Wang, X., Yu, K., Dong, C., et al. (2018). Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 606–615).

  • Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

    Article  Google Scholar 

  • Wu, H., Xiao, B., & Codella, N. (2021). Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 22–31).

  • Xiao, W., Zhang, Y., Wang, H., et al. (2022). Heterogeneous knowledge distillation for simultaneous infrared-visible image fusion and super-resolution. IEEE Transactions on Instrumentation and Measurement, 71, 1–15.

    Google Scholar 

  • Xu, H., Ma, J., Jiang, J., et al. (2022). U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 502–518.

    Article  Google Scholar 

  • Xu, H., Ma, J., Yuan, J., et al. (2022b). Rfnet: Unsupervised network for mutually reinforcing multi-modal image registration and fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 19,679–19,688).

  • Xu, H., Yuan, J., & Ma, J. (2023). Murf: Mutually reinforcing multi-modal image registration and fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12148–12166.

    Google Scholar 

  • Xydeas, C. S., Petrovic, V., et al. (2000). Objective image fusion performance measure. Electronics Letters, 36(4), 308–309.

    Article  Google Scholar 

  • Yao, Y., Zhang, Y., Wan, Y., et al. (2021). Heterologous images matching considering anisotropic weighted moment and absolute phase orientation. Geomatics and Information Science of Wuhan University, 46(11), 1727–1736.

    Google Scholar 

  • Yi, P., Wang, Z., & Jiang, K. (2021). Omniscient video super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision (CVPR) (pp. 4429–4438).

  • Yu, C., Gao, C., Wang, J., et al. (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129, 3051–3068.

    Article  Google Scholar 

  • Yuan, K., Guo, S., Liu, Z., et al. (2021). Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 579–588)

  • Zhang, G., Zhang, P., Qi, J., et al. (2021). Hat: Hierarchical aggregation transformers for person re-identification. In Proceedings of the 29th ACM international conference on multimedia (ACMMM). (pp. 516–525).

  • Zhang, H., & Ma, J. (2021). Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision, 129(10), 2761–2785.

  • Zhang, H., Xu, H., Xiao, Y., et al. (2020a). Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI conference on artificial intelligence (pp. 12,797–12,804).

  • Zhang, Y., Liu, Y., Sun, P., et al. (2020). IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion, 54, 99–118.

    Article  Google Scholar 

  • Zhao, Z., Xu, S., Zhang, J., et al. (2022). Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1186–1196.

    Article  Google Scholar 

  • Zheng, S., Lu, J., & Zhao, H. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 6881–6890).

  • Zhu, X., Su, W., & Lu, L. et al. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In International conference on learning representations (ICLR).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62161015, 62176081 and U23A20294), and Yunnan Fundamental Research Projects (No. 202301AV070004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Communicated by Ondra Chum.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Liu, J., Zhang, Y. et al. A Deep Learning Framework for Infrared and Visible Image Fusion Without Strict Registration. Int J Comput Vis 132, 1625–1644 (2024). https://doi.org/10.1007/s11263-023-01948-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01948-x

Keywords

Navigation