Skip to main content

FusionVAE: A Deep Hierarchical Variational Autoencoder for RGB Image Fusion

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Sensor fusion can significantly improve the performance of many computer vision tasks. However, traditional fusion approaches are either not data-driven and cannot exploit prior knowledge nor find regularities in a given dataset or they are restricted to a single application. We overcome this shortcoming by presenting a novel deep hierarchical variational autoencoder called FusionVAE that can serve as a basis for many fusion tasks. Our approach is able to generate diverse image samples that are conditioned on multiple noisy, occluded, or only partially visible input images. We derive and optimize a variational lower bound for the conditional log-likelihood of FusionVAE. In order to assess the fusion capabilities of our model thoroughly, we created three novel datasets for image fusion based on popular computer vision datasets. In our experiments, we show that FusionVAE learns a representation of aggregated information that is relevant to fusion tasks. The results demonstrate that our approach outperforms traditional methods significantly. Furthermore, we present the advantages and disadvantages of different design choices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: CVAE-GAN: fine-grained image generation through asymmetric training. In: ICCV, pp. 2745–2754 (2017)

    Google Scholar 

  2. Becker, P., Pandya, H., Gebhardt, G., Zhao, C., Taylor, C.J., Neumann, G.: Recurrent Kalman networks: factorized inference in high-dimensional deep feature spaces. In: ICML, pp. 544–552. PMLR (2019)

    Google Scholar 

  3. Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. In: ICLR (2016)

    Google Scholar 

  4. Canny, J.: A computational approach to edge detection. IEEE TPAMI 8(6), 679–698 (1986)

    Article  Google Scholar 

  5. Chen, X., et al.: Variational lossy autoencoder. In: ICLR (2017)

    Google Scholar 

  6. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR, pp. 1907–1915 (2017)

    Google Scholar 

  7. Child, R.: Very deep VAEs generalize autoregressive models and can outperform them on images. In: ICLR (2021)

    Google Scholar 

  8. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS, vol. 27, pp. 2672–2680 (2014)

    Google Scholar 

  9. Gregor, K., Danihelka, I., Graves, A., Rezende, D., Wierstra, D.: DRAW: a recurrent neural network for image generation. In: ICML. Proceedings of Machine Learning Research, vol. 37, pp. 1462–1471. PMLR, July 2015

    Google Scholar 

  10. Gu, J., et al.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)

    Article  Google Scholar 

  11. Gulrajani, I., et al.: PixelVAE: a latent variable model for natural images. In: ICLR (2017)

    Google Scholar 

  12. Guo, X., Nie, R., Cao, J., Zhou, D., Mei, L., He, K.: FuseGAN: learning to fuse multi-focus image via conditional generative adversarial network. IEEE TMM 21(8), 1982–1996 (2019)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  14. He, Y., Huang, H., Fan, H., Chen, Q., Sun, J.: FFB6D: a full flow bidirectional fusion network for 6D pose estimation. In: CVPR, pp. 3003–3013 (2021)

    Google Scholar 

  15. He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: CVPR, pp. 11632–11641 (2020)

    Google Scholar 

  16. Hodaň, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects. In: WACV (2017)

    Google Scholar 

  17. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 4700–4708 (2017)

    Google Scholar 

  18. Huang, J., Le, Z., Ma, Y., Mei, X., Fan, F.: A generative adversarial network with adaptive constraints for multi-focus image fusion. Neural Comput. Appl. 32(18), 15119–15129 (2020). https://doi.org/10.1007/s00521-020-04863-1

    Article  Google Scholar 

  19. Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM TOG 36(4), 1–14 (2017)

    Article  Google Scholar 

  20. Jung, H., Kim, Y., Jang, H., Ha, N., Sohn, K.: Unsupervised deep image fusion with structure tensor representations. IEEE TIP 29, 3845–3858 (2020)

    MATH  Google Scholar 

  21. Kim, J., Yoo, J., Lee, J., Hong, S.: SetVAE: learning hierarchical composition for generative modeling of set-structured data. In: CVPR, pp. 15059–15068 (2021)

    Google Scholar 

  22. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)

    Google Scholar 

  23. Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improving variational inference with inverse autoregressive flow. In: NeurIPS, vol. 29, pp. 4743–4751 (2016)

    Google Scholar 

  24. Köhler, R., Schuler, C., Schölkopf, B., Harmeling, S.: Mask-specific inpainting with deep neural networks. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 523–534. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11752-2_43

    Chapter  Google Scholar 

  25. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: IROS, pp. 1–8. IEEE (2018)

    Google Scholar 

  26. Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: ICML, pp. 1558–1566. PMLR (2016)

    Google Scholar 

  27. LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist (1998)

  28. Li, H., Wu, X.J.: DenseFuse: a fusion approach to infrared and visible images. IEEE TIP 28(5), 2614–2623 (2018)

    MathSciNet  Google Scholar 

  29. Li, H., Wu, X.J., Durrani, T.S.: Infrared and visible image fusion with ResNet and zero-phase component analysis. Infrared Phys. Technol. 102, 103039 (2019)

    Article  Google Scholar 

  30. Li, H., Wu, X.J., Kittler, J.: Infrared and visible image fusion using a deep learning framework. In: ICPR, pp. 2705–2710. IEEE (2018)

    Google Scholar 

  31. Li, Y., Liu, S., Yang, J., Yang, M.H.: Generative face completion. In: CVPR, pp. 3911–3919 (2017)

    Google Scholar 

  32. Liu, Y., Chen, X., Cheng, J., Peng, H.: A medical image fusion method based on convolutional neural networks. In: International Conference on Information Fusion, pp. 1–7. IEEE (2017)

    Google Scholar 

  33. Liu, Y., Chen, X., Peng, H., Wang, Z.: Multi-focus image fusion with a deep convolutional neural network. Inf. Fusion 36, 191–207 (2017)

    Article  Google Scholar 

  34. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV, December 2015

    Google Scholar 

  35. Ma, J., et al.: Infrared and visible image fusion via detail preserving adversarial learning. Inf. Fusion 54, 85–98 (2020)

    Google Scholar 

  36. Ma, J., Xu, H., Jiang, J., Mei, X., Zhang, X.P.: DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE TIP 29, 4980–4995 (2020)

    MATH  Google Scholar 

  37. Ma, J., Yu, W., Liang, P., Li, C., Jiang, J.: FusionGAN: a generative adversarial network for infrared and visible image fusion. Inf. Fusion 48, 11–26 (2019)

    Article  Google Scholar 

  38. Maaløe, L., Fraccaro, M., Liévin, V., Winther, O.: BIVA: a very deep hierarchy of latent variables for generative modeling. In: NeurIPS, vol. 32, pp. 6551–6562 (2019)

    Google Scholar 

  39. Marinescu, R.V., Moyer, D., Golland, P.: Bayesian image reconstruction using deep generative models. arXiv:2012.04567 [cs.CV] (2020)

  40. Parmar, G., Li, D., Lee, K., Tu, Z.: Dual contradistinctive generative autoencoder. In: CVPR, pp. 823–832 (2021)

    Google Scholar 

  41. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: CVPR, pp. 2536–2544 (2016)

    Google Scholar 

  42. Prabhakar, K.R., Srikar, V.S., Babu, R.V.: DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In: ICCV, pp. 4714–4722 (2017)

    Google Scholar 

  43. Sadeghi, H., Andriyash, E., Vinci, W., Buffoni, L., Amin, M.H.: PixelVAE++: improved PixelVAE with discrete prior. arXiv:1908.09948 [cs.CV] (2019)

  44. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  45. Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: NeurIPS, vol. 28, pp. 3483–3491 (2015)

    Google Scholar 

  46. Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. In: NeurIPS, vol. 29, 3738–3746 (2016)

    Google Scholar 

  47. Song, Y., et al.: Contextual-based image inpainting: infer, match, and translate. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 3–18. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_1

    Chapter  Google Scholar 

  48. Song, Y., Yang, C., Shen, Y., Wang, P., Huang, Q., Kuo, C.C.J.: SPG-Net: segmentation prediction and guidance network for image inpainting. In: BMVC (2018)

    Google Scholar 

  49. Vahdat, A., Kautz, J.: NVAE: a deep hierarchical variational autoencoder. In: NeurIPS, vol. 33, pp. 19667–19679 (2020)

    Google Scholar 

  50. Vahdat, A., Macready, W., Bian, Z., Khoshaman, A., Andriyash, E.: DVAE++: discrete variational autoencoders with overlapping transformations. In: ICML, pp. 5035–5044. PMLR (2018)

    Google Scholar 

  51. Volpp, M., Flürenbrock, F., Grossberger, L., Daniel, C., Neumann, G.: Bayesian context aggregation for neural processes. In: ICLR (2020)

    Google Scholar 

  52. Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51

    Chapter  Google Scholar 

  53. Wang, C., et al.: DenseFusion: 6D object pose estimation by iterative dense fusion. In: CVPR, pp. 3343–3352 (2019)

    Google Scholar 

  54. Wang, W., Liu, W., Hu, J., Fang, Y., Shao, Q., Qi, J.: GraspFusionNet: a two-stage multi-parameter grasp detection network based on RGB-XYZ fusion in dense clutter. Mach. Vis. Appl. 31(7), 1–19 (2020)

    Google Scholar 

  55. Xu, H., Liang, P., Yu, W., Jiang, J., Ma, J.: Learning a generative model for fusing infrared and visible images via conditional generative adversarial network with dual discriminators. In: IJCAI (2019)

    Google Scholar 

  56. Xu, H., Ma, J., Le, Z., Jiang, J., Guo, X.: FusionDN: a unified densely connected network for image fusion. In: AAAI, vol. 34, pp. 12484–12491, April 2020

    Google Scholar 

  57. Yan, Z., Li, X., Li, M., Zuo, W., Shan, S.: Shift-Net: image inpainting via deep feature rearrangement. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_1

    Chapter  Google Scholar 

  58. Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H.: High-resolution image inpainting using multi-scale neural patch synthesis. In: CVPR, pp. 6721–6729 (2017)

    Google Scholar 

  59. Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_43

    Chapter  Google Scholar 

  60. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: CVPR, pp. 5505–5514 (2018)

    Google Scholar 

  61. Zaheer, M., Kottur, S., Ravanbakhsh, S., Póczos, B., Salakhutdinov, R., Smola, A.J.: Deep sets. In: NeurIPS, vol. 30, pp. 3391–3401 (2017)

    Google Scholar 

  62. Zeng, Yu., Lin, Z., Yang, J., Zhang, J., Shechtman, E., Lu, H.: High-resolution image inpainting with iterative confidence feedback and guided upsampling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 1–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_1

    Chapter  Google Scholar 

  63. Zhang, H., Xu, H., Xiao, Y., Guo, X., Ma, J.: Rethinking the image fusion: a fast unified image fusion network based on proportional maintenance of gradient and intensity. In: AAAI, vol. 34, pp. 12797–12804, April 2020

    Google Scholar 

  64. Zhang, Q., Qu, D., Xu, F., Zou, F.: Robust robot grasp detection in multimodal fusion. In: MATEC Web of Conferences, vol. 139, p. 00060. EDP Sciences (2017)

    Google Scholar 

  65. Zheng, C., Cham, T.J., Cai, J.: Pluralistic image completion. In: CVPR, pp. 1438–1447 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabian Duffhauss .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2822 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Duffhauss, F., Vien, N.A., Ziesche, H., Neumann, G. (2022). FusionVAE: A Deep Hierarchical Variational Autoencoder for RGB Image Fusion. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19842-7_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19841-0

  • Online ISBN: 978-3-031-19842-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics