Skip to main content

DSFormer: Leveraging Transformer with Cross-Modal Attention for Temporal Consistency in Low-Light Video Enhancement

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14872))

Included in the following conference series:

  • 263 Accesses

Abstract

Recent advancements in deep learning have significantly impacted low-light video enhancement, sparking great interest in the field. However, while these techniques have proven effective for enhancing individual static images, they struggle with temporal instability when applied to videos, leading to artifacts and flickering. This challenge is further compounded by the difficulty of obtaining dynamic low-light/high-light video pairs in real-world scenarios. Our proposed solution tackles these issues by integrating a cross-attention mechanism with optical flow. This approach helps mitigate temporal inconsistencies, often found when training with static images, by using optical flow to infer motion in individual frames. We have also developed a Transformer model (DSFormer) that leverages spatial and channel features to enhance visual quality and temporal stability in videos. Additionally, we have created a novel dual path feed-forward network (DPFN) that improves our method’s ability to capture and maintain local contextual information, which is crucial for low-light enhancement. Through extensive comparative and ablation studies, we demonstrate that our approach delivers high luminance and temporal consistency in enhancement sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, F., Li, Y., You, S., Fu, Y.: Learning temporal consistency for low light video enhancement from single images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4967–4976 (2021)

    Google Scholar 

  2. Dosovitskiy, A., et~al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929

  3. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  4. Zheng, S., Gupta, G.: Semantic-guided zero-shot learning for low-light image/video enhancement. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 581–590 (2022)

    Google Scholar 

  5. Azizi, Z., et al.: Salve: self-supervised adaptive low-light video enhancement. APSIPA Trans. Signal Inf. Proc. 12(4) (2022)

    Google Scholar 

  6. Li, C., Guo, C., Loy, C.C.: Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4225–4238 (2021)

    Google Scholar 

  7. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019)

    Google Scholar 

  8. Zhan, X., Pan, X., Liu, Z., Lin, D., Loy, C.C.: Self-supervised learning via conditional motion propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1881–1889 (2019)

    Google Scholar 

  9. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbelaez, P., Sorkine-Hornung, A., Van~Gool, L.: The 2017 davis challenge on video object segmentation (2017). arXiv preprint arXiv:1704.00675

  10. Lv, F., Li, Y., Lu, F.: Attention guided low-light image enhancement with a large scale low-light simulation dataset. Int. J. Comput. Vision 129(7), 2175–2193 (2021)

    Article  Google Scholar 

  11. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  12. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

    Google Scholar 

  13. Ma, L., et al.: Bilevel fast scene adaptation for low-light image enhancement. Int. J. Comput. Vision 1–19 (2023).

    Google Scholar 

  14. Fu, Z., Yang, Y., Tu, X., Huang, Y., Ding, X., Ma, K.K.: Learning a simple low-light image enhancer from paired low-light instances. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22252–22261 (2023)

    Google Scholar 

  15. Xu, X., Wang, R., Fu, C.W., Jia, J.: SNR-aware low-light image enhancement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17714--17724 (June 2022)

    Google Scholar 

  16. Ma, L., Ma, T., Liu, R., Fan, X., Luo, Z.: Toward fast, flexible, and robust low-light image enhancement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5637–5646 (2022)

    Google Scholar 

  17. Hai, J., et al.: R2rnet: low-light image enhancement via real-low to real-normal network. J. Vis. Commun. Image Represent. 90, 103712 (2023)

    Article  Google Scholar 

  18. Liu, Y., Huang, T., Dong, W., Wu, F., Li, X., Shi, G.: Low-light image enhancement with multi-stage residue quantization and brightness-aware attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12140–12149 (2023)

    Google Scholar 

  19. Wang, T., Zhang, K., Shen, T., Luo, W., Stenger, B., Lu, T.: Ultra-high-definition low-light image enhancement: a benchmark and transformer-based method. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2654–2662 (2023)

    Google Scholar 

  20. Yang, S., Ding, M., Wu, Y., Li, Z., Zhang, J.: Implicit neural representation for cooperative low-light image enhancement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12918–12927 (2023)

    Google Scholar 

  21. Wang, C., Wu, H., Jin, Z.: Fourllie: boosting low-light image enhancement by fourier frequency information. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 7459–7469 (2023)

    Google Scholar 

  22. Lv, F., Lu, F., Wu, J., Lim, C.: MBLLEN: low-light image/video enhancement using cnns. In: BMVC, vol. 220, p. 4. Northumbria University (2018)

    Google Scholar 

  23. Jiang, H., Zheng, Y.: Learning to see moving objects in the dark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7324–7333 (2019)

    Google Scholar 

  24. Lai, W.S., Huang, J.B., Wang, O., Shechtman, E., Yumer, E., Yang, M.H.: Learning blind video temporal consistency. In: Proceedings of the European conference on computer vision (ECCV), pp. 170–185 (2018)

    Google Scholar 

Download references

Acknowledgments

This work was sponsored by National Natural Science Foundation of China (NSFC) (62272342, 62020106004), and Tianjin Natural Science Foundation (23JCJQJC00070).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fan Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, J., Mei, S., Chen, Z., Zhang, D., Shi, F., Zhao, M. (2024). DSFormer: Leveraging Transformer with Cross-Modal Attention for Temporal Consistency in Low-Light Video Enhancement. In: Huang, DS., Pan, Y., Zhang, Q. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14872. Springer, Singapore. https://doi.org/10.1007/978-981-97-5612-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5612-4_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5611-7

  • Online ISBN: 978-981-97-5612-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics