1 Introduction

Images or videos acquired under low-light conditions suffer from various visual quality degradation issues such as poor visibility, intensive noise, or temporal flickering, which not only significantly harm the quality of viewing experience but also affect the performance of downstream computer vision tasks in fields such as autonomous driving (Rashed et al., 2019) or surveillance (Ai & Kwon, 2020). Meanwhile, it is a highly ill-posed problem to enhance low-light videos into high-quality ones with visually pleasing quality. In particular, the signal in the dark is dominated by less structural information but intensive noise, resulting in a low signal-to-noise ratio. To address the issue, software-based enhancement algorithms present a promising alternative to costly hardware solutions in low-light conditions. These algorithms have to endeavour to enhance the visibility of individual frames while simultaneously preserving temporal consistency among consecutive frames.

In order to realize the aforementioned objectives, challenges arise in two aspects, including high-quality training data acquisition and methodology design. As for the dataset construction, lots of recent efforts are paid to capturing datasets in low-light conditions, as seen in studies  (Chen et al., 2019; Jiang & Zheng, 2019; Li et al., 2021; Wang et al., 2021). However, it is still challenging to obtain a variety of real captured paired videos (low/normal-light videos), even with the help of specifically designed optical systems (Wang et al., 2021). In detail, as it is labor-intensive to collect such data, existing dataset often suffers from limited dataset scale, monotony in captured contexts (e.g., background and light conditions), and oversimplified motion patterns. The first two limitations restrict the model’s capacity in handling complex scenes and the generalization capabilities when facing unseen scenarios. In particular, the capacity in modeling intensive and large motions in real-world scenarios can even be affected, which results in the flicker artifacts caused by temporal inconsistency among successive frames. As a result, there are still unsolved issues at the data end, such as building a large-scale paired dataset featured with rich contexts and diverse motion dynamics, as well as developing the corresponding training mechanism.

In addition to efforts on datasets, the other route targets the methodology design. Previous efforts have been made to develop single-image enhancement methods to improve the human visual experience, including traditional low-light enhancement methods (Jobson et al., 1997; Fu et al., 2016b; Guo et al., 2016; Wang et al., 2013) and deep learning approaches (Lore et al., 2017; Ren et al., 2019; Yang et al., 2020; Guo et al., 2020; Zhao et al., 2021). Unlike the single-image enhancement task, both intra-frame spatial fidelity and inter-frame temporal consistency are critical for achieving the desired visual quality of dynamic video sequences. One straightforward idea is to perform low-light image enhancement independently on each frame of the dynamic low-light videos. However, this strategy solely focuses on achieving higher fidelity at the frame level without considering the consistency among frames, and such fidelity is typically quantified frame by frame using image quality measures such as Peak Signal-to-Noise Ratio (PSNR) or Structural SIMilarity Index (SSIM). Hence, the enhanced video may end up with undesired visual artifacts (e.g. temporal flickering) since the inter-frame information has not been fully taken into consideration, resulting in the degraded visual quality of the enhanced video. Even worse, these problems are more apparent in scenarios involving significant camera or object movement. It is still an open issue to explore how to effectively incorporate the temporal correlation, in addition to the well-established methods for modeling spatial fidelity. That is to say, harnessing temporal information for low-light video enhancement holds great potential.

To address the above challenges, namely data scarcity, potential conflicts between spatial/temporal constraints, and insufficient exploration of corresponding training mechanisms, we make systematic efforts in three aspects: dataset construction, model design, and training mechanism.

1) Dataset Construction. Firstly, we create a new Diverse Statistics LOw-Light (DS-LOL) video dataset including 90 videos (12,420 frames), which covers different noise levels, diverse lighting conditions, diversified scenes, complex motion patterns in real-world scenarios, etc. We are dedicated to modeling the signal distribution in low-light conditions that benefits simulating a series of visual quality degradation (i.e., underexposure, moonlight, back-lit, and extremely dark). Furthermore, we also consider non-linear processing of the in-camera signal (i.e., demosaicing, white balance, and gamma correction) in realistic noise modeling and synthesis.

2) Model Design. A video-specific enhancement method is developed to leverage both the spatial and temporal correlations within the video frames. We develop a novel approach that addresses the temporal flickering issue by imposing constraints at both the feature and pixel levels. At the feature level, this step is specifically crafted to seamlessly connect various frame indexes Footnote 1 and maintain the continuity among the features, enabling the full utilization of temporal correlations. At the pixel level, we adopt a cycle-consistent constraint to impose the penalty on temporal misalignment. This constraint naturally encourages temporal coherence, alleviating the temporal flickering issue and improving the overall quality of the enhanced video.

3) Training Mechanism. One issue revealed by previous methods is the contradiction (Lai et al., 2018) between spatial fidelity and temporal consistency, which can lead to degraded visual quality. To address the issue, we propose a simple yet effective Spatial-Temporal Compatible Learning (STCL) optimization technique to exploit spatial and temporal clues respectively from real captured low-light videos and synthesized videos, achieving better tradeoff between the spatial fidelity and temporal consistency. Our dedicated endeavours in dataset construction, model design, and training mechanism yield visually pleasing enhanced video that harmoniously improves from both spatial and temporal aspects, ensuring compatibility between them. Extensive experiments on various datasets demonstrate that our method outperforms state-of-the-art techniques, including both single-image and video-based approaches. In particular, superior performance in terms of frame-wise spatial quality and temporal consistency has been achieved. In summary, we make the following contributions:

  • We construct the DS-LOL video dataset by simulating low-light videos with various exposure levels, noise levels, and motion patterns. By synthesizing datasets, specifically through this approach, the diversity of background scenes and motion patterns is significantly enriched, surpassing previous methods that rely on capturing real paired data. This not only provides a more extensive data resource for nighttime scenes but also facilitates the advancement of joint spatial-temporal video enhancement techniques.

  • We develop a novel Temporally Consistent Enhancer Network (TCE-Net), an effective low-light video enhancement model that can adaptively enhance diverse nighttime scenes by making use of temporal correlation and consistency.

  • We propose the STCL training strategy to facilitate the effective learning of valuable information from different dataset resources built upon the designed data and model. By constructing specific training loss functions adaptively on different datasets, we can pursue improving the video quality from the perspectives of spatial fidelity and temporal consistency simultaneously.

  • We apply a cycle-consistency temporal constraint to suppress temporal inconsistency. The consistency constraint with the optical flow alignment is only applied in the training stage, leading to a better model that still operates efficiently in testing. Extensive experimental results on several low-light video datasets have proven that the proposed method could achieve superior enhancement results in illumination estimation, noise removal, and temporal consistency.

The paper is organized as follows. In Sect. 2, a comprehensive literature review is provided. Sect. 3 presents the construction of the proposed low-light video dataset. Sect. 4 introduces the proposed TCE-Net, which utilizes dynamic spatial/temporal clues within different frame indexes to improve temporal coherence significantly. In Sect. 5, experimental configurations and results are presented. Finally, concluding remarks are given in Sect. 6.

2 Related Works

2.1 Low-Light Image Enhancement

Acquiring images under low-lighting conditions often suffers from poor visibility. Numerous low-light enhancement methods have been developed to brighten dark regions and suppress annoying hidden artifacts to improve image quality. Pioneering research works fall into the following two categories: Histogram Equalization (HE) (Abdullah-Al-Wadud et al., 2007; Coltuc et al., 2006; Stark, 2000; Arici et al., 2009) and Retinex theory-based approaches (Fu et al., 2016b; Guo et al., 2016; Jobson et al., 1997; Wang et al., 2013). For example, the Dynamic Histogram Equalization (DHE) (Abdullah-Al-Wadud et al., 2007) was proposed to eliminate undesirable artifacts. Moreover, Guo et al. (2016) imposed structure-aware prior to guide the final illumination layer, and Fu et al. (2016a) developed the fusion strategy to blend different techniques to produce an adjusted illumination layer. Despite their impressive ability to adjust the illumination, these methods do not make use of any prior knowledge from large-scale image datasets. Leveraging the powerful capability of both deep networks and large-scale data, Low-Light Net (LLNet) (Lore et al., 2017) attempted to achieve image enhancement and noise removal simultaneously with a deep auto-encoder. Ren et al. (2019) developed a sophisticated hybrid architecture to enhance the low-light images based on holistic estimation with edge information. Yang et al. (2020) designed a semi-supervised network, termed as Deep Recursive Band Network (DRBN), to extract coarse-to-fine representations and recompose these representations towards fitting perceptually pleasing images. Zhu et al. (2022) developed a Guidance Enhanced Multi-Scale Context (GEMSC) network to enhance the low light images guided by a learnable map. Guo and Hu (2023) have recently proposed a method to address the challenge of enhancing low-light images. Their approach was inspired by the divide-and-rule principle and focused on decoupling the entanglement of noise and color distortion, which can be a major obstacle in low-light image enhancement. To get rid of the restriction of paired data, Jiang et al. (2021) addressed the enhancement problems based upon adversarial learning, and Guo et al. (2020) carefully designed the zero-reference optimization to perform image enhancement. A more systematic review of low-light image enhancement algorithms can be found in Liu et al. (2021a).

Previous works have constructed representative datasets to facilitate further enhancement research, including LOw-Light (LOL) Real Dataset (Wei et al., 2018), See-in-the-Dark (SID) Dataset (Chen et al., 2018), DeepUPE Dataset (Wang et al., 2019), MIT-Adobe FiveK Dataset (Bychkovsky et al., 2011), etc. The LOL Real dataset consists of 689 low/normal-light image pairs of real-world scenarios in which the ISO settings and exposure values were manually adjusted. The DeepUPE dataset and the MIT-Adobe FiveK dataset provide expert-retouched references that enable the model to learn rich and diverse luminance information in different contexts. It is worth mentioning that the MIT-Adobe FiveK dataset is composed of raw sensor data, and the reference of Expert C is often recommended (Ni et al., 2020). When captured under extremely low-light conditions (less than 0.1 lux), the SID dataset benefits from a RAW-to-RGB mapping performed using deep learning techniques, which outperforms traditional camera processing pipelines. The LOL Synthesized dataset (Yang et al., 2021) is generated by adjusting Adobe Lightroom interface parameters, in which the synthesized illumination distribution approximately matches the illumination distribution of typical public datasets such as MEF (Ma et al., 2015), NPE (Wang et al., 2013), LIME (Guo et al., 2016), and DICM (Lee et al., 2013). There are also synthetic datasets (Lv et al., 2021; Zhou et al., 2023) that support the development of low-light scenario studies.

2.2 Low-Light Video Enhancement

Analogous to single-image enhancement, video enhancement has attracted increasing attention. Dong et al. (2011) proposed an efficient video enhancement method by applying a dehazing algorithm. Kim et al. (2015) considered video enhancement and denoising jointly to achieve visually pleasing results. In Liu et al. (2016); Wang et al. (2014), the Retinex theory has also been utilized in video enhancement. The Multi-Branch Low-Light Enhancement Network (MBLLEN) (Lv et al., 2018) was carefully designed by Liu et al. to incorporate rich features, which apply to both low-light image and video tasks. Triantafyllidou et al. (2020) addressed RAW-to-RGB mapping by leveraging Generative Adversarial Networks (GANs). Zhang et al. (2021a) developed a novel strategy to learn with temporal consistency constraint, with the core idea to synthesize short video sequences using the optical flow.

As for the datasets, Jiang and Zheng (2019) built See-Moving-Objects-in-the-Dark (SMOID) dataset by utilizing a co-axis optical system to capture low/normal-light video pairs simultaneously. The Dark Raw Video (DRV) dataset (Chen et al., 2019) was collected in the raw domain, and a Siamese network was trained on static videos and generalized to handle dynamic videos. The latest SDSD dataset (Wang et al., 2021), standing for Seeing Dynamic Scenes in the Dark, was constructed via a mechatronic alignment system. The challenging LoLi-Phone dataset (Li et al., 2021) contains 120 videos (55,148 video frames) captured by 18 mobile cameras in natural scenes without ground truth to test generalization ability.

3 Dataset Construction

Fig. 1
figure 1

The pipeline for building the proposed DS-LOL video dataset. We manually select high-quality videos from public datasets to perform low-light simulations using the Adobe Lightroom interface in Sec. 3.2 and then inject Gaussian-Poisson mixed noise to synthesize low-light video in Sec. 3.3

3.1 Motivation

We first show our motivation that leads to the construction of the dataset, thereby establishing a clear distinction between our work and the existing low-light video datasets. Existing datasets, including DRV, SMOID, and SDSD, exhibit inherent shortcomings regarding temporal dynamics for training well-established low-light video enhancement models. For instance, the DRV dataset contains extremely dark static videos with corresponding ground truth, insufficient to provide powerful pixel-level constraints for enhancing dynamic videos. The absence of dynamic scenes inherently hampers applying temporal consistency constraints, thereby inevitably resulting in temporal flickering. While the SMOID and SDSD datasets are acquired using specifically designed systems (co-axis optical system and mechatronic system, respectively), capturing paired videos with these devices requires professional shooting skills, which brings difficulty in data collection. In particular, the SDSD dataset exhibits restricted degrees of freedom (velocity, position, rotation, orientation, etc.) due to the limitations imposed by the length of the electric slide rail in the mechatronic system. Consequently, the dataset predominantly comprises plain motion patterns. These limitations of existing datasets result in unsatisfactory diversity in motion and lighting conditions, leading to undesirable enhancements in temporal consistency and difficult adaptation to varying light intensities. To address these limitations, we create a synthetic low-light dataset containing various low-light conditions, noise levels and diverse motion patterns. The pipeline used to create the DS-LOL dataset is shown in Fig. 1. Undoubtedly, the construction of this dataset fills the blank in terms of diverse backgrounds and motion patterns. The obtained dataset, abundant in dynamic scenes, enables effective model training for better temporal modeling, generating temporally more consistent results.

Fig. 2
figure 2

We compare our synthesized DS-LOL dataset with the SDSD dataset, which includes both indoor and outdoor scenes, in terms of the level of diversity in spatiotemporal characteristics illustrated in the Spatial Information (SI) and Temporal Information (TI) space. The SI and TI range for the indoor scenes in the SDSD dataset are [23, 120] and [0, 112], respectively. The outdoor scenes in the SDSD dataset have an SI range of [28, 125] and a TI range of [2, 95]. In comparison, the synthesized DS-LOL dataset has a wider range of spatiotemporal characteristics, with an SI range of [0, 170] and a TI range of [0, 113]

3.2 Low-Light Video Synthesis

Our method for generating realistic dark photography video frames differs from previous approaches, such as the single image simulation strategy used in Wei et al. (2018); Yang et al. (2021). We consider 15 consecutive frames as a video clip and align its distribution to the dark illumination distribution captured from a real-world low-light video. Through the utilization of multiple frames, we are able to synthesize video clips with diverse illumination levels. This strategy prevents the model from overfitting to a particular low-light condition and improves its adaptability to various lighting scenarios. To synthesize the low-light videos, we adjust the lighting parameters of 15 consecutive frames using Adobe Lightroom. We select four cofactors, \(\mathcal {C}_1, \mathcal {C}_2, \mathcal {C}_3\) and \(\mathcal {C}_4\) to adjust the lighting parameters, where \(\mathcal {C}_4 = \mathcal {{C}}_1^2 \). Table 1 shows our parameter configuration, which is highly representative and facilitates the synthesis of effective videos for model training. To simulate a variety of low-light scenarios, we apply varying degrees of darkening to the normal-light frames using a uniform distribution of \(\mathcal {C}_1, \mathcal {C}_2, \mathcal {C}_3\) with random variables. To ensure consistency within each video clip, we consider all frames of each testing video as a unified clip and apply a consistent operation to generate a coherent illumination distribution in the dark. This helps alleviate the temporal flickering during the inference stage. Besides, our physics-based lighting simulation have two advantages. First, it is more general and flexible compared to sensor-specific lighting distributions that are often difficult to characterize in real-world low-light scenarios. Second, the non-linear enhancement curve based on commercial software maintains color consistency while changing intensity order. In summary, our approach allows for the generation of synthetic low-light videos with various lighting conditions and noise levels, which can better train video enhancement models to handle different low-light scenarios.

Table 1 Parameters of generating diverse dark illumination

3.3 Gaussian–Poisson Mixed Noise Modeling

Our primary consideration in stimulating a practical imaging system is noise modeling in low-light conditions, which significantly affect successively developed denoising methods. Most conventional works (Simoncelli & Adelson, 1996; Rudin et al., 1992) employ a physically unrealistic noise modeling approach in modern digital cameras, where noise is modelled in a signal-independent manner, e.g. a zero-mean additive Gaussian distribution. We take inspiration from existing work (Foi et al., 2008; Hasinoff, 2014; Brooks et al., 2019; Guo et al., 2019) to adopt a more appropriate alternative for noisy real-world photographs. The camera sensor noise mainly comes from Poisson noise (also called photon shot noise) and read noise. The former primarily originates from the discrete nature of electrons, which can be modelled by the Poisson process. The latter is caused by voltage fluctuations in signal-processing electronics. Read noise is an approximate estimation of a Gaussian random variable with zero mean and unitary variance. Additionally, in-camera signal processing (demosaicing, tone mapping, gamma compression, etc.) makes the distribution of noise more complex. As such, we follow the workflow of Brooks et al. (2019) and leverage the inverse in-camera signal processing pipeline to “unprocess" the video frames to simulate raw data records. Given the varying noise characteristics of camera sensors, incorporating diverse noise levels is critical in simulating imaging systems accurately. Herein, we could obtain the observed image intensity o under low-light conditions,

$$\begin{aligned} o \sim \mathcal {N}\left( \mu =o, \sigma ^{2}=\lambda _{\text{ read } }+\lambda _{\text{ shot } } o\right) , \end{aligned}$$
(1)

where \(\sigma \) denotes the signal-dependent standard deviation. Instead of processing video frames individually, we follow the above low-light video synthesis strategy, treating 15 consecutive frames as one video clip. In other words, the noise intensity of the same video clip remains approximately constant. The parameters \(\lambda _{\text{ read } }\) and \(\lambda _{\text{ shot } }\) approximately follow a pre-defined log-linear relationship (Foi et al., 2008; Brooks et al., 2019), enabling the comprehensive simulation of noise across various camera sensors and acquisition settings. To choose noise levels for each video clip, we generate a factor \(\mathcal {C}_5 \in \{0, 1, 2,\ldots ,19\}\) to control the noise level sampling from a reasonable distribution randomly as follows,

$$\begin{aligned} \begin{aligned}&\log \left( \lambda _{\text{ shot } }\right) \sim \mathcal {U}(a, b), \\&\log \left( \lambda _{\text{ read } }\right) \sim \mathcal {N}\left( \mu =2.18 \log \left( \lambda _{\text{ shot } }\right) +1.2, \sigma =0.26\right) , \\&a = \log (1^{-6} + 6^{-6} \times \mathcal {C}_5 ), \\&b = \log (7^{-6} + 6^{-6} \times (\mathcal {C}_5 + 1)). \\ \end{aligned} \end{aligned}$$
(2)

3.4 Video Dataset Characteristics

To improve the performance of the enhancement algorithms, it is critical to build a large-scale diverse dataset for model training. Such a dataset should contain a variety of contents and motions that mimic real-world low-light scenes faithfully. Therefore, we manually select 90 source video clips from publicly available datasets (Pinson, 2013; Wang et al., 2017). These video clips have a total of 12,420 frames and cover various scenes, each with a duration of 5 s and a spatial resolution of 960 \(\times \) 540. The frame rate of the videos is either 30 fps or 24 fps. We carefully choose the source videos to avoid any redundancy in terms of content. The dataset we constructed, called DS-LOL, includes diverse categories, such as Animal, Plant, Vehicle, Cityscapes, Human, Building, and others. The careful selection of these categories ensures that the dataset contains a wide range of real-life scenes with diverse objects and backgrounds. Figure 2 shows the computation of the Spatial Information (SI) and Temporal Information (TI) as defined in the ITU-T Recommendation (Installations & Line, 1999), which are used to represent the spatiotemporal characteristics of source videos. We then compare these measurements with the existing SDSD dataset. In comparison, the synthesized DS-LOL dataset has a wider range of spatiotemporal characteristics. By providing such a comprehensive dataset, our work establishes a solid foundation for evaluating the performance of low-light video enhancement algorithms and inspires new directions. The included diverse contexts make the enhancement task become more challenging, thereby preventing the enhancement models from suffering from domain shift.

Fig. 3
figure 3

Illustration of the proposed TCE-Net, including the input sequences \(\mathbf {O^{r}}\) and \(\mathbf {O^{s}}\), three core modules (Multi-frame Extraction module \(\textbf{G}\), Recurrent Refinement module \(\mathbf {R^{2}}\) and Up-sampling Reconstruction module \(\textbf{U}\)) as well as the training objectives. Herein, TCE-Net is coupled with the proposed STCL strategy that constrains the model in spatial fidelity on the real data while enforcing the temporal consistency on the synthetic data (DS-LOL dataset), such that both spatial and temporal clues can be jointly taken advantage of

4 The Proposed Method

Our method includes the network design and development of a learning strategy. First, we design a network architecture that is capable of making full use of the temporal correlation from video while maintaining its temporal consistency. Subsequently, we develop a special temporal compatible learning. By applying adaptive losses on different datasets, we manage to pursue both spatial fidelity and temporal consistency jointly.

4.1 Architecture: Temporally Consistent Enhancer Network (TCE-Net)

Our proposed TCE-Net jointly learns to enhance the low-light video and maintain temporal consistency. The network takes a U-Net (Ronneberger et al., 2015) architecture. It consists of three modules: an encoder for multi-frame spatial-temporal features extraction \(\textbf{G}\), recurrent refinement \(\mathbf {R^{2}}\), and a decoder \(\textbf{U}\) for the final reconstruction, as shown in Fig. 3.

Table 2 The architecture of multi-frame feature extraction module \(\textbf{G}\)

Multi-Frame Feature Extraction Module \(\textbf{G}\). To leverage temporal correlation, we extract the dynamic information from multiple frames. The background texture signal on the temporal dimension is intrinsically correlated while the noise captured under low light conditions is random. Therefore, the occluded contents or missing details of the current frame can be compensated from adjacent frames by making use of the temporal redundancy under low-light conditions. In detail, we utilize multiple frames as input to enhance the low-light video frame progressively. The module \(\textbf{G}\), which consists of cascaded 3D convolution and 2D convolution in an alternating manner, is used to extract global intra-frame and inter-frame features. The detailed architecture is listed in Table 2. The progressive architecture and the alternating arrangement strategy facilitate the implicit alignment of the features among adjacent frames. On the encoder side, six consecutive frames selected from a real captured low-light video training set, \(O^{r} = \left\{ o_{t-3}^{r}, o_{t-2}^{r}, o_{t-1}^{r}, o_{t}^{r}, o_{t+1}^{r}, o_{t+2}^{r} \right\} \), is split into two parts, where \(O_{t-1}^{r} = \left\{ o_{t-3}^{r}, o_{t-2}^{r}, o_{t-1}^{r}, o_{t}^{r}, o_{t+1}^{r}\right\} \) and \(O_{t}^{r} = \left\{ o_{t-2}^{r}, o_{t-1}^{r}, o_{t}^{r}, o_{t+1}^{r}, o_{t+2}^{r} \right\} \). Besides, another six consecutive frames selected from the synthetic DS-LOL training set also follow the split strategy to produce \(\left\{ O_{t}^{s}, O_{t-1}^{s}\right\} \). The purpose of splitting the data into two sets is to get two consecutive frame results, which can be further applied with the temporal consistency constraint. For the sake of simplicity, we only describe the real captured video frames \(\left\{ O_{t}^{r}, O_{t-1}^{r}\right\} \) in the following steps of TCE-Net. We persuade our encoder to effectively utilize both frame-wise and cross-frame information in a correlated and mutually beneficial manner. Specifically, we expect the spatial features of the same object or background to exhibit similarity, while the temporal features among adjacent frames should capture temporal dynamics. Hence, the intermediate features \({E_{t}^{r}}\) and \({E_{t-1}^{r}} \in \mathbb {R}^{H \times W \times C} \) are extracted using the shared weight \(\textbf{G}\),

$$\begin{aligned} {E_{t-1}^{r}} = \textbf{G}\left( {O_{t-1}^{r}}\right) ,\, {E_{t}^{r}} = \textbf{G}\left( {O_{t}^{r}}\right) . \end{aligned}$$
(3)
Fig. 4
figure 4

Illustration of the two steps of the designed Feature Correlation Alignment (FCA), which ensures the temporal information is well aligned. Both \(E_{t}\) and \(E_{t-1}\) contain similar objects or background information and reflect the temporal dynamics of the video sequence. Concatenating contextual information \(F_{t}\) and \(F_{t-1}\) allows us to obtain the spatial-attention weights \(s_{1}, s_{2}, s_{3}, s_{4}\) and channel-attention weights \(c_{1}, c_{2}, c_{3}, c_{4}\), which can highlight significant information within the spatial path and frequency path, respectively

Recurrent Refinement Module \(\mathbf {R^{2}}\). It consists of three key components: convolutional LSTM (convLSTM) structure (Shi et al., 2015), Atrous Spatial Pyramid Pooling (ASPP) operation (Chen et al., 2017), and proposed Feature Correlation Alignment (FCA) mechanism. The former two modules are widely used for implicit joint spatial-temporal modeling. The last one is our proposed model that can effectively align the features from different frames based on their correlation for explicit temporal modeling, which complements the previous two modules. By integrating these components, our Recurrent Refinement Module \(\mathbf {R^{2}}\) can effectively capture the spatial-temporal representation of the input frames and propagate feature-level information across different frame indexes. The convLSTM structure is utilized to aggregate the information from different frames in an effective way and propagate the information of the video along the temporal dimension. It is capable of extracting and maintaining a compact representation of several successive frames, which can benefit temporal consistency. The ASPP operation is used to enlarge the receptive field with hierarchical features captured from multi-scale different contexts. This introduces richer contexts that facilitate the suppression of noise and restoration of missing details. Despite ConvLSTM and ASPP can perform the implicit feature alignment, the absence of the explicit alignment still leads to misalignment of local texture details, especially when large motions are included. To address this, the FCA mechanism is proposed to leverage the spatial-temporal feature correlation to align the features across different frames effectively.

The intuitive idea of the proposed FCA is that the objects and backgrounds in consecutive video frames are highly correlated, but motion causes the location shift. Therefore, the adjacent features at the high-level are also highly correlated and thus can be aligned along the temporal dimension based on their potential correspondence. As such, the first step of FCA aims at calculating the feature correlation to derive the location matching based on the correlation of spatial correspondence along the temporal dimension. In this way, the correlation can be estimated more robustly due to the compensation of motion information across multiple frames, which is different from previous work (Lu et al., 2019). As shown in Fig. 4, we reshape the high-level features \({E_{t}^{r}}\) and \({E_{t-1}^{r}} \in \mathbb {R}^{H \times W \times C} \) to the size of \(\mathbb {R}^{H W \times C}\), and compute the affinity matrix \({M}^{r}\) as follows,

$$\begin{aligned} {M}^{r}={E}_{t-1}^{r\top } * {W} * {E}_{t}^{r}, \end{aligned}$$
(4)

where W is a trainable weight matrix, and \(*\) stands for matrix multiplication. If adjacent features are well aligned, we can change their temporal orders by swapping one’s high-level part with the others as follows,

$$\begin{aligned} \begin{array}{l} {F}_{t}^{r}=\textbf{Softmax}({M}^{r}) * {{E}}_{t-1}^{r\top }, \\ {F}_{t-1}^{r}=\textbf{Softmax}\left( {M}^{r\top }\right) * {{E}}_{t}^{r}, \end{array} \end{aligned}$$
(5)

We reshape \({F}_{t}^{r}\) and \({F}_{t-1}^{r}\) to original shape \(\mathbb {R}^{H \times W \times C}\) for further integrating contexts from the features from adjacent frames. The fusion feature map is obtained by concatenating \({F_{t}}\) and \({F_{t-1}}\), denoted as the operation \(\textbf{Cat}\). The spatial gated maps and channel-wise gated vectors are utilized to reveal significant information in spatial and frequency paths, as follows,

$$\begin{aligned} \begin{array}{l} s_{j} =\textbf{Sigmoid}\left( \textbf{Conv}_{j}\left( \textbf{Cat}({F_{t}^{r}}, {F_{t-1}^{r}})\right) \right) , j=1,2,3,4, \\ c_{j} =\textbf{Sigmoid}\left( \textbf{FC}_{j}\left( \textbf{GAP}(\textbf{Cat}({F_{t}^{r}}, {F_{t-1}^{r}}))\right) \right) , j=1,2,3,4, \end{array}\nonumber \\ \end{aligned}$$
(6)

where \(\textbf{Conv}_{j}\), \(\textbf{FC}_{j}\) and \(\textbf{GAP}\) are the abbreviations of j-th convolution layer, j-th fully connected layer and global average pooling operations, respectively. The learned \(s_{j}\) and \(c_{j}\) are multiplied with spatial and frequency paths in the second step of the FCA operation.

Subsequently, we aim to capture better underlying temporal dynamics based on the fused information from both spatial and frequency feature paths. Compared with conventional CNN, discrete wavelet transform (DWT) (Liu et al., 2018) can exploit frequency and location characteristics (Daubechies, 1990) to extract the features that are more aware of their contexts. Moreover, similarly, the learned spatial gated maps \(s_{1}, s_{4}\) and channel-wise gated vectors \(c_{1}, c_{4}\) are utilized to further unveil the significant information in the frequency path. Then, the original feature can be accurately restored by inverse wavelet transform (IWT). We adopt similar operations on the feature maps in the spatial path to those in the frequency path. After the fusion of the spatial and frequency information, the enhanced features are treated as a gated map to preserve detailed texture, producing \(T_{t}^{r}\) and \(T_{t-1}^{r}\). In this way, the network is better aware of the spatial and temporal context, therefore being capable of achieving temporal consistency and suppressing flicker.

Up-sampling Reconstruction Module \(\textbf{U}\). The decoder module plays a crucial role in generating high-quality videos in our TCE-Net framework, which produces high-dimensional features at a resolution of 1/4 of the input video. To ensure that the final output is of full resolution, the aligned features are resampled twice via a Vanilla convolution layer. This step is essential for preserving the details and structure of the input video, as well as ensuring that the output is visually appealing and free of artifacts. Finally, the last convolution layer reconstructs the feature into a standard RGB image format. Throughout this process, the high-dimensional features are transformed into pixels.

$$\begin{aligned} {x}_{t-1}^{r} = \textbf{U}\left( {T_{t-1}^{r}} \right) ,\, {x}_{t}^{r} = \textbf{U}\left( {T_{t}^{r}} \right) . \end{aligned}$$
(7)

4.2 Training Dynamics: Spatial-Temporal Compatible Learning

Motivation. It poses great challenges to achieve both good performances in both spatial and temporal domains as they might be conflicted, as demonstrated in Lai et al. (2018); Zhang et al. (2020). The challenge becomes even more pronounced for low-light videos. Although methods that handle single-frame images can achieve high spatial fidelity, the capture conditions of low-light video datasets limit the availability of temporal dynamics, thus hindering the establishment of effective temporal constraints. In the above section, we address the data issue. Herein, we make efforts in overcoming the limitations caused by the conflict between spatial and temporal domains with the help of a comprehensive utilization of diverse datasets. To achieve a better balance between spatial and temporal fidelity, our proposed STCL strategy is designed to take advantage of spatial clues and temporal clues from real captured and synthesized low-light videos, respectively.

The solution of compatibility learning decouples the quality from both spatial and temporal perspectives and adopts certain kinds of datasets, i.e. synthetic and real datasets where the respective constraints are built for the model training. We train our TCE-Net in real captured low-light video datasets to focus on spatial intrinsic data distribution of real low-light videos. Real low-light video datasets capture the noise and illumination characteristics of real scenarios. However, due to the limitations on the acquisition, they usually lack rich temporal dynamics and less scene diversity. If we enforce the temporal consistency constraint for TCE-Net on the real captured videos, it tends to show degraded temporal quality due to the intrinsic spatial-temporal conflict. Therefore, we choose to constrain the model in spatial fidelity on the real data while constraining the temporal consistency on the synthetic data (DS-LOL dataset). In this way, the synthesized DS-LOL dataset is capable of injecting the knowledge of a broader range of diverse scenes and temporal dynamics for more effective exploitation of temporal distribution clues, making the TCE-Net understand richer temporal information and improve its ability to generate high-quality videos with consistent motion trajectories.

In summary, by obtaining the spatial/temporal clues from the real/synthetic data respectively, our strategy provides a unified, robust, and perception-friendly training approach that can effectively enhance the quality of low-light videos to exploit the training advantages of the two kinds of datasets.

Fig. 5
figure 5

The first column shows the warped results. The second column visualizes the estimated mask results computed using Equation (14). The third column visualizes the flow maps based on the fine-tuned RAFT model

Loss Function. We elaborate the details of the loss design that follow the idea of spatial-temporal compatible learning.

1) Spatial Constraint. This constraint/loss is applied to constrain the model on the real low-light video dataset. The overall \(\mathcal {L}_{s}\) training loss function consists of three constraints, including edge constraint \(\mathcal {L}_{edge}\), Charbonnier penalty function \(\mathcal {L}_{char}\) (Charbonnier et al., 1994), and perceptual similarity constraint \(\mathcal {L}_{perceptual}\). We optimize the proposed model with the following loss functions,

$$\begin{aligned} \mathcal {L}_{s} = \mathcal {L}_{char} + \mathcal {L}_{perceptual } + \lambda _{edge} \mathcal {L}_{edge}, \end{aligned}$$
(8)

where \(\lambda _{edge}\) is set to 0.05 empirically.

  • Charbonnier constraint. The Charbonnier penalty function between enhanced frame \({x}_i^{r}\) and ground truth \(y_i^{r}\) for its global smoothness and robustness to annoying noise outliers:

    $$\begin{aligned} \mathcal {L}_{char}= \sum _{i=t, t-1} \sqrt{\left( y_i^{r} -{x}_i^{r} \right) ^2+\epsilon ^2}, \end{aligned}$$
    (9)

    where \(\epsilon \) is set to \(1 \times 10^{-3}\) empirically.

  • Perceptual constraint. The perceptual constraint has been extensively used to improve the enhancement quality, and we utilize the perceptual loss (Johnson et al., 2016) to measure consistency between the features extracted from a pre-trained VGG-19 (Simonyan & Zisserman, 2014) network,

    $$\begin{aligned} \mathcal {L}_{perceptual }= \frac{1}{N^{l}}\sum _{l}\sum _{i=t, t-1}\left\| \Phi ^{l}\left( y_i^{r}\right) -\Phi ^{l}\left( x_i^{r}\right) \right\| _{1},\nonumber \\ \end{aligned}$$
    (10)

    where \(\phi ^{l}\) indicates the l-th activation layer within the VGG-19 network and \(N^{l}\) is the number of selected layers.

  • Edge constraint. The edge constraint has been extensively used to improve the enhancement quality, defined as:

    $$\begin{aligned} \mathcal {L}_{edge}=\sum _{i=t, t-1} \sqrt{\left\| \delta x_i^{r} -\delta y_i^{r}\right\| ^2+\epsilon ^2}, \end{aligned}$$
    (11)

    where \(\delta \) indicates the Laplacian operator.

2) Temporal Constraint. This constraint/loss is applied to constrain the model on the synthetic low-light video dataset. We estimate the optical flow using the RAFT (Teed & Deng, 2020) due to its state-of-the-art accuracy and high efficiency, which is introduced to denote the process of predicting optical flow given the enhanced frames \({x}_{t}^{s}, {x}_{t-1}^{s}\) from synthetic low-light video training as follows,

$$\begin{aligned} \begin{array}{l} {C}_{t \rightarrow t-1} = \textbf{RAFT} \left( {x}_{t}^{s}, {x}_{t-1}^{s}\right) , \\ {C}_{t-1 \rightarrow t} = \textbf{RAFT} \left( {x}_{t-1}^{s}, {x}_{t}^{s}\right) , \end{array} \end{aligned}$$
(12)

where the subscript \({C}_{t \rightarrow t-1}\) denotes the flow from \(t \rightarrow t-1\), and vice versa. Instead of warping towards one direction, we continuously introduce the cycle manner to avoid information loss,

$$\begin{aligned} \begin{array}{l} \hat{x}_{t}^{s} =\textbf{warp} \left( x_{t}^{s}, {C}_{t-1 \rightarrow t}\right) , \\ \hat{x}_{t-1}^{s} =\textbf{warp} \left( x_{t-1}^{s}, {C}_{t \rightarrow t-1}\right) . \end{array} \end{aligned}$$
(13)
Fig. 6
figure 6

Representative video frames from the SDSD (Wang et al., 2021) dataset, which are enhanced using different enhancement methods. Zooming in on the details allows for better visual quality comparison. Our method could achieve a satisfying result, alleviating the presence of intensive noise, over-exposure and under-exposure that could be apparently observed in other methods

Fig. 7
figure 7

Representative video frames from the SDSD (Wang et al., 2021) dataset, which are enhanced using different enhancement methods. Zooming in on the details allows for better visual quality comparison. Our method could achieve a satisfying result, alleviating the presence of intensive noise, inconsistent color and under-exposure that could be apparently observed in other methods

Fig. 8
figure 8

Representative enhanced video frames from SDSD (Wang et al., 2021) dataset, which are enhanced using different enhancement methods. The first column shows the enhanced results by different enhancement methods. The second column visualizes the warped result based on the fine-tuned RAFT model. The third column visualizes the per-pixel warping error between enhanced result and warped result

Table 3 Quantitative comparison on the outdoor scenes within SDSD dataset
Table 4 Quantitative comparison on the indoor scenes within SDSD dataset
Table 5 Quantitative comparison on the DRV dataset, including static and dynamic scenes
Table 6 Quantitative comparison on the proposed DS-LOL dataset

Our unique design uses the cycle form to warp the adjacent frames, including warping from the current frame to the previous frame and warping the previous frame to the current frame through optical flow. The critical step in alleviating temporal flickering is to enforce our TCE-Net to be aware of motion information along the temporal dimension without increasing temporal-wise loss. The cycle warping error between the two adjacent frames is as follows,

$$\begin{aligned} \mathcal {L}_{temporal}{} & {} = M_{t \rightarrow t-1}\left\| {x}_{t}^{s}-\hat{x}_{t-1}^{s}\right\| _{1} \nonumber \\{} & {} \quad + M_{t-1 \rightarrow t}\left\| {x}_{t-1}^{s}-\hat{x}_{t}^{s}\right\| _{1}, \end{aligned}$$
(14)

where \(\hat{x}_{t-1}^{s}\) and \(\hat{x}_{t}^{s}\) are the frames warped by the optical flow \(C_{t \rightarrow t-1}\) and \(C_{t-1 \rightarrow t}\), respectively. \(M_{t \rightarrow t-1}\) and \(M_{t-1 \rightarrow t}\) represent the visibility mask, \(M_{t \rightarrow t-1}\) \(=\exp \left( -50 * \left\| x_{t}^{s}-\hat{x}_{t-1}^{s}\right\| _{2}^{2}\right) \) (Lai et al., 2018). We visualize the estimated masks and optical flow maps in Fig. 5.

5 Experiments

5.1 Experimental Settings

Datasets. We demonstrate the superiority of our method through experiments on real captured low-light videos as well as the synthetic DS-LOL dataset. This comprehensive evaluation enables us to fairly compare the performance of our method with other existing enhancement methods. The DS-LOL dataset consists of a total of 70 training videos and 20 testing videos. The chosen real captured low-light video datasets include SDSD (Wang et al., 2021) and DRV (Chen et al., 2019). The SDSD dataset consists of a total of 150 videos, 70 videos depicting indoor scenes and 80 videos depicting outdoor scenes. Each video has 100 \(\sim \) 300 frames. The indoor scene subset comprises 58 training videos and 12 testing videos, while the outdoor scene subset consists of 67 training videos and 13 testing videos. The DRV dataset includes static and dynamic scenes captured in the RAW domain, with ground truth available only for the static scenes. We employ pre-processing techniques to convert the RAW data to the sRGB space based on the default rawpy Footnote 2 ISP, as the Raw-to-RGB task is not our primary consideration. Out of the 200 videos in the DRV dataset, 129 static videos are designated for training, 49 static videos for testing, and 22 dynamic sequences for testing. For fast training, we resize the original SDSD and DRV video resolution to 960 \(\times \) 512, following the approach taken in Wang et al. (2021).

Implementation Details. The network is implemented based on one Nvidia GeForce RTX 3090 GPU under the Pytorch framework and is trained for total 300 epochs with a batch size of 4. During the training, we use the classic Adam (Kingma & Ba, 2014) optimizer. The learning rate is initially fixed to \(1 \times 10^{-4}\) for the first 150 epochs and is then linearly decayed to zero over the next 150 epochs. All training videos are randomly sampled and cropped into \( 256\times 256\times 5\) cubes, augmented by flipping and rotating operations. Furthermore, pixel values are normalized to [0, 1]. We adopt PSNR, SSIM (Wang et al., 2004), and Feature SIMilarity Index (FSIM) (Zhang et al., 2011) as comparison criteria to evaluate the spatial quality of enhanced video frames, which are based upon the implementations with MATLAB (R2018b). The higher values of PSNR, SSIM, and FSIM denote better spatial-wise quality. Following Zhang et al. (2021a), we adopt the warping error \(\mathcal {L}_{warp}\) (Lai et al., 2018) and Mean Absolute Brightness Differences (MABD) (Jiang & Zheng, 2019) to evaluate the temporal consistency. The lower values of \(\mathcal {L}_{warp}\) and MABD denote better temporal stability. The evaluation results are the average results of all frames in testing videos.

Baselines. We compare the proposed network with single image-based enhancement methods and video-based enhancement methods. The single image enhancement methods include Bio-Inspired Multi-Exposure Fusion (BIMEF)(Ying et al., 2017), Light Image Enhancement via Illumination Map Estimation (LIME) (Guo et al., 2016), Multiscale Retinex (MR) (Jobson et al., 1997), Dong et al. (2011), Multiple Fusion (MF) (Fu et al., 2016a), Naturalness Preserved Enhancement (NPE) (Wang et al., 2013) and Simultaneous Reflectance and Illumination Estimation (SRIE) (Fu et al., 2016b), Zero-Reference Deep Curve Estimation (Zero-DCE) (Guo et al., 2020), Structure-Aware Lightweight Transformer (STAR) (Zhang et al., 2021b), EnlightenGAN (Jiang et al., 2021), Retinex-inspired Unrolling with Architecture Search (RUAS) (Liu et al., 2021b) and recently powerful General u-shaped transformer Uformer (Wang et al., 2022) for image restoration. The video-based enhancement methods include SDSDNet (Wang et al., 2021), StableLLVE (Zhang et al., 2021a), DRVNet (Chen et al., 2019), Multi-Branch Low-Light Enhancement Network (MBLLEN) (Lv et al., 2018), Semantic-Guided Zero-Shot Learning (SGZSL) (Zheng & Gupta, 2022). We adopt the authors’ original codes to produce results. Deep learning-based methods are retrained from scratch based on a unified training data configuration, except SDSDNet (Wang et al., 2021). For SDSDNet, we directly evaluate the testing videos using the official pre-trained model since the training–testing splitting strategy remains the same as ours.

5.2 Comparison to the State-of-the-Arts

Quantitative Evaluation. We quantitatively compare the performance of enhancement methods on different real captured low-light video datasets. Our method yields promising results, as evidenced by quality measures such as PSNR, SSIM, FSIM, \(\mathcal {L}_{warp}\), and MABD, which provide further insight into the effectiveness of our design. The results for the outdoor scene and indoor scene are presented in Tables 3 and 4, respectively. It is important to note that the constructed synthetic data, DS-LOL, is not utilized for training other enhancement methods. To provide a fairer and more comprehensive evaluation, we present two results of our method: one without the additional DS-LOL dataset (referred to as “Ours*”), and another that includes the DS-LOL dataset (referred to as “Ours”). Our method achieves 1.58 dB gain, 0.0066 SSIM gain, and 0.0058 FSIM loss on outdoor scenes, and 1.75 dB gain, 0.0142 SSIM gain, and 0.0066 FSIM gain on indoor scenes compared to the second-best performing method. As for the temporal quality measures, our warping error \(\mathcal {L}_{warp}\) and MABD results produce better performance compared with other methods in terms of outdoor and indoor scenes, which demonstrates that our recovered results could alleviate temporal flickering and keep the video smooth.

As illustrated in Table 5, all methods are evaluated on the DRV dataset. It should be noted that due to the existence of ground truth, the static videos of the DRV dataset can be evaluated based on PSNR, SSIM, and FSIM measures. By contrast, the dynamic videos can only be evaluated using \(\mathcal {L}_{warp}\) and MABD due to the unavailability of ground truth. The performance gain is 0.9 dB in PSNR, 0.0426 in SSIM and 0.001 in FSIM when compared with the second-place method. As for temporal quality measures, our method ranks the second place and first place in terms of \(\mathcal {L}_{warp}\) and MABD, respectively. The spatial and temporal quality results show our sophisticated model design could simultaneously improve the perceptual quality of video and temporal consistency. Besides, most of the results on real captured low light video datasets are slightly lower than ours, even compared with Uformer transformer architecture, suggesting that our method could adapt to challenging lighting conditions.

We further perform a performance comparison on our synthesized dataset to verify the effectiveness of our method. The comparison results are presented in Table 6. We could observe that our method outperforms other enhancement models in terms of PSNR, SSIM, FSIM, and \(\mathcal {L}_{warp}\), highlighting the effectiveness of our proposed framework on the synthesized dataset. Compared to the second-place results, our method demonstrates a performance gain of 1.88 dB in PSNR, 0.0122 in SSIM, and 0.012 in FSIM. This evaluation results not only accurately reflect the performance in real-world scenarios but also in the synthesized dataset.

Qualitative Evaluation. We qualitatively compare the performance of different enhancement results in Figs.  6, 7 and 8, including single-image enhancement and video-based enhancement methods, and obtain several observations. The results show that our method outperforms other SOTA methods in different low-light scenarios, even under extremely low-light conditions. Comparatively, our method is more successful in recovering the illumination, maintaining the main structural details, and removing annoying noise. Besides, we can observe from the video frame results that most methods can recover the illumination to some extent. However, the severe artifacts i.e., obstinate noise, color casting, and unsatisfactory structural details still remain. Specially, we can observe that NPE, Dong et al.’s, LIME, MF, and MR fail to remove intensive noise. Moreover, EnlightenGAN, SRIE, and BIMEF tend to introduce under-exposure regions. SDSDNet introduces abnormal lighting. In Fig. 8, we present qualitative comparisons of the temporal inconsistency between our proposed method and other methods. The figures show that our method produces superior enhancement results with lower per-pixel warping errors. By contrast, the inferior temporal consistency observed in other enhancement models is often caused by factors such as intensive noise or inconsistent exposure. As such, other enhancement methods can lead to inconsistent frame-to-frame variations, resulting in noticeable temporal flickering while playing videos. Our results are much more visually pleasing than other methods and provide effective results with normal-looking lighting, rich details, and less noise.

Model complexity. As shown in Table 7, we provide a comparison of model complexity regarding the model size and inference time for a spatial input video frame size of \(960 \times 512\). Our method utilizes 3D Convolution, which is computationally intensive and requires more calculations compared to 2D Convolution. As a result, our method has a higher number of model parameters, specifically 9.06 M, and a longer inference time of 1706.59 ms when processing a single \(960 \times 512\) video frame. However, it is still acceptable when compared with SOTA EnligthenGAN (114.35 M) and Uformer (50.88 M).

Table 7 Quantitative comparison on parameters and inference time
Table 8 Ablation studies on loss functions within SDSD dataset
Table 9 Ablation studies on temporal loss function within SDSD dataset
Table 10 Ablation studies on the weight \(\lambda _{st}\) within SDSD dataset

5.3 Ablation Study

We perform ablation study experiments on SDSD dataset to verify the effectiveness of temporal loss settings, spatial loss settings, and different network components, respectively. We show the PSNR, SSIM, FSIM, \(\mathcal {L}_{warp}\) and MABD values to evaluate the performance in Tables 8, 9 and 11, respectively. The result of our ablation study to determine the optimal weight \(\lambda _{st}\) is presented in Table 10. We also investigate the domain gap between indoor scene and outdoor scene within SDSD dataset in Table 12.

Fig. 9
figure 9

Visual quality comparison on the loss function. The red box regions have been zoomed in for visualization. We calculate the error maps (the third row) in gray-scale space, where the caxis is set from 0 to 0.02 in MATLAB R2018b

Fig. 10
figure 10

Visual quality comparison on the network components. The red box regions have been zoomed in for visualization. We calculate the error maps (the third row) in gray-scale space, where the caxis is set from 0 to 0.02 in MATLAB R2018b

Ablation Study on Spatial Loss Functions. To evaluate the effectiveness of spatial loss selection, we perform the ablation study on the spatial training losses shown in Table 8. It is observed that our final decision produces better results in terms of PSNR, SSIM, \(\mathcal {L}_{warp}\) and MABD. The results of the ablation study demonstrate that the Charbonnier loss leads to superior pixel-level and structure-level results that are more resistant to noise outliers in low-light conditions. The absence of perceptual loss leads to a significant decrease in both PSNR and SSIM values, highlighting its importance in the training scheme. Similarly, the absence of Edge loss also results in a decline in performance. In addition, we visualize the results in Fig. 9, and our results generate fewer errors.

Ablation Study on Temporal Loss Function. To assess the impact of temporal loss, we conduct an ablation study as shown in Table 9. The results indicate that removing the temporal cycle constraint (as defined in Equation (14)) leads to lower spatial and temporal quality measures. To verify the effectiveness of the STCL training strategy, we conduct experiments to compare the results obtained by adding temporal loss on real captured low-light videos (“On Real”), adding temporal loss on synthetic videos (“Ours”), and adding temporal loss on both real captured and synthetic videos (“Both”). Based on our findings, we conclude that only incorporating temporal loss on synthetic videos yields the best results in terms of spatial and temporal domains. Table 10 shows the results of our ablation study to determine the optimal weight for the temporal loss term, denoted by \(\lambda _{st}\). We find that setting \(\lambda _{st}\) to 0.1 can yield promising spatial and temporal quality results when compared to other weight choices. Although a value of 10 for \(\lambda _{st}\) results in a lower temporal warping error, the PSNR, SSIM and FSIM values tend to be lower.

Fig. 11
figure 11

Illustration of feature distribution within indoor, outdoor, and DS-LOL. The abscissa and ordinate values have been normalized

Table 11 Ablation studies on network module within SDSD dataset

Ablation Study on Network Architecture. In Table 11, we verify the performance of different components. The ConvLSTM operation improves the quantitative results significantly in terms of spatial and temporal-wise aspects. We attribute ConvLSTM could mine correlation and spatial information between adjacent video frames. Our observations indicate that the results obtained without the ASPP component are inferior to those of our final network, demonstrating the effectiveness of ASPP in capturing multi-scale representations. Similarly, the final version can be benefited from the proposed FCA component, resulting in higher PSNR and SSIM and lower \(\mathcal {L}_{warp}\) and MABD. With the performance verification, we select the combination of ConvLSTM, ASPP, and FCA to form Recurrent Refinement Module \(\mathbf {R^{2}}\), preserving the spatial-temporal information. Furthermore, we provide the results of our ablation study on network components in Fig. 10, which demonstrate that our final results exhibit fewer errors (the third row) when compared to the ground truth. This highlights the effectiveness of the proposed approach and provides additional evidence to support its use in low-light video enhancement.

Ablation Study on Different Domain. Furthermore, we conduct an investigation into the impact on spatial information by comparing the results obtained from training on indoor and outdoor scenes separately (“Ours”) versus training on both scene types together (“Entire”) in Table 12. In the SDSD dataset, the indoor and outdoor scenes are captured by the same camera system. Our findings indicate that training on indoor and outdoor scenes separately can result in better spatial alignment and overall performance improvements. The t-SNE result in Fig. 11 provides further evidence that there is a significant domain gap between different scenes. This is because training separately allows for a more focused and targeted learning process that is better able to capture the specific spatial characteristics of each domain.

6 Conclusions

To address the temporal inconsistency issue in the low-light video enhancement, we make a comprehensive exploitation from data-centric, model-centric and training mechanism aspects, respectively. We have created a high-quality synthetic DS-LOL video dataset to enrich the nighttime scenes, specifically designed to enhance nighttime scenes and support future research. This dataset includes a diverse range of scenes and provides rich temporal information, which allows for the development of models that can accurately model temporal information. We develop a new video temporal consistency framework to learn the mapping function between low-light video frames and normal-light video frames, where the temporal inconsistency is addressed in two ways, where we utilize feature similarity in high-level features in the temporal dimension property and add constraints in adjacent output video frames. Furthermore, we design the STCL strategy to resolve the contradiction between spatial fidelity and temporal consistency, with the core idea of exploiting spatial and temporal cues from real captured and synthetic DS-LOL datasets, respectively. Our framework achieves illumination enhancement and noise reduction from the spatial output perspective while maintaining temporal consistency in the temporal dimension. Experimental results of the different methods on two real captured low-light datasets show that the proposed method outperforms several state-of-the-art methods by a large margin.

Table 12 Ablation studies about the scene gap within the SDSD dataset