1 Introduction

Depth completion [1] aims to predict dense depth maps from sparse ones and the corresponding color images. It is an essential task in computer vision and has been widely used in various applications, such as augmented reality [2, 3], 3D scene reconstruction [4, 5], and self-driving [6, 7]. In the past few years, plenty of image guided methods [710] have been proposed for depth completion under daytime conditions, e.g., the well-known KITTI benchmark [11]. However, very few approaches have focused on the more challenging nighttime scenarios. Nighttime depth-aware self-driving is especially important but difficult. As shown in Fig. 1, existing state-of-the-art image guided depth completion methods [4, 7, 9] perform well under daytime conditions but struggle in challenging nighttime scenarios. This is because the sparse depths from light detection and ranging (LiDAR) are illumination-invariant while color images are highly affected by visibility and illumination variations. Therefore, we identify the key challenge of nighttime depth completion as guidance from color images, which suffer greatly from poor visibility and complex illumination.

Figure 1
figure 1

Visual comparison of different image guided depth completion approaches in daytime and nighttime scenarios

Poor visibility

A possible solution is to leverage existing low-light image enhancement techniques [1214] to improve the visibility of nighttime color images. Since there are no paired clear images available as supervisory signals, self-supervised methods [12, 13, 15] for nighttime depth perception are preferred. However, they cannot generate very reasonable illumination maps, resulting in extremely untrustworthy enhanced color images for safe self-driving. For example, Fig. 2 shows that the state-of-the-art model [13] suffers from severe color cast. To address this issue, we propose recurrent inter-convolution differencing (RICD), which explicitly and gradually estimates global illumination to improve poor visibility, by using continuous differencing between two convolutions with different kernels. Treating the small-kernel-convolution feature as the center of the large-kernel-convolution feature is a new perspective. Moreover, convolution subtraction [16] is useful for modeling uncertainty [17] where the target pixels are difficult to predict accurately. In the nighttime image above, we can easily observe that there are many areas with underexposure, overexposure, and terminator (the junction area between light and dark) effects due to varying illumination, resulting in more and higher uncertainty than usual. Based on these priors, we transform the uncertainty in nighttime scenarios into relative light intensity by applying continuous convolution differencing. Such differencing features that capture explicit light intensity information are essential for predicting valid illumination. Consequently, RICD contributes to robust visibility enhancement of nighttime color images with more naturalistic visual effects, as shown in Fig. 2.

Figure 2
figure 2

Visual comparison of self-calibrated illumination (SCI) [13] and our recurrent inter-convolution differencing (RICD) enhancement. NIQE: natural image quality evaluator, a completely blind no-reference metric

Complex illumination

Even after applying RICD enhancement, the distribution of relative light intensity in nighttime color images is still much more complex than that in daytime conditions. For instance, there are many terminator areas with varying illumination, which are difficult for standard convolutions to handle. Fortunately, inspired by the local binary pattern operator [18], which is robust to illumination variations, a series of central convolution differencing algorithms [1922] are devised to address these challenging scenarios. Nevertheless, their differencing centers are typically fixed, leading to restricted applicability, especially for self-driving where safety is incredibly important. For example, if the center contains noise or is on terminator, these algorithms would introduce additional negative reference information, thus resulting in unsatisfactory illumination robustness. To address this issue, we propose illumination affinitive intra-convolution differencing (IAICD), which learns reasonable differencing centers within a single convolution. On the one hand, IAICD can reduce the latent impact of noise and predict an adaptive differencing center based on the surrounding neighbors. On the other hand, the estimated illumination map in the RICD module is used to adaptively measure the contribution of each neighbor. As a result, IAICD can cope with complex illumination in challenging nighttime scenarios.

Finally, considering that nighttime depth estimation [2325] is a highly relevant task, we further evaluate our model. In short, our contributions are as follows:

1) To the best of our knowledge, this is the first instance where we have expanded the traditional depth completion task to address the challenging conditions of nighttime environments, thereby enhancing the safety of self-driving applications.

2) We identify the key challenge of nighttime depth completion as the guidance from color images, where the visibility is low and illumination is complex. To tackle these issues, RICD and IAICD with learnable differencing centers are proposed.

3) We build two benchmark datasets for the nighttime depth completion task. Extensive experiments show that our method achieves state-of-the-art results.

2 Related work

2.1 Monocular depth perception at night

Monocular depth perception mainly consists of depth estimation [26] and completion [11]. To date, various depth estimation methods have been developed for both supervised [27, 28] and self-supervised [26, 29] ways for daytime scenarios. Recently, some depth estimation approaches [2325, 30, 31] focus on nighttime conditions. Specifically, ADDS [24] proposes a domain-separated network to address large day-night domain shift and illumination variations. RNW [23] introduces prior regularization and consistent image enhancement for stable training and brightness consistency, respectively. Furthermore, to handle the challenges in underexposed and overexposed regions, STEPS [25] presents a new method that jointly learns a nighttime image enhancer and a depth estimator with uncertain pixel masking strategy and bridge-shaped curve. For depth completion, the majority of related works are applied in daytime scenarios, employing either supervised [4, 7, 9, 10, 3237] or self-supervised [6, 38] methods. For example, RigNet [9] and RigNet++ [39] explore a repetitive design in the image guided network branch to acquire clear guidance and structure. CFormer [7] couples convolution and vision transformer to leverage their local and global contexts. LRRU [35] presents a large-to-small dynamical kernel scope to capture long-to-short dependencies. However, there are hardly any depth completion approaches that focus on the much more challenging nighttime environment, which is a vital component of self-driving. Therefore, we attempt to develop a basic framework for nighttime depth completion task to compensate for self-driving applications.

2.2 Differencing convolution

Vanilla convolution is commonly used to extract basic visual features in deep learning networks, but it is not very effective when processing scenes with varying illumination. Inspired by local binary pattern [18] that is robust to illumination variations, CDC [19] first introduces central differencing convolution to aggregate both intensity and gradient information. After that, lots of modified operators are presented in various vision tasks [2022, 40]. For example, C-CDC [41] extends CDC into dual-cross central differencing convolution via horizontal, vertical, and diagonal decomposition for face anti-spoofing. Furthermore, PDC [21] proposes pixel differencing convolution to enhance gradient information for edge detection. Besides, SDN [22] introduces semantic similarity to build semantic differencing convolution for semantic segmentation. However, their fixed differencing centers are usually not very robust or reasonable enough if they contain noise or are on terminator. Different from these methods, our goal is to design learnable differencing centers that are affinitive to neighbor and illumination distribution for safe self-driving at night.

2.3 Low-light image enhancement

Low-light images captured in dark environments often suffer from severe noise, low brightness, low contrast, and color deviation [14, 15]. Thus, plenty of supervised [42, 43] and self-supervised [12, 13, 15] methods are presented to restore the details. For example, the well-known histogram equalization (HE) [44] is a classic algorithm that strengthens the global contrast. However, the accuracy of HE-based approaches would degrade as the background noise contrast increases. As an alternative, Retinex-based methods [42, 45] perform better, with the assumption that low-light image can be decomposed into illumination and reflectance. However, this task usually lacks paired ground truth annotations, since there may be multiple low-light and high-light images for the same scene, making it difficult to determine which reference image is the best. Moreover, strictly aligned image pairs are also difficult to obtain [46]. Thus, self-supervised approaches are preferred for real-world applications. Recently, SCI [13] proposes a self-supervised illumination estimation framework with extremely lightweight parameters. Based on these previous studies, we present the recurrent differencing strategy between paired convolutions to predict more reasonable illumination.

3 Method

Our work is specifically designed for nighttime depth completion, as well as depth estimation. We introduce the recurrent inter-convolution differencing and illumination affinitive intra-convolution differencing in Sect. 3.2 and Sect. 3.3, respectively. The detailed design is depicted in Fig. 3. Then we elaborate our framework in Fig. 4 of Sect. 3.4. Finally, the loss is defined in Sect. 3.5.

Figure 3
figure 3

Comparison of vanilla convolution, central differencing convolution (CDC) [19], our RICD, and our illumination affinitive intra-convolution differencing (IAICD). \(\boldsymbol {\omega}_{6}, \boldsymbol {\omega}_{7}, \ldots , \boldsymbol {\omega}_{18}\) denote a \(3\times 3\) convolution kernel

Figure 4
figure 4

The proposed learnable differencing center network (LDCNet). The low-light RGB x is first fed into the RICD to predict the credible image \(\boldsymbol {x}'\) and reasonable illumination m. Then, the IAICD is used to alleviate the negative impact of varying illumination. d/o: sparse/dense depth, \(\Phi _{\mathrm{c}}\)/\(\Phi _{\mathrm{d}}\): subnetwork

3.1 Prior knowledge

Self-calibrated illumination (SCI)

Based on the Retinex theory [47], the relation between the low-light image x and enhanced image \(\boldsymbol {x}'\) is formulated as:

$$ \boldsymbol {x}'=\boldsymbol {x} \oslash \boldsymbol {m}, $$
(1)

where \(\boldsymbol {m}\in (0, 1]\) is the estimated illumination map, and ⊘ denotes pixel-wise division. On this basis, a self-calibrated module and an illumination estimator are designed in lightweight elf-supervised SCI [13]. Given the low-light input x, this estimator predicts the illumination map m via several \(3\times 3\) convolutions. The enhancement loss includes fidelity and smoothness terms, defined as:

$$ \begin{aligned} &{\mathcal{L}_{\mathrm{f}}}= \frac{1}{n}\sum_{i=1}^{n}{{ ( \boldsymbol {m}_{i}-\boldsymbol {x}_{i} )}^{2}}, \\ &{\mathcal{L}_{\mathrm{s}}}=\frac{1}{n}\sum _{i=1}^{n}{\sum_{j\in \mathcal{N} ( i )}{{{ \mathcal{G}}_{i,j}} \vert {{\boldsymbol {m}}_{i}}-{{ \boldsymbol {m}}_{j}} \vert }}, \end{aligned} $$
(2)

where \(\mathcal{G}_{i,j}\) is the weight of a Gaussian kernel, and \(\mathcal{N} ( i )\) is a window centered at i with \(5\times 5\) adjacent pixels. \(\mathcal{L}_{\mathrm{f}}\) measures the similarity of m and x while \(\mathcal{L}_{\mathrm{s}}\) regularizes the consistency of m itself. n denotes the number of valid pixels.

Vanilla convolution

We denote the frequently used 2D spatial convolution as vanilla convolution. For simplicity, we describe the convolution operator in 2D while ignoring the channel dimension. Given the input x, the new output feature y produced by vanilla convolution is represented as:

$$ \boldsymbol {y}_{p_{0}}=\sum_{p_{n}\in \mathcal{R}}{ \boldsymbol {\omega}_{p_{n}} \cdot \boldsymbol {x}_{p_{0} + p_{n}}}, $$
(3)

where \(\mathcal{R}\) is the local receptive field region sampled from x. \(\boldsymbol {\omega}_{p_{0}}\) is the convolution weight in the current location, while \(p_{n}\) enumerates the locations in \(\mathcal{R}\). Figure 3(a) is a \(3\times 3\) kernel case.

Central differencing convolution

Based on the vanilla convolution, CDCN [19] designs central differencing convolution (CDC), where every pixel of x in \(\mathcal{R}\) subtracts its center pixel \(\boldsymbol {x}_{p_{0}}\):

$$ \boldsymbol {y}_{p_{0}}=\sum_{p_{n}\in \mathcal{R}}{ \boldsymbol {\omega}_{p_{n}} \cdot ( \boldsymbol {x}_{p_{0} + p_{n}} - \boldsymbol {x}_{p_{0}} )}. $$
(4)

Figure 3(b) illustrates the process of the above equation. Furthermore, by combining Eqs. (3) and (4), it yields the trade-off contribution of vanilla convolution and CDC, which is defined as:

$$ \begin{aligned} \boldsymbol {y}_{p_{0}}&=\theta \cdot \sum_{p_{n}\in \mathcal{R}}{\boldsymbol {\omega}_{p_{n}} \cdot ( \boldsymbol {x}_{p_{0} + p_{n}} - \boldsymbol {x}_{p_{0}} )} \\ &\quad{} + ( 1 - \theta ) \cdot \sum_{p_{n}\in \mathcal{R}}{ \boldsymbol {\omega}_{p_{n}} \cdot \boldsymbol {x}_{p_{0} + p_{n}}} \\ &\Rightarrow \quad \sum_{p_{n}\in \mathcal{R}}{\boldsymbol {\omega}_{p_{n}} \cdot \boldsymbol {x}_{p_{0} + p_{n}}} + \biggl( -\boldsymbol {x}_{p_{0}} \cdot \sum_{p_{n}\in \mathcal{R}}{\boldsymbol {\omega}_{p_{n}}} \biggr), \end{aligned} $$
(5)

where the first term is the vanilla convolution and the second is the central differencing term. θ is a coefficient.

3.2 Recurrent inter-convolution differencing

Existing low-light image enhancement methods cannot restore very reasonable output in more challenging nighttime self-driving scenarios. For example in Fig. 2, SCI [13] suffers from severe color cast. To address this issue, we propose RICD in Fig. 3(c). First, convolution subtraction [16] between two different-kernel vanilla convolutions is employed in RICD to highlight the uncertainty of different lighting areas. Then, the uncertainty is converted into illumination via recurrent convolution differencing. Suppose that \(\mathcal{R}\) is the larger local receptive field region, while \(\mathcal{\bar{R}}\) is the smaller, where \(\bar{p}_{n}\) is the location enumeration. \(\mathcal{R}\) and \(\mathcal{\bar{R}}\) have the same current location \(p_{0}\). As a result, one step of RICD is defined as:

$$ \boldsymbol {y}_{p_{0}}=\sum_{p_{n}\in \mathcal{R}}{ \boldsymbol {\omega}_{p_{n}} \cdot \boldsymbol {x}_{p_{0} + p_{n}}} - \sum _{{\bar{p}_{n}}\in \mathcal{\bar{R}}}{\boldsymbol {\omega}_{ \bar{p}_{n}} \cdot \boldsymbol {x}_{p_{0} + \bar{p}_{n}}}. $$
(6)

One novel aspect of RICD is that it converts the uncertainty distribution into illumination estimation. Besides, it introduces a new perspective that identifies the feature of the smaller-kernel convolution as the center of the feature of the larger-kernel convolution. The differencing center is dynamically learned from its local environment. These characteristics contribute to valid illumination prediction. Consequently, according to Eq. (1), RICD can restore robust enhanced images.

3.3 Illumination affinitive intra-convolution differencing

Although the RICD enhances the visibility of nighttime images, the relative light intensity caused by varying illumination is still much more complex than that in daytime images. To handle this problem, we present IAICD in Fig. 3(d). Different from CDC [19] whose center is typically fixed, IAICD first aggregates its differencing center adaptively from all neighboring pixels. After yielding the differencing matrix between neighbors and the center, IAICD reweights the matrix via \(\mathcal{M}\), which is a channel-wise (c) normalization of the illumination map m, i.e., \({{\mathcal{M}}^{c}}={{{\boldsymbol {m}}^{c}}}/{\sum_{v=1}^{c}{ \vert {{\boldsymbol {m}}^{v}} \vert }}\), then yielding

$$ \boldsymbol {y}_{p_{0}}=\sum_{p_{n}\in \mathcal{R}}{ \boldsymbol {\omega}_{p_{n}} \cdot \biggl( \boldsymbol {x}_{p_{0} + p_{n}} - \sum_{p_{n}\in \mathcal{R}}{\mathcal{M}_{p_{n}} \cdot \boldsymbol {x}_{p_{n}}} \biggr)}. $$
(7)

Compared with CDC, the differencing center predicted by IAICD is robust. On the one hand, if the center \(\boldsymbol {x}_{p_{0}}\) contains noise, CDC would introduce abnormal differencing information, whereas IAICD could ignore \(\boldsymbol {x}_{p_{0}}\) or reduce its negative effect by distributing very small weight. On the other hand, if the center \(\boldsymbol {x}_{p_{0}}\) lies on terminator areas, the fixed \(\boldsymbol {x}_{p_{0}}\) is no longer appropriate as the differencing center, because its light intensity differs significantly from that of the neighbors. As an alternative, we integrate the corresponding illumination map to adjust the weight of each neighboring pixel.

While the illumination map is an all-ones matrix, and the weight of \(p_{n} (n\neq 0)\) is equal to zero, IAICD will degenerate into CDC. IAICD in Eq. (7) is a generalized version of CDC in Eq. (4).

3.4 Learnable differencing center network

Architecture

Our learnable differencing center network (LDCNet) is illustrated in Fig. 4. Overall, LDCNet consists of an image guidance branch and a depth prediction branch. In the image guidance branch, the low-light image x is first fed into the RICD, generating the enhanced image \(\boldsymbol {x}'\) and the illumination map m. Next, a simple Unet-like subnetwork \(\Phi _{\mathrm{c}}\), composed of five layers with resolutions of 1/1, 1/2, 1/4, 1/8, and 1/16, is used to encode \(\boldsymbol {x}'\). Together with m, then the features of each layer are input into the IAICD. In the depth prediction branch, the sparse depth d is encoded by a similar subnetwork \(\Phi _{\mathrm{d}}\). Meanwhile, the output of IAICD is resolution-wisely leveraged to guide the dense depth prediction in \(\Phi _{\mathrm{d}}\), yielding the final depth output o.

3.5 Loss function

Following the previous depth completion methods [9, 35, 36, 48], we employ \(\mathcal{L}_{\mathrm{2}}\) loss to supervise the output o by using the ground truth depth D.

$$ {\mathcal{L}_{\mathrm{2}}}=\frac{1}{n}\sum _{i=1}^{n}{{ ( \boldsymbol {D}_{i}- \boldsymbol {o}_{i} )}^{2}}. $$
(8)

Finally, we jointly train the low-light image enhancement subnetwork and depth prediction subnetwork by combining Eqs. (2) and (8), obtaining the total loss function:

$$ \mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{2}} + \alpha \mathcal{L}_{\mathrm{f}} + \beta \mathcal{L}_{\mathrm{s}}, $$
(9)

where α and β are hyper-parameters, which are set to 0.15 and 0.30 as the defaults, respectively.

4 Experiments

In this section, we first introduce the two generated datasets in Sect. 4.1, metrics in Sect. 4.2, and implementation details in Sect. 4.3. Then, we present the quantitative and qualitative comparison with state-of-the-art methods in Sect. 4.4 and Sect. 4.5. Finally, extensive ablation studies in Sect. 4.6 are conducted to validate the effectiveness of each module.

4.1 Datasets

RobotCar-Night-DC

Oxford RobotCar [49] is a large-scale dataset that captures various weather and traffic conditions along a route in central Oxford. We create RobotCar-Night-DC from the 2014-12-16-18-44-24 sequences by using the left color images of the front stereo-camera. To generate sparse and ground truth depth, we employ the official toolbox to process the data from the front laser and sensors. Following the KITTI benchmark [11], we use the current frame for sparse depth generation and multiple frames for ground truth depth creation. The densities of the valid pixels of sparse depth and ground truth depth are approximately 4% and 16%, respectively. We crop and resize these data to \(576\times 320\) to remove the car-hood and enable efficient training. As a result, the RobotCar-Night-DC dataset contains \(10{,}290\) RGB-depth (RGB-D) pairs for training and 411 for testing.

CARLA-Night-DC

CARLA-EPE [25] is generated by CARLA simulator [50] and EPE network [51], which is a synthetic dataset for nighttime depth estimation task. The ground truth depth in CARLA-EPE is almost fully dense, which is unrealistic for LiDAR-based self-driving systems where the depth density is around 7% [9]. Hence, based on the synthetic dataset we create CARLA-Night-DC for the proposed nighttime depth completion task, by transferring the sparse LiDAR pattern of KITTI [11] to CARLA-EPE. Hence, CARLA-Night-DC is composed of 7532 RGB-D pairs in total, of which 7000 for training and 532 for testing.

4.2 Metrics

Given ground truth D and output o, we use the following metrics: mean absolute error (MAE), root mean square error (RMSE), inverse MAE (iMAE), inverse RMSE (iRMSE), square relative error (Sq Rel), absolute relative error (Abs Rel), and root mean square logarithmic error (RMSE log). Table 1 summarizes their mathematical expressions.

Table 1 Metric definition

4.3 Implementation details

LDCNet is implemented on the Pytorch library with a single RTX 3090 GPU. We train it for 20 epochs using the Adam optimizer, with momentum \(\beta _{1}=0.900\), \(\beta _{2}=0.999\), and weight decay of \(1.0 \times {10}^{-6}\). The initial learning rate is \(1.0 \times {10}^{-3}\), which decreases by half every 5 epochs. We use synchronized cross-GPU batch normalization [52], resulting in a batch size of 12. The evaluation metrics are consistent with those of RNW [23] and KITTI [11]. The RMSE is measured in meters.

4.4 Nighttime depth estimation

We compare LDCNet with nighttime state-of-the-art methods, including MD2 [53], DeFeatNet [31], ADFA [30], ADDS [24], RNW [23], WSGD [54], and STEPS [25]. Based on STEPS, we embed our RICD and IAICD into its image enhancement branch and depth estimation branch, respectively. From Table 2 we can observe that LDCNet almost achieves the lowest errors and the highest accuracy. On RobotCar, LDCNet is superior to the second best STEPS in all aspects. Furthermore, LDCNet surpasses the well-known MD2 by large margins. For example, the RMSE of MD2 is reduced from 12.771 m to 6.725 m, an improvement of almost 47%, while the accuracy \(\delta _{1}\) increases by 22.9%. On the CARLA-EPE dataset, LDCNet also outperforms the other three approaches. In addition, we compare these methods on RobotCar and CARLA-EPE and observe that they perform worse on CARLA-EPE. This can be attributed to the darker color images and the larger depth ranges of CARLA-EPE. Finally, the visual results in Fig. 5 show that LDCNet can predict more accurate depth with more complete and sharper edges, which further verify the superiority and effectiveness of LDCNet.

Figure 5
figure 5

Visual comparison of nighttime depth estimation on the RobotCar-Night-DC dataset. The methods include MD2 [53], RNW [23], STEPS [25], and our LDCNet

Table 2 Results of nighttime depth estimation on the RobotCar [49] and CARLA-EPE [25] benchmarks. LDCNet: the learnable differencing center network. ↑/↓ indicates that higher/lower scores are better. Bold indicates the best result and underline indicates the second best result

4.5 Nighttime depth completion

For fair comparison, we retrain existing state-of-the-art daytime depth completion approaches in nighttime scenarios, including pNCNN [55], FusionNet [56], NCNN [32], S2D [3], NLSPN [4], GuideNet [48], RigNet [9], and CFormer [7]. The quantitative results are reported in Table 3. Overall, we discover that LDCNet achieves the best performance on the two nighttime depth perception benchmarks. Specifically, on the RobotCar-Night-DC dataset, the result of LDCNet is superior or competitive. For instance, LDCNet reduces the MAE by 28.7% compared with the third best RigNet. Compared with CFormer, which requires 5 days for training on a single 3090 GPU, LDCNet still achieves slightly better results with 20-hour training cost. On CARLA-Night-DC dataset, the challenging darker environment and greater distance result in poor performance of these methods. For example, the RMSE is at least 6 m greater than that of RobotCar-Night-DC. Additionally, we notice that the NCNN, pNCNN, NLSPN, FusionNet, and CFormer, all of which estimate confidence maps to reweight depth, suffer from large RMSE and MAE values. We analyse that the very low-light color images make it rather difficult to predict credible confidence distribution, resulting in unstable depth refinement. Finally, from Fig. 6 we discover that LDCNet succeeds in recovering object depth more accurately, such as the cars, bus shelters, and buildings in the foreground, and the trees, light poles, and billboards in the background.

Figure 6
figure 6

Visual comparison of nighttime depth completion on the CARLA-Night-DC dataset. The methods include GuideNet [48], CFormer [7], and our LDCNet

Table 3 Results of nighttime depth completion on RobotCar-Night-DC and CARLA-Night-DC. Note that all methods in this table are retrained from scratch

4.6 Ablation study

For efficient ablation on RobotCar-Night-DC, we halve the size of the two subnetworks in LDCNet by setting the stride of the first-layer convolution to 2. Results are presented in Table 4, Table 5, Fig. 7, and Fig. 8.

Figure 7
figure 7

Ablation on RICD and IAICD. ‘N-agg’: neighboring aggregation; ‘I-wei’: illumination affinitive weighting

Figure 8
figure 8

Intermediate feature comparison of vanilla convolution and our method. \(k_{1}\): the kernel size of the convolution

Table 4 Ablation on components of LDCNet. RICD: recurrent inter-convolution differencing; IAICD: illumination affinitive intra-convolution differencing
Table 5 Ablation on diverse-kernel RICD. \(k_{1}\) and \(k_{2}\) denote the kernel sizes of the two convolutions

LDCNet

As listed in Table 4, the baseline LDCNet-i first removes the RICD and IAICD modules. Then, as an alternative to IAICD, LDCNet-i incorporates the guidance module proposed in GuideNet [48]. When implementing our RICD design (LDCNet-ii), we discover that the two evaluation metrics are consistently improved, i.e., the RMSE is reduced by 104 mm and the MAE is reduce by 129 mm. Similarly, the individual IAICD (LDCNet-iii) contributes to larger performance improvement, severally reducing RMSE and MAE by 117 mm and 148 mm. Finally, to combine the best of two worlds, LDCNet-iv embeds the RICD and IAICD simultaneously into the baseline. As a result, LDCNet-iv performs much better than LDCNet-i, significantly exceeding it by 137 mm in RMSE and 181 mm in MAE.

RICD

The basic unit of RICD is the differencing between two convolutions with different kernels. Consequently, we ablate the diverse kernel sizes in Table 5. Based on LDCNet-i, RICD-i, RICD-ii, and RICD-iii conduct \((k+2)\times (k+2)\) and \(k\times k\) convolution differencing. As the kernel size increases, the two evaluation metrics decrease gradually. For example, the MAE of \(k_{2}=5\) is 109 mm superior to that of \(k_{2}=1\). This is due to the learnable differencing center design, which regards the small-kernel-convolution feature as the center of the large-kernel-convolution feature. Such differencing convolutions with larger local receptive fields can predict reliable illumination distribution by aggregating the surrounding light information. Furthermore, RICD-iv increases the kernel size gap from 2 to 4. For one thing, it is clear that the \(1\times 1\) convolution of RICD-i is not very suitable for use as the differencing center because it cannot leverage ambient information. Thus, RICD-iv performs better than RICD-i regardless of the larger gap. For another thing, with the larger size gap, the larger-kernel convolution introduces redundant light reference over long distances, while the smaller-kernel convolution can only map the light in local regions. Therefore, RICD-iv performs worse than RICD-ii and RICD-iii, which has smaller size gap. In addition, based on RICD-ii, Fig. 7 (a) shows the ablation of the RICD with different recurrent steps. We observe that RICD performs better as the step increases. As depicted in Fig. 8, RICD can strengthen the representation of relative light intensity, contributing to more precise illumination. Finally, we select RICD-ii and step-3 as the defaults.

IAICD

Different from the center differencing convolution (CDC) [19] with fixed center, IAICD first aggregates all neighboring pixels and then employs the illumination affinitive weight to produce its learnable center. Figure 7 (b) displays that both of these two strategies contribute to consistent improvement over vanilla convolution and CDC. Furthermore, to evaluate the robustness of IAICD, we introduce Gaussian noise into the raw color images. IAICD still performs better than CDC and achieves very close performance to itself using raw color images. All of these evidences demonstrate the effectiveness and robustness of IAICD.

4.7 Generalization

Here we further evaluate the generalization capabilities of our LDCNet for both daytime depth completion [11] and low-light image enhancement [42] tasks.

Table 6 reports the comparison results on the KITTI depth completion dataset [11], which is collected during the daytime. We can observe that the performance of current state-of-the-art methods [7, 9, 10, 48] is very similar. For example, the RMSE of the ranking metric is 730 mm nearby. Although our LDCNet is specifically designed for nighttime scenarios, it still achieves competitive performance on the daytime benchmark.

Table 6 Results on the KITTI depth completion benchmark

Based on self-supervised SCI [13], which is trained for 600 epochs, we replace its illumination estimation module with our RICD block. According to Table 7, RICD consistently improves the baseline in both no-reference NIQE [60] and DE [59], and full-reference PSNR and SSIM metrics. Furthermore, Fig. 9 again demonstrates the superiority of our method, i.e., higher quality with lower training cost.

Figure 9
figure 9

Visual comparison on difficult test split. 100/600: 100/600 training epochs; GT: ground truth

Table 7 Comparison on the difficult test split of SCI [13]. NIQE: natural image quality evaluator; DE [59]: discrete entropy, a completely blind no-reference metric; PSNR: peak signal-to-noise ratio; SSIM: structural similarity

5 Conclusion

In this paper, we extended the conventional depth completion task into nighttime environments to complement safe self-driving. We identified the key challenge as the guidance from color images with low visibility and complex illumination. As a result, we proposed RICD and IAICD to improve the poor visibility and reduce negative influences of the varying illumination, respectively. RICD could predict explicit global illumination to enhance visibility, where treating the small-kernel convolution as the center of the large-kernel-convolution was a new perspective. IAICD succeeded in alleviating local relative light intensity, in which the differencing center was learned dynamically from the neighboring pixels and illumination maps of RICD. Thus, the center was robust and illumination affinitive. Finally, extensive experiments on depth perception datasets have verified the effectiveness of LDCNet.

Limitation

We believe our LDCNet is a general approach that could benefit network-based models in other various vision tasks. However, the current version is only evaluated on four tasks, i.e., nighttime depth estimation, nighttime depth completion, daytime depth completion, and low-light image enhancement. In recent future, we will extend it into more tasks, e.g., nighttime semantic segmentation, nighttime flow estimation, etc. Additionally, LDCNet has latent values that can be applied into important domains such as self-driving and 3D scene reconstruction under low-light environments.