Learnable Differencing Center for Nighttime Depth Perception

Depth completion is the task of recovering dense depth maps from sparse ones, usually with the help of color images. Existing image-guided methods perform well on daytime depth perception self-driving benchmarks, but struggle in nighttime scenarios with poor visibility and complex illumination. To address these challenges, we propose a simple yet effective framework called LDCNet. Our key idea is to use Recurrent Inter-Convolution Differencing (RICD) and Illumination-Affinitive Intra-Convolution Differencing (IAICD) to enhance the nighttime color images and reduce the negative effects of the varying illumination, respectively. RICD explicitly estimates global illumination by differencing two convolutions with different kernels, treating the small-kernel-convolution feature as the center of the large-kernel-convolution feature in a new perspective. IAICD softly alleviates local relative light intensity by differencing a single convolution, where the center is dynamically aggregated based on neighboring pixels and the estimated illumination map in RICD. On both nighttime depth completion and depth estimation tasks, extensive experiments demonstrate the effectiveness of our LDCNet, reaching the state of the art.


I. INTRODUCTION
D Epth completion [1] aims to predict dense depth maps from sparse ones and the corresponding color images.It is an essential task in computer vision and has been widely used in various applications, such as augmented reality [2], [3], 3D scene reconstruction [4], [5], and self-driving [6], [7].In the past few years, plenty of image-guided methods [7]- [10] have been proposed for depth completion under daytime conditions, e.g., the well-known KITTI benchmark [11].However, there are very few approaches focusing on the more challenging nighttime scenarios.As we know that nighttime depth-aware self-driving is especially important but difficult.As shown in Fig. 1, existing state-of-the-art imageguided depth completion methods [4], [7], [9] perform well in daytime conditions but struggle in challenging nighttime scenarios.This is because the sparse depths from LiDAR are illumination-invariant while the color images are highly affected by visibility and illumination variations.Therefore, we identify the key challenge of nighttime depth completion as the guidance from color images, which heavily suffer from poor visibility and complex illumination.
For poor visibility: A possible solution is to leverage existing low-light image enhancement techniques [12]- [14] Zhiqiang Yan, Jun Li, and Jian Yang are with Nanjing University of Science and Technology, China (yanzq,junli,csjyang@njust.edu.cn).Yupeng Zheng is with Chinese Academy of Sciences, China (zhengyupeng2022@ia.ac.cn).Chongyi Li is with Nankai University, China (lichongyi25@gmail.com).to improve the visibility of nighttime color images.Since there are no paired clear images available as supervisory signals, self-supervised methods [12], [13], [15] for nighttime depth perception are preferred.However, we find that they cannot generate very reasonable illumination maps, resulting in extremely untrustworthy enhanced color images for safe self-driving.For example, Fig. 2 shows that the state-of-the-art model [13] suffers from serious color cast.To tackle this issue, we propose Recurrent Inter-Convolution Differencing (RICD), which explicitly and gradually estimates global illumination to improve the poor visibility, by using continuous differencing between two convolutions with different kernels.It is a new perspective to treat the small-kernel-convolution feature as the center of the large-kernel-convolution feature.Moreover, we recognize that convolution subtraction [16] is useful for modeling uncertainty [17] where the target pixels are difficult to predict accurately.In the nighttime image above, we can easily observe that there are many areas with underexposure, overexposure, and terminatorn (the junction area between light and dark) effects due to varying illumination, resulting in more and higher uncertainty than usual.Based on these priors, we transform the uncertainty in nighttime scenarios into relative light intensity by applying continuous convolution differencing.Such differencing features that capture explicit light intensity information are essential for predicting valid illumination.Consequently, RICD contributes to robust visibility enhancement of nighttime color images with more naturalistic visual effect in Fig 2.
For complex illumination: However, even after applying RICD enhancement, the distribution of relative light intensity in nighttime color images is still much more complex than in daytime conditions.For instance, there are lots of terminatorn areas with varying illumination, which are difficult for standard convolutions to handle.Fortunately, inspired by the Local Binary Pattern operator [18] that is robust to illumination variation, a series of central convolution differencing algorithms [19]- [22] are devised to address these challenging scenarios.Nevertheless, their differencing centers are typically fixed, leading to restricted applicability, especially for self-driving where safety is incredibly important.For example, when the center contains noise or lies on terminatorn, these algorithms would additionally introduce negative reference information, thus resulting in unsatisfactory illumination robustness.To tackle this issue, we propose Illumination-Affinitive Intra-Convolution Differencing (IAICD) that learns reasonable differencing center within a single convolution.For one thing, IAICD can reduce the latent impact of noise and predict an adaptive differencing center based on the surrounding neighbors.For another, the estimated illumination map in RICD module is involved to adaptively measure the contribution of each neighbor.As a result, IAICD could cope with the complex illumination in challenging nighttime scenarios.Finally, considering the nighttime depth estimation [24]- [26] is a highly relevant task, we further evaluate our model on it.In summary, our main contributions are as follows: • For the first time, we extend the conventional depth completion task into challenging nighttime environments to compensate for safe self-driving applications.• We identify the key challenge of nighttime depth completion as the guidance from color images, where the visibility is low and illumination is complex.Thus we propose RICD and IAICD with learnable differencing centers, which are rather suitable for nighttime scenarios.• We build two benchmark datasets for the nighttime depth completion task.Extensive experiments indicate the effectiveness of our method, reaching the state of the art.

II. RELATED WORK
Monocular depth perception at night.Monocular depth perception mainly consists of depth estimation [27] and completion [11].Up to now, numerous depth estimation methods have been developed in both supervised [28], [29] and selfsupervised [27], [30] ways for daytime scenarios.Recently, some depth estimation approaches [24]- [26], [31], [32] focus on nighttime conditions.Specifically, ADDS [25] proposes a domain-separated network to tackle the large day-night domain shift and illumination variation.RNW [24] introduces prior regularization and consistent image enhancement for stable training and brightness consistency, respectively.Further, to handle the challenges in underexposed and overexposed regions, STEPS [26] presents a new method that jointly learns a nighttime image enhancer and a depth estimator with uncertain pixel masking strategy and bridge-shaped curve.For depth completion, the majority of related works are applied in daytime scenarios, employing either supervised [4], [7], [9], [10], [33] or self-supervised [6], [34] manners.For example, RigNet [9] explores a repetitive design in the image guided network branch to acquire clear guidance and structure.CFormer [7] couples convolution and vision transformer to leverage their local and global contexts.However, most of these methods perform poorly at night.Considering the challenging nighttime environment is a vital component of self-driving, we attempt to develop a basic framework for nighttime depth completion task to compensate for self-driving applications.
Differencing convolution.Vanilla convolution is commonly used to extract basic visual features in deep learning networks, but it is not very effective when processing scenes with varying illumination.Inspired by Local Binary Pattern [18] that is robust to illumination variation, CDC [19] first introduces central differencing convolution to aggregate both intensity and gradient information.After that, lots of modified operators are presented in various vision tasks [20]- [22], [35].For example, C-CDC [36] extends CDC into dual-cross central differencing convolution via horizontal, vertical, and diagonal decomposition for face anti-spoofing.Further, PDC [21] proposes pixel differencing convolution to enhance gradient information for edge detection.Besides, SDN [22] introduces semantic similarity to build semantic differencing convolution for semantic segmentation.However, their fixed differencing centers are not robust and reasonable enough if they contain noise or lie on terminatorn.Differently, our goal is to design learnable differencing centers that are affinitive to neighbor and illumination distribution for safe self-driving at night.
Low-light image enhancement.Low-light images often suffer from severe noise, low brightness, low contrast, and color deviation [14], [15].Thus, plenty of supervised [37], [38] and self-supervised [12], [13], [15] methods are presented to restore the details.For example, the well-known histogram equalization (HE) [39] is a classic algorithm that strengthens the global contrast.However, the accuracy of HE-based approaches would degrade while the background noise contrast increases.As an alternative, Retinex-based methods [37], [40] perform better, with the assumption that low-light image can be decomposed into illumination and reflectance.Recently, SCI [13] designs a self-supervised cascaded illumination estimation framework with extremely lightweight parameters.Nevertheless, these methods suffer from either low accuracy or poor robustness in challenging driving scenarios.Different from them, we employ recurrent differencing between paired convolutions to predict reasonable illumination.

III. METHOD A. Prior Knowledge
Self-calibrated illumination (SCI).According to the Retinex theory [41], the relation between the low-light image x and enhanced image x ′ is formulated as: where m ∈ (0, 1] is the estimated illumination map, and ⊘ denotes pixel-wise division.On this basis, the lightweight  self-supervised SCI [13] designs a self-calibrated module and an illumination estimator.Given the low-light input x, this estimator predicts the illumination map m via several 3 × 3 convolutions.The enhancement loss includes fidelity and smoothness terms, defined as: where G i,j is the weight of a gaussian kernel, and N (i) is a window centered at i with 5 × 5 adjacent pixels.L f measures the similarity of m and x while L s regularizes the consistency of m itself.n denotes the number of valid pixels.Vanilla convolution.We denote the frequently used 2D spatial convolution as vanilla convolution.For simplicity, here we describe the convolution operator in 2D while ignoring the channel dimension.Given the input x, the new output feature y produced by vanilla convolution is represented as: where R is the local receptive field region sampled from x. ω p0 is the convolution weight in current location, whilst p n enumerates the locations in R. Fig. 3(a) is a 3 × 3 kernel case.
Central differencing convolution.Based on vanilla convolution, [19] designs central differencing convolution (CDC), where every pixel of x in R subtracts its center pixel x p0 : Fig. 3(b) illustrates the process of the above equation.Further, by combining Eqs. 3 and 4, it yields the trade-off contribution of vanilla convolution and CDC, which is defined as: where the first term is vanilla convolution and the second is central differencing term.

B. Recurrent Inter-Convolution Differencing
Existing low-light image enhancement methods cannot restore very reasonable output in more challenging self-driving nighttime scenarios.For example in Fig. 2, SCI [13] suffers from serious color cast.To tackle this issue, in Fig. 3(c) we propose recurrent inter-convolution differencing (RICD).RICD first employs convolution subtraction [16] between two different-kernel vanilla convolutions to highlight the uncertainty of different lighting areas.Then it converts the uncertainty into illumination via recurrent convolution differencing.Suppose that R is the larger local receptive field region while R is the smaller.R and R have the same current location p 0 .As a result, one step of RICD can be formulated as: One novel aspect of RICD is that it converts the uncertainty distribution into illumination estimation.Besides, it introduces a new perspective that identifies the feature of the smallerkernel convolution as the center of the feature of the largerkernel convolution.The differencing center is dynamically learned from its local environment.These characteristics contribute to valid illumination prediction.Consequently, according to Eq. 1, RICD can restore robust enhanced images.

C. Illumination-Affinitive Intra-Convolution Differencing
Although RICD enhances the visibility of nighttime images, the relative light intensity caused by varying illumination is still much more complex than in daytime images.To handle Compared with CDC, the differencing center predicted by IAICD is robust.For one thing, when the center x p0 contains noise, CDC would introduce abnormal differencing information whilst IAICD could ignore x p0 or reduce its negative effect by distributing very small weight.For another, when the center x p0 lies on terminatorn areas, the fixed x p0 is no longer appropriate as the differencing center, because its light intensity differs significantly from the neighbors'.As an alternative, we integrate the corresponding illumination map to adjust the weight of each neighboring pixel.
While the illumination map is a all-ones matrix, and the weight of p n (n ̸ = 0) is equal to zero, IAICD will degenerate into CDC.That is to say, IAICD in Eq. 7 is a generalized version of CDC in Eq. 4.

D. Learnable Differencing Center Network
Architecture.The pipeline of our learnable differencing center network (LDCNet) is illustrated in Fig. 4. Overall, LDCNet consists of an image guidance branch and a depth prediction branch.In the image guidance branch, the low-light image x is first fed into RICD, generating the enhanced image x ′ and the illumination map m.Next, a simple Unet-like subnetwork Φ c , composed of five layers with resolutions 1/1, 1/2, 1/4, 1/8, and 1/16, is conducted to encode x ′ .Together with m, then the features of each layer are input into IAICD.In the depth prediction branch, the sparse depth d is encoded by a similar subnetwork Φ d .Meanwhile, the output features of IAICD are resolution-wisely leveraged to guide the dense depth prediction in Φ d , yielding the final depth output o.
Loss Function.Following previous depth completion methods [9], [42], we employ L 2 loss to supervise the output o by using groundtruth depth D: Finally, we jointly train the low-light image enhancement subnetwork and depth prediction subnetwork by combining Eqs. 2 and 8, obtaining the total loss function: where α and β are hyper-parameters, which are set to 0.15 and 0.3 as the default, respectively.
IV. EXPERIMENTS A. Datasets and Implementation Details.
RobotCar-Night-DC.Oxford RobotCar [43] is a largescale dataset that captures various weather and traffic conditions along a route in central Oxford.We create RobotCar-Night-DC from the 2014-12-16-18-44-24 sequences by using the left color images of the front stereo-camera (Bumblebee XB3).To generate sparse and groundtruth depth maps, we employ the official toolbox to process the data from the front LMS laser and INS sensors.Following KITTI benchmark [11], we use the current frame for sparse depth generation and multiple frames for groundtruth depth creation.The densities of the valid pixels of sparse depth and groundtruth depth are about 4% and 16%, respectively.Then we crop and resize these data to 576 × 320 to remove the car-hood and enable efficient training.As a result, the RobotCar-Night-DC dataset contains 10, 290 RGB-D pairs for training and 411 for testing.
CARLA-Night-DC.CARLA-EPE [26] is a synthetic dataset for nighttime depth estimation task, generated by CARLA simulator [44] and EPE network [45].The groundtruth depth in CARLA-EPE is almost fully dense, which is unrealistic for LiDAR-based self-driving systems where the depth density is around 7% [9].Hence, based on  the synthetic dataset we create CARLA-Night-DC for the proposed nighttime depth completion task, by transferring the sparse LiDAR pattern of KITTI [11] to CARLA-EPE.Hence, CARLA-Night-DC is composed of 7, 532 RGB-D pairs in total, of which 7, 000 for training and 532 for testing.Implementation Details.We implement LDCNet using Pytorch on a single RTX 3090 GPU.We train it for 20 epochs with the Adam optimizer, the momentum β 1 = 0.9, β 2 = 0.999, and weight decay 1 × 10 −6 .The initial learning rate is 1 × 10 −3 that drops by half every 5 epochs.We use synchronized cross-GPU batch normalization [46], resulting in a batch size of 12. Evaluation metrics are consistent with RNW [24] and KITTI [11].RMSE is measured in meters.

B. Results
Nighttime depth estimation.We compare LDCNet with nighttime state-of-the-art methods, including MD2 [47], De-FeatNet [32], ADFA [31], ADDS [25], RNW [24], WSGD [48], and STEPS [26].Based on STEPS, we embed our RICD and IAICD into its image enhancement branch and depth estimation branch, respectively.From Tab.I we can observe that LDCNet almost achieves the lowest errors and the highest accuracy.On RobotCar dataset, LDCNet is superior to the second best STEPS in all aspects.Furthermore, LDCNet surpasses the well-known MD2 by large margins.For example, the RMSE of MD2 is reduced from 12.771m to 6.725m, almost 47% improvement, whilst the accuracy δ 1 acquires an increase of 22.9 percentage points.On CARLA dataset, LDCNet also performs better than other three approaches.In addition, we compare these methods on RobotCar and CARLA and observe that they perform worse on CARLA.This can be attributed to the darker color images and the larger depth ranges of CARLA.Finally, the visual results in Fig. 5 shows that LDCNet can predict more accurate depth maps with more complete and sharper edges, which further verify the superiority and effectiveness of LDCNet.
Nighttime depth completion.For fair comparison, we retrain existing state-of-the-art daytime depth completion approaches in nighttime scenarios, including FusionNet [50], NCNN [33], pNCNN [49], S2D [3], NLSPN [4], GuideNet [42], RigNet [9], and CFormer [7].The quantitative results are reported in Tab.II.Overall, we discover that LDCNet achieves the best performance on the two nighttime depth perception benchmarks.Specifically, on RobotCar-Night-DC dataset, LDCNet is comprehensively superior to other methods.For  On CARLA-Night-DC dataset, the challenging darker environment and greater distance result in poor performance of these methods.For example, the RMSE is at least 6m larger than that on RobotCar-Night-DC.Additionally, we notice that NCNN, pNCNN, NLSPN, FusionNet, and CFormer, all of which estimate confidence map to reweight depth, suffer from quite large RMSE and MAE.We analyse that the very lowlight color images make it rather difficult to predict credible confidence distribution, resulting in unstable depth refinement.At last, from Fig. 6 we discover that LDCNet succeeds in recovering object depth more accurately, such as the cars, bus shelters, and buildings in the foreground, and the trees, light poles, and billboards in the background.

C. Ablation Study
For efficient ablation on RobotCar-Night-DC, we halve the size of the two subnetworks in LDCNet by setting the stride of the first-layer convolution to 2.
LDCNet.As reported in Tab.III, the baseline LDCNet-i first removes RICD and IAICD modules.Then, as an alternative to IAICD, LDCNet-i incorporates the guidance module proposed in GuideNet [42].When implementing our RICD design (LDCNet-ii), we discover that the two evaluation metrics are consistently improved, i.e., RMSE is reduced by 104mm and  RICD.The basic unit of RICD is the differencing between two convolutions with different kernels.Consequently, we ablate diverse kernel sizes in Tab.IV.Based on LDCNet-i, RICD-i, RICD-ii, and RICD-iii conduct (k + 2) × (k + 2) and k × k convolution differencing.We can find that, as the kernel size increases, the two evaluation metrics decrease gradually.For example, the MAE of k 2 = 5 is 109mm superior to that of k 2 = 1.This is due to the learnable differencing center design, which regards the small-kernel-convolution feature as the center of the large-kernel-convolution feature.Such differencing convolutions with larger local receptive fields can predict reliable illumination distribution by aggregating the surrounding light information.Further, RICD-iv increases the kernel size gap from 2 to 4. For one thing, it is clear that the 1 × 1 convolution of RICD-i is not very suitable to be the differencing center because it cannot leverage ambient information.Thus, RICD-iv performs better than RICD-i regardless of the larger gap.For another thing, with the larger size gap, the larger-kernel convolution would introduce redundant light reference over long distances, while the smaller-kernel convolution can only map the light in local regions.Therefore, RICD-iv performs worse than RICD-ii and RICD-iii with smaller size gap.In addition, based on RICD-ii, Fig. 7 (left) shows the ablation on RICD with different recurrent steps.We observe that RICD performs better as the step grows.As shown in Fig. 8, RICD can strengthen the representation of relative light intensity, contributing to more precise illumination.Finally, we select RICD-ii and step-3 as the default.IAICD.Different from the center differencing convolution (CDC) [19] with fixed center, IAICD first aggregates all neighboring pixels and then employs the illumination-affinitive weight to produce its learnable center.Fig. 7 (right) shows that both of these two strategies contribute to consistent improvement over vanilla convolution and CDC.Furthermore, to evaluate the robustness of IAICD, we introduce Gaussian noise  into raw color images.As can be seen, IAICD still performs better than CDC and achieves very close performance to itself using raw color images.All of these evidences demonstrate the effectiveness and robustness of IAICD.

D. Generalization
Here we further evaluate the generalization capabilities of our LDCNet on both daytime depth completion [11] and lowlight image enhancement [37] tasks.
Tab. V reports the comparison results on KITTI depth completion dataset [11], which is collected during the daytime.We can observe that the performance of current state-of-theart methods [7], [9], [10], [42] is very similar.For example, the ranking metric RMSE is 730mm nearby.Although our LDCNet is specifically designed for nighttime scenarios, it still achieves competitive performance on the daytime benchmark.
Based on the self-supervised SCI [13] that is trained for 600 epochs, we replace its illumination estimation module with our RICD block.From Tab.VI we can discover that RICD consistently improves the baseline in both no-reference NIQE [23] & DE [53] and full-reference PSNR & SSIM metrics.Furthermore, Fig. 9 demonstrates the superiority of our method again, i.e., higher quality with lower training cost.
V. CONCLUSION In this paper, we extended the conventional depth completion task into nighttime environments to complement safe selfdriving.We identified the key challenge as the guidance from color images with low visibility and complex illumination.As a result, we proposed RICD and IAICD to improve the poor visibility and reduce negative influences of the varying illumination, respectively.RICD could predict explicit global illumination to enhance visibility, where treating the small-kernel convolution as the center of the large-kernel-convolution was a new perspective.IAICD succeeded in alleviating local relative light intensity, in which the differencing center was learned dynamically from the neighboring pixels and illumination maps of RICD.Thus, the center was robust and illuminationaffinitive.Finally, extensive experiments on depth perception datasets have verified the effectiveness of LDCNet.

Fig. 1 .
Fig. 1.Visual results of different image-guided depth completion methods in daytime (first row) and nighttime (second row) scenarios.

Fig. 4 .
Fig. 4. Learnable differencing center network (LDCNet).The low-light input is first fed into RICD to predict credible image and reasonable illumination, based on both of which IAICD is then conducted to alleviate the negative influence of varying illumination.

Fig. 8 .
Fig. 8. Feature comparison of vanilla convolution and our method.

TABLE II RESULTS
OF nighttime depth completion ON ROBOTCAR-NIGHT-DC AND CARLA-NIGHT-DC.

TABLE III ABLATION
ON COMPONENTS OF LDCNET.

TABLE V RESULTS
ON KITTI DEPTH COMPLETION BENCHMARK.

TABLE VI COMPARISON
ON DIFFICULT TEST SPLIT OF SCI.