1 Introduction

Vehicle detection [14] is a fundamental step in various intelligent transportation tasks, including vehicle type recognition [5] and license plate detection [68]. Although there are many successes in vehicle detection, there are still many problems to be solved. In practice, it is a challenging task to identify and detect vehicles in dim light.

Low brightness and low contrast often affect the extraction of target features in images and reduce the performance of object detection. Meanwhile, additional noise is generated in dim light, which further affects and impairs the information structure of the target in the image. What is worse in the task of vehicle detection is that vehicles are illuminated with large fluctuations on the road due to the presence of vehicle lights and street lamps.

People usually choose image enhancement methods [915] to improve the quality of images in dim light conditions. For example, RetinexNet [14] uses a decomposition network and an enhancement network to enhance images illuminated with dim backlight by taking advantage of the fact that the reflectance of images is the same under different illuminations. However, it is difficult to ensure that a single image enhancement can handle the vehicle images in different areas with different brightnesses due to the uneven light and darkness caused by factors such as vehicle lights and street lamps. Furthermore, it may even produce over exposure in some areas, which further damages the original detection feature. Therefore, the multi-exposure method used for high dynamic range (HDR) images is a good reference. The method uses different camera parameters for several different shots, and then fuses them to learn the complementary features to detect the vehicle to cover different exposure conditions.

In addition, the purpose of enhancing images in dim light conditions is to improve the visual quality of the images, which is not the same as the goal of vehicle detection. For example, the smoothing operation commonly used for image enhancement may have an impact on features that are essential for detection. Therefore, the impact of image enhancement should be taken into consideration in the end-to-end vehicle detection network under dim light conditions.

Combining the above motives, this paper proposes a multi-exposure generation fusion network (MEGF-Net). The method consists of two parts as the name implies. First, a generative network is used to simulate the effect of multiple nonlinear exposures in HDR cameras, using historical information to maintain details in key areas of vehicle detection on the one hand, and to generate images with progressively higher exposures on the other hand. Next, a fusion network is used to smoothly fuse multiple images to detect vehicles in dim light. From an end-to-end point of view, the multi-exposure fusion network (MEF-Net) provides vehicle detection information to guide the generation network to fine-tune the parameters so that the generation network can be effectively optimized for the vehicle detection problem, while the multi-exposure generation network (MEG-Net) provides pre-trained enhanced images for the fusion network. In our experiments, we also control and optimize the additional parameters needed, resulting in a significant performance gain with only a small increase in parameters.

To summarize, this paper makes the following contributions:

1) For the vehicle detection problem in dim light, we propose an end-to-end MEGF-Net, which couples image enhancement and vehicle detection together for joint training.

2) In MEG-Net, a single gated memory unit with two inputs and one output is used to solve the over smoothing and over exposure problems, and the network performance can be significantly improved by adding only a few parameters.

3) Our method has excellent performance in vehicle detection under dim light conditions on the ODDS dataset, reaching 78.09% mAP. It is 5.90% higher than the fine-tuned object detection baseline, and 2.91% higher than the state-of-the-art method BIMEF [15]. At the same time, the number of parameters and the detection time are basically unchanged.

2 Related works

2.1 Image enhancement under dim light conditions

Image enhancement under dim light conditions has been a longstanding challenge in computer vision, with early solutions often relying on local statistics or intensity mapping techniques such as histogram equalization [1618] and gamma correction [19]. More recent approaches have focused on Retinex theory [14, 20], aiming to address the relationship between illumination and reflectance. For example, BIMEF [15] designed a weight matrix for image fusion using light estimation techniques and fused the input image with the synthetic image based on the weight matrix. The fusion process uses multi-scale Retinex decomposition to adjust the brightness and contrast of different exposure images, while using pixel-level weighting to retain the detailed information of each exposure. After that, deep learning further advanced the field of image enhancement [913, 21], with state-of-the-art methods demonstrating superior results in enhancing images in dim light, the performance gain is still limited when applied to vehicle detection under dim light conditions. This is partly because image enhancement and vehicle detection have different goals, and neural networks alone may struggle to address the uneven illumination within an image. Therefore, it is important for researchers to explore new approaches that can effectively enhance images in dim light while also facilitating accurate vehicle detection.

2.2 Vehicle detection under dim light conditions

In the past few years, deep learning has emerged as a powerful technique for vehicle detection under dim light conditions. Li et al. [22] proposed a nighttime framework that employs multiple means to enhance nighttime vehicle information. First, a multi-exposure fusion BIMEF algorithm was used to enhance the quality of nighttime images. Then, multi-scale salient feature maps of highlighted vehicle regions were combined with vehicle visual feature maps to enhance the vehicle information during detection. Finally, an integration algorithm was designed to combine multiple networks to provide richer vehicle visual features. Gao et al. [23] fused infrared and visible images to improve the detection performance of nighttime vehicles. This article proposed a vehicle detection model of generative adversarial network (GAN) fusing infrared and visible images, named GF-detection, with different feature extraction strategies according to the differences between the two. Similarly, Shao et al. [24] proposed a cascaded detection network framework, FteGanOd, in which the FTE module was built based on CycleGAN, and proposed multi-scale feature fusion to enhance the detection of nighttime vehicle features. Additionally, the OD module based on the existing object detection network was improved by cascading with the FTE module to detect vehicles on EF maps. Furthermore, Mo et al. [25] concluded that vehicle highlight information such as vehicle lights was a highly credible visual feature at night. Therefore, a novel framework for nighttime vehicle detection aided by vehicle highlight information was proposed. They proposed a feature aggregation mechanism that combined multi-scale highlighting features with the visual features of the vehicle to make full use of the vehicle highlighting location cues. Our algorithm is inspired by the multi-exposure concept in BIMEF, while we perform end-to-end joint learning of image enhancement and vehicle detection, and thus compare the performance with existing methods.

3 MEGF-Net

As illustrated in Fig. 1, the proposed method consists of two modules: a multi-exposure generation net (MEG-Net) and a multi-exposure fusion net (MEF-Net). MEG-Net uses a neural network to imitate the high dynamic range image, which provides the fusion module with multiple enhanced images and performs adaptive fusion and detection during the multi-exposure fusion stage. MEF-Net provides vehicle detection information to guide MEG-Net in fine-tuning the parameters so that the generation network can be effectively optimized for the problem of vehicle detection.

Figure 1
figure 1

The proposed MEGF-Net architecture framework for vehicle detection under dim light conditions. CRM is short for convolutional recurrent modules, and Conv is short for convolutional module

3.1 MEG-Net

In MEG-Net, in order to obtain different levels of exposure images from the input images in dim light, the easiest solution is to use the previous image as input and obtain the next image through a convolutional neural network, forming a single-input, single-output structure. However, the use of the above strategy will most likely lead to localized over smoothing or over exposure of the images due to the more severe illumination and darkness inhomogeneity problem on dark roads with vehicle lights and street lamps, thus compromising the detailed information in the vehicle detection task.

To optimize the above problem, a convolutional recurrent neural network (ConvRNN) [26] based on two-stream progressive exposure is applied to the encoder of MEG-Net. On the one hand, the input relies on the information of the previous stage to realize the memory function of the neural network, and avoids losing the ability to learn the information of the more distant stages by using the historical information to maintain the details of the key regions, thus reducing the probability of extreme values in the key positions in the image. On the other hand, the upper layer of the same stage is used to complete the encoding process, extract the features of the input image at that stage by increasing the number of channel dimensions, and then restore the size of the input image by decoding to generate images with gradually higher exposure. This design can effectively avoid the problem of over smoothing or over exposure caused by the single input and single output of conventional convolutional neural networks. As presented in Fig. 1, the encoder has multiple stages, each of which consists of multiple convolutional recurrent modules (CRM) connected in series. We implement this through a series of convolutional training with dual inputs. The coding and decoding process of each stage through deep learning can be represented by Eq. (1). This makes it available for the output of this stage to be used as input for the next stage.

$$\begin{aligned} ({I_{t}},{H_{t}}) = {D_{\theta }} \bigl({E_{\omega }}({I_{t - 1}},{H_{t - 1}})\bigr), \quad t = 1,2, \ldots ,T, \end{aligned}$$
(1)

where I represents the picture of each stage, H represents an intermediate hidden layer module CRM, and \({D_{\theta }}\) and \({E_{\omega }}\) represent the decoder and encoder in MEG-Net with corresponding parameters θ and ω. t represents the current stage.

The process of each encoder is illustrated in Eq. (2), and the input of each CRM consists of two parts, i.e., previous layer of this stage and same layer of the previous stage.

$$\begin{aligned} H_{t}^{l} = {E^{l}} \bigl(H_{t}^{l - 1},H_{t - 1}^{l}\bigr),\quad l = 1,2, \ldots ,L, \end{aligned}$$
(2)

where \(H_{t}^{l}\) represents the l layer of the t stage in CRM. CRM is initialized as \({H_{0}} = 0\) and \({I_{0}}\) is the original input image. The input of the first layer is the output image of the previous stage, expressed by the formula \(H_{t}^{0} = {I_{t - 1}}\).

Since vehicle detection tasks are often the cornerstone of intelligent transportation systems and many vehicle tasks need to be performed on the basis of obtaining vehicle locations, it is crucial to design lightweight ConvRNN structures with fewer parameters in less time without losing accuracy.

We have designed a lightweight CRM composed of a single gated memory unit and a few additional parameters to ensure that long-term key information is learned, and the structure is illustrated in Fig. 2. From the perspective of sequence length, there is no need to use the common convolutional long short-term memory (ConvLSTM) network and convolutional gated recurrent unit (ConvGRU) network because the number of stages required for MEG-Net is much smaller than the sequence length of natural language processing. The total number of parameters of the single gated memory unit is only half of that of ConvLSTM, which will have fewer parameters and faster training and usage. The memory unit is calculated using Eq. (3) ∼ Eq. (5).

$$\begin{aligned} &f_{t}^{l}= \sigma \bigl(W_{f}^{l} * H_{t}^{l - 1} + U_{f}^{l} * H_{t - 1}^{l} + b_{f}^{l}\bigr), \end{aligned}$$
(3)
$$\begin{aligned} &\tilde{H}_{t}^{l}= \tanh \bigl(W_{h}^{l} * H_{t}^{l - 1} + U_{h}^{l} * \bigl(f_{t}^{l} \odot H_{t - 1}^{l}\bigr) + b_{h}^{l}\bigr), \end{aligned}$$
(4)
$$\begin{aligned} &\hat{H}_{t}^{l}= \bigl(1 - f_{t}^{l} \bigr) \odot H_{t - 1}^{l} + f_{t}^{l} \odot \tilde{H}_{t}^{l}, \end{aligned}$$
(5)

where f represents the unique gating module that determines the extent of resetting its history encoding. The value of the gating function is controlled by a sigmoid function. W and U represent dilated convolution and regular convolution in Fig. 2, respectively. ⊙ represents Hadamard product. \(\hat{H}_{t}^{l}\) is the output result of a single gate memory unit.

Figure 2
figure 2

Convolutional recurrent module internal structure. GMP and GAP represent global maximum pooling and global average pooling. CMP and CAP represent channel maximum pooling and channel average pooling

In addition as shown in Eq. (6), the lightweight channel attention mechanism \({A_{c}}\) and the spatial attention mechanism \({A_{s}}\) modules are used within CRM to focus on important features and suppress unnecessary features [27].

$$\begin{aligned} H_{t}^{l} &= A_{s}^{l} \bigl(A_{c}^{l}\bigl(\hat{H}_{t}^{l} \bigr)\bigr). \end{aligned}$$
(6)

Specifically, in the channel dimension, the global maximum pooling (GMP) and global average pooling (GAP) are used in parallel to compress the length and width of the image, and the weight calibration of the channel dimension is obtained by sharing a multilayer perceptron. In the spatial dimension, the average and maximum values in the channel dimension need to be taken separately for information extraction. After that, the number of channels is compressed by a custom-sized convolution kernel. Finally, a nonlinear function is used to weight the original feature map weight coefficients to complete the whole attention module.

3.2 Pseudo supervised pre-training of MEG-Net

To achieve a better complementary end-to-end training, we first pre-train MEG-Net. However, in practice, it is difficult to collect image pairs with different exposures of the same scene at the same time. Therefore, we first need to generate pseudo ground-truth images with different exposure levels for images illuminated with dim backlight. We select the camera response model [15] to generate the required pseudo ground-truth images, which can well simulate the nonlinear function relationship between each pixel value of the image and the irradiance, and also has excellent scalability. The model can be expressed as:

$$ {P_{x}} = g({P_{0}},k) = {{ \mathrm{e}}^{b(1 - k_{x}^{a})}}P_{0}^{(k_{x}^{a})}, $$
(7)

where \({P_{x}}\) and \({P_{0}}\) are the pixel values of the desired image and the input image, the exposure rate \({k_{x}}\) is the ratio of their irradiance, and the function g is the exposure transformation function. The parameters \(a= -\)0.32 and \(b = 1.12\) are internal parameters of the used camera and are camera dependent only. The values are obtained by fitting all camera response curves in the DoRF dataset [28]. The parameter k is related to the desired exposure rate of the dataset images and the value of the approximate desired exposure rate needs to be solved and counted.

After that, MEG-Net is pseudo supervised pre-trained by the above model to generate images with the desired exposure. To measure the difference between the network-generated image \({I_{t}}\) and the generated pseudo exposure image \({\hat{I}_{t}}\), we need to construct the loss function according to the desired. Due to the problem of uneven light and darkness in the image, if we only use the common mean square error (MSE) loss, it may lead to these overexposed outliers having a large impact on the results, making the loss function larger. Therefore, the first part of the loss function uses a combination of MSE and mean absolute error (MAE) to form a combined loss function named SMAE. For locations with small errors, MSE is used to make the loss easily converge and more stable, while MAE is used to reduce the impact of outliers with large errors. As displayed in Eq. (8), the distance between the generated image \({{I_{t}}}\) and the real image \({{{\hat{I}}_{t}}}\) is calculated by SMAE. δ is a hyperp-arameter whose magnitude determines the focus of the SMAE on MSE and MAE. We use H to represent SMAE [29].

$$ H(I,\hat{I}) = \textstyle\begin{cases} \frac{1}{2}{(I - \hat{I})^{2}}, & \vert I - \hat{I} \vert < \delta , \\ \delta \vert I - \hat{I} \vert - \frac{1}{2}{\delta ^{2}}, & \vert I - \hat{I} \vert \ge \delta . \end{cases} $$
(8)

In addition to the regression loss from the errors of pixel points, we also add the evaluation metric of the overall quality of the generated image to the loss function. The structural similarity (\(SSIM\)) [30] index can be a useful response to the perception of quality in the visual domain and it is calculated by using Eq. (9). By using a Gaussian filter at pixel \({p_{t}}\), we can obtain the variance δ and the mean μ. \({\mathrm{C}_{1}}\) and \({\mathrm{C}_{2}}\) represent constants.

$$ SSIM = \frac{{(2{\mu _{{p_{t}}}}{\mu _{{{\hat{p}}_{t}}}} + {\mathrm{C}_{1}})(2{\delta _{{p_{t}}}}_{{{\hat{p}}_{t}}} + {\mathrm{C}_{2}})}}{{(\mu _{{p_{t}}}^{2} + \mu _{{{\hat{p}}_{t}}}^{2} + {\mathrm{C}_{1}})(\delta _{{p_{t}}}^{2} + \delta _{{{\hat{p}}_{t}}}^{2} + {\mathrm{C}_{2}})}}. $$
(9)

Therefore, we finally adopt a loss function combining \(SSIM\) and SMAE, as presented in Eq. (10). The \(SSIM\) at stage t is subtracted to calculate our required loss function. N denotes the number of pixels in a image.

$$ {L_{\mathrm{pse}}}(I,\hat{I}) = \frac{1}{{TN}}\sum _{t} {\bigl(H({I_{t}} , {{\hat{I}}_{t}}) + 1 - SSI{M_{t}}\bigr)}. $$
(10)

It is possible to make our model scalable without relying on image pairs of different luminance by pseudo supervised pre-training. In addition, this section demonstrates the performance of the final vehicle detection in dim light.

3.3 MEF-Net

For MEF-Net, we choose the commonly used object detection framework in convolutional neural networks as the basis of the network. This allows us to place MEG-Net in front of MEF-Net by only dealing with the channel dimensions at the connection of the two networks and ultimately enables end-to-end learning and makes our method compatible with any detection framework.

In the object detection framework, one-stage has no region proposal network compared to a two-stage network. It is not only faster, but also more compatible with the requirements of the chips in cameras. Therefore, we use RetinaNet in one-stage for training, where the focal loss is useful for detecting difficult vehicles in the dark. In addition, we use a variant of the residual network as the backbone network of RetinaNet. Compared to the most commonly used ResNet [31], this variant moves the batch normalization (BN) and RELU operations to before the convolutional layer, refer to Fig. 3, making it pre-activated and easier to train the whole network.

Figure 3
figure 3

The upper figure shows the structure of RetinaNet and the object detection framework used in MEF-Net. The lower figure shows the selected variant of ResNet as the backbone of RetinaNet. K is the number of classifications, and A is the number of anchors

At the interface between MEF-Net and MEG-Net, we choose to crop the first layer of the detector framework and use convolutional kernel channel dimension expansion [32], which allows the detector to process multiple images from the generation module simultaneously and perform adaptive fusion. Specifically, for the pre-trained vehicle detector, we ensure that MEF-Net has the same channel dimension as the generative network by replicating and normalizing the parameter weights of the T times first layer convolutional kernel while helping to maintain better differentiation and complementary region information across different pseudo exposures. For RetinaNet, after we convolutionally expand the channel dimension of the first convolutional layer in Fig. 3, we can plug it into the whole network framework displayed in Fig. 1 to implement the multi-exposure fusion function.

In addition, since the final output of MEG-Net is T tri-color pictures, we can feed the generated pictures into the squeeze-excitation net (SENet) [33] before the information is fed into the vehicle detector, so that it acts as a “bridge” connecting the two networks. This allows the learning process to use global information to strengthen useful channel features and weaken useless ones to obtain channel dimensional weights when 3T channels are fused.

In this phase of vehicle detection, the loss function consists of two parts, confidence loss \({L_{\mathrm{conf}}}\) and regression loss \({L_{\mathrm{reg}}}\).

$$ {L_{\mathrm{total}}} = \frac{1}{{{N_{a}}}}\sum_{i} {{L_{ \mathrm{conf}}}({p_{i}},{{\hat{p}}_{i}}) + } \frac{\lambda }{{{N_{p}}}}\sum_{i} {{{\hat{p}}_{i}} {L_{ \mathrm{reg}}}({g_{i}},{{\hat{g}}_{i}})} , $$
(11)

where \({{N_{a}}}\) represents the full number of samples, \({{N_{p}}}\) represents the number of positive samples, and the parameter λ is used to balance \({L_{\mathrm{conf}}}\) with \({L_{\mathrm{reg}}}\). \({{{\hat{p}}_{i}}}\) is 1 for the vehicle, \({{{\hat{p}}_{i}}}\) is 0 for the background, and \({{p_{i}}}\) is the confidence score between 0 and 1. \({{g_{i}}}\) denotes the vector consisting of the coordinates of the four parameters of the predicted vehicle boxes, and \({{{\hat{g}}_{i}}}\) is the ground-truth bounding box.

\({L_{\mathrm{conf}}}\) is a binary loss function. The proportion of vehicle targets in the dataset is much smaller than the proportion of background, so the targets of the anchor will be mainly negative samples. Our selected focal loss function adds two dynamic scaling factors to the binary cross-entropy to dynamically and significantly reduce the loss caused by the easily distinguishable negative samples in the learning process, thus focusing the loss on the part of the positive samples that are difficult to distinguish. This is extremely critical for difficult object detection under dim light conditions. The confidence loss is given by Eq. (12), where the parameters α and γ are designed as double-balancing factors that smoothly adjust the simple samples so that their weight loss is significantly reduced, which allows the loss function to focus on the vehicle samples that are difficult to detect in dim light.

$$ {L_{\mathrm{conf}}}({p_{i}},{\hat{p}_{i}}) = \textstyle\begin{cases} \alpha {(1 - {p_{i}})^{\gamma }}\log ({p_{i}}), & {{ \hat{p}}_{i}} = 1, \\ (1 - \alpha ){p_{i}}^{\gamma }\log (1 - {p_{i}}), & {{\hat{p}}_{i}} = 0. \end{cases} $$
(12)

The regression loss is another part of the loss function, which is computed by positive samples only. We use the SMAE loss to calculate the distance between \({{g_{i}}}\) and \({{{\hat{g}}_{i}}}\), which is expressed by Eq. (13):

$$ {L_{\mathrm{reg}}}({g_{i}},{\hat{g}_{i}}) = \sum_{j \in \{ x,y,h,w \} } {H\bigl(g_{i}^{(j)} - \hat{g}_{i}^{(j)}\bigr)}. $$
(13)

Finally, MEG-Net and MEF-Net are tied together for joint training so that the two parts form an end-to-end learning system. Analysis from a functional perspective: MEF-Net provides vehicle detection information to guide the generative network to fine-tune parameters so that vehicle regions can be specifically enhanced to accomplish detection; MEG-Net provides pre-trained enhanced images for the fusion network to perform adaptive fusion and detection after the channel dimension attention mechanism. After coupling the two parts for training, we can achieve detailed interaction and feature fusion under different lighting conditions, and end-to-end learning for vehicle detection under abnormal lighting.

4 Evaluation

4.1 Setup

4.1.1 Datasets and evaluation metrics

We select 15,000 images of vehicles under darkness from surveillance photos as our ODDS (object detection under dim light surveillance) dataset. Each of these images has from 1 to 23 labeled vehicles. The size of the labeled vehicles also has a relatively large variance. We divide the training and test sets in a 9:1 ratio, and use 10-fold cross-validation to split a portion of the training set as the validation set. In addition, we also use the large-scale normal-light vehicle dataset UA-DETRAC [34] to pre-train the object detection framework. We select 80,000 photos and more than 500,000 vehicles for pre-training. For the evaluation metrics, the overall performance of the vehicle detection model is measured using mean average precision (mAP), which is calculated as the area under the precision-recall curve.

4.1.2 Pre-training settings

In the pre-training phase, we first set the number of modules for MEG-Net. We set the number of stages to 4 and the number of coding modules for each stage to 4. For all the black night images in the dataset, we solve for the desired optimal exposure rate values and perform the statistics. The statistical histogram is plotted as Fig. 4 shows. We choose the first quartile, second quartile, third quartile and maximum value of the distribution as the four different levels of exposure rates needed. Then we set the four stages in the following order: \({{k_{1}}} = 3.8328\), \({{k_{2}}} = 4.0223\), \({{k_{3}}} = 4.3367\) and \({{k_{4}}} = 6.9852\). The four k values are brought into Eq. (7) for the generation of the pseudo exposure images. During training, because the images generated by the camera response model will unavoidably generate noise and artifacts, the early stopping [35] method in deep learning is used to avoid obtaining these unfavorable factors during pre-training. Specifically, we can stop pre-training when the average peak signal-to-noise ratio (PSNR) of \({{I_{t}}}\) and \({{{\hat{I}}_{t}}}\) reaches 25. The deep learning part is implemented by the PyTorch library.

Figure 4
figure 4

Statistical histogram of the best exposure of ODDS

4.1.3 Data augmentation

We have used a number of data enhancement methods in our experiments. Three common methods are chosen. First, we let the image of the dataset flip randomly horizontally with a probability of 50%. Then, the images are rotated randomly clockwise or counterclockwise in a range of 15 degrees. Finally, the dataset is randomly cropped. The number of datasets is increased by these methods.

4.1.4 Training parameter setting

For the anchor used in the detection, we set the anchor scales to 8, 16, 32, 64, 128 and 256. We also set the anchor ratio to 0.75, 1.00 and 1.33 depending on the actual situation. These anchors cover a very wide range of vehicle sizes.

For the training of the model, we use an initial learning rate of 0.01 and a learning rate decay factor of 0.1. We set the batch size to 16 and epoch to 160, while using the state-of-the-art rectified Adam optimizer [36]. The focal loss parameter is set to \(\alpha =0.25\) and \(\gamma =3\). The GPU devices used in our experiments are two Tesla V100.

4.2 Performance of the proposed method

We use RetinaNet to pre-train under the UA-DETRAC dataset and then fine-tune it on the ODDS dataset. The resulting model is defined as the baseline. Our method is compared with the fine-tuned baseline network in terms of the number of parameters and overall accuracy. The comparison images of the results are presented in Fig. 5. It can be seen that our method successfully detects more vehicles in the dark. In Fig. 6, it can be seen that there is a more significant improvement in mAP values for our method, which increases by 5.90% to 78.09%. In terms of the number of parameters, the design of the lightweight convolutional recurrent module makes the existing method add only 0.49 M computational parameters to the original 33.60 M number of parameters.

Figure 5
figure 5

Qualitative comparison of vehicle detection in dim light. The left side shows the result of baseline detection after fine-tuning. The right side is the result of detection by the method in this paper. The number in the lower right foot of the figure represents the total number of detected vehicles in darkness

Figure 6
figure 6

Quantitative results of different methods. There are pre-trained versions (followed by “p”) and fine-tuned versions (followed by “f”) for all the remaining methods except ours

Meanwhile, we take state-of-the-art methods to improve the quality of images illuminated with dim backlight to compare with our MEGF-Net. We select methods LLNet [9], KinD [10], RRDNet [11], DeepUPE [12], DSLR [13], RetinexNet [14] and BIMEF [15] to pre-train the images. These are all image enhancement methods separate from object detection, rather than our end-to-end training. The quantitative comparison of different methods under the RetinaNet detector is visualized in Fig. 6. The pre-trained baseline reaches 38.51% and the fine-tuned baseline reaches 72.19%. The fine-tuned methods show a higher mAP compared to all pre-trained results, which also validates the large difference in the light distribution of the UA-DETRAC dataset under normal light compared to the ODDS dataset under dim light conditions. When pre-processed only, the mAP using LLNet, DeepUPE, DSLR and BIMEF improved by 4.55%, 6.72%, 5,52% and 8.12% over the baseline. After fine-tuning with ODDS, the mAP of LLNet, DeepUPE, DSLR, BIMEF improved by 1.79%, 2.27%, 2.72% and 2.99%, respectively, over the baseline. This is because fine-tuning has greatly reduced the difference in data between normal and dim light, so these methods obtain slightly lower performance gains under fine-tuning than under pre-training.

It is worth noting that KinD and RetinexNet degrade the performance in some way. This may be caused by excessive smoothing (KinD) or artifacts (RetinexNet) in the vehicle detection region when the image is enhanced.

From Fig. 6, we can see that the multiple exposure method BIMEF [15] is second to our method and superior to other image enhancement methods. This also verifies that it is reasonable and effective to use the multiple exposure method in our detection task. Finally, it is easy to see that our mAP values are higher than any of the methods for image enhancement, which also verifies that this task performs better using an end-to-end method than a method that separates enhancement from detection.

We also replace RetinaNet to compare on detectors for vehicle recognition under darkness. As illustrated in Fig. 7, the detectors were replaced with YOLO3 [37] and FPN [38]. The mAP of pre-trained baseline for these two detectors reached 36.46% and 35.56%, respectively, and the fine-tuned baseline reached 69.97% and 69.33%. In the framework of our method, the mAP values reached 76.18% and 75.83%. It is easy to see that RetinaNet in our method has the best performance in this task. This also verifies that focal loss is critical for the detection of difficult vehicle samples under low illumination. In summary, we validate the superiority of our method from different perspectives, and the extremely high mAP proves that it outperforms existing state-of-the-art techniques.

Figure 7
figure 7

Quantitative analysis of different object detection methods is performed. Among them, there are versions that apply the framework proposed in this paper (followed by “o”), as well as pre-trained versions (followed by “p”) and fine-tuned versions (followed by “f”)

4.3 Ablation studies

Channel dimension attention

We added the channel dimensional attention mechanism SENet between MEG-Net and MEF-Net. We perform ablation experiments on this part and the results are displayed in the first row of Table 1. We can obtain that adding the channel dimensional attention mechanism increases the mAP by 0.38%.

Table 1 Results of ablation experiments performed on different parts of the method in this paper. CA: channel dimension attention; JT: joint training; SG: single gated memory unit; AM: attention mechanism within the convolutional recurrent module; KE: convolutional kernel expansion

Joint training

For different lighting conditions, the exposure required for the images is also different. Meanwhile, before end-to-end training, MEG-Net cannot determine which features will be helpful for vehicle recognition, and MEF-Net can provide assistance for it. To verify the effectiveness of joint training, we freeze the weight values of MEG-Net after pre-training, i.e., the final training only fine-tunes the weight values of MEF-Net. The comparison results are summarized in the second row of Table 1. The value of mAP becomes 73.89%, which is much lower than 78.09%. This verifies that end-to-end joint training is essential.

Single gated memory unit

To verify the performance of our proposed single gated memory unit, we perform ablation experiments on this part. The experiments replace the coding part of all MEG-Net with convolutional recurrent neural networks without gating. The results are presented in the third row of Table 1. It can be seen that using a single gated memory unit can increase the performance by 3.24%.

Attention mechanism within the convolutional recurrent module

Within the convolutional recurrent module of MEG-Net, we include lightweight channel-dimensional and spatial-dimensional attention mechanisms. We remove all the attention mechanisms within the module for comparison in the ablation experiments. The results are shown in the fourth row of Table 1. This part of mixed attention can improve the mAP of the network by 0.88%.

Convolutional kernel expansion

We use a convolutional kernel channel dimension expansion for the first layer of the detector in MEF-Net to handle the interface between the two networks. Adaptive fusion is achieved by replicating and normalizing the T times parameters. In addition, we can perform experiments by fusing the corresponding detection frames after feeding the images into the detectors separately, using a post-fusion method to handle the channel dimension interface. The results of this part of the ablation experiment are presented in the fifth row of Table 1. It is easy to see that the performance can be improved by 0.46% using this method.

Model design of MEG-Net

We designed several modules for comparison with our method to prove that our proposed two-input single-output recurrent module is effective, as illustrated in Fig. 8. The methods are listed below.

Figure 8
figure 8

Different model designs of MEG-Net

1) Branched exposure generation (BEG). The stages are parallel to each other and do not affect each other, which generates images with different exposures according to \({I_{0}}\).

2) Serial exposure generation (SEG). The input of this stage is the output of the previous stage, but the difference with our approach is that the two-stream progressive exposure input strategy is not used in the encoding stage.

3) Recurrent exposure generation (REG). This is the training method proposed in this paper, which uses a two-stream progressive exposure input approach to avoid information loss due to over smoothing or over exposure.

We conducted experiments on each of the three methods mentioned above, and the results are presented in Table 2. All the methods are improved by adding only a small number of parameters, but the detection results are improved in different degrees. Specifically, our BEG module can improve the quality of images, which verifies that our MEF can provide important information guidance for the enhancement module as a way to generate complementary information in different exposures. At the same time, SEG produces a more erratic performance gain, which may be due to the loss of critical information caused by over exposure or over smoothing in the intermediate stages. Our proposed REG method has the best performance among several comparisons, which validates that a two-stream progressive exposure input structure with memory capability can perform this task effectively. Finally, we visualize the four intermediate outputs of the camera response model, BEG, SEG and REG of the multi-exposure images as demonstrated in Fig. 9, thus providing visual support material for the method that we use.

Figure 9
figure 9

Four intermediate exposure result images for different methods. The intermediate images are generated by camera response model(CRM). The trained REG, SEG and BEG networks are also used to show the outcome

Table 2 Results of the ablation study on the proposed MEG-Net. Params represent the number of parameters for the entire MEGF-Net

4.4 Hyper-parameter analysis

For MEG-Net, we choose four stages as the number of multi-exposures, and four convolutional recurrent modules are selected as encoders for each stage. In general, increasing the number of stages will not only increase the number of parameters and computational cost of the model, but also require more space to generate more pseudo-exposure images in advance to support learning. Increasing the number of encoders in each stage may result in overfitting due to the depth of the layers. For the selection of the number, we conducted a comparison experiment and the results are presented in Fig. 10. Trade-offs were made between accuracy and efficiency. We finally chose a 4 × 4 coding structure. In addition, our network becomes a single-exposure enhanced detection network when the number of stages of MEG-Net is 1. The results in Fig. 10 also verify that our multi-exposure method improves by approximately 2% compared to the single-exposure method.

Figure 10
figure 10

Hyper-parameters sensitivity studies

5 Conclusion

In this paper, we propose an end-to-end method to improve vehicle detection performance under dim light conditions through multi-exposure generation and fusion network. For the problem of uneven light and darkness during vehicle detection in darkness, we introduce a single gated memory unit with two-stream progressive exposure input in MEG-Net and perform ablation experiments on this part. The experimental results demonstrate that end-to-end joint training can boost mAP to 78.09% in ODDS dataset, a dataset with poor lighting conditions. Meanwhile, the number of parameters remains almost constant. Our method outperforms a variety of common image enhancement methods. In addition, the ablation experiments on each part in the paper also prove the rationality of each module in the network of this method.