1 Introduction

Low-light image enhancement is a technique to improve the quality of photographs and videos taken in low-light environments, which increases the visibility and expressiveness of images and has wide application prospects in aerial photography and mining industries [1, 2]. Existing low-light enhancement algorithms can be divided into three categories, such as distribution-based mapping methods, model optimization methods and deep learning methods, among which, distribution-based mapping methods are image enhancement algorithms represented by histogram equalization and curve transformation, and color distortion [3]. Model optimization-based methods are image enhancement algorithms represented by the Retinex model algorithm, and this class of algorithms obtains ideal images by removing light from low-light inputs, reducing problems such as detail loss and color distortion, but further optimization is needed for bright-light image processing [4, 5]. Deep learning-based methods are low-light image enhancement algorithms represented by Retinex-Net, EnlightenGAN, etc. This class of algorithms converts images under low light into images under normal light by training the relevant neural network models [6, 7]. From the existing research results, low-light image enhancement based on deep learning can be divided into two types of methods: luminance enhancement and joint luminance enhancement and denoising, among which, luminance enhancement methods are used to enhance the image luminance to show more structure and content, such as Li et al. [6] proposed a convolutional neural network LightenNet for low-light image enhancement, which can be used to enhance low-light images by using the Retinex theory to design the network structure and achieve image enhancement; however, the method will further intensify the noise in real scenes. To address this problem, researchers have proposed a joint luminance enhancement and denoising method, which is based on luminance enhancement combined with image denoising methods to solve the noise problem to improve visual quality [8,9,10,11,12,13,14,15,16,17,18,19], e.g., Chen et al. [18] proposed a data-driven Retinex-Net method to establish a fusion of image decomposition and successive enhancement operations depth network and denoising operation of reflectance using BM3D; From the existing research results, various low-light image enhancement methods have made good research progress in luminance enhancement; however, the enhancement of noise information and the color bias of image information in real scenes need to be further investigated [20].

Aiming at the problems of color bias and noise enhancement in the existing algorithms for low-light image enhancement tasks, this paper proposes a low-light image enhancement algorithm that incorporates channel attention mechanism and multi-scale pyramid. Firstly, DecomNet is designed to decompose the reflectance and illumination components of normal-light and low-light images; secondly, the reflectance is adjusted using the UMSIA module to make the reflectance features more structured; the five-layer feature pyramid and kernel selection in the PRID-net module are used to fuse the contextual information between different scale feature layers and remove the enhanced noise; finally, the illumination ratio is introduced using the Finally, the illumination ratio is introduced, and the illumination is adjusted using multi-scale cascading and channel attention mechanisms to enhance the brightness, texture, and other features in the image.

2 Low-light image enhancement model based on DMPH-Net

The traditional Retinex theory is based on color-sensory consistency, i.e., argues that the human eye’s perception of color depends not only on the surface color of the object itself, but also on the influence of surrounding illumination. Thus, Retinex theory models the process of color perception by the human eye by separating reflectance and illumination, so that a given image can be expressed as [5]:

$$\begin{aligned} S = R \circ I \end{aligned}$$
(1)

where R denotes the reflectance property image of the object, I denotes the incident light property image, and \( \circ \) denotes the multiplication at the elemental level; the low-light image is enhanced by adjusting the light and reflectance, but the traditional Retinex algorithm is more sensitive to noise, and problems such as color shift will occur when processing the color.

Fig. 1
figure 1

DMPH-Net network diagram

In order to improve the quality of low-light images while overcoming problems such as color distortion and blurring that occur in the traditional Retinex algorithm, a DMPH-Net algorithm that incorporates an attention mechanism and a multi-scale pyramid is proposed, and the overall structure of the algorithm is shown in Fig. 1, it can be seen from Fig. 1 that there are three main modules in the algorithm: the image decomposition DecomNet module, the reflectance restoration Restoration_Net module, and Illumination_Adjust_Net module.

2.1 Image decomposition net

In the image decomposition module based on the Retinex theory, algorithms such as Retinex-Net and KinD typically employ the same network structure to generate the reflectance and illumination components of an image. The illumination component characterizes the image’s structural information and requires the extraction of deep-level features. On the other hand, the reflectance component represents the image’s color and texture information and necessitates the extraction of both deep-level and shallow-level features simultaneously. When using the same network structure to generate the reflectance and illumination components, it becomes challenging to obtain accurate semantic information concurrently. Consequently, such algorithms result in poor quality image recovery. To address these issues, this paper proposes a fusion of two networks, Deep and Unet, to achieve high-quality image recovery. The structure diagram of DecomNet image decomposition is shown in Fig. 2, it can be seen from Fig. 2 that after the image input, downsampling is used in Unet to capture the contextual information in the image, upsampling (transposed convolution) is used to obtain a larger size map, and feature The deep features are fused with the shallow features through feature stitching to make the information richer and thus obtain the reflectance component; the input image is maximized on the channel (keeping the dimensionality constant) and fused with the input image in Deep, which improves the model’s attention to the brightness of the larger channel components, thus facilitating the discovery of potential features on the brightness, and then the network stacking is performed to obtain the feature information of the illumination; the above two network layers are set with shared parameters.

Fig. 2
figure 2

Image decomposition DecomNet network structure diagram

The DecomNet designed in this paper has two main parts: the Unet network layer and the Deep network layer. Among them, the Unet network layer is an encoding–decoding architecture, which contains convolution, maximum pooling, upsampling, and feature stitching operations, which contains 6 convolutional layers, 2 upsampling and downsampling steps. The Deep network layer is made more informative by fusing the feature maps from the downsampling process through jump connections during the upsampling process; finally, the feature maps are resized by a convolutional layer with a channel number of 3 and a Sigmoid activation function, and the output is the reflectance component. Fusion is then performed for network stacking, which contains 7 convolutional layers and a Sigmoid step.

The loss function of the image decomposition network can be expressed as:

$$\begin{aligned} \begin{aligned} {L_\textrm{rec}}&= \textrm{MAE}({R_\textrm{low}} \times {I_\textrm{low}},{S_\textrm{low}})\\&\quad + \textrm{MAE}({R_\textrm{high}} \times {I_\textrm{high}},{S_\textrm{high}}) \end{aligned} \end{aligned}$$
(2)

where \({L_\textrm{rec}} = \textrm{MAE}({R_\textrm{low}} \times {I_\textrm{low}},{S_\textrm{low}}) + \textrm{MAE}({R_\textrm{high}} \times {I_\textrm{high}},{S_\textrm{high}}) \) is the reconstruction error loss, \({L_{mi}} = {\left\| {M \circ \exp \left( { - c \cdot M} \right) } \right\| _1}\) is the reflectance similarity loss, \({L_{mi}} = {\left\| {M \circ \exp \left( { - c \cdot M} \right) } \right\| _1}\) is the illumination mutual consistency loss, \({L_{mlh}} = {\left\| {\frac{{\nabla {L_l}}}{{\max \left( {\left| {\nabla {I_l}} \right| ,\varepsilon } \right) }}} \right\| _1} + {\left\| {\frac{{\nabla {L_h}}}{{\max \left( {\left| {\nabla {I_h}} \right| ,\varepsilon } \right) }}} \right\| _1}\) is the illumination smoothness loss.

Fig. 3
figure 3

Restoration net network structure diagram

2.2 Restoration net

In the low-light enhancement task, problems such as color bias and emphasis noise may occur. To solve this problem, this paper designs a reflectance network incorporating multiple attention mechanisms and a five-layer pyramid, whose structure is shown in Fig. 3. First of all, the scale information of different scales is captured through UMSIA attention to the light information in the image, so that the reflectance gets the structure information of light; then, the image noise level is estimated through the feature extraction network; the estimated noise is spliced with the reflectance to improve the attention of the model to the noise and to enhance the reflectance denoising for the subsequent do pavement; then, multi-scale feature extraction is used to perform average pooling by different kernel sizes to generate multi-scale feature maps, using a pyramid denoising structure to focus on each scale, and successive upsampling and downsampling by U-shaped networks help denoising, while global information is extracted and local details are retained; finally, to cascade multi-scale results, duality interpolation is used to upsample multi-level denoising features to the same size, the noise is removed and the final reflectance image is output by kernel selection of various kernels of the network and their attention weights.

The process of reflectance network is as follows: UMSIA is used to adjust the reflectance component, then through the illumination attention module, the light component is introduced through one 3\(\times \)3 convolutional layer to extract the light features and element-level multiplication to fuse the light information with the reflectance features to selectively focus on the light information in the image; secondly, through the multi-scale module to capture the different scale information by using convolution, pooling, batch normalization, and inverse convolution to complete the operation and cascade the extracted information of different scales; finally, feature extraction is completed by two 3\(\times \)3 convolution layers. The adjusted reflectance components become clear for which the noise is estimated, where the feature information is extracted and the global reflectance information is learned through five 3\(\times \)3 convolutional layers and channel attention (as shown in SE-Net in Fig. 4); the average pooling is implemented using different size kernels to generate multi-scale feature maps, and each branch uses a U-shaped network separately, adjusting the feature size of each branch for cascading using duality interpolation; extracting features through three 3\(\times \)3 convolutional layers, using the kernel selection module, three parallel convolutions of 3, 5, and 7, respectively, to obtain U’, U”, and U”’, integrating the three-branch information by element summation, global average pooling GAP and FC operations for expansion, and calculating kernel and attention weight combinations; feature fusion is performed using 3\(\times \)3 convolutional layers.

The loss function of the reflectance network can be expressed as:

$$\begin{aligned} {L_R}= & {} 0.5\textrm{Grad}({\widehat{R}},{R_h}) + 0.5SSIM({\widehat{R}},{R_h})\nonumber \\{} & {} \quad + \textrm{MAE}({\widehat{R}},{R_h}) + \textrm{color}({\widehat{R}},{R_h}) \end{aligned}$$
(3)

Among them, \(\textrm{Grad}({\widehat{R}},{R_h})\) is the gradient loss \(\textrm{SSIM}({\widehat{R}},{R_h})\) is the structure similarity loss \(\textrm{MAE}({\widehat{R}},{R_h})\) is the detail average absolute error loss to solve the color shift that occurs in the low-light enhancement task, this paper proposes a color loss \(\textrm{color}({\widehat{R}},{R_h}) = \frac{1}{c}\left( {{{\left( \!{\sum \limits _{c = 1}^c {{I_c} - \widehat{{I_c}}} }\! \right) }^2} + \sum \limits _{c = 1}^c {{{\left( \! {\sqrt{{\sigma _c}} - \sqrt{\widehat{{\sigma _c}}} }\! \right) }^2}} }\! \right) \), which makes the reflectance image with color information more closely resemble the normal-light image

Fig. 4
figure 4

Illumination adjust net network structure diagram

2.3 Illumination adjust net

When adjusting the illumination of low-light images, in order to make the illumination component of low-light images converge to normal illumination as well as to be able to adjust the illumination flexibly, this paper designs a fused channel attention mechanism and a multi-scale cascaded Illumination adjust Net network, whose structure is shown in Fig. 4. In the illumination adjustment module, ratio (\(I_\textrm{low}\)/\(I_\textrm{normal}\)) is introduced to specify the amount of illumination enhancement for each pixel to flexibly adjust the illumination, the neural network can capture extensive contextual information about the light distribution to improve its adaptive adjustment capability; Feature learning is performed in the channel dimension to form the importance for each channel, after which different weights are assigned to each channel through the U’ incentive part to enhance the expression capability of the feature map; finally, the fused feature map is reduced to one channel by a convolutional layer The Illumination adjust Net network improves the robustness and generalization ability of the model, thus preserving the details and information in the image so that the model can achieve good performance in different data sets and scenes.

The process of Illumination adjust Net network is as follows: the illumination component and ratio are connected along the depth dimension to the two inputs, and then the generated tensor is passed through three convolutional layers conv1, conv2, conv3, each using a 3\(\times \)3 convolutional kernel and ReLU activation function; the next three layers are a series of upsampling and deconvolution operations deconv1, deconv2, deconv3, using nearest neighbor interpolation and 3\(\times \)3 convolution kernels, which maintain the spatial information while improving the resolution of the feature map; after deconv3, the deconv1, deconv2 deconvolution layers are resized to match the dimensionality, and the feature maps of deconv1, deconv2, deconv3 are connected along depth dimensions are connected to create a tensor, then the tensor is passed through a 3\(\times \)3 convolutional layer and a channel attention mechanism to enhance the important features in the feature map, and finally, the adjusted light components are passed through a 3\(\times \)3 convolutional layer with a ReLU activation function and output using a Sigmoid function.

The loss of Illumination adjust Net network can be expressed as:

$$\begin{aligned} {L_I}= & {} \textrm{Grad}\left( {{{{\widehat{I}}}_\textrm{low}},{I_\textrm{normal}}} \right) + \textrm{MSE}\left( {{{{\widehat{I}}}_\textrm{low}},{I_\textrm{normal}}} \right) \nonumber \\{} & {} \quad + \textrm{MAE}\left( {{{{\widehat{I}}}_\textrm{low}},{I_\textrm{normal}}} \right) \end{aligned}$$
(4)

where \(\textrm{Grad}({{{\widehat{I}}}_\textrm{low}}, {I_\textrm{normal}})\) is the gradient loss, \(\textrm{MSE}({{{\widehat{I}}}_\textrm{low}}, {I_\textrm{normal}})\) is the mean squared error loss and \(\textrm{MSE} ({{{\widehat{I}}}_\textrm{low}}, {I_\textrm{normal}})\) is the mean absolute error loss.

3 Experimental results and analysis

3.1 Data set and training configuration

In this paper, the publicly available dataset by the authors of Retinex-Net is selected, containing 500 low-/normal-light image pairs, and the LOL dataset is derived from real scenes, which is more challenging. For image decomposition, reflectance enhancement denoising and light adjustment network, batch_size is set to 10, patch_size is set to 384\(\times \)384, epoch is set to 2000, and lr is 0.0001. The experimental environment during the experiment is NVIDIA TESLA V100 32GB GPU as the compiler to implement the code under TensorFlow1.15+nv framework. Adam (Adaptive Moment Estimation) was used as the optimizer during the model training process.

Table 1 Quantitative evaluation indicators of each algorithm on the LOL dataset

3.2 Analysis of results

In order to verify the effectiveness of the improvements of the algorithm in this paper, in this section, we mainly compare the performance of the proposed low-light image enhancement algorithm DMPH-Net with some existing low-light image enhancement algorithms.

3.2.1 Performance comparison on LOL dataset

In the existing research results in image enhancement, the image quality is usually evaluated using index parameters such as PSNR (peak signal-to-noise ratio), SSIM (similarity structural property), NIQE (natural image quality evaluator), LPIPS (learning perceptual image block similarity), and LOE (luminance order error), among which, PSNR indicates the ratio of the maximum value of the signal to the noise, and its value The larger the value, the smaller the distortion; SSIM reflects the structural properties of the objects in the scene, and the larger the value, the closer to 1 means that the images are more similar; NIQE is from the image texture details in line with the visual habits of the human eye, and the smaller the value, the better the visual quality; LPIPS is used to perceive the loss and measure the difference between two images, and the lower the value, the more similar the two images are; LOE reflects the natural retention ability of the image, and its smaller value indicates that the image has a better luminance order and looks more natural. In addition, in the process of evaluating PSNR, SSIM, LPIPS, and LOE values, it is necessary to use normal-light images as reference.

Table 2 Quantitative results of PSNR, SSIM, and NIQE in the LOL dataset of ablation studies

The objective evaluation index values of the low-light image enhancement algorithm on the LOL dataset are shown in Table 1, from which it can be seen that the PSNR value of the DMPH-Net algorithm proposed in this paper is 23.3772, which exceeds other research results, and the SSIM value is 0.8442, and these indexes have obtained higher values and have better performance compared with other algorithms. The LPIPS and NIQE values of the DMPH-Net algorithm proposed in this paper are 0.1386 and 3.5966, respectively, and the image enhancement achieved by the algorithm in this paper has a better similarity compared with other research results. In the LOE image evaluation index, low-light and normal-light images are selected as evaluation references, respectively, and the algorithm DMPH-Net proposed in this paper ranks in the top three in both LOE\(_\textrm{low}\) and LOE\(_\textrm{high}\). From the experimental results, it can be seen that the effectiveness of the algorithm proposed in this paper has been fully verified.

In order to verify the effectiveness of each network of the DMPH-Net algorithm, this paper conducts ablation experiments on the LOL test set, gradually changing the image decomposition DecomNet, Restoration Net network, Illumination adjustment Netw network, and using KinD network partially instead, and analyzing through the comparison of experimental results The effectiveness of each module of the DMPH-Net algorithm is analyzed by comparing the experimental results. The results of the ablation experiments are shown in Table 2.

In order to verify the effectiveness of DecomNet network decomposition, based on KinD algorithm, the image decomposition network is replaced, which leads to a significant decrease in performance, and the decomposed images support the feature information of subsequent image enhancement, and the decrease in performance after replacement indicates that DecomnNet can provide the algorithm with feature information that is beneficial to image enhancement.

In order to verify the effectiveness of the D &E (reflectance enhancement denoising) module, the ablation experiments in this paper are conducted using the KinD algorithm module instead of the reflectance denoising enhancement module, while the performance is compared with the DMPH-Net algorithm proposed in this paper. As can be seen from Table 2, the values of PSNR and SSIM for the w/o D &E (reflectance enhancement denoising) algorithm are 23.3545 and 0.8456, respectively, which correspond to very similar performance index values for the two algorithms; however, the PSNR and SSIM image analysis indexes are not specifically used to evaluate quantization noise. In order to more accurately evaluate the naturalness and quality of the enhanced images, this paper introduces the NIQE (Natural Image Quality Evaluation) evaluation index, as shown in Table 2. When the D &E network is not used, the value of the evaluation index NIQE is 3.7033, which shows an increase, and there is a large amount of noise. From the experimental results, we can see that it is necessary to use the D &E module in order to solve problems such as different application scenarios and limitations such as models. Also, we verified the effectiveness of UMSIA in adjusting the reflectance structuredness through the light structuredness. If UMSIA is removed and reflectance is directly connected to illumination for reflectance denoising, the performance is significantly degraded. This result shows that the multi-scale feature adjustment denoising of the D &E module and UMSIA enhances various metrics, including NIQE. In summary, with the above validation and experimental results, we can conclude that the introduction of the D &E module and UMSIA is reasonable and necessary, and they can adjust and enhance the image from multiple perspectives to improve the naturalness, quality, and denoising performance of the image.

To verify the effectiveness of the Illumination adjustment Net network, the KinD algorithm is used as the basis for replacing the Illumination adjustment Net network, resulting in a significant decrease in performance, which in turn shows that the structure of the codec, jump connection, and attention mechanism have a significant effect on improving the feature expression capability and fusion.

3.2.2 Comparison of the actual detection results of different detection methods

Figure 5 shows the comparison of the effect of algorithm DMPH-Net with other algorithms on the test set of LOL dataset. The comparison results in the picture show that SRIE, BIMEF, Zero-DCE, RUAS, and SCI algorithms improve the brightness of the input, but still have problems such as unsatisfactory light adjustment, noise, and color distortion, and DMPH-Net has proper denoising as well as light adjustment in these cases. Among them, (b), (g), (h), and ours, in Fig. 5, use the make Retinex theory to process the image, and most of the image enhancement based on Retinex theory has color distortion in the reconstruction part. While DMPH-Net light is properly adjusted, the noise is eliminated and the color distortion is reduced.

Fig. 5
figure 5

Comparison of LOL test images of various methods in the dataset

In order to verify the enhancement ability of the algorithmic model in this paper when the image degradation problem under different low-light conditions, the image enhancement effect of the DMPH-Net algorithm proposed in this paper and other algorithms in the case of real dark scenes with complex lighting is shown in Fig. 6, from which it can be seen that Fig. 6b Retinex-Net algorithm, Fig. 6e LIME algorithm, Fig. 6i Zhang algorithm, and Fig. 6h Zhang algorithm in image enhancement, the output image has obvious color shift, overall bias to red, etc.; Fig. 6f GLAD algorithm, Fig. 6k GLAD algorithm enhanced image appears too high brightness and loss of details, etc.; Fig. 6c SRIE algorithm, Fig. 6j RUAS algorithm enhanced image appears low brightness, etc. (d) BIMEF algorithm, Fig. 6g KinD algorithm, Fig. 6h KinD++ algorithm enhanced image appears larger noise; compared with these algorithms, the DMPH-Net algorithm proposed in this paper presents better results in processing images under complex lighting conditions, and the enhanced image has better results in color shift and noise suppression, etc. The enhanced image has better effect in terms of color shift and noise suppression, and improves the detail recovery and color restoration accuracy of image enhancement.

Fig. 6
figure 6

Comparison of various methods for night scenes with complex light sources

3.2.3 Performance comparison on reference-free datasets

In order to further demonstrate the generalization of DMPH-Net algorithm on other datasets and the performance advantage of other algorithms on low-light image enhancement, this paper uses the reference-free datasets LIME, NPE, and MEF datasets to conduct experiments using the reference-free image evaluation index NIQE (natural image quality evaluator) to further compare with other algorithms, and the results are shown in Table 3. From this paper, it can be seen that the algorithm DMPH-Net performs well compared to other traditional as well as recent algorithms on each reference-free dataset, and the results in Table 3 show that the DMPH-Net algorithm has better performance in low-light image enhancement.

Table 3 Quantitative ratios of NIQE for each method on the LIME, NPE, and MEF datasets

4 Conclusions

In order to solve the degradation due to the dark areas of low-light images and improve the quality of low-light image enhancement, this paper proposes a graph enhancement algorithm DMPH-Net that combines attention mechanism and multi-scale pyramid. The algorithm enhances the low-light images by three parts, which are image decomposition, reflectance, and illumination adjustment. Firstly, the image decomposition part decomposes the low-light image by deepening the network depth, so that the decomposed light has more information; then the reflectance part uses the UMSIA module to adjust the reflectance, and the five-layer feature pyramid and kernel selection in the PRID-net module can effectively fuse the contextual information between different scale feature layers, while removing the enhanced noise, and the added color loss effectively suppresses the color bias of the output image; finally, in the illumination adjustment part of the illumination uses a multi-scale cascade to capture the multi-scale feature information in the illumination, improves the structural information contained in the illumination, and uses channel attention for feature learning of global information. The model was experimentally analyzed on the DMPH-Net algorithm on LOL and no-reference LIME, NPE, and MEF datasets, and the objective evaluation metrics of this paper’s algorithm were 23.3772, 0.8442, 0.1386, and 3.5966 on the LOL dataset for PSNR, SSIM, LIPIPS, and NIQE, respectively, and on the no-reference datasets LIME, NPE, and MEF with objective evaluation metrics NIQE of 3.0735, 3.1711, and 2.9464, respectively, and each metric is improved compared with RUAS, UnRetinex-Net, and other enhanced algorithms. The results show that the DMPH-Net algorithm has a better enhancement effect in the low-light image enhancement task. However, the DMPH-Net algorithm requires paired datasets for training, and there are few paired low-light datasets with realism. The next work intends to further investigate the problem of unsupervised or semi-supervised low-light image enhancement.