Superresolution Reconstruction of Single Image for Latent features

In recent years, Deep Learning has shown good results in the Single Image Superresolution Reconstruction (SISR) task, thus becoming the most widely used methods in this field. The SISR task is a typical task to solve an uncertainty problem. Therefore, it is often challenging to meet the requirements of High-quality sampling, fast Sampling, and diversity of details and texture after Sampling simultaneously in a SISR task.It leads to model collapse, lack of details and texture features after Sampling, and too long Sampling time in High Resolution (HR) image reconstruction methods. This paper proposes a Diffusion Probability model for Latent features (LDDPM) to solve these problems. Firstly, a Conditional Encoder is designed to effectively encode Low-Resolution (LR) images, thereby reducing the solution space of reconstructed images to improve the performance of reconstructed images. Then, the Normalized Flow and Multi-modal adversarial training are used to model the denoising distribution with complex Multi-modal distribution so that the Generative Modeling ability of the model can be improved with a small number of Sampling steps. Experimental results on mainstream datasets demonstrate that our proposed model reconstructs more realistic HR images and obtains better PSNR and SSIM performance compared to existing SISR tasks, thus providing a new idea for SISR tasks.


I. INTRODUCTION
S INGLE-IMAGE super-resolution reconstruction (SISR)   tasks are critical in research areas such as Computer Vision [1] and Image Processing [2] [3].The SISR task is to reconstruct the corresponding HR image using an LR image.Because LR images lose a lot of details and texture features in image degradation, the reconstructed HR images must have rich image details and clear textures.However, an LR image may correspond to infinitely many HR images, and various degraded LR images can also restore a single HR image.Hence, the SISR task is typical for solving the uncertainty problem.In the SISR task, Researchers have successively proposed various traditional methods, such as Iterative Back Projection [4], Convex Set Projection [5] and Sparse Representation [6], Etc.However, the traditional methods usually explicitly estimate the fuzzy kernel and then reconstruct the HR image.Therefore, the traditional methods will lead to the error of the estimated fuzzy kernel, so the HR image reconstruction effect is not ideal.
SISR task can also be regarded as a typical generation task.The so-called generation task is to effectively fit the probability distribution of the data through the generator to make the generated probability distribution as close as possible to the real data distribution.Deep learning-based methods in this task can be divided into five categories: CNN-based methods, Generative Adversarial Network (GAN) based methods [7], Flow based methods [8], Variational Auto-Encoder (VAE) based methods [9] and Denoising Diffusion Probabilistic model (DDPM) based methods [10].However, these generative models face three major difficulties in the SISR task: high quality Sampling, fast sampling and diversity of details after Sampling.The method based on CNN can fit any Function, but cannot fit any probability distribution.Therefore, the method based on CNN alone is difficult to solve the problems of unreal perception and artifacts in the reconstructed results.Gan-based methods are also commonly used in SISR.They use Perceptual Loss and Adversarial Loss to reconstruct images.Although they can provide fast sampling, there are problems such as pattern collapse and training instability.The Flow based methods can improve the diversity of the generated images by using the Log-Likelihood Function to infer the latent variables accurately, but the generated images are too smooth.The VAE based method not only generates more diverse data using additional conditions but also provides relatively fast sampling.However, the VAE based method does not sample with high quality, and there is a Loss of detail and texture in the HR images.
Recently, DDPM has achieved good results in Image Synthesis [11] and Speech Synthesis [12].DDPM uses the Markov chain to transform latent variables in Gaussian distribution into data in complex distribution, thus solving the "one to many" problem in the SISR task and improving the quality of reconstructed data.However, SISR tasks are different from other generation tasks.Applying DDPM to SISR tasks requires solving the following problems: 1) The inverse diffusion process of DDPM on the SISR arXiv:2211.12845v2[eess.IV] 25 Nov 2022 task requires a complex probability distribution to model the denoising distribution, so DDPM requires thousands of evaluation steps in the forward diffusion process to sample a sample feature.If DDPM uses a small number of sampling steps, the generated images after DDPM sampling are not of high quality.2) DDPM is based on unconditional or simple conditions for model input.At the same time, SISR tasks often need to fully use LR images as conditions for model input to constrain the solution space of HR images.Therefore, this paper proposes a novel Denoising Diffusion Probabilistic model for Latent features (LDDPM) to solve the problem faced by DDPM in SISR task: 1) To ensure high-quality sampling by DDPM with a small number of sampling steps, this paper designs a Multimodal distribution based on GAN and Normalized Flow to model HR images, which enables LDDPM to focus on reconstructing high-frequency details of HR images with fewer diffusion steps.2) To ensure high-quality sampling by DDPM with a small number of sampling steps, this paper designs a Multimodal distribution based on GAN and normalized flow to model HR images, which enables LDDPM to focus on reconstructing high-frequency details of HR images with fewer diffusion steps.The model in this paper has the following advantages: Fast and high quality sampling: We have reconstructed HR images by Markov chains and complex multimodal distribution modeling, which enables fast model sampling while reducing the negative impact of model collapse on modeled HR images, thus producing complex and diverse HR images with high quality.
Stable style and content consistency: Although the probability distribution of HR images is difficult to predict, this paper limits the effect of prediction randomness caused by the maximally variable lower bound in DDPM by designing a new conditional encoder, so that the model is trained stably and can generate images with the same style and content as the original HR images.
The LDDPM proposed in this paper is experimentally proved on many datasets.The experimental results show that the proposed model outperforms most of the methods in SISR tasks on multiple datasets.In addition, the code of LDDPM will be open source soon: https://github.com/yanjingke/Image-Super-ResolutionLDDPM.

II. RELATED WORK
In this section, we discuss the Convolutional Neural Network (CNN) based methods, Generative Adversarial Network (GAN) based methods, Flow based methods, Variational Auto-Encoder (VAE) based methods, and Denoising Diffusion Probabilistic model (DDPM) based methods in the generative model as shown in Figure 1.

A. Single Image Superresolution Based on Traditional generation model
CNN based methods:Due to the rapid development of Deep Learning, many Deep Learning-based methods have been proposed in SISR.Most of these methods are based on Convolutional Neural networks (CNN), which use an end-toend to learn the probability mapping relationship between LR and HR images.For example, Zhang et al [13] found that most CNN based method not only did not fully explore the contextual information of LR images during feature extraction but also paid little attention to the reconstruction steps of the final HR images, so they proposed a two-stage single image reconstruction method (TSAN) based on an Attention Mechanism, thus achieving accurate HR image reconstruction using a coarse-to-fine approach.However, TSAN rarely explores the feature correlation between layers, which reduces the ability of CNN to learn probabilistic mapping relationships between LR images and HR images.Dai et al. [14] capture long-distance spatial contextual information between features by using a Second-order feature statistics module and a Nonlocally augmented residual module so that the model can learn abstract probabilistic mapping representations.Although the Second-order feature statistics module can effectively extract features with rich information in each layer, the module is processing the features of each Convolutional layer independently, thus ignoring the correlation of features between different layers.Therefore, Niu et al. [15] proposed a Holistic Attention Network (HAN) based on CNN, which not only considers the correlation between layers to adaptively emphasize the features between layers but also can learn the confidence of each channel so that the model can carry out complex probability distribution mapping.In the SSIR task, there are some similar Patches in the image.The similar Patches can provide information to each other, which can help CNN learn the probabilistic mapping relationship between the LR image and HR image.Therefore, Zhou et al. [16] divided the LR image into multiple Patches and used each Patch to search for K nearest adjacent features to construct a cross-scale map matrix dynamically.The probability distribution in the HR image can be transferred to the query Patch of the LR image in the above way, thus helping to recover more complex Detail and texture features.However, if only the CNN-based methods are used to generate the HR image, the perception is not real, and there are artifacts.A breakthrough solution to this problem is to use GAN-based methods.
GAN based methods:GAN obtains Content Loss and Discrimination Loss through the Generator and Discriminator to make the generated image distribution as close as possible to the real image distribution.For example, Wang et al. [17] found that if the goal of Perception Loss is to minimize the error in pixel space rather than the error in feature space, the generated HR image tends to output excessively smooth results, thus missing enough High-Frequency details.They then proposed the SRGAN network, which uses activation features to improve Perceptual Loss, thus providing a stronger supervisory signal for luminance consistency and texture recovery.However, SRGAN has a limited ability to reconstruct Spectral-Spatial invariance, which may lead to Spectral-Spatial distortion in the generated HR image, especially when the image magnification factor is enormous.Therefore, Shi et al. [7] mapped the generated Spectral-Spatial features from the image space to the latent space, thereby generating a coupling component to regularize the generated samples.Although the research results of GAN are applied to the SISR task, GAN is based on data-driven, which leads to fundamental limitations of GAN in reconstructing high-frequency information features of unknown images in the testing phase.Therefore, Liu et al. [18] seamlessly integrated the advantages of CNN-based methods into GAN-based methods based on the fact that CNNbased methods have advantages in adaptive aspects, thus using the detailed features captured by CNNs as prior knowledge to help GANs generate more realistic details.The above methods reconstruct HR images with fewer artifacts and more realistic perceptions.Still, they are prone to model collapse.They cannot effectively solve the problems of "one-to-many" uncertainty and the inability to determine the distribution of true samples in the hidden space.
VAE based methods:The VAE based methods and Flow based methods are also generative models.The VAE based approach first maps the input to the hidden space for probability density estimation.Then VAE assumes the prior distribution as a standard Gaussian distribution and trains a probability decoder to achieve the mapping from the hidden space to the real data distribution.For example, Gatopoulos et al. [19], according to the feature that neurons in human vision can continuously add new information to enhance existing signals after adapting to light, used the image downsampling representation as a random variable based on the VAE and continuously added random variables into the model for training.However, Gatopoulos et al.'s method tended to generate blurred images, so Liu et al. [20] added a Conditional sampling mechanism to reduce the potential subspace for reconstruction.Although Liu's method can reconstruct some HR images with simple backgrounds, they use Mean Square Error(MSE) for model optimization, which tends to cause blurring of the edges of some complex background images.Liu et al. [9] considered searching similar style images from reference images to guide the reconstruction of HR images.They use Conditional Variational Auto-Encoder (CVAE) to compress various reference images into a compact hidden space to learn the explicit distribution and sample corresponding style features from this distribution as conditions or priors, which are used to solve the problem of complex background edge blurring in reconstruction.
Flow based methods:Flow based methods use bijective Functions to learn the posterior distribution from the prior distribution through a series of reversible transformation Functions to generate HR images based on the posterior distribution.For example, Liang et al. [8] found that the Normalizing Flow can predict detail-rich HR images from LR images using Downsampling and Upsampling by a joint modeling method.They then modeled the LR image and the remaining high-frequency components so that the model uses the bijective mapping between the HR and LR images for learning the lost high-frequency information.Xiang et al. [21] used a Flow-based model for intra-flow feature extraction, interflow dependency extraction, and joint feature learning, which resulted in the better reconstruction of HR images.However, the Flow-based model of the above methods only used a small number of Convolutional Layers, which led to the limited perceptual field of the model.Therefore, Jo et al. [22] stack more Convolutional Layers through affine coupling to expand the receptive field and obtain more vital feature expression ability.The Flow based and VAE based methods can not only effectively learn the distribution of samples in hidden space but also solve the "one-to-many" uncertainty problem.However, the detailed features of HR images generated by the above methods are too smooth and require high training time.

B. Single Image Superresolution Based on Denoising Diffusion Probabilistic model
The recent Denoising Diffusion Probabilistic model (DDPM) was used for the SISR task.The DDPM is composed of two parametric Markov chains (forward and inverse chains) and uses variational inference to generate samples in finite time that are consistent with the original data distribution.The forward chain functions as a perturbation of the data by gradually adding Gaussian noise to the data according to a predesigned noise schedule until the distribution of the data converges to the prior distribution (standard Gaussian distribution).The reverse chain learns to gradually recover the original data distribution by iterating from a given prior distribution and using a parameterized Gaussian transformation kernel.Thus DDPM is a highly flexible and easy-to-compute generative model that not only effectively avoids the model collapse encountered by GAN but also generates High-quality images.For example, Li et al. [10] designed the SRDiff model based on DDPM, which gradually transformed Gaussian noise into HR images through Markov chains with residuals.Saharia et al. [23] designed based on a repeat detailed image Super-Resolution model (SR3).Firstly, white Gaussian noise is added to the image, and then various noisy images are used to train the UNet model to refine the noise output iteratively.Ryu et al. [24] proposed a Pyramid Denoising Diffusion Probabilistic model.In addition to DDPM, this model uses the Position Embedding training score Function to make LR image gradually generate HR image.Although the above DDPMbased models show strong performance on different superresolution datasets, DDPM's high-quality sampling, diversity of samples, and small computational overhead on SISR tasks are still worth investigating.

This section introduces the proposed Denoising Diffusion
Probabilistic model for Latent features (LDDPM) for the SISR task.Firstly, We briefly introduce the basic architecture of the model.Secondly, We review the Denoising Diffusion Probabilistic model is review.Then, the critical components in LDDPM are described in detail.Finally, the Loss Function of LDDPM is introduced.

A. Denoising Diffusion Probabilistic model for Latent features
In the SISR task, a given LR image X ∈ R w×h×c is restored to the corresponding HR image Y ∈ R w s↑ ×h s↑ ×c .Where w, h, c are the width, length, and the number of channels of image X , respectively, and s ↑ is the Upsampling factor.Therefore, the Superresolution problem of a single image can be described in Eq. ( 1).
Where n represents Gaussian white noise, k represents the Convolution Kernel of the subsampling, and ⊗ represents the Subsampling of the Convolution.The SISR task aims to model Eq. ( 1) as a maximum posterior probability problem, as shown in Eq. ( 2).
Where Y represents the reconstructed HR image, log q(Y ) represents the model-optimized HR image, and log q(X|Y ) represents the Log-Likelihood of the LR image under given HR image conditions.However, in the Traditional SISR task, the model is not only easy to collapse but also cannot recover the image details well.The Denoising Diffusion Probabilistic model [25] [26] transforms the standard normal distribution into empirical data distribution (similar to Langevin dynamics) through a series of refinement steps.The Denoising Diffusion Probabilistic model can reduce model collapse and retain more image details.Therefore, in this paper, the parameters of log q(X 1 , X 2 ...X T |Y ) are learned to approximate Y by a random iterative refinement process in the way of DDPM.The process is gradually mapping the Source image X 1 , X 2 ...X T to the Target image Y to achieve a "one-to-many" mapping.The Target image Y is gradually consistent with multiple Source images X 1 , X 2 ...X T as far as possible.Therefore, this paper can change Eq.( 2) into a modeling method based on DDPM, as shown in Eq. ( 3)).
Where X i contains the LR image X and the addition of X i−1 to Gaussian noise a t step i , and T is the total diffusion step.In DDPM,Y gradually adds Gaussian noise to generate latent variables X 1 , X 2 ...X T .The LDPM in this paper is shown in Figure 2, and the LDPM is built on the DDPM of the T Steps.Instead of directly reconstructing the HR image at each iteration step of LDDPM, the UNet network is used to predict the noise ε in X i at the current i-th step.In the LDDPM model, we add Conditional Encoding Mechanism, divided into Conditional Encoding based on an Adaptive Multi-Head Attention Mechanism and Conditional Encoding based on VAE.In the Conditional Encoding based on Adaptive Multi-Head Attention Mechanism, we map the LR image features encoded by the Conditional Encoder to the middle layer of UNet using the Multi-Head Attention Mechanism to guide the UNet network to learn more latent features in LR images.In the Conditional Encoding based on VAE, we use VAE to sample random feature vectors from LR image X as conditional feature F R and combine the mean map F µ and variance map F σ decomposed by the feature vector F X of UNet encoder to transfer the conditional features to the hidden space.VAE not only effectively fills in the missing information of LR image amplification but also constrains the solution space of the reconstructed HR image, making it easier for the model to learn the noise at the current moment.
For the encoder output feature F g of the UNet network, we use Normalized Flow, which can better make the model induce a more complex probability distribution bias.During training, to ensure high-quality sampling of the LDDPM, GAN is adopted to learn the Multi-Modal distribution of X i−1 , X i .The Multi-Modal distribution replaces the simple Gaussian distribution learned by the original DDPM.Thus, the Kullback-Leibler Divergence (KL Divergence) of the noise probability distributions of the denoised model and the real model can be reduced.

B. Denoising Diffusion Probabilistic model
In DDPM, the HR image Y is defined as the Target variable, and q (Y ) is the probability distribution of the Target variable.As shown in Figure 3, DDPM consists of forward and reverse diffusion processes.The forward diffusion process of DDPM aims to map Y to a Multidimensional normal distribution (Gaussian noise) through a Markov chain, and the calculation method is shown in Eq. (4).
Where we define X 0 as Y ,X i and Y are variables of the same dimension,T is the number of diffusion steps, and q (X i |X i−1 ) is defined as the Gaussian distribution A small amount of Gaussian noise is added at each diffusion step in the process, and the final HR image is transformed into a Multidimensional Gaussian distribution with different dimensions independent of each other.The inverse diffusion process of DDPM is based on sampling the Gaussian distribution to generate HR images, which is calculated as shown in Eq. 5.
The process model can gradually eliminate Gaussian noise and generate HR images matching the Target distribution.It is   ) worth noting that we train only the mean Function µ θ and the variance Function σ θ in the model training so that the model can be sampled to generate HR images.In addition, we set σ θ as a constant so that µ θ can be rewritten, as shown in Eq. 6, according to the parameter re-referencing.

Reverse diffusion process
where Ultimately, DDPM can be interpreted as abstracting the noise ε added at step i from X i given an image Y , noise ε, and step i.To achieve this, the model needs to learn valid feature information from X i ,ε, and step i in order to allow the HR image Y to be gradually mapped to the corresponding noise values according to the specified rules and to generate a distribution similar to HR image Y based on the noise values during the Reverse diffusion process.Therefore, the Loss Function of the DDPM is defined as shown in Eq. (7).

C. Conditional Encoding Mechanism
The purpose of the LDDPM on the SISR task is to model the conditional distribution P (X|Y ).Therefore we can control the synthesis process of HR images by inputting the LR image X encoding as a condition to the Function ε θ (•).However, if X and the noisy image X i at the current moment are directly stacked together in the UNet network for conditional sampling, UNet not only tends to ignore detailed features related to the perceptual context, but also requires the use of expensive Function evaluation in pixel space to better extract the noisy features.Therefore, encoding LR image X in DDPM and using it as a conditional input to the model to better learn the noise distribution deserves further investigation.
Conditional Encoding based on Adaptive Multi-Headed Attention Mechanism:For the modeling condition of LD-DPM, we not only stack the noisy image X i and LR image X at the current moment but also design a Conditional Encoding method based on an adaptive Multi-Headed Attention Mechanism inspired by Rombach et al. [27] and Qin et al. [28] in this paper.First, an Encoder T θ (•) based on an LR image is designed, and the LR image X is projected to the same dimension of Unet's middle layer features using Encoder T θ (•); then, the features of the middle layer and the features projected by T θ (•) are points multiplied by the Multi-Headed Attention Mechanism in UNet.Thus, the Conditional Adaptive Multi-Headed Attention Mechanism is calculated as shown in Eq. (8).
Where ϕ k (•) denotes the flat feature operation and are the projection matrices of the k-th intermediate layer of UNet.It is worth noting that in the Glow model based on Normalized Flow, we replace the Mask matrix in Glow with the feature matrix output from the Conditional Encoder T θ (•) so that the model learns the noise ε of the current step i better.
Conditional Encoding based on Variational Auto-Encoder:For Conditional Encoding in DDPM, we also designed the Conditional Variational Auto-Encoder (CVAE) [29].The CVAE is a variational inference modeling of the hidden Z and observed variables C ∈ X 1,2...T using a reparameterization technique.The CVAE can project the conditional X i into the latent space and thus learn the latent conditional probability distribution.The decoder of UNet will refer to the conditional features F c , which are output from CVAE, to learn the feature map F g for the prediction of UNet noise ε.The CVAE model is mainly divided into Feature Encoder, Hidden Variable Z, and Feature Decoder, as shown in Fig. 2. the Feature Encoder of CVAE is mainly used to fit the Likelihood Function, which is calculated as shown in Eq. (9).
The values of the Likelihood Functions depend on the results of the µ θ σ θ Functions calculations.µθ and σ θ learn the mean and variance of the Gaussian distribution, respectively, and they thus learn the relationships between pixels and represent them in a probabilistic model.The hidden variable Z can be denoted by µ θ and σ θ respectively in CVAE,which is calculated as shown in Eq. (10).
Where the hidden variable Z is sampled from the Gaussian distribution Q(Z) = N ∼ (0, I), as shown in Figure 2. To ensure that randomness is introduced in the sampling process and that the probability distribution learned by CVAE is close to a Gaussian distribution, we use KL Divergence to optimize CVAE, which is calculated as shown in Eq. (11).
In the LDDPM forward diffusion process, first, we replace the Gaussian distribution of the Feature Decoder input using the hidden variable Z.Then, we use the convolutional layer to project the probability distribution of the output of the Feature Decoder into the spatial domain to obtain the conditional probability mapping feature F R .Finally, to map the conditional probability F R to the output of the UNet encoder, we use the convolutional layer to learn the mean F µ and variance F δ of the feature mapping F X of the output of the UNet encoder, and the conditional probability F R and F X are fused to obtain the fused features F g .It is worth noting that the mean F µ and variance F δ are the spatial variables of the feature mapping, not the variables of the Gaussian distribution.In addition, we can remove the Feature Encoder of CVAE in the inverse diffusion process of LDDPM and use the random Gaussian distribution as the input hidden variable Z of the Feature Decoder.

D. Optimized Denoising Diffusion Probabilistic model
Glow-based model Optimization:In the LDDPM-based SISR task's problem setting, this paper aims to make the Posterior Encoder of LDDPM able to reconstruct the HR image accurately.However, the Prior Encoder of LDDPM can learn better distributions for the Posterior Encoder to perform better reconstruction of HR images.The Gaussian distribution is utilized in the CVAE of the LDDPM to parameterize the Prior and Posterior Encoders of the UNet, so we applied the Glow model based on Normalized Flow [30] to the feature map F g .Glow is a type of Flow model that consists of a combination of multiple Superficial Layers, each of which consists of a Squeeze Function and a Flow step.Each Flow step contains ActNorm, 1x1 convolutional layer, and Coupling Layer.We designed Glow to be able to convert the simple distribution into a more complex distribution based on the Gaussian distribution of the CVAE output, according to the law of the changing noise, using the bijection Function f θ (•).This enables the Posterior Encoder to reconstruct the complex probability distribution of the HR image more easily.Glow is calculated as shown in Eq. (12).
Y in the DDPM describes the degree of information loss by replacing the C distribution with the Y distribution.Therefore, the LDDPM should ensure that the KL Divergence of the probability distribution p θ (X i−1 |X i ) of the denoising model in the inverse diffusion process and the probability distribution q (X i−1 |X i ) of the denoising model in the forward diffusion process is as small as possible to ensure a better match between the probability distribution of the real HR image and the reconstructed HR image.Xiao et al. [31] demonstrated that the data distribution would approach a unimodal Gaussian distribution as Gaussian noise is added incrementally in the forward diffusion process, while the data distribution will become more complex from a Gaussian distribution as the step size increases in the forward diffusion process.Therefore, in this paper, we design Conditional GAN to estimate the true denoising distribution q (X i−1 |X i ), thus enabling LDDPM to model Multimodal denoising distributions with stronger expressive power.The goal of our conditional GAN is to minimize the Adversarial Loss Function, thus minimizing D KL (C, Y ) and improving the match between the probability distribution p θ (X i−1 |X i ) of the inverse diffusion process of LDDPM and the true denoising distribution q (X i−1 |X i ) of the forward diffusion process.The Conditional GAN proposed in this paper, shown in Figure 4, sets up a time-dependent Discriminator D φ (•) .The input of the Discriminator is X i−1 and X i computed by the noise ε, and the output is the confidence of X i−1 and X i .The Discriminator is trained by Eq. (13).
It is worth noting that the output of the GAN generator is the noise ε distribution, so this paper can use Equation ( 14) q X X Fig. 4. Denoising process of LDDP Mbased on Generative Adversarial Network to calculate X i−1 and X i .
Compared with DDPM, the distribution of HR images reconstructed by the inverse diffusion process of LDDPM now in this paper is more complex, and LDDPM is an implicit model.the forward diffusion process of LDDPM is still an additive Gaussian noise process, so no matter how long the step length or how complex the data distribution is, the forward diffusion process q (X i−1 |X i ) obeys the nature of Gaussian distribution.Therefore, the inverse diffusion process p θ (X i−1 |X i ) of LDDPM can be expressed by Eq. ( 15).
) where p θ (ε θ |X i ) is an implicit distribution added by the generator G θ (•) of the GAN, and the generator inputs are X i , Z ∼ p(Z) = N (Z; 0, I) and the number of diffusion steps i .

E. Loss Function for Training
During the training process, the LDDPM gradually maps Y to a Gaussian distribution through a Markov chain.However, the error of noise increases with the number of iteration steps, and in order to reduce the noise error during the training process bring about an increase in the perceived distance between the reconstructed HR image Y and the real HR image Y , we utilize Content-Aware, Style-Aware for guiding LDDPM to reconstruct HR images.The idea is that the real denoised image X i for each iteration step needs to pass the style information to the predicted denoised image X i while retaining the content information.
Content Loss:Ledig et al. [32] proved that the Mean Square Error(MSE) Loss Function is prone to a lack of Highfrequency details, which leads to excessive smooth textures in the reconstructed images in the SISR task.Therefore, we calculated the loss values of image X i and X i according to the Content Perception Loss based on VGG-19 so as to retain more detailed features of image X i .Different from Ledig et al., we also calculated the loss value of real HR image Y and reconstructed HR image Y for the difference between pixels.Therefore, the Content Loss of LDDPM is calculated as shown in Eq. ( 16).
Style Loss: Park et al. [33] used VGG-19 to extract feature maps and calculate the mean and variance of feature maps so as to reduce the Style Loss of feature maps.Therefore, LDDPM refers to Park et al. 's method to calculate the Style Loss value of X i and X i ,as shown in Eq. (17).
Therefore, the total Loss Function of LDDPM model is shown in Eq. (18).

IV. EXPERIMENTS AND ANALYSIS
In this section, the experimental details of the model are presented, and the applicability of the proposed model for SISR tasks is proved.Firstly, the experimental setup of the model is introduced in detail.Then, the experimental results of the proposed model and other advanced models are introduced.Finally, the network structure of the model is verified by ablation experiments, and it is proved that the network structure designed in this paper can recover image details and texture features well and reconstruct realistic High-resolution images on the SISR task.

A. Experimental Settings
Datasets: We validated the applicability of LDDPM on a face-based dataset (8×) and a dataset for general tasks (2×).For the face-based dataset, this paper used the Flickr-Faces-High-quality (FFHQ) [Karras et al. [34], 2019] and CelebFaces Attribute (CelebA) datasets [Liu et al. [35], 2018].FFHQ is a High-quality face dataset containing 70K face datasets.The dataset is not only rich and distinct in age, race, and image context, but also possesses a very large number of variations in face attributes.The training set of FFHQ contains 30K images, and the test set consists of 2000 images.The CelebA dataset is a face attribute dataset with 200K.The dataset images cover a large range of pose variations and background clutter.The training set of CelebA in this experiment comprises 54K images, and the test set comprises 5,000 images.For model training and prediction, we resized the images in the CelebA and FFHQ datasets to HR images of 128×128 size and downsampled the HR using dual cubic kernels to generate LR images of 16×16 size.
For the general task of image super-resolution datasets, we used the DIV2K dataset [Agustsson et al. [36], 2017] and the Flickr2K dataset [Lim et al. [37], 2017] together for training and testing.We selected 900 images on the DIV2K dataset as the training set and 2650 images on the Flick2K dataset as the training set.To ensure the generalization of the model, Set5, Set14, Urban100, and Manga109 were chosen as test sets.In addition, we cropped each image in the dataset into 128 × 128 sizes to obtain the HR image and downsampled the HR image using a double kernel to generate a 64 × 64 size LR image.Finally, we used the image degradation algorithm of Zhang et al. [38] for the LR images to improve the model's robustness, which contains fuzzy degradation, downsampling degradation, and random permutation of the noise degradation.Two homogeneous and heterogeneous Gaussian blurs simulate blur degradation.Downsampling degradation is image degradation by random selection methods from nearest neighbor and bilinear and cubic spline interpolation.Noise degradation is the addition of different noise levels of Gaussian noise, JPEG compression of different quality, and reversal of ISP-generated sensor noise to the LR image.
Experimental parameters:This paper used graphics cards for model training were 8 GeForce RTX 3090 24GB and 4 TITAN V 16GB.Model parameter settings for training and testing on the CelebA, CelebA+FFHQ(CeleHQ) , DIV2K and DIV2K+Flick2K(DIFL2K) datasets are provided in Table I.These settings are used for the main results in the whole table.The same settings were used for all variants of the LDDPM in the ablation experiments.We chose Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) as the evaluation metrics for the experiment.Where PSNR measures the difference between the pixel points of the reconstructed HR image and the original HR image by calculating the peak signal-to-noise ratio, while SSIM is a metric that calculates the structural similarity between the two images to calculate the similarity of the two images.

B. Comparison of general Image Super-Resolution Experiments
In this section, we evaluate the reconstruction effect of LDDPM on the face image super-resolution defense dataset (8×) and image super-resolution dataset (2×) of the general dataset by comparing LDDPM with the advanced image superresolution model.
For the general image super-resolution dataset (2×), we first trained on the DIV2K dataset the EDSR [Lim et al. [37],2017], RCAN [Zhang et al. [40], 2018], SAN [Dai et al. [14], 2019], IGN [Zhou et al. [16], 2020], HAN [Niu et al. [15], 2020], NLSA [Mei et al. [41], 2021], and LDDPM (Our) models.Then, the SwinIR [liang et al. [42],2021], SwinFIR [Zhang et al. [43],2022], EDT [Li et al. [44],2022], and LDDPM models were trained on the DIFL2K dataset.Finally, we use the CelebA dataset as a pre-training dataset to train SR3 [Saharia et al. [23], 2022] and LDDPM and will fine-tune the model parameters on the DIFL2K dataset.Finally, we visualized the HR images recovered by some advanced models, and the visualization results are shown in Figure 5. From Fig. 5, we can see that the current Transformer and CNN-based models still have much room for improvement in reconstructing details and textures of complex images.At the same time, LDDPM can well solve the problems of the above Transformer and CNN-based models and reconstruct HR images with High-frequency details.

C. Comparison of experimental results of Super-resolution of face images
In this section, LDDPM is experimentally compared with ESRGAN [Wang et al. [17],2018], ProgFSR [Kim et al. [45], 2019], SRFlow [Lugmayr et al. [46], 2020], SRDiff [Lin et al. [10], 2022] and SR3 models on the CeleHQ dataset , as shown in Table III   0.96 dB and SSIM by 2.31% compared with the advanced indicating that LDDPM is able to generate Highquality and diverse HR images with strong uniformity with LR.As seen in Figure 6, compared with other models, the LDDPM reconstructed images of the wrinkles on the forehead of the elderly and the hair of the woman look more natural and have rich details and textures.
In addition, LDDPM (43M) uses fewer model parameters than SR3 (98M) and SRDiff models (52M) and takes only about 20 hours to converge on the CeleHQ dataset, compared to 34 hours for SRDiff and 40 hours for SR3, indicating that LDDPM is training efficient and can be used with a smaller computational overhead of getting better performance.
As shown in Figure 7, this paper visualizes the important detail pixels of the woman's face part in the HR image recon-structed by LDDPM with SR3 and SRDiff using histogram.It can be seen from Fig. 7 that LDDPM can learn more regular feature distribution to capture good detail and texture features and obtain better performance.Model optimization based on Glow and GAN:From row 2 of Table V, it can be seen that the addition of Glow to the LDDPM increases the PSNR and SSIM by 1.03 dB and 3.19%, respectively, indicating that the addition of Glow to the LDDPM enables the LDDPM to capture a more complex noise probability distribution.As can be seen from row 3 of Table V, the PSNR and SSIM distributions rise by 0.87 dB, and 0.97% for the GAN added by LDDPM, indicating that the multimodal distribution of GAN learning can make the HR images reconstructed by LDDPM more realistic in the inverse diffusion process.In Figure 8, we visualize the features extracted by LDDPM adding Glow and GAN.As seen in Figure 8, compared to the original LDDPM, the LDDPM with Glow and GAN can learn better probability distributions with a relatively small number of total steps to be sampled.
Optimization of Experimental Hyperparameters:We performed hyperparametric ablation experiments to investigate the effect of total diffusion step and Loss Function on LDDPM.As shown in Figure 9, with the increase in total diffusion steps, the quality of the images in this paper is enhanced.However, larger total diffusion steps slow down the training and inference of the model, so is chosen as the default parameter setting in this paper.Finally, we compare the effects of Content Loss and Style Loss on the experimental results in this paper.From Table VI

E. Experimental comparison of Real Datasets
To evaluate the performance of the model of LDDPM more comprehensively, we collected some Low-resolution images in the real world.As shown in Figure 10, the quality of the reconstructed HR images from LDDPM is better than that of the reconstructed SR3, EDF, and SRDiff.Specifically, the images reconstructed by SR3, EDF, and SRDiff in Figure 10 are blurred and have missing details and textures, while the LDDPM model can reconstruct not only clear images but also reconstructed images with complete details and textures.Experiments on Real-world datasets demonstrate that LDDPM generalizes well and can be applied well to SISR tasks in natural environments.V. CONCLUSION We designed Denoising Diffusion Probabilistic model for Latent features (LDDPM) to address the "one-to-many" uncertainty of the SISR task, which solves the problems of missing details and textures in the reconstructed HR images, slow sampling speed of the model, and ineffective use of degraded images in the existing Image Super-resolution Reconstruction methods.LDDPM mainly uses Markov chains to convert HR images into simple Gaussian probability distributions and then uses the inverse diffusion process to reconstruct HR images gradually.We used Conditional Encoder in the forward and reverse processes of LDDPM.The Conditional Encoder encodes the LR image using an adaptive Multi-Headed Attention Mechanism and Variational Auto-Encoder, which significantly constrains the solution space of the reconstructed image.In addition, to accelerate the convergence speed and stable training of LDDPM, we add Normalized Flow and Multimodal Adversarial training to the model.These ways utilize a complex distribution to model each denoising process, enabling the model to learn the probability distribution of more complex HR images efficiently and significantly reducing the number of diffusion steps of LDDPM.Through extensive experiments, it has been demonstrated that LDDPM can better utilize the LR image feature information to generate HR images with better perceptual quality at a smaller number of diffusion steps.
Our work has many shortcomings, such as LDDPM still requires a large number of sampling steps and whether the number of sampling steps can be further reduced by Scorebased DDPM.In addition, in the forward diffusion process of the Markov chain, is it possible for LDDPM to add Gaussian white noise along with JPEG compressed noise of different quality and reversed ISP-generated sensor noise, Etc., for training to improve the robustness of the model.
Our future work focuses on two research areas: 1) To greatly reduce the number of sampling steps in LDDPM by investigating the Score-based DDPM using the Score-matching technique.2) To improve the quality of LDDPM reconstructed HR images by adding different kinds of noise to the LDDPM forward diffusion process.

Fig. 2 .
Fig. 2. Network architecture of the LDDPM model.Our model can be viewed as a Conditional Denoising Diffusion Probabilistic model.

Fig. 3 .
Fig. 3. Overview of the forward and inverse diffusion processes of the Denoising Diffusion Probabilistic model.The forward diffusion process is from left to right, and the inverse diffusion process is from right to left and θ denotes the learnable parameters.

Fig. 8 .Fig. 9 .
Fig. 8. Visualization of feature sampling by LDDPM with the same number of steps

Fig. 10 .
Fig. 10.LDDPM generates HR images under different steps, where t is the time to generate HR images in seconds.
EDSR, RCAN, SAN, IGNN, and HAN are mainly CNN-based models.swinIR, EDT, and SwinFIR are mainly Transformerbased models.SR3 and LDDPM are mainly DDPM-based models.Table II shows the experimental results of the classical single-image super-resolution model.Compared with other advanced models, the LDDPM in this paper achieves higher performance in reconstructing HR images on several test sets.In particular, LDDPM improves PSNR and SSIM by 2.07 dB and 1.95%, respectively, compared with the EDT model on Urban100, which proves the effectiveness of LDDPM and provides a new idea for the SISR task.Meanwhile, we used CelebA as the LDDPM and SR3 pre-training dataset in the comparison experiments.Table II shows that the LDDPM has a PSNR of 42.96 dB and SSIM of 97% on the Manga109 dataset.Compared with the DDPM-based SR3 model, PSNR improved by 6.57 dB, and SSIM improved by 1.75%, indicating that adding a pre-training dataset significantly improved the reconstruction effect of LDDPM.
. Where LDDPM, SRDiff, and SR3 are DDPM-based models, respectively, RRDB and ProgFSR are CNN-based models, ESRGAN is a (GAN)-based model, and SRFlow is a normalized Flow based model.From Table III, it can be seen that LDDPM outperforms all the above models in terms of evaluation metrics, and it improves PSNR by

TABLE II QUANTITATIVE
COMPARISON OF LDDPM MODELS WITH ADVANCED MODELS ON CLASSICAL IMAGE HYPER-RESOLUTION DATA (2×) , we can see that adding Content Loss (CL) to LDDPM increases the PSNR and SSIM by 0.56 dB and 0.49%, respectively, compared to row 1 of TableVI.Moreover, adding Style Loss (SL) to LDDPM increases 0.18 dB and 0.64% compared to row 2 of TableVI.The above results illustrate that adding content loss and style loss to LDDPM can better guide LDDPM to learn more information of image features, which leads to more stable training of LDDPM.

TABLE IV COMPARISON
OF PSNR (DB) AND SSIM (%) METRICS FOR LDDPM ADDED CONDITIONAL ENCODING MODULE, WITH THE BEST RESULTSBOLDED.

TABLE V COMPARISON
OF PSNR (DB) AND SSIM (%) OF LDDPM WITH GLOW AND GAN, THE BEST RESULTS ARE SHOWN IN BOLD.

TABLE VI COMPARISON
OF PSNR (DB) AND SSIM (%) FOR LDDPM WITH CONTENT LOSS AND STYLE LOSS, THE BEST RESULTS ARE SHOWN IN BOLD.