Product of Gaussian Mixture Diffusion Models

In this work we tackle the problem of estimating the density $ f_X $ of a random variable $ X $ by successive smoothing, such that the smoothed random variable $ Y $ fulfills the diffusion partial differential equation $ (\partial_t - \Delta_1)f_Y(\,\cdot\,, t) = 0 $ with initial condition $ f_Y(\,\cdot\,, 0) = f_X $. We propose a product-of-experts-type model utilizing Gaussian mixture experts and study configurations that admit an analytic expression for $ f_Y (\,\cdot\,, t) $. In particular, with a focus on image processing, we derive conditions for models acting on filter-, wavelet-, and shearlet responses. Our construction naturally allows the model to be trained simultaneously over the entire diffusion horizon using empirical Bayes. We show numerical results for image denoising where our models are competitive while being tractable, interpretable, and having only a small number of learnable parameters. As a byproduct, our models can be used for reliable noise level estimation, allowing blind denoising of images corrupted by heteroscedastic noise.


Introduction
The problem of estimating the probability density f X : X → R of a random variable X in X , given a set of data samples {x i } N i=1 drawn from f X has received significant attention in the recent years [1][2][3][4][5].The applications range from purely generative purposes [5,6], over classical image restoration problems [7][8][9] to medical image reconstruction [10][11][12].This is a challenging problem in high dimension (e.g. for images of size M × N , i.e.X = R M ×N ), due to extremely sparsely populated regions [13].A fruitful approach is to estimate the density at different times when undergoing a diffusion process [3,5].Intuitively, the diffusion equilibrates high-and low-density regions over time, thus easing the estimation problem.
Let Y t (carelessly) denote the random variable whose distribution is defined by diffusing f X for some time t.We denote the density of Y t by f Y ( • , t), which fulfills the diffusion partial differential equation (PDE) (∂ t −∆ 1 )f Y ( • , t) = 0 with initial condition f Y ( • , 0) = f X .The empirical Bayes theory [14] provides a machinery for reversing the diffusion PDE: Given an instantiation of the random variable Y t , the Bayesian least-squares estimate of X can be expressed solely using f Y ( • , t).Importantly, this holds for all positive t, as long as f Y is properly constructed.
In practice we wish to have a parametrized, trainable model of f Y , say f θ where θ is a parameter vector, such that f Y (x, t) ≈ f θ (x, t) for all x ∈ X and all t ∈ [0, ∞).Recent choices [3,15] for the family of functions f θ ( • , t) were of practical nature: Instead of an analytic expression for f θ at any time t, authors proposed a time-conditioned network in the hope that it can learn to behave as if it had undergone the diffusion PDE.Further, instead of worrying about the normalization X f Y ( • , t) = 1 for all t ∈ [0, ∞), usually they directly estimate the score −∇ 1 log f Y ( • , t) : X → X with some network s θ ( • , t) : X → X .This has the advantage that normalization constants vanish, but usually the constraint ∂ j (s θ ( • , t)) i = ∂ i (s θ ( • , t)) j is not enforced in the architecture of s θ .Thus, s θ ( • , t) is in general not the gradient of a scalar function (the negative-log-density it claims to model).
In contrast to this line of works, in this paper we pursue a more principled approach.Specifically, we leverage products of Gaussian mixure model (GMM) experts to model the distribution of responses of transformations acting on natural images.Here, an expert is a one-dimensional distribution modeling certain characteristics of the random variable Y t (the terminology is borrowed from [1]).In particular, we derive conditions under which f Y ( • , t) can be expressed analytically from f Y ( • , 0).We call our model product of Gaussian mixture diffusion model (PoGMDM) Code for training, validation, and visualization, along with pre-trained models is available at https://github.com/VLOGroup/PoGMDM.This paper is organized as follows: In Section 2, we give background information on diffusion and how it can be used for parameter estimation of learned densities.This section also encompasses an overview of related work.In Section 3, we introduce the backbone of our models and derive conditions under which they obey the diffusion PDE.We demonstrate the practical applicability of our models in Section 4 with numerical experiments.We explore alternative parametrizations and possible extensions of our models in Section 5 and finally conclude the paper, providing future research directions, in Section 6.

Notation and Preliminaries
For the sake of simplicity, throughout this article, we assume that all distributions admit a density with respect to the Lebesgue measure, although the numerical experiments only assume access to an empirical distribution.Thus, we use the terms distribution and density interchangeably.In Section 3, we define normal distributions that are supported on a subspace (e.g. the zero-mean subspace {x ∈ R n : ⟨1 R n , x⟩ R n = 0}).In this case, we restrict our analysis to the support, which is theoretically supported by the disintegration theorem [16].We use the symbols R + and R ++ to denote the non-negative real numbers {x ∈ R : x ≥ 0} and positive reals numbers {x ∈ R : x > 0} respectively.We denote with ⟨ • , • ⟩ R n : R n × R n → R : (x, y) → n i=1 x i y i the standard inner product in the Euclidean space R n , and with ∥ • ∥ 2 : R n → R + the map x → ⟨x, x⟩ R n .In addition, ( • ⊗ • ) : R n × R n → R n×n is the standard outer product in R n : (x ⊗ y) ij = x i y j .conj denotes element-wise complex conjugation.Let Q ⊂ H be a (not necessarily convex) subset of a Hilbert space H.We define by proj Q : H → H the orthogonal projection onto the set Q.With slight abuse of notation, we ignore that this is a multivalued map in general.L 2 (Ω) denotes the standard Lebesgue space on a domain Ω ⊂ R n .Lastly, Id H and 1 H denote the identity map and the one-vector in H respectively.

Background
In this section, we first emphasize the importance of diffusion in density estimation (and sampling) in high dimensions.Then, we detail the relationship between diffusing the density function, empirical Bayes, and denoising score matching [17].

Diffusion Eases Density Estimation and Sampling
Let f X be a density on X ⊂ R d .A major difficulty in estimating f X with parametric models is that f X is extremely sparsely populated in high dimensional spaces 1 , i.e., d ≫ 1.This phenomenon has many names, e.g. the curse of dimensionality or the manifold hypothesis [13].Thus, the learning problem is difficult, since meaningful gradients are rare.Conversely, let us for the moment assume access to a model fX that approximates f X well.In general, it is still very challenging to generate a set of points {x i } I i=1 such that we can confidently say that the associated empirical density 1 I I i=1 δ xi approximates fX (let alone f X ) well.This is because, in general, there does not exist a procedure to directly draw samples from fX , and Markov chain Monte Carlo methods are prohibitively slow in practice, especially for multimodal distributions in high dimensions [3].
The diffusion PDE or heat equation ) equilibrates the density f X , thus mitigating the challenges outlined above.Here, ∂ t denotes the standard partial derivative with respect to time ∂ ∂t and ∆ 1 = Tr •∇ 2 1 is the Laplace operator, where the 1 indicates its application to the first argument.We detail the evolution of f X under this diffusion PDE and relations to empirical Bayes in Section 2.2.
Learning f ( • , t) for t ≥ 0 is more stable since the diffusion "fills the space" with meaningful gradients [3].Of course, this assumes that for different times t 1 and t 2 , the models of f ( • , t 1 ) and f ( • , t 2 ) are somehow related to each other.As an example of this relation, the recently popularized noise-conditional score-network [3] shares convolution filters over time, but their input is transformed through a time-conditional instance normalization.In this work, we make this relation explicit by considering a family of functions f ( • , 0) for which f ( • , t) can be expressed analytically.
For sampling, f ( • , t) for t > 0 can help by gradually moving samples towards high-density regions of f X , regardless of initialization.To utilize this, a very simple idea with relations to simulated annealing [18] is to have a pre-defined time schedule t T > t T −1 > . . .> t 0 > 0 and sample f ( • , t i ), i = T, . . ., 0 (e.g. with Langevin Monte Carlo [19]) successively [3].In [15], instead of considering discrete time steps, the authors propose to model the sampling procedure as a continuous-time stochastic differential equation.We note that the diffusion PDE (1) on the densities corresponds to the stochastic differential equation on the random variables, where W is the standard Wiener process.This is known as the variance exploding stochastic differential equation in the literature.

Diffusion, Empirical Bayes, and Denoising Score Matching
In this section, similar to the introduction, we again adopt the interpretation that the evolution in (1) defines the density of a random variable Y t .That is, Y t is a random variable with probability density f Y ( • , t), which fulfills It is well known that Green's function of ( 1) is a Gaussian (see e.g.[20]) with zero mean and covariance 2tId.In other words, for Thus, the diffusion PDE constructs a (linear) scale space in the space of probability densities and we refer to Y t (respectively f Yt ) as the smoothed random variable (respectively density).Equivalently, in terms of the random variables, we can write where N is a random variable with normal distribution N (0, Id X ).Next, we show how to estimate the corresponding instantiation of X which has "most likely" spawned an instantiation of Y t using empirical Bayes.
In the school of empirical Bayes [14], we try to estimate a clean random variable given a corrupted instantiation, using only knowledge about the density of the random variable corresponding to the corrupted instance.In particular, for our setup we have a corruption model with x ∼ f X and η ∼ N (0, Id X ).
and choosing an appropriate prior f X .However, a classical result from empirical Bayes estimation reveals that a map y t → xf X|Yt (x | y t ) dx can be constructed only assuming access to the smoothed density f Yt without any reference to the prior f X .This result is known as the Miyasawa estimate [21] or Tweedie's formula [22,23] and we derive it here for completeness.
First, by the corruption model ( 4) we can write where we use the relation σ = √ 2t, and thus by Bayes theorem it follows that Taking the gradient w.r.t.y and multiplying by σ 2 yields and after dividing by f Yt it follows that where we used the definition of conditional densities, i.e. that . Finally, by noting that the above can be rewritten as We refer to the work of Raphan and Simoncelli [23] for an empirical Bayes theory encompassing a more general family of corruptions.They refer to this type of estimator more generally as non-parametric empirical Bayes least-squares (NEBLS).We illustrate the idea of empirical MMSE estimation on a toy example in Fig. 1, where the data distribution consists of Dirac measures f The figure illustrates that that f Yt approaches a simple form as t approaches infinity.Indeed, it has been shown [24] that f Yt is log-concave for large enough t, and − log f Yt approaches a quadratic function.
Recently, (9) has been used for parameter estimation [15,17]: Let {x i } I i=1 be a dataset of I samples drawn from f X and let Y t be governed by diffusion.Additionally, let f θ : X × [0, ∞) → R + denote a parametrized model for which we wish that f θ ( • , t) ≈ f Yt , for all t > 0.Then, both the left-and right-hand side of ( 9) are known -in expectation.This naturally leads to the loss function Here, f X,Yt denotes the joint distribution of the clean and smoothed random variables and Θ describes the set of feasible parameters.This learning problem is known as denoising score matching [15,17,25].

Methods
In this section, we first introduce one-dimensional GMMs as the backbone of our model and recall some properties that are needed for the analysis in the following subsections.Then, we detail how we can utilize PoGMDMs based on filter-, wavelet-, and shearlet-responses to model the distribution of natural images.For all models, we present assumptions under which they obey the diffusion PDE.
The backbone of our models is the one-dimensional GMM expert ψ j : R × △ L × [0, ∞) → R + with L components of the form The weights of each expert w j = (w j1 , . . ., w jL ) ⊤ must satisfy the unit simplex constraint, i.e., Although not necessary, we assume for simplicity that all experts ψ j have the same number of components and the discretization of their means µ l over the real line is shared and fixed a priori (for details see Section 4.1).
The main contribution of our work is to show that, under certain assumptions, it suffices to adapt the variances of the individual experts to implement the diffusion of a model built through multiplying experts of the form (12).In detail, we show that the variance σ 2 j : [0, ∞) → R + of the j-th expert can be modeled as where σ 0 > 0 is chosen a-priori to support the uniform discretization of the means µ l and c j ∈ R ++ are derived from properties of the product model such that it obeys the diffusion PDE (1).
In the following subsections we exploit two well-known properties of GMMs to derive how to adapt the variance σ 2 j (t) of each expert with diffusion time t, such that the product model obeys the diffusion PDE: First, up to normalization, the product of GMMs is again a GMM, see e.g.[26].This allows us to work on highly expressive models that enable efficient evaluations due to factorization.Second, we use the fact that there exists an analytical solution to the diffusion PDE if f X is a GMM: Green's function associated with the linear isotropic diffusion PDE (1) is a Gaussian with isotropic covariance 2tId.Due to the linearity of the convolution, it suffices to analyse the convolution of individual components of the product model, which is just the convolution of two Gaussians.Using previous notation, if X is a random variable with normal distribution N (µ X , Σ X ), then Y t follows the distribution N (µ X , Σ X + 2tId).In particular, the mean remains unchanged and it suffices to adapt the covariance matrix with the diffusion time.
In what follows, we discuss three product models whose one-dimensional GMM experts act on filter-, wavelet-, and shearlet-responses.In particular, we present conditions under which the diffusion of the product model can be implemented by adapting the variances of the onedimensional GMM experts.For all three models, we give an analytic expression for the constants c j in (13).

Patch Model
In this section, we approximate the distribution of image patches p ∈ R a of size a = b×b by a product of J ∈ N GMM experts acting on filter-responses.In detail, the model is of the form of the associated filters k j ∈ R a for all t > 0. We denote with Z({k j } J j=1 , σ 0 , t) the partition function such that f filt θ is properly normalized.In this model, we can summarize the learnable parameters θ = {(k j , w j )} J j=1 .First, the following theorem establishes the exact form of (12) as a GMM on R a .The covariance matrix and the means are endowed with the subscript R a to emphasize that the resulting GMM models patches of this size; The models based on wavelet-and shearlet-responses discussed later (Section 3.2 and Section 3.3 respectively) can be applied to images of arbitrary size, which we emphasize by using the The mean of the component identified by the choice of the index map l has the form Proof.By definition, The general component of the above is uniquely identified by the choice of the map l as To find (Σ R a ) −1 , we match the gradient of the familiar quadratic form: Motivated by The next theorem establishes a tractable analytical expression for the diffusion process under the assumption of pair-wise orthogonal filters, that is Theorem 2 (Patch diffusion).Under assumption (19), (19), the Eigendecomposition of the precision matrix can be trivially constructed.In particular, Proof.We first show that ψ j ( • , w j , t) model the marginal distribution of the random variable U j,t = ⟨k j , Y t ⟩.Consider one component of the resulting homoscedastic GMM: [27] for a proof).Under our orthogonality assumptions, this simplifies to N (µ l(j) , σ 2 0 + 2t∥k j ∥ 2 ).The claim follows from the linear combination of the different components.
The normalization constant is the classical normalization of a Gaussian, which requires the pseudo-determinant of Σ R a [16].The pseudo-determinant is easily calculated by the product of the Eigenvalues outlined in Theorem 2.

Wavelet Model
The key ingredient in the previous section was the orthogonality of the filters.In other words, the filter bank {k j } J j=1 forms an orthogonal (not necessarily orthonormal) basis for (a subspace of) R a .In this section, we discuss the application of explicit diffusion models in another well-known orthogonal basis: Wavelets.In what follows, we briefly discuss the main concepts of the discrete wavelet transformation needed for our purposes.For the sake of simplicity, we stick to the one-dimensional case but note that the extension to two dimensions is straight forward, see e.g.[28,Chapter 4.4].The following is largely adapted from [28], we refer the reader to this and [29,30] for information on the extension to two-dimensional signals as well as efficient implementations using the fast wavelet transformation.

The discrete wavelet transformation
Let ω ∈ L2 (R) be a wavelet satisfying the admissibility condition The set of functions forms an orthonormal basis of L 2 (R) under certain conditions that we now recall.Let (V j ) j∈Z be a multiscale analysis with generator or scaling function ϕ ∈ V 0 , i.e.
of the multiscale analysis (V j ) j∈Z implies that the functions with . We define the detail or wavelet spaces W j as the orthogonal complements of the approximation spaces V j in V j−1 , i.e.
From this follows that V j = m≥j+1 W m and due to the . By the orthogonality, we have that proj Vj−1 = proj Vj + proj Wj and hence proj Wj = proj Vj−1 − proj Vj .Thus, any u ∈ L 2 (R) can be represented as justifying the name multiscale analysis.Then (see [28,Theorem 4.67] for details) ω ∈ V −1 defined by is a wavelet, {ω j,k : k ∈ Z} is an orthonormal basis of W j and in particular the construction ( 21) is an orthonormal basis of L 2 (R).

Modeling Wavelet Coefficients
In this section we describe how we can utilize a product of GMM experts to model the distribution of waveletresponses.For the subsequent analysis, first observe that by (24) the detail spaces (and the approximation spaces) are orthogonal.Utilizing the shorthand notation since W j is an orthogonal projection, it satisfies the properties (identity on subspace) where W j | Wj denotes the restriction of W j to W j .As in the previous section, we model the waveletresponses with Gaussian mixture experts.In detail, let x be a signal in R n and thus, W j : R n → R n .Then, the model reads 2 Following the approach utilized in Theorem 2, we first describe the exact form of (29) as a GMM on R n .We denote with l : {1, . . ., n} → {1, . . ., L} a fixed but arbitrary selection from the index set {1, . . ., L}.In addition, the notation L n l=1 indicates the summation over all L n possible selections and for the following proof we define µ R n ( l) := (µ l(1) , µ l(2) , . . . ,µ l(n) Proof.By definition for t = 0, we have First, we expand the product over the pixels using the index map l and w j l = I i=1 w j l(i) .Further, expanding over the features results in where w î(i,j) = J j=1 I i=1 w î(i,j) .Notice that (33) describes a homoscedastic GMM on R n with precision matrix Using the properties of a projection (28) and the orthogonality property W i ⊥ W j for i ̸ = j, this simplifies to Proof.Notice that, using the properties of a projection (28), W j , in analogy to the model based on filter-responses, it suffices to adapt the variance of the one-dimensional GMMs ψ j with σ 2 0 → σ 2 0 + 2t.We can endow the different sub-bands of the wavelet transformation with scalars to weight their influence as follows: Replacing W j with λ j W j in (34) (the derivation does not change up to this point), we find that Thus, the scaling parameters λ j are analogous to the filter-norm ∥k j ∥ in Theorem 2.
We briefly discuss the extension to two-dimensional signals: Let x ∈ R n×n be a two-dimensional signal.W d j : R n×n → R n×n is a linear operator corresponding to the j-th detail level (j ∈ {1, . . ., J} where J ∈ N is the coarsest scale in the decomposition) in the wavelet decomposition in direction d.We denote the (now three) detail spaces at scale j as W d j , where d ∈ {v, h, d} indexes the direction (vertical, horizontal, and diagonal.).Our model accounts for the directional sub-bands with individual GMM experts, i.e. every ψ j is replaced by a triplet ψ d j endowed with weights w d j ∈ △ nw for d ∈ {v, h, d}.Then, the entire previous discussion holds, where in particular W d i ⊥ W d j for all d, d ∈ {v, h, d} and all i ̸ = j.Since the operators W d j are derived from generating sequence h ∈ R k (see Section 4.1.2),the learnable parameters are summarized as θ = {h, {λ d j } j,d , {w d j } j,d }.

Interpretation as Diffusion Wavelet Shrinkage
Wavelet shrinkage is a popular class of denoising algorithms.
The key motivation behind wavelet shrinkage denoising algorithms is the observation that wavelet coefficients of natural images are sparse, wheres the wavelet coefficients of noisy images are densely filled with "small" values.Thus, a straight forward denoising algorithm might be to calculate the wavelet coefficients, "shrink" small coefficients towards zero, and calculate the inverse wavelet transform of the shrank coefficients.Popular shrinkage operators include the soft-shrinkage x → sgn(x) max{|x| − τ, 0} and the hard-shrinkage x → xχ {|x|>τ } .It is easy to see that these operators promote sparsity in the wavelet coefficients, as they correspond to the proximal maps w.r.t.τ ∥ • ∥ 1 and τ ∥ • ∥ 0 respectively.Here, τ > 0 is a thresholding parameter that has to be chosen depending on the noise level.
Historically, research for wavelet shrinkage models has focused on finding the optimal shrinkage parameter τ (w.r.t.some risk, e.g. the squared error), assuming a particular choice of the shrinkage operator (e.g. the soft-shrinkage).Popular selection methods include VisuShrink [31] and SureShrink [32].The former is signal independent and the threshold is essentially determined by the dimensionality of the signal as well as the (assumed known) noise level.In contrast, the latter chooses the thresholding parameter depending on the energy in a particular sub-band and does not depend on the dimensionality of the signal explicitly.The BayesShrink [40] method is also sub-band adaptive, and the authors provide expressions (or at least good approximations) for the optimal thresholding parameter under a generalized Gaussian prior on the wavelet coefficients.In particular, they rely on classical noise level estimation techniques to fit the generalized Gaussian to the wavelet coefficients (of the noisy image) and arrive at a simple expression for a sub-band dependent threshold.
The general methodology outlined in the previous section allows us to take a different approach: Instead of fixing the thresholding function and estimating the threshold solely on the corrupted image, we instead propose to learn the distribution of wavelet coefficients in different sub-bands for all noise levels σ > 0. Notice that an empirical Bayes step on the wavelet coefficients under our model corresponds to applying a point-wise non-linearity.
In contrast to the traditional wavelet shrinkage, our model does not prescribe a shrinkage function for which an optimal parameter has to be estimated for different noise levels.Rather, by learning the distribution of the wavelet coefficients at "all" noise levels, we have access to an MMSE optimal "shrinkage" function view of the empirical Bayes step on the experts.In addition, our wavelet prior can be used in more general inverse problems whereas classical shrinkage methods are only applicable to denoising (although the denoising engine could be used in regularization by denoising [41] or plug-and-play [42] approaches).

Convolutional Model
The model based on filter-responses discussed in Section 3.1 can not account for the correlation of overlapping patches when used for whole image restoration [43,44].Similarly, the model based on wavelet-responses is limited in expressiveness since it only models the distribution of a scalar random variable per sub-band.In what follows, we describe a convolutional PoGMDM that avoids the extraction and combination of patches in patch-based image priors and can account for the local nature of low-level image features.The following analysis assumes vectorized images x ∈ R n with n pixels; the generalization to higher dimensions is straight forward.
In analogy to the product-of-experts-type model acting on filter-responses, here we extend the fields-of-experts model [44] to our considered diffusion setting by accounting for the diffusion time t and obtain Here, each expert ψ j models the density of convolution features extracted by convolution kernels {k j } J j=1 of size a = b × b, where {K j } J j=1 ⊂ R n×n are the corresponding matrix representations.Further, w j ∈ △ L are the weights of the components of the j-th expert ψ j (see (12)).As with the models based on filter-and wavelet-responses, it is sufficient to adapt the variances σ 2 j (t) by the diffusion time as the following analysis shows.
(36) again describes a homoscedastic GMM on R n with precision This can be seen by essentially following the derivation of (34) in Theorem 4, at which point we did not yet exploit the special structure of W j : R n → R n (i.e. it may be an arbitrary linear operator).
In order to derive conditions under which (36) fulfills the diffusion PDE, we begin by fixing the convolutions as cyclic, i.e., K j x ≡ k j * n x, where * n denotes a 2-dimensional convolution with cyclic boundary conditions.Due to the assumed boundary conditions, the Fourier transformation F diagonalizes the convolution matrices: K j = F * diag(Fk j )F.Thus, the precision matrix can be expressed as where we used the fact that FF * = Id R n and conj(z)z = |z| 2 (here | • | denotes the complex modulus acting elementwise on its argument).To get a tractable analytic expression for the variances σ 2 j (t), we further assume that the spectra of k j have disjoint support, i.e.
where Γ j = supp Fk j .In addition, we assume that the magnitude is constant over the support, i.e.
where ξ j ∈ R is the magnitude and χ A is the characteristic function of the set Theorem 5 (Convolutional Diffusion).Under assumptions (38) and (39), Proof.In analogy to Theorem 2, with (37) under diffusion.The inner sum decomposes as using (38), and with (39) the numerator reduces to σ 2 0 + 2tξ 2 j .
We emphasize that the convolutional model ( 36) is distinctly different from the model based on filter-responses discussed in Section 3.1.In particular, the one-dimensional GMM experts ψ j ( • , w j , t) do not model the distribution of the filter-responses of their corresponding filter kernels k j , but account for the non-trivial correlation of overlapping patches.We discuss this in more detail in Section 5.3.

Shearlets
In the previous section, we derived abstract conditions under which a product of one-dimensional GMM experts, with each expert modeling the distribution of convolutional features, can obey the diffusion PDE.In particular, we derived that the spectra of the corresponding convolution filters must be non-overlapping and constant on their support.Naturally, the question arises how such a filter Fig. 2: Schematic illustration of the cone-like frequency tiling of the non-separable shearlet transformation [46].The shearlets are constructed such that their spectra are non-overlapping.bank can be constructed.Luckily, the shearlet transformation [45] (and in particular the non-separable version of [46]) fulfills these conditions.As an extension to the wavelet transformation, the shearlet transformation [45] can represent directional information in multidimensional signals via shearing.Here, we consider the non-separable digital shearlet transformation [46], whose induced frequency tiling is shown schematically in Fig. 2. In particular, the frequency plane is partitioned into non-overlapping cones indexed by the scaling and shearing parameters described in the next paragraph.
We briefly describe our setup but refer the interested reader to [46,47] for more details.We construct a digital shearlet system, specified by the positive scaling integer j = 0, . . ., J the translations m ∈ Z 2 and shearing |k| ≤ ⌈2 ⌊ j 2 ⌋ ⌉.The system is constructed by a one-dimensional low-pass filter h 1 and a two-dimensional directional filter P .Given one-dimensional filters h J−j/2 and g J−j derived from h 1 in a wavelet multiresolution analysis, let W j = g J−j ⊗ h J−j/2 and let p j be the Fourier coefficients of P at scaling level j.Then, the system is constructed by Here, ↑ a and ↓ a are a-fold up-and down-sampling operators, and , and S k is a shearing operator.The digital shearlet transformation of an image x ∈ R n×n is then given by where λ j,k > 0 are learnable weights that reflect the importance of the respective scale and shear level.The learnable weights λ j,k > 0 are easily accounted for in the diffusion models by adapting ξ j,k in (39) (where we have swapped the index j for a two-index j, k to account for the scales and shearing levels).Thus, we can summarize the learnable parameters for the model based on shearlet-responses as θ = {h 1 , P, {λ j,k } j,k , {w j,k } j,k }.

Numerical Results
In this section, we first detail the setup for numerical optimization.In particular, we discuss how we can learn the one-dimensional GMM experts along with the corresponding transformation (filters, wavelets, and shearlets) jointly.Then, we show results for denoising utilizing a simple onestep empirical Bayes scheme as well as denoising algorithms derived for classical diffusion models.In addition, we show that we can use our models for noise level estimation and blind heteroscedastic denoising, and exploit Corollary 1 to derive a direct sampling scheme.

Numerical Optimization
For the numerical experiments, f X reflects the distribution of rotated and flipped b × b patches from the 400 gray-scale images in the BSDS 500 [48] training and test set, with each pixel value in the interval [0, 1].We optimize the score matching objective function (11) using projected AdaBelief [49] for 100 000 steps.We approximate the infinite-time diffusion PDE by uniformly drawing √ 2t from the interval [0, 0.4].For the denoising experiments, we utilize the validation images from [44] (also known as "Set68").Due to computational constraints, we utilize only the first 15 images of the dataset according to a lexicographic ordering of the filenames.In addition, our wavelet-and shearlet-toolboxes only allow the processing of square images.To avoid boundary artifacts arising through padding images to a square, we only utilize the central region of size 320 × 320.
For all experiments, ψ j is a L = 125 component GMM, with equidistant means µ l in the interval [−η j , η j ] (we discuss the choice of η j for the different models in their respective sections).To support the uniform discretization of the means, the standard deviation of the j-th experts is σ 0 = 2ηj L−1 .In the one-dimensional GMM backbone of all models, we have to project a weight vector w ∈ R L onto the unit simplex △ L .We realize the projection proj △ L : R L → R L with the sorting-based method proposed by [50], which is summarized in Algorithm 2. In addition, we further assume that the one-dimensional GMM experts are symmetric around 0, i.e. that the weights w j are in the set {x ∈ R L : (x ∈ △ L ) ∧ (x = ← − x )}.We implement by storing only ⌈L/2⌉ weights, and mirroring the tail of ⌈L/2⌉ − 1 elements prior to the projection algorithm and function evaluations.To ensure that the one-dimensional GMM experts are sufficiently peaky around zero, we always choose L to be odd.
In the next sections, we detail the constraints the building blocks of the learned transformations have to fulfill and how to satisfy them in practice.

Learning Orthogonal Filters
Let K = [k 1 , k 2 , . . ., k J ] ∈ R a×J denote the matrix obtained by horizontally stacking the filters.We are Algorithm 1: Algorithm for orthogonalizing a set of filters K.
Input : where Other than positivity, we do not place any restrictions on λ 1 , . . ., λ J , as these are related to the precision in our model.Thus, we rewrite the objective where with ⟨ • , • ⟩ F denoting the Frobenius inner product.
We propose the following alternating minimization scheme for finding O and D. The solution for the reduced sub-problem in O can be computed by setting O = U , using the polar decomposition of DK ⊤ = U P , where U ∈ R J×a is semi-unitary (U ⊤ U = Id R a ) and P = P ⊤ ⪰ 0. The subproblem in D is solved by setting D i,i = (O ⊤ K) i,i + .The algorithm is summarized in Algorithm 1, where we have empirically observed fast convergence; B = 3 steps already yielded satisfactory results.A preliminary theoretical analysis of the algorithm is presented in the supplementary material of our conference paper [7].
Assuming a patch size of a = b × b we use J = b 2 − 1 filters spanning the space of zero-mean patches Z = {x ∈ R a : ⟨1 R a , x⟩ R a = 0}.We found that implementing proj O∩Z as proj Z • proj O , both constraints were always almost exactly fulfilled.To ensure the correct projection, an alternative would be to utilize Dykstra's projection algorithm [51].The filters are initialized by independently drawing their entries from a zero-mean Gaussian distribution with standard deviation b −1 .Since the filters can be freely scaled, we simply choose η j = 1 for all j = 1, . . ., J.
To visually evaluate whether our learned model matches the empirical marginal densities for any diffusion time t, we plot them in Fig. 3.At the top, the learned 7×7 orthogonal filters k j are depicted.The filters bare striking similarity to the Eigenimages of the covariance matrices of [43, Fig. 6], who learn a GMM directly on the space of image patches (i.e.without any factorizing structure).This comes as no surprise, since the construction of the patch-model ( 14) can be interpreted as "learning the Eigendecomposition", see Theorem 1 and the proof of Theorem 2. The learned potential functions − log ψ j ( • , w j , t) and activation functions −∇ log ψ j ( • , w j , t)3 associated with the j-th filter are shown below the filters in Fig. 3. Indeed, the learned potential functions match the negative-log empirical marginal response histograms visualized at the bottom almost perfectly even at extremely low-density tails.This supports the theoretical argument that diffusion eases the problem of density estimation outlined in the introductory sections.

Learning Wavelets
The discrete wavelet transformation is characterized by the sequence h ∈ R K .In addition to learning the parameters of the one-dimensional GMM, we follow [52] and also learn h.From the sequence h, the scaling-function ϕ and waveletfunction ω are defined by and where g(h For ω to be a wavelet, it must follow the admissibility criterion cf [29], from which it immediately follows that (Fω)(0) = R ω = 0.For the transformation to be unitary, we need that R ϕ = 1, and From these constraints, the feasible set of waveletgenerating sequences is described by Here ⟲ n : R K → R K rolls its argument by n entries, i.e. ⟲ n x = (x K−n+1 , x K−n+2 , . . ., x K , x 1 , x 2 , . . ., x K−n ) ⊤ .Fig. 3: Learned filters k j (top, the intervals show the values of black and white respectively, amplified by a factor of 10), potential functions − log ψ j ( • , w j , t) and activation functions −∇ log ψ j ( • , w j , t).On the bottom, the negative-log empirical marginal response histograms are drawn.
Observe that the orthogonality condition encodes K/2 constraints (we assume that K is even), since To project onto Ω, we write the projection problem in its Lagragian form using and find stationary points by solving the associated nonlinear least-squares problem using 10 iterations of Gauss-Newton.To facilitate convergence, we warm start the Lagrange multipliers Λ scal , Λ adm , Λ with the solution from the previous outer iteration.We initialize the sequence h with the generating sequences of the db2-(K = 4) and db4-wavelet (K = 8).For both, we utilize J = 2 levels.We use the pytorch_wavelets [53] implementation of the discrete wavelet transformation.
In contrast to the model based on filter-responses, the model based on wavelet-responses does not have the freedom to adapt the scaling of filters.To overcome this, we discretize the means over the real line individually for Algorithm 2: Simplex projection from [50].
Input : x ∈ R m Output : y = proj △ m (x) // element-wise each sub-band.In detail, for the j-th level and d-th direction, d ∈ {h, v, d}, we choose η d j = 1.1q d j , where q d j is the .999-quantile of corresponding responses.
The initial and learned generating sequences, their corresponding scaling-and wavelet-functions, along with the learned potential functions and MMSE-shrinkage are shown in Fig. 4. In these figures, it is apparent that our chosen parametrization is sub-optimal.In particular, in order to represent the heavy tails (especially for level j = 1), many intermediate weights are set to 0. This leads to the MMSE shrinkage functions becoming step-like.We emphasize that this is a practical problem of choosing the appropriate parametrization; we discuss alternatives to our equispaced GMM in Section 5.

Learning Shearlets
We initialize the one-dimensional low-pass filter h 1 and the two-dimensional directional filter P with the standard choices from [47]: h 1 is initialized as maximally flat 9-tap symmetric low-pass filter4 , P is initialized as the maximally flat fan filter5 described in [54].Furthermore, λ j,k is initialized as 1 for all scale levels j and shearings k, and we set η j,k = 0.5.
The projection operators can be realized as follows: The projection onto the non-negative real line is just proj realizes the projection onto the linear constrain encoded in H.The projection onto the unit-one-normsphere is proj P (x) = sgn(x)⊙proj △ m (|x|) (see e.g.[55,56]), where we ignore the degenerate case of projecting the origin where proj P is not well defined.Our implementation of the shearlet transformation is based on the ShearLab 3D [47] toolbox 6 .
For the numerical experiments, we chose J = 2 scales and 5 shearings (k ∈ {−2, . . ., 2}).We show the initial and learned filter weights λ j,k , the one-dimensional lowpass filter h 1 , and the two-dimensional directional filter P in Fig. 5.The resulting shearlet system in the frequencyand time-domain, along with the learned potential functions, is shown in Fig. 6.We again emphasize that the learned one-dimensional potential functions ψ j,k ( • , w j,k , t) are distinctly different from the other models.In particular, they exhibit multiple local minima, sometimes different from 0, such that certain image structures can be enhanced under this prior.This is in stark contrast to the learned filter-and wavelet-responses, which show a single minimum at 0 and the classical heavy-tailed shape.
Fig. 6 also shows that the shearlet system only approximately fulfills the assumption (38) and (39).We analyze the shearlet system with respect to the assumption of disjoint support (38) by visualizing the pair-wise cosine similarity of the magnitude of the spectra in Fig. 7.In detail, the figure shows . ., 2} and both cones.Although less for the learned shearlet system, the plot is dominated by the main diagonal, indicating that the corresponding spectra are almost non-overlapping.To meet the theoretical assumptions, it would be possible to penalize The fact that the spectra are not constant over their support raises the question of how to choose ξ a that best approximates (39).During training and evaluation, we simply chose ξ a = ∥|γ a |∥ ∞ , where a is a two-index ranging over the chosen scale-shearing grid.It remains an open question how the violation of the constraints (38) and (39) influences the diffusion, and if there exists a better choice for ξ a .

Image Denoising
To exploit our patch-based prior for whole-image denoising, following [43], we define the expected patch log-likelihood of a noisy image y ∈ R n with variance σ 2 (t) = 2t as Here, ñ denotes the total number of overlapping patches (e.g. for n = 4 × 4 and a = 3 × 3, ñ = 4, ), P i : R n → R a denotes the patch-extraction matrix for the i-th patch and p i = ñ j=1 P ⊤ j P j i,i counts the number of overlapping patches to compensate for boundary effects (see [57, Appendix B] for a more rigorous discussion).The waveletand shearlet-based priors can act on images of arbitrary size.
Let log f θ be either epll filt θ , log f wave θ , or log f conv θ .We consider two inference methods: The one-step empirical Bayes estimate Fig. 4: Left: db2 initial generating sequence (K = 4).Right: db4 initial generating sequence (K = 8).From top to bottom: Initial generating sequence h along with the corresponding scaling-and wavelet-functions ϕ and ω, learned generating sequence h along with the corresponding scaling-and wavelet-functions ϕ and ω, learned potential functions − log ψ( • , w, t) and the learned MMSE shrinkage functions y t → y t + 2t∇ log ψ(y t , w, t).
corresponds to the Bayesian MMSE estimator.Notice that, in the case of log f θ = epll filt θ the estimator computes patch-wise MMSE estimates and combines them by averaging.This is known to be a sub-optimal inference strategy, since the averaged patches are not necessarily likely under the model [43].The discussion of algorithm utilizing patch-based priors for whole-image restoration is beyond the scope of this article.We refer the interested reader to the works of [43,57] for a detailed discussion  on this topic.In addition, we refer to our previous conference publication [7], in which we present a proximal gradient continuation scheme that slightly improves over the empirical Bayes estimate by allowing patch-crosstalk.
The second inference method we consider is the stochastic denoising algorithm proposed by [58] and summarized in Algorithm 3. In detail, this algorithm proposes a sampling scheme to approximately sample from the posterior of a denoising problem when utilizing diffusion priors.This is achieved by properly weighting the score ∇ log f θ with the gradient of the data term while annealing the noise level.Sampling from the posterior, as opposed to directly computing MMSE estimates with an empirical Bayes step, is known to produce sharper results when utilizing modern highly expressive diffusion models [58,59].We chose ϵ = 5 × 10 −6 , σ C = 0.01 and the exponential schedule i/C , using B = 3 inner loops and C = 100 diffusion steps.Let x ∈ R n denote a test sample from the distribution f X , and let x denote the estimation of x given y t = x+ √ 2tη where η ∼ N (0, Id R n ), through either of the discussed inference methods.In Table 1, we show a quantitative evaluation utilizing the standard metrics peak signal-tonoise ratio (PSNR) 10 log 10  and structural similarity (SSIM) [60] with a window size of 7 and the standard parameters K 1 = 0.01 and K 2 = 0.03.The column with the heading "Patch-GSM" utilized the Gaussian scale mixture parametrization discussed in Section 5.1.The results are obtained for one run of the algorithms, i.e. we did not compute the expectation over the noise (neither in the construction of y t nor during the iterations of the stochastic denoising algorithm).However, we did not observe any The quantitative evaluation shows impressive results of the model based on shearlet-responses, despite having very little trainable parameters.In particular, it performs best across all noise levels and inference methods, with the exception of the one-step empirical Bayes denoising at σ = 0.2.There, the patch-based model with a = 15 × 15 performs best, but notably has about 50 times the number of trainable parameters.By leveraging symmetries between the cones in the shearlet system, the number of trainable parameters could even be approximately halved.These symmetries are strongly apparent in Fig. 6, where the potential functions of the second cone (rightmost 5 potentials) are almost a perfect mirror image of the potential functions of the first cone (leftmost 5 potential functions).
Additionally, the table reveals that the empirical Bayes estimator beats the stochastic denoising in every quality metric.This is not surprising, as -in expectation -it is the optimal estimator in the MMSE sense, which directly corresponds to PSNR.Comparing the qualitative evaluation in Fig. 8 (empirical Bayes) to Fig. 9 (stochastic denoising) we do not observe that sharper images using the stochastic denoising algorithm; we are unsure why.
The analysis of posterior variance is out of the scope of this paper.However, techniques for analyzing the posterior induced by diffusion models are also readily applicable to our models.In particular, we refer to [58] or related papers such as [61] for an in-depth discussion of these techniques.

Noise Estimation and Blind Image Denoising
Within this and the following subsection, we describe two applications that arise as a byproduct of our principled approach: Noise estimation (and, consequently, blind denoising) and analytic sampling.For both, we utilize the model based on filter-responses as a stand-in but emphasize that similar results hold also for the models based on wavelet-and shearlet-responses.
The construction of our model allows us to interpret f filt θ ( • , t) as a time-conditional likelihood density.Thus, it can naturally be used for noise level estimation: We assume     a noisy patch y constructed by y = x + ση, where x ∼ f X , η ∼ N (0, Id R n ) and σ is unknown.We can estimate the noise level σ by maximizing the likelihood of y w.r.t. to the diffusion time t -t = arg max t f filt θ (y, t) -and recover the noise level via σ = √ 2 t.
To demonstrate the feasibility of this approach, Fig. 10 shows the expected negative-log density7 E p∼f X ,η∼N (0,Id) l θ (p + ση, t) over a range of σ and t.The noise level estimate σ → arg min t E p∼f X ,η∼N (0,Id) l θ (p + ση, t) perfectly matches the identity map σ → √ 2t.In addition, we can leverage this noise level estimation procedure to perform blind heteroscedastic denoising with the same model as follows: First, for all ñ overlapping patches P j y in the corrupted image, we estimate the noise level through tj = arg max t f filt θ (P j y, t).Given the noise levels tj , we can estimate the clean image with an empirical Bayes step of the form where for each patch P j y we utilize the estimated noise level tj .In Fig. 11, the original image is corrupted by heteroscedastic Gaussian noise with standard deviation 0.1 and 0.2 in a checkerboard pattern, which is clearly visible in the noise level map.In the restored image and the absolute difference to the reference, the checkerboard pattern

Sampling
A direct consequence of Corollary 1 is that our models admit a simple sampling procedure: The statistical independence of the components allows drawing random patches by where U j,t is a random variable on R sampled from the one-dimensional GMM ψ j ( • , w j , t).The samples in Fig. 12 indicate a good match over a wide range of t.However, for small t the generated patches appear slightly noisy, which is due to an over-smooth approximation of the sharply marginals around 0. This indicates that the (easily adapted) discretization of µ l equidistant over the real line is not optimal.We discuss alternative parametrizations in Section 5.

Alternative Parametrizations
The potential functions of the models based on filter-and wavelet-responses shown in Fig. 3 and Fig. 4 exhibit leptokurtic behavior, which has been noticed quite early in the literature [44,[62][63][64][65][66].To model these leptokurtic potential functions, our parametrization relies on one-dimensional GMMs with a-priori chosen equidistant means on the real line.The GMM is a very natural choice in our framework, as the Gaussian family is the only function family closed under diffusion (i.e.convolution with a Gaussian, cf. the central limit theorem).However, as a consequence, the discretization of the means over the real line has to be fine enough to allow proper modeling of the leptokurtic marginals.Thus, the majority of the learnable parameters are actually the weights of the one dimensional Gaussian mixtures.This motivates the consideration of other expert functions ψ.
An extremely popular choice for modeling the distribution of filter-responses on natural images is the Student-t expert [44,63] x As outlined above, the convolution of this function with a Gaussian can not be expressed in closed form.However, there exist approximations, such as the ones shown in [67] or [68, Theorem 1], which we recall here for completeness: Let X be a random variable on R with density where Γ(z) = ∞ 0 t z−1 exp(−t) dt is the Gamma function, and let Y t be a random variable defined as previously.Then,

Designing more expressive models
All architectures discussed until now are shallow in the sense that they model the distribution of filter-responses (either directly, or through wavelets or shearlets).A possible extension of our work would be to consider deep networks, i.e., networks with more than one layer.Indeed, many popular image restoration frameworks, such as trainable non-linear reaction diffusion [75] or the cascade of shrinkage fields [76] employ trainable Gaussian mixture potential functions (often referred to more generally as radial basis splines).However, they are typically trained as point estimators in a classic discriminative (task-specific) framework, and have not been studied in the context of diffusion priors.We quickly note that the diffusion in the trainable non-linear reaction diffusion is a diffusion in image space, whereas our framework considers diffusion in probability space.Extending the idea of diffusion in probability space to deep networks is non-trivial.We believe that such models can only be tackled by approximating the diffusion PDE.

Wavelets: Modeling neighborhoods
In essence, the model based on wavelet-responses described in Section 3.2 models the histogram of wavelet coefficients in different sub-bands.However, it does not take the spatial  neighborhood (neither in its own sub-band nor of siblings or parents) into account.There have been many attempts at making these types of models more powerful: Guerrero-Colon et al. [77] introduce mixtures of GSMs to model the spatial distribution of wavelet coefficients in and across sub-bands.The authors of [78] extend this idea to mixtures of generalized Gaussian scale model mixtures.We believe that these extensions can be used also in our work.
In particular, modeling disjoint neighborhoods leads to a block diagonal structure in the product GMM, which can be efficiently inverted.However, modeling disjoint neighborhoods is known to introduce artifacts [73].Still, such models can be globalized, e.g. by utilizing ideas similar to the expected patch log-likelihood [43], which amounts to applying a local model to overlapping local neighborhoods individually and averaging the results.Another interesting research direction with applications to generative modeling would be to condition the distribution of the wavelet coefficients on their parent sub-bands.Notice that when utilizing conditioning and modeling local neighborhoods, we essentially recover the wavelet scorebased generative model of Guth et al. [79].Their model uses the score network architecture proposed in [80], but we believe that modelling local neighborhoods could yield results that are close to theirs.

Patch versus Convolutional Model
One of the major open questions in this work is the relationship between the models based on filter-responses and shearlet-responses.We again want to emphasize that they are distinctly different: The former "only" models the distribution of filter-responses, essentially forming a histogram.In particular, the distribution of filter-responses of natural images on arbitrary filters will always exhibit leptokurtic behavior [62,65], with sharp peaks at 0 (see our learned potential functions of the model based on filter-responses in Fig. 3).The experts in the model based on shearlet-responses do not model the marginal distribution of filter-responses, but takes into account the non-trivial correlation of overlapping patches This leads to significantly more complex expert functions with multiple minima, sometimes different from zero (see our learned potential functions of the model based on shearletresponses in Fig. 6).Although quite well known in the literature [43,57,75,81], this distinction is sometimes overlooked (e.g. when [44] chose the restrictive Student-t potential functions in their convolutional fields-of-experts model).To the best of our knowledge, this paper is the first in proposing strategies to learn patch-based and convolutional priors in a unified framework.
The assumption of non-overlapping spectra of the filters in the convolutional model (38) is in analogy to the assumption of pair-wise orthogonality of the filters in the patch model (19): From (38) immediately follows that ⟨Fk j , Fk i ⟩ C n = 0 when i ̸ = j.Thus, in some sense, the convolutional model becomes a patch-based model in Fourier space.However, the relationship remains unclear and deserves being investigated further.
The second assumption -that the spectra are constant over their support (39) -restricts the space of admissible filters quite heavily.Unfortunately, we did not find a way to relax this constraint and we believe that it can not be relaxed without losing exact diffusion.However, we think that the constraint can be relaxed such that the diffusion PDE is fulfilled within some error bounds.

Conclusion
In this paper, we introduced PoGMDMs as products of Gaussian mixture experts that allow for an explicit solution of the diffusion PDE of the associated density.For models acting on filter-, wavelet-, and shearlet-responses we derive conditions for the associated filters and potential functions such that the diffusion PDE is exactly fulfilled.Our explicit formulation enables learning of image priors simultaneously for all diffusion times using denoising score matching.Numerical results demonstrated that PoGMDMs capture the statistics of the underlying distribution well for any diffusion time.As a byproduct, our models can naturally be used for noise estimation and blind heteroscedastic denoising.
Future work will include the design of multi-layer architectures for which the diffusion can be expressed analytically, or approximated within some error bounds.In addition, the learned models could be evaluated on more involved inverse problems such a deblurring or even medical imaging.Further, the extensive evaluation of the model based on filter-responses in terms of sampling the distribution and performing heteroscedastic blind denoising can also be applied to the models based on wavelet-and shearlet-responses.Finally, the connection between the models based on filter-and shearlet-responses deserves being investigated further.

Fig. 1 :
Fig. 1: Diffusion of an empirical density consisting of the weighted Dirac measures

Fig. 5 :
Fig. 5: Initial (a) and learned (b) building blocks for the shearlet system: One-dimensional low-pass filter h 1 and the two-dimensional directional filter P .The corresponding shearlet filters along with their frequency response, learned weights and learned potential functions are shown in Fig. 6.

Fig. 6 :
Fig. 6: The five left columns show the frequency response, time-domain filters and corresponding potential functions of the first cone of the learned shearlet system.The inset numbers show the values of the corresponding weighting factor λ j,k .The five unlabeled right columns show the second cone.

Fig. 7 :
Fig. 7: Cosine similarity between the magnitude the spectra of different levels, shearings and cones of the initial (a) and learned (b) shearlet system.A system exactly fulfilling the assumption (38) would be unpopulated off the main diagonal.

Fig. 12 :
Fig. 12: Ground-truth samples from the random variable Y t (top) and samples generated by the analytic sampling procedure (60) (bottom).
It is well known that the Bayesian minimum mean-squared-error (MMSE) estimate is the conditional mean, i.e. the map y t → xf X|Yt (x | y t ) dx.In classical Bayes theory, such a map would be constructed by utilizing Bayes theorem, i.e. writing f X|Yt The arrows show the empirical Bayes estimate y → y + 2t∇ log f Yt (y).As t approaches infinity, f Yt becomes log-concave and − log f Yt approaches a quadratic function.subscript R n .We denote with l : {1, . . ., J} → {1, . . ., L} a fixed but arbitrary selection from the index set {1, . . ., L}. Theorem 1. f filt θ ( • , 0) is a homoscedastic GMM on R a with L J components and precision matrix

Table 1 :
[58]titative denoising results in terms of PSNR and SSIM using one-step empirical Bayes denoising the stochastic denoising algorithm from[58].The intervals indicate the 0.95 confidence region, bold typeface indicates the best method.