Convergent Data-driven Regularizations for CT Reconstruction

The reconstruction of images from their corresponding noisy Radon transform is a typical example of an ill-posed linear inverse problem as arising in the application of computerized tomography (CT). As the (naive) solution does not depend on the measured data continuously, regularization is needed to re-establish a continuous dependence. In this work, we investigate simple, but yet still provably convergent approaches to learning linear regularization methods from data. More specifically, we analyze two approaches: One generic linear regularization that learns how to manipulate the singular values of the linear operator in an extension of our previous work, and one tailored approach in the Fourier domain that is specific to CT-reconstruction. We prove that such approaches become convergent regularization methods as well as the fact that the reconstructions they provide are typically much smoother than the training data they were trained on. Finally, we compare the spectral as well as the Fourier-based approaches for CT-reconstruction numerically, discuss their advantages and disadvantages and investigate the effect of discretization errors at different resolutions.


Introduction
Linear inverse problems are at the heart of a variety of imaging applications, including restoration tasks such as image deblurring as well as the inference of unknown images from measurements that contain implicit information about them as, for instance, arising in computerized tomography (CT), positron emission tomography (PET) or magnetic resonance imaging (MRI).All of these problems are commonly modeled as the task of recovering an image u from measurements for a linear operator A and noise ν characterized by some error bound (noise level ) δ ∈ [0, ∞[ to be specified later.A key challenge for most practically relevant problems is that A : X → Y is a compact linear operator with infinite dimensional range (assuming here that X and Y are infinite-dimensional Hilbert spaces), leading to zero being an accumulation point of its singular values, and making the pseudo-inverse A † discontinuous.
The compactness of A makes it possible to expand it by its singular values, which means to write it in the form where the singular values σ n > 0 are non-increasing, {u n } n∈N and {v n } n∈N form orthonormal bases of the orthogonal complement of the nullspace of the operator N (A) ⊥ , or the closure of its range R(A), respectively.With a slight abuse of notation, ⟨•, •⟩ denotes the inner-product on the spaces X and Y .Classical linear regularization strategies therefore aim to approximate u with so-called spectral regularization operators of the form for a suitable function g δ that remains bounded for all δ > 0 but for which g δ (σ) → 1/σ as δ → 0. With a suitable speed of such a pointwise convergence, the continuous dependence on the data, i.e., R(f δ ; g δ ) → A † f , can be reestablished.An extensive overview on requirements for this kind of convergence and classical examples for g δ including Tikhonov, Lavrentiev or truncated SVD regularization, can be found in [1].
In the specific case of the linear operator being the Radon operator defined on functions on the whole space (which is not compact in contrast to integral operators on bounded domains), we have to work with a continuous spectrum.The most commonly used reconstruction technique for inverting the Radon transform is the filtered backprojection, which follows a very similar strategy to (3), but exploits the structure of the operator in a different way: With the central slice theorem, the inverse Radon transform applied to some range-element f ∈ R(A) is given by where A * denotes the adjoint operator of A (also called back-projection operator ), ρ(r) = |r| is called the ramp-filter, F 1-D is the one-dimensional Fourier transform with respect to the spatial offset variable of the Radon transform, and r (in the definition of ρ) is the variable resulting from the Fourier transform.
In analogy to the regularized version (3), the most common classical way to ensure a stable reconstruction is to replace (4) by with the filter ρ δ chosen to avoid amplification of high frequency components with large frequency |r|.Common choices include the Hamming and ramp / Ram-Lak filters.
In this work, we study (3) and ( 5) for functions g δ (respectively ρ δ ) that can be learned from data.In particular, we show that the shape of the optimal functions g δ and ρ δ can be characterized in closed form.We prove that the learned approaches result in convergent regularization methods under certain conditions, investigate their behavior under different discretizations, and conduct numerical experiments on CT reconstruction problems.

Related Work
Regularization methods for linear inverse problems in general and manipulations of the singular values in particular, have long been studied in applied mathematics, c.f. the classical reference [1] or the more recent overview [2].Classical examples of (3) include Lavrentiev, Tikhonov, or truncated SVD regularization.Subsequently, a lot of research has focused on non-linear regularization techniques, such as variational methods or (inverse) scale space flows.Even more recently, researchers have focused on learning reconstruction schemes through neural networks, which tend to show significantly stronger practical performances, but often lack theoretical guarantees, e.g.being convergent regularizations (independent of their discretization) with error bounds in suitable (problem-specific) metrics.We refer to the two overview papers [1,2] for classical and nonlinear regularization theory and recall some machine-learning specific regularization approaches below.
The simplest form of benefiting from data-driven approaches are pre-or post-processing networks, i.e., parameterized functions G that are either applied to the data f δ before exploiting a classical reconstruction technique, or to a preliminary reconstruction like (3).Common architectures in the area of image reconstruction problems are simple convolutional neural networks (CNNs) or multiscale approaches such as the celebrated U-Net architecture [3,4].Direct reconstructions (with different types of problem specific information being accounted for) can, for instance, be found in [5][6][7].Natural extensions of pre-and postprocessing networks use A and its adjoint to switch between the measurement and reconstruction spaces with intermediate learnable operations in structures that are often motivated by classical iterative reconstruction/optimization methods such as gradient descent, forward-backward splitting or primal-dual approaches, e.g.LEARN [8], FISTA-Net [9] or the learned primal-dual method [10].Yet, without further restrictions, such methods do not allow to prove error estimates or convergence results.
Coming from the perspective of regularization schemes based on variational methods, many approaches have suggested to learn the regularizer, starting from (sparsity-based) dictionary learning, e.g.[11,12], over learning (convex and nonconvex) regularizers motivated by sparsity penalties, e.g.[13][14][15], to schemes that merely learn descent directions [16] or operators provably being convergent to a global optimum ( [17]) or playing the role of ( [18,19]) a proximal operator, to reparametrizations of the unknown in unsupervised settings (e.g. the deep image prior [20]) or in a learned latent space of realistic reconstruction ( [21,22]).In a similar fashion, deep equilibrium models [23] generalize the optimality conditions arising from learning regularizers (c.f.[24]) and can, for instance, be combined with convex neural networks [25] for learning suitable regularizations.Yet, the above approaches are either not viewed in the light of infinite dimensional problems, or are non-convex, such that standard regularization results do not apply, or at least require the ability to compute global minimizers (as in the analysis of [26] or strict assumptions like the tangential cone condition [27]).For a recent overview on data-driven approaches in the context of inverse problems, we refer the reader to [28].
Even if properties like convexity of the regularizer allow to deduce convergence and/or error estimates, the complex parametrization of the above methods makes it highly difficult to understand how a data-driven approach extracts information from the training data, which quantities matter and what properties a learned regularization has.Therefore, our approach in this paper is to extend our previous work [29] by studying the simple linear (singular-value based) reconstruction (3) as well as the application-specific learning of a filter in Radon-inversion (5) in order to provide more insights into properties of learned regularization techniques.While optimizing such spectral filters in a data-driven manner has been studied in the past (c.f.[30]), to the best of our knowledge it has not been studied to which extent the obtained reconstruction operators fulfill the characteristics of classical regularization methods.
While a large number of different data-driven approaches have been tailored to CTreconstruction (c.f.[31][32][33][34] for particular examples or [35][36][37] for surveys), they largely follow the above categories in terms of theoretical guarantees and an understanding of the underlying learning process.
Our framework is close to learning the optimal Tikhonov regularizer, which was extended to the infinite dimensional case and studied with a focus on the generalization to unseen data in [38].Although our approach is more restricted and requires the knowledge of the forward operator and its singular value expansion, it allows for deriving reasonable assumptions to guarantee convergence in the no-noise-limit.

Supervised Learning of Spectral Regularizations
In this Section we derive a closed-form solution for g δ (σ n ) in (3) when the expectation of the squared norm difference between u and the regularization applied to data f δ is minimized.Here u and f δ = Au + ν come from a distribution of training data.We then analyze the corresponding regularization operator and show that the operator is a convergent linear regularization.

Optimally Learned Spectral Regularization
In this Section, we study the approach of learning the function g δ in the approach of (3) from a theoretical perspective.First of all note that due to the assumption of A being compact (and hence having a discrete spectrum), only the evaluations of g δ at σ n matter, such that we will focus on the optimal values g n := g δ (σ n ) directly.For the sake of simplicity, we can guarantee the well-definedness of g(σ n ) = g n by assuming the singular values σ n to be strictly decreasing and have multiplicity µ(σ n ) = 1.Let us further assume that our data formation process (1) arises with noise drawn independently from u from a noise distribution parameterized by the noise level δ with zero mean.
For a fixed noise level δ, the most common way to learn parameters (i.e. the g n in our case) is to minimize the expectation of a suitable loss, e.g. the squared norm, over the training distribution of pairs (u, f δ ) of the desired ground truth u and the noisy data f δ .Thus, the ideal learned method is obtained by choosing where E u,ν denotes the expected value with respect to the joint distribution of the ground truth data and the noise.With the spectral decomposition (3) we have where u 0 ∈ N (A) is the unique projection of u onto the nullspace of the operator.Due to the independence of u and ν together with the zero mean of the noise we find where E u and E ν denote the expected values with respect to the marginal distributions of the ground truth data, or the noise, respectively.Since the whole training only makes sense in the chosen spaces X and Y if the ground truth data u are indeed elements of X, we shall assume in the following without further notice that The above problem can be minimized for each g n separately by solving a quadratic minimization problem, where a solution is given by To obtain uniqueness of the solution, we assume that Π n > 0 or ∆ n > 0 for all n ∈ N throughout this paper.
Remark 1.The above result can be generalized to the case of singular values not strictly decreasing.Since σ n is converging to zero, each singular value has finite multiplicity and we can rewrite them as a sequence of strictly decreasing singular values σ n with multiplicity µ(σ n ) ≥ 1 and corresponding singular functions In this case, the optimal coefficients g n are again given by (6), where we use the generalized variance coefficients We observe that for Π n > 0, Equation ( 6) can be rewritten as and therefore has the same form as the classical Tikhonov regularization with the commonly fixed regularization parameter α being replaced by a componentwise adaptive and data-driven term ∆ n /Π n .We note that the restricted structure of the regularizer prevents it from approximating the nullspace component of the unknown.This is quantified by the term E u ∥u 0 ∥ 2 which is independent of the choice of g.For the generalization error this implies and hence, In general we will need an assumption about the structure of the noise in the data set.First of all we interpret the noise level δ in a statistical sense as a bound for the noise standard deviation, however formulated such that it includes white noise.This leads to which we will assume throughout the paper without further notice.Note that when studying the zero noise limit δ → 0 it is natural to interpret ∆ n as depending on δ and in these instances we will use the explicit notation ∆ n (δ).We will refer to the case of white noise as ∆ n (δ) = δ 2 for all n ∈ N.
Obviously we expect the clean images to be smoother than the noise, which means that their decay in the respective singular basis is faster.We thus formulate the following assumption, which is automatically true for white noise (∆ n = δ 2 for all n), since Π n goes to zero: Assumption 1.For δ > 0, there exists n 0 ∈ N such that the sequence ∆ n /Π n is welldefined for n ≥ n 0 and diverging to +∞.
For some purposes it will suffice to assume a weaker form, which only has a lower bound instead of divergence to infinity: Assumption 2. There exists c > 0 and n 0 ∈ N such that ∆ n (δ) ≥ c δ 2 Π n for every n ≥ n 0 and δ > 0.
With Assumption 2 we can conclude that the operator R(•; g) with g given by ( 6) is a linear and bounded operator for stricly positive δ.Lemma 1.Let Assumption 2 be satisfied, then R(•; g) with g as defined in (6) is a bounded linear operator for δ > 0.
Proof.By construction, the operator R is linear in its argument f δ .We first note that Π n = 0 implies g n = 0.For Π n > 0 and n ≥ n 0 , the condition ∆ n ≥ c δ 2 Π n implies For n < n 0 we have Hence, we conclude which implies that the operator R(•; g) is a bounded linear operator for δ > 0.
Remark 2. In practical scenarios with finite (empirical) data (u i , f i ) i=1,...,N one typically minimizes the empirical risk instead of the expectation, leading to coefficients with Yet, a convergence analysis of the empirical risk minimization requires further (strong) assumptions to control Γ N n , such that we limit ourselves to the analysis of the ideal case (6) in this work.

Range conditions
We start our analysis of the learned regularization method with an inspection of the range condition (cf.[2]), which for Tikhonov-type regularization methods equals the so-called source condition in classical regularization methods (cf.[1]).More precisely, we ask which elements u ∈ X can be obtained from the regularization operator R(•, g) for some data f , i.e. we characterize the range of R(•, g).Intuitively, one might expect that the range of R includes the set of training samples, or in a probabilistic setting the expected smoothness of elements in the range (with expectation over the noise and training images) should equal the expected smoothness of elements of the training images.However, as we shall see below, the expected smoothness of reconstructions is typically higher.
We start with a characterization of the range condition, where we again denote the range of an operator by R.
If Assumption 2 is satisfied we have in particular u ∈ R(A * ) and in the case of white noise the range condition (10) holds if and only if Proof.Given the form of g we see that for some f ∈ Y , which is further equivalent to 1 g n ⟨u, u n ⟩ ∈ ℓ 2 (N).We see that and due to the boundedness of σ n , we find σ n ⟨u, u n ⟩ ∈ ℓ 2 (N) for any u ∈ X.Hence, the range condition is satisfied if an only if holds, which is just (10).Under Assumption 2 we have ∆n σnΠn ≥ c δ 2 σn , thus This implies u = A * w with For white noise, we have ∆ n = δ 2 , which implies that ( 10) is satisfied if and only if (11) is satisfied.
Let us mention that under Assumption 1 the range condition of the learned regularization method is actually stronger than the source condition u = A * w, since we even have diverging under Assumption 1.This indicates that the learned regularization might be highly smoothing or even oversmoothing, which depends on the roughness of the noise compared to the signal, i.e. the quotient ∆n Πn .Although the range of the reconstruction operator is still dense in the range of A * if Π n > 0 for all n ∈ N, the oversmoothing effect of the learned spectral regularization method can be made clear by looking at the expected smoothness of the reconstructions, i.e., Πn = E u,ν (⟨R(Au + ν, g), u n ⟩ 2 ) .
A straight-forward computation yields While we expect (at least on average) a similar smoothness of the reconstructions as for the training images, i.e.Πn ≈ Π n , we find under Assumption 2 that for large n, i.e.Πn Πn converges to zero at least as fast as σ 2 n .The distribution of reconstruction thus has much more mass on smoother elements of X than the training set.The main reason for this behaviour seems to stem from the noise in the data.Thus, although the method is supervised and tries to match all reconstructions to the training data, it needs to produce a linear operator that maps noise to suitable elements of X as well.Note that the behaviour changes with δ → 0. For small ∆n σ 2 n Πn , the denominator is approximately equal to one, hence Πn ≈ Π n , i.e. in the limit the right degree of smoothness will be restored.Another perspective on the expected smoothness is to investigate the bias of the obtained reconstruction operator, or, since we cannot expect to reconstruct nullspace elements Inserting the definition of g we derive with that for any which yields convergence of the bias to zero as δ → 0. We see that, just like the velocity of the convergence of the variance coefficients Πn , the velocity of the convergence of the bias is determined by the convergence of the factors ∆n σ 2 n Πn → 0.

Convergence of the Regularization
In the following we will investigate the convergence properties of the learned spectral regularization as the noise level δ tends to zero.For this sake we will write ∆ n (δ) and g(δ) here to make clear that we are looking at a sequence converging to zero.We shall naturally consider convergence in mean-squared-error, i.e.
e(u, δ) This equals the 2-Wasserstein distance (cf.[39, Sec.2.1]) between the concentrated measure at u and the distribution generated by the regularization applied to random noisy data, thus can be understood as a concentration of measure in the zero noise limit.
In addition to the convergence of the regularization for single elements u ∈ X we can also look at the mean-squared error over the whole distribution of given images and noise, i.e.

e(δ)
which is identical to the quantity (7) with emphasis of the dependency of ∆ n on δ.
Proof.We start by computing Taking expectations with respect to the noise yields Now let ϵ > 0 be arbitrary.Due to the summability of ⟨u,

Now we can choose
to obtain for δ sufficiently small, which implies convergence to zero.
In the same way we can estimate where the first equality follows from (7), and obtain convergence by the same argument.

Learned Radon Filters
In this Section we show the correspondence between data-driven regularization of the inverse Radon transform and filtered back-projection.We therefore analyze the properties of an optimal filter for the regularized filtered backprojection (5), i.e., we want to find a function ρ : R × [0, π] → C such that The function ρ now plays the role of the learned spectral coefficients.In particular, ρ(r, θ) = |r| corresponds to an inversion, i.e., to g n = 1 σn , and ρ(r, θ) = 1 yields the adjoint operator, which corresponds to g n = σ n .In analogy to the derivation of the optimal coefficients in Section 3, we first rewrite the L 2 -error of the difference of a single u ∈ L 2 (R 2 ) and the corresponding regularized reconstruction of the measurement f δ ∈ L 2 (R 2 ) as follows (using the isometry property of the Fourier transform) In the last line, we use the central slice theorem to get where the latter equality holds since the range of the Radon transform is dense in L 2 (R × [0, π]) and thus Note, that for the reconstruction operator to be well-defined and continuous on L 2 , the filter function ρ has to be essentially bounded.With the same arguments used in Section 3 we derive By the above formula we can restrict ψ and thus ρ to be real-valued, as adding an imaginary component would always increase the expected value.Accordingly, the optimal filter is real-valued and given by ρ(r, θ) = |r| Π(r, θ) Π(r, θ) + ∆(r, θ) .
To fulfill the boundedness of ρ we assume that which can be recognized as the obvious conditions of ground-truth data being smoother than noise on average.The expected error using the optimal filter is then given by Using the central slice theorem again we can further determine the expected smoothness Π(r, θ) This reveals the analogue to the smoothing effect obtained by the spectral reconstruction that was discussed in Section 3.2.
Remark 3. Similar to the spectral regularization, optimal filters can also be computed for a finite set of training data.Rewriting the empirical risk for input-output pairs and computing the optimality conditions for ψ yields with

Numerical Experiments with the discretized Radon Transform
In this Section we support the theoretical results of the previous Sections with numerical experiments1 .

Datasets
The following experiments are performed on a synthetic dataset and the LoDoPaB-CT dataset (cf.[40]).The synthetic dataset contains 32, 000 images of random ellipses supported on a circle contained completely in the image domain (see Figure 4 for an example).Each ellipse has a random width and height, and is roto-translated by a random distance and angle.All ellipses are completely contained within the image borders, and there may be an overlapping between two or more ellipses.We split the data into 64% training, 16% validation and 20% test data.The LoDoPaB-CT dataset contains over 40, 000 human chest scans and is split into 35, 820 training, 3, 522 validation, and 3, 553 test images.For both datasets, we simulate sinograms as described in the following section and model noisy data by adding Gaussian noise with zero mean and variance δ 2 .
In Appendix A we additionally report results for uniformly distributed noise.

Discretization and Optimization
We perform numerical experiments with the Radon transform for discrete data based on the Spline-0-Pixel-Model where we assume the image functions u ∈ L 2 (R 2 ) to be of the form where I ∈ N denotes the number of pixels in each direction, u ij ∈ R for each i, j < I and the 0-spline β 0 is defined as Accordingly we have u (ij) = u i I , j I , so with a slight abuse of notation we write u ∈ R I×I .Reformulating the Radon transform for this particular choice of input functions yields a finite linear operator (see e.g., [41] for a detailed derivation) and can thus be represented by a matrix A. For sufficiently small dimensions, the finite set of vectors u n and v n is given by the singular value decomposition of A, more precisely A = V ΣU T , where u n and v n are the column vectors of the orthogonal matrices U and V and Σ is a diagonal matrix containing the singular values in non-increasing order.Since the considered problem has finite dimensions, Assumption 2 is always fulfilled.We still see a faster decay of Π N compared to ∆ N as is depicted in Figure 1.
Since the computation of the SVD has high memory requirements and the matrix multiplication with U and V T is computationally expensive, this approach is only feasible for low resolution data.Using the fact that V is orthogonal, we see the connection to the continuous optimization ( 16) by rewriting the regularized inversion operator as where the division is to be interpreted element-wise, and where diag(Σ) returns the main diagonal of Σ.As we will analyze in more detail below, the discretization of the Radon transform leads to (4) being incorrect for F denoting the discrete Fourier transform.Thus, we will compare four different approaches in our numerical results 1.The spectral regularization (3) with coefficients computed according to the analytical solution (9).
2. The spectral regularization (3) with coefficients optimized in a standard machine learning setting, i.e., initializing the parameters with zeros, partitioning the data into batches of 32 instances and then using the PyTorch implementation of Adam (with default parameters except for a learning rate of 0.1) for optimization.Ideally, these results should be identical to 1).
3. The learned filtered back-projection (5) with coefficients computed according to the analytical solution (17), knowing that the resulting coefficient might not be optimal due to the aforementioned discretization error in (5).To partially compensate for this, we replaced the discretization of the ramp filter |r| by filter coefficients we optimized on clean data.For all non-zero noise levels we kept these pseudo-ramp filter coefficients from the no-noise baseline.
4. The learned filtered back-projection (5) with coefficients that optimize where F denotes the discrete Fourier transform.In keeping with the results from Section 4, we only consider real-valued σ.The optimization is done in the same way as described above for the SVD.We point out that for rotatation invariant distributions, i.e., cases where the noise and data distributions are constant in the direction of the angles, the filter ρ can be chosen constant in the direction of the angles as well.By this the number of necessary parameters is equal to the number of positions s.We further note that due to the inevitable discretization errors, we cannot expect perfect reconstructions, not even for perfect measurements without noise.
We refer to Approaches 2. and 4. as learned and to Approaches 1. and 3. as analytic and compare their behavior below.Although the discrete Radon transform A is an approximation of the continuous Radon transform (denoted by Â in the following), several scaling factors have to be taken into account to relate the discrete regularization to a continuous regularization.We collect these factors in the following remark.
Remark 4. For data supported on the uniform square, we consider discretizations u ∈ R I×I and v ∈ R K×L of functions û and v.Here I denotes the number of pixels in each direction, K denotes the number of equispaced angles in [0, π] and L denotes the number of equispaced positions in 2 .Then, the following approximations hold for (i) the Radon transform and its adjoint: (ii) the singular value decomposition: where ûn , vn are normalized with respect to the L 2 -norm, and u n , v n are normalized with respect to the Euclidean norm, (iii) the data-driven (co-)variance terms:

Results
In the following, we present several numerical results for the approaches introduced in the previous Section.

Learned vs. analytic
The loss as well as the PSNR-and SSIM-values obtained by the different approaches on the test dataset are documented in Table 1.Based on these results, we deduce that the SVD-based regularization performs better than the FFT-based regularization.In the SVD case we see that the results obtained by the analytic coefficients are still slightly better than the ones obtained by the learned coefficients.The opposite is true in the FFT case, where the analytic solution performs worse than the learned one.This had to be expected by the discretization errors that are made in the computation of the analytic optimum.Figure 2 shows the loss curves for both data-driven optimization approaches as well as the loss obtained by the analytically optimal filters.As expected for the SVDapproach the learned loss converges to the analytic loss.On the other hand, the loss approximated with the analytic formula for the Fast-Fourier-approach is higher than the actual data-driven loss, which may again indicate that the optimization balances out the discretization errors.This gap between the continuous and discrete optima can also be seen in higher resolution data, as the comparison of the analytic filters to the learned filters in Figure 3 shows.Based on these insights, we only show the results for analytic SVD-coefficients and learned FFT-filters in the following.We further see that the obtained losses of both approaches are more similar the higher the noise level is which is also visible in the reconstructions (see Figure 4).Here, we compare our two   approaches to naïve reconstruction with filtered back-projection (FBP) and Tikhonov reconstruction, where the reconstruction operator is chosen individually for each example with the discrepancy principle.The reconstruction examples also reveal the predicted oversmoothing effect for both approaches.

Generalization to different resolutions
Since the operator we use in our experiments is discretized, the optimal coefficients (or filters, respectively) depend on the resolution of the operator and the corresponding data.However, both approaches offer ways to transfer coefficients optimized for a certain resolution to another resolution.One approach to generalize the learned SVDbased coefficients would be to resort to the classic linear regularization formulation (3) and retrieve a function g(σ) by inter-/extrapolating the optimal values g(σ n ) = g n .A generalization of the FFT-filters can be obtained in a similar way by interpolation of ρ(θ, r) in the direction of the angles θ and extrapolation in the direction of the frequencies r.Comparing the optimal coefficients obtained for different resolutions after rescaling them in accordance to Remark 4 (see Figure 5 for SVD-coefficients and Figure 6 for FFT-filters) we see that all curves behave similarly but there are obvious differences in the magnitude and location of the peak for different resolutions.While the optimal coefficients seem to converge in the domain of small singular values (or low frequencies, respectively), up to the resolutions we were able to compute, there is no obvious convergence visible for the domain of large singular values in the SVD-approach or the domain of high frequencies in the FFT-approach.Note that since high singular values correspond to low frequencies, this reveals an opposite behavior of both approaches.Inspecting the curves for Π that they are almost identical for high resolutions (see Figure 5).Therefore, a convergence of the optimal spectral coefficients for higher resolutions seems probable.Learned filters for varying spatial resolution

Generalization to different data distributions
To evaluate how well the learned regularizers generalize between different datasets, we show empirical results on the LoDoPaB-CT data set in this section.Figures 7-9 compare the results obtained with a regularizer trained on LoDoPaB-Ct dataset with the ones obtained with a regularizer trained on the synthetic ellipse dataset.While both approaches clearly improve the reconstructions compared to the filtered back-projection, the effect of optimizing the regularizers on the correct dataset is still clearly visible.

Conclusions & Outlook
In this work, we studied the behavior of data-driven linear regularization of inverse problems.For problems emerging from compact linear forward operators we showed that the optimal linear regularization can be computed analytically with the singular value expansion of the operator.An analysis of the range conditions revealed an oversmoothing effect depending on the noise level.We further confirmed that the optimal regularization is convergent under suitable assumptions.By deriving analogous results for the Radon transform, a specific non-compact linear operator, we could establish a connection to the well-known approach of filtered back-projection.Numerical experiments with a discretized (and thus compact) version of the Radon transform verified our theoretical findings for compact operators.We additionally proposed a computationally more efficient discretization of the filtered back-projection by using the Fast Fourier Transform, which allows for finer discretizations of the operator.However, the filters obtained by this approach turned out to strongly deviate from the theoretical results for the continuous case.Especially in the regime of low noise, the discretization errors that come with this approach caused a heavy accuracy loss compared to the SVD-approach.Additionally, we empirically evaluated the convergence behavior of both regularization approaches for the discrete Radon transform with an increasing level of discretization.
For the future, we would like to better understand the discretization error that arises from the discretization of the Radon transform and its impact on the computed filters.It would also be interesting to study more general nonlinear data-driven regularization methods and to find formulations and criteria for which one can prove that these methods are convergent regularizations, similar to the analysis carried out in Section 3.3.

Figure 1 :
Figure 1: Comparison of coefficients ⟨u, u n ⟩ and ⟨ν, v n ⟩ with n = 5, 200, 4000 for 64 × 64 ground truth images, 93 × 256 sinograms and noise level δ = 0.005.While the variance of the data-coefficients decreases with increasing n, the variance of the noise-coefficients is stable.

Figure 2 :Figure 3 :
Figure 2: Training loss curves for both optimization approaches and different noise levels δ for 64 × 64 ground truth images and 93 × 256 sinograms.The scale on the right shows the loss obtained by the analytically computed filters.

Figure 4 :Figure 5 :
Figure 4: Comparison of the ellipse reconstructions obtained by different approaches for different levels of additive gaussian noise.The first row (δ = 0) shows the reconstructed sinograms before adding noise.

Figure 6 :
Figure 6: Comparison of the optimal FFT-filters for different angle and frequency resolutions for 256 × 256 ground truth images with noise level δ = 0.005.Since the filters are not constant in the direction of the angles, multiple, slightly different curves correspond to one filter.The filters for higher spatial resolutions are cut off at frequency r = 183.

Figure 7 :
Figure 7: Example image from the LoDoPaB-CT dataset and the reconstructions of an uncorrupted sinogram by the different methods.

Figure 8 :
Figure 8: Comparison of LoDoPaB reconstructions obtained by SVD-approach, trained on different datasets, for different levels of additive gaussian noise.

Figure 9 :
Figure 9: Comparison of LoDoPaB reconstructions obtained by FFT-approach, trained on different datasets, for different levels of additive gaussian noise.

Figure 11 :
Figure 11: Comparison of LoDoPaB reconstructions obtained by SVD-approach, trained on different datasets, for different levels of additive uniform noise.

Figure 12 :
Figure 12: Comparison of LoDoPaB reconstructions obtained by SVD-approach, trained on different datasets, for different levels of additive uniform noise.

Table 1 :
Performance of the different approaches on test data for 64 × 64 ground truth images and 93 × 256 sinograms and different noise levels δ.