1 Introduction

Inverse problems arise in numerous applications such as medical imaging [14], phase retrieval [57], geophysics [813], and machine learning [1418]. The goal of inverse problems is to recover a signalFootnote 1\(u_{d}^{\star }\) from a collection of indirect noisy measurements d. These quantities are typically related by a linear mapping A via

$$ d = A u_{d}^{\star }+ \varepsilon, $$
(1)

where ε is measurement noise. Inverse problems are often ill-posed, making recovery of the signal \(u_{d}^{\star }\) unstable for noise-affected data d. To overcome this, traditional approaches estimate the signal \(u_{d}^{\star }\) by a solution \(\tilde{u}_{d}\) to the variational problem

$$ \min_{u } \ell (A u, d) + J(u), $$
(2)

where is a fidelity term that measures the discrepancy between the measurements and the application of the forward operator A to the signal estimate (e.g. least squares). The function J serves as a regularizer, which ensures both that the solution to (2) is unique and that its computation is stable. In addition to ensuring well-posedness, regularizers are constructed in an effort to instill prior knowledge of the true signal e.g. sparsity \(J(u) = \|u\|_{1}\) [1922], Tikhonov \(J(u) = \|u\|^{2}\) [23, 24], total variation (TV) \(J(u) = \| \nabla u \|_{1}\) [25, 26], and, more recently, data-driven regularizers [2729]. A further generalization of using data-driven regularization consists of plug-and-play (PnP) methods [3032], which replace the proximal operators in an optimization algorithm with previously trained denoisers.

An underlying theme of regularization is that signals represented in high dimensional spaces often exhibit a common structure. Although hand picked regularizers may admit desirable theoretical properties leveraging a priori knowledge, they are typically unable to leverage available data. An ideal regularizer will leverage available data to best capture the core properties that should be exhibited by output reconstruction estimates of true signals. Neural networks have demonstrated great success in this regard, achieving state of the art results [33, 34]. However, purely data-driven machine learning approaches do little to leverage the underlying physics of a problem, which can lead to poor compliance with data [35]. On the other hand, fast feasibility-seeking algorithms (e.g. see [3640] and the references therein) efficiently leverage known physics to solve inverse problems, being able to handle massive-scale sets of constraints [36, 4143]. Thus, a relatively untackled question remains:

Is it possible to fuse feasibility-seeking algorithms with data-driven regularization in a manner that improves reconstructions and yields convergence? This work answers the above inquiry affirmatively. The key idea is to use machine learning techniques to create a mapping \(T_{\Theta }\), parameterized by weights Θ. For fixed measurement data d, \(T_{\Theta }(\cdot;d)\) forms an operator possessing standard properties used in feasibility algorithms. Fixed point iteration is used to find fixed points of \(T_{\Theta }(\cdot;d)\) and the weights Θ are tuned such that these fixed points both resemble available signal data and are consistent with measurements (up to the noise level).

Contribution

The core contribution of this work is to connect powerful feasibility-seeking algorithms to data-driven regularization in a manner that maintains theoretical guarantees. This is accomplished by presenting a feasibility-based fixed point network (F-FPN) framework that solves a learned feasibility problem. Numerical examples are provided that demonstrate notable performance benefits to our proposed formulation when compared to TV-based methods and fixed-depth neural networks formed by algorithm unrolling.

Outline

We first overview convex feasibility problems (CFPs) and a learned feasibility problem (Section 2). Relevant neural network material is discussed next (Section 3), followed by our proposed F-FPN framework (Section 4). Numerical examples are then provided with discussion and conclusions (Sections 5 and 6).

2 Convex feasibility background

2.1 Feasibility problem

Convex feasibility problems (CFPs) arise in many real-world applications e.g. imaging, sensor networks, radiation therapy treatment planning (see [36, 44, 45] and the references therein). We formalize the CFP setting and relevant methods as follows. Let \({\mathcal {U}}\) and \({\mathcal {D}}\) be finite dimensional Hilbert spaces, referred to as the signal and data spaces, respectively.Footnote 2 Given additional knowledge about a linear inverse problem, measurement data \(d\in {\mathcal {D}}\) can be used to express a CFP solved by the true signal \(u_{d}^{\star }\in {\mathcal {U}}\) when measurements are noise-free. That is, data d can be used to define a collection \(\{{\mathcal {C}}_{d,j}\}_{j=1}^{m}\) of closed convex subsets of \({\mathcal {U}}\) (e.g. hyperplanes) such that the true signal \(u_{d}^{\star }\) is contained in their intersection i.e. \(u_{d}^{\star }\) solves the problem

$$ \text{Find $u_{d}$ such that}\quad u_{d} \in {\mathcal {C}}_{d} \triangleq \bigcap_{j=1}^{m} {\mathcal {C}}_{d,j}. $$
(CFP)

A common approach to solving (CFP), inter alia, is to use projection algorithms [46], which utilize orthogonal projections onto the individual sets \({\mathcal {C}}_{d,j}\). For a closed, convex, and nonempty set \({\mathcal {C}}\subseteq {\mathcal {U}}\), the projection \(P_{{ \mathcal {C}}}:{\mathcal {U}}\rightarrow {\mathcal {C}}\) onto \({\mathcal {C}}\) is defined by

$$ P_{{ \mathcal {C}}}(u) \triangleq \mathop {\operatorname {argmin}}_{v\in {\mathcal {C}}} \frac{1}{2} \Vert v-u \Vert ^{2}. $$
(3)

Projection algorithms are iterative in nature and each update uses combinations of projections onto each set \({\mathcal {C}}_{d,j}\), relying on the principle that it is generally much easier to project onto the individual sets than onto their intersection. These methods date back to the 1930s [47, 48] and have been adapted to now handle huge-size problems of dimensions for which more sophisticated methods cease to be efficient or even applicable due to memory requirements [36]. Computational simplicity derives from the fact that the building bricks of a projection algorithm are the projections onto individual sets. Memory efficiency occurs because the algorithmic structure is either sequential or simultaneous (or hybrid) as in the block-iterative projection methods [49, 50] and string-averaging projection methods [36, 5153]. These algorithms generate sequences that solve (CFP) asymptotically, and the update operations can be iteration dependent (e.g. cyclic projections). We let \({\mathcal {A}}_{d}^{k}\) be the update operator for the kth step of a projection algorithm solving (CFP). Consequently, each projection algorithm generates a sequence \(\{u^{k}\}_{k\in {\mathbb {N}}}\) via the fixed point iteration

$$ u^{k+1} \triangleq {\mathcal {A}}_{d}^{k} \bigl(u^{k}\bigr)\quad \text{for all $k\in {\mathbb {N}}$.} $$
(FPI)

A common assumption for such methods is the intersection of all the algorithmic operators’ fixed point setsFootnote 3 contains or forms the desired set \({\mathcal {C}}_{d}\) i.e.

$$ {\mathcal {C}}_{d} = \bigcap_{k=1}^{\infty } \mathrm{fix}\bigl({\mathcal {A}}_{d}^{k}\bigr), $$
(4)

which automatically holds when \(\{{\mathcal {A}}_{d}^{k}\}_{k\in {\mathbb {N}}}\) cycles over a collection of projections.

2.2 Data-driven feasibility problem

As noted previously, inverse problems are often ill-posed, making (CFP) insufficient to faithfully recover the signal \(u_{d}^{\star }\). Additionally, when noise is present, it can often be the case that the intersection is empty (i.e. \({\mathcal {C}}_{d} = \emptyset \)). This calls for a different model to recover \(u_{d}^{\star }\). To date, projection methods have limited inclusion of regularization (e.g. superiorization [5458], sparsified Kaczmarz [59, 60]). Beyond sparsity via \(\ell _{1}\) minimization, such approaches typically do not yield guarantees beyond feasibility (e.g. it may be desirable to minimize a regularizer over \({\mathcal {C}}_{d}\)). We propose composing a projection algorithm and a data-driven regularization operator in a manner so that each update is analogous to a proximal-gradient step. This is accomplished via a parameterized mapping \(R_{\Theta }:{\mathcal {U}}\rightarrow {\mathcal {U}}\), with weightsFootnote 4 denoted by Θ. This mapping directly leverages available data (explained in Section 3) to learn features shared among true signals of interest. We augment (CFP) by using operators \(\{{\mathcal {A}}_{d}^{k}\}_{k\in {\mathbb {N}}}\) for solving (CFP) and instead solve the learned common fixed points (L-CFP) problem

$$ \text{Find $\tilde{u}_{d}$ such that}\quad \tilde{u}_{d} \in {\mathcal {C}}_{ \Theta,d} \triangleq \bigcap_{k=1}^{\infty } \mathrm{fix}\bigl({\mathcal {A}}_{d}^{k} \circ R_{\Theta } \bigr). $$
(L-CFP)

Loosely speaking, when \(R_{\Theta }\) is chosen well, the signal \(\tilde{u}_{d}\) closely approximates \(u_{d}^{\star }\).

We utilize classic operator results to solve (L-CFP). An operator \(T\colon {\mathcal {U}}\rightarrow {\mathcal {U}}\) is nonexpansive if it is 1-Lipschitz i.e.

$$ \bigl\Vert T(u) - T(u) \bigr\Vert \leq \Vert u-v \Vert \quad\text{for all $u,v\in {\mathcal {U}}$.} $$
(5)

Also, T is averaged if there exist \(\alpha \in (0,1)\) and a nonexpansive operator \(Q:{\mathcal {U}}\rightarrow {\mathcal {U}}\) such that \(T(u) = (1-\alpha ) u + \alpha Q(u) \) for all \(u\in {\mathcal {U}}\). For example, the projection \(P_{{ \mathcal {S}}}\) defined in (3) is averaged along with convex combinations of projections [61]. Our method utilizes the following standard assumptions, which are typically satisfied by projection methods (in the noise-free setting with \(R_{\Theta }\) as the identity).

Assumption 2.1

The intersection set \({\mathcal {C}}_{\Theta,d}\) defined in (L-CFP) is nonempty and \(\{({\mathcal {A}}_{d}^{k}\circ R_{\Theta })\}_{k\in {\mathbb {N}}}\) forms a sequence of nonexpansive operators.

Assumption 2.2

For any sequence \(\{u^{k}\}_{k\in {\mathbb {N}}}\subset {\mathcal {U}}\), the sequence of operators \(\{({\mathcal {A}}_{d}^{k}\circ R_{\Theta })\}_{k\in {\mathbb {N}}}\) has the property

$$ \lim_{k\rightarrow \infty } \bigl\Vert \bigl({\mathcal {A}}_{d}^{k} \circ R_{\Theta }\bigr) \bigl(u^{k}\bigr)-u^{k} \bigr\Vert = 0 \quad\Longrightarrow \quad\liminf_{k\rightarrow \infty } \bigl\Vert P_{{\mathcal {C}}_{\Theta,d}}\bigl(u^{k}\bigr)-u^{k} \bigr\Vert = 0. $$
(6)

When a finite collection of update operations are used and applied (essentially) cyclically, the previous assumption automatically holds (e.g. setting \({\mathcal {A}}_{d}^{k} \triangleq P_{{\mathcal {C}}_{d,i_{k}}}\) and \(i_{k} \triangleq k\ \text{mod}(m) + 1\)). We use the learned fixed point iteration to solve (L-CFP)

$$ u^{k+1} \triangleq \bigl( {\mathcal {A}}_{d}^{k}\circ R_{\Theta }\bigr) \bigl(u^{k}\bigr) \quad\text{for all $k\in {\mathbb {N}}$.} $$
(L-FPI)

Justification of the (L-FPI) iteration is provided by the following theorems, which are rewritten from their original form to a manner that matches the present context.

Theorem 2.1

(Krasnosel’skiĭ–Mann [62, 63])

If \(({\mathcal {A}}_{d}\circ R_{\Theta })\colon {\mathcal {U}}\rightarrow {\mathcal {U}}\) is averaged and has a fixed point, then, for any \(u^{1}\in {\mathcal {U}}\), the sequence \(\{u^{k}\}_{k\in {\mathbb {N}}}\) generated by (L-FPI), taking \({\mathcal {A}}_{d}^{k} \circ R_{\Theta }= {\mathcal {A}}_{d}\circ R_{\Theta }\), converges to a fixed point of \({\mathcal {A}}_{d}\circ R_{\Theta }\).

Theorem 2.2

(Cegieslki, Theorem 3.6.2, [61])

If Assumptions 2.1and 2.2hold, and if \(\{u^{k}\}_{k\in {\mathbb {N}}}\) is a sequence generated by the iteration (L-FPI) satisfying \(\|u^{k+1}-u^{k}\|\rightarrow 0\), then \(\{u^{k}\}_{k\in {\mathbb {N}}}\) converges to a limit \(u^{\infty }\in {\mathcal {C}}_{\Theta,d}\).

3 Fixed point networks overview

One of the most promising areas in artificial intelligence is deep learning, a form of machine learning that uses neural networks containing many hidden layers [64, 65]. Deep learning tasks in the context of this work can be cast as follows. Given measurements d drawn from a distribution \({\mathbb {P}}_{{ \mathcal {D}}}\) and corresponding signals \(u_{d}^{\star }\) drawn from a distribution \({\mathbb {P}}_{{ \mathcal {U}}}\), we seek a mapping \({\mathcal {N}}_{\Theta }\colon {\mathcal {D}}\rightarrow {\mathcal {U}}\) that approximates a one-to-one correspondence between the measurements and signals i.e.

$$ {\mathcal {N}}_{\Theta }(d) \approx u_{d}^{\star }\quad \text{for all $d\sim {\mathbb {P}}_{{ \mathcal {D}}}$.} $$
(7)

Depending on the nature of the given data, the task at hand can be regression or classification. In this work, we focus on solving regression problems where the learning is supervised i.e. the loss function explicitly uses a correspondence between input and output data pairings. When the loss function does not use this correspondence (or when not all data pairings are available), the learning is semi-supervised if partial pairings are used and unsupervised if no pairings are used.

3.1 Recurrent neural networks

A common model for \({\mathcal {N}}_{\Theta }\) is given by recurrent neural networks (RNNs) [66], which have enjoyed a great deal of success in natural language processing (NLPs) [67], time series [68], and classification [68]. An N-layer RNN takes observed data d as input and can be modeled as the N-fold composition of a mapping \(T_{\Theta }\) via

$$ \begin{aligned} {\mathcal {N}}_{\Theta }\triangleq \underbrace{T_{\Theta } \circ T_{\Theta } \circ \ldots \circ T_{\Theta }}_{N \text{ times}}. \end{aligned} $$
(8)

Here, \(T_{\Theta }(u; d)\) is an operator comprised of a finite sequence of applications of (possibly) distinct affine mappings and nonlinearities, and u is initialized to some fixed value (e.g. the zero vector). To identify a faithful mapping \({\mathcal {N}}_{\Theta }\) as in (7), we solve a training problem. This is modeled as finding weights that minimize an expected loss, which is typically solved using optimization methods like SGD [69] and Adam [70]. In particular, we solve the training problem

$$ \min_{\Theta } {\mathbb {E}}_{d \sim \mathcal{D}} \bigl[ \ell \bigl({\mathcal {N}}_{\Theta }(d), u_{d}^{\star }\bigr) \bigr], $$
(9)

where \(\ell \colon {\mathcal {U}}\times {\mathcal {U}}\to {\mathbb {R}}\) models the discrepancy between the prediction \({\mathcal {N}}_{\Theta }(d)\) of the network and the training data \(u_{d}^{\star }\). In practice, the expectation in (9) is approximated using a finite subset of data, which is referred to as training data. In addition to minimizing over the training data, we aim for (7) to hold for a set of testing data that was not used during training (which tests the network’s ability to generalize).

Remark 3.1

We emphasize that a time intensive offline process is used to find a solution \(\Theta ^{\star }\) to (9) (as is common in machine learning). After this, in the online setting, we apply \({\mathcal {N}}_{\Theta ^{\star }}(d)\) to recover a signal \(u_{d}^{\star }\) from its previously unseen measurements d, which is a much faster process.

Remark 3.2

If we impose a particular structure to \(T_{\Theta }\), as shown in Figure 1, an N-layer RNN can be interpreted as an unrolled fixed point (or optimization) algorithm that runs for N iterations. Our experiments compare our proposed method to such an unrolled scheme.

Figure 1
figure 1

Diagram for update operations in the learned fixed point iteration (L-FPI) to solve (L-CFP). Here \(R_{\Theta }\) is comprised of a finite sequence of applications of (possibly) distinct affine mappings (e.g. convolutions) and nonlinearities (e.g. projections on the nonnegative orthant i.e. ReLUs). For each \(k\in {\mathbb {N}}\), we let \({\mathcal {A}}_{d}^{k}\) be a projection-based algorithmic operator. The parameters Θ of \(R_{\Theta }\) are tuned in an offline process by solving (9) to ensure signals are faithfully recovered.

3.2 Fixed point networks

Increasing neural network depth leads to more expressibility [71, 72]. A recent trend in deep learning seeks to inquires: what happens when the number of recurrent layers N goes to infinity? Due to ever growing memory requirements (growing linearly with N) to train networks, directly unrolling a sequence generated by successively applying \(T_{\Theta }\) is, in general, intractable for arbitrarily large N. However, the sequence limit can be modeled using a fixed point equation. In this case, evaluating a fixed point network (FPN) [73] is equivalent to finding the unique fixed point of an averaged operator \(T_{\Theta }(\cdot;d)\) i.e. an FPN \({\mathcal {N}}_{\Theta }\) is defined byFootnote 5

$$ {\mathcal {N}}_{\Theta }(d) \triangleq u_{\Theta,d},\quad \text{where } u_{\Theta,d} = T_{\Theta }(u_{\Theta,d};d). $$
(10)

Standard results [7476] can be used to guarantee the existenceFootnote 6 of fixed points of nonexpansive \(T_{\Theta }\). Iteratively applying \(T_{\Theta }\) produces a convergent sequence (Theorem 2.1). However, for different d, the number of steps to converge may vary, and so these models belong to the class of implicit depth models. As mentioned, it is computationally intractable to differentiate with respect to Θ by applying the chain rule through each of the N layers (when N is sufficiently large). Instead, the gradient \(\mathrm{d}\ell /\mathrm{d}\Theta \) is computed via the implicit function theorem [77]. Specifically, the gradient is obtained by solving the Jacobian-inverse equation (e.g. see [7880])

$$ \frac{\mathrm{d}\ell }{\mathrm{d}\Theta } = {\mathcal {J}}_{\Theta }^{-1} \frac{\partial T}{\partial \Theta }, \quad\text{where } {\mathcal {J}}_{\Theta }\triangleq I - \frac{dT_{\Theta }}{d u}. $$
(11)

Recent works that solve the Jacobian-inverse equation in (11) to train neural networks include deep equilibrium networks [78, 81] and monotone equilibrium networks [79]. A key difficulty arises when computing the gradient via (11), especially when the signal space has large dimensions (e.g. when \(u_{d}^{\star }\) is a high-resolution image). Namely, a linear system involving the Jacobian term \({\mathcal {J}}_{\Theta }\) must be approximately solved to estimate the gradient of . Recently, a new framework for training implicit depth models, called Jacobian-free backpropagation (JFB) [73], was presented in the context of FPNs, which avoids the intensive linear system solves at each step. The idea is to replace gradient \(\mathrm{d}\ell /\mathrm{d}\Theta \) updates with \(\partial T/\partial \Theta \), which is equivalent to a preconditioned gradient (since \({\mathcal {J}}_{\Theta }^{-1}\) is coercive [73, Lemma A.1]). JFB provides a descent direction and was found to be effective and competitive for training implicit-depth neural networks at substantially reduced computational costs. Since the present work solves inverse problems where the signal space has very high dimension, we leverage FPNs and JFB to solve (9) for our proposed method.

3.3 Learning to optimize

An emerging field in machine learning is known as “learning to optimize” (L2O) (e.g. see the survey works [82, 83]). As a paradigm shift away from conventional optimization algorithm design, L2O uses machine learning to improve an optimization method. Two approaches are typically used for model-based algorithms. Plug-and-Play (PnP) methods learn a denoiser in the form of a neural network and then plug this denoiser into an optimization algorithm (e.g. to replace a proximal for total variation). Here training of the denoiser is separate from the task at hand. On the other hand, unrolling methods incorporate tunable weights into an algorithm that is truncated to a fixed number of iterations, forming a neural network. Unrolling the iterative soft thresholding algorithm (ISTA), the authors in [84] obtained the first major L2O scheme learned ISTA (LISTA) by letting each matrix in the updates be tunable. Follow-up papers also demonstrate empirical success in various applications, include compressive sensing [8593], denoising [29, 88, 9399], and deblurring [88, 93, 95, 100105]. L2O schemes are related to our method, but no L2O scheme has, to our knowledge, used a fixed point model as in (L-CFP). Additionally, our JFB training regime differs from the L2O unrolling and PnP schemes.

4 Proposed method

Herein we present the feasibility-based FPN (F-FPN). Although based on FPNs, here we replace the single operator of FPNs by a sequence of operators, each taking the form of a composition. Namely, we use updates in the iteration (L-FPI). The assumptions necessary for convergence can be approximately ensured (e.g. see Subsection A.4 in the Appendix). This iteration yields the F-FPN \({\mathcal {N}}_{\Theta }\), defined by

$$ {\mathcal {N}}_{\Theta }(d) \triangleq \tilde{u}_{d},\quad \text{where} \tilde{u}_{d} = \bigcap_{k=1}^{\infty } \mathrm{fix}\bigl({\mathcal {A}}_{d}^{k} \circ R_{\Theta } \bigr), $$
(12)

assuming the intersection is unique.Footnote 7 This is approximately implemented via Algorithm 1. The weights Θ of the network \({\mathcal {N}}_{\Theta }\) are tuned by solving the training problem (9). In an ideal situation, the optimal weights \(\Theta ^{\star }\) solving (9) would yield feasible outputs (i.e. \({\mathcal {N}}_{\Theta }(d) \in {\mathcal {C}}_{d}\) for all data \(d\in {\mathcal {C}}\)) that also resemble the true signals \(u_{d}^{\star }\). However, measurement noise in practice makes it unlikely that \({\mathcal {N}}_{\Theta }(d)\) is feasible, let alone that \({\mathcal {C}}_{d}\) is nonempty. In the noisy setting, this is no longer a concern since we augment (CFP) via (L-CFP) and are ultimately concerned with recovering a signal \(u_{d}^{\star }\), not solving a feasibility problem. In summary, our model is based on the underlying physics of a problem (via the convex feasibility structure), but is also steered by available data via training problem (9). Illustrations of the efficacy of this approach are provided in Section 5.

Algorithm 1
figure a

Feasibility-based fixed point network (F-FPN)

5 Experiments

Experiments in this section demonstrate the relative reconstruction quality of F-FPNs and comparable schemes—in particular, filtered backprojection (FBP) [106], total variation (TV) minimization (similarly to [107, 108]), total variation superiorization (based on [109, 110]), and an unrolled L2O scheme with an RNN structure.

5.1 Experimental setup

Comparisons are provided for two low-dose CT examples: a synthetic dataset, consisting of images of random ellipses, and the LoDoPab dataset [111], which consists of human phantoms. For both datasets, CT measurements are simulated with a parallel beam geometry with a sparse-angle setup of only 30 angles and 183 projection beams, resulting in 5490 equations and 16,384 unknowns. Additionally, we add 1.5% Gaussian noise corresponding to each individual beam measurement. Moreover, the images have a resolution of \(128 \times 128\) pixels. The quality of the image reconstructions are determined using the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). We use the PyTorch deep learning framework [112] and the ADAM [70] optimizer. We also use the operator discretization library (ODL) python library [113] to compute the filtered backprojection solutions. The CT experiments are run on a Google Colab notebook. For all methods, we use a single diagonally relaxed orthogonal projections (DROP) [37] operator for \({\mathcal {A}}_{d}\) (i.e. \({\mathcal {A}}_{d}^{k} = {\mathcal {A}}_{d}\) for all k), noting DROP is nonexpansive with respect to a norm dependent on A [114]. The loss function used for training is the mean squared error between reconstruction estimates and the corresponding true signals. We use a synthetic dataset consisting of random phantoms of combined ellipses as in [115]. The ellipse training and test sets contain 10,000 and 1000 pairs, respectively. We also use phantoms derived from actual human chest CT scans via the benchmark low-dose parallel beam dataset (LoDoPaB) [111]. The LoDoPab training and test sets contain 20,000 and 2000 pairs, respectively.

5.2 Experiment methods

TV superiorization

Sequences generated by successively applying the operator \({\mathcal {A}}_{d}\) are known to converge even in the presence of summable perturbations, which can be intentionally added to lower a regularizer value (e.g. TV) without compromising convergence, thereby giving a “superior” feasible point. Compared to minimization methods, superiorization typically only guarantees feasibility, but is often able to do so at reduced computational cost. This scheme, denoted as TVS, generates updates

$$ u^{k+1} = {\mathcal {A}}_{d} \biggl(u^{k} - \alpha \beta ^{k} D_{-}^{\top } \biggl(\frac{D_{+}u}{ \Vert D_{+}u \Vert _{2}+\varepsilon } \biggr) \biggr)\quad \text{for $k=1,2,\ldots,20$,} $$
(13)

where \(D_{-}\) and \(D_{+}\) are the forward and backward differencing operators, \(\varepsilon >0\) is added for stability, and 20 iterations are used as early stopping to avoid overfitting to noise. The differencing operations yield a derivative of isotropic TV (e.g. see [116]). The scalars \(\alpha >0 \) and \(\beta \in (0,1)\) are chosen to minimize training mean squared error. See the superiorization bibliography [117] for further TVS materials.

TV minimization

For a second analytic comparison method, we use anisotropic TV minimization (TVM). In this case, we solve the constrained problem

$$ \min_{u\in [0,1]^{n}} \Vert D_{+}u \Vert _{1} \quad\text{such that } \Vert Au-d \Vert \leq \varepsilon, $$
(TVM)

where \(\varepsilon > 0\) is a hand-chosen scalar reflecting the level of measurement noise and the box constraints on u are included since all signals have pixel values in the interval \([0,1]\). We use linearized ADMM [118] to solve (TVM) and refer to this model as TV minimization (TVM). Implementation details are in the Appendix.

F-FPN structure

The architecture of the operator \(R_{\Theta }\) is modeled after the seminal work [119] on residual networks. The F-FPN and unrolled scheme both leverage the same structure \(R_{\Theta }\) and DROP operator for \({\mathcal {A}}_{d}\). The operator \(R_{\Theta }\) is the composition of four residual blocks. Each residual block takes the form of the identity mapping plus the composition of a leaky ReLU activation function and convolution (twice). The number of network weights in \(R_{\Theta }\) for each setup was 96,307, a small number by machine learning standards. Further details are provided in the Appendix.

5.3 Experiment results

Our results show that F-FPN outperforms all classical methods as well as the unrolled data-driven method. We show the result on an individual reconstruction via wide and zoomed-in images from the ellipse and LoDoPab testing datasets in Figures 2 and 3 and Figures 4 and 5, respectively. The average SSIM and PSNR values on the entire ellipse and LoDoPab datasets are shown in Tables 1 and 2. We emphasize that the type of noise depends on each individual ray in a similar manner to [120], making the measurements more noisy than some related works. This noise and ill-posedness of our underdetermined setup are illustrated by the poor quality of analytic method reconstructions. (However, we note improvement by using TV over FBP and further improvement by TV minimization over TV superiorization.) Although nearly identical in structure to F-FPNs, these results show the unrolled method to be inferior to F-FPNs in these experiments. We hypothesize that this is due to the large memory requirements of unrolling (unlike F-FPNs), which limits the number of unrolled steps (∼20 steps versus 100+ steps of F-FPNs), and F-FPNs are tuned to optimize a fixed point condition rather than a fixed number of updates.

Figure 2
figure 2

Ellipse reconstruction with test data for each method: filtered back projection (FBP), TV superiorization (TVS), TV minimization (TVM), unrolled network, and the proposed feasibility-based fixed point network (F-FPN).

Figure 3
figure 3

Zoomed-in ellipse reconstruction with test data of Figure 2 for each method: FBP, TVS, TVM, unrolling, and the proposed F-FPN.

Figure 4
figure 4

LoDoPab reconstruction with test data for each method: filtered back projection (FBP), TV superiorization (TVS), TV minimization (TVM), unrolled network, and the proposed feasibility-based fixed point network (F-FPN).

Figure 5
figure 5

Zoomed-in LoDoPab reconstruction with test data of Figure 4 for each method: FBP, TVS, TVM, unrolling, and the proposed F-FPN.

Table 1 Average PSNR and SSIM on the 1000 image ellipse testing dataset.
Table 2 Average PSNR/SSIM on the 2000 image LoDoPab testing dataset.

6 Conclusion

This work connects feasibility-seeking algorithms and data-driven algorithms (i.e. neural networks). The F-FPN framework leverages the elegance of fixed point methods while using state of the art training methods for implicit-depth deep learning. This results in a sequence of learned operators \(\{{\mathcal {A}}_{d}^{k}\circ R_{\Theta }\}_{k\in {\mathbb {N}}}\) that can be repeatedly applied until convergence is obtained. This limit point is expected to be nearly compatible with provided constraints (up to the level of noise) and resemble the collection of true signals. The provided numerical examples show improved performance obtained by F-FPNs over both classic methods and an unrolling-based network. Future work will extend FPNs to a wider class of optimization problems and further establish theory connecting machine learning to fixed point methods.