Introduction

Magnetic resonance imaging (MRI) has long been at the forefront of medical imaging, in that it offers unparalleled ability to visualize the human body’s internal structures and functions. Because it is noninvasive and has exceptional soft tissue contrast, it has become an indispensable tool in modern diagnostic medicine. From detecting subtle pathological changes to guiding therapeutic interventions, MRI’s versatility is unmatched. However, the full potential of MRI is often constrained by inherent limitations in imaging speed and resolution, which are crucial for accurate diagnosis and patient comfort [1].

Long acquisition times represent a major barrier in clinical MRI, primarily due to the inherent trade-off between image quality and speed of imaging [2,3,4,5,6,7,8]. High-resolution images, which are crucial for accurate diagnosis, require extended scan durations that can be uncomfortable for patients and increase the risk of motion artifacts [9]. This challenge is particularly significant in dynamic imaging, such as in cardiac and abdominal studies, where fast physiological movements can lead to blurring and other distortions. Moreover, sequences with long repetition times (TRs) like those used in diffusion MRI, extensive field-of-view (FOV) coverage, and protocols that require multiple contrasts significantly extend the duration of MRI scans. Additionally, the acquisition of 3D images, which provide comprehensive spatial detail for better clinical evaluation, also necessitates longer scan times due to the increased volume of data being collected. These complexities highlight the pressing need for advancements in faster imaging techniques to reduce MRI acquisition times while maintaining high image quality.

Significant research efforts have been dedicated to accelerating MRI. Central to these endeavors is the development of methods for image reconstruction from under-sampled data. Techniques based on parallel imaging (PI) introduced in the 1990s were an important watershed [3,4,5,6,7,8]. These techniques leverage the spatial diversity of multiple coil arrays to reconstruct images, thus allowing for reduced scan times by acquiring less data. The early 2000s saw the emergence of compressed sensing (CS) methods, which constituted a novel approach to MRI reconstruction [10,11,12,13,14,15,16,17]. CS overcomes the sparsity of MRI images by enabling the reconstruction of high-quality images from a much smaller set of measurements than traditionally required. The development of these techniques represented major milestones in MRI, and contributed to substantial reductions in scan duration and improvements in image quality.

However, both parallel imaging and CS methods have their limitations, each affecting their practicality in clinical settings. Parallel imaging is highly dependent on the geometry and sensitivity of the coil arrays, with suboptimal configurations leading to uneven image quality and potential artifacts. It also faces a practical limit on acceleration, beyond which significant noise can degrade the image quality. CS, on the other hand, is computationally demanding due to the complex optimization problems it involves, particularly with non-Cartesian trajectories, and often involve excessive compute times. It also relies heavily on hand-crafted priors for image reconstruction, which may not be applicable across different types of scans, thus limiting its adaptability. Furthermore, in low-rank CS MRI, the challenge of choosing an appropriate rank to balance image fidelity and computational efficiency often leads to a trade-off between reconstruction accuracy and speed. These challenges underscore the need for continued advancements in MRI technology to balance speed, image quality, and usability in clinical environments.

In recent years, the advent of deep learning (DL) has heralded a new era in MRI by offering promising solutions to these longstanding challenges. DL, a subset of machine learning characterized by algorithms based on computational neural networks, has had remarkable success in extracting complex patterns from large datasets [18,19,20,21]. In the realm of MR image reconstruction, DL methods focus on learning from vast amounts of data to transform under-sampled or noisy data into high-fidelity images. These methods have demonstrated their ability to mitigate artifacts, enhance resolution, and accelerate the imaging process [22,23,24,25,26,27,28,29,30,31]. There are already multiple public datasets curated to enable training of DL models on MRI, e.g. [32,33,34,35,36], as well as community challenges related to MRI reconstruction problems [37,38,39,40,41].

This review provides a comprehensive overview of recent advances and applications of deep learning (DL) to the reconstruction of magnetic resonance (MR) images. Given the rapid pace at which this field is evolving, encapsulating the entirety of the published literature is a formidable challenge. Previous reviews have laid the groundwork by detailing the fundamental components of DL architectures and providing theoretical analysis [26,27,28,29,30, 42,43,44,45,46,47,48,49]. Here we cover a broad range of approaches, and highlight emerging methods such as self-supervised learning and diffusion models. Furthermore, we also review closely related topics, such as DL methods for k-space trajectory optimization, pulse sequence design, quantitative MRI, motion correction, and multi-task pipelines. Moreover, we address areas where DL encounters significant hurdles, including susceptibility to distribution shifts, instabilities, and inherent biases. Finally, building on our hands-on experience, we propose actionable strategies for effectively enhancing the robustness of DL models to such challenges.

Background on MRI reconstruction

In this section, we describe the image formation process and forward model, and discuss conventional, optimization-based, non-DL image reconstruction methods.

Image formation and forward model

The acquisition process in many imaging schemes can be modeled by an operator \({{\textbf {A}}}\) applied on the continuous domain image \({{\textbf {x}}}\), where the process of collecting measurements is described by \({{\textbf {y}}}={{\textbf {A}}}({\textbf {x}}) + {{\textbf {n}}}\). In MRI, the measurement operator \({{\textbf {A}}}\) commonly corresponds to a multi-coil Fourier sampling operator. Although the acquisition is continuous, the general practice is to discretize the problem. Thus, we consider the reconstruction of an image vector \({{\textbf {x}}}\) from linear measurements, modeled by a matrix \({{\textbf {A}}}\), by:

$$\begin{aligned} {{\textbf {y}}}={{\textbf {A}}}{{\textbf {x}}} + {{\textbf {n}}}. \end{aligned}$$
(1)

The above equation is a numerical model for the imaging device, and is often referred to as the forward model. In many imaging methods, the forward model is known precisely. However, there are many applications where the forward model is unknown or only partially known. Examples include imaging in the presence of motion during the acquisition, trajectory errors, and field inhomogeneity effects in MRI acquisitions.

Due to the MRI’s long scan duration, many scans are accelerated by sampling k-space at a sub-Nyquist rate. In these cases, the forward model \({{\textbf {A}}}\) is often rank-deficient, making the recovery of \({{\textbf {x}}}\) an ill-posed problem. In MRI, the forward map \({\textbf {A}}\) includes undersampling, the Fourier transform, and sensitivity maps.

Conventional model-based image recovery

When the recovery of \({{\textbf {x}}}\) is ill-posed, many MRI schemes such as SENSE [5] pose the recovery as an optimization problem \({{\textbf {x}}} = \arg \min _{{{\textbf {x}}}} {\mathcal {L}}({{\textbf {x}}})\) with an objective function

$$\begin{aligned} {\mathcal {L}}({{\textbf {x}}})=\underbrace{\Vert {{\textbf {A}}}\,{{\textbf {x}}}-{{\textbf {y}}}\Vert _2^2}_{\text{ data } \text{ consistency }} + {\lambda }\,\underbrace{{\mathcal {R}}({{\textbf {x}}})}_{\text{ regularization }}, \end{aligned}$$
(2)

where the first term is often called a data consistency term, and the second term is called a regularization prior. The objective (2) is sometimes called a variational objective [24].

The prior \({\mathcal {R}}: {\mathbb {C}}^n \rightarrow {\mathbb {R}}_{+}\) is used to restrict the solutions to the space of desirable images. The prior \({\mathcal {R}}({{\textbf {x}}})\) has a large value when \({{\textbf {x}}}\) is an undesirable image and is small for a desirable image. A common prior used in compressive sensing methods is wavelet-domain sparsity, where the number of non-zero wavelet coefficients or their surrogates are used as priors [10, 11]. In this case, the optimization algorithm facilitates the recovery of an image \({{\textbf {x}}}\) that has few non-zero wavelet coefficients.

From a Bayesian perspective, the above formulation can be viewed as an a-posterior estimate [28, 50], where the goal is to find an image \({{\textbf {x}}}\) that maximizes the posterior distribution \(p({{\textbf {x}}}|{{\textbf {y}}}) = {p({{\textbf {y}}}|{{\textbf {x}}})*p({{\textbf {x}}})}/{p({{\textbf {y}}})}\). The estimate is obtained by minimizing the negative log posterior

$$\begin{aligned} -\log p({{\textbf {x}}}|{{\textbf {y}}}) = \underbrace{-\log p({{\textbf {y}}}|{{\textbf {x}}})}_{\text{ data } \text{ consistency }} - \underbrace{\log p({{\textbf {x}}})}_{\text{ prior }}. \end{aligned}$$
(3)

Here, the first term is the data consistency term. It yields the mean-squared error in Eq. (1) if the noise vector \({{\textbf {n}}}\) has i.i.d. Gaussian entries. Data consistency terms appear in both compressed sensing [10] and DL methods [51], as they ensure that the reconstructed images adhere closely to the acquired data. The second term incorporates prior information on the images [50, 52].

Over the past few decades, substantial research efforts have been dedicated to crafting effective priors. Tikhonov regularization, for instance, employs a Gaussian prior on \({{\textbf {x}}}\), resulting in a regularization term \({\mathcal {R}}({{\textbf {x}}}) =\Vert {{\textbf {x}}}\Vert ^2\), and compressed sensing methods mentioned earlier promote sparsity.

Optimization algorithms

The loss in Eq. (2) is typically minimized by applying iterative first-order optimization algorithms such as gradient descent. Starting from \({\textbf {x}}_0\) with a stepsize of \(\eta _t\), iteration \(t+1\) of gradient descent is described by

$$\begin{aligned} {\textbf {x}}_{t+1}&= {\textbf {x}}_t- \eta _t\nabla \mathcal L({\textbf {x}}_t) \nonumber \\&= {\textbf {x}}_t- \eta _t\left( {{\textbf {A}}}^H ({\textbf {A}}{\textbf {x}}_t- {\textbf {y}}) + \nabla {\mathcal {R}}({\textbf {x}}_t) \right) . \end{aligned}$$
(4)

where H denotes a Hermitian transpose (i.e., conjugate transpose). The above algorithm depends on the gradient of the regularizer \(\nabla {\mathcal {R}}\). When \({\mathcal {R}}({{\textbf {x}}}) = \log p({{\textbf {x}}})\) is the log-prior as mentioned earlier, \(\nabla R({{\textbf {x}}}) = \nabla \log p({{\textbf {x}}})\), which is often referred to as the score of the distribution. This term moves the estimate towards a signal with higher likelihood.

Other popular fast iterative algorithms for minimizing the objective (2) include the alternating direction method of multipliers (ADMM) [53] and the fast iterative shrinkage thresholding algorithm (FISTA) [54]. For example, the ADMM scheme considers the equivalent problem,

$$\begin{aligned} {{\textbf {x}}} = \arg \min _{\textbf{x}} \min _{{{\textbf {v}}}} \Vert {{\textbf {A}}}\,{{\textbf {x}}}-{{\textbf {y}}}\Vert _2^2 + {\lambda }\,{\mathcal {R}}({{\textbf {v}}}) \, \text{ such } \text{ that }\,\, {{\textbf {v}}}={{\textbf {x}}} \end{aligned}$$
(5)

The above problem is solved by alternating between the following steps

$$\begin{aligned} {{{\textbf {x}}}}_{t+1}= & {} \arg \min _{{{\textbf {x}}}} \Vert {{\textbf {A}}}\,{{\textbf {x}}}-{{\textbf {y}}}\Vert _2^2 + \beta \Vert {{\textbf {x}}}-({{\textbf {v}}}_t-{{\textbf {u}}}_t)\Vert ^2 \end{aligned}$$
(6)
$$\begin{aligned} {{{\textbf {v}}}}_{t+1}= & {} \arg \min _{{{\textbf {v}}}}\beta \Vert {{\textbf {v}}}_t-\underbrace{\left( {{\textbf {x}}}_{t+1}-{{\textbf {u}}}_t\right) }_{\overline{{{\textbf {x}}}}}\Vert ^2 + \lambda R({{\textbf {v}}}) \end{aligned}$$
(7)
$$\begin{aligned} {{\textbf {u}}}_{t+1}= & {} {{\textbf {u}}}_{t} + ({{{\textbf {x}}}_{t+1}} - {{{\textbf {v}}}}_{t+1}) \end{aligned}$$
(8)

where \({\textbf {u}}\) is the Lagrange multiplier and \({{\textbf {v}}}\) is an auxiliary variable introduced in the ADMM algorithm to split the original optimization problem into smaller, more manageable subproblems. The second step of the above optimization scheme

$$\begin{aligned} {{{\textbf {v}}}} = \arg \min _{{{\textbf {v}}}} \beta \Vert {{\textbf {v}}}-\overline{{{\textbf {x}}}}\Vert ^2 + \lambda {\mathcal {R}}({{\textbf {v}}}) = {\mathcal {D}}_{\beta }(\overline{{{\textbf {x}}}}) \end{aligned}$$
(9)

can be viewed as a denoising step to clean the current solution \(\overline{{{\textbf {x}}}}\), thus yielding \({{\textbf {v}}}\). For many penalties (e.g. \(\ell _1\) norm), the solution to (9) can be evaluated as proximal mapping. Here, \(\beta\) is a continuation parameter that can be interpreted as \({1}/{\sigma _{\beta }^2}\), where \(\sigma _{\beta }\) is the variance of noise in \(\overline{{{\textbf {x}}}}\) that decreases with the iterations. The first step (67) involves an inversion step to reduce the cost function composed of the linear combination of the data consistency error and the deviation from the denoised image \({{\textbf {v}}}\). This provides an iterative denoising interpretation, which is used in plug-and-play algorithms (discussed below, in “Pretrained plug-and-play (PnP) methods”). One challenge with the above convex optimization schemes is their high computational complexity, which is due to the numerous iterations required for convergence. In particular, the data consistency step (67) involves the evaluation of the forward model and its adjoint, which is often computationally expensive.

DL reconstruction: approaches and architectures

In this section we provide an overview of the main approaches and architectures in the MRI reconstruction landscape. While many different methods are available, and those often incorporate elements from other techniques, we classify them into five main categories: (i) neural networks trained end-to-end; (ii) approaches based on pre-trained denoisers, often called plug-and-play (PnP) methods; (iii) approaches based on generative models; and (iv) un-trained methods; (v) self-supervised methods. Additionally, we identify several recent architectures, e.g. transformers and dual-domain networks, that are used in various classes.

Interestingly, the top performing models in both the 2020 FastMRI challenge and the 2024 CMRxRecon challenge all used neural networks trained end-to-end [41, 55]. However, relative comparisons of algorithms depend on the problem setup and metrics. In addition, the field is progressing rapidly, so that new and diverse benchmarking studies would be valuable (Fig. 1).

Fig. 1
figure 1

Overview of the DL-based MRI reconstruction landscape. While many different methods are available, and those often incorporate elements from other techniques, we classify them into five main categories

Please check and confirm the Fig. 1 citation.I don't see a citation of Fig. 1. Where should I confirm it?

Neural networks trained end-to-end

Neural networks trained end-to-end are commonly trained to map the acquired data, which is often noisy and degraded by undersampling artifacts, to a target, ground-truth image. Their training hence commonly requires such paired data.

Let \(f_\theta\) be a neural network, which receives the measurements as input and produces clean, reconstructed images as output. Given a training set consisting of pairs of measurements and target images \(\{({\textbf {y}}_1,{\textbf {x}}_1),\ldots ,({\textbf {y}}_n,{\textbf {x}}_n)\}\), the network is trained by minimizing the loss between the prediction of the network and the target images, i.e.,

$$\begin{aligned} \mathcal L(\varvec{\theta }) = \sum _{i=1}^n \text {loss}( f_\theta ({\textbf {y}}_i), {\textbf {x}}_i ). \end{aligned}$$
(10)

Networks mapping a noisy measurement to a clean image

Many architectures have been developed for mapping an image with noise and undersampling artifacts to a clean image. Here we review some of the most well-known approaches. One of the early architectures is known as AUTOMAP (Automated Transform by Manifold Approximation) [22]. It utilizes a fully connected network followed by a convolutional network as the architecture \(f_\theta\) in Eq. (10). This architecture does not incorporate the known forward model directly. Instead, it maps data from k-space to image domain, and learns the forward model from the data. A representative AUTOMAP application is shown in Fig. 2.

Fig. 2
figure 2

Elimination of hardware artifacts at low field MRI (6.5 mT) using AUTOMAP. Two slices from a 3D bSSFP (NA = 50) are shown. When reconstructed with IFFT (a, b), a vertical artifact (red arrows) is present across slices. When the same raw data was reconstructed with AUTOMAP (c, d), the artifacts are eliminated. The error maps of each slice with respect to a reference scan (NA = 100) is shown for both IFFT and AUTOMAP reconstruction. eg Uncorrupted k-space (NA = 50) was reconstructed with AUTOMAP (e) and IFFT (f). Adapted and modified from Koonjoo, N. et al. Sci Rep 11, 8248 (2021). https://doi.org/10.1038/s41598-021-87482-7 [56]

Other architectures commonly utilize image-to-image neural networks and incorporate the forward map. This class of approaches maps a coarse reconstruction of the image, for example an image generated from the zero-filled k-space, to a target image. In the notation above, the architecture \(f_\theta\) consists of a linear map computing the zero-filled image followed by application of an image-to-image neural network.

The architecture of the image-to-image network is most commonly a convolutional neural network (CNNs). One of the pioneering works in this area was by Wang et al. [57], which demonstrated that using a CNN as the image-to-image network enables substantial improvements in both speed and image quality. Another pioneering work was by Jin et al. [58], which showed that this approach is applicable to a wide range of linear forward models.

Additionally, numerous other studies have embedded UNET, ResNet or recurrent neural networks as the backbone architecture [26,27,28,29,30, 42,43,44,45]. More recently, vision transformers have been utilized instead of a CNN as the image-to-image network. Several studies demonstrated that transformers can provide improvements [59,60,61]. However, they are computationally more expensive. In practice, a UNET architecture is often chosen as a simple starting point, as it provides a good trade-off between image quality and computational performance. However, the choice of architecture depends on the problem at hand, the database size, the acceleration rate, and the desired reconstruction accuracy.

Architectures of unrolled networks

Currently, some of the best-performing neural networks are based on unrolled architectures [55]. These networks are obtained by unrolling an iterative algorithm such as gradient descent. The idea of unrolled networks was first introduced by [62], and several pioneering works applied it in the context of MRI reconstruction [24, 25, 63, 64]. These architectures iterate between two types of blocks: (i) data-consistency blocks, which can be computed using different algorithms [51], and (ii) blocks that remove noise and artifacts, which are commonly implemented by a deep neural network.

One of the early works in this context was by Hammernik et al. [24], who introduced the variational network (Fig. 3). This approach relies on a gradient descent algorithm to minimize the variational objective (2) where the regularizer \(\mathcal R\) is taken as the total-variation norm. In this case, the gradient of the regularizer in the gradient descent iterations (4) takes the form of a convolution. Thus, the gradient descent iterations can be interpreted as a neural network that applies data consistency operations (originating from the gradient of the least-squares loss) and the application of a convolutional network (originating from the gradient of the regularizer). Motivated by this observation, these so-called variational networks initialize \({\textbf {x}}_0={{\textbf {A}}}^H{\textbf {y}}\) and then perform the following computations with a neural network:

$$\begin{aligned} {\textbf {x}}_{t+1} = {\textbf {x}}_t- \eta _t{{\textbf {A}}}^H ({\textbf {A}}{\textbf {x}}_t- {\textbf {y}}) + \text {CNN}_t({\textbf {x}}_t). \end{aligned}$$
(11)

Here, both the parameter \(\eta _t\) and the parameters of the CNN are learnable. The original variational network [24] used a relatively shallow CNN, inspired by the parameterization provided by the total variation norm and its generalizations; specifically, the fields-of-experts-model. This yields a well-performing network with very few parameters. Later work has also shown that using a UNET within the unrolled network can improve the overall performance. [25, 65].

Fig. 3
figure 3

Variational network (VN) training procedure. The objective is learning a set of VN parameters during an offline training procedure. For this purpose, the current reconstruction of the VN is compared to an artifact-free reference using a similarity measure. This yields the reconstruction error which is propagated back to the VN to compute a new set of parameters. Reproduced with permission from Hammernik, K. et al. (2018) Magn. Reson. Med., 79: 3055–3071. https://doi.org/10.1002/mrm.26977. [24]

Other unrolled methods adopted alternate algorithms to minimize the variational loss (4) (described in “Optimization algorithms”). Those replaced the CNN in Eq. (11) with other image-to-image architectures; see [25, 63, 64, 66, 67] for a few examples. Furthermore, other architectures replaced the image-domain CNN with either a k-space CNN, a dual-domain (k-space and image domain) network [68,69,70] or a transformer [66].

Computational considerations The unrolling step requires multiple physical realizations of the CNN block during training, which translates into a high memory demand during training. This restricts their application in higher dimensional (e.g. 3D, 4D) applications. Programming solutions such as gradient check-pointing are now available to reduce the memory demand, at the expense of increased computational complexity. An alternative approach relies on deep equilibrium models [71, 72]. These models use a single CNN block and iterate the steps (67) and (8) until convergence to a fixed point, similar to PnP methods. These methods then implement the fixed point iterations for back propagation. These methods thus enable the evaluation of the forward and back propagation using a single physical CNN block, thus reducing the memory demand. The MOL [73] method also imposes a local Lipschitz constraint on the CNN block, which offers theoretical guarantees and robustness without sacrificing performance.

Pretrained plug-and-play (PnP) methods

Early CS methods relied on convex priors \({\mathcal {R}}({{\textbf {x}}})\), such as the total-variation norm. Plug-and-play (PnP) methods make it possible to solve inverse problems with pre-trained denoisers. Another benefit is that they work with arbitrary forward models, where the prior incorporates information about the image.

ADMM and FISTA are iterative optimization methods for solving the a regularized least-squares problem and involve evaluations of the proximal operator proximal operator \({\mathcal {D}}_{\beta }\) in (9) of the regularizer. One class of PnP methods replaces the proximal operator with a pre-trained denoiser; two well-known examples are PnP-ADMM and PnP-FISTA [74, 75]. While early methods relied on off-the-shelf image denoisers such as BM3D [76], pre-trained CNN denoisers are now considered to be more effective [74, 75, 77, 78]. Note that the proximal step in (9)

$$\begin{aligned} {\mathcal {D}}_{1/2\sigma ^2}(\overline{{{\textbf {x}}}})= & {} \arg \min _{{{\textbf {v}}}} \frac{1}{2\sigma ^2}\Vert {{\textbf {v}}}-\overline{{{\textbf {x}}}}\Vert ^2 + \lambda \, {\mathcal {R}}({{\textbf {v}}}) \end{aligned}$$
(12)

can be seen as the maximum a-posteriori (MAP) estimate of \(\overline{{{\textbf {x}}}}\) from its noise corrupted measurements

$$\begin{aligned} {{\textbf {v}}} = \overline{{{\textbf {x}}}} + \sigma \,{{\textbf {n}}}. \end{aligned}$$
(13)

Here, \({{\textbf {n}}} \sim {\mathcal {N}}(0,{{\textbf {I}}})\) is a sample from a Gaussian distribution with variance \(\sigma ^2\). The CNN modules are hence pre-learned from training data as MAP denoisers, where noise-corrupted images are fed as input and the model is trained to yield noise-free images. During inference, steps (6)-(7) and (8) are iterated until the algorithm converges to a fixed point. Similar to CS methods, several iterations are often needed for convergence, which translates into higher computational complexity than the unrolled approaches described in “Pretrained plug-and-play (PnP) methods”.

Another PnP framework is known as regularization by denoising (RED) [79]. This framework is more general the above because it does not rely on any specific optimization algorithm, i.e. it enables using other methods, not only ADMM and FISTA. Furthermore, it offers great flexibility in choosing the denoising algorithm, as it enables incorporating almost any denoiser. Further information can be found in recent reviews of PnP methods [74, 75].

Generative priors

Another successful approach to DL-based MRI reconstruction is to learn an image prior parameterized by a generative neural network. Several major classes of generative methods have emerged, based on variational autoencoders [80, 81], Generative Adversarial Networks (GANs) [82,83,84,85,86,87,88], and very recently, diffusion models [89,90,91,92,93,94,95]. Here we focus on the two latter ones, which have attracted substantial attention.

One of the major advantages of generative approaches for image reconstruction is that they are flexible with regard to changes of the forward model, and at the same time perform well for reconstructing high-quality images from undersampled data. Furthermore, their probabilistic nature provides measures for uncertainty quantification, which is highly important for clinical imaging [96,97,98].

GANs

GANs [82] are a framework for generative modeling. A GAN consists of two competing neural networks: a generator, which aims to produce data indistinguishable from a given dataset of real images, and a discriminator, whose role is to distinguish between the generator’s output and the real data. GANs are trained using an adversarial loss [82]; this process enables the generator to learn to generate high-quality realistic images. After training, the generator can be used either to generate images that look similar to those in the training set, or as a prior for image reconstruction.

In the context of MRI reconstruction, GANs have attracted substantial attention over the last few years [83,84,85,86,87,88]. For example, DAGAN (Deep De-Aliasing Generative Adversarial Networks) [83] was a pioneering work that proposed a conditional GAN with a refinement-learning stage, and used a loss function comprised of an adversarial and a perceptual component. Mardani et al. [99] proposed a reconstruction framework where GANs were used for learning the low-dimensional manifold that underlies high-quality MR images. However, images generated by the generator are not necessarily consistent with the acquired measurements. To ensure such consistency, they included an affine projection operation, conducted by a layer placed between the generator and discriminator. Another approach for tackling this was proposed by Quan et al. [85], who introduced a novel cyclic loss in their GAN architecture to enforce data consistency. These methods, and many others [87, 100] showcased the potential of GANs to produce clinically viable MRI reconstructions.

Diffusion models

Diffusion models, a class of generative models that have garnered substantial attention in recent years, are making an impact in a variety of fields, including MRI reconstruction [89,90,91,92,93,94,95, 101]. These models operate by learning to reverse a diffusion process that gradually transforms random noise into structured images, and have shown a remarkable capability to generate high-quality, detailed images.

Diffusion models have been derived using different approaches [89], including discretized corruptions, e.g., denoising diffusion probabilistic models (DDPMs) [93], denoising score matching [102], and continuous formulations based on stochastic differential equations (SDEs) [103].

For a general probability density function p(x), these approaches approximate the score function, defined by \(\nabla _{x} \log p(x)\), using a neural network \(s_\theta (x)\). To do so, the network is used to approximate a series of conditional score functions, \(s_t(x_t) = \nabla _{x_t} \log p(x_t | x_{t+1})\), which guide the denoising process from pure noise, i.e., \(x_T\) drawn from a normal Gaussian distribution for some maximum iteration value T, to a clean sample \(x_0 \sim p(x)\). Once trained, these models can be used to sample unconditionally from the prior distribution by running the reverse diffusion process, and hence generate new samples.

In the context of inverse problems in general, and MRI reconstruction in particular, the diffusion process can be hijacked to approximately sample from the conditional posterior distribution, p(y|x) instead. One method involves conditioning on the k-space measurements y and applying Bayes’ rule to the series of score functions, i.e.,

$$\begin{aligned} \nabla {\log p(x_t|y,x_{t+1})} = \nabla {\log p(y|x_t)} + \nabla {\log p(x_t|x_{t+1})}. \end{aligned}$$
(14)

The second term, corresponding to the prior conditioned on the denoising process, is unchanged from the original diffusion model and can be learned by training on clean, fully sampled images. The first term, corresponding to the likelihood conditioned on the denoising process, can be approximated through various approaches [93, 94, 104]. In a naive approximation,

$$\begin{aligned} \nabla {\log p(y|x_t)} \approx A^H(Ax_t - y), \end{aligned}$$
(15)

given the MRI forward model.

A growing body of work demonstrates that diffusion models work well for accelerated MRI and exhibit flexibility when handling various sampling patterns [94,95,96, 98, 105, 106]. For example, in a pioneering work, Jalal et al. [107] demonstrated that training a score-based generative model using Langevin dynamics, without making any assumptions on the measurement system, could yield competitive reconstruction results for both in-distribution and out-of-distribution data. Chung et al. [96] demonstrated that score-based diffusion models trained solely on magnitude images can be utilized for reconstructing complex-valued data. Luo et al. [97] described a comprehensive approach using data-driven Markov chains for MRI reconstruction which not only facilitates efficient image reconstruction across variable sampling schemes, but also enables the generation of uncertainty maps.

The flexibility afforded by explicitly decoupling the image prior (which is learned with diffusion models) and the statistical measurement model has also enabled other extensions. These include incorporating errors into the forward model, e.g., due to motion [108] and field inhomogeneity [109] and incorporating multiple image contrasts [110].

Un-trained neural networks

Un-trained methods are DL models that do not rely on training data apart from hyper-parameter tuning. Instead of conventional training on large datasets, these methods are typically based on fitting a randomly initialized neural network to a specific measurement. Here we discuss two types of methods: un-trained neural networks based on CNNs, and methods based on coordinate-wise implicit neural networks.

Un-trained CNNs for single image recovery

CNNs can be used as an image prior by fitting a randomly initialized CNN with gradient descent to a measurement. This approach, termed the deep image prior (DIP), was introduced in a pioneering work by Ulyanov et al. [111]. The optimization problem is formulated by,

$$\begin{aligned} {{\textbf {x}}} = \arg \min _{\theta } \Vert {{\textbf {A}}}{{\textbf {x}}}-{{\textbf {y}}}\Vert ^2 \,\,\text{ such } \text{ that }\,\, {{\textbf {x}}} = {\mathcal {G}}_{\theta }(z), \end{aligned}$$
(16)

where \({\mathcal {G}}_{\theta }\) is a CNN generator whose input \({{\textbf {z}}}\) is a noise vector drawn from some noise distribution. The optimization is performed using gradient descent or ADAM [112], starting with random initialization of the network weights, and early stopping is used for regularization. The image quality first improves with the number of iterations, and then degrades as the network begins to fit the measurement noise in \({{\textbf {y}}}\). This behavior is caused by the implicit bias of CNN networks to natural images: when trained with gradient descent, CNNs fit the smooth images before the noise, as formalized in [113].

Un-trained networks perform very well for denoising [111, 114] and compressive sensing (e.g., accelerated MRI). These methods can provably denoise smooth signals [113] and can provably reconstruct undersampled smooth images [115]. Un-trained networks also work quite well for accelerated MRI; they provide significant improvement over sparsity-based methods for 2D accelerated MRI [116].

One key benefit of un-trained networks is that they do not need training data. However, this benefit comes at the expense of performance; the images produced by DIP methods are commonly not comparable to those from the pre-trained networks discussed above. In addition, DIP often suffers from longer run times compared to the unrolled and direct inversion approaches because of the need for ADAM or gradient descent optimization during reconstruction.

Un-trained CNNs for joint recovery of multiple images

Recently, the DIP framework was extended to dynamic imaging applications [117, 118] where the images in a time series are modeled as the output of a generator

$$\begin{aligned} \gamma _{t}({{\textbf {r}}}) = \mathcal {G}_{\theta }\left[ {\textbf {z}}_{t}\right] . \end{aligned}$$
(17)

Unlike the fixed noisy input used in the original DIP work [111], here \({\textbf {z}}_{t}\) are low dimensional latent vectors at a specific time point t. \(\mathcal {G}_{\theta }\) is a deep CNN generator, whose weights \(\theta\) are independent of t. For example, in a free-breathing cardiac MRI, the images in the time series at a specific time t can be viewed as non-linear functions of cardiac and respiratory phases captured by \({{\textbf {z}}}_t\). This model (17) can be viewed as a non-linear mapping/lifting from a low-dimensional subspace \({{\textbf {Z}}}\) to the image space. The low-dimensional nature of the latent vectors enables the exploitation of the non-local redundancies between images at different time points, thus facilitating the fusion of information between them as in [119, 120].

The network parameters \(\theta\) and the latent variables \({\textbf {z}}\) are jointly optimized for by minimizing the cost function

$$\begin{aligned} {\mathcal {C}}({{\textbf {z}}},\theta ) =\sum _{t=1}^N\Vert {{\textbf {A}}}_{t}\,{\mathcal {G}}_{\theta }[{{\textbf {z}}}_{t}] - {{\textbf {y}}}_{t}\Vert ^2 + \lambda _1 \underbrace{\Vert \nabla _{{{\textbf {z}}}} {\mathcal {G}}_{\theta }\Vert ^2}_{\begin{array}{c} \text {network}\\ \text {regularization} \end{array}} + {{\lambda }}_2 \underbrace{\mathcal {R}({\textbf {z}})}_{\begin{array}{c} \text {latent}\\ \text {regularization} \end{array}}. \end{aligned}$$
(18)

The network regularization is an \(\ell _2\) penalty on the weights \(\theta\), which was shown to minimize the need for early stopping and provide improved performance. The latent vector regularization term involves a smoothness regularization to capitalize on the temporal smoothness of the images in the time series.

The above approach can also be extended to 3D applications, where the joint alignment and recovery of data from different slices obtained using different acquisitions may differ in cardiac/respiratory motion. Different sets of latent vectors are used for different slices to account for differences in breathing patterns and cardiac motion. In this case, a Kullback–Leibler divergence term is used to encourage the latent vectors of all the slices to follow a zero-mean Gaussian distribution, thus facilitating the alignment of data from different slices.

Coordinate-based networks

Coordinate-based neural representations, also known as implicit neural representations (NeRF-type networks), have recently emerged as an efficient way to represent and work with images, 3D shapes, and other signals [121]. They are commonly used for representing scenes and performing view syntheses in vision [122, 123]. To represent a 2D or 3D object, these models map a coordinate input (e.g., (xy) coordinates for 2D or (xyz) coordinates for 3D) to a pixel value, for example to a real number for a gray-scale image and to two real numbers for a complex-valued image.

Coordinate-based networks can be used in an analogous fashion to un-trained CNNs to reconstruct an image [124,125,126,127,128,129]. Specifically, they can replace the CNN in an un-trained network and can be fitted to measurement data. Networks with Fourier-feature input (like NerF [122], SIREN [130], and Fourier Feature inputs [125]) impose a smoothness prior similar to the un-trained CNN discussed in the previous section.

Networks with Fourier-feature inputs take a coordinate input (e.g., an (xy)-coordinate \({{\textbf {z}}}\)) and map it to a feature representation with the map \([\sin ({\textbf {C}}_0 {{\textbf {z}}}),\cos ({\textbf {C}}_0 {{\textbf {z}}}) ] \in \mathbb {R}^{2\,m}\), where \({\textbf {C}}_0\) are parameters initialized randomly. The parameters \({\textbf {C}}_0\) can be fixed or trainable. Those features are then mapped with a standard MLP with trainable parameters to an output. If two coordinates are close, the Fourier features are close, and how close can be controlled by the scale (variance) of the initialization.

If used as an image prior, compared to un-trained CNNs, coordinate networks with Fourier features perform worse in terms of image quality. In the context of MRI reconstruction, coordinate networks have been shown to be useful for representing high-dimensional objects such as 3D volumes and scenarios with motion. For example, [124] used coordinate networks to perform cardiac MRI reconstruction by fitting a network to the k-space data. This can be computationally efficient since the undersampled k-space data is sparse. [131] also used coordinate networks for free-breathing cardiac MRI reconstruction, by fitting a coordinate in the image domain.

Self-supervised methods

Neural networks, such as the end-to-end networks discussed in “Neural networks trained end-to-end”, are usually trained in a supervised manner (see Eq. (10)). This requires pairs of measurement and target (ground-truth) images. However, in practice, such pairs cannot always be acquired, e.g., due to scan time constraints, signal decay effects along echo trains, or physiological motion. Therefore, self-supervised methods are attracting increased research interest. These methods make it possible to train networks without target or ground-truth data by either making assumptions on the measurements or using additional noisy or partial measurements. A plethora of approaches has been developed, including methods for learning from under-sampled data [132, 133], unpaired data [134], or limited-resolution data [135]. Here we describe several approaches that are architecture-agnostic. For recent reviews on this topic see [134, 136, 137].

Learning of algorithms based on Stein’s Unbiased Risk Estimate (SURE)

We start with a method that is based on assumptions on the noise distribution, called Stein’s Unbiased Risk Estimate (SURE) [138]. We consider the estimation of \({{\textbf {x}}}\), denoted by \(\widehat{{{\textbf {x}}}}\) from its noisy measurements \({\textbf {v}}= {{\textbf {x}}} + {{\textbf {n}}}\). Here \({{\textbf {n}}}\) is zero-mean Gaussian noise with a variance of \(\sigma ^2\). In practice, the estimate \(\widehat{{{\textbf {x}}}}\) is derived from the noisy measurements \({\textbf {v}}\) using a deep network as \(\widehat{{{\textbf {x}}}} = f_{\theta }({\textbf {v}})\). When the noiseless reference image \({{\textbf {x}}}\) is available, the true mean-square error (MSE), denoted by

$$\begin{aligned} \text {MSE} = {\mathbb {E}}_{{{\textbf {x}}}} \,\Vert \widehat{{{\textbf {x}}}} - {{\textbf {x}}}\Vert ^2 \end{aligned}$$
(19)

can be used.

By contrast, the SURE [138] approach uses the loss function

$$\begin{aligned} \text {SURE}(f_{\theta }({\textbf {v}}),{\textbf {v}}) = \Vert f_{\theta }({\textbf {v}})- {\textbf {v}}\Vert ^2_2 + 2 \sigma ^2 \nabla _{{\textbf {v}}} \cdot f_\theta ({\textbf {v}}) - N\sigma ^2, \end{aligned}$$
(20)

which is an unbiased estimate of (19). Note that the expression in (20) does not depend on the noise-free images \({{\textbf {x}}}\); it only depends on the noisy images \({\textbf {v}}\) and the network parameters \(\theta\). In (20), \(\nabla _{{{\textbf {u}}}} \cdot f_\theta (\varvec{u})\) represents the network divergence, which is often estimated using Monte-Carlo simulations [139]. Several researchers have adapted SURE as a loss function for the unsupervised training of deep image denoisers [140, 141] and demonstrated performance approaching that of supervised methods.

The SURE approach was extended to inverse problems with a rank-deficient measurement operator known as the generalized SURE (GSURE) [142]. The GSURE provides an unbiased estimate of the projected MSE, which is the expected error of the projections in the range space of the measurement operator. The GSURE approach was recently used for inverse problems in [140]. The experiments in [140] showed that the GSURE-based projected MSE was a poor approximation of the actual MSE in the highly undersampled setting. To improve performance, the authors trained the denoisers at each iteration in a message-passing algorithm in a layer-by-layer fashion using classical SURE, which was termed LDAMP-SURE [140]. This approach approximates the residual aliasing errors at each iteration to be Gaussian random noise. As this assumption is violated in many inverse problems, the performance of this layer-by-layer training approach is not as good as supervised methods.

The ENSURE framework circumvents the poor approximation of the true MSE by GSURE by considering different sampling operators for different images. Similar to classical SURE metrics [142, 143], the ENSURE loss metric has a data consistency term and a divergence term. The data consistency term in ENSURE is the sum of the weighted projected losses [142] from multiple subjects; the weighting depends on the class of sampling operators. When different sampling patterns from different subjects fully cover k-space, the ENSURE metric is an unbiased estimate of the true image-domain MSE and hence is a superior loss function than projected SURE [142]. The comparison of the above methods shows that the the ENSURE approach can provide performance comparable to that of supervised training.

Self-supervised DL based on Noise2noise

Noise2noise [144] is a well-established framework, which constructs a self-supervised loss based on independent noisy measurements of the same object. Recall that for single-coil accelerated MRI, the forward map is \({\textbf {A}}= {\textbf {M}}{\textbf {F}}\), where \({\textbf {M}}\) is an undersampling mask and \({\textbf {F}}\) is the Fourier transform.

Suppose we are given two measurements \({\textbf {y}}= {\textbf {M}}{\textbf {F}}{\textbf {x}}\) and \({\textbf {y}}' = {\textbf {M}}'{\textbf {F}}{\textbf {x}}\), where \({\textbf {M}}\) and \({\textbf {M}}'\) are two different random undersampling masks. From these measurements, we can construct the self-supervised loss

$$\begin{aligned} \ell _{\text {SS}}( f_{\varvec{\theta }}({\textbf {y}}), {\textbf {y}}') = \left\| {\textbf {M}}'{\textbf {F}}f_{\varvec{\theta }}({\textbf {y}}) - {\textbf {y}}'\right\| ^2. \end{aligned}$$
(21)

It can be shown that in expectation over the random measurements, a minimizer of the self-supervised loss is also a minimizer of the expectation of the supervised loss (see Prop. 2 in [145]). Thus, with enough training examples, such self-supervised training can approach the performance of supervised training [145].

One notable method that has implemented this approach successfully for MRI reconstruction is Self-Supervised Learning via Data Undersampling (SSDU) [132]. This method partitions the available k-space measurements into two disjoint sets; the first set is used in the data consistency units of the unrolled network, i.e., for the forward pass, and the other one is used for computing the loss, i.e., for supervision. SSDU can hence be trained using under-sampled data alone. In their work, Yaman et al. [132] demonstrated that SSDU achieved comparable performance to fully supervised learning methods while offering practical advantages in real-world MRI applications.

Recently, Millard and Chiew [146] introduced a general theoretical framework that extends Noiser2Noise [147] and also explains SSDU. Unlike the SSDU formulation, where one set is recovered from the other, they applied two subsampling masks to the data. They proposed a weighted \(\ell _2\) loss, computed in k-space, with a weighting that compensates for the sampling and sampling-partitioning densities. They derived the framework analytically and showed that when the weighting matrix W is rank-deficient and fulfils certain conditions, the method boils down to SSDU. They showed analytically that SSDU with an \(\ell _2\) k-space loss approximates fully sampled reconstruction, on expectation. It is worth mentioning that their analysis was done for an \(\ell _2\) k-space loss, while the original SSDU method was trained with a mixed \(\ell _1/\ell _2\) loss.

Self-supervised DL using k-space bands

The self-supervised methods described above focused on learning from under-sampled data acquired with variable-density or parallel-imaging schemes. Although such datasets have undersampling artifacts, they effectively constitute high-resolution data, because the sampling masks commonly cover the entire k-space extent (note that under-sampling creates artifacts but does not necessarily reduce the resolution). However, the acquisition of high-resolution data can be challenging. In dynamic MRI, for example, there is often a trade-off between the spatial and temporal resolutions, which requires acquisition compromises.

Recently, the k-band framework was proposed for self-supervised learning from partial, limited-resolution data [135]. This framework is based on the acquisition of k-space bands, where each band acquires data with high resolution in the MRI readout dimension and limited resolution in the phase encoding (PE) dimension. The authors suggested acquiring different bands from different subjects, and randomizing the bands’ orientation across subjects; fundamentally, this randomization serves to expose the network to all k-space areas across the training iterations (Fig. 4). Thus, even though the network does not get a full k-space from any single subject, it can learn connections across all the k-space regions. To enable self-supervised learning from limited-resolution data without limiting the resolution during inference, the authors introduced an optimization method dubbed stochastic gradient descent (SGD) over k-space subsets.

In this framework, the loss is computed in k-space and formulated by

$$\begin{aligned} \ell _{k\_band} = \Vert {\textbf {W}}{\textbf {B}}({\textbf {F}}f_{\theta }({\textbf {y}}) - {\textbf {F}}{\textbf {x}}) \Vert _1 \end{aligned}$$
(22)

where \({\textbf {B}}\in \{{\textbf {B}}_i\}_{i=1,...,180}\) is a binary band sampling operator that samples a band with angle i, and \({\textbf {W}}\) is a loss weighting mask

$$\begin{aligned} {\textbf {W}}= 180 \left( \sum _{i=1}^{180} {\textbf {B}}_i\right) ^{-1}. \end{aligned}$$
(23)

This loss-weighting compensates for the over-exposure of the network to low-frequency k-space data and enhances learning in the k-space periphery. This is beneficial because in the k-band acquisition setting, the center of k-space is included in all bands (Fig. 4), unlike the periphery. The authors showed analytically that when this loss-weighting mask is applied, the self-supervised training process stochastically approximates fully supervised training, on expectation. They demonstrated that learning from limited-resolution data can hence result in performance comparable to supervised and self-supervised methods trained on high-resolution data, and hence offers a practical solution for cases where such data are unavailable.

Fig. 4
figure 4

Example of the input training data for three DL reconstruction methods. The fully-supervised MoDL method [25] receives var-dens sampled data as input and uses the entire k-space for supervision. The self-supervised SSDU method [132] receives var-dens data as input, splits it into two subsets, and uses one set for data consistency and the other for supervision. In this example, the var-dens data were sampled from parallel-imaging (equispaced) acquired data, as in [132]. The k-band method [135] receives var-dens sampled data from a k-space band, and uses data from the whole band for supervision, without any supervision outside the band. Different bands are acquired from different subjects, with random orientations. At inference, the input to all three methods is var-dens data from the entire k-space, similar to that shown here for MoDL

Loss-weighting

Several recent studies have independently proposed applying spatially varying weighting to k-space loss functions and demonstrated that such weighting can improve the performance of self-supervised DL methods [133, 135]. Interestingly, this general concept emerged even though the studies analyzed different sampling schemes and loss functions. For example, Millard and Chiew [133] analyzed SSDU with variable-density sampling masks that cover the entire k-space area, and an \(\ell _2\) k-space loss function. In contrast, the authors of k-band [135] explored training on band-limited data using an \(\ell _1\) k-space loss function. Despite these differences, these studies arrived at similar conclusions: they derived loss-weighting masks that weigh down the loss in the center of k-space and enhance it in the periphery. These masks hence inhibit the learning of low-frequency data and facilitate learning of high-frequency details, so that eventually all frequencies are weighted equally.

A related concept was proposed by Huang et al., who developed a neural implicit k-space representation model for cardiac MRI [124]. To account for the large variations in k-space values they proposed a log transform that inhibits high-magnitude k-space data to make their magnitude values similar to those of of low-magnitude data. However, because such a non-linear transform has an undesired effect on the noise distribution, the authors proposed an approximation using a linear function [148]. Altogether, this is an alternative approach for balancing the different parts of k-space.

Recent architectures

Transformers and dual-domain networks

In addition to the training methods described above, much progress has also been made in the development of advanced architectures. For example, two architectures that recently garnered substantial attention are transformers and dual-domain networks. Transformers [149, 150] have powerful computational capabilities due to their use of an attention mechanism [151] that makes it possible to weigh the importance of different parts of the input data and capture long-range dependencies. Transformers first made a substantial impact in the field of natural language processing [152, 153] and then became highly influential in computer vision [149].

In the context of MRI reconstruction, recent studies demonstrated that transformers offer excellent performance and ability to deliver improved structural and textural fidelity. For example, Korkmaz et al. [154] developed an unsupervised MRI reconstruction method based on a generative vision transformer. Their method utilizes cross-attention transformer blocks, which receive both global and local latent variables as input and progressively map them to MR images with increasing spatial resolution. This style-generative architecture enhances representational learning and improves model invertibility. Feng et al. [155] introduced the \(T^2Net\) for simultaneous MRI reconstruction and super-resolution. This network has two branches dedicated to these two tasks, and incorporates a task transformer module to facilitate effective feature sharing between them. Guo et al. [61] introduced the ReconFormer, an architecture that leverages recurrent pyramid transformer layers and scale-wise attention mechanisms. It effectively captures multi-scale information and deep feature correlations, leading to efficient, high-quality image reconstruction and computational efficiency.

Another emerging type of architecture is known as dual-domain networks, which commonly integrate information from the image and k-space domains [69, 156,157,158,159]. This approach, exemplified by MD-Recon-Net [158], leverages the complementary strengths of these two domains to enhance reconstruction quality. A study by Souza et al. [156] demonstrated the effectiveness of such networks in multi-channel MRI reconstructions. Singh et al. [159] demonstrated that layers utilizing joint learning of image and frequency domain features can directly replace standard convolutional layers. This is useful for numerous tasks, including image reconstruction, motion correction and denoising.

Transformers and dual-domain networks have recently been integrated, leading to state-of-the-art architectures. For example, Zhao et al. [160] introduced SwinGAN, a dual-domain Swin Transformer-based GAN. This network combines frequency-domain and image-domain generators, both utilizing Swin Transformer backbones. This design allows for effective capture of long-distance dependencies in MR images. SwinGAN also features a contextual image relative position encoder, which enhances its ability to capture local information. Wang et al. introduced DCT-Net, a dual-domain transformer network for MRI reconstruction [70], which integrates image and frequency domain information through its cross-attention and fusion-attention blocks. DCT-Net is designed to enhance MRI reconstruction performance, particularly under low sampling rates, by leveraging the complementary strengths of both domains. In summary, these recent architectures offer high computational power to improve image reconstruction quality.

Recent architectures incorporating diffusion models

Recently, some of the architectures mentioned earlier have been integrated with diffusion models, yielding state-of-the-art methods. For example, Korkmaz et al. [161] introduced the Self-Supervised Diffusion Reconstruction (SSDiffRecon) method, which poses a diffusion model as an unrolled network, with interleaved cross-attention transformer blocks and physics-driven data-consistency steps. Furthermore, Zhao et al. [162] introduced DiffGan, an architecture that combines a local vision transformer with a diffusion model, which mitigates computational challenges in training generative models.

DL for acquisition optimization

The previous chapter discussed various reconstruction approaches for retaining data integrity and accuracy when k-space sub-sampling takes place. This chapter will highlight two complementary deep-learning-based interventions at the acquisition step that further assure optimized performance, while suggesting additional acceleration. First, we explore methods that optimize the k-space sampling trajectories in tandem with the reconstruction. Next, we describe recent advances in harnessing DL to design and refine MRI pulse sequences.

Optimizing k-space trajectories

The computational design of sampling patterns has a long history in MRI. Generally, two types of approaches have been taken. Algorithm-agnostic methods, e.g. [163,164,165,166,167], consider specific image properties (e.g., the Cramér-Rao bound or image support) and optimize the sampling pattern to improve the measurement diversity for that class. Algorithm-dependent methods, on the other hand, e.g. [168,169,170,171], optimize the sampling pattern assuming specific reconstruction algorithms. These are typically CS algorithms, which employ regularizers such as TV, wavelet-domain sparsity, or pre-trained diffusion models [171].

The main challenge with the above computational approaches is their high computational complexity. In particular, algorithm-dependent schemes need to solve the CS problem for each image in the dataset, to evaluate the loss for a specific sampling pattern. The design of sampling pattern thus involves a nested optimization strategy; the optimization of the sampling patterns is performed in an outer loop, while image recovery is performed in the inner loop to evaluate the cost associated with the sampling pattern.

DL provides an opportunity to speed up the computational design, because DL inference schemes enable rapid evaluation of the loss for each sampling pattern. This enables a joint strategy that simultaneously optimizes the acquisition scheme and the reconstruction algorithm. Early DL-based joint optimization schemes solved for a binary sampling mask [172, 173]. The PILOT method, for example, solves for a hardware-constrained k-space trajectory [172]. The LOUPE method, on the other hand, learns the optimal sampling density in tandem with a reconstruction network [173]. It was first developed for 2D Cartesian imaging [173] and later extended to non-Cartesian sampling [174]. Other studies have focused on 3D Cartesian sampling with a variational reconstruction network [175].

More recent work represents the sampling locations \(\phi\) as continuous variables and jointly solves for them and for the parameters of the DL algorithms. These methods consider a forward model \({{\textbf {A}}}_{\phi }\), where \(\phi\) denotes the sampling locations. This forward model may be represented either by using an analytical Fourier transform [176] or a non-uniform Fourier transform [177]. We denote the reconstruction algorithm (which can be unrolled, direct inversion, or plug-and-play) by

$$\begin{aligned} \hat{{{\textbf {x}}}} = {\mathcal {M}}_{\theta ,\phi }({{\textbf {y}}}), \end{aligned}$$
(24)

where \(\theta\) denotes the parameters of the reconstruction algorithm and \(\phi\) are the sampling locations corresponding to the forward model. Joint optimization schemes, e.g. [176, 177], are designed to optimize the sampling pattern \(\Phi\) and the CNN parameters \(\phi\) in tandem, i.e.,

$$\begin{aligned} \{\theta ^*,\phi ^*\} = \arg \min _{ \theta ,\phi } \sum _{i=1}^{N} \Vert {\mathcal {M}}_{\theta ,\phi } \left( {{\textbf {A}}}_{\phi }({{\textbf {x}}}_i)\right) -{{\textbf {x}}}_i \Vert _2^2. \end{aligned}$$
(25)

Several methods have been developed within this framework. For example, J-MODL focuses on a model-based reconstruction and utilizes an unrolled network [176]. In a different work, Wang et al. [177] parameterized trajectories with quadratic B-spline kernels, and performed optimization under penalties describing realistic MRI hardware constraints, e.g. the slew rate and gradient amplitude. This work was later extended to a generalized Stochastic optimization framework for 3D NOn-Cartesian samPling trajectorY (SNOPY) [178], which can accommodate several optimization objectives. Chaithya and Ciuciu [179] introduced the PROJeCTOR framework, which enables joint learning of non-Cartesian trajectories and reconstruction networks by using a projected gradient descent algorithm. Alkan et al. [180] introduced joint sampling and reconstruction optimization through variations in information maximization, where they used an encoder to represent non-uniform sampling and a decoder in an unrolled neural network. Xie et al. [181] introduced the PUERT method for learning probabilistic sampling patterns along with an interpretable reconstruction method; their learning module incorporated a dynamic gradient estimation strategy. Finally, Zou et al. [182] demonstrated that joint optimization can reduce the bias and uncertainty of pharmacokinetic parameter estimation in dynamic contrast enhanced MRI, and hence contribute to higher diagnostic value. Altogether, these methods have shown significant benefits from jointly optimizing the sampling pattern and the reconstruction algorithm.

Pulse sequence design

The previous section focused on accelerating MRI scans via k-space sub-sampling. The complementary element to this effort is the optimization of the remaining pulse sequence parameters by a set of radio-frequency (RF) powers, shapes, and duration that enable the shortest possible scan time, while retaining sufficient contrast, SNR, and consistency with conventional (and lengthy) alternatives.

The pulse sequence design task was traditionally hand-crafted by MR experts, who combined strong intuitions and an understanding of spin physics with mathematical solutions of the Bloch equations. While a remarkable number of contrast mechanisms and imaging schedules have been developed since the invention of MRI, the reliance on solvable differential equations severely limits our ability to reach a globally optimized schedule and reduce the scan time. Recent developments in DL architectures and computational frameworks have created new opportunities for the automatic and efficient optimization of rapid acquisition protocols.

Zhou et al. [183, 184] introduced the representation of the Bloch equations as a computational graph. By treating each of the acquisition parameters as a neural network node weight, an efficient gradient-descent-based optimization was realized, where simulated signal trajectories were fed into the network, enabling an automatic generation of pulse sequences. The resulting protocols were characterized by non-intuitive gradient waveforms, where continuous off-resonant excitation applied as the receive channel was continuously and simultaneously recorded. This approach yielded an ultra-short scan time for \(\hbox {T}_1/\hbox {T}_2\) mapping at 1D. By expanding for 2D imaging, Lee et al. [185] used automatic differentiation to optimize the Cramér-Rao Lower Bound (CRLB) of multiple-echo spin echo \(\hbox {T}_2\) mapping, driven equilibrium single pulse observation of \(\hbox {T}_1\) (DESPOT1) mapping, and the MRF IR-FISP sequence.

Loktyushin et al. [186] developed a supervised learning framework termed MR-Zero where a target contrast of interest is used for learning the optimal set of RF events, the gradient moment, and the delay times (Fig. 5). One important feature of this approach is the use of a task-driven cost function that provides the user with the flexibility to prioritize the characteristics required from the output protocol, such as high data fidelity, short scan time, or the specific absorption rate (SAR) limits. In a later study, the same group used this approach to optimize the refocusing flip angles and minimize \(\hbox {T}_2\)-induced blurring in accelerated spin echo sequences [187].

Fig. 5: Automated discovery of MRI acquisition protocols using supervised learning
figure 5

. A differentiable MR scanner utilizes the Bloch equations for in-silico signal generation and the later reconstruction of the target contrast of interest from real, acquired data. Reproduced from Loktyushin et al. Magn. Reson. Med. 2021; 86: 709-724 [186]

In the molecular MRI field, an end-to-end DL-based framework was developed for the discovery of rapid, quantitative chemical exchange saturation transfer (CEST), semisolid magnetization transfer acquisition and reconstruction protocols [188]. The system was based on a computational graph representation of the Bloch-McConnell analytical solution which receives the molecular imaging scenario of interest as input, and outputs an optimized set of acquisition parameters and the corresponding reconstruction network that translates the raw data into quantitative parameter maps. In vivo experiments showed it could acquire in-vivo data in merely 35 s and reconstruct parameter maps in less than 1 s. The use of recurrent neural networks and training over a wide range of saturation pulse frequency offsets has further increased the robustness of this conceptual approach for \(\hbox {B}_{{0}}\) and \(\hbox {B}_{{1}}\) inhomogeneity [189].

All these approaches exploit DL-based strategies to optimize and derive novel acquisition routes offline. Recently, a different optimization paradigm was suggested where the acquisition parameters are modified and adapted on the fly, during data acquisition [190]. By combining a Bayesian framework, CRLB calculation, and model-based reconstruction, the acquisition parameters for a series of images can be optimized in real-time based on the previous image history. This concept has demonstrated up to a 3.3 fold acceleration of multi-echo sequences in human subjects and molecular imaging phantoms.

As an intermediate conclusion, while the concept of machine-learning-based pulse sequence design is relatively young and not heavily explored, the first reports suggest a promising new avenue for optimizing image contrast, shortening the scan time, and finding new acquisition schemes beyond human intuition.

Advanced techniques and applications

In this section we discuss DL methods for quantitative MRI and dynamic MRI.

DL methods for quantitative MRI

The goal of quantitative MRI is to extract one or more tissue parameter maps from a series of qualitative images [191]:

$$\begin{aligned} {{\textbf {I}}}_m={\varvec{\Phi }}_m(T_{param}){\varvec{\rho }} \end{aligned}$$
(26)

where \(\hbox {I}_m\) denotes the contrast-weighted images for m=1,...,M acquisitions, \(\rho\) denotes the spin density, \(\hbox {T}_{{param}}\) denotes the tissue parameters (\(\hbox {T}_1\), \(\hbox {T}_2\), etc.), and \(\Phi _m\) is the biophysical function connecting the acquisition parameters with the resulting contrast-weighted images. The mapping of tissue properties enables de-biasing imaging protocols and harmonization of the final diagnosis across sites, vendors, and physicians. It thus provides sensitive and standardized tools for reproducible interpretation of MRI-based information [192]. The classical approach to MR property mapping mandates repeated acquisition, where all the protocol parameters are held fixed, and only a single parameter is slowly and gradually varied (e.g., the flip angle or the repetition time across different M acquisitions). The resulting long scan time hinders the widespread use of quantitative MRI in clinical settings [193]. The reconstruction of the acquired raw data series demands a lengthy parameter-fitting procedure that is computationally intensive and slow.

The development of powerful DL architectures such as CNNs, UNets, GANs, ResNets and recurrent neural networks has been leveraged to accelerate and enhance the performance of quantitative MRI [194]. To accelerate relaxometry studies, Liu et al. [195] developed a model-augmented neural network that receives a series of incoherently-sampled multi-echo images and uses a CNN to reconstruct the \(\hbox {T}_2\) parameter maps. The supervised learning was guided by a parameter-space loss, which compares the reconstructed \(\hbox {T}_2\) maps to the ground truth reference, and a k-space loss. The latter was designed to ensure that the physical-model-based synthetic undersampled k-space measurements matched the originally acquired k-space information. In a later work, the same group developed a model-guided self-supervised DL framework for rapid \(\hbox {T}_1/\hbox {T}_2\) mapping [196], to obviate the need for fully sampled training references.

While many DL-based quantitative mapping strategies are focused on k-space sub-sampling, a further acceleration potential lies in reducing the number of contrast-weighted images acquired. In a very recent work, Li et al. [197] trained a deep residual CNN network to receive just three k-space under-sampled contrast weighted images and output the corresponding \(\hbox {T}_{{1rho}}\) and \(\hbox {T}_2\) parametric maps (which are particularly useful for the study of osteoarthritis).

Magnetic resonance fingerprinting (MRF), which was first reported in a 2013 Nature paper [198] and increasingly studied since then, constitutes a paradigm shift in MRI-based tissue characterization. Unlike traditional relaxometry studies, MRF starts with the acquisition of tens or hundreds of images, using a pseudo-random acquisition pattern accommodating a series of short repetition times, small flip angles, and heavily under-sampled k-space data (e.g., via a single variable density spiral trajectory). Although each of the resulting raw images is extremely noisy, the temporal evolution of the signal at each pixel entails a unique fingerprint. By comparing the experimental trajectory to a Bloch-equation-derived dictionary of simulated signals, the inverse problem can be solved to uncover the associated parameter maps (\(\hbox {T}_{{param}}\)).

Fig. 6: A demonstration of two MRI quantification strategies/architectures.
figure 6

a Deep learning reconstruction of quantitative magnetic resonance fingerprinting (MRF) information. A fully connected neural network is trained using simulated signal trajectories. During inference, it receives a series of raw MRF images pixel-wise (gray-scale images, left), as well as auxiliary maps (color, top left), yielding quantitative parameter maps (top right). b A further acceleration in quantitative MRI scan time can be achieved by training a generative adversarial network (GAN) using a smaller subset of raw input data to yield the same quantitative output maps. Reproduced and modified from Weigand-Whittier et al. [199]

While the resulting acquisition times are incredibly short (e.g., <13 s [198, 200]) the quantitative image reconstruction step (via pattern matching) may take hours, because the similarity between each acquired signal trajectory and all possible dictionary entries needs to be calculated.

In recent years, several DL-based strategies have been suggested to overcome this challenge. Cohen et al. [201] trained a fully connected neural network using synthetic signal dictionaries to reconstruct MRF data in less than 100 ms. To take advantage of the inherent dependencies between adjacent image pixels, Balsiger et al. [202] designed a spatiotemporal CNN where the time-evolution dimension is the third dimension of the CNN patch kernel. In-vivo human brain validation studies using this approach demonstrated improved performance compared to alternative MRF networks [201, 203]. To accommodate 3D imaging, Gomez et al. [204] combined fully connected reconstruction networks with radial and spiral readout trajectories, and achieved whole-brain reconstruction in less than 7 min (compared to \(>1.5\) h using traditional reconstruction).

Another MRF-associated bottleneck relates to the time required to generate the synthetic signal dictionary, which increases exponentially with the number of simulated parameters [205]. Even when high-end computer clusters are used for this task, the computation time may reach hours/days for complex multi-pool imaging [206, 207]. Recently, this challenge was addressed by training a fully connected neural network using a variety of dictionaries to learn the nonlinear relations embedded in the physical model. The resulting system enabled the rapid generation of simulated signals for various protocols and imaging scenarios [208]. NN-based simulators can be further combined with reconstruction networks to provide a unified rapid method for MRF analysis [209]. A different approach to circumvent the need for exhaustive dictionary generation involves the direct synthesis of multi-contrast images (e.g., \(\hbox {T}_1\)-weighted, \(\hbox {T}_2\)-weighted, and FLAIR) from raw MRF data. While synthetic images can be derived from quantitative MRF data by forward model activation using the desired acquisition parameters [210, 211], Wang et al. [212] showed that a dictionary-free conditional GAN (trained on MRF raw data and paired ground truth weighted images) can perform the same task much faster. For cases where full quantitative information is required, a different work demonstrated that multi-parameter maps can still be extracted with GANs, even when merely 30% of the acquired MRF data is used [199] (Fig. 6).

Dynamic MRI

Deep learning has become a transformative force in the realm of dynamic MRI, particularly in addressing the challenges related to limited acquisitions and motion correction, which constitute substantial hurdles in clinical imaging [9, 45, 213, 214]. DL excels at learning signal evolution [157, 215], a critical factor when aiming to accurately visualize and interpret dynamic changes in the body.

Motion-resolved reconstruction

Motion-resolved algorithms can effectively learn spatio-temporal correlations and reconstruct images from highly undersampled sequential data [216, 217]. These methods have primarily been developed in the context of cardiac MRI [64, 120, 213, 216, 218, 219]. For instance, supervised unrolled algorithms have been used to recover cardiac cine MRI from breath-held MRI using using 4D (3D+Time spatial) convolutions [64, 213]. In other studies, architectures included unrolled algorithms that combine manifold [120] or low-rank priors [216, 218], and joint learning of motion estimation and segmentation in cardiac MRI [219]. From a clinical perspective, DL has been found useful for measuring myocardial displacement [220], noninvasive diagnosis of myocardial ischemia [221], and evaluation of cardiac function in pediatric imaging [222].

To tackle the scarcity of training data, unsupervised implicit learning approaches have recently been introduced for dynamic MRI [217, 223] (for more information see “Un-trained CNNs for joint recovery of multiple images” and “Coordinate-based networks”). These methods were also extended to multi-slice dynamic MRI data [224], where the dynamic data from slice \({{\textbf {x}}}_i(t)\) are acquired sequentially at different time points. The above model has been generalized to recover a pseudo-3D reconstruction by modeling the data as \({{\textbf {x}}}_i(t)={\mathcal {G}}_{\theta }\Big ({{\textbf {z}}}_i(t)\Big )\), where \({{\textbf {z}}}_i(t)\) are allowed to vary for different slices.

Motion-compensated reconstruction

The development of DL techniques for motion estimation and correction is a highly active research field, as DL can accurately detect and compensate for both rigid and non-rigid motion artifacts, which leads to more diagnostically valuable images. A thorough review of motion estimation and correction techniques is beyond the scope of this manuscript. Here we highlight some of the main applications, and more information can be found in recent reviews [44, 45, 225,226,227].

One of the main applications where DL is highly effective for motion correction is brain MRI, which is characterized by rigid-body motion [226,227,228,229]. One of the early works in this field, by Johnson and Dragnova [230], proposed conditional GANs to infer clean images from motion-corrupted data. More recent techniques include co-optimization for jointly estimating the motion parameters and reconstructed image [134, 228], methods for detection and correction of motion-corrupted k-space lines [229, 231], and the use of score-based generative models [232].

DL approaches are also highly useful for tackling non-rigid, irregular motion. Applications include imaging of the body trunk [233], fetal MRI [234, 235], abdominal MRI [236, 237], and MR angiography [238]. Unsupervised implicit learning methods that recover the deformable motion fields at each time point have also been introduced and found to be effective in motion-compensated recovery [239]. DL is also making strides in the field of real-time interventional MRI. Here, the rapid processing capabilities of DL algorithms enable real-time feedback and guidance during medical procedures, thus enhancing both the safety and efficacy of interventions. [240,241,242]

Multi-task pipelines

The fundamental goal motivating the acquisition of diagnostic-quality MR images is to extract clinically useful insights to further clinical care or to interrogate disease activity. Consequently, efficient, high-quality image acquisition is just the first step (typically referred to as upstream DL) in the imaging workflow, which is followed by image analysis and insight extraction (typically referred to as downstream DL) [243]. In many applications, these upstream and downstream processes are disconnected, leading to insufficient insights as to whether a novel MRI acceleration and reconstruction technique can reliably produce the requisite diagnostic information [244]. As a result, there is a substantial need to combine the upstream and downstream processes to ultimately harness advances in MRI physics, hardware, and DL for end-to-end acquisition-to-analysis workflows.

Conventional and DL-based reconstruction techniques can potentially be combined with downstream task of clinical utility to guide useful model development. Specifically, MRI reconstruction workflows can be combined with three different downstream tasks that use whole images as inputs: (i) image classification, which performs binary identification (via a yes/no) to identify the presence of one or multiple disorders; (ii) abnormality detection, which performs localization via bounding boxes to accurately depict where one or multiple disorders are present in images; (iii) image segmentation, which performs image classification at the voxel level to distinguish voxels that belong to a particular tissue or disease class. The DL sub-field of multi-task learning can learn multiple tasks simultaneously with positive task transfer, where learning one task improves the performance of other tasks.

In the context of accelerated MRI, combining the upstream task of MRI reconstruction with the downstream tasks of classification, detection, or segmentation can improve performance on all tasks. It also contributes to optimizing reconstruction techniques with clinically informed metrics. One of the greatest challenges in doing so, however, is the lack of available datasets that can merge both sets of tasks. The fastMRI raw-data dataset was recently supplemented with the fastMRI+ dataset that includes classification and detection bounding box annotations for knee and brain abnormalities at the slice level [245]. Such datasets can enable the design of end-to-end techniques to optimize reconstruction, subject to high performance on lesion detection [246]. Similarly, even beyond end-to-end methods, such abnormality labels can be used to design clinical task-specific undersampling trajectories [247].

Beyond fastMRI+, SKM-TEA datasets include raw k-space data as well as classification labels, detection bounding boxes, segmentation masks, and quantitative T2 relaxation time maps [35]. The original work profiled how different reconstruction approaches combined with different segmentation tools affected a common musculoskeletal biomarker of cartilage T2 relaxation time. Despite differences in the performance of the individual DL blocks, the overall impact on regional cartilage T2 values was small, a surprising finding that has been replicated for cartilage morphology and T2 tasks [248, 249]. Recent work has evaluated new approaches that combine generic pre-training tasks such as image reconstruction with fine-tuning for different clinically-relevant downstream tasks [250]. This approach achieves high performance in image acceleration as well as segmentation. Similar to these findings, the K2S challenge at MICCAI 2022 combined knee MRI reconstruction with bone/cartilage segmentation and bone shape analysis [40]. Yet again, there were only weak correlations between the metrics of reconstruction and segmentation quality, with one of the best segmentation models producing highly artifactual reconstructions but high quality segmentations.

Joint estimation of sensitivity maps and reconstruction

The power of DL has also been harnessed for improving parallel multi-coil MRI, where the coil sensitivity maps must be estimated and incorporated in the image reconstruction process. Many DL methods utilize the popular ESPIRiT algorithm [251] for computing the sensitivity maps prior to the reconstruction process. However, joint estimation of the sensitivity maps and reconstructed data could contribute to improving image quality, as indicated in different studies, first with classical approaches [252, 253] and later using DL [254,255,256,257]. DL frameworks were hence recently developed for joint estimation of the sensitivity maps and reconstruction data. For example, the well-known E2E-VarNet method [254] included a module for sensitivity maps estimation and incorporated it into a larger unrolled network, trained end-to-end. A similar approach was taken by Jun et al., who proposed the IC-Net [255]. Luo et al. suggested using a deep image prior [256], and Zhang et al. proposed a zero-shot learning method which is trained solely on data from a specific subject and jointly estimates the sensitivity maps and temporal data [258]. Most recently, Hu et al. [257] introduced the self-supervised SPICER method, which enables joint reconstruction and sensitivity maps estimation with training only on noisy data.

Other applications

The powerful capabilities of DL have also been exploited for other computational tasks in the MRI workflow aside from image reconstruction. A detailed review of these applications is beyond the scope of this manuscript, which focuses on MRI reconstruction. Specific examples include the joint recovery of multi-contrast MRI data [110, 259], the synthesis of missing contrasts or synthesis of quantitative maps based on anatomical data [259, 260], super-resolution [261, 262], B0 estimation and off-resonance correction [263], enhancement of low-field MRI data, where the low SNR degrades image quality [56, 264], and automated scan prescription [265].

Datasets and software

Datasets

The availability of public datasets and open-source code repositories has played a crucial role in the rapid development of DL techniques [266]. In the MRI reconstruction field, several major databases such as fastMRI [32], SKM-TEA [35], mridata.org [33], and Calgary-Campinas [34] catalyzed development by making available large amounts of raw k-space data, which are useful for developing and benchmarking methods [51, 267]. Other resources provide valuable data for specific applications, including MR imaging of speech production [268], cardiovascular imaging [269], and low-field MRI [270]. Many other MRI datasets are also available on the web, but those were commonly designed for downstream, non-reconstruction tasks, hence they do not always contain raw k-space data. Examples include the Human Connectome project [271], IXI [272], BRaTS [273, 274], ADNI [275], UK-biobank [276] and OASIS [277]

Open-source software

The adoption of open-source frameworks has significantly accelerated the development of DL methods as they provide researchers robust, flexible platforms to develop new algorithms. The two most prominent general-purpose DL frameworks are PyTorch [278] and TensorFlow [279], which offer extensive libraries that facilitate the design, training, and deployment of DL models.

Several open-source software frameworks have been developed specifically for MRI. These offer useful computational tools for handling raw k-space data, implementation of algorithms and computation of MRI-related metrics. For example, BART (Berkeley Advanced Reconstruction Toolbox) [280] is a large and highly popular software package. It enables efficient data processing and contains implementations of different iterative reconstruction algorithms. The recent versions of BART also contain general-purpose tools that are highly useful for the development of DL reconstruction models, e.g., an automated differentiation framework compatible with complex-valued data, and implementations of well-established DL models [281]. Gadgetron [282] is another popular package, which offers extensive tools for image reconstruction, data management, and implementations of iterative solvers. A different package is MRIReco.jl [283], which is written entirely in Julia and utilizes the ISMRMRD file format. This package offers many building blocks for data management, simulations, and image reconstruction. Another example is Sigpy [284], which offers a set of operators, blocks, and algorithms that are highly suitable for iterative reconstruction. Unlike other toolboxes, Sigpy is written entirely in Python and can hence be integrated easily into frameworks such as PyTorch.

While the above packages focused on data management and image reconstruction, SNOPY (non-Cartesian sampling trajectory) [178] is a framework that offers practical tools for optimizing k-space sampling trajectories. These tools include a differentiable MRI system model and loss functions corresponding to constraints on image quality, e.g. hardware (e.g., maximum slew rate and gradient strength) and peripheral nerve stimulation (PNS) constraints. Finally, a different framework is Yarra (https://cai2r.net/resources/yarra/), which provides tools for automated collection of raw k-space data. These can facilitate acquisition of datasets and the creation of new datasets.

Another key area is pulse sequence development. One of the main challenges in reproducing complex pulse sequences across different sites and scanners is the dedicated prototyping environment and software used by each vendor. Pulseq [285] is a rapid, hardware-independent pulse sequence prototyping framework, which enables intuitive high-level programming of acquisition protocols in Matlab or Python. It enables easy deployment across different field strengths and hardware. Importantly, the spin physics associated with the specific acquisition protocol compiled at the scanner can be accurately simulated as part of the Pulseq framework or its derivatives [286]. Techniques for data harmonization can help mitigate challenges in transferring protocols across different systems [287].

In the context of quantitative imaging, the qMRLab software was developed to facilitate reproducibility across MRI systems [288]. It consists of practical tools for analyzing and processing quantitative MRI data acquired by different vendors. The user-friendly interface and modular design of qMRLab enable researchers to easily implement and share quantitative MRI techniques.

Many other open-source codes can also be found on online platforms such as GitHub, articles with code (https://paperswithcode.com/), and the two dedicated websites of the ISMRM: MR-Hub (https://ismrm.github.io/mrhub/), which hosts toolboxes, and MR-Pub (https://ismrm.github.io/mrpub/), which hosts git repositories published together with articles.

The toolboxes and platforms described above are essential for enhancing reproducible research in the MRI community. The development of a unified data format, the ISMRMRD [289] can also facilitate easy translation of datasets and methods across sites and research groups. The recent development of techniques for federated learning [290, 291] are also useful for training algorithms collaboratively without sharing the data; this can help address data-privacy issues.

Robustness challenges

In this section we discuss challenges related to developing, evaluating, and benchmarking DL reconstruction methods, and suggest approaches for mitigating them.

Distribution shifts

In deep learning, generalization refers to the ability of a trained model to accurately reconstruct images that it has never seen before, particularly when these new images differ substantially from the training data. Good generalization of DL-based MRI reconstruction models is critical for clinical workflows. However, achieving a good generalization is challenging because MRI data can vary substantially in terms of different factors, e.g. MRI hardware, vendor-specific scanning protocols, patient populations, and the anatomical regions being imaged. This variability can lead to a model that performs well on data from one source but poorly on data from another, a phenomenon known as domain shift or distribution shift. Here we review some of the main challenges related to this issue.

Domain shifts have been studied from several perspectives. Johnson et al. [292] analyzed the robustness of the models submitted to the 2019 fastMRI challenge to distribution shifts, e.g. small structural changes, addition of noise to k-space data, and changes in the number of coils. The study found that many of these models were sensitive to the distribution shifts. Darestani et al. [293] evaluated the robustness of DL reconstruction methods with regard to out-of-distribution data, and found that both trained and untrained networks were affected by distribution shifts. Avidan et al. studied another type of distribution shift [294], related to sampling; methods trained on specific sampling schemes may not generalize well to other schemes. Altogether, distribution shifts can lead to substantial performance drops in MRI and can hence be a major limiting factor in practice.

Potential mitigation strategies. In cases where only a few training examples from a target domain are available, pre-training a network on other data and fine-tuning it to the target domain can improve performance [295,296,297]. In the challenging case where no target data are available for fine-tuning, test-time-training, which involves adapting to a single training at inference, is a viable performance-enhancing alternative [298]. Another good strategy that can help mitigate the performance drop due to distribution shifts is to train on broad and diverse data [299].

A growing body of work has pointed out the advantages of diffusion models in robustifying networks to distribution shifts. As described above (“Diffusion models”), diffusion models decouple the image prior from the statistical measurement model. They can hence generalize easily to various anatomies and sampling patterns [94,95,96, 98, 105, 106]. Another technique to robustify networks to shifts in sampling patterns is to provide the network with the undersampling mask, and train the network to generalize to various sampling masks [294]. In addition, generative networks can be trained solely on magnitude images and applied to complex-valued data with different sub-sampling patterns and out-of-distribution data [98].

Bias and “data crimes”

In the field of AI, the term bias is often associated with gender-related or population-related bias. This can occur when models are trained on datasets that do not contain equal distributions of subjects having different genders, different ethnicities, or even data that only contain a narrow set of medical conditions [300]. This training can lead to algorithmic failure for under-served populations [301, 302] or rare conditions [300].

However, when solving inverse problems, bias can also arise from a naive, seemingly-appropriate use of open-access datasets. One of the primary challenges in the development of DL reconstruction methods is the need for raw k-space data, which are scarce and difficult to acquire due to the high cost of MRI scans and the long scan duration. While several databases offer such data, e.g. [32,33,34,35, 250, 268], there are many other databases that offer non-raw MRI data. Those are generally designed for downstream tasks, e.g., segmentation and classification, hence they are frequently preprocessed. Nevertheless, due to their high availability, researchers sometimes download and use them for synthesizing k-space data, for training DL reconstruction models. This has been referred to as “off-label” data use, because those datasets are used for a different task than the one they were were designed for [303].

Surprisingly, training DL reconstruction algorithms using “off-label” data could give rise to good-looking results; nevertheless, those are often biased and overly optimistic, i.e. they are “too good to be true” [303]. This is due to subtle preprocessing steps which, although imperceptible to the naked eye, impact algorithm performance. Common preprocessing steps include k-space zero-padding, coil combination, and JPEG data compression. The authors of [303] demonstrated that CS, dictionary learning, and DL algorithms are all sensitive to these preprocessing steps and yield biased results. This underscores their potential risk, as these algorithms are aimed for clinical purposes. The findings also demonstrated that DL algorithms trained on preprocessed data may not be able to generalize well to real-world clinical data, and could potentially eliminate crucial clinical details [303].

Another concerning finding is that popular error metrics, e.g., the normalized root mean square error (NRMSE) and SSIM could be blind to the preprocessing [303], and miss the bias. This is because those metrics compare the reconstructed and reference image, but those come from the same underlying data, i.e. both are preprocessed. The error metrics thus cannot measure the true image quality. Therefore, DL algorithms may achieve “good” NRMSE and SSIM scores even when their performance is poor. This makes the head-to-head comparison of results across papers very difficult, because some papers report experiments with raw k-space data while others report results for preprocessed data, where the metrics tend to be better. To raise awareness of this phenomenon, the authors dubbed the publication of misleading results “data crimes”.

Potential mitigation strategies. In the context of gender-related or population-related biases, the best strategy is to train on large, diverse datasets [300,301,302]. In the context of bias that stems from training on preprocessed data, the optimal strategy is to train solely on raw k-space data. When raw, zero-padded k-space data are available, and such data have not been subject to any other preprocessing steps, one simple correction step is to crop k-space to its original size. However, this scenario is quite rare. In practice, other preprocessing steps are often applied, e.g. coil-combination (e.g. using root-sum-of-squares operation) and JPEG compression. Those steps are irreversible, hence there is no simple technique to remedy them, and the bias cannot be easily prevented.

Techniques for data synthesis. When raw k-space data are unavailable, one possible strategy is to train on synthetic data [304,305,306]. When magnitude data are available, synthetic complex-valued data can be obtained, for example, by adding a synthetic phase to the magnitude images and training a generative model to learn priors of complex-valued images [306]. Multi-coil data can be synthesized by multiplying the phase-enhanced magnitude data with sensitivity maps [304]. Another approach is to leverage the Bloch equations to simulate realistic data [305, 307]. However, it should be emphasized that the synthesis of complex-valued data does not guarantee good performance for real-world data; i.e., it cannot automatically prevent the bias described above. Models trained on synthetic data must therefore be tested using real-world, raw k-space data.

Hallucinations

The term hallucinations refers to the generation of false, realistic-looking features which are not present in the actual data. This can arise from the use of inaccurate priors [308], e.g., when there is a distribution shift between the training and test data, as described above. Strikingly, the team that organized the second FastMRI challenge found that many of the top-performing models produced hallucinations, and that these hallucinations were not captured by image quality metrics such as SSIM [37]. They also noticed that hallucinations could morph abnormal structures into seemingly normal ones.

Several studies have also highlighted and explored manifestations of hallucinations. For example, Cohen et al. (2018) [309] discussed how distribution matching losses in medical image translation can lead to hallucinated features. Bhadra et al. [308] reported hallucinations in the context of tomographic image reconstruction and introduced the concept of hallucination maps to identify and understand the impact of prior information in regularized reconstruction methods. Gottschling et al. [310] explored hallucinations from a theoretical perspective, and highlighted problematic scenarios of in-distribution hallucinations. The issue of hallucinations is a critical concern in DL-based image reconstruction, as it can lead to misleading results and false diagnoses.

Potential mitigation strategies. At present, there is a pressing need for techniques to mitigate hallucinations. Currently, the best strategy is to have radiology experts evaluate the reconstructed images, in the hope that they will be able to detect hallucinations. However, more research and development are required.

Adversarial attacks and instabilities

Neural networks for image classification are known to be sensitive to adversarial attacks, i.e. small, imperceptible, adversarially chosen perturbations. Generally, such attacks are not a major concern for clinical MRI systems, as those systems are typically closed and require password access. However, adversarial attacks have been utilised for analysing robustness of DL reconstruction methods and this topic attracted substantial attention [309,310,311,312]. For example, Antun et al. [312] provided simulations demonstrating that small, adversarially selected perturbations in undersampled measurements can result in severe reconstruction artifacts. They also showed that there is a trade-off between robustness and performance, and that classical sparsity-based reconstruction methods are also sensitive to adversarially selected perturbations.

Robustness to adversarial attacks is still an open issue. For example, at present there is no conclusive evidence indicating that DL reconstruction methods are more sensitive than classical methods such as sparsity-based reconstruction to worst-case perturbations. Krainovic et al. [313] provided theoretical results on a worst-case optimal estimator. Morshuis et al. [314] found that simple end-to-end variational networks are as sensitive to perturbations as U-nets, and Darestani et al. [293] showed that both un-trained and trained networks are sensitive to such perturbations.

Strategies for enhancing robustness. Goujon et al. [315] demonstrated that the robustness of reconstruction algorithms can be improved by constraining the CNN module to be convex or using a monotone constraint [72]. However, a global convexity or monotone constraint often translates into reduced performance [72, 315], as predicted by Antun et al. [312]. A recent work showed that the global monotone constraint described in [72] can be replaced by a local constraint around the image manifold [316], to achieve improved robustness without compromising on performance. Following the approach implemented in non-convex algorithms, this algorithm is theoretically guaranteed to converge to a minimum, provided it is initialized with SENSE reconstructions. From a different perspective, [317] and [318] suggested the use of adversarial attacks during training to reduce false negatives.

Benchmarking challenges

When evaluating various techniques comparing the performance of different reconstruction and image analysis methods, it is crucial to have appropriate benchmarks that can characterize true clinical utility. Conventionally, image quality metrics such as the mean square error (MSE), peak signal-to-noise ratio (PSNR), and the structural similarity metric (SSIM) [319] are used to assess the image quality of reconstructed MR images. These metrics have been described extensively in the computer vision literature and have a moderate correspondence with human-perceived image quality [319]. However, as there is a substantial domain shift between natural images and medical images, studies have shown that such traditional image quality metrics do not correlate well with radiologist-perceived metrics [320]. One likely reason for this phenomenon is that not all pixels in a given image have similar diagnostic value. Consider, for example, knee MRI scans, e.g. those contained in the popular fastMRI database. Most abnormalities in knee scans are likely to be located in small, subtle regions in the cartilage, meniscus, and ligaments, while the tissues that are visible in the scans, such as bone and muscle, are likely to have substantially fewer abnormalities. Computing traditional image quality metrics that weigh all image pixels similarly may therefore not reflect how a radiologist perceives the images. Thus, it is essential to define metrics that correspond to downstream clinical utility so that the same metrics can be used for benchmarking models.

Another issue of concern is that the performance of DL reconstruction algorithms is typically evaluated using a relatively narrow test set, which is similar in nature to the training data. The evaluation results can thus be misleading, since they do not yield a reliable estimation of the model’s generalization ability; i.e., the ability to perform well on test data that deviate from the training data, which is of critical importance in clinical settings.

Yet another challenge is related to the common train/test data split. Even for a single dataset, different research groups or studies may apply different splits, and hence train and test the model with different data subsets. For example, even if two identical studies used the same number of training images, it is not always possible to verify whether they used the same images. Previous work has shown that there can be a significant variation in how well the same deep learning models perform on different imaging exams within the same larger dataset [293]. Consequently, even if the number of training examples is maintained across studies, some examples may be easier or more challenging to train and evaluate. This makes a head-to-head comparison of published literature difficult to interpret.

Strategies for mitigating benchmarking challenges. Initiatives such as the large-scale fastMRI database and challenge [32] allow multiple different methods to be compared on identical training, validation, and testing data splits. This type of consistent evaluation platform allows for head-to-head comparisons across different methods and can shed light on their pros and cons.

More recent studies dealing with image quality assessment have gone beyond PSNR and SSIM metrics and used perceptual metrics of image quality instead of relying on handcrafted metrics [321]. Perceptual metrics utilize representations extracted from pre-trained neural networks and have been used for evaluating and optimizing MR reconstruction quality [322, 323]. For example, Learned Perceptual Image Patch Similarity (LPIPS) is commonly used for assessing the quality of natural images. A similar analog of Self-Supervised Feature Distance (SSFD) has been proposed for assessing MR image quality [324, 325]. When using these perceptual metrics, a reference image and an evaluation image can be fed into the same pre-trained network, and the differences in their representations is compared and taken as the similarity between the two images. The impact of common image perturbations on such deep feature metrics is shown in Fig. 7. These perceptual metrics can be computed using networks that were pre-trained either on natural images or medical images. A recent study demonstrated that although there may not be considerable differences between these two types of networks, i.e., between the networks used for generating the representations, using perceptual metrics is better than using traditional image quality metrics when evaluating against radiologists’ perceived image quality [322]. However, due to the challenges in developing image quality metrics that reflect radiological assessment, further research is needed in this context.

Fig. 7: Impact of common image perturbations on image quality metrics
figure 7

. A variety of image perturbations applied to a sample image from the fastMRI dataset (top row: noise addition, image blurring, pixel rolling (where an image is shifted by a number of pixels), and physics-based subject motion. The impact of these corruptions is shown for conventional image quality metrics (SSIM, PSNR) and deep feature distance metrics (LPIPS—made for natural images, SSFD—made for MR images). The deep feature metrics exhibit a larger dynamic range to the noise, blurring, and motion corruptions, but present very little change due to pixel rolling, since the image quality does not change. These qualities of deep feature metrics are ideal for assessing MRI reconstruction quality

Beyond the metrics of image quality, another approach for benchmarking reconstruction models aims to directly benchmark the downstream value provided by the MR image. For example, the SKM-TEA dataset is designed to enable the evaluation of downstream tasks such as clinical classification, segmentation, and articular cartilage T2 quantification [35]. This dataset emphasizes the combination of automated cartilage segmentation with T2 quantification, a known metric with clinical and research significance [326, 327]. Similarly, the K2S dataset aims to optimize cartilage volume and thickness quantification, whereas the fastMRI+ dataset assesses classification and detection tasks [40, 246]. Such benchmarking metrics, which relate to clinical significance, provide a promising path forward for assessing image quality when used in conjunction with traditional or perceptual image quality metrics. Doing so can contribute to optimizing both image quality and image value for radiologists.

Uncertainty estimation

Connecting both upstream and downstream DL tools with clinical practice first requires promoting confidence among the users of these tools. Robustly characterizing the uncertainty of DL models may help increase the trust of downstream users in the outputs of DL algorithms prior to their integration into routine clinical practice. Uncertainty quantification, a popular subfield of DL research, can encourage building such trust among downstream users, regarding both image reconstruction and automated image analysis algorithms.

Several works have proposed methods to evaluate the uncertainty of accelerated MRI reconstructions [97, 328,329,330,331,332]. Given that MRI reconstruction is an ill-posed problem where several high-quality images can correspond to the same low-quality image, quantifying the uncertainty of a sample reconstruction can help guide the fidelity of the overall reconstruction process. The developed uncertainty approaches leverage the variational formulation of inverse image recovery for sampling not only a single output of reconstruction, but a variety of different outputs using a learned model of the posterior distribution. Sampling a multitude of image outputs makes it possible to evaluate the variance at the voxel-level to compute uncertainties within a given image. This technique of uncertainty quantification can be used to determine whether specific regions of interest have high uncertainty values around abnormal image findings.

For downstream image analysis tasks, techniques like Monte Carlo Dropout have gained popularity for estimating output uncertainty [333]. Although dropout is typically employed for model regularization and reducing overfitting during training by randomly setting parameters to zero, it can also be applied at inference time. This involves producing multiple outputs with different neural network weights set to zero. This Monte Carlo Dropout approximates a Gaussian process, which facilitates the computation of uncertainty across the variance of all the generated inputs. Although straightforward to implement, this approach requires multiple forward passes during inference.

Looking forward, despite our ability to generate voxel-level uncertainties, the optimal utilization of these generated uncertainty maps still remains unclear. This elicits a number of intriguing research questions concerning the ways in which these uncertainties can be utilized beyond simply presenting them to end-users. For instance, it is unlikely that a radiologist would review both the output of a deep learning reconstruction and the corresponding uncertainty map since doing so would nearly double clinical read durations, and the radiologist may not necessarily know how to contextualize regions of high and low uncertainty.

Alternatively, these uncertainty maps could be applied in iterative reconstruction techniques [334]. In this approach, the uncertainty associated with a specific step in an image reconstruction pipeline could serve as input for the subsequent step. The reconstruction network’s objective would be to reconstruct the same image but with the new constraint of reducing the underlying uncertainty. This type of formulation holds potential for adaptive, case-based sampling schemes tailored to individual patients.

Another promising area of research involves directly integrating uncertainties from MR image reconstruction with those of the relevant downstream parameters of interest [335]. This methodology could serve to estimate maximum acceleration rates without compromising quantitative clinical parameters. Achieving such end-to-end analysis necessitates datasets that can provide raw case-based data alongside downstream image analysis datasets (e.g. SKM-TEA, fastMRI+, K2S, etc).

Implementation issues

To practically reproduce, implement, and further develop the DL-based strategies discusses in the review, one must have access to the appropriate hardware resources. This section will highlight the main computational challenges expected, especially for large networks. For purposes of illustration, we focus on the case of unrolled networks.

Memory demands of unrolled algorithms. PnP and score-based algorithms that pre-train deep learning modules as denoisers are associated with low memory demand. In contrast, unrolled algorithms, which offer state-of-the art performance compared to PnP methods, involve a number of iterations and their training is restricted by the memory of the GPU devices during training. This often limits the applicability of unrolled algorithms to large-scale multi-dimensional problems.

Strategies for computational efficiency. Several strategies have been introduced to overcome the memory limitations of unrolled methods. For an unrolled network with N iterations and shared CNN modules across iterations, the computational complexity and memory demands of backpropagation are \({\mathcal {O}}(N)\) and \({\mathcal {O}}(N)\), respectively. The forward steps can be recomputed during backpropagation, which reduces the memory demand to \({\mathcal {O}}(1)\), while the computational complexity increases to \({\mathcal {O}}(N^2)\). Forward checkpointing [336] saves the variables for every K layers during forward propagation, which reduces the computational demand to \({\mathcal {O}}(NK)\), while the memory demand is \({\mathcal {O}}(N/K)\). Reverse recalculation has been proposed to reduce the memory demand to \({\mathcal {O}}(1)\) and computational complexity to \({\mathcal {O}}(N)\) [337]. However, the approach in [337] requires multiple iterations to invert each CNN block, resulting in high computational complexity in practical applications. The deep equilibrium (DEQ) model [338] was recently adapted to inverse problems to significantly improve the memory demand [71, 72] of unrolled methods. Unlike unrolled methods, DEQ schemes run the iterations until convergence, similar to PnP algorithms. This property makes it possible to perform forward and backward propagation using fixed-point iteration involving a single physical layer, which reduces the memory demand to \({\mathcal {O}}(1)\), while the computational complexity is \({\mathcal {O}}(N)\); this offers better tradeoffs than the alternatives discussed above [336, 337]. The runtime of DEQ methods that are iterated until convergence are variable compared to unrolled methods, which use a finite number of iterations. In addition, the convergence of the iterative algorithm is crucial for the accuracy of backpropagation steps in DEQ, unlike in unrolled methods. Convergence guarantees were introduced in [71, 72].

In summary, while computational infrastructure is increasingly evolving and expanding, both within research institutes and as cloud services, deep-learning-based reconstruction algorithms may require tremendous storage, RAM, and parallelization capabilities, to facilitate the training and inference of massive and complex models. Computational efficiency should, therefore, be considered an important aspect, when assessing the applicability of a new ML approach.

Conclusion

The introduction and rapid development of deep-learning-based strategies for MRI reconstruction have brought about a dramatic acceleration in acquisition time, paved the way for rapid and accurate parameter mapping, and facilitated automatic schedule optimization. While several key challenges still lie ahead, especially in terms of robust generalization, careful considerations of the training data diversity, ongoing model validation, and potentially, the development of adaptive or continuous learning systems are expected to enable adjustment to new data distributions over time. Kindly provide complete details for details for Refs. [94, 171, 180, 231, 246], if possible.we have added the missing details to all 5 references.Inclusion of a “Data availability” statement is preferred for this journal. If applicable, please provide the statement.Because this is a review paper, which summarized existing literature and did not do any new experiments with data, a data availability statement is not applicable in this case. Thank yo.