Keywords

1 Introduction

In the past two decades, Magnetic resonance imaging (MRI) has revolutionized diagnostic and therapeutic imaging due to its non-radiation and non-ionizing nature. It can reveal the structure and the function of internal tissues and organs in a high-quality and safe manner. However, the main barrier in contemporary MRI technology is the slow process of data acquisition, resulting in long scan time and potentially more severe motion artefacts, hence accelerating MR acquisition is in high demand. Different efforts have been made to reduce scan time, which can be categorized into two complementary directions: physics and hardware-based methods as well as signal processing based methods. The former mainly lay emphasis on designing fast imaging sequences, as well as exploiting information from multiple receiving coils. For example, generalized autocalibrating partial parallel acquisition (GRAPPA) [1] aims to exploit the diverse information in coil sensitivity maps. The latter relies on the prior knowledge of the sparsity in k-space. Compressed Sensing MRI (CS-MRI) [2] is an important representative in signal processing based methods. k-space is the frequency domain, and randomly undersampled data (usually less than 50%) is often acquired in such applications. In the theory of Compressed Sensing MRI, artefacts brought by random undersampling can be treated as noise-like interference and thus be recovered by the sparse representation [2]. Although this assumption based on the sparsity in the transformation domain succeeds in many applications [3,4,5,6,7], one problem still exists is the computational complexity.

Recent progress in artificial neural networks opens new opportunities to solve classification [8], recognition [9], and ill-posed inverse problems [10] more efficiently than conventional signal processing methods. When dealing with ill-posed inverse problems, Convolutional Neural Networks (CNNs) outperforms a great number of traditional model-based methods in different tasks, such as image super-resolution [11], segmentation [12], de-noising [13], and pose estimation [14] etc.

In order to address the problems regarding complexity resulted from the above assumption, there are preliminary researches focusing on deep learning based MRI reconstruction have made great progress. Wang et al. [15] first proposed using end to end CNN to learn the mapping between zero-filled and fully-sampled data. Schlemper et al. [16] incorporated data consistency as a layer when cascading CNNs for MRI reconstruction, also demonstrated that using different Cartesian masks to train is beneficial for generic applications. Yang et al. [17] took advantage of the algorithm in Alternating Direction Method of Multipliers (ADMM) and achieved results that reconstruction time was significantly reduced while producing the same quality as traditional model-based methods. Sun et al. [18] used a Recursive Dilated architecture aiming to reduce the network parameters while introducing dilated convolutions. However, training traditional end to end CNN with pixel-wise oriented loss function may result in overly smooth structure detail and lack perceptual coherent details for diagnostic purpose. Goodfellow et al. [19] proposed Generative Adversarial Networks (GANs) in which utilizing the generator network as a nonlinear transformation solves perceptually generating problems at a high level. Wang et al. [20] proposed a state of the art single image super-resolution (SISR) model called Enhanced SRGAN, taking advantage of a residual dense block with a GAN architecture. Such analogous computer vision tasks (e.g. super-resolution, de-noising, and reconstruction) with perceptual quality driven goal gradually make use of GAN architecture and showed promising results [20, 21].

Mardani et al. [22] first incorporated GAN into Compressed Sensing MRI, Quan et al. [23] used a cyclic loss function while learned residual content of undersampled scans. Yang et al. [24] incorporated perceptual and frequency domain loss, Li et al. [25] introduced a structure regularization called Patch Correlation Regularization (PCR) which aims to restore structure information within both local and global scale. Chen et al. [26] trained a GAN to provide two MRI contrast during one single scan.

The motivation of this study mainly comes from two observations. In recent years, image restoration and de-noising tasks lay great emphasis on perceptual quality [27] based on the human visual system (HVS). Also, it is straightforward to formulate one single reconstruction based on a single observation instead of repeatedly refinement of the generated sample, which is beneficial for clinical hardware implementation.

Taken the above observations into consideration, in this study, we adopted and improved Generative Adversarial Network (GAN) to first time aimed at preserving global structure using an MS-SSIM oriented training purpose, while realizing data correction after single reconstruction. Specifically, our contributions are as follows:

  • We proposed to incorporate MS-SSIM oriented loss function in an unbalanced U-Net based generator architecture, further balanced between NMSE loss and frequency domain loss.

  • We proposed to add a single data consistency correction after one-time reconstruction using GAN, which can be further synthesized into the model.

  • We presented a theoretical analysis of the proposed differentiable loss function, conducted numerous comparison experiments to examine our model and proved the efficiency of the proposed method.

The rest of this paper is organized as: In Sect. 2 the problem is stated, and Structure Oriented Generative Adversarial Networks (SOGAN) is proposed after the evaluation of conventional methods and deep learning methods. Section 3 reports the method as well as our contributions. In Sect. 4, training details and data evaluation are presented. Discussion and conclusion are in Sect. 5.

2 Problem Formulation

Compressed Sensing MRI can be treated as an ill-posed linear system \(y= \varvec{\varPhi } x + \epsilon \) with \(\varvec{\varPhi } \in {\mathbb {C}}^{M \times N}\). \(\epsilon \) denotes the noise and other unmodeled dynamics. The observation and desired reconstruction are respectively denoted as y and x, where y \(\in {\mathbb {C}}^{M}\) and x \(\in {\mathbb {C}}^{N}\), note that \(M \ll N\). The image acquisition process can be described by matrix \({\mathbb {C}}^{M \times N}\). Thus the desired goal is estimating the inverse matrix \({\mathbb {C}}^{N \times M}\) which is underdetermined. Another unstable factor is the unmodeled dynamics \(\epsilon \). The reconstructed image is often estimated by

$$\begin{aligned} \hat{x} = \mathop {\arg \min }_{x} \ \{ \frac{1}{2}\Vert \varvec{\varPhi }\mathop {x} - y\Vert _2^2 + \sum _{l=1}^L \lambda _l g(x) \} \end{aligned}$$
(1)

in which \(\varvec{\varPhi } = \varvec{PF} \in {\mathbb {C}}^{M \times N}\) is the measurement matrix with \(\varvec{P}\) denoting undersample operation and \(\varvec{F}\) denoting Fourier transformation. \(g(\cdot )\) is the regularization term, making use of prior information and \(l_q\)-regularizer (\(q \in [0, 1]\)) is usually adopted for compressed sensing problems.

For learning based problem formulation, no further information is obtained besides the training samples and the corresponding noisy observations. The goal is to estimate \(x'\) with newly acquired data \(y'\). We denote the fully-sampled training data as set \(\varvec{X} = \{x_1, x_2, \dots , x_t\}\), and the corresponding observations as set \(\varvec{Y} = \{y_1, y_2, \dots , y_t\}\), thus the training process can be written as \(\varvec{S} = \{(x_1, y_1), (x_2, y_2), \dots , (x_t, y_t)\}\).

2.1 Conventional Model-Based MRI

Magnetic Resonance Imaging takes advantage of the radio frequency pulse sequence. Model-based theories can be categorized into two parts: transform-based methods and dictionary learning-based methods. For conventional Compressed Sensing (CS) theory, utilizing the sparsity characteristic of acquired signal is important, and transforms such as Fourier transform, Wavelets [4] and discrete cosine transform [2] are used. However, solving the minimization problem may introduce computational complexity and also result in block artefacts [28]. The highlights for dictionary learning-based methods [29] is that it can specifically design a dictionary for the desired dataset.

2.2 Generative Adversarial Networks

General image processing aimed Generative Adversarial Networks derives from a zero-sum game between two CNN based neural networks, called the generator \(\varvec{G}\) and the discriminator \(\varvec{D}\). The generator aims to learn the mapping from the manifold latent space z and the corresponding ground truth input x. The discriminator learns to classify whether the input sample lies within real data distribution \(P_{data}\) or in generated data distribution \(P_g\). Together the training function L can be formulated by

$$\begin{aligned} \mathop {\max }_{D} \min _{G} L(D, G) = E_{x \sim P_{data(x)}}[logD(x)] + E_{z \sim P_{z(z)}}[log(1-D(G(z)))] \end{aligned}$$
(2)

GAN can be estimated by the simultaneous optimization of \(\varvec{G}\) and \(\varvec{D}\), based on a stochastic gradient descent algorithm. However, at the initial stage, the over-confident discriminator may result in gradient vanishing problem, thus the gradient step for generator often takes

$$\begin{aligned} \varDelta _G = \nabla _G E_{z \sim P_{z(z)}}[-log(D(G(z)))] \end{aligned}$$
(3)

The optimized balance between generator and discriminator should finally reach at

$$\begin{aligned} D_G'(x) = \frac{P_{data}(x)}{P_{data}(x)+P_g(x)} \end{aligned}$$
(4)

in which the discrimination process approximates random guess and the training process is then stopped. In the final stage, the generated sample distribution \(P_g\) approximates real data distribution \(P_{data}\).

3 Method

In the proposed SOGAN architecture, U-Net based generator is adopted for several reasons. Skip connections between the down-sampling encoder and up-sampling decoder can preserve structure information within different scales, which have proven effective among medical image processing tasks [30]. Unlike the upscaling characteristic of super-resolution, the input and output of our task should maintain the same size. Moreover, the residual connection can effectively propagate gradients and avoid gradient vanishing problem.

The generator used in this study is shown in Fig. 1 which utilizes skip connections, and each down-sampling decreases the feature map size by a factor of 2 and each up-sampling increase the feature map size by a factor of 2. Empirically doubling the feature maps in the decoder which results in an unbalanced U-Net helps to reconstruct more detail, and gain better results. Each down-sampling and up-sampling stage consists of three parts: convolutional layer (deconvolutional layer), batch normalization layer, Leaky ReLU layer.

Fig. 1.
figure 1

U-Net structure of generator used in this study

3.1 Multi-scale SSIM Loss

In order to evaluate the content quality of a reconstructed image, PSNR and SSIM are the two important factors involved. More importantly, the perceptual quality of the image is often assessed by SSIM [31], thus the perceptually motivated error function is adopted in this study. To evaluate SSIM at pixel p,

$$\begin{aligned} SSIM(p) = \frac{2\mu _x \mu _y+C_1}{\mu _x^2 + \mu _y^2 + C_1} \cdot \frac{2\sigma _{xy}+C_2}{\sigma _x^2 + \sigma _y^2 + C_2} =l(p) \cdot cs(p) \end{aligned}$$
(5)

SSIM driven loss function should be written as

$$\begin{aligned} L^{SSIM}(p) = \frac{1}{N}\sum _{p \in P} (1-SSIM(p)) \end{aligned}$$
(6)

In order to deal with boundary regions, we need to replace pixel patch p with center pixel \(\tilde{p}\) to calculate SSIM and its derivatives. For back propagation need, the derivatives of SSIM function should be calculated as

$$\begin{aligned} \frac{\partial {L^{SSIM}(p)}}{\partial x(p)} = -\frac{\partial }{\partial x(q)}SSIM(\tilde{p}) = -(\frac{\partial l(\tilde{p})}{\partial x (q)} \cdot (\tilde{p})+l(\tilde{p})\cdot \frac{\partial cs (\tilde{p})}{\partial x(q)}) \end{aligned}$$
(7)

where q is any other pixel in patch P, and \(l(\tilde{p})\) and \(cs(\tilde{p})\) denote two different terms in computing \(SSIM(\tilde{p})\). For Multi-scale SSIM computation, the loss function is then written as

$$\begin{aligned} L^{MS-SSIM}(P) = 1-MSSSIM(\tilde{p}) \end{aligned}$$
(8)

However, MS-SSIM as the goal of network optimization may introduce a shift of brightness or colors [31], based on this observation, multiple loss functions are combined as demonstrated in the next section.

3.2 Generator Loss

For structural oriented training purpose, MS-SSIM loss, adversarial GAN loss, pixel-wise l2 loss, and frequency domain l2 loss are combined as the total optimization goal. We use the loss function in Eq. (8) for MS-SSIM loss, which is the main training guidance. For fast convergence and robustness, pixel-wise NMSE loss dominates the training procedure at the starting point, which helps to guide the overall gradient descent procedure. For pixel NMSE the loss function can be written as

$$\begin{aligned} \mathop {\min }_{G}L_{NMSE}(G) = \frac{\Vert x_t - \hat{x}_g\Vert _2^2}{\Vert x_t\Vert _2^2} \end{aligned}$$
(9)

Recall that \(x_t\) denotes the ground truth and \(\hat{x}_g\) is the generated sample. Different from single image super-resolution (SISR) tasks in which images do not have a clear frequency domain pattern, MRI data are naturally acquired from k-space. Thus at the same time, a frequency domain NMSE loss is added into the training loss

$$\begin{aligned} \mathop {\min }_{G}L_{k-space}(G) = \frac{\Vert f_t - \hat{f}_g\Vert _2^2}{\Vert f_t\Vert _2^2} \end{aligned}$$
(10)

where \(f_t\) and \(\hat{f}_g\) are the corresponding k-space data of \(x_t\) and \(\hat{x}_g\). At last, the adversarial loss based on the discriminator is written as

$$\begin{aligned} \mathop {\min }_{G} L_{GAN}(G) = -log(D(G(\hat{x}_g))) \end{aligned}$$
(11)

The total loss function for SOGAN generator is then as follows

$$\begin{aligned} L_{SOGAN}(G) = \alpha L_{MSSSIM} + \beta L_{NMSE} + \gamma L_{kspace} + \eta L_{GAN} \end{aligned}$$
(12)

3.3 Single Data Consistency Correction

The task of MRI reconstruction is analogous to super-resolution, de-noising but also different. The intrinsic down-sampling is operated in k-space, which result in global aliasing and blurring artefacts.

After training the network, a mapping from the observation sample \(x_t\) to the reconstructed samples \(\hat{x}_g = f_{recon}(x_t)\) is obtained. In order to fully correct the original k-space data, we apply a data consistency layer after a single time reconstruction. As \(\mathcal {F}(x_t)\) contains the original data from the fully-sampled data and the padded zeroes, the generator only tries to fill the zeroes in k-space as far as possible. During this nonlinear interpolation, \(f_{recon}(x_t)\) also changed the original data inevitably because the reconstruction is performed in the image domain. Hence the correction is formulated in function \(f_{DC}(\cdot )\) which is conducted in k-space, the output of the final network is

$$\begin{aligned} \hat{x_g}(t)' = \mathcal {F}^{-1}(f_{DC}(\mathcal {F}(\hat{x_g}))) \end{aligned}$$
(13)

where we transform to frequency domain and replace corresponding data points. Finally, the undersampling and reconstruction process is illustrated in Fig. 2.

Fig. 2.
figure 2

Overall reconstruction process of SOGAN

4 Experiment Settings and Results

In the following, experiments are performed on testing the capability of SOGAN, and results are compared with other state-of-art methods.

4.1 Experiment Settings

Dataset. We tested SOGAN on MICCAI 2013 Grand Challenge on Multi-Atlas Labeling [32], and used deep brain structures data for training, testing and validating. For training purpose, 12729 T2 weighted brain scans were included, 3879 for validation and 7082 for testing purpose. All the images of T2 weighted brain scans are \(256 \times 256\) and we normalize them into \([-1, 1]\). During the under-sampling process, all the networks are tested under different masks: 10%, 20%, and 30%, correspondingly yields 10, 5 and 3.3 times acceleration. For robustness of the training process, a time-decreasing data augmentation is added onto the training set. All the random added white Gaussian noise (AWGN) and random interpolation of the image starts at a ratio of 1, and decrease as the training epochs increase. In order to test the model, 50 images from the test set was randomly chosen.

Network Settings and Training Details. After empirically experiments, we set the loss function as \(L_{SOGAN}(G) = \alpha L_{MSSSIM} + \beta L_{NMSE} + \gamma L_{kspace} + \eta L_{GAN}\), where \(\alpha = 10, \beta = 15, \gamma = 0.5\), and \( \eta = 1\). In the early stage, NMSE and frequency domain loss decreases dramatically as the main motivation, later on, SSIM and adversarial loss modify the details of the output. The architecture of SOGAN was inspired by [19] and we implemented on Tensorlayer API. The training was conducted on NVIDIA Tesla K80 of 12 GB memory. The initial learning rate was set to 1E-3, the batch size was set to 25.

4.2 Results on Real MRI Data

We use the Structural Similarity Index (SSIM), Normalized Mean Square Error (NMSE), and the Peak Signal to Noise Ratio (PSNR) as the three evaluation methods. The results are shown in Table 1, SSIM results are shown in Table 2. For visualization results, the SOGAN reconstruction under different under-sampling rates are in Fig. 3. In order to compare the efficiency of different improvements of our method, we first trained the final network with Structural Oriented GAN with Data Consistency Loss (SOGAN-DC), then we removed the Data Consistency layer (SOGAN). We also examined our model with the SSIM loss function removed (Pixel-SOGAN) to show that structural aimed training is beneficial. In order to compare different model performance, we trained the state-of-art deep learning based method DAGAN [24], we also compared our model with ADMM-Net [17].

Table 1. Evaluations for PSNR and NMSE
Table 2. Evaluations for SSIM

4.3 Comparison

Comparison of Different Training Variables. It can be observed that Data Consistency layer plays an important role in reconstruction quality, and the improvement becomes more efficient as the sampling rate increases, this is due to the correction involves more data as the sampling rate increases. Also, SSIM oriented loss function helps to improve the performance of the network not only in SSIM but also in PSNR. Note that, there is a significant help in Data Consistency layer but the correction does not introduce computational redundancy as the reconstruction time does not increase a lot.

Fig. 3.
figure 3

SOGAN reconstruction results under different under-sampling rates

Fig. 4.
figure 4

The comparison of reconstructed samples’ differences from fully sampled data under 10% under-sampling

Fig. 5.
figure 5

Pixel values drawn from a horizontal line from reconstructed samples

Comparison with Different Sampling Rate. As can be observed from Table 1, different sampling rate did influence the reconstruction result a lot. Current reconstruction results for 10% undersampling is approximately at the same level of 30% undersampling with zero-filling images. Also, the performance increases non-linearly, when the undersampling rate is significantly low (10%), the reconstruction is more efficient.

Comparison with Other Models. We tested SOGAN with other state-of-art models as shown in Table 1, the performance has been improved due to the k-space correction and the introduced combined loss function. Specifically for structure evaluation methods, SOGAN-DC outperforms the other models in SSIM evaluation as shown in Table 2. As for visual comparison, we plotted the pixel values in a horizontal line from a reconstructed sample. The differences of reconstructed samples from fully sampled data between different methods are listed in Fig. 4, SOGAN-DC also has perceptually satisfying result. Moreover, it can be observed from the details of pixel values drawn from a horizontal line in Fig. 5 that SOGAN-DC preserved structure contrast most successful.

5 Discussion and Conclusion

In this paper, we propose to incorporate the structural training technique in image restoration into the reconstruction of compressed sensing MRI, formed and presented our novel SOGAN model. For compressed sensing MRI, preserving structural information is critical for clinical diagnostic purpose. This study focused on preserving the frequency domain information with k-space correction layer as well as structural oriented MS-SSIM loss function. Theoretical analyses of the structural loss function are given and the combined training strategies are illustrated.

Numerical experiments showed our architecture is efficient to learn the mapping from zero-filling acquisitions to the perceptually convincing reconstructions. For future exploration, it is interesting to form an architecture with fewer parameters and pruning strategy for hardware implementation.