Abstract
Multiimage superresolution (MISR) usually outperforms singleimage superresolution (SISR) under a proper interimage alignment by explicitly exploiting the interimage correlation. However, the large computational demand encumbers the deployment of MISR in practice. In this work, we propose a distributed optimization framework based on data parallelism for fast largescale MISR using multiGPU acceleration named FLMISR. The scaled conjugate gradient (SCG) algorithm is applied to the distributed subfunctions and the local SCG variables are communicated to synchronize the convergence rate over multiGPU systems towards a consistent convergence. Furthermore, an innerouter border exchange scheme is performed to obviate the border effect between neighboring GPUs. The proposed FLMISR is applied to the computed tomography (CT) system by superresolving the projections acquired by subpixel detector shift. The SR reconstruction is performed on the fly during the CT acquisition such that no additional computation time is introduced. FLMISR is extensively evaluated from different aspects and experimental results demonstrate that FLMISR effectively improves the spatial resolution of CT systems in modulation transfer function (MTF) and visual perception. Comparing to a multicore CPU implementation, FLMISR achieves a more than 50\(\times\) speedup on an offtheshelf 4GPU system.
Introduction
Superresolution (SR) is a fundamental task in image processing and has been an attractive research field for decades [1,2,3]. SR is an algorithmbased image enhancement technique dedicated to improving the spatial resolution of the imaging systems beyond the hardware limit by exploiting the lowresolution (LR) acquisitions and it is widely applied in many applications such as medical diagnostics, surveillance, and remote sensing.
In recent years, we have witnessed tremendous progress of deep learning in multiple image processing and computer vision tasks such as image denoising [4, 5], superresolution [6, 7], deformable registration [8, 9], and semantic segmentation [10, 11]. Despite of the great success of deep learning in SR, most of the work focuses on singleimage SR (SISR) [6, 7, 12,13,14,15,16,17]. In fact, SR reconstruction can significantly benefit from the available correlated input images which are captured of the same view. Multiimage SR (MISR) exploits the correspondences entailed in the multiple input images and usually outperforms SISR when the relative movements between the reference image and the other input images are well estimated. However, the learningbased MISR approaches in the literature are mainly dedicated to video applications [18,19,20,21,22]. Besides, the quality of the learningbased methods highly depends on the fidelity of the training datasets. In practice, preparing synthetic datasets which adequately resemble the realworld measurements covering diverse imaging conditions would be challenging. Furthermore, although learningbased approaches are able to describe more sophisticated image priors, “hallucinated” structures can be unpredictably constructed which may impede the employment of the trained models in applications such as metrology and quality control.
Different from the deep learningbased SR methods, optimizationbased MISR algorithms [23,24,25,26,27] reconstruct the latent highresolution (HR) image explicitly based on the real acquisitions but not the training datasets. Nowadays, due to the technological development of sensor manufacturing, sensors or detectors with large resolutions such as 8, 16 Mpixels or even higher are employed in applications such as medical imaging and industrial inspection. Coping with largescale multiimage input can be computationally expensive and hardware costly. The optimizationbased SR methods usually suffer from the iterative manner which leads to undesirable computation time. In this work, we present a multiGPU accelerated framework for largescale MISR reconstruction based on distributed optimization. The proposed framework is applied to the computed tomography (CT) imaging system and achieves a realtime SR reconstruction during the CT acquisition without introducing additional computation time. The contribution of this work can be summarized as following:

We propose a distributed optimization framework for MISR, named FLMISR, dealing with large sized images based on multiGPU acceleration. Each GPU accounts for an allocated partition and the latent SR image is obtained by image fusion.

To obtain a consistent resolution enhancement among all the GPUs, the update of the partitions is synchronized by unifying the local variables of the scaled conjugate gradient (SCG) method. To avoid border effect between neighboring GPUs, an innerouter border exchange scheme is performed.

The proposed FLMISR is applied to realtime CT imaging by superresolving the projections acquired via subpixel detector shift. Extensive evaluation from different aspects demonstrates that FLMISR not only achieves a significant resolution enhancement for CT systems but also provides very promising results for natural images. Comparing to a multicore CPU implementation, FLMISR achieves a more than \(50\times\) speedup on a 4GPU system.
Related work
Optimizationbased iterative methods
In the literature, conventional optimizationbased iterative SR methods can be traced back to 1980’s and they are mainly grouped into two categories: the frequency domain based and the spatial domain based methods [1, 2]. In Huang et al. [28], firstly address the MISR problem in the frequency domain. Although the frequency domainbased methods have low computational complexity, they behave extremely sensitive to model errors and have limited ability to integrate a priori knowledge as regularization. The majority of the iterative MISR approaches solve the problem in the spatial domain based on the maximum likelihood (ML), the maximum a posteriori (MAP), and the projection onto convex sets (POCS) [23, 25, 27, 29,30,31,32,33]. Most of the work focuses on the reconstruction accuracy and only few concerns the performance in computation time. Specially, Elad et al. [30] propose a fast MISR algorithm concerning the special case of pure translation and space invariant blur. In [23], Farsiu et al. present a robust MISR method based on MAP using the L1 norm data fidelity term and the bilateral total variation regularization. Jens et al. [32] introduce a GPUaccelerated MISR approach for imageguided surgery which supports a 2\(\times\) SR reconstruction from 4 LR images of size 200\(\times\)200 in 60 ms. However, due to the GPU memory limit, their method can not handle large sized images. In [34], the authors propose a fast MISR method which composes of registration, fusion, and sharpening for satellite images using highorder spline interpolation. Nevertheless, purely image fusion is performed on a GPU and the rest two steps are on the CPU which results in a degraded performance in runtime.
Deep learningbased methods
In the last decade, deep learning has been very successfully adopted in SR and has harvested fruitful results. Dong et al. [6] introduce the convolutional neural network (CNN) into SISR which demonstrates the great potential of CNN for feature extraction. Inspired by the distinguished performance of CNN, a series of work from plain CNN to densely connected GAN, from 2D natural image to 3D medical volume, has been successively proposed [7, 12, 13, 15,16,17]. Comparing to the traditional iterative methods, CNNbased SR approaches focus on superresolving single LR image by exploiting the relation learned exclusively from the LRHR image pairs in the external example database. The learningbased MISR methods are mainly proposed to cope with natural video streams [18,19,20,21]. Although some work is intended for realtime applications using GPU or FPGA [21, 22, 35], the video SR (VSR) performance highly relies on the fidelity of the synthesized LRHR frame pairs and the quality of the training datasets. Furthermore, since the supervised learning scheme requires the ground truth (GT) HR images during the training phase, the performance of the trained model will be limited by the available quality of the GT acquired in practice. It is especially true for CT imaging due to the lack of publicly available highquality HR datasets like DIV8K [36] for natural images.
To the best of our knowledge, the literature on GPUaccelerated MISR methods for largescale images is very limited despite of its importance. In this paper, we extend our previous work [37] mainly in two folds. First, the locally applied scaled conjugate gradient (SCG) algorithm is adapted to achieve a synchronized convergence rate over multiGPU systems. Second, instead of performing region averaging, we employ the socalled innerouter border exchange scheme to preserve the sharpness of the overlapped regions. Particularly, in [37] we introduce a multiGPU implementation of a MISR method based on data parallelism where each GPU deals with an allocated partition of the latent SR image. Although overlapped regions between neighboring GPUs are exchanged and averaged, we found that the resolved SR image in [37] is not globally optimized and the fused SR image suffers from inhomogeneous resolution enhancement due to the inconsistent convergence rate of the local SCG and region averaging. We address these issues in this work and propose a generalized framework for multiGPU supported MISR. We conducted extensive experiments to validate the proposed FLMISR. Experimental results show that the exchange of local SCG variables and overlapped regions among GPUs has limited impact on the overall performance of runtime and leads to a consensus convergence over multiGPUs without causing border effects. Besides, it is shown that superresolving four input images of size 4096 \(\times\) 4096 by an upscaling of 2\(\times\) can be achieved within 2.4s on a 4GPU system.
Methods
Distributed optimization for MISR
The common formulation of SR model in the pixel domain is presented as
with \(\mathbf {x}\in \mathbb {R}^{n\times 1}, \mathbf {y}\in \mathbb {R}^{m\times 1}\) being, respectively, the latent and captured image rearranged in lexicographic order. The system matrix \(\mathbf {A}\in \mathbb {R}^{m\times n}\) is usually expressed as \(\mathbf {A}=\mathbf {D}\mathbf {B}\mathbf {M}\) with \(\mathbf {D}\in \mathbb {R}^{m\times n}, \mathbf {B}\in \mathbb {R}^{n\times n}\), and \(\mathbf {M}\in \mathbb {R}^{n\times n}\) describing the decimation, blurring, and motion effects, respectively. The vector \(\varvec{\varepsilon }(\mathbf {x})\in \mathbb {R}^{m\times 1}\) denotes the additive noise existing in the imaging systems. More detailed description of the system model can be found in [27]. To simplify the calculation, in this work we assume \(\varvec{\varepsilon }(\mathbf {x})\) is an intensityindependent additive noise and the system matrix \(\mathbf {A}\) is known.
Since SR is an illposed problem, involving a welldefined image prior can effectively constrain the solution domain. Therefore, MAP estimator is preferably adopted for SR reconstruction. The posterior probability \(P(\mathbf {x}\mathbf {y})\) of the SR image \(\mathbf {x}\) is formulated based on the Bayes’ theorem:
Assuming the noise \(\varepsilon _i\in \varvec{\varepsilon }\) in each pixel i is white Gaussian and i.i.d with \(\varepsilon _i\sim N(0, \sigma ^2)\) and \(P(\varepsilon _i)=\frac{1}{\sqrt{2\pi \sigma ^2}}e^{\frac{\varepsilon _i^2}{2\sigma ^2}}\), we yield the likelihood function as
Taking the natural logarithm, the associated negative loglikelihood can be formulated as
where c is a constant. For brevity, we will omit the weight \(\frac{1}{2\sigma ^2}\) and the constant c in the latter formulation.
For MISR with k independent LR images \(\mathbf {y}_i\), \(i\in [1\ldots k]\), the posterior probability can be extended as
and the data fidelity term is hence formulated by
It should be noted that in case of additive white Laplacian noise which models the impulse noise (Salt & Pepper noise), we have the L1 norm data fidelity term [38]. Usually, L1 norm data term has better robustness against pixel outliers [23]. Without loss of generality, the data fidelity term can be formulated as
with the \(L_p\) norm \(p\in \{1,2\}\).
In the literature, there are several wellknown handcrafted image priors \(P(\mathbf {x})\) including the total variation (TV) [39], HuberMarkov prior [40], bilateral total variation (BTV) [23], nonlocal total variation (NLTV) [41], and the more recent bilateral spectrum weighted total variation (BSWTV) [27]. In this paper, aiming for reducing the computational complexity, we leverage the BTV as the image prior and the regularization term is therefore expressed as
where \(\mathbf {d}\in \mathbb {N}^2\) with \(d_x, d_y\in [0,w1]\) and w denotes the window size accounting for the neighbors in the x, y directions. \(S_\mathbf {d}\) represents the shifting operator along x and y axis by \(d_x\) and \(d_y\) pixels. \(\gamma (\mathbf {d}):=\alpha ^{d_x+d_y}\) embodies the spatial decaying effect with constant \(\alpha <1\).
As the denominator of Eq. (5) is independent from the image \(\mathbf {x}\), maximizing the posterior probability \(P(\mathbf {x}\mathbf {y}_1\ldots \mathbf {y}_k)\) is equivalent to minimizing the negative logarithm of the numerator which is formulated, respectively, in Eqs. (7) and (8). Hence, we yield the overall objective function based on the MAP framework as following:
where the scaling factor of the fidelity term \(\frac {1}{2\sigma ^2}\) in Eq. (4) is actually absorbed into the weighting parameter \(\lambda\). In the experiments, we have used the L1 norm data term for a better robustness.
To accelerate the computation and alleviate the GPU memory load especially when coping with a sequence of large input images, we distribute the computational demand over multiGPUs by data parallelism and follow a consensusbased convergence manner to guarantee a centralized solution. The latent SR image \(\mathbf {x}\) is finally obtained by data fusion. In particular, Eq. (9) can be rewritten as
with \(D_i\) representing the corresponding data term and R being the regularization term. In this regard, the subfunction associated with the hth GPU is expressed as
where \(\mathbf {x}_h\) is a fraction of the latent image \(\mathbf {x}\) assigned to the hth GPU and g denotes the number of employed GPUs. To enforce the distributed optimization towards a centralized solution, we allow communication between the local GPU node and the host CPU for a consensus update decision. Specially, we utilize the SCG algorithm [42] to iteratively solve the subproblem described in Eq. (11) in each GPU. Instead of using the handcrafted step size or performing line search, SCG employs a step size scaling mechanism based on an adaptive scalar which achieves a faster and more robust convergence than the widely used approaches such as conjugate gradient with line search (CGL) and Broyden–Fletcher–Goldfarb–Shanno (BFGS).
Aiming for synchronizing the update of the individual \(\mathbf {x}_h\) towards a centralized solution, we unify the local SCG scalar variables \(\sigma , \lambda , \delta , \mu , \alpha\) by data communication. As these variables are calculated based on the inner product of vectors, we can obtain the consensus variables by the aggregate of the broadcast local ones. By means of consensus variables, the subfunctions can converge synchronically and a homogeneous resolution among multiGPUs is guaranteed. In Table 1, we list the unified scalar variables and vectors (in bold) of SCG.
In addition, to avoid border discontinuity of neighboring partitions, region overlapping between neighboring GPUs is required. Instead of the naive averaging of the overlapped regions which sacrifices the sharpness and visual quality, we perform an innerouter border exchange in each SCG iteration as shown in Fig. 1. A 4GPU system is demonstrated and each GPU deals with the allocated image partition \(\mathbf {x}_h\). The overlapped regions marked in violet are exchanged between neighboring GPUs. Particularly, since the inner borders can be correctly calculated only in case that the outer borders are consistent with the neighboring GPUs, the outer borders are replaced by the received ones and the inner borders are broadcast to the neighbors as exhibited in Fig. 1b). Consequently, an agreement in the overlapped regions is achieved as shown in Fig. 1c) without compromising the image sharpness. Without loss of generality, assuming g GPU nodes are employed, the architecture of the proposed multiGPU framework for SR is illustrated in Fig. 2. The local variables and overlapped regions are interchanged in each SCG iteration over the host CPU and updated in a consensus scheme.
In Algorithm 1, we present a detailed description of the proposed distributed optimization framework for MISR based on the SCG approach. The local GPU computation is marked by red and the centralized computation in the host CPU is denoted in blue. The local variables, overlapped regions, and consensus variables are respectively exchanged after the local and central update. The algorithm variables are initialized based on SCG [42] and the calculation of the system matrix \(A_i\) is explained in Sect. 4. The SR image \(\mathbf {x}\) is fused when the SCG iterations are complete.
In the implementation, we have used the OpenCL framework. In order to optimize the data deployment on GPU memory, we exploited the local memory in the kernel functions to the most extent. Sparse matrix was employed to calculate the system matrix \(\mathbf {A}_i=\mathbf {D}_i\mathbf {B}_i\mathbf {M}_i\) and the transpose \(\mathbf {A}^T_i\) due to the sparseness of the downsampling, blurring, and motion matrices. Although memory transfer of local variables and overlapped regions between the GPU and host CPU is intended to hold the consensus convergence, transfer of large amounts of data is obviated during the SR reconstruction. It is worthy noting that the proposed distributed optimization framework is based on data parallelism and consensus SCG. It can be easily applied to other applications such as SISR and image denoising by replacing the objective function in Eq. 9.
Realtime MISR for CT
SR is always preferable in CT imaging where spatial resolution plays a determinant role in image quality assessment. We have applied the proposed FLMISR on the industrial CT scanner as shown in Fig. 3. During the CT acquisition, the object is rotated by 360\(^{\circ }\) and at each rotation angle, four LR projections (Xray images) are captured via detector shift rightwards, downwards, leftwards, and upwards by half a pixel as illustrated in Fig. 4. As long as all the four LR projections of the same view are collected, SR reconstruction is launched as denoted in green along the scan time axis. The capturereconstruct fashion repeats until the whole CT acquisition is accomplished. Due to the fact that SR reconstruction usually takes less time than the accumulated time of projection acquisition (in red) and object rotation (in gray), SR can be performed in realtime during the CT scan without introducing extra runtime. The superresolved projections are utilized for CT reconstruction and hence, an improved spatial resolution in CT is achieved by the increased detector sampling rate. We demonstrate the experimental results in Sect. 4. It is necessary to note that since the same detector movement pattern is repeated for all the rotation angles during CT scan, the system matrices \(A_i\) with \(i\in [1,4]\) are calculated once at the beginning of the CT acquisition and shared by all the rotation angles.
Experiments and results
In this section, we conduct extensive experiments to evaluate the performance of the proposed FLMISR from different aspects, mainly on resolution enhancement and computation acceleration. Specially, FLMISR is evaluated for realtime CT imaging based on the synthetic and realworld CT measurements. Besides, the application of FLMISR on natural images is evaluated using the public dataset DIV8K [36].
The CT measurements were carried out on the Nikon HMX ST 225 CT scanner as shown in Fig. 3 which is equipped with a flat panel Varian PaxScan 4030E detector of pixel size 127 \(\times\) 127 \(\upmu m\). The detector is mounted on the controllable linear stages for x and ypositioning which supports detector displacement with a movement accuracy up to 1 \(\upmu m\). The focal spot size of the tungsten Xray tube is power dependent and for the power under 7 W, which was utilized in our experiments, the effective focal spot size is about 6 \(\mu m\) measured by the JIMA RT RC04 micro chart.
The calculation of the system matrix \(\mathbf {A}_i\) is thoroughly described in our previous work [26]. For an upscaling of 2\(\times\) with half pixel detector shift and a \(3\times 3\) Gaussian blur for \(B_i\), a 12row block area in the HR grid is required as the overlapped region between neighboring GPUs. The weighting parameters \(\lambda\) and \(\alpha\) were, respectively, set as 0.05 and 0.4. The SCG iteration was limited to 20. In practice, larger \(\lambda\) should be opted in case of strong noise and fewer SCG iterations should be used for fast CT acquisitions. To quantify the resolution enhancement by FLMISR on CT systems, we adopted the modulation transfer function (MTF) which was measured according to the standard ASTME 1695.
Evaluation of FLMISR on spatial resolution enhancement
Before we evaluate FLMISR on CT imaging, we briefly introduce the CT system and the assessment metric. CT scanner mainly consists of two components: the Xray tube and the Xray sensitive detector. The spatial resolution of the CT system is hence primarily limited by the focal spot size of the Xray tube and the detector pixel size. Usually, spatial resolution of imaging systems is assessed by the MTF which is calculated as the normalized magnitude of the Fourier Transform of the point spread function (PSF). The MTF of the CT system is formulated by \(\mathrm{MTF}_\mathrm{sys}=\mathrm{MTF}_\mathrm{fs}\cdot \mathrm{MTF}_\mathrm{det}\cdot \mathrm{MTF}_\mathrm{others}\), where \(\mathrm{MTF}_\mathrm{fs}\) and \(\mathrm{MTF}_\mathrm{det}\) respectively denote the MTF of the Xray focal spot and the detector. Other components such as the reconstruction algorithm, Xray beam hardening, and display monitor are usually of less influence on the overall \(\mathrm{MTF}_\mathrm{sys}\). In this work, we perform subpixel detector shift to achieve a higher detector sampling rate which will lead to an effective improvement of \(\mathrm{MTF}_\mathrm{sys}\) when \(\mathrm{MTF}_\mathrm{det}\) dominates \(\mathrm{MTF}_\mathrm{fs}\), which is usually the case in many CT applications.
Evaluation on synthetic CT images
To analyze the effectiveness of subpixel detector shift on the spatial resolution enhancement in CT, we first demonstrate the impact of \(\mathrm{MTF}_\mathrm{det}\) on the \(\mathrm{MTF}_\mathrm{sys}\). To simplify the system model, we consider only the primary components and therefore, we yield \(\mathrm{MTF}_\mathrm{sys} := \mathrm{MTF}_\mathrm{fs}\cdot \mathrm{MTF}_\mathrm{det}\). The \(\mathrm{MTF}_\mathrm{fs}\) is modeled by a Gaussian function and the \(\mathrm{MTF}_\mathrm{det}\) is represented by a sinc function due to the assumed rectangular shape of each pixel. As shown in Fig. 5, the left plot indicates the case where \(\mathrm{MTF}_\mathrm{fs}\) dominates \(\mathrm{MTF}_\mathrm{det}\), for instance when the object is extremely close to the Xray source and the right one depicts the situation where \(\mathrm{MTF}_\mathrm{det}\) dominates. The MTF of the detector with full pixel size and with half pixel size is respectively denoted as \(\mathrm{Detector}_\mathrm{LR}\) and \(\mathrm{Detector}_\mathrm{HR}\). The MTF at \(10\%\) is usually considered as the visible limit in practice and is marked by the gray dotted line. It is shown that halving the detector pixel size doubles the \(\mathrm{MTF}_\mathrm{det}\) and improves the overall \(\mathrm{MTF}_\mathrm{sys}\) effectively when \(\mathrm{MTF}_\mathrm{det}\) dominates, while for the case \(\mathrm{MTF}_\mathrm{fs}\) dominates, \(\mathrm{MTF}_\mathrm{sys}\) has a negligible improvement.
Based on the analysis above, we evaluate FLMISR on the CT images quantitatively and qualitatively. Specially, we conducted CT scans of an aluminium cylindrical phantom with a diameter of 20 mm as shown in Fig. 3b) which was fixed perpendicular to the rotation table and a QRM bar pattern resolution phantom at the magnification of 20. Considering them as the ground truth (GT), we simulated four sets of 0.5\(\times\) LR projections by shifting the GT projections rightwards, downwards, leftwards, and upwards by one pixel followed by a \(2 \times 2\) binning. The downscaled LR projections were fused by interpolation and by FLMISR. As the interimage offset is assumed to be one pixel and accurate, for interpolationbased fusion we inserted the pixel values of the LR images into the corresponding integer location in the HR grid. The superresolved projections were then used for CT reconstruction by filter backprojection (FBP). The CT cross sections of the aluminium cylindrical phantom and the associated MTF are demonstrated in Fig. 6. The LR CT was reconstructed by the reference (upper left) set of the downscaled projections. As we can clearly see that FLMISR resembles the MTF of the GT extremely well and almost doubles the MTF of the LR image. To illustrate the performance of FLMISR visually, we present the CT images of the QRM bar pattern target in Fig. 7. It is shown that FLMISR provides a more pleasant result with sharper structures and better visual quality.
Evaluation on realworld CT images
As the spatial resolution of CT systems depends on the magnification, we evaluate FLMISR on the realworld CT scans at different magnifications. Particularly, we conducted CT measurements of aluminium cylindrical phantoms with diameters of 10 mm and 20 mm, QRM bar pattern phantom with spatial resolution ranging from 3.3 to 100 lp/mm, QRM bar pattern nano phantom which covers resolution from 50 to 500 lp/mm, and a cylindrical dry concrete joint with a diameter of 50 mm. The aluminium cylindrical phantoms and the QRM resolution targets were both scanned at magnifications of 5 (voxel size of 25.4 \(\upmu m\)), 10 (voxel size of 12.7 \(\upmu m\)), and 25 (voxel size of 5.08 \(\upmu m\)) and the concrete joint was acquired at magnifications of 3 (voxel size of 42.3 \(\upmu m\)) and 5. The detailed measurement setup is summarized in Table 2. As illustrated in Fig. 4, the Xray detector was repeatedly displaced clockwise by half a pixel in a precisely controlled way. The projection at each detector position took 3 s, namely at each rotation angle 4 \(\times\) 3 s was required for the acquisition. The object table rotated over 360 \(^\circ\) with 0.1 degree resolution following a stopmove manner and hence in total 4 \(\times\) 3600 projections were taken. Aluminium filters were utilized to absorb the soft Xray beam and suppress the beam hardening artifact. We compare FLMISR with multiimage interpolation and the standard CT without detector shift where the exposure time was set as 12 s, the same as FLMISR.
In Fig. 8, we demonstrate the MTF measured by the aluminium cylindrical phantoms at different magnifications according to the standard ASTME 1695. It is shown that FLMISR performs significantly better than the standard CT at all the investigated magnifications covering voxel size up to 5.08 \(\upmu m\). The multiimage interpolation behaves worse than FLMISR as expected due to the naive manner of fusion.
The CT images of the QRM bar pattern phantom and QRM bar pattern nano phantom are illustrated in Fig. 9 with the corresponding closeup views. Comparing to the standard CT images, we can observe that FLMISR and multiimage interpolation both improve the spatial resolution by exploiting the additional information captured via subpixel detector shift. However, multiimage interpolation is less robust than the optimizationbased FLMISR. FLMISR generates sharper edges and provides more pleasant results in visual perception. In fact, the spatial resolution estimated by the visibility of the QRM bar patterns coincides with the MTF measured by the cylindrical phantoms.
In Fig. 10, we illustrate the CT images of a dry concrete joint with the zoomedin region of interest (ROI). Fig. 10a and b represent respectively the results of the standard CT without detector shift and FLMISR at the magnification of 3. Fig 10c exhibits the results of standard CT at the magnification of 5 which is considered as the reference image. It is shown that comparing to the standard CT with a voxel size of 42.3 \(\upmu m\) at the magnification of 3, FLMISR generates sharper contours with more detailed structures which resembles the CT measurement at the magnification of 5 better.
Evaluation on border effect and consensus convergence
As explained in Fig. 1, we exchange the overlapped regions between neighboring GPUs to avoid border discontinuity. In Fig. 11, we demonstrate the superresolved projections and the associated CT images of the synthetic (top row) and the realworld measurements (bottom row). For the synthetic image, we employed four GPUs and for the realworld one, two GPUs were in use. The individual \(\mathbf {x}_h\) of each GPU is partitioned by the red dotted line. As we can observe that the overlapped regions, a 12row block surrounding the borders (the red dotted lines), are of inherent sharpness without intensity discontinuity and the border effect is fundamentally obviated. Besides, to avoid inhomogeneous resolution in different partitions, we synchronize the update of the partitioned \(\mathbf {x}_h\) among all the GPUs by exchanging the local variables of SCG. In Fig. 12, we illustrate the convergence curve of the centralized objective of Eq. 9 running on a single GPU and the distributed objective of Eq. 11 running on four GPUs. The consensus convergence is reflected in two aspects. First, the four GPUs have exactly the same convergence trend, where they are almost overlaid, due to the share of the SCG variables. Second, the distributed objective follows the same convergence trend as the centralized one and moreover, the sum of the four distributed objectives equals the centralized one by resorting to the scheme we adopt for the calculation of the consensus variables of SCG as described in Sect. 3.1. In addition, we can observe that the objective function is almost converged after 5 SCG iterations.
Evaluation on natural images
Since the distributed optimization of FLMISR is based on data parallelism, FLMISR is not limited to a certain application. We evaluate the proposed FLMISR on natural images using the public dataset DIV8K [36]. Particularly, we randomly selected 7 natural images with the vertical or horizontal resolution ranging from 1920 to 5760 pixels as the GT. For each GT image, 4 and 9 LR images were respectively generated for upscaling factors of 2\(\times\) and 3\(\times\) according to Eq. (1) with \(\varepsilon \sim N(0,1)\) and translational movement of \(\frac {1}{2}\) and \(\frac {1}{3}\) pixel. We performed SR reconstruction only for the luminance channel on 4 GPUs and set the SCG iterations as 10. The SR performance is assessed by PSNR, SSIM, and runtime. Quantitative evaluation is summarized in Table 3. As we can see, the proposed FLMISR outperforms the multiimage interpolation by a large margin in PSNR and SSIM. Although the iterative FLMISR requires 2\(\sim\)5 \(\times\) runtime as the naive interpolation one, it supports an SR output of 5760 \(\times\) 5760 resolution within 1.3 s for the upscaling of 2\(\times\) and 1.93 s for the upscaling of 3\(\times\). It is interesting to find that the runtime of SCG depends not only on the image size but also on the count of iterations with successful reduction in the objective function as expressed in [42]. In Fig. 13, we illustrate the closeup views of images \(\textit{0002}, \textit{0027}\), and \(\textit{0055}\) of DIV8K. The top two rows demonstrate the results for an upscaling of 2\(\times\) and the bottom row is for the upscaling of 3\(\times\). We can observe that FLMISR provides pleasant results with significantly better visual quality than the multiimage interpolation.
Evaluation of FLMISR on acceleration
To demonstrate the performance of FLMISR in acceleration, we conducted SR reconstruction of different sized inputs ranging from 512 \(\times\) 512 to 4096 \(\times\) 4096 for an upscaling factor of 2\(\times\) on a multicore CPU, single GPU, and multiGPU systems. In particular, the CPU experiments were performed on the Intel Xeon Gold 5120 CPU with 755GB memory which contains two nodes and each is equipped with 28 cores. The GPU experiments were carried out on the Nvidia GeForce GTX 1080 GPUs with 11GB memory. Since FLMISR is based on the iterative SCG algorithm, we evaluated the runtime also with regard to the number of SCG iterations. Besides, we also demonstrate the runtime of the multiimage interpolation as the baseline. The performance of different configurations was calculated based on an average of 100 runs and is summarized in Table 4 where N/A denotes not applicable due to the large GPU memory footprint. As we can see, comparing to the 56core CPU variant, the single GPU implementation accelerates the computation by more than 25\(\times\) for LR images of size 2048 \(\times\) 2048 and the multiGPU implementation which uses 4 GPUs achieves a speedup up to 50\(\times\). For largescale images of size 2300 \(\times\) 3200 and 4096 \(\times\) 4096, FLMISR running on 4 GPUs obtains a more than 55\(\times\) speedup than the CPU implementation, while single GPU can not fulfill the memory requirement. For small sized inputs like 512 \(\times\) 512 and 1024 \(\times\) 1024, single GPU implementation has similar performance as multiGPU and achieves a 20\(\times\) speedup comparing to the multicore CPU. Although the iterative FLMISR requires more runtime than the naive interpolation one, FLMISR has much better SR performance and the runtime difference becomes less as the image dimension increases.
In addition, we analyzed the runtime distribution for the local and central computation on a 4GPU system where the data communication time is aggregated into the central computation. We exhibit the average runtime distribution over 100 runs for input images of different sizes in Fig. 14. It is shown that the consumed time for consensus computing is almost negligible comparing to the local computation, while it is fundamentally necessary to avoid border effects between neighboring GPUs and guarantee a consensus convergence over multiGPU systems.
Conclusion
In this paper, we propose a multiGPU accelerated largescale multiimage superresolution (MISR) framework based on data parallelism. Specially, each GPU node accounts for a designated region of the latent highresolution (HR) image by applying an adapted scaled conjugate gradient (SCG) algorithm to the distributed subproblem. The local variables of the SCG algorithm are broadcast and aggregated in each iteration to synchronize the convergence rate over multiGPUs towards a centralized optimum and consistent resolution. Furthermore, an innerouter border exchange mechanism is performed in the overlapped regions of neighboring GPUs to avoid border effect without compromising the sharpness.
The proposed FLMISR is seamlessly integrated into the computed tomography (CT) systems by superresolving projections of the same view captured via subpixel detector shift. The SR reconstruction is performed on the fly during the CT acquisition such that no additional computation time is induced. Extensive experiments were conducted based on simulated data and realworld CT measurements of cylindrical phantoms, QRM bar pattern resolution targets, and cylindrical dry concrete joints to quantitatively and qualitatively evaluate the proposed FLMISR. Experimental results demonstrate that the spatial resolution of CT systems is significantly improved in modulation transfer function (MTF) and visual perception by the application of FLMISR. Moreover, comparing to a multicore CPU implementation, the multiGPU accelerated FLMISR achieves a more than 50\(\times\) speedup on a 4GPU system and it is shown that the exchange of local SCG variables and overlapped regions between GPUs has limited impact on the overall runtime. Last but not least, evaluation on public dataset DIV8K shows that FLMISR is not confined to CT imaging but also provides very promising results for natural images.
References
 1.
Park, S., Park, M., Kang, M.G.: Superresolution image reconstruction: a technical overview. IEEE Signal Process. Mag. 20(5), 21–36 (2003)
 2.
Nasrollahi, K., Moeslund, T.B.: Superresolution: a comprehensive survey. Mach. Vis. Appl. 25(6), 1423–1468 (2014)
 3.
Yang, W., Zhang, X., Tian, Y., Wang, W., Xue, J.: Deep learning for single image superresolution: a brief review. IEEE Trans. Multimedia 21(12), 3106–3121 (2019)
 4.
Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
 5.
Mildenhall, B., Barron, J.T., Chen, J., Sharlet, D., Ng, R., Carroll, R.: Burst denoising with kernel prediction networks. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2502–2510 (2018)
 6.
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image superresolution. In: Proc. Eur. Conf. Comput. Vis., pp. 184–199 (2014)
 7.
Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image superresolution. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, pp. 136–144 (2017)
 8.
Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: VoxelMorph: a learning framework for deformable medical image registration. IEEE Trans. Image Process. 38(8), 1788–1800 (2019)
 9.
Sun, K., Simon, S.: FDRN: a fast deformable registration network for medical images. Med. Phys. 2021, 1–11 (2021)
 10.
Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., MaierHein, K.H.: NNUnet: a selfconfiguring method for deep learningbased biomedical image segmentation. Nature Methods 18(2), 203–211 (2021)
 11.
Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 39(6), 1856–1867 (2020)
 12.
Dong, C., Loy, C.C., Tang, X.: Accelerating the superresolution convolutional neural network. In: Proc. Eur. Conf. Comput. Vis., pp. 391–407 (2016)
 13.
Kim, J., Lee, J.K., Lee, K.M.: Accurate image superresolution using very deep convolutional networks. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1646–1654 (2016)
 14.
Zhang, K., Zuo, W., Zhang, L.: Deep plugandplay superresolution for arbitrary blur kernels. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1671–1681 (2019)
 15.
Wang, X., et al.: ESRGAN: Enhanced superresolution generative adversarial networks. In: Proc. Eur. Conf. Comput. Vis., pp. 1–16 (2018)
 16.
Pham, C.H., et al.: Multiscale brain MRI superresolution using deep 3D convolutional networks. Comput. Med. Imaging Graph. 77, 101647 (2019)
 17.
Chen, Y., Shi, F., Christodoulou, A.G., Zhou, Z., Xie, Y., Li, D.: Efficient and accurate MRI superresolution using a generative adversarial network and 3d multilevel densely connected network. In: Proc. Int. Conf. Med. Imag. Comp. Comput. Assist. Interv., pp. 91–99 (2018)
 18.
Kappeler, A., Yoo, S., Dai, Q., Katsaggelos, A.K.: Video superresolution with convolutional neural networks. IEEE Trans. Comput. Imaging 2(2), 109–122 (2016)
 19.
Caballero, J. et al.: Realtime video superresolution with spatiotemporal networks and motion compensation. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2848–2857 (2017)
 20.
Haris, M., Shakhnarovich, G., Ukita, N.: Recurrent backprojection network for video superresolution. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3897–3906 (2019)
 21.
Sajjadi, M.S.M., Vemulapalli, R., Brown, M.: Framerecurrent video superresolution. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6626–6634 (2018)
 22.
Sun, K., Koch, M., Wang, Z., Jovanovic, S., Rabah, H.: An FPGAbased residual recurrent neural network for realtime video superresolution. IEEE Trans. Circ. Syst. Video Technol. 2021, 1–12 (2021)
 23.
Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and robust multiframe superresolution. IEEE Trans. Image Process. 13(10), 1327–1344 (2004)
 24.
Yue, L., Shen, H., Yuan, Q., Zhang, L.: A locally adaptive l1–l2 norm for multiframe superresolution of images with mixed noise and outliers. Signal Process. 105(1), 156–174 (2014)
 25.
Köhler, T., Huang, X., Schebesch, F., Aichert, A., Maier, A., Hornegger, J.: Robust multiframe superresolution employing iteratively reweighted minimization. IEEE Trans. Comput. Imaging 2(1), 42–58 (2016)
 26.
Sun, K., Tran, T., Krawtschenko, R., Simon, S.: Multiframe superresolution reconstruction based on mixed PoissonGaussian noise. Signal Process. Image Commun. 82, 115736 (2020)
 27.
Sun, K., Simon, S.: Bilateral spectrum weighted total variation for noisyimage superresolution and image denoising. IEEE Trans. Signal Process 1–13 (2021). arXiv:2106.00768
 28.
Huang, T., Tsai, R.: Multiframe image restoration and registration. Adv. Comput. Vis. Image Process 1, 317–339 (1984)
 29.
Stark, H., Oskoui, P.: Highresolution image recovery from imageplane arrays, using convex projections. J. Opt. Soc. Am. A 6(11), 1715–1726 (1989)
 30.
Elad, M., HelOr, Y.: A fast superresolution reconstruction algorithm for pure translational motion and common spaceinvariant blur. IEEE Trans. Image Process 10(8), 1187–1193 (2001)
 31.
Tipping, M.E., Bishop, C.M.: Bayesian image superresolution. In Adv. Neural. Inf. Process. Syst. 2033, 1303–1310 (2003)
 32.
Wetzl, J., Taubmann, O., Haase, S., Köhler, T., Kraus, M., Hornegger, J.: GPUaccelerated timeofflight superresolution for imageguided surgery. In: Bildverarbeitung für die Medizin 2013, pp. 21–26. Springer, Berlin (2013)
 33.
Xu, J., Liang, Y., Liu, J., Huang, Z., Liu, X.: Online multiframe superresolution of image sequences. EURASIP J. Image Video Process. 2018(1), 1–10 (2018)
 34.
Anger, J., Ehret, T., de Franchis, C., Facciolo, G.: Fast and accurate multiframe superresolution of satellite images. ISPRS J. Photo. Rem. Sens. 5(1), 1–8 (2020)
 35.
Kim, Y., Choi, J., Kim, M.: A realtime convolutional neural network for superresolution on FPGA with applications to 4K UHD 60 fps video services. IEEE Trans. Circ. Syst. Video Technol. 29(8), 2521–2534 (2019)
 36.
Gu, S., Lugmayr, A., Danelljan, M., Fritsche, M., Lamour, J., Timofte, R.: DIV8K: diverse 8k resolution image dataset. In: IEEE Int. Conf. Comput. Vis. Workshop, pp. 3512–3516 (2019)
 37.
Sun, K., Kieß, S., Sven, S.: Spatial resolution enhancement based on detector displacement for computed tomography. In: Proc. Conf. Industrial Computed Tomography, pp. 1–8 (2019)
 38.
Rodríguez, P.: Total variation regularization algorithms for images corrupted with different noise models: a review. J. Electr. Comput. Eng. 2013, 5 (2013)
 39.
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D: Nonlinear Phenom. 60(1–4), 259–268 (1992)
 40.
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
 41.
Gilboa, G., Osher, S.: Nonlocal operators with applications to image processing. Multiscale Model. Simul. 7(3), 1005–1028 (2009)
 42.
Møller, M.F.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. 6(4), 525–533 (1993)
Acknowledgements
This work was supported by the German Research Foundation (DFG, Germany) under the DFGproject SI 587/181 in the priority program SPP 2187.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sun, K., Tran, TH., Guhathakurta, J. et al. FLMISR: fast largescale multiimage superresolution for computed tomography based on multiGPU acceleration. J RealTime Image Proc (2021). https://doi.org/10.1007/s11554021011810
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554021011810
Keywords
 Super resolution
 Computed tomography
 Distributed optimization
 Data parallelism
 MultiGPU
 Subpixel detector shift