FL-MISR: Fast Large-Scale Multi-Image Super-Resolution for Computed Tomography Based on Multi-GPU Acceleration

Multi-image super-resolution (MISR) usually outperforms single-image super-resolution (SISR) under a proper inter-image alignment by explicitly exploiting the inter-image correlation. However, the large computational demand encumbers the deployment of MISR in practice. In this work, we propose a distributed optimization framework based on data parallelism for fast large-scale MISR using multi-GPU acceleration named FL-MISR. The scaled conjugate gradient (SCG) algorithm is applied to the distributed subfunctions and the local SCG variables are communicated to synchronize the convergence rate over multi-GPU systems towards a consistent convergence. Furthermore, an inner-outer border exchange scheme is performed to obviate the border effect between neighboring GPUs. The proposed FL-MISR is applied to the computed tomography (CT) system by super-resolving the projections acquired by subpixel detector shift. The SR reconstruction is performed on the fly during the CT acquisition such that no additional computation time is introduced. FL-MISR is extensively evaluated from different aspects and experimental results demonstrate that FL-MISR effectively improves the spatial resolution of CT systems in modulation transfer function (MTF) and visual perception. Comparing to a multi-core CPU implementation, FL-MISR achieves a more than 50x speedup on an off-the-shelf 4-GPU system.


Introduction
Super-resolution (SR) is a fundamental task in image processing and has been an attractive research field for decades [1][2][3]. SR is an algorithm-based image enhancement technique dedicated to improving the spatial resolution of the imaging systems beyond the hardware limit by exploiting the low-resolution (LR) acquisitions and it is widely applied in many applications such as medical diagnostics, surveillance, and remote sensing.
In recent years, we have witnessed tremendous progress of deep learning in multiple image processing and computer vision tasks such as image denoising [4,5], super-resolution [6,7], deformable registration [8,9], and semantic segmentation [10,11]. Despite of the great success of deep learning in SR, most of the work focuses on single-image SR (SISR) [6,7,[12][13][14][15][16][17]. In fact, SR reconstruction can significantly benefit from the available correlated input images which are captured of the same view. Multi-image SR (MISR) exploits the correspondences entailed in the multiple input images and usually outperforms SISR when the relative movements between the reference image and the other input images are well estimated. However, the learning-based MISR approaches in the literature are mainly dedicated to video applications [18][19][20][21][22]. Besides, the quality of the learning-based methods highly depends on the fidelity of the training datasets. In practice, preparing synthetic datasets which adequately resemble the real-world measurements covering diverse imaging conditions would be challenging. Furthermore, arXiv:2108.04315v2 [eess.IV] 5 Oct 2021 although learning-based approaches are able to describe more sophisticated image priors, "hallucinated" structures can be unpredictably constructed which may impede the employment of the trained models in applications such as metrology and quality control.
Different from the deep learning-based SR methods, optimization-based MISR algorithms [23][24][25][26][27] reconstruct the latent high-resolution (HR) image explicitly based on the real acquisitions but not the training datasets. Nowadays, due to the technological development of sensor manufacturing, sensors or detectors with large resolutions such as 8, 16 Mpixels or even higher are employed in applications such as medical imaging and industrial inspection. Coping with largescale multi-image input can be computationally expensive and hardware costly. The optimization-based SR methods usually suffer from the iterative manner which leads to undesirable computation time. In this work, we present a multi-GPU accelerated framework for large-scale MISR reconstruction based on distributed optimization. The proposed framework is applied to the computed tomography (CT) imaging system and achieves a real-time SR reconstruction during the CT acquisition without introducing additional computation time. The contribution of this work can be summarized as following: -We propose a distributed optimization framework for MISR, named FL-MISR, dealing with large sized images based on multi-GPU acceleration. Each GPU accounts for an allocated partition and the latent SR image is obtained by image fusion. -In order to obtain a consistent resolution enhancement among all the GPUs, the update of the partitions is synchronized by unifying the local variables of the scaled conjugate gradient (SCG) method. To avoid border effect between neighboring GPUs, an inner-outer border exchange scheme is performed. -The proposed FL-MISR is applied to real-time CT imaging by super-resolving the projections acquired via subpixel detector shift. Extensive evaluation from different aspects demonstrates that FL-MISR not only achieves a significant resolution enhancement for CT systems but also provides very promising results for natural images. Comparing to a multi-core CPU implementation, FL-MISR achieves a more than 50× speedup on a 4-GPU system.

Optimization-Based Iterative Methods
In the literature, conventional optimization-based iterative SR methods can be traced back to 1980's and they are mainly grouped into two categories: the frequency domain based and the spatial domain based methods [1,2]. In [28], Huang et al. firstly address the MISR problem in the frequency domain. Although the frequency domain based methods have low computational complexity, they behave extremely sensitive to model errors and have limited ability to integrate a priori knowledge as regularization. The majority of the iterative MISR approaches solve the problem in the spatial domain based on the maximum likelihood (ML), the maximum a posteriori (MAP), and the projection onto convex sets (POCS) [23,25,27,[29][30][31][32][33].
Most of the work focuses on the reconstruction accuracy and only few concerns the performance in computation time. Specially, Elad et al. [30] propose a fast MISR algorithm concerning the special case of pure translation and space invariant blur. In [23], Farsiu et al. present a robust MISR method based on MAP using the L1 norm data fidelity term and the bilateral total variation regularization. Jens et al. [32] introduce a GPU-accelerated MISR approach for image-guided surgery which supports a 2× SR reconstruction from 4 LR images of size 200×200 in 60 ms. However, due to the GPU memory limit, their method can not handle large sized images. In [34], the authors propose a fast MISR method which composes of registration, fusion, and sharpening for satellite images using high-order spline interpolation. Nevertheless, purely image fusion is performed on a GPU and the rest two steps are on the CPU which results in a degraded performance in runtime.

Deep Learning-Based Methods
In the last decade, deep learning has been very successfully adopted in SR and has harvested fruitful results. Dong et al. [6] introduce the convolutional neural network (CNN) into SISR which demonstrates the great potential of CNN for feature extraction. Inspired by the distinguished performance of CNN, a series of work from plain CNN to densely connected GAN, from 2D natural image to 3D medical volume, has been successively proposed [7,12,13,[15][16][17]. Comparing to the traditional iterative methods, CNN-based SR approaches focus on super-resolving single LR image by exploiting the relation learned exclusively from the LR-HR image pairs in the external example database. The learning-based MISR methods are mainly proposed to cope with natural video streams [18][19][20][21]. Although some work is intended for real-time applications using GPU or FPGA [21,22,35], the video SR (VSR) performance highly relies on the fidelity of the synthesized LR-HR frame pairs and the quality of the training datasets.
Furthermore, since the supervised learning scheme requires the ground truth (GT) HR images during the training phase, the performance of the trained model will be limited by the available quality of the GT acquired in practice. It is especially true for CT imaging due to the lack of publicly available high-quality HR datasets like DIV8K [36] for natural images.
To the best of our knowledge, the literature on GPU-accelerated MISR methods for large-scale images is very limited despite of its importance. In this paper, we extend our previous work [37] mainly in two folds. First, the locally applied scaled conjugate gradient (SCG) algorithm is adapted to achieve a synchronized convergence rate over multi-GPU systems. Second, instead of performing region averaging, we employ the so-called inner-outer border exchange scheme to preserve the sharpness of the overlapped regions. Particularly, in [37] we introduce a multi-GPU implementation of a MISR method based on data parallelism where each GPU deals with an allocated partition of the latent SR image. Although overlapped regions between neighboring GPUs are exchanged and averaged, we found that the resolved SR image in [37] is not globally optimized and the fused SR image suffers from inhomogeneous resolution enhancement due to the inconsistent convergence rate of the local SCG and region averaging. We address these issues in this work and propose a generalized framework for multi-GPU supported MISR. We conducted extensive experiments to validate the proposed FL-MISR. Experimental results show that the exchange of local SCG variables and overlapped regions among GPUs has limited impact on the overall performance of runtime and leads to a consensus convergence over multi-GPUs without causing border effects. Besides, it is shown that super-resolving four input images of size 4096×4096 by an upscaling of 2× can be achieved within 2.4s on a 4-GPU system.

Distributed Optimization for MISR
The common formulation of SR model in the pixel domain is presented as with x ∈ R n×1 , y ∈ R m×1 being respectively the latent and captured image rearranged in lexicographic order. The system matrix A ∈ R m×n is usually expressed as A = DBM with D ∈ R m×n , B ∈ R n×n , and M ∈ R n×n describing the decimation, blurring, and motion effects, respectively. The vector ε(x) ∈ R m×1 denotes the additive noise existing in the imaging systems. More detailed description of the system model can be found in [27]. To simplify the calculation, in this work we assume ε(x) is an intensity-independent additive noise and the system matrix A is known.
Since SR is an ill-posed problem, involving a welldefined image prior can effectively constrain the solution domain. Therefore, MAP estimator is preferably adopted for SR reconstruction. The posterior probability P (x|y) of the SR image x is formulated based on the Bayes' theorem: Assuming the noise ε i ∈ ε in each pixel i is white Gaussian and i.i.d with ε i ∼ N (0, σ 2 ) and P (ε i ) = 1 √ 2πσ 2 e − ε 2 i 2σ 2 , we yield the likelihood function as Taking the natural logarithm, the associated negative log-likelihood can be formulated as where c is a constant. For brevity, we will omit the weight 1 2σ 2 and the constant c in the latter formulation. For MISR with k independent LR images y i , i ∈ [1 . . . k], the posterior probability can be extended as and the data fidelity term is hence formulated by It should be noted that in case of additive white Laplacian noise which models the impulse noise (Salt & Pepper noise), we have the L1 norm data fidelity term [38]. Usually, L1 norm data term has better robustness against pixel outliers [23]. Without loss of generality, the data fidelity term can be formulated as with the L p norm p ∈ {1, 2}.  Param. Description f c , f c new consensus of the objective function p c consensus of the conjugate weight vector r c , r c new consensus of the steepest descent direction σ c , λ c consensus of the scalars δ c , µ c consensus of the variables in step size α c consensus of the update step size In the literature, there are several well-known handcrafted image priors P (x) including the total variation (TV) [39], Huber-Markov prior [40], bilateral total variation (BTV) [23], nonlocal total variation (NLTV) [41], and the more recent bilateral spectrum weighted total variation (BSWTV) [27]. In this paper, aiming for reducing the computational complexity, we leverage the BTV as the image prior and the regularization term is therefore expressed as and w denotes the window size accounting for the neighbors in the x, y directions. S d represents the shifting operator along x and y axis by d x and d y pixels. γ(d) := α dx+dy embodies the spatial decaying effect with constant α < 1. (5) is independent from the image x, maximizing the posterior probability P (x|y 1 . . . y k ) is equivalent to minimizing the negative logarithm of the numerator which is formulated respectively in Eqs (7) and (8). Hence, we yield the overall objective function based on the MAP framework as following: where the scaling factor of the fidelity term 1 /2σ 2 in Eq. (4) is actually absorbed into the weighting param-eter λ. In the experiments, we have used the L1 norm data term for a better robustness. In order to accelerate the computation and alleviate the GPU memory load especially when coping with a sequence of large input images, we distribute the computational demand over multi-GPUs by data parallelism and follow a consensus-based convergence manner to guarantee a centralized solution. The latent SR image x is finally obtained by data fusion. In particular, Eq. (9) can be rewritten as with D i representing the corresponding data term and R being the regularization term. In this regard, the subfunction associated with the hth GPU is expressed as where x h is a fraction of the latent image x assigned to the hth GPU and g denotes the number of employed GPUs. To enforce the distributed optimization towards a centralized solution, we allow communication between the local GPU node and the host CPU for a consensus update decision. Specially, we utilize the SCG algorithm [42] to iteratively solve the subproblem described in Eq. (11) in each GPU. Instead of using the handcrafted step size or performing line search, SCG employs a step size scaling mechanism based on an adaptive scalar which achieves a faster and more robust convergence than the widely used approaches such as conjugate gradient with line search (CGL) and Broyden-Fletcher-Goldfarb-Shanno (BFGS). Aiming for synchronizing the update of the individual x h towards a centralized solution, we unify the local SCG scalar variables σ, λ, δ, µ, α by data communication. As these variables are calculated based on the inner product of vectors, we can obtain the consensus  variables by the aggregate of the broadcast local ones. By means of consensus variables, the subfunctions can converge synchronically and a homogeneous resolution among multi-GPUs is guaranteed. In Table 1, we list the unified scalar variables and vectors (in bold) of SCG.
In addition, to avoid border discontinuity of neighboring partitions, region overlapping between neighboring GPUs is required. Instead of the naive averaging of the overlapped regions which sacrifices the sharpness and visual quality, we perform an inner-outer border exchange in each SCG iteration as shown in Fig. 1. A 4-GPU system is demonstrated and each GPU deals with the allocated image partition x h . The overlapped regions marked in violet are exchanged between neighboring GPUs. Particularly, since the inner borders can be correctly calculated only in case that the outer borders are consistent with the neighboring GPUs, the outer borders are replaced by the received ones and the inner borders are broadcast to the neighbors as exhibited in Fig. 1b). Consequently, an agreement in the overlapped regions is achieved as shown in Fig. 1c) without compromising the image sharpness. Without loss of generality, assuming g GPU nodes are employed, the architecture of the proposed multi-GPU framework for SR is illustrated in Fig. 2. The local variables and overlapped regions are interchanged in each SCG iteration over the host CPU and updated in a consensus scheme.
In Algorithm 1, we present a detailed description of the proposed distributed optimization framework for MISR based on the SCG approach. The local GPU computation is marked by red and the centralized computation in the host CPU is denoted in blue. The local variables, overlapped regions, and consensus variables are respectively exchanged after the local and central update. The algorithm variables are initialized based  on SCG [42] and the calculation of the system matrix A i is explained in Section 4. The SR image x is fused when the SCG iterations are complete.
In the implementation, we have used the OpenCL framework. In order to optimize the data deployment on GPU memory, we exploited the local memory in the kernel functions to the most extent. Sparse matrix was employed to calculate the system matrix A i = D i B i M i and the transpose A T i due to the sparseness of the downsampling, blurring, and motion matrices. Although memory transfer of local variables and overlapped regions between the GPU and host CPU is intended to hold the consensus convergence, transfer of  object table  object table  object table  object table   θ  large amounts of data is obviated during the SR reconstruction. It is worthy noting that the proposed distributed optimization framework is based on data parallelism and consensus SCG. It can be easily applied to other applications such as SISR and image denoising by replacing the objective function in Eq. 9.

Real-Time MISR for CT
SR is always preferable in CT imaging where spatial resolution plays a determinant role in image quality assessment. We have applied the proposed FL-MISR on the industrial CT scanner as shown in Fig. 3. During the CT acquisition, the object is rotated by 360 • and at each rotation angle, four LR projections (X-ray images) are captured via detector shift rightwards, downwards, leftwards, and upwards by half a pixel as illustrated in Fig. 4. As long as all the four LR projections of the same view are collected, SR reconstruction is launched as denoted in green along the scan time axis. The capture-reconstruct fashion repeats until the whole CT acquisition is accomplished. Due to the fact that SR reconstruction usually takes less time than the accumulated time of projection acquisition (in red) and object rotation (in gray), SR can be performed in real-time during the CT scan without introducing extra runtime. The super-resolved projections are utilized for CT reconstruction and hence, an improved spatial resolution in CT is achieved by the increased detector sampling rate. We demonstrate the experimental results in Section 4. It is necessary to note that since the same detector movement pattern is repeated for all the rotation angles during CT scan, the system matrices A i with i ∈ [1,4] are calculated once at the beginning of the CT acquisition and shared by all the rotation angles.

Experiments and Results
In this section, we conduct extensive experiments to evaluate the performance of the proposed FL-MISR from different aspects, mainly on resolution enhancement and computation acceleration. Specially, FL-MISR is evaluated for real-time CT imaging based on the synthetic and real-world CT measurements. Besides, the application of FL-MISR on natural images is evaluated using the public dataset DIV8K [36].
The CT measurements were carried out on the Nikon HMX ST 225 CT scanner as shown in Fig. 3 which is equipped with a flat panel Varian PaxScan 4030E detector of pixel size 127×127 µm. The detector is mounted on the controllable linear stages for xand y-positioning which supports detector displacement with a movement accuracy up to 1 µm. The focal spot size of the tungsten X-ray tube is power dependent and for the power under 7 W , which was utilized in our experiments, the effective focal spot size is about 6 µm measured by the JIMA RT RC-04 micro chart.
The calculation of the system matrix A i is thoroughly described in our previous work [26]. For an upscaling of 2× with half pixel detector shift and a 3 × 3 Gaussian blur for B i , a 12-row block area in the HR grid is required as the overlapped region between neighboring GPUs. The weighting parameters λ and α were respectively set as 0.05 and 0.4. The SCG iteration was limited to 20. In practice, larger λ should be opted in case of strong noise and fewer SCG iterations should be used for fast CT acquisitions. To quantify the resolution enhancement by FL-MISR on CT systems, we adopted the modulation transfer function (MTF) which was measured according to the standard ASTM-E 1695.

Evaluation of FL-MISR on Spatial Resolution Enhancement
Before we evaluate FL-MISR on CT imaging, we briefly introduce the CT system and the assessment metric. CT scanner mainly consists of two components: the Xray tube and the X-ray sensitive detector. The spatial resolution of the CT system is hence primarily limited by the focal spot size of the X-ray tube and the detector pixel size. Usually, spatial resolution of imaging systems is assessed by the MTF which is calculated as the normalized magnitude of the Fourier Transform of the point spread function (PSF). The MTF of the CT system is formulated by M T F sys = M T F f s · M T F det · M T F others , where M T F f s and M T F det respectively denote the MTF of the X-ray focal spot and the detector. Other components such as the reconstruction algorithm, X-ray beam hardening, and display monitor are usually of less influence on the overall M T F sys . In this work, we perform subpixel detector shift to achieve a higher detector sampling rate which will lead to an effective improvement of M T F sys when M T F det dominates M T F f s , which is usually the case in many CT applications.

Evaluation on Synthetic CT Images
In order to analyze the effectiveness of subpixel detector shift on the spatial resolution enhancement in CT, we firstly demonstrate the impact of M T F det on the M T F sys . To simplify the system model, we consider only the primary components and therefore, we yield M T F sys := M T F f s · M T F det . The M T F f s is modeled by a Gaussian function and the M T F det is represented by a sinc function due to the assumed rectangular shape of each pixel. As shown in Fig. 5, the left plot indicates the case where M T F f s dominates M T F det , for instance when the object is extremely close to the X-ray source and the right one depicts the situation where M T F det dominates. The MTF of the detector with full pixel size and with half pixel size is respectively denoted as Detector LR and Detector HR . The MTF at 10% is usually considered as the visible limit in practice and is marked by the gray dotted line. It is shown that halving the detector pixel size doubles the M T F det and improves the overall M T F sys effectively when M T F det dominates, while for the case M T F f s dominates, M T F sys has a negligible improvement.
Based on the analysis above, we evaluate FL-MISR on the CT images quantitatively and qualitatively. Specially, we conducted CT scans of an aluminium cylindrical phantom with a diameter of 20 mm as shown in Fig. 3b) which was fixed perpendicular to the rotation table and a QRM bar pattern resolution phantom at the magnification of 20. Considering them as the ground truth (GT), we simulated four sets of 0.5× LR projections by shifting the GT projections rightwards, downwards, leftwards, and upwards by one pixel followed by a 2 × 2 binning. The downscaled LR projections were fused by interpolation and by FL-MISR. As the inter-image offset is assumed to be one pixel and accurate, for interpolation-based fusion we inserted the pixel values of the LR images into the corresponding integer location in the HR grid. The super-resolved projections were then used for CT reconstruction by filter  backprojection (FBP). The CT cross sections of the aluminium cylindrical phantom and the associated MTF are demonstrated in Fig. 6. The LR CT was reconstructed by the reference (upper left) set of the downscaled projections. As we can clearly see that FL-MISR resembles the MTF of the GT extremely well and almost doubles the MTF of the LR image. To illustrate the performance of FL-MISR visually, we present the CT images of the QRM bar pattern target in Fig. 7. It is shown that FL-MISR provides a more pleasant result with sharper structures and better visual quality.

Evaluation on Real-World CT Images
As the spatial resolution of CT systems depends on the magnification, we evaluate FL-MISR on the real-world CT scans at different magnifications. Particularly, we conducted CT measurements of aluminium cylindrical phantoms with diameters of 10 mm and 20 mm, QRM bar pattern phantom with spatial resolution ranging from 3.3 lp/mm to 100 lp/mm, QRM bar pattern nano phantom which covers resolution from 50 lp/mm to 500 lp/mm, and a cylindrical dry concrete joint with a diameter of 50 mm. The aluminium cylindrical phantoms and the QRM resolution targets were both scanned at magnifications of 5 (voxel size of 25.4 µm), 10 (voxel size of 12.7 µm), and 25 (voxel size of 5.08 µm) and the concrete joint was acquired at magnifications of 3 (voxel size of 42.3 µm) and 5. The detailed measurement setup is summarized in Table 2. As illustrated in Fig. 4, the X-ray detector was repeatedly displaced clockwise by half a pixel in a precisely controlled way. The projection at each detector position took 3 s, namely at each rotation angle 4×3 s was required for the acquisition. The object table rotated over 360 • with 0.1 degree resolution following a stop-move manner and hence in total 4×3600 projections were taken. Aluminium filters were utilized to absorb the soft X-ray beam and suppress the  beam hardening artifact. We compare FL-MISR with multi-image interpolation and the standard CT without detector shift where the exposure time was set as 12 s, the same as FL-MISR.
In Fig. 8, we demonstrate the MTF measured by the aluminium cylindrical phantoms at different magnifications according to the standard ASTM-E 1695. It is shown that FL-MISR performs significantly better than the standard CT at all the investigated magnifications covering voxel size up to 5.08 µm. The multi-image interpolation behaves worse than FL-MISR as expected due to the naive manner of fusion.
The CT images of the QRM bar pattern phantom and QRM bar pattern nano phantom are illustrated in Fig. 9 with the corresponding closeup views. Comparing to the standard CT images, we can observe that FL-MISR and multi-image interpolation both improve the spatial resolution by exploiting the additional information captured via subpixel detector shift. However, multi-image interpolation is less robust than the optimization-based FL-MISR. FL-MISR generates sharper edges and provides more pleasant results in visual perception. In fact, the spatial resolution estimated by the visibility of the QRM bar patterns coincides with the MTF measured by the cylindrical phantoms.
In Fig. 10, we illustrate the CT images of a dry concrete joint with the zoomed-in region of interest (ROI) .  Fig 10a and Fig 10b represent respectively the results of the standard CT without detector shift and FL-MISR at the magnification of 3. Fig 10c exhibits the results of standard CT at the magnification of 5 which is considered as the reference image. It is shown that comparing to the standard CT with a voxel size of 42.3 µm at the magnification of 3, FL-MISR generates sharper contours with more detailed structures which resembles the CT measurement at the magnification of 5 better.

Evaluation on Border Effect and Consensus Convergence
As explained in Fig. 1, we exchange the overlapped regions between neighboring GPUs to avoid border discontinuity. In Fig 11, we demonstrate the superresolved projections and the associated CT images of the synthetic (top row) and the real-world measurements (bottom row). For the synthetic image, we employed four GPUs and for the real-world one, two GPUs were in use. The individual x h of each GPU is partitioned by the red dotted line. As we can observe that the overlapped regions, a 12-row block surrounding the borders (the red dotted lines), are of inherent sharpness without intensity discontinuity and the border effect is fundamentally obviated. Besides, in order to avoid inhomogeneous resolution in different partitions, we synchronize the update of the partitioned x h among all the GPUs by exchanging the local variables of SCG. In Fig. 12, we illustrate the convergence curve of the centralized objective of Eq. 9 running on a single GPU and the distributed objective of Eq. 11 running on 4 GPUs. The consensus convergence is reflected in two aspects. First, the 4 GPUs have exactly the same convergence trend, where they are almost overlaid, due to the share of the SCG variables. Second, the distributed objective follows the same convergence trend as the centralized one and moreover, the sum of the 4 distributed objectives equals the centralized one by resorting to the scheme we adopt for the calculation of the consensus variables of SCG as described in Section 3.1. In addition, we can observe that the objective function is almost converged after 5 SCG iterations.

Evaluation on Natural Images
Since the distributed optimization of FL-MISR is based on data parallelism, FL-MISR is not limited to a certain application. We evaluate the proposed FL-MISR on natural images using the public dataset DIV8K [36]. Particularly, we randomly selected 7 natural images  with the vertical or horizontal resolution ranging from 1920 to 5760 pixels as the GT. For each GT image, 4 and 9 LR images were respectively generated for upscaling factors of 2× and 3× according to Eq. 1 with ε ∼ N (0, 1) and translational movement of 1 /2 and 1 /3 pixel. We performed SR reconstruction only for the luminance channel on 4 GPUs and set the SCG iterations as 10. The SR performance is assessed by PSNR, SSIM, and runtime. Quantitative evaluation is summarized in Table 3. As we can see, the proposed FL-MISR outperforms the multi-image interpolation by a large margin in PSNR and SSIM. Although the iterative FL-MISR requires 2∼5× runtime as the naive interpolation one, it supports an SR output of 5760 × 5760 resolution within   1.3s for the upscaling of 2× and 1.93s for the upscaling of 3×. It is interesting to find that the runtime of SCG depends not only on the image size but also on the count of iterations with successful reduction in the objective function as expressed in [42]. In Fig. 13, we illustrate the closeup views of images 0002, 0027, and 0055 of DIV8K. The top two rows demonstrate the results for an upscaling of 2× and the bottom row is for the upscaling of 3×. We can observe that FL-MISR provides pleasant results with significantly better visual quality than the multi-image interpolation.

Evaluation of FL-MISR on Acceleration
In order to demonstrate the performance of FL-MISR in acceleration, we conducted SR reconstruction of different sized inputs ranging from 512×512 to 4096×4096 for an upscaling factor of 2× on a multi-core CPU, single GPU, and multi-GPU systems. In particular, the CPU experiments were performed on the Intel Xeon Gold 5120 CPU with 755GB memory which contains two nodes and each is equipped with 28 cores. The GPU experiments were carried out on the Nvidia GeForce GTX 1080 GPUs with 11GB memory. Since FL-MISR is based on the iterative SCG algorithm, we evaluated the runtime also with regard to the number of SCG iterations. Besides, we also demonstrate the run-time of the multi-image interpolation as the baseline. The performance of different configurations was calculated based on an average of 100 runs and is summarized in Table 4 where N/A denotes not applicable due to the large GPU memory footprint. As we can see, comparing to the 56-core CPU variant, the single GPU implementation accelerates the computation by more than 25× for LR images of size 2048×2048 and the multi-GPU implementation which uses 4 GPUs achieves a speedup up to 50×. For large-scale images of size 2300×3200 and 4096×4096, FL-MISR running on 4 GPUs obtains a more than 55× speedup than the CPU implementation, while single GPU can not fulfill the memory requirement. For small sized inputs like 512×512 and 1024×1024, single GPU implementation has similar performance as multi-GPU and achieves a 20× speedup comparing to the multi-core CPU. Although the iterative FL-MISR requires more runtime than the naive interpolation one, FL-MISR has much better SR performance and the runtime difference becomes less as the image dimension increases.
In addition, we analyzed the runtime distribution for the local and central computation on a 4-GPU system where the data communication time is aggregated into the central computation. We exhibit the average runtime distribution over 100 runs for input images of different sizes in Fig. 14. It is shown that the consumed time for consensus computing is almost negligible comparing to the local computation, while it is fundamentally necessary to avoid border effects between neighboring GPUs and guarantee a consensus convergence over multi-GPU systems.

Conclusion
In this paper, we propose a multi-GPU accelerated large-scale multi-image super-resolution (MISR) framework based on data parallelism. Specially, each GPU node accounts for a designated region of the latent highresolution (HR) image by applying an adapted scaled conjugate gradient (SCG) algorithm to the distributed subproblem. The local variables of the SCG algorithm are broadcast and aggregated in each iteration to synchronize the convergence rate over multi-GPUs towards a centralized optimum and consistent resolution. Furthermore, an inner-outer border exchange mechanism is performed in the overlapped regions of neighboring GPUs to avoid border effect without compromising the sharpness.
The proposed FL-MISR is seamlessly integrated into the computed tomography (CT) systems by superresolving projections of the same view captured via subpixel detector shift. The SR reconstruction is performed on the fly during the CT acquisition such that no additional computation time is induced. Extensive experiments were conducted based on simulated data and real-world CT measurements of cylindrical phantoms, QRM bar pattern resolution targets, and cylindrical dry concrete joints to quantitatively and qualitatively evaluate the proposed FL-MISR. Experimental results demonstrate that the spatial resolution of CT systems is significantly improved in modulation transfer function (MTF) and visual perception by the application of FL-MISR. Moreover, comparing to a multi-core CPU implementation, the multi-GPU accelerated FL-MISR achieves a more than 50× speedup on a 4-GPU system and it is shown that the exchange of local SCG variables and overlapped regions between GPUs has limited impact on the overall runtime. Last but not least, evaluation on public dataset DIV8K shows that FL-MISR is not confined to CT imaging but also provides very promising results for natural images.