Computational scatter correction in near real-time with a fast Monte Carlo photon transport model for high-resolution flat-panel CT

In computed tomography (CT), scattering causes server quality degradation of the reconstructed CT images by introducing streaks and cupping artifacts which reduce the detectability of low contrast objects. Monte Carlo (MC) simulation is considered the most accurate approach for scatter estimation. However, the existing MC estimators are computationally expensive, especially for high-resolution flat-panel CT. In this paper, we propose a fast and accurate MC photon transport model which describes the physics within the 1 keV to 1 MeV range using multiple controllable key parameters. Based on this model, scatter computation for a single projection can be completed within a range of a few seconds under well-defined model parameters. Smoothing and interpolation are performed on the estimated scatter to accelerate the scatter calculation without compromising accuracy too much compared to measured near scatter-free projection images. Combining the fast scatter estimation with the filtered backprojection (FBP), scatter correction is performed effectively in an iterative manner. To evaluate the proposed MC model, we have conducted extensive experiments on the simulated data and real-world high-resolution flat-panel CT. Compared to the state-of-the-art MC simulators, the proposed MC model achieved a 15×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} acceleration on a single-GPU compared to the GPU implementation of the Penelope simulator (MCGPU) utilizing several acceleration techniques, and a 202 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} speed-up on a multi-GPU system compared to the multi-threaded state-of-the-art EGSnrc MC simulator. Furthermore, it is shown that for high-resolution images, scatter correction with sufficient accuracy is accomplished within one to three iterations using a FBP and the proposed fast MC photon transport model.


Introduction
Computed tomography (CT) imaging is a powerful nondestructive testing technique. It provides inception about the inner of the scanned object and is widely used for industrial and medical applications. However, this technique suffers from severe quality degradation artifacts. Among these artifacts the scatter, which results from the change in the direction or the direction and the energy of the photon penetrating the object, degrades the quality of the CT reconstructed image by inserting cupping and streak artifacts. These artifacts reduce the contrast of this image [1,2], and the contrast-to-noise [3][4][5]. MC method can estimate the scatter accurately due to the accurate modeling of the physics involved in the photon matter interactions, the shape and the composition of the object require no simplification and could be easily simulated, the ability of this method to model any order of scatter, and requires no simplification or alteration of the real-world scanner configuration, i.e., any scan protocol and setting can be adopted by the MC method [1]. Although it is accurate, the MC method requires a huge computation time.
An accurate and fast MC model accelerated over multiple-GPU is implemented in this paper, which simulates the fundamentals physics involved within the keV range including Compton scattering, Rayleigh scattering, and photoelectric absorption. This MC model has been extensively evaluated against other MC simulators and real-world CT scanner. On the other hand, an iterative scatter correction algorithm which requires a few iterations for scatter compensation is also implemented, the latter has been embedded with the implemented MC model for fast scatter estimation. The estimated scatter from this MC model is used to correct the scatter-corrupted raw projections from the real-world scanner. Moreover, smoothing and interpolation are used to accelerate the iterative scatter correction algorithm. Smoothing reduces the time needed for the MC simulation by denoising the estimated scatter acquired using a reduced number of photons, while interpolation is used to estimate a full set of projections from half the number, and to upsampling the simulated projection from low-resolution to high-resolution. The correction result from the iterative scatter correction algorithm is in fine agreement with the near scatter-free collimator-based reconstructed data set.

Implementation of an accurate and parameterized multi-GPU accelerated MC forward projection model with postprocessing
The proposed MC model incorporates basic physical properties such as Compton scattering, Rayleigh scattering, and photoelectric absorption, as well as parameterized postprocessing such as filtering and resolution interpolation to reduce the computation time through slight approximations that lead to minor but tolerable deviations in the computation result. In addition, the basic parameters of the MC model, such as the step size of the ray tracing, which control the number of voxels crossed during the ray tracing, and the splitting number of photons, can be controlled to balance the trade-off between the speed-up and the accuracy of the scattering estimate. It is shown that the proposed MC model has achieved more than 15 × acceleration in comparison to the GPU implementation of the Penelope simulator (MCGPU) [35] utilizing these parameters. Moreover, this MC model has been accelerated using the multi-GPU platform such that thousands of photons are simulated simultaneously. In comparison with the state-of-the-art multi-threaded CPU MC simulator EGSnrc, our model achieves 202 × speed-up on a four GPUs system.

Fast iterative scatter correction near real-time
MC simulation is considered the gold standard for the scatter estimation [5][6][7][8][9][10][11]14], due to the accurate modeling of the physics involved in the photon transport. However, it is computationally expensive. This has led to a choice of methods other than MC simulation for scatter correction for the reconstruction of volume representations in computed tomography in the past. The huge computation time is particularly true for the high resolutions in non-medical computed tomography, i.e., computed tomography for materials, engineering, and natural sciences in general. In this work, we show that by optimizing the parameters of the GPUaccelerated MC simulation and by the use of smoothing and interpolation techniques, accurate and fast scatter correction are achieved simultaneously. This makes the MC simulation and correction of scattering suitable for integration into a real-world high-resolution flat-panel CT reconstruction. Based on this approach, we achieved a significant speedup of MC simulation in a demonstrated use case from 45.4 to 0.28 h for 3000 projections of a computed tomography scan with a flat-panel detector of 2k × 3k resolution. This computation time for MC simulation is even lower than the acquisition time required by the real-world CT scanner for the same number of projections which takes almost 0.8 h. In Ref. [15], and for the same scanner model, the reported scan time is 0.435 h for 3142 projections with an exposure time of 500 ms, a voltage of 120 kV, and a current of 90 μ A. In the proposed work, the scan was performed using 3000 projections with an exposure time of 1 s, a voltage of 200 kV, and a current of 50 μ A. Moreover, this simulation time using this model is less than the time of the non-iterative filtered backprojection (FBP). Furthermore, the achieved simulation time is very close to the time required by the fast Boltzmann equation solver methods [17][18][19][20]. However, the simulation time achieved by the proposed MC model is without the need to down-sample the detector and the voxelized volume to the same extent suggested by these methods. Considering all the above mentioned, scatter correction using the proposed MC model can be achieved in near real-time or even in real-time, i.e., faster processing time than the required acquisition time for the projections images from the realworld CT scanners [16]. Additionally, it is shown that the proposed MC model and the scatter correction algorithm provide a solution to the problem of scatter correction with sufficiently high accuracy, which is close to a nearly scatter-free measurement generated using a collimator.

Related work
Several methods have been introduced to compensate for scatter. According to [21], these methods are divided into two approaches. The first approach is the direct elimination of the scatter during the scan process by the use of an antiscatter grid [22], bowtie filter [23], or optimizing the geometry by increasing the distance between the scattering object and detector [24][25][26]. These methods can only reduce the scatter in the reconstructed image due to a better CT-scanner setup and do not lead to a complete scatter correction. The second approach is by computing the scatter and its subtraction in each projection using different computational models like an analytical or empirical approach, by neural networks [11] or by MC simulation.
Recently, a new approach to estimate the scatter has been introduced [17][18][19][20]. Unlike the MC approach, which solves the Boltzmann problem of the photon transport 1 3 stochastically, these methods try to solve the problem deterministically. Using this approach, the estimation time of a single scatter projection by these methods is within seconds on the GPU platform. To reach this execution time, a severe reduction of the detector and the volume resolutions is required in these works [17,18], this hardens the possibility to track high-frequency features of the scatter. In addition, the recent use of the neural network to estimate the scatter results in very fast estimation. Nevertheless, a significant reduction in the accuracy can be encountered when the network is used to predict the scatter in a region that is not a part of the training data set of this network or under a different scan voltage far from the training data [12]. This imposes limitations on the flexibility of the scan geometry and the type of object which can be used.
MC simulation is an accurate approach to model the scatter due to the accurate modeling of the physics involved in the photon-matter interaction. Many MC simulators have been introduced for this purpose. Among these simulators are MCNP [27], aRTist [28,29], EGSnrc [30], Penelope [31], and others. The main drawback of the existing MC simulators is their required computational effort. To reduce the computation time, many acceleration methods are adopted. In Refs. [5,8,[32][33][34], it is mentioned that the use of what is known as the variance reduction techniques, such as photon splitting, Russian Roulette, and forced detection could enhance the efficiency of the photon transport and reduce the execution time by several orders of magnitude. However, even with the extensive use of the variance reduction techniques, the MC simulation still required long computation times [35]. Single instruction multiple data (SIMD) based computation is an approach that is used to accelerate the computation through the vectorization of the data which results in an enhanced matrix multiplication [36]. The SIMD technique shows good performance when it is used to accelerate the ray-tetrahedron intersection for the MC photon transport simulator in Ref. [37]. Moreover, random number generation using the SIMD acceleration is also tested in Ref. [37]. The combination of these techniques results in a 22% speed improvement in comparison to the non-SIMD case.
Apart from the aforementioned algorithmic acceleration techniques, GPU is employed to further accelerate the MC simulation from the hardware aspect. In Ref. [35], the authors have accelerated the MC simulation in a voxelized geometry using the Penelope MC simulation physics by the use of the CUDA programming model. A speed-up factor of 27 compared to the CPU version of this simulator is achieved. The authors in Ref. [38] have implemented the simulation of the photon transport of the EGSnrc simulator on the GPU by the use of the CUDA programming model. Between 20 to 40 × speed-up is achieved depending on the number of voxels used in the simulation. The Geant4 MC simulator is the base of the work in Ref. [39] in which the authors took advantage of the varieties of the physics available in this simulator and performed the acceleration on the GPU with a speed-up factor of 86. Variance reduction techniques have not been implemented in the aforementioned works, which imposes a limitation on the speed and the efficiency of these works [41,53]. In contrast to the all methods mentioned above, which perform the MC simulation using a voxelized object, the method in Ref. [40] uses quadratic functions to represent the bounding surfaces of a region. As a result, a ∼ 3 × higher computation time is required in comparison to the use of a voxelized geometry. A hybrid approach of using a GPU accelerated MC simulation combined with the use of the variance reduction technique is introduced in Ref. [41]. A GPU accelerated MC simulation for mega-voltage cone-beam computed tomography (MV-CBCT) is introduced in Ref. [42]. The main physics simulated using this model is the photoelectric and the Compton scattering only, while ignoring the Rayleigh scattering.
In Ref. [43] the authors have developed a GPU-based MC code for dose calculation in the radiotherapy treatment known as GPU dose planning method (gDPM) based on the original CPU-based dose calculation package dose planning method (DPM) [45]. An improved version of this work has been published in Ref. [44]. In this version, the simulations of the electron and the photon have been separated by placing the particles to be simulated into two different arrays in which the GPU simulates particles in only one array at a time. Following this approach reduces the GPU thread divergence that occurs due to the different particle transport physics for different types of particles [44,46], which results in a faster execution time in comparison to the work in Ref. [43]. The GPU-based coupled photon-electron MC dose calculation package, GPU Monte Carlo dose (GPUMCD), has been implemented for a voxelized geometry in Ref. [47]. This work follows the same aforementioned approach to separate the transport of the electrons and the photons. The work in Ref. [48] introduces the MC photon-electron transport package GPU Monte Carlo (GMC). In contrast to the gDPM and the GPUMCD simulators, which simulate the full trajectory of an electron through their GPU kernels, the GPU kernel in this simulator transports the electron by only one step [46,48]. The GPU-based MC photon dose code (MCPDC), introduced in Ref. [49], supports the dose calculation by simulating only the photon transport. However, the above-mentioned works are for dose calculation and radiotherapy problems only. Their main application is for treatment planning [50][51][52]. In contrast to the proposed MC model, the dose calculation methods can not be used for non-destructive testing in industrial applications. In addition, their implementation followed simple photon physics treatment. In this treatment, the Rayleigh scattering has been ignored, and the Compton scattering is considered to occur with free electrons only. Such an assumption does not include the electron-atom binding effect on both the angular and energy distribution (Doppler broadening) of the scattered photon. As a consequence, they are only suitable for high-energy photons (above 1 MeV simulation). Full physics simulation has been reported in Ref. [53]. In this work, a GPU accelerated MC dose calculation code (gCTD) has been introduced. The result from the gCTD simulator is in good agreement with the EGSnrc simulator. However, in the same work, it is stated that there is no implementation of the variance reduction techniques in the gCTD simulator. As mentioned before, this affects both the convergence rate and efficiency of the MC simulator [53].
The acceleration of the MC simulation over the GPU platform has been conducted for positron emission tomography (PET) in Refs. [54,55]. The accuracy of the GPUbased MC gPET simulation tool has been compared with the GATE 8.0 (Geant4 application for tomography emission). It is mentioned that in terms of accuracy, both simulators are very close. The work in Ref. [56] uses the GPU platform to accelerate the ray-tracing calculations that are involved in the MC simulation. Another GPU-based MC code for PET named pet aimed novel nuclear images (PANNI) is implemented in Ref. [57]. This work aims to implement a PET image reconstruction with the iterative maximum likelihood expectation maximization (MLEM) relying on the accuracy of MC to calculate the elements of a PET system matrix. The authors in Ref. [58] have proposed a fast GPU-based PET simulator named Ultra-fast MC PET simulator (UMC-PET). All the relevant physics related to the transport, emission and detection of the radiation in the PET simulation have been considered in this simulator. This includes the scatter, positron range, the attenuation inside the patient, photon interaction from the environment, and the detector response. The simulation results from this simulator are in fine agreement with the results from the PeneloPET. PET imaging is a powerful tool that is mainly used in the medical field to evaluate organs and tissues for possible diseases. Therefore the main application of the above-mentioned MC-based PET simulator is in the medical field [59], in which they do not target the industrial non-destructive testing applications as well.
Scatter correction based on MC computation is the focus of several works. In Ref. [8], the authors have used the EGSnrc simulator supported by the use of the variance reduction techniques to correct the scatter-corrupted projections iteratively for low-resolution projections. The same approach is extended for a real phantom study in Ref. [5] also for low-resolution projections. Other works based on CPU MC simulators are found in Refs. [6,7,60,64]. Although the number of projections and photons used in these works is low, their simulations required long computation time beyond the applicability to high-resolution flatpanel CT and with compromised correction quality [32].
Acceleration techniques are employed to accelerate the MC simulators and the scatter correction algorithms. The iterative scatter correction in Ref. [1] is based on the fast estimation of the scatter using a low number of photons. The resultant noisy scatter estimation was then efficiently denoised by a three dimensional fitting of the Gaussian basis function. However, this approach works well for small objects only [34]. The MC-based scatter correction algorithm proposed in Ref. [65] estimates the scatter on a small number of detector nodes combined with a reduced number of projections. Linear interpolation is then used between the nodes and projections to derive the complete scatter estimation of the scan. Such an approach could not track possible high spatial frequencies in the scatter distribution especially if the interpolation grid is too coarse [34]. The fast scatter correction algorithm proposed in [32], which is based on a single-GPU MC scatter simulation and extended in Ref. [9] for multi-GPU, relies again on the usage of a very low number of photons and projections and requires the availability of a priori information such as the planning CT scan. By utilizing a GPU-accelerated MC model in Ref. [41], the scatter is corrected using a two-stage of scattering estimation which is rather expensive. The fully iterative GPU-based scatter correction algorithm in Ref. [63] is used to correct the scatter from breast examination only. Every voxel in the volume is assumed to be composed of two materials glandular and adipose. The range of the voltages supported by this work is from 10 to 50 keV with a simplified version of MC simulation, this includes the assumption of 100% detector efficiency.
Unlike most of the previously mentioned works, the scatter correction algorithm introduced here does not rely on heavily smoothing or interpolating the simulated projections. By applying certain acceleration techniques, the proposed MC model could achieve a 162 × speed-up in comparison to the standard case without acceleration using the same model.

The proposed MC photon transport model
A detailed MC photon transport model through a voxelized geometry in a GPU has been implemented using the OpenCL 2.1 programming model. The implemented MC model simulates the photon physics within the 1 keV to 1 MeV range. In this model, the X-ray source is treated as a point source that emits a polychromatic spectrum. This spectrum is simulated using the Geant4 MC simulator and it is imported into this model as a discrete spectrum with several energy bins of 2 keV resolution. The total number of photons to be simulated is distributed among the energy bins according to the spectrum. Simulation is then conducted for each energy bin sequentially. The attenuation coefficients of the materials used in the simulation are from the national institute of standard and technology (NIST) database [66]. The major physics of the photon-matter interaction such as Compton scattering, Rayleigh scattering, and photoelectric absorption have been considered in this model. The Compton scattering angular distribution was simulated using the Compton scattering PDF which is an extension of the Klein-Nishina PDF by including the form factor tabulated by Hubbel [68], this PDF is given in Eq. 1 [67].
where = 2 (1 − cos( )) , is the scatter angle of the photon, r 0 is the electron radius 2.8179 × 10 −15 m, and ′ are the incident and final photon energies in units of 0.511 MeV, = E∕(mc 2 ) , where m is the mass of the electron and c is the speed of light, and � = ∕[1 + (1 − cos( ))] , S(q, Z) is an appropriate scattering factor modifying the Klein-Nishina cross-section taken from [68] with q being the inverse length and Z being the atomic number of the material. Taking into account the scattering form factor is necessary to include the binding effect of the electron-atom on the angular distribution of the photon. Moreover, to include the binding effect on the energy distribution, the Doppler broadening of the energy of the photon is also simulated according to [69,70]. For the Rayleigh scattering, the angular distribution is simulated using the Rayleigh scattering PDF which is the extension of the Thomson scattering by the inclusion of the scattering form factor, this PDF is given in Eq. 2 [71].
where F(q, z) is a form factor modifying the energy-independent Thomson cross-section taken from [68].
To simulate the scatter event, a photon that reaches the volume from the source is transported to the first interaction point with a distance l = −(1∕ (E, )) log( ) [27], where is the linear attenuation value, E is the energy of the photon, is the density of the material, and is a random number generated. The scattering interaction type is then determined by sampling from the cumulative probability formed using the probability of each interaction type. If the photon encounters a photoelectric absorption it is then immediately terminated. Otherwise, the scattering angle is sampled, using the rejection and inversion methods [71][72][73][74], from Eq. 1 or Eq. 2 depending on the selected interaction type. Figure 1a, b show the sampling results from the Compton scattering, and the Rayleigh scattering PDFs, respectively, taken from the proposed MC model. To reduce the variance of the scattering image, the photon is split into what is known as pseudo-particles to randomly selected pixels on the detector [27,34]. The weight of the photon is then divided by the selected split number to keep the result unbiased [27]. The scoring of the scatter on the detector in this model is done by the use of the point detector method, i.e., the contribution to the detector from each interaction point is calculated using the following equation [27].
where D(E) is the energy response of the detector, p( d ) represents the probability of scattering toward the detector given by Eqs. 4 and 5 for the Compton and the Rayleigh scattering, respectively, [27], d is the cosine of the angle between the photon path and the direction to the detector [27], W is the photon's weight, L is the distance from the interaction point to the detector. From each interaction point, where is the spatial position, is calculated by checking the material type in a specific step length which represents the number of voxels that the ray-tracer crosses. At each step, the code checks the material type and interpolates the value from the table of the selected material. The tables of the values of the relevant materials are imported to the code from external files already prepared and stored within the code directory. The voxelized volume is segmented and assigned to different materials using multi-threshold derived using the Otsu method [76] before the start of the simulation. r 2 0 is a constant and Incoh , Coh , are the integrated incoherent and coherent cross-section, respectively, taken from [68]. The scatter intensity projection on the detector is calculated by summing up the contributions from all the interaction points of all the simulated photons according to Eq. 6. The selection of the pixel, where the scoring occurs, is done randomly.
where N is the number of energy bins of the polychromatic source, M is the number of photons in the energy bin, and K is the number of interactions.
Regarding the GPU implementation, two kernels were implemented in this model. The first kernel is to calculate the primary projection as given in Eq. 7 [75].
While the second kernel is used to calculate the scatter projection. The number of threads that are used in the simulation is 16,384, as this number gives the fastest execution. On the other hand, the number of work items per work group was determined automatically by the OpenCL implementation. In addition, the GPU MC model supports both single and double-precision with a significant enhancement of the performance regarding the speed using the single-precision format. The implemented GPU-based MC model is designed to estimate the scatter and the primary projections using a voxelized volume to enable the use of this model in the correction of the scatter artifacts iteratively [5].
This GPU-based MC model is a parameterized model in which it is not only accelerated using the multi-GPU platform but is embedded with several controllable key parameters to speed up the simulation. An example of these powerful key parameters is the ray-tracing step size, in which the number of voxels that are crossed during the ray-tracing can be controlled to reduce the simulation time. Other key parameters are the number of photons and the splitting number of the photons. To demonstrate the effect of these parameters on the simulation time, it is shown in Sect. 3.5 that the implemented model can achieve 15 × speed-up in comparison to the MCGPU simulator utilizing these key parameters. Moreover, Sect. 3.6 shows how the proper adjustment of these parameters can lead to a real-time MC simulation with a simulation time that is even lower than the time required by the real scanner considering the same number of projections. From the hardware aspect, the number of projections, that are needed to be simulated, are distributed among multi-GPU. This scale the simulation time almost linearly with the number of GPUs. As a result, the simulation time for the full set of projections is divided by the number of GPUs. Moreover, the achievable speed-up is neither with the extensive reduction in the projection and the volume resolution nor by the use of a very low number of photons and projections, which is the approach almost used in all the scatter correction methods that adopt the MC simulation to estimate the scatter [1, 5-7, 32, 60, 64, 65].
This GPU-based MC model is made a part of an in-house tool that contains several CT reconstruction algorithms and solutions. The GPU implementation of this CT tool is achieved using the OpenCL 2.1 framework. Consequently, the GPU implementation of the proposed MC model has been performed using the same OpenCL version.
The CPU version of the implemented MC model is designed to perform the simulation on a computer-aided design (CAD) surface model using triangular elements. This version of the implemented MC model is embedded by a fast highly vectorized ray-tracer algorithm named Embree [61]. This MC version is also parallelized using multi CPU threads. This parallelization on the CPU platform is implemented using the OpenMP framework [62]. However, being independent of the CAD model provides numerous advantages if the estimated scatter is to be used in the iterative scatter correction process. First, the CAD model is not always available in all the cases or it only describes parts of the object, second, it is required a registration that should be robust against the CT artifacts, and finally, it is possible that doing the correction using the CAD model could bias the measurement toward the CAD model itself [13]. The implementation and execution of the GPU and the CPU versions of the implemented MC model are performed using a Linux system.

Applied acceleration schemes
From the hardware aspect, we have accelerated the proposed MC model using the GPU platforms. The GPU implementation is achieved using the OpenCL framework in which multiple thousands of photons are simulated simultaneously by exploiting the threads of the GPU. Regarding multi-GPU implementation, the number of projections is equally assigned to the available GPUs so that these projections can be simulated simultaneously.
In addition to the GPU acceleration, other techniques are utilized to accelerate the scatter correction process. To reduce the variance in the scatter image and enhance the speed, the variance reduction techniques are used [34]. Examples of these techniques are splitting and Russian Roulette. Splitting reduces the variance in the image by distributing any high contribution on the detector among many pixels. Russian Roulette is used to discard some of the split particles that have low contributions to the result overall to save the time of the expensive tracking process [27].
Moreover, the ray-tracing through the voxelized volume is an expensive method, as millions of voxels need to be crossed by these rays to score the values on the detector. Therefore, the number of voxels of the original volume is down-sampled from 3164 × 2304 × 3164 voxels to 791 × 576 × 791 voxels to fit the GPU first, and to reduce the number of voxels that the ray-tracer crosses. In addition, we have shown, in Sect. 3.6, that instead of looping on every voxel and checking the material type during the ray-tracing, one can adjust the loop to make the ray crosses several voxels before checking the material type through the use of a higher step size. This has led to a significant improvement in the simulation time while maintaining the accuracy of the result. A study of the effect of different step sizes on the accuracy of the scatter correction can be found in Sect. 3.6.
Interpolation techniques have been also used. First, cubic interpolation is used to up-sample the simulated primary and scatter projections from 576 × 800 pixels resolution to 2304 × 3200 pixels resolution; second, only half the number of scatter projections is simulated using the proposed MC model, and the other half is generated using the linear interpolation.
On the other hand, the smoothing operation enhances the estimation speed of the scatter by more than 50% by reducing the number of photons required in the MC simulation [8]. Many previous works adopt smoothing filters [1,78]. The EGSnrc simulator uses the Savitzky-Golay filter [8,79]. The Savitzky-Golay filter is used to denoise the noisy scatter estimates in this work, as this type of filter can preserve the high-frequency components of the smoothed image [80]. Figure 2 shows that the profile line from the denoised scatter projection fits well with the profile line of the scatter projection acquired with a 10 × higher number of photons in comparison to the first case [8].

Simulation of the polychromatic behavior
To achieve an accurate CT scan simulation, the polychromatic behavior of the source and the detector should be accurately simulated. The simulations of the two have been performed using the Geant4 MC simulator taking into account the internal construction and the components setup of both.

Iterative scatter correction
The iterative scatter correction algorithm based on a fast FBP and the proposed fast photon transport model is shown in Fig. 3. Usually, one to three iterations are required for the scatter correction in considered cases shown later.
The algorithm starts by importing the scatter corrupted raw intensity projections from the scanner. These projections are of 2304 × 3200 pixels resolution. The intensity projections are then converted into linear attenuation projections and reconstruction using the FBP method is then performed to get a volume using these projections. The reconstructed volume is down-sampled to 791 × 576 × 791 voxels to fit the GPU. The different materials in the down-sampled volume are segmented using multi-threshold derived from the Otsu method and then converted into densities. A MC simulation is then applied, using the proposed MC model, to acquire the scatter and the primary intensity projections using the voxelized volume. These projections are then used in Eq. 8 to perform the scatter correction in the iterative algorithm.

Scanner Corrupted
Projs.  where c is the corrected projection, a is the scatter-corrupted linear attenuation projection from the scanner, and Is, Ip are the scatter and the primary intensity projections taken from the proposed MC model, respectively. As mentioned before, this work aims to correct high-resolution projection from the scanner. Two approaches can be followed in the iterative scatter correction algorithm. The first approach is to calculate the scatter and the primary projections using the proposed MC model with the same resolution of 2304 × 3200 from the scanner. These projections should be denoised by the smoothing filter before using them in the correction step. This approach is time-consuming, as performing the MC simulation on a high-resolution detector requires a high number of photons to get a low noise projection. The second approach is to calculate the scatter and the primary projections using a lower resolution, i.e., 576 × 800 with less number of photons. These projections are then up-sampled to the high-resolution case using cubic interpolation to prepare them for the correction step.
In CT, a high number of projections should be used to get a proper reconstruction quality using the FBP method. Simulating the scatter for all these projections is timely expensive. Since the scatter tends to change slowly between projections [77], only half of the projections are simulated and the other half is acquired using linear interpolation between every two projections.
The primary and the scatter intensities projections, prepared in the previous step, are used in the initial correction of the projections from the scanner using Eq. 8. This initial correction lacks accuracy since the estimation of the scatter and the primary intensities in this step has been done using a scatter-corrupted volume.
The correction process is repeated again using the partially corrected projections until an adequate correction is derived. Different parameters of the MC setting were tested for the sake of optimizing the time. It is shown that certain settings of these parameters minimize the MC simulation time to the lowest extent while maintaining the accuracy of the scatter correction. This is one of the main differences between the iterative scatter correction algorithm given in Fig. 3 compared to previous works [5,8]. Further differences are the low-resolution, low number of photons, low number of projections, and the CPU instead of GPU-based scattering corrections.
The iterative scatter correction algorithm has been accelerated using multi CPU threads. Only the MC simulation, as it is the most time-consuming part of the algorithm, has been accelerated using multi-GPU.

Experiments and results
In this section, the proposed MC model is evaluated on real-world data sets and compared with other simulators. The real-world datasets were acquired using the Nikon scanner available at our institute for different test objects.
In particular, several examples of cement-based materials were acquired using objects prepared also at our institute.
In addition, results of the scatter correction algorithm and comparison of these results with and without the use of different acceleration and optimization techniques are also shown for different experimental objects.

Collimator scatter-suppressed results
Two copper blocks of 2 cm × 2 cm × 4 cm dimension have been used as a simple collimator in this work. These blocks were placed on top of each other in front of the X-ray source in which the long side was positioned perpendicularly with the exit window of the X-ray source with an opening slit of 1 mm thickness. As a result, the original cone beam is converted into a fan beam and the scanned slit with this fan beam is used as ground truth in this work.

Verification of the proposed MC model
Several methods were used to verify the proposed GPUbased MC model. First, the model is used to correct a scatter-corrupted projection from the real scanner. This correction result is compared with the experimental near scatter-free projection acquired experimentally using the collimator. Second, the model is used to correct a scatter-corrupted projection from the EGSnrc simulator in which the result is compared with the scatter-free projection available from this simulator. Figure 4 shows the result of the scatter correction of the scatter-corrupted linear attenuation projection from the scanner. The correction was done using Eq. 8 utilizing the scatter and the primary projections simulated by the proposed MC model. The result of the correction of this projection matches well with the near scatter-free result of the same scan acquired using the collimator.
On the other hand, the result of the scatter correction of the scatter-corrupted projection from the EGSnrc simulator is shown in Fig. 5. It is shown that the scatter correction result matches the scatter-free projection from this simulator.

Scatter correction for motorcycle cylinder head
The scatter is corrected for an aluminum motorcycle cylinder head using the proposed iterative scatter correction method with three iterations only. 1500 scatter and 3000 primary projections were simulated using the proposed MC model with 2.5 × 10 8 photons and a 576 × 800 resolution. The scatter projections were linearly interpolated to 3000 projections and then up-sampled along with the primary projections to the same resolution of the scanner before using them in the iterative scatter correction method. Figure 6 shows the results of the third iteration of the scatter correction. As shown in the figure, the results of the scatter correction using the iterative algorithm coincide with the collimator results.
The scatter correction results in each one of the three iterations are shown in Fig. 7 (the collimator result is not shown in this figure). This figure shows that some parts of the aluminum material were wrongly corrected as steel in the first and the second iterations, as the thresholds derived by the Otsu method fail to distinguish between the steel and the aluminum Fig. 4 Correction of a scatter-corrupted attenuation projection acquired from the scanner for an aluminum motorcycle cylinder head. a Scatter-corrupted projection. b Scatter-corrected projection. c The white dashed profile lines of a, b, and from the near scatter-free projection acquired using a collimator c The white dashed profile lines of a, b, and the near scatter-free volume's slice from the collimator. d Slice from the scattercorrupted volume (side view). e Same slice from the corrected volume. f The white dashed profile lines of d, e, and the near scatter-free volume's slice from the collimator. The profile lines in this example are averaged over multiple rows to suppress the noise 1 3 materials due to the severe scatter artifacts. However, the third iteration shows a good scatter correction result has been achieved.
In Ref. [5], it is suggested that the result of each iteration can be quantified with the reference scatter-free result by the use of the square root of the mean square error (RMSE), given in Eq. 9, of the reconstructed voxels. Figure 7f shows the plot of the RMSE values for all the iterations. These RMSE values are calculated for the regions marked with blue circles in Fig. 7b-d, and from the result of the fourth iteration (not shown here). The scatter-free reference image, used in Eq. 9, is taken from the collimator. According to Fig. 7f, the fourth iteration gives almost the same RMSE value as the third iteration. Therefore, only three iterations were used to correct the scatter artifact.
Here i and i,ref are the mean of the linear attenuation values in the scatter-corrected volume region of interest (ROI) and the reference near scatter-free volume ROI, respectively, for material i.
The setting of the scan parameters used in this example and the time required for the scatter correction of this object using three iterations are shown in Tables 1 and 2, respectively. In Sect. 3.6, it is shown that a near real-time scatter correction is achieved for this example following certain optimization of the MC key parameters. Figure 8 shows the scatter correction results of different cement-based objects. The first row shows the scatter

Scatter correction of cement objects with steel rods
correction result of a cement cylinder with 11 cm outer diameter and 5 cm inner diameter with eight steel rods inserted in this cylinder. The second row shows the correction result of a 7 cm diameter cement cylinder with eight steel rods. The last row shows the result for a 7 cm diameter cement cylinder without any insertion. 3000 The profile lines in this example are averaged over multiple rows to suppress the noise Table 1 Measurements parameters for the motorcycle cylinder head and the cement objects scans  Table 2 Time required for three iterations of scatter correction using four GPUs with and without the use of the interpolation technique for the cylinder head example, the MC simulation time is for one iteration only projections from the scanner for each one of these objects were corrected using the implemented iterative scatter correction algorithm following the same approach which is used to correct the motorcycle cylinder head object. The number of photons used in the MC simulation, in this case, was 4.9 × 10 8 photons.
Since the reconstructed volumes from the scanners' projections are well segmented, only a single iteration was used for the correction. The good segmentation result is a consequence of the high diversity between the linear attenuation values of the cement and the steel. Although the steel material is highly affected by the scatter in these examples, the value of the linear attenuation of the steel is still easily distinguishable from the one of the cement. The primary and the scatter projections, simulated using the proposed MC model in this single iteration, were accurate enough for the scatter correction process.
The results of the scatter correction of these objects resemble the near scatter-free results from the collimator. Besides, it is also shown that the scatter correction successfully reduces the cupping and the streak artifacts that are severely affecting the original volumes from the scanner. Table 1 shows the parameters of the scan which are used to acquire the projections from the scanner. Table 3 shows the time required for a single iteration of the scatter correction and for the MC simulation of the three objects.

Evaluation of execution time of the proposed MC model
Three different objects were used to evaluate the required execution time of the CPU and the GPU versions of the MC model against other CPU-based simulators. Object (3) is shown in Fig. 5, objects (1) and (2) are not shown here however, the occupation of these objects on the screen is the same as the object in Fig. 5 (1), (2) and (3),  In the first case, a single thread was used in the simulations of the CPU simulators, whereas a single GPU was used in the simulation of the proposed MC GPU model. Table 4 shows the time required for a single projection. Considering the comparison case of the object (2), 347 × acceleration is achieved compared to the EGSnrc simulator, while more than 18 × acceleration is achieved compared to the aRTist simulator using the GPU version of the MC model. Compared to the CPU version of the implemented MC model, a 19 × acceleration is achieved compared to the EGSnrc simulator, and with a slightly better performance compared to the aRTist simulator. It is worth mentioning that the aRTist simulator is one of the fastest simulators available on the CPU platform [28].
In the second case, the evaluation is done by comparing the total simulation time required for the full set of projections. Thus, 3000 projections were simulated on four GPUs using the proposed MC model. Whereas 12 threads were used in the simulation of the CPU simulators. Table 5 shows the time required for the simulation. For the comparison case of the object (2), the multi-GPU implementation achieves a 202 × and 9 × acceleration compared to the EGSnrce simulator and the aRTist simulator, respectively. While using the CPU version, a 17 × acceleration has been achieved compared to the EGSnrc simulator. In comparison to the aRTist simulator, the CPU version is slightly slower. This slow performance is due to the use of some atomic operations in the OpenMP implementation of our CPU version. On the other hand, a lower speed-up factor is achieved using the GPU MC model in this case, as the use of the multi-GPU imposes an extra time to copy the data from all the devices to the CPU memory and merge them, which is a time-consuming process. It is worth mentioning that the achieved speed-up factor is not the same for all the objects. It is shown that the performance using object (2) is better than the others. This is related to the occupation of the object on the detector and how complex the object is, as ray tracing through complex objects requires more time.
Moreover, the execution time of the proposed GPU-based MC model has been evaluated against the MCGPU simulator. The two GPU-based models were used to correct a scatter-corrupted projection from the EGSnrc simulator. Figure 9 shows that the scatter correction results from the two models match the scatter-free projection from the EGSnrc simulator. Regarding the speed and the accuracy, the proposed MC model achieved a scatter correction result with a better contrast-to-noise ratio (CNR) and speed compared to the MCGPU simulator considering the use of other acceleration methods in addition to the GPU platform. In contrast to the MCGPU model, the proposed MC model supports the usage of variance reduction and other acceleration techniques to accelerate the acquisition of the projections. By

Optimization of the runtime of MC simulation
In this work, key parameters are made controllable to optimize the runtime for the MC simulation as it is the most expensive part of the correction process. The MC simulation alone consumes almost 85% of the complete simulation time required for the iterative scatter correction process shown in Fig. 3. Table 6   simulation time is reduced to only 0.28 h using these parameter settings. Considering the last simulation case, case 4, the number of photons and the splitting number of photons are both reduced in comparison to cases 1-3, while the step size is set to 2. The simulation time is recorded as 0.2 h only. The quality of the result of the scatter correction has been assessed using quantitative evaluation methods such as the CNR, the mean square error (MSE), and the normalized cross-correlation (NCC). The results of this evaluation are shown in Table 7. According to this evaluation, the results of both cases 2 and 3 which are shown in Fig. 10b, c, respectively, produce good scatter correction quality with a great enhancement of the computation time in comparison to case 1 (shown in Fig. 10a) and the standard case (not shown). Case 3, for example, can simulate a single scatter projection within 2.7 seconds only, whereas the full set of the projections is calculated within 0.28 h using four GPUs in near real-time, which is comparable to real-world CT acquisition. This gives a 162 × speed-up in comparison to the standard case (column 1) which requires 45.4 h to acquire the full set of projections. Although further optimization of the key parameters in case 4 results in less required computation time in comparison to the other cases, the result of the scatter correction, in this case, is noisy as shown in Fig. 10d.

Conclusion
In this work, a multi-GPU accelerated MC photon transport model has been implemented and embedded into the iterative scatter correction algorithm for high resolution flat-panel CT. Especially, fundamental physics including Compton scattering, Rayleigh scattering, and photoelectric absorption are implemented in the proposed MC model. The MC model is accelerated by splitting the projections equally over multi-GPU to enable a full parallelization. In addition, several key parameters are made adjustable in this model. It is shown that certain adjustments of these key parameters enhanced the speed of the scatter simulation while maintaining accuracy. The most effective factor in the speed-up achieved using the proposed MC model is the use of the multi-GPU in conjunction with the above-mentioned key parameters speeding up the simulation significantly with a proper adjustment. Moreover, the MC code itself is efficiently written, by excluding the multi-GPU acceleration and the key parameters from the count, the CPU version itself achieves a comparable speed to the aRTist simulator, which is one of the fastest simulators on the CPU platform. To validate the speed and accuracy of the proposed MC model, comparisons have been conducted with the state-of-the-art MC simulators, i.e., EGSnrc, aRTist, MCGPU, and with the real-world scanner. In comparison with the multi-threaded MC simulator EGSnrc, a 202 × speed-up is achieved for a 2304 × 3200 pixels detector using four GPUs. Furthermore, using the implemented iterative scatter correction algorithm, we show the possibility to perform the correction of the scatter artifact iteratively in near real-time. Compared to the reference images acquired using the collimator, the scattering artifacts are effectively suppressed within one to three iterations of the proposed scatter correction algorithm. It is shown that the proposed MC photon transport model is both sufficiently fast (real-time) compared to the acquisition time by the real-world CT scanner and to FBP and sufficiently accurate for scattering artifact correction in high-resolution flat-panel CT reconstruction. With the achieved speed using the proposed GPU MC model, the integration of this model into an iterative reconstruction method, e.g., the MLEM method, can increase its accuracy by considering the full physics in the forward projection part of the reconstruction iteration.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.