Reviewing GPU architectures to build efficient back projection for parallel geometries

Chilingaryan, Suren; Ametova, Evelina; Kopmann, Anreas; Mirone, Alessandro

doi:10.1007/s11554-019-00883-w

Reviewing GPU architectures to build efficient back projection for parallel geometries

Original Research Paper
Open access
Published: 26 June 2019

Volume 17, pages 1331–1373, (2020)
Cite this article

Download PDF

You have full access to this open access article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Reviewing GPU architectures to build efficient back projection for parallel geometries

Download PDF

Suren Chilingaryan ORCID: orcid.org/0000-0002-2909-6363¹,
Evelina Ametova^2,3,
Anreas Kopmann¹ &
…
Alessandro Mirone⁴

4777 Accesses
Explore all metrics

Abstract

Back-Projection is the major algorithm in Computed Tomography to reconstruct images from a set of recorded projections. It is used for both fast analytical methods and high-quality iterative techniques. X-ray imaging facilities rely on Back-Projection to reconstruct internal structures in material samples and living organisms with high spatial and temporal resolution. Fast image reconstruction is also essential to track and control processes under study in real-time. In this article, we present efficient implementations of the Back-Projection algorithm for parallel hardware. We survey a range of parallel architectures presented by the major hardware vendors during the last 10 years. Similarities and differences between these architectures are analyzed and we highlight how specific features can be used to enhance the reconstruction performance. In particular, we build a performance model to find hardware hotspots and propose several optimizations to balance the load between texture engine, computational and special function units, as well as different types of memory maximizing the utilization of all GPU subsystems in parallel. We further show that targeting architecture-specific features allows one to boost the performance 2–7 times compared to the current state-of-the-art algorithms used in standard reconstructions codes. The suggested load-balancing approach is not limited to the back-projection but can be used as a general optimization strategy for implementing parallel algorithms.

Super-resolution deep-learning reconstruction for cardiac CT: impact of radiation dose and focal spot size on task-based image quality

Article 17 June 2024

A Review of Deep Learning CT Reconstruction: Concepts, Limitations, and Promise in Clinical Practice

Article Open access 27 July 2022

The use of deep learning methods in low-dose computed tomography image reconstruction: a systematic review

Article Open access 28 May 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

X-ray tomography is a powerful tool to investigate materials and small animals at the micro- and nano-scale [1]. Information about X-ray attenuation or/and phase changes in the sample is used to reconstruct its internal structure. Recent advances in X-ray optics and detector technology have paved the way for a variety of new X-ray imaging experiments aiming to study dynamic processes in materials and to analyze small organisms in vivo. At the Swiss Light Source (SLS) scientists were able to take high quality 3D snapshots of 150 Hz oscillations of a blowfly flight motor [2]. A temporal resolution of 20 ms was achieved during a stencil test performed at SLS [3] and also in the analysis of morphological dynamics of fast-moving weevils at the ANKA synchrotron at KIT [4].

To achieve these results, the instrumentation used at imaging beamlines has recently undergone a major update. The installed streaming cameras are able to deliver up to hundreds of thousands of frames per second with a continuous data rate up to 8 GB/s [5]. Newly developed control systems at ANKA [6], SLS [5], and other synchrotron facilities use the acquired imaging information to track the processes under study and adjust the instrumentation accordingly. These control systems rely highly on the performance of the integrated image processing frameworks. Faster acquisition and a high level of automation is essential to study dynamic phenomena and at the same time enables experiments with significantly increased sample throughput. For example, in 2015 Diamond Light Source (DLS) reported that typically about 3000 scans are recorded during 5 days of operation at a single imaging beamline [7]. Consequently, the amount of data generated at imaging beamlines quickly grows and results in a steep rise of the required computing power. In order to achieve higher temporal resolution and to prolong the duration of experiments, advanced methods are developed that incorporate a priori knowledge in the reconstruction procedure. These methods are able to produce high-quality images from undersampled and underexposed measurements, as demonstrated by [8, 9]. Unfortunately these methods are computationally significantly more demanding than traditional reconstruction algorithms and further increase the load on the computing infrastructure [10].

To tackle the performance challenge several reconstruction frameworks have been developed and optimized to utilize the parallel capabilities of nowadays computing architectures. At SLS GridRec, a fast reconstruction approach optimized for conventional CPU technology, has been adopted [11]. The reconstruction is scheduled across a dedicated cluster and reconstructs a 3D image within a couple of minutes [5]. Other frameworks use GPUs to accelerate the computation and are able to achieve minute-scale reconstructions at a single node equipped with multiple GPU adapters. PyHST is developed at ESRF and uses the CUDA framework to offload image reconstruction to NVIDIA GPUs [12]. The second version of PyHST provides also a number of iterative reconstruction techniques [13]. The UFO parallel computing framework is used at ANKA synchrotron to realize in-vivo tomography and laminography experiments [14, 15]. It constructs a data processing workflow by combining basic building blocks in a graph structure. OpenCL is used to execute the reconstruction at parallel accelerators with a primary focus on NVIDIA and AMD GPUs. ASTRA is a fast and flexible development platform for tomographic algorithms with MATLAB and python interfaces [16, 17]. It is implemented in C++ and uses CUDA to offload computations to GPU. Several other frameworks are based on the ASTRA libraries to provide GPU-accelerated reconstruction, for instance the Savu framework at DLS [7] or TomoPy at the Advanced Photon Source (APS) [18]. Recent versions of TomoPy also support UFO and GridRec as backends. All of the GPU-accelerated frameworks are capable to distribute the computation to a GPU cluster as well.

While most of the nowadays imaging frameworks rely heavily on parallel hardware to speed-up the reconstruction, specific features of the GPU architecture are rarely considered. On other hand, the hardware architectures differ significantly [19]. Organization of memory and cache hierarchies, performance balance between different types of operations, and even the type of parallelism varies. A significant speed-up is possible if details of the specific architecture are taken into account as illustrated in [20]. Fast execution is especially important if the reconstruction is embedded in a control workflow. Minimal latency is essential to track faster processes and to improve the achieved spatial and temporal resolutions. Due to unavoidable communication overhead, it is not always possible to reduce the latencies by scaling the reconstruction cluster.

For online monitoring and control, normally fast analytical methods are used to reconstruct 3D images. There are two main approaches: Filtered Back Projection (FBP) and methods based on the Fourier Slice Theorem [21]. The later methods are asymptotically faster, but due to the involved interpolation in the Fourier domain are more sensitive to the quality of the available projections. For typical geometries Fourier-based methods are several times faster using the same computing hardware [22] and should be preferred if the computing infrastructure is limited to general-purpose processors only [5]. A recent study suggests to implement back projection as convolution in log-polar coordinates in order to gain high reconstruction speed with interpolation in the image domain [23]. However, this new method has not yet been adopted in production environments. Still, Filtered Back Projection is the method of choice, largely due to it simplicity and robustness. Therefore, the efficiency of the FBP implementation is still crucial for the operated monitoring and control systems. Furthermore, methods used for low dose tomography normally consist out of sequences of forward and back projections. And, thus, a faster implementation of the back projection lowers also the computational demands for high-quality offline reconstruction and might reduce the required hardware investments.

While there are several articles aiming at optimization of Back Projection for general-purpose processors and Intel Xeon-Phi accelerators [24], up to our knowledge there are no publications considering the variety of GPU architectures. A number of papers addresses specific GPU architectures [25, 26]. Multiple papers perform a general analysis of a range of GPU architectures, reveal undisclosed details trough micro-benchmarking, and propose guidelines for performance optimization [27,28,29]. This information is invaluable to understand factors limiting performance on a specific architecture and to find an alternative approach to achieve a better performance. Several papers propose methods to auto-tune computation kernels [30]. However, the tuning is limited to finding optimal configuration of pre-defined parameters like desired occupancy, dimensions of execution blocks, etc. For instance, there are no automated solutions to tune the balance between the texture engine and the computational cores.

In [31], we presented two highly-optimized back-projection algorithms for NVIDIA Pascal GPUs and a hybrid approach to balance the load between different GPU subsystems using both in parallel. While the algorithms can be used on different hardware, multiple modifications are required to address the differences in the architectures efficiently. Furthermore, the proposed hybrid approach is only suitable for the NVIDIA GPUs of a few latest generations. A different scheme to balance load is required for AMD, Intel, and older NVIDIA GPUs. In this paper, we review a variety of parallel architectures presented in the last 10 years and establish a methodology to expand the original work to different parallel hardware. We discuss hardware differences in detail, build performance model, and demonstrate how these differences can be addressed to optimize the performance of the FBP algorithm. Particularly, we suggest modifications to adapt the developed algorithms for the architectures with on-chip memory optimized for 64-bit access. To address further differences in memory subsystems, we propose several alternative caching methods. We introduce an approach to reduce the overall number of executed instructions for systems with a bottleneck in the instruction throughput. We discuss optimal blocking strategies in great detail and suggest how the code-generation can be tweaked on the NVIDIA platform. We also propose two new methods to balance the load between different GPU subsystems. One targets NVIDIA Kepler architecture and another can be applied universally but with a minor penalty to the quality. The proposed performance model allows us to estimate the speed also for future architectures and select the appropriate modification and parametrization of the algorithms. Up to our knowledge, we present the first comprehensive overview of the GPU architectures across multiple vendors and GPU generations. Furthermore, using the back-projection algorithm as an example, we also illustrate how specific hardware features can be addressed and estimate possible gains. So the contribution of this paper goes beyond the proposed back-projection algorithm and also suggests optimization strategies suitable for other applications.

In this paper we focus on the optimizations of the back-projection algorithm and only briefly mention the organization of data flow as it is already explained in literature [12, 15]. We also do not cover scaling issues since the proposed optimizations can be easily integrated in existing frameworks like ASTRA, PyHST, or UFO which provide multi-GPU and GPU-cluster support already. The article is organized as follows. The hardware setup, software configuration, and pseudo-code conventions are listed in Sect. 2. A short introduction to parallel architectures that is required to understand the proposed optimizations is given in Sect. 3. In this section we also highlight the differences between the considered parallel architectures. The Filtered Back Projection algorithm and its state-of-the-art implementation are presented in Sect. 4. A number of optimizations to this implementation are proposed in Sect. 5. An alternative implementation relaying on a different set of hardware resources is developed in Sect. 6. A hybrid approach combining both approaches to fully utilize all hardware resources is presented in Sect. 7. The achieved performance improvements are finally discussed in Sect. 8.

2 Setup, methodology, and conventions

2.1 Hardware platform

To evaluate the performance of the proposed methods, we have selected 9 AMD and NVIDIA GPUs with different micro-architectures. Table 1 summarizes the considered GPUs. These GPUs were assembled into the 3 GPU servers. The newer NVIDIA cards with Maxwell and Pascal architectures were installed in a Supermicro 7047GT based server specified in Table 2. The older NVIDIA cards and all AMD cards were installed in two identical systems based on the Supermicro 7046GT platform. The full specification is given in Table 3. Additionally, we have tested how the developed code is performing on an Intel Xeon Phi 5110P accelerator. The accelerator was installed in the first platform along with the newer NVIDIA cards.

Table 1 List of selected GPU architectures

Reviewing GPU architectures to build efficient back projection for parallel geometries

Abstract

Similar content being viewed by others

Super-resolution deep-learning reconstruction for cardiac CT: impact of radiation dose and focal spot size on task-based image quality

A Review of Deep Learning CT Reconstruction: Concepts, Limitations, and Promise in Clinical Practice

The use of deep learning methods in low-dose computed tomography image reconstruction: a systematic review

1 Introduction

2 Setup, methodology, and conventions

2.1 Hardware platform

2.2 Software setup

2.3 Benchmarking strategy

2.4 Quality evaluation

2.5 Pseudo-code conventions

3 Parallel architectures

3.1 Hardware architecture

3.2 Execution model

3.3 Memory hierarchy

3.4 Texture engine

3.5 Task partitioning

3.6 Code generation

3.7 Scheduling

3.8 Synchronization

3.9 Communication

3.10 Summary

4 Tomographic reconstruction

5 Back-projection based on texture engine

5.1 Standard version

5.2 Multi-slice reconstruction

5.3 Using half-precision data representation

5.4 Efficiency of the standard algorithm

5.5 Optimizing locality of texture fetches

5.6 Optimizing memory bandwidth

5.7 Optimizing occupancy

5.8 Summary

6 Alternative algorithm based on ALUs

6.1 The concept

6.2 Base implementation

6.3 Optimizing the thread mapping to avoid shared memory bank conflicts

6.4 Advanced caching mode

6.5 Modeling

6.6 Rounding using floating-point arithmetic

6.7 Half-float cache

6.8 Additional caches

6.9 Managing occupancy

6.10 CPU and Xeon Phi

7 Hybrid approaches

7.1 Combined approach for Pascal architecture

7.2 Oversampling

8 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation