1 Introduction

Graphics Processing Units (GPUs), as one of the most feasible parallel structured processors, have proven their power in facilitating research in a wide range of fields, including high-performance computing, data centers, medical imaging, and machine learning. Among these, GPU-based machine learning applications, and more specifically deep learning, have significantly grown in use in recent years [18]. To address this need, NVIDIA has introduced a specialized computing unit called Tensor Core that speeds up neural network training and inferencing operations in deep learning by offering enormous acceleration in matrix computations [21]. Tensor Core-powered GPUs can offer more than a hundred TFLOPS performance [4].

Although the Tensor Core’s tremendous performance is tempting to use, its application is more restricted than general CUDA core. First, Tensor Cores only perform a specific form of matrix multiply–accumulate operation. Second, the precision of Tensor Core floating-point operations is limited to half-precision. Since every application with matrix operations at its core, including non-machine learning image and video processing applications, could potentially benefit from the performance improvements these newly introduced Tensor Cores provide, it will be a valuable study to analyze the performance and the effects of these constraints for these non-machine learning applications.

Computed tomography (CT) is one of these applications that has been widely applied to non-destructive testing and contact-free inspection methods, such as medical imaging, age determination, and industrial materials testing [16]. Two main challenges in CT include decreasing radiation exposure to reduce risk of biological and material damage, and improving the reconstruction process by decreasing processing time while increasing resolution. In this regard, several hardware- and software-based approaches have been proposed to address these concerns [10, 25]. Compared to the traditional methods used to reconstruct CT images, Iterative Reconstruction (IR) methods provide superior images but require more processing resources. Thanks to improvements in processor performance, IR is now considered the preferred reconstruction method due to the simultaneous noise reduction and quality improvement it can provide [25]. Although it has recently received increasing attention from the high-performance computing domain, reconstruction is still a resource-hungry application [27, 28]. Thus, these novel parallel architectures provide new opportunities for accelerating IR. And given the increasing real-time applications in both medical and non-medical scanning, there are many areas that would benefit from this technology.

In this regard, we demonstrate the application of NVIDIA Tensor Cores to accelerating CT forward-projection (FP) and back-projection (BP) algorithms, which are of the most demanding kernels in iterative reconstruction approaches. For this purpose, the distance-driven projection method is used to build a system matrix. Although any other projection method can be used in our approach, distance-driven projection is used because of the advantages of fast and accurate reconstruction [5, 6]. To take advantage of GPU hardware acceleration, the Tensor Cores are used to perform the forward-projection by multiplying the system matrix with a vector consisting of the image pixel intensities and the back-projection by multiplying the transpose of the system matrix with a vector consisting of the sinogram intensities. Because Tensor Cores operate in a mixed 16-bit/32-bit floating-point precision, we expect a loss in accuracy compared to a purely 32-bit floating-point implementation. Therefore, our goal is to also investigate whether this loss in accuracy is acceptable for CT reconstruction applications.

The contributions of this paper can be summarized as follows:

  • Application of Tensor Cores to a non-machine learning application; in particular, we discuss how to deal with several algorithmic challenges when using Tensor Cores.

  • A pseudo-Morton ordering algorithm is proposed to improve the sparsity pattern of the system matrix, which will improve performance and memory utilization.

  • An approach is proposed for exploiting the symmetry of the system and to further reduce the required system matrix memory.

Although—as far as we are aware of—applying Tensor Cores to this purpose is not reported yet, GPU-accelerated (non-Tensor Core) CT reconstruction algorithms have already been reported for both CT and Positron Emission Tomography (PET) applications, e.g., [13, 17, 26]. The work presented in [9] developed a fast GPU-based algorithm for reconstructing high-quality images from under-sampled and noisy data. Xie et al. [29] proposed a 10–16 times faster CUDA-based implementation for the projection in iterative CT reconstruction algorithms. Both Li et al. [16] and Sabne et al. [23] implemented a GPU-optimized Model-based Iterative Reconstruction (MBIR). Finally, Hidayetoğlu et al. [14] proposed a GPU-based memory-centric X-ray CT reconstruction which is implemented using multi-stage buffering, pseudo-Hilbert ordering, and sparse matrix–vector multiplication (SpMV). Because of its memory-centricity, it has some similarities with our work, such as dealing with sparse matrix data structures and using space-filling curves. However, our work focuses specifically on the application of the Tensor Cores and its limitations (e.g., programming limitations and precision limitations) in a non-machine learning application. We use different data structures and a different ordering method with lower computational complexity than the ordering algorithm used in their work.

This paper is organized in five sections. Background information on the concepts used later in the paper is presented in Sect. 2. Section 3 presents the proposed method in detail. The experimental results and a discussion are presented in Sect. 4. Section 5 contains the conclusion.

2 Background information

In this section, we give some background information on the three main concepts we build our approach on: (1) the distance driven projection method, (2) representations for large sparse matrices and (3) exploiting symmetries in the system.

2.1 Distance-driven projection

Projection and back-projection in CT reconstruction are typically conducted using one of these three main methods: pixel driven, ray driven, and distance driven [5]. The pixel-driven method works by passing a ray from the source through the center of each pixel of interest to the detector cells and then interpolates the contribution of two adjacent detector cells to this projected pixel value. The ray-driven method works by passing a ray from the source through the image to the center of each interest detector cell and then integrates the pixel values along this ray to calculate the projection value for the detector cell. The distance-driven method, on the other hand, converts the problem of projection–backprojection into a one-dimensional re-sampling problem by mapping both the detector cell boundaries and the pixel boundaries onto a common axis, e.g., the x-axis [5, 6]. It then calculates the length of the overlap to determine the contribution of each pixel to the corresponding detector cell.

Because of grid artifacts that tend to appear in the image domain when using the ray-driven back-projection and in the projection domain when using pixel-driven projection, the ray-driven method is well suited for forward-projection, while back-projection works well with the pixel-driven method [5, 6].

The distance-driven methods usually lead to better reconstruction image quality than both ray-driven and pixel-driven methods. Its projection and back-projection operations are matched, or symmetric, and can be used in iterative reconstruction algorithms where many projection and back-projections are performed [24]. In addition, this method has the potential for better performance because of its sequential memory access pattern (similar to the pixel-driven method) and relatively low arithmetic complexity (comparable to the ray-driven method), which make it more suitable for hardware implementations [5, 6].

Figure 1 shows a schematic of the distance-driven projection for a given row, in which the pixel boundaries and the detector cell boundaries are both mapped onto the x-axis. In Fig. 1, detector boundaries are denote by \(X_i\), pixel boundaries by \({X_i^\prime }\), detector values by \(d_i\), and pixel values by \(P_i\). In this example, the value of detector \(d_1\) can be calculated by:

Fig. 1
figure 1

Distance-driven projection, where the detector and the pixel boundaries are mapped onto the x-axis

$$\begin{aligned} d_1 = \frac{ ({{X_1^\prime }} - {X_1}) P_0 + ({ {X_2} - {X_1^\prime }}) P_1}{{X_2} - {X_1}}, \end{aligned}$$

which is a linear equation in \(P_0\) and \(P_1\). This means that the distance-driven projection describes a linear transformation from the image domain to the projection domain. As described in section 3, this linear transformation can be represented by a large sparse system matrix.

2.2 Representation of large sparse matrices

When multiplying a large sparse matrix by a dense vector, even if the sparse matrix can be stored in memory, the memory bandwidth for reading from memory would limit its performance. Consequently, most algorithms store the sparse matrix in a compressed format. Several compressed data structure formats do compression by storing only the non-zero values in a list, and the efficiency of these methods usually depends on the required memory bytes for storing the indices or pointers of the non-zero values. However, when the application’s performance is essential, the required memory bandwidth must also be taken into consideration when choosing the appropriate data structure. The memory bandwidth depends highly on the memory access pattern of the application, and as a result, there is a trade-off between memory usage and performance.

There are several well-known compressed data structures for storing sparse matrices in scientific computing such as COO, CSC, CCS, and CSR [22]. Among these formats, compressed sparse rows (CSR) is one of the most efficient and famous standard formats.

The CSR format stores the non-zero values of each matrix row in consecutive memory locations as well as an index to the first stored element of each row. In a common and general-purpose version, the CSR directly stores sparse matrix using a floating-point array and two one-dimension integer arrays for storing the column indices (CI) and row indices (RI). The non-zero values are stored in the floating-point array, and the column indices of these non-zero values are stored in the CI array. Then, the RI array stores the position of the first element of each row inside the floating-point array. Since the RI array contains an index for each row of the sparse matrix, only \(m+\mathrm {nnz}\) indices are required to store an \( m \times n \) sparse matrix containing \(\mathrm {nnz}\) non-zero values.

Another compression format that is more suitable for applications that require fetching a block of data from memory is the Block Compressed Sparse Row (BSR) format [22]. The BSR method stores two-dimensional non-empty blocks of elements instead of storing only the non-zero values, and uses a similar indexing format as the CSR method. Although the block-based formats are less memory efficient, they can provide better performance for the applications that work on blocks of data, such as the application we present here. Additional details on the data structure used in this work are presented in Sect. 3.2

2.3 Tensor Cores

As mentioned in the Introduction, expanding AI and deep-learning applications motivated NVIDIA to release its recent architectures, Volta, Turing, and Ampere, with new specialized execution units designed specifically for performing matrix operations, called Tensor Cores. A Tensor Core is capable of performing one matrix-multiply-and-accumulate operation on a \(4\times 4\) matrix in one GPU clock cycle. In its first version introduced by the Volta architecture, the Tensor Core was able to perform computations on half (according to the IEEE 754-2008 standard) and mixed-precision floating-point numbers. In both modes, Tensor Cores perform matrix multiplication in half-precision, but the accumulation could be performed in half or single precision. NVIDIA’s Turing architecture, as the successor to the Volta, introduced a new and enhanced version of the Tensor Core unit by adding new integer modes, INT8 and INT4, for applications that can tolerate inaccurate computations, such as neural networks. In this architecture, the Streaming Multiprocessor (SM) consists of four processing blocks that NVIDIA refers to as Sub-Cores, and each Sub-Core has two Tensor Cores and one warp scheduler.

Although Tensor Cores perform \(4\times 4\) matrix multiplication at a hardware level, the programming interface exposes the matrix operations only at the warp level and in terms of larger matrices. This interface, the Warp Matrix Function (WMMA) API, gives the programmer access to the Tensor Cores as warp-wide operations. This means that all threads in a warp must cooperatively work together to perform the computation of \(\varvec{D} = \varvec{A} \times \varvec{B} + \varvec{C}\), where \(\varvec{A}, \varvec{B}, \varvec{C}\) and \(\varvec{D}\) can be tiles of larger matrices. However, the programmer does not have direct access to the \(4\times 4\) matrix multiplications. Instead, the possible matrix sizes, called fragment sizes, are specified by a WMMA API.

The fragment sizes are represented using the notation \(M \times N \times K\), where \(M \times K\) is the dimension of Tile \(\varvec{A}\), \(K \times N\) is the dimension of Tile \(\varvec{B}\) and thus \(\varvec{C}\) and \(\varvec{D}\) have dimension \(M \times N\). Table 1 presents the possible configurations according to the available fragment sizes on the latest version of CUDA 10.2 for half-precision mode.

Table 1 Available configurations based on the fragment size

To program the Tensor Cores, CUDA code must be adapted to (1) use the WMMA API and (2) to use the available fragment sizes. For matrix multiplications of sufficiently large matrices, the fragment size can be chosen freely. However, for other tasks, the algorithm may have to be modified to fit the Tensor Core architecture.

From a reconstruction quality perspective, accuracy may be reduced due to the use of half-precision floating-point used by matrix multiplication, compared to single-precision floating-point calculations. Therefore, the choice of using Tensor Cores presents an ‘accuracy versus speed-up’ trade-off. The question is then whether this accuracy decrease is acceptable for practical applications. We investigate this further in Sect. 4.

3 The proposed approach

3.1 Overview

The distance-driven projection describes a linear transformation from the image domain to the projection domain, as described in Sect. 2.1. Thus, reconstruction from projection can be considered as solving a system of linear equations:

$$\begin{aligned} \varvec{A} \varvec{x} = \varvec{b}, \end{aligned}$$

where \(\varvec{A}\) is a system matrix that describes the distance driven projection, \(\varvec{x}\) is an \(N \times 1\) column vector that represents the intensities of the image, and \(\varvec{b}\) is the projection data for each angle and detector position collected by sensors. In this notation, vectors are obtained from images using column stacking. The dimensions of the system matrix are determined by the number of projection angels, m, the number of projections per each angel, k, and the image dimensions, \(w \times h\). Therefore, each entry of the system matrix has two sub-indices, ij and rq, where sub-indices ij refer to the pixel location and indices rq refer to projection number and ray angle in each projection. That is, \(\varvec{A}\) is an \(M \times N\) rectangular matrix where \(M=m k\) and \(N=w h\).

The reconstruction problem (2) may be either over-determined or under-determined. Consequently, the system is not uniquely solvable in general. Instead, we use least squares minimization of \(||\varvec{A} \varvec{x} - \varvec{b}||^2\); its solution is given by the pseudo-inverse:

$$\begin{aligned} \hat{\varvec{x}} = (\varvec{A}^T\varvec{A})^{-1}\varvec{A}^T \varvec{b}. \end{aligned}$$

Due to the size of the matrix \(\varvec{A}\), direct inversion of \((\varvec{A}^T\varvec{A})^{-1}\) is not practical; therefore most methods in literature revert to iterative approaches. In this work, the pseudo inverse is calculated iteratively using the Conjugate Gradient method (Algorithm 1). The resulting iterative CT reconstruction is carried out as iterations of computing residual and gradient matrices by performing forward-projection (\(\varvec{A} \varvec{x}\)) and its adjoint, back-projection (\(\varvec{A}^T \varvec{b}\)), between image and projection domains. Therefore, each iteration contains two system matrix multiplications. Even if the image and projection sizes are as small as \(512 \times 512\) and \(720 \times 512\), respectively, more than \(380 \times 10^9\) bytes of memory is needed to store the system matrix when a 4-byte word is dedicated for each element, which itself can be a great challenge. While the memory problem could be skipped by instead performing on-the-fly computations, the more than \(2 \times 10^{16}\) floating-point operations which would be required for each projection is beyond the tolerance for most of the applications even with the acceleration provided by the NVIDIA Tensor Cores.

figure a

Luckily, as we will explain below, the system matrix is sparse and significantly less memory and floating point computations will be required by performing some pre-computations.

As previously mentioned, the distance-driven projection method is used in this work to generate the elements of the system matrix. Accordingly, each row of the system matrix gives the proportion of all image pixels from a single ray going through the image. Therefore, most of the system matrix elements are equal to zero, since only a small percent of these pixels contribute to any individual ray. This means the system matrix \(\varvec{A}\) can be considered as a large sparse matrix and therefore, the required memory can be greatly reduced by the use of the data formats described in Sect. 2.2. Due to the requirements of Tensor Cores, the BSR format with a non-square block size is used in this work. The block sizes (size of Tile B) are defined based on the fragment size of Tensor Cores, as presented in Table 1.

Since the efficiency of the sparse matrix data structures depends on the sparsity pattern of the system matrix, an application-specific coordinate mapping algorithm is used in this work to improve the efficiency of the proposed approach. The high-level block diagram of the proposed approach is presented in Fig. 2. As we will explain, the coordinate mapping algorithm is also applied to the image and the projection matrices in addition to the system matrix.

Finally, we will show that by exploiting system’s symmetry, even further improvements are possible in both computational performance and the forward-projection’s memory requirements.

In the following sections, we will explain the BSR data structure used to store the sparse system matrix, the application-specific coordinate mapping algorithm to improve the sparsity pattern, the method used to exploit the symmetry of the system, and finally, the complete structure of the proposed method.

Fig. 2
figure 2

High-level block diagram of the proposed approach

3.2 Data structure

Due to the hardware characteristics of Tensor Cores and the requirements that are specified by the WMMA APIs, the BSR data structure has been implemented in this work for efficient storage of the system matrix. In the BSR format, non-empty blocks are the building elements, instead of the non-zero values that are the basic elements of the CSR method. In this format, a block containing at least one non-zero element is considered a non-empty block.

Given \(\alpha \) and \(\beta \) as the block size parameters, the BSR method partitions the \(M \times K\) sparse matrix \(\varvec{A}\) into \(M/\alpha \times N/\beta \) blocks. No compression is applied inside blocks, and blocks contain both zero and non-zero values. Therefore, a fixed amount of memory is allocated for each non-empty block. An example application of the BSR method is displayed in Fig. 3. As shown, a one-dimension floating-point array, \(blocks[\alpha \times \beta \times NEB]\), is allocated to store the values of non-empty blocks, and two one-dimension integer arrays are dedicated for storing the indices of these non-empty blocks, \(row\_indices[1+M/\alpha ]\) and \(col\_indices[NEB]\), where NEB is the number of non-empty blocks.

Finally, since the Tensor Cores perform mixed-precision matrix operations, the data type for storing the blocks array is considered to be a half-precision floating point.

Fig. 3
figure 3

An example of storing a sparse matrix using the BSR method

3.3 Pseudo-Morton ordering algorithm

The efficiency of using Tensor Cores for sparse matrix multiplications highly depends on the sparsity pattern of the matrix. Maximum performance will be achieved when the density of the non-empty blocks is maximal, which also means having the minimum number of non-empty blocks.

Although the sparsity pattern of the system matrix and, consequently, the sparsity pattern of non-empty blocks depends on the physics of the system, this pattern can be improved by applying an appropriate mapping algorithm. However, the system matrix pattern must first be analyzed to determine the optimal mapping algorithm, .

As mentioned in Sect. 3, the system matrix entries give the proportion of pixel ij from the ray rq passing from source to the detector. Therefore, pixels close to each other will likely have a similar proportion of rays passing through the region.

According to the locality properties of the system matrix, space-filling curves and especially Morton space-filling curves are a good candidate [3, 8, 14, 19]. Morton codes are constructed by interleaving bits of integer coordinates representing a spatial position. Therefore, Morton codes map n-dimensional data points to a one-dimensional space while preserving the spatial locality of the data points. As a result, coordinates that are spatially close to each other in the N-dimensional space will have Morton numbers that are also close to each other.

In addition, since the Morton code has a lower computational cost compared to other space-filling curves, a pseudo-Morton mapping has been applied to the system matrix in this work. For efficiency, our pseudo-Morton mapping does not convert the coordinate indices all the way down to Morton codes, but instead performs bit shuffling operations directly onto the coordinate indices themselves, hence the name pseudo-Morton. We assume blocks of size \(B_x\) and \(B_y\) where \(B_x\) and \(B_y\) are both powers of 2. The main idea is that for a subsequent range of \(B_x B_y\) 1D indices, the corresponding 2D coordinates are all within the same block of size \(B_x \times B_y\). It is not necessary to map the \(\log _2(B_x)+\log _2(B_y)\) least significant bits onto Morton codes, as this does not bring any improvements to the rest of our algorithm, allowing the operation to be implemented using a limited number of bitwise operations. The proposed mapping algorithm is shown in Alg. 2, where \(B_x\) and \(B_y\) are the mapping block sizes.

The mapping algorithm is applied to both the row indices as well as the column indices. Hence, to reverse the effect of the applied permutation to the system matrix, the mapping algorithm was also applied to the images before performing the multiplications and to the result matrices.

figure b

3.4 Exploiting symmetries of the system matrix

Symmetry can be considered one of the most important properties of any system, especially when dealing with system matrices. In a symmetric system,Footnote 1 a part of a system matrix becomes exactly the same as the other part by applying some kinds of transformation such as rotation and flipping of the input image; as a result, the size of the system matrix can be greatly reduced.

In addition, symmetric system matrices are of great interest for parallel structured computing architectures. By loading a single part of the system matrix from memory, several parallel kernels can perform their computations. However, the extra computation required for performing the transformation can itself limit the maximum achievable performance [7].

In Sect. 3.1, it is explained that in forward-projection, \(\varvec{x}\) is a column matrix and the representative of a single input image. Since the Tensor Cores perform matrix–matrix multiplications instead of matrix–vector multiplications, \(\varvec{x}\) must be used as a 2D matrix with dimension according to the selected fragment size for Tensor Core operations. Filling the other columns of \(\varvec{x}\) with zeros is not an option, because then the advantages of using Tensor Cores for acceleration disappear. A better alternative relies on filling the matrix \(\varvec{x}\) with several input images. This works well, e.g., when multiple CT slices or volumes must be reconstructed in parallel (assuming the data fits in CPU or GPU memory). However, for real-time applications in which a time dimension is present (e.g., real-time reconstruction during a surgical procedure), it can not be considered an ideal method, since latency is increased.

Our approach relies on splitting a single input image into several columns. Due to the symmetry of the system, it is possible by splitting the input image into several sub-images and then use the transformed version of the resulting sub-images as the columns of matrix \(\varvec{x}\). In addition to exploiting the maximum performance of Tensor Cores, this approach reduces the size of the system matrix by the same factor, which is advantageous for memory consumption.

However, there is one practical restriction on the possible transformations: the symmetry transformations should not impose heavy computations by requiring re-sampling of the pixel grid. Therefore, only rotations of \(90^\circ \) and reflections that do not require re-sampling of the pixel grid are acceptable [7]. While other inexpensive schemes could be considered [1], these sampling schemes are not required in our application because 8 symmetries nicely lead to the minimal required number of columns of \(\varvec{x}\) (i.e., eight, see Sect. 2.3). Fig. 4 shows the eight available transformations (combinations of \(90^\circ \) rotations and reflections, taking into account the restriction) and the section that will be extracted from each transformed image to fill the columns of x.

In addition to the transformations applied to the input image, as displayed in Fig. 5, a new transformation kernel including shifts and flipping is applied to the columns of the projection matrix to reverse the initial image transformations. Then, the transformed columns of the projection matrix are summed to build the final projection image. Finally, a masking kernel is applied to the system matrix to remove the unused elements.

In this work, the application of this idea is only proposed for forward-projection. Therefore, the extension of this idea to back-projection and consequently to the entire iterative reconstruction process is considered as future work.

Although the aforementioned system matrix symmetries are not present in all CT systems, there do exist solutions to mitigate the lack of symmetry for certain cases. In particular, CT systems may contain a detector pixel offset with a fractional (non-integer) part, causing the symmetry to break. By performing band-limited re-sampling of the detector array (e.g., sinc-interpolation with a Kaiser window), any asymmetries due to the offsets can be avoided. In addition, yet another solution consists in quantizing the slopes of the rays (i.e., the tangents of the ray angles) to rational numbers—this will induce the weight functions to be periodic as a function of the system matrix row and/or column. While this technique can also be exploited to reduce the system matrix memory requirements, we did not consider it for this work because it involves an approximation of the ray angles which may degrade the accuracy of the reconstruction.

Fig. 4
figure 4

The eight available transformations, obtained by mirroring, flipping and \(90^\circ \) rotation

Fig. 5
figure 5

High-level block diagram of the forward-projection using the proposed approach for exploiting the symmetry of the system

Finally, the back-projection uses the transposed version of the forward-projection. Since the transposition can be done on-the-fly when reading out the matrix coefficient blocks, no separate data structure is required for the backward-projection. Remarkably, this observation also applies to the implementation with eight symmetry transformations.

4 Results

In this section, the experimental setup and the sizes considered for images and projection are explained. After that, the efficiency of the applied mapping algorithm on reducing the size of the system matrix is presented. The results of the proposed iterative reconstruction are given next, followed by the measured error of reconstruction. As mentioned in Sect. 3.4, the idea proposed for exploiting symmetries of the system matrix is only implemented for the forward-project. Thus, the last results are dedicated to the projection of a single image split into 8 sectors based on the proposed idea.

4.1 Experimental setup and configurations

To evaluate the performance and efficiency of the proposed method, the experiments have been carried out on a system containing a GeForce RTX 2080 Ti, an i7-9800X CPU, and 32GB of DDR4 RAM. The operating system was Ubuntu 18.04 and CUDA 10.2 was used to build the kernels.

The NVIDIA GeForce RTX 2080 Ti has a TU102 GPU, which is designed based on the Turing architecture. The full implementation of the TU102 includes more than 18 billion transistors produced on the production process of TSMC’s 12 nm FFN and contains 72 Streaming Multiprocessors (SMs), with each SM containing 64 CUDA Cores and 8 Tensor Cores. In total, it includes 4608 CUDA cores and 576 Tensor Cores. However, the GPU integrated in RTX 2080 Ti includes 68 SM and 544 Tensor Cores. Each Tensor Core can perform up to 64 floating-point fused multiply–add (FMA) operations per clock using FP16 inputs. The 8 Tensor Cores in an SM can perform a total of 512 FP16 multiply and accumulate operations per clock, or 1024 total floating-point operations per clock.

The evaluations are performed by considering a fixed size for projections, \(720 \times 512\), and six different sizes for input image matrix. The image sizes are: \(512 \times 512, 384 \times 384, 320 \times 320, 256 \times 256, 128 \times 128\), and \(64 \times 64\). Specifications of the system matrix for different image matrix sizes are presented in Table 2.

Table 2 System matrix specifications

4.2 Mapping algorithm

Our approach’s computational and memory efficiency strongly depends on the sparsity pattern of the system matrix, which can be controlled by adjusting the parameters of the mapping algorithm. In practice, we found the choice \(B_x=4, B_y=2\) to give good sparsity patterns. A visualization of its effect is shown in Fig. 6. Figure 6a, b display a random section of the system matrix before and after applying the mapping algorithm, where the black points indicate the presence of non-zero values.

Table 3 presents the percent of non-empty blocks (which in fact determines the size of system matrix) depending on the available fragment sizes. As shown in Table 3, the mapping code reduces the matrix size by a factor 2 to 3. In addition, it shows that the minimum percent of non-empty blocks and consequently, the smallest system matrix, is achieved when the fragment size is 8.32.16. Therefore, it can be expected that the best performance will also be achieved for this fragment size.

As Table 3 shows, fragment size 32.8.16 presents the worst result among the other fragment sizes. Whereas, to exploit the symmetry of the system using the eight available transformations as explained in Sect. 3.4, it is required to use the fragment size 32.8.16. This restriction justifies the value of the further investigation on other inexpensive sampling schemes to increase the number of available transformations up to 16 or even 32, respectively, for the fragment size 16.16.16 and 8.32.16.

Fig. 6
figure 6

Visualizing the effect of mapping algorithm on the sparsity pattern of the system matrix; (a) a random section of system matrix before mapping; (b) the same random section after mapping

Table 3 Number of non-empty blocks (NEBs) before and after applying the pseudo-Morton mapping code

4.3 Evaluation of Tensor Cores for iterative reconstruction

In this section, the results of the proposed iterative reconstruction which consists of both accelerated forward and back-projection are presented. To provide a better comparison of the performance of Tensor Cores for this application, several non-tensor approaches are evaluated: ASTRA [26], Tigre [2], Tomopy [12]. All three are open-source libraries for high-performance 2D and 3D tomographic data processing and image reconstruction. For the ASTRA toolbox, we have tested three different reconstruction methods: CGLS, SIRT, and SART. We have used the CGLS method for Tigre toolbox and the SIRT method for Tomopy.

In addition to these, we have implemented four other non-tensor GPU accelerated approaches. The half2 implementation employs the NVIDIA’s direct half-precision arithmetic [15] by using the half2 datatype. This is because the maximum throughput for half-precision operations can be achieved when using the half2 datatype [15]. The second implementation uses the CUDA sparse matrix library (cuSPARSE), in which the basic CSR with single (32-bit) precision floating-point data type is used. The third approach, implemented in Quasar [11], performs the iterative reconstruction using on-the-fly computation without pre-computation of the system matrix. This reference method is added to assess the impact of memory latency versus latency of the computations: the method performs the weight calculation on the fly, which is computationally much more costly but at the same time avoids weight memory accesses. Finally, the last implementation is a Quasar implementation for the MLEM reconstruction method [20].

In this experiment, by assuming a multi-slice fan-beam method, several images (slices) are reconstructed in parallel by loading into the columns of the image matrix. The image matrix \(\varvec{x}\) must have at least 8, 16, and 32 columns for the fragment size 32.8.16, 16.16.16, and 8.32.16, respectively, to allow straightforward mapping onto the GPU Tensor Cores. However, to reconstruct the same number of images, it is assumed that 32 images are reconstructed in parallel for all the fragment sizes.

For single image reconstruction, we have explained in Sect. 3.4 that the symmetry of the system can be exploited to reconstruct single images in the configuration 32.8.16. We also note again that for this experiment, the approach explained for exploiting the symmetry of the system is not applied.

The average reconstruction time per image for a single iteration is presented in Table 4 for the evaluated approaches. Since our implementation is a gradient-based method, it is better to be compared with other gradient-based implementations. Because, each method requires a different number of iterations for convergence and also has a different error at convergence. Thus, while we present the execution time of different methods (including CGLS, SART, SIRT, and MLEM), it is better to consider reconstruction error (see Sect. 4.4) when comparing the execution time of different methods. It must be emphasized that our approach is not limited to the conjugate gradient and is a general approach that can work with other reconstruction methods.

According to Table 4, the maximum performance is achieved for the fragment size 8.32.16, which is faster than all other implementations and is about 5 times faster than the cuSPARSE implementation. Compared to the on-the-fly method, this result shows the advantage of storing the weights in memory and hence using the proposed BSR data structure. Compared to the ASTRA SART implementation, it should be noted that the SART method processes only one single projection direction in each iteration.

Table 4 also presents the average reconstruction per image energy that is consumed by the NVIDIA GPU during the execution of a single iteration. The average energy consumption values are obtained by polling the nvml APIs during kernel execution. According to Table 4, the proposed approach consumes 3–5 times less power compared to other implementations.

Finally, Table 5 presents the required memory for storing the system matrix for the cuSPARSE implementation (the basic CSR with 32-bit single-precision floating-point data type) and Tensor Core implementation (the BSR data structure explained in Sect. 3.2 with half-precision floating-point data type). Since the Half2 implementation uses the same data structure as the Tensor Core method, it is not presented in Table 5.

Table 4 Reconstruction time and energy consumption per image (for a single iteration)
Table 5 Memory usage for storing the system matrix (GigaBytes)

4.4 Reconstruction error

Since Tensor Cores perform mixed-precision floating-point operations, a larger numerical error is expected than in the case of single-precision floating-point calculations. Therefore, we measure the reconstruction error in comparison with 32-bit floating-point reconstructions. In this paper, the relative reconstruction error is calculated as follows:

$$\begin{aligned} \mathrm{{Error}} = {\left\Vert P - P^\mathrm{o}\right\Vert _{F}}/ {\left\Vert P^\mathrm{o}\right\Vert _{F}}, \end{aligned}$$

where \(P^\mathrm{o}\) is the original image and P is the reconstructed image generated using the different reconstruction implementations. \({\left\Vert . \right\Vert _{F}}\) denotes the Frobenius norm.

For measuring the error, a dataset containing 23 input images of class uint8 with the size of \(512 \times 512\) was used. The input images are reconstructed images, collected from existing CT scanners (e.g., the X-O CT system, Gamma Medica-Ideas, Northridge, California, USA). Figure 7 displays four examples of the dataset. To generate the projection data required for reconstruction, we have used the distance driven forward-projection for the Tensor core and Quasar implementations, and we have used the “line-fanflat” algorithm for the ASTRA implementation.

Figure 8 presents the average error versus the number of iterations measured from different implementations. It shows that the conjugate gradient based methods perform well for the provided dataset. Although it was expected to have a larger error using the Tensor cores due to the mixed-precision datatype, as shown in Fig. 8, the accuracy of both the Tensor CG and the non-Tensor (32-bit CG) implementations are nearly equivalent—indicating that the mixed-precision computation offered by Tensor Cores has the potential to be used for CT reconstruction.

Fig. 7
figure 7

Four example images of the dataset used for error measurement

Fig. 8
figure 8

Average relative reconstruction error

4.5 Forward-projection by exploiting the symmetry of the system

The last set of results is dedicated to evaluating the projection of a single image split into 8 sectors based on the proposed method and the transformations for exploiting the symmetry of the system, as explained in Sect. 3.4. This approach has identical numerical accuracy with multi-slice approach in case the detector pixel offset is zero. The execution time of the forward-projection, the consumed energy, and the memory usage for storing the system matrix are presented in Tables 6, 7. The achieved results not only show the efficiency of the proposed method for real-time applications, but also highlight that the reasonable size of the system matrix warrants extending this method to the whole reconstruction process.

Table 6 Execution time and energy consumption of the forward-projection kernel
Table 7 Memory usage for storing the system matrix (GigaBytes)

5 Conclusion

In this paper, we demonstrated the feasibility of using NVIDIA Tensor Cores for accelerating iterative CT reconstruction, as an example of a non-machine learning applications. For this purpose, we used the Tensor Cores to perform the sparse matrix–matrix multiply–accumulate operations of the projection and back-projection kernels. To overcome the hardware and software limitations of using Tensor Cores and also the challenges of implementing a memory-centric iterative reconstruction, we proposed and implemented an approach using compressed BSR data structure for storing the sparse system matrix, a pseudo-Morton encoding to improve the sparsity pattern of the system matrix, and a method to exploit the symmetry of system which makes this approach a practical solution for real-time or memory limited CT applications.

The proposed approach is evaluated by measuring the performance, power, accuracy and memory usage. The presented performance results indicate approximately 5x speedup and energy saving is achieved with the proposed method, compared to a cuSPARSE-based implementation on the same GPU. The accuracy results show that using mixed-precision computing of Tensor Cores does not impose any significant reconstruction error.

In future work, in addition to extending the proposed idea of exploiting symmetry to the entire reconstruction process, we plan to investigate the extension of our approach to cone-beam (3D) reconstruction and helical CT reconstruction.