Introduction

Although computed tomography (CT) is a widely used imaging modality in clinical practice [1], it inherently involves the hazard of radiation exposure [2]. Iterative reconstruction (IR) has facilitated technical efforts to reduce the radiation dose without compromising diagnostic performance [3,4,5].

Recently, however, deep learning–based image reconstruction algorithms (DLRs) have emerged as an alternative technical approach. For example, a DLR introduced by GE Healthcare (Truefidelity™) was reported to reduce the radiation dose by 36–50% compared with IR, while decreasing image noise and improving lesion detectability [6,7,8,9,10]. However, because it is a vendor-specific algorithm, it can only be mounted and operated on state-of-the-art GE Healthcare machines. The need for a dose reduction technique that is applicable to any existing CT scanner was strongly reinforced by the coronavirus disease 2019 pandemic, with the increased demand for chest CTs [11]. Therefore, the development and evaluation of vendor-agnostic deep learning–based algorithms that improve image quality with a lower radiation dose is a timely and necessary task.

A few vendor-agnostic techniques that can reduce image noise with non-inferior detectability compared to existing vendor-specific IR and DLR have been introduced [8, 12,13,14]. ClariCT.AI (ClariPI) is a vendor-agnostic deep learning model (DLM) that works in the image post-processing domain and does not need projection data. According to a recent study by Nam et al, for ultralow-dose CT, the DLM achieved better subjective image quality with fewer artifacts than did the DLR, although the vendor-specific DLR showed less noise and a higher signal-to-noise ratio than did the vendor-agnostic DLM [15]. However, this study assessed the subjective image quality under a single radiation dose level, and the diagnostic performance was not evaluated.

With the development of various techniques (e.g., IR), for CT, AAPM Task Group (TG) 233 suggested a guideline for the performance evaluation of CT images. This guideline includes a detectability index, which is a task-based detection performance metric that indicates the clinical performance in detecting lesions [16]. To the best of our knowledge, the dose reduction potentials of vendor-agnostic DLM and DLR have not yet been compared using the detectability index.

Therefore, in this study, we compared the dose reduction potential (DRP) of the vendor-agnostic DLM and DLR using the detectability index.

Materials and methods

Data acquisition and deep learning–based denoising algorithm

We used a multi-sized image quality phantom (Mercury v4.0 phantom, GAMMEX Sun Nuclear Corp.) suggested by AAPM TG 233 for assessing a CT system; it is a cylindrical polyethylene phantom with expected background CT values of −90 HU under 120 kVp. It consists of five sections with different diameters (36/31/26/21/16 cm) and a total length of 52 cm. Each section includes five inlets representing solid water, bone-mimicking material, polystyrene, 10-mg/ml iodine, and air (Fig. 1), each with expected CT difference with the background of 90, 1000, 50, 335, and −895 HU, respectively [16]. We measured the background CT value by drawing a concentric circular ROI around each inlet, excluding the area of the inlet (Supplementary Appendix Figure E1).

Fig. 1
figure 1

a Mercury phantom which was used in the study. b Schematic of Mercury phantom (v4.0) consisted of five sections of different diameters with tapered transitional sections. c Three-centimeter-sized five inlets were radially distributed equally from the center in each section of diameter

Images were acquired using Revolution CT (GE Healthcare) equipped with the TrueFidelity algorithm. Scanning was done under six radiation dose levels (volume CT dose index (CTDIvol): 15.47, 7.74, 3.87, 1.93, 0.97, and 0.48 mGy) with a tube voltage of 120 kV. We set the full-dose level in this study to 15.47 mGy by referencing the diagnostic reference level of abdominal pelvic CT, which is typically set at the 75th percentile of CTDIvol for patients with water equivalent diameters between 29 and 33 cm [17]. The applied CT parameters are listed in Table 1.

Table 1 CT scan parameters for scanning Mercury phantom v4.0

CT images of the Mercury phantom were reconstructed using the FBP and a DLR (TrueFidelity-L/M/H). The inference engine of TrueFidelity generates three selectable reconstruction strength levels (i.e., low, medium, high (L/M/H)) to control the degree of noise reduction. Reconstruction strength levels are qualitatively divided according to the noise reduction and are selectable based on the clinical application. A body kernel was used for reconstruction with FBP. Scanning was performed 10 times at each radiation dose level [16, 18]. To achieve a reliable estimate of the task-based transfer function (TTF), which represents the spatial resolution of the CT image, the contrast-to-noise ratio (CNR) should be at least 15. However, for low-contrast inlets with high image noise, the minimum CNR may not be achievable in a single slice. In that case, multiple slices should be averaged to lower the noise and increase the CNR [16, 18]. We followed the averaging method specified by AAPM TG 233 to achieve a total effective CNR greater than 15 [16, 18]. We obtained TTF to calculate a metric representing the CT image quality (i.e., detectability index), which will be described in detail later.

Deep learning–based denoising from the FBP images was performed using the DLM [19]. The DLM is an image-based CT denoising solution having a U-net-based CNN, trained by taking a noise-added CT image as input to produce an original CT image as output. To generate the training dataset, a synthetic sinogram-based low-dose CT simulation technique was applied to create a paired set of low-dose and standard-dose CT images from a set of given standard-dose CT images [20]. To solve the problem of overfitting and to train a model that minimizes bias, we trained the DLM using various combinations of clinical and phantom images from four major vendors. Various vendor-specific datasets were used for the training, so that the DLM could acquire vendor-agnostic denoising capability. The training dataset comprised more than 1 million clinical CT images encompassing 2,100 different combinations of scan and reconstruction conditions, including various kVp, mAs, automatic exposure controls, slice thicknesses, contrast enhancements, and convolution kernels, with 24 scanner models from four different CT manufactures (GE Healthcare, Siemens, Philips Healthcare, and Canon). In this study, we used the DLM version (ver.1.2.1), which was further trained using the images of standard QA phantoms such as AAPM, ACR, and Catphan considering phantom-based image quality assurance practices. The DLM performance has been previously evaluated in several clinical studies [12,13,14,15, 21].

Task-based image quality assessment

We used a dedicated analysis software tool (imQuest, Duke University) to evaluate the detectability index (d′) [22]. This index is a metric of the image quality that is specifically designed to represent the detectability of target objects; an increase in d′ suggests that the target object is easier to detect, and the image quality is better [16]. The detectability index consists of three primary components: a task to be performed, an observer to perform the task, and the image to be evaluated. The task is to detect a circular signal defined by a task function (Wtask). Wtask encodes the properties of the image (i.e., size, shape, contrast, and contrast profile), and these encoded profiles are used to calculate the detectability index. We used the target signals of designer profile with three sizes of circular signals (i.e., 1, 5, and 10 mm) for each of the five inlets. For the observer model, a non-pre-whitening (NPW) matched filter was adopted that correlates strongly with human performance in the low-contrast detection task [16].

To acquire the detectability index, the noise power spectrum (NPS) and TTF were first measured. The noise was measured in terms of the NPS to analyze the components of the Mercury phantom at each dose level. The TTF was measured as an indicator of the spatial resolution for each dose level and inlet. The spatial frequency at which the TTF reached 50% was indicated as TTF50% and was used to represent the system resolution under the given condition [16]. The change in TTF50% compared with the FBP was obtained with respect to the object and radiation dose. NPS and a detectability index were obtained according to the methods specified by AAPM task group 233.

The d′ was obtained for DLM and DLR images using the methods introduced by AAPM TG 233 under various combinations of five inlets, five phantom diameters, and six radiation dose levels [16].

For convenience, the normalized d′ value was used in consideration of the wide variation in d′ across the five inlets. The normalization of d′ at each dose level was achieved by dividing the value by that of the same inlet on the full dose (15.47 mGy) FBP image. For example, if the d′ of detecting polystyrene target on the image acquired under 7.74 mGy and denoised with the DLM was 63, and if d′ of that on the full-dose FBP image was 28, the normalized d′ would be 63/28 = 2.25. The formula for normalized d′ was as follows:

$$ \mathrm{Normalized}\ {\mathrm{d}}^{\prime }=\frac{\left(\mathrm{d}^{\prime}\mathrm{of}\ \mathrm{the}\ \mathrm{specific}\ \mathrm{material}\ \mathrm{a}\mathrm{t}\ \mathrm{a}\ \mathrm{specific}\ \mathrm{radiation}\ \mathrm{dose}\ \mathrm{for}\ \mathrm{the}\ \mathrm{DLR}\ \mathrm{or}\ \mathrm{DLM}\ \mathrm{a}\mathrm{lgorithm}\right)\ }{\left({\mathrm{d}}^{\prime}\mathrm{of}\ \mathrm{the}\ \mathrm{same}\ \mathrm{material}\ \mathrm{on}\ \mathrm{full}-\mathrm{dose}\ \mathrm{FBP}\ \mathrm{image}\right)} $$

Dose reduction potential

The performances of the DLM and DLR-L/M/H were compared by obtaining the DRP as the final comparison value. The DRP was defined as the amount of radiation dose that could be reduced by using either the DLM or DLR instead of the FBP. For example, the radiation dose at which the d′ of polystyrene reaches 1 is 16 mGy for the FBP, whereas that for the DLM is 6 mGy. Thus, the DRP of the DLM is (16–6) mGy/16 mGy × 100 (%) = 62.5 (%) [23]. For comparison, DRP values were averaged across the phantom size and inlets for each reconstruction algorithm (i.e., DLM, DLR-L/M/H). For example, the radiation dose at which the d′ of 10-mm polystyrene target on DLM images became equal to that on FBP images acquired under the full dose (15.47 mGy) was 1.76, 1.93, 2.23, 2.29, and 2.67 mGy respectively, in 16-, 21-, 26-, 31-, and 36-cm phantoms. We obtained the average of those values, which was 2.18 mGy. We repeated the same process for the other four inlets, and the averaged values for the iodine, bone, water, and air were 1.89 mGy, 1.86 mGy, 1.94 mGy, and 1.87 mGy, respectively. Finally, the average of the values from the five inlets (1.95 mGy) was used for the calculation of the DRP. The DRP of the DLM in detecting 10 mm targets in this case was (15.47–1.95/15.47) × 100 = 87%.

Results

The detectability index was analyzed with respect to the inlet, phantom diameter, and radiation dose. For all inlets and phantom diameters, d′ of the DLM was higher than those of all DLR levels (Figs. 2 and 3). The difference in d′ between the DLM and the DLR-H was greater when the diameter of the phantom was larger. At higher dose levels (3.87/7.74/15.47 mGy), d′ of the DLM was higher than that of the DLR-H. For radiation doses below 1.93 mGy, d′ of the DLM was lower than that of the DLR-H but was comparable to that of the DLR-M (Fig. 4). Figure 5 illustrates the image of polystyrene object obtained by using different algorithms at each radiation dose level. For each algorithm, the CT value differences between the background and each inlet according to radiation dose and phantom diameter were presented in the supplementary appendix (Supplementary Appendix Table 1–3).

Fig. 2
figure 2

Normalized detectability index (d) of the DLM and three levels of the DLRs according to five inlets for 10-mm (a), 5-mm (b), and 1-mm (c) target objects. DLM deep learning model, DLR-H deep learning–based image reconstruction algorithm, high strength, DLR-M deep learning–based image reconstruction algorithm, medium strength, DLR-L deep learning–based image reconstruction algorithm, low strength

Fig. 3
figure 3

Normalized detectability index (d) of the DLM and three levels of the DLRs according to phantom diameters for 10-mm (a), 5-mm (b), and 1-mm (c) target objects DLM deep learning model, DLR-H deep learning–based image reconstruction algorithm, high strength, DLR-M deep learning–based image reconstruction algorithm, medium strength, DLR-L deep learning–based image reconstruction algorithm, low strength

Fig. 4
figure 4

Normalized detectability index of the DLM and three levels of the DLRs according to radiation dose levels for 10-mm (a), 5-mm (b), and 1-mm (c) target objects DLM deep learning model, DLR-H deep learning–based image reconstruction algorithm, high strength, DLR-M deep learning–based image reconstruction algorithm, medium strength, DLR-L deep learning–based image reconstruction algorithm, low strength

Fig. 5
figure 5

A 3 × 3 cm2 region of interest centered on the polystyrene on the FBP, DLM, and DLR-H, DLR-M, DLR-L according to the radiation dose on the phantom diameter of 31 cm. FBP filtered back projection, DLM deep learning model, DLR-H deep learning–based image reconstruction algorithm, high strength, DLR-M deep learning–based image reconstruction algorithm, medium strength, DLR-L deep learning–based image reconstruction algorithm, low strength

The average DRPs of the DLM, DLR-L, DLR-M, and DLR-H were 86% (81–88%), 60% (46–67%), 76% (60–81%), and 87% (78–92%), respectively. For large target objects (10 mm), the DRP of the DLM was slightly higher than those of all DLR levels. However, for smaller targets (1 mm and 5 mm), the DRP of the DLR-H (89 and 88%) was greater than that of the DLM (87 and 84%) (Table 2). The DLM showed a relatively narrow range of DRP with the range of 81–88%, in comparison with the DLR-H that showed the range of 78–92%. The CT values of the five inlets and the background, as well as the NPS and TTF results obtained to calculate d′ are presented in the supplementary appendix (Supplementary Appendix Fig. E2–7 and Tables 1–5). The CT values of the five inlets and the background were relatively constant across the radiation doses and phantom sizes.

Table 2 Dose reduction potential of the DLM and DLR for 10 mm, 5 mm, and 1 mm target objects

There was a tendency where the DLM showed higher spatial resolution than the DLRs for inlets having high contrast values (bone-mimicking material and iodine) (Table 3). On the other hand, in the case of polystyrene, being the inlet with the lowest contrast value among the five inlets, the spatial resolution of DLM was degraded compared with the FBP; that of the DLM was between that of the medium and high reconstruction levels of the DLR. For air, the spatial resolution of the DLR was higher than that of the DLM.

Table 3 A percentage change (range) of a spatial resolution (TTF50%) of the DLM and DLR compared to FBP according to inlets

Moreover, the spatial resolutions of the DLM and DLR were dependent on the level of radiation exposure (Table 4). TTF50% of FBP, DLM, and DLR-H, M, and L decreased as the radiation dose decreased. The spatial resolution of DLR decreased with the increase of reconstruction strengths.

Table 4 Mean TTF50% values of the FBP, DLM, and DLR according to radiation dose levels

Discussion

In this study, performances of the vendor-agnostic DLM and a vendor-specific DLR were compared in terms of the DRP. We found that the DLM showed DRP comparable to that of DLR. Although several vendor-specific DLRs are now available, to the knowledge of the authors, this study is the first to compare a vendor-agnostic DLM with a commercially available DLR based on the method specified by the AAPM TG 233 group.

Analysis of CT image quality based on conventional metrics such as CNR or modulation transfer function is insufficient for an objective evaluation of DL algorithms, as those metrics cannot appropriately evaluate the non-linear and multifarious nature of the DL algorithms [24]. To reflect the object-dependent resolution and noise performance of DL algorithms, we employed metrics (e.g., TTF, NPS, and d′) by adopting the methodology specified by AAPM TG 233 to better represent the clinical performance.

It was unclear whether the vendor-agnostic DLM could exhibit DRP performance comparable to that of the DLR. The DLR was expected to have integrated into its neural networks many vendor-specific experiences and knowledge regarding the physical and mathematical properties of the whole reconstruction process. In contrast, the DLM focuses on a single task of image noise reduction without using raw projection data.

Therefore, our finding that the DLM showed DRP comparable to that of the DLR is encouraging for attempts to reduce radiation doses. The algorithm can potentially be adapted by many institutions having CT scanners from different vendors.

Indeed, compared to reported DRPs of the currently used IRs (range from 25 to 65%) (4, 5), the DLM in our study showed DRP of a higher range (81–88%). If this degree of dose reduction potential can be translated into clinical practice, CT images of the abdomen could be acquired under a 2-mGy radiation dose (15 mGy × 14%). When the effective dose (mSv) is calculated with a constant 0.015 mSv·mGy-1·cm-1 and a scan length of 50 cm, the radiation exposure is approximately 1.6 mSv, which is equivalent to that of two abdomen x-rays [25]. Previous studies have shown that clinical outcome using 2-mSv CT is not inferior to that using conventional dose CT for acute appendicitis. There is a dearth of similar research in other disease groups [26, 27]. Our results may facilitate future research on reducing CT radiation used for various diseases

In the subgroup analysis, the DLM showed d′ higher than that of the DLR as the diameter of the phantom increased. Such strength of DLM may be translated to its application in obese patients. In low radiation-dose levels, however, the DLM showed lower d′ than did the DLR-H, owing to its relatively low noise reduction. This performance change may be attributed to the difference in denoising principles between the two algorithms. Because the DLM is an image-based noise reduction algorithm, it was trained to reduce the noise level only to the extent where the image texture could be preserved. In contrast, in the DLR, which is a projection sinogram-based algorithm, the noise reduction power might override the preservation of image texture. In fact, a previous study reported that when applied to low radiation dose CT scans in its full strength, the DLR caused image texture degradation and texture distortion artifacts that can be mistaken for a true lesion [15].

In previous DLR studies, the image texture was preserved, and the NPS peak remained unchanged compared with the FBP [28]. However, analysis of the noise and spatial resolution revealed that the DLM showed a slightly right-shifted NPS peak, reflecting a sharpening of the image texture. This shift of the NPS peak in the DLM resulted in higher d′ values than did the DLR-H for all object materials. Further evaluation is needed to determine whether such shift of the NPS peak of the DLM affects clinical performance.

Previous studies have demonstrated that for DLR, spatial resolution depends on contrast of the object material and radiation dose [29,30,31]. Our results were consistent with those of the previous studies, showing a reduction in the spatial resolution (TTF) with a lower radiation dose and target contrast, in both DLR and DLM images. Owing to such dependency, McCollough et al found that a radiation dose reduction of more than 25% results in lower spatial resolution in IR images than in FBP images [30]. Therefore, the decrease in spatial resolution is likely to be worse for objects having contrast lower than that of polystyrene (50 HU). To overcome the degradation of the spatial resolution, the use of a generative adversarial network (GAN) was proposed in deep learning–based algorithms. The GAN focuses on migrating the data noise distribution from the strong to the weak statistically, to avoid over-smoothing or structural detail losses [32]. However, verification of GAN in clinical CT remains insufficient. Therefore, further study is needed to determine whether the DLM incorporating a GAN can further preserve the spatial resolution for low-contrast object materials. Meanwhile, our results also showed the degradation of spatial resolution of DLM below 1.97 mGy. There is no clear explanation for this phenomenon, and further research is needed regarding how the spatial resolution of DLM is affected at a very low radiation dose level.

This study has several limitations. First, because the DLM and DLR were evaluated using a phantom, further research using clinical CT images is needed to validate their performance in clinical applications. Second, although we tried to remove vendor-specific overfitting bias by training with various combinations of clinical and phantom images from four vendors, the DLM and DLR were evaluated using only one type of CT equipment. Therefore, to demonstrate the generalizability of the DLM, it should be evaluated on various other CT machines. Third, because we used FBP images reconstructed only with the body kernel, further research using other kernels may be needed. Finally, as mentioned, owing to the dependency of spatial resolution on contrast, further study is needed on low-contrast objects (below 50 HU) based on the AAPM TG 233 guideline to validate the clinical applicability of the DLM.

In conclusion, the dose reduction potential of the vendor-agnostic DLM was shown to be comparable to that of the vendor-specific DLR at high strength and superior to those of the DLRs at medium and low strengths