Introduction

The quality of medical images can vary due to complex relationships between various factors, such as equipment used, patient conditions, imaging conditions, and image reconstruction conditions, even when the same object is taken. Lesions may be overlooked if medical images have poor quality, and an accurate diagnosis would not be made, which is detrimental to patients. Therefore, proper assessment of medical image quality is a critical task in providing appropriate medical care.

Image quality assessment (IQA) methods are roughly divided into two types: subjective and objective assessments. Subjective assessments mean that humans visually judge the superiority or inferiority of image quality. For instance, in the evaluation of specific protocols or techniques, subjective assessments are the most reliable and gold standard method for IQA because humans are the final image handlers [1,2,3,4,5]. However, subjective assessments are time- and labor-consuming [1, 3,4,5,6,7,8]. In contrast, objective assessments use calculations to determine image quality; thus, these methods can be used as an alternative to subjective assessments because they address the problems associated with them. Physical indices such as modulation transfer function (MTF) and noise power spectrum (NPS) are commonly used as objective assessment methods for medical image quality. MTF is an index for assessing the resolution characteristics of an imaging system, and NPS is an index for assessing the noise characteristics of an imaging system. These indexes are, by their nature, based on signals obtained from dedicated phantoms or structures under defined conditions; therefore, directly assessing the quality of images containing complex structures, such as actual clinical images, can be difficult.

In the field of natural images, a lot of research on IQA methods for complex structures has been conducted [8,9,10,11,12,13,14,15], and their performances have been compared [4,5,6,7, 16,17,18,19,20]. IQA is classified into three types: full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA), and no-reference IQA (NR-IQA) [4, 5, 10, 15, 18, 20]. FR-IQA calculates an image quality score by comparing a reference image with a distorted image. The quality of the distorted image can be assessed by using a non-distorted image as the reference image. RR-IQA uses partial information from a reference image to calculate an image quality score for a distorted image. NR-IQA does not use a reference image and instead calculates an image quality score based solely on a distorted image. An IQA method with good performance judges the image quality as high quality in cases where humans perceive it as good, and low quality in cases where humans perceive it as poor. In other words, a good IQA method is strongly correlated with subjective assessments. In general, FR-IQA tends to calculate a highly correlated score with subjective assessments because it uses more information than NR-IQA [4]. Therefore, FR-IQA is useful when a reference image with the same geometric position as the target image can be acquired.

In the medical imaging field, IQA methods developed for natural images, especially traditional FR-IQA methods such as peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), are widely used to evaluate imaging conditions, reconstruction conditions, and image processing algorithms, including noise reduction and super-resolution technology [3, 21,22,23,24,25,26,27,28,29,30]. However, because these IQA methods were designed for natural images, they may be inappropriate for assessing medical images with different properties. It is important to verify the applicability of IQA methods developed for natural images to medical images. Several research groups have evaluated the adaptability of FR-IQA methods to magnetic resonance (MR) images [31,32,33]. These studies evaluated the correlation between several FR-IQA methods and subjective assessments of various distorted MR images. Renieblas et al. evaluated several SSIM-related methods for distorted X-ray and MR images [34]. Furthermore, Kumar et al. investigated whether PSNR and SSIM adequately represent human subjective assessments as evaluation indices for determining the degree of image compression in teleradiology [35]. Currently, these are the only studies that have evaluated the adaptability of FR-IQA methods to medical images.

One of the major roles of IQA in medical imaging is to minimize patient burden while ensuring a certain level of image quality. Computed tomography (CT) is a modality with a relatively higher exposure dose than other imaging examinations; therefore, the importance of IQA is high. Nevertheless, no research has extensively investigated the applicability of IQA methods to CT images with various types of distortions.

Therefore, this study aimed to investigate the correlation between objective assessment by existing FR-IQA methods and human subjective assessment of CT images and evaluate whether FR-IQA methods developed for natural images can apply to CT image quality assessment.

In light of the current widespread use of the FR-IQA methods for medical image quality assessment, it is crucial to determine the reliability of the results obtained from these methods and that has great clinical significance. This study provides valuable evidence in evaluating the trustworthiness of the FR-IQA methods. Furthermore, this research identifies specific methods that are particularly useful for assessing CT image quality, thereby contributing to the advancement of objective assessment techniques in the field.

Materials and Methods

CT Images

Six CT images were selected from open online databases as reference images, which were determined by a radiological technologist to have no significant artifacts and relatively low noise and blurring visually, and confirmed as such by another radiologist. The reference images included two images each of the head, chest, and abdomen (Fig. 1). The head images were obtained from CQ500 dataset [36], and the chest and abdominal images were obtained from DeepLesion dataset [37]. The CQ500 dataset provided by the Centre for Advanced Research in Imaging, Neurosciences, and Genomics (GARING) includes CT images of 491 individuals. The DeepLesion dataset published by the National Institutes of Health (NIH) contains CT images of various parts from the head to the legs of 4427 individuals. All the reference images obtained from both datasets were unsigned 16-bit portable network graphics (PNG) format, with image matrix size: 512 × 512 and slice thickness: 5 mm. In this study, we set the window width/window level to 80/40 Hounsfield units (HU), 1500/-650 HU, and 350/40 HU for the head, chest, and abdomen images, respectively, and then converted them linearly to 8-bit pixel values. The width/window level was determined in accordance with the clinical conditions actually used at Shiga University of Medical Science Hospital.

Fig. 1
figure 1

The six cases used as reference images

Distorted Image Creation

The reference images were distorted using two types of distortion: Gaussian noise and Gaussian blur. In total, 210 distorted images were created by adding each or both types of distortion with five-grade intensity to the six reference images (Table 1 and Fig. 2). ImageJ software version 1.53e (Rasband W.S., Image J, U.S. National Institute of Health, Bethesda, Maryland, USA) was used to create the distorted images.

Table 1 Summary of distortions applied to reference images
Fig. 2
figure 2

Examples of generated distorted images. a original image, b σnoise = 14 and σblur = 0, c σnoise = 30 and σblur = 0, d σnoise = 0 and σblur = 1.4, e σnoise = 0 and σblur = 3.0, f σnoise = 30 and σblur = 3.0

Objective Assessment

This study employed nine widely used FR-IQA methods for natural images: PSNR, SSIM [9], feature similarity (FSIM) [10], information fidelity criterion (IFC) [11], visual information fidelity (VIF) [8], noise quality measure (NQM) [12], visual signal-to-noise ratio (VSNR) [13], multi-scale SSIM (MSSSIM) [14], and information content-weighted SSIM (IWSSIM) [15]. PSNR is derived from mean squared error and indicates the ratio between the maximum pixel value and the power of the distortion. SSIM is a metric based on an assumption that human perception is highly adapted for extracting structural information from visual scenes and is calculated from three components: luminance, contrast, and structure. MSSSIM and IWSSIM were derived from SSIM. MSSSIM is calculated by weighting each component of SSIM on various scales. Consequently, MSSSIM is more flexible than SSIM in incorporating variations in viewing conditions. IWSSIM employs a weighted pooling strategy to estimate global image quality after local image quality measurements and uses the MSSSIM for local image quality measurements. FSIM is based on the fact that humans recognize an image mainly by its low-level features. FSIM computes quality estimates using phase congruency (PC) as the main feature and gradient magnitude (GM) as the complementary feature. IFC quantifies the amount of mutual information shared between the reference and distorted images. VIF is developed based on IFC and is calculated by normalizing IFC with reference image information. NQM is a human visual system (HVS) model-based index calculated by modeling the effect of additive noise on human perception systems. Similar to NQM, VSNR is an HVS model-based method. It is calculated based on both low-level and mid-level properties of human vision.

In subjective assessments, the image quality of background areas with no diagnostic significance is expected to be almost ignored and not considered in the assessment, whereas, in objective assessments, the image quality of background areas affects the results to some extent. In this study, objective assessments were conducted on cropped images in which the background areas were minimized in order to purely focus on the distortion of the subject (Fig. 3).

Fig. 3
figure 3

An example of cropping process to minimize background areas

Subjective Assessment

Double stimulus continuous quality scale (DSCQS), a type of dual stimulation method proposed by the International Telecommunication Union Radiocommunication sector (ITU-R) Recommendation BT.500 [38], was used as the subjective assessment method. ITU-R BT.500 provides methodologies for the subjective assessment of image quality, including general testing methods, grading scales used during assessments, and viewing conditions recommended for carrying out assessments. Several IQA studies have performed subjective assessments based on this recommendation [5,6,7, 31, 39,40,41,42]. Figure 4 shows the assessment flow of the DSCQS method. Both reference and distorted images were presented in pairs alternately twice. An assessment score was assigned to both images at the second presentation. Observers were blind to which image was the reference image, and the display order for each image pair was random. The observers marked quality scores for both images on a continuous scale (Fig. 5). The marked quality scores were then normalized from 0 to 100, and the difference between the quality scores for the pair of images was calculated. The DSCQS score was calculated by averaging these quality differences for all the observers. The DSCQS score is derived from the quality difference; thus, a smaller value indicates better image quality, and a larger value indicates poorer image quality. Thanks to the dual-stimulus method and continuous scale, DSCQS is suitable for discriminating minute differences in image quality and is widely used in IQA for medical images [42,43,44]. However, because the reference and distorted images are presented twice separately, the assessment time is longer than for other subjective assessment methods. Here, we set the presentation time for each image to 5 s and the image display interval to 3 s.

Fig. 4
figure 4

Assessment flow of the DSCQS method. Image A and B are a pair of reference and distorted images

Fig. 5
figure 5

Rating scale used in the DSCQS method

For all FR-IQA methods used, higher and lower values indicate better and poorer image quality, respectively, which is inversely related to the DSCQS score. To interpret the results intuitively and easily, we defined the subjective score using the following formula:

$$Subjective\; Score=100-DSCQS\; Score.$$
(1)

Four radiologists and two radiological technologists were involved in the subjective assessment. Regarding the viewing conditions, the monitor was a two-megapixel medical monitor (RadiForce RX250, EIZO Corporation, Ishikawa, Japan), the illumination surrounding the display was set to 50 lx or less, and the viewing distance was arbitrary. The order of the displayed image pairs was randomized for each observer. At the beginning of the assessment, a training session was conducted to stabilize the observer’s opinion, using another case different from the six cases used for the assessment.

To assess the inter-rater variability of the subjective assessment, the intraclass correlation coefficient (ICC) was employed. There are several types of ICCs, depending on their intended use. In this study, we used ICC(2,1), which is the case for absolute agreement and single measurements. ICC(2,1) is a commonly used measure to evaluate agreement among multiple observers. The ICC(2,1) values range from 0 to 1, where higher values indicate greater agreement among observers. ICC(2,1) was calculated using SPSS version 25 (IBM Corp., Chicago, IL, USA).

Performance Evaluation of FR-IQA Methods

Pearson linear correlation coefficient (PLCC), Spearman rank order correlation coefficient (SROCC), and root-mean-square error (RMSE) are generally used to evaluate IQA performance in terms of the prediction accuracy, monotonicity, and consistency of the models, respectively [5, 8,9,10,11, 13,14,15,16, 18, 20, 31, 39, 41, 42, 45]. The scores predicted using FR-IQA methods often do not correlate linearly with subjective assessment scores. PLCC and RMSE can be applied to linear systems; therefore, the scores require correction to eliminate nonlinearity using logistic regression as a pre-processing step. Here, we used five-parameter logistic regression [4, 5, 8, 15, 16, 19, 42, 45], which regression function is given as:

$${Y}_{L}={\beta }_{1}\left(\frac{1}{2}-\frac{1}{1+exp\left({\beta }_{2}\left(Y-{\beta }_{3}\right)\right)}\right)+{\beta }_{4}Y+{\beta }_{5}$$
(2)

where Y is the image quality score calculated by each IQA method; YL is the image quality score after regression; and β1, β2, β3, β4, and β5 are the regression model parameters. After the regression process, PLCC, SROCC, and RMSE were calculated using the following equations:

$$PLCC\left(X,{Y}_{L}\right)=\frac{{\sum }_{i}^{n}\left({X}_{i}-\overline{X }\right){\sum }_{i}^{n}\left({Y}_{Li}-\overline{{Y }_{L}}\right)}{\sqrt{{\sum }_{i}^{n}{\left({X}_{i}-\overline{X }\right)}^{2}}\sqrt{{\sum }_{i}^{n}{\left({Y}_{Li}-\overline{{Y }_{L}}\right)}^{2}}}$$
(3)
$$SROCC\left(X,Y\right)=1-\frac{6\sum_{i=1}^{n}{d}_{i}^{2}}{n\left({n}^{2}-1\right)}$$
(4)
$$RMSE\left(X,{Y}_{L}\right)=\sqrt{\frac{{\sum }_{i}^{n}{\left({Y}_{Li}-{X}_{i}\right)}^{2}}{n}}$$
(5)

where X is the subjective assessment score, d is the difference between X and Y, and n is the total number of images used in the evaluation. In this study, we applied these three metrics to 210 distorted images for overall FR-IQA performance evaluation. In addition, these metrics were also applied to 30 noise-distorted images without additional blurs, another 30 blur-distorted images without additional noises, 70 head images, 70 chest images, and 70 abdominal images to evaluate the performance of the FR-IQA methods for specific types of distortion and specific regions.

We also conducted statistical testing to confirm which differences between IQM performances were statistically significant. In the field of IQA, the F-test is usually performed on the residuals between the IQM scores after the regression and the subjective scores [4, 16, 32, 41]. The null hypothesis is that the data in the two IQA residual vectors have the same variance and they are statistically indistinguishable. If the null hypothesis is rejected after performing a one-tailed test at 5% significance level, then the performance of the two IQA methods is statistically significantly different.

Finally, we compared the average computation time required to assess an image for each FR-IQA method. We used 210 distorted images of 512 × 512 pixels before cropping and the average value of five calculations to calculate time. The CPU of the PC was an Intel Core i5-10210U with 8.00 GB memory, and the software platform was MATLAB R2021b (The MathWorks, Inc., Natick, Massachusetts). The MATLAB source code for all FR-IQA methods, except PSNR and SSIM, was obtained from the original authors. We calculated PSNR and SSIM using the built-in functions of MATLAB.

Results

Subjective Assessment

Figure 6 shows the results of the subjective assessment of each distortion factor. The cases used in these results were independently distorted images that did not contain any other distortion factors. In other words, in Fig. 6a, only noise was added to the image, and the blur was fixed at σblur = 0. In Fig. 6b, only blur was added, and the noise was fixed at σnoise = 0. The assessment results were obtained for the data of 30 cases because five levels of distortion intensity were applied to the six cases. The subjective score decreased as the distortion intensity increased for both the noise and blur. In addition, the rate of decrease became lower as the distortion intensity increased.

Fig. 6
figure 6

Results of subjective assessment for each distortion factor: a Gaussian noise and b Gaussian blur. Dots represent mean subjective scores of six observers; error bars represent standard deviation

The obtained ICC(2,1) value for our subjective assessment was 0.68 (95% confidence interval: 0.61–0.74), indicating a moderate level of agreement among the observers. The 95% confidence interval suggests that the true ICC value falls between 0.61 and 0.74 with a high degree of certainty.

IQA Performance Comparison

Figure 7 shows scatter plots of the relationship between the objective scores calculated by each FR-IQA method and the subjective scores for the 210 distorted images. Table 2 compares the PLCC, SROCC, and RMSE results. All FR-IQA methods had PLCC and SROCC values higher than 0.8, indicating strong positive correlations. Particularly, VIF exhibited the best performance for all three indices: PLCC = 0.967, SROCC = 0.968, and RMSE = 5.380. In contrast, the SSIM was inferior to other FR-IQA methods: PLCC = 0.817, SROCC = 0.882, and RMSE = 12.075. Table 3 shows the statistical significance testing results among the nine FR-IQA methods for all 210 distorted images. The symbol “1” means that the method in the row is statistically better than the method in the column, the symbol “-” means that it is statistically indistinguishable, or the symbol “0” means that it is statistically worse. VIF and IWSSIM performed statistically better than the other IQM methods. In contrast, SSIM was statistically worse than any other IQM method.

Fig. 7
figure 7

Scatter plots of the relationship between subjective scores and objective scores calculated by nine FR-IQA methods for 210 distorted images

Table 2 PLCC, SROCC, and RMSE values between subjective scores and objective scores by nine FR-IQA methods for 210 distorted images
Table 3 Statistical significance testing results based on prediction residuals of nine FR-IQA methods for all 210 distorted images. The symbol “1”, “-”, or “0” means that the method in the row is statistically (with 95% confidence) better, indistinguishable, or worse than the method in the column, respectively

Table 4 compares the performance of the FR-IQA methods on 30 images degraded by each distortion type alone. Regarding noise, VIF had the best performance for PLCC and RMSE, and VSNR had the best performance for SROCC. Regarding blur, VIF had the best performance for PLCC and SROCC, and IWSSIM had the best performance for RMSE. Table 5 shows the statistical significance testing results among the nine FR-IQA methods for a particular distortion. Each entry in the table is a codeword consisting of two symbols. The symbol location within a codeword represents each type of distortion in the following order, from left to right: Gaussian noise and Gaussian blur. Regarding noise, IFC, VIF, VSNR, MSSSIM, and IWSSIM performed statistically better than the other four IQM methods. Regarding blur, VIF had the highest number of statistical superiority over the other methods and was statistically superior to PSNR, SSIM, IFC, NQM, and VSNR.

Table 4 PLCC, SROCC, and RMSE values between subjective scores and objective scores by nine FR-IQA methods for 30 distorted images by Gaussian noise without additional blur and another 30 distorted images by Gaussian blur without additional noise
Table 5 Statistical significance testing results based on prediction residuals of nine FR-IQA methods for 30 images degraded by each distortion type alone. Each entry in the table is a codeword consisting of two symbols. The symbol location within a codeword represents each type of distortion in the following order: [Gaussian noise, Gaussian blur]. The symbol “1”, “-”, or “0” means that the method in the row, for a particular distortion, is statistically (with 95% confidence) better, indistinguishable, or worse than the method in the column, respectively

Table 6 compares the values of PLCC, SROCC, and RMSE for the head (70 images), chest (70 images), and abdomen (70 images). Regarding the head images, VIF had the best performance for PLCC and RMSE, and IWSSIM had the best performance for PLCC and SROCC. Regarding the chest images, VIF had the best performance for all indices, and IWSSIM also had the best performance for PLCC. Regarding the abdominal images, IWSSIM had the best performance for PLCC and SROCC, and VIF had the best performance for RMSE. Table 7 shows the statistical significance testing results among the nine FR-IQA methods for a particular target site. Each entry in the table is a codeword consisting of three symbols. The symbol location within a codeword represents each target site in the following order, from left to right: head, chest, and abdomen. Regarding head and abdomen, VIF and IWSSIM had the highest number of statistical superiority over the other methods and was statistically superior to PSNR, SSIM, FSIM, NQM, VSNR, and MSSIM. Regarding chest, VIF and IWSSIM also had the highest number of statistical superiority over the other methods and was statistically superior to PSNR, SSIM, FSIM, IFC, VSNR, and MSSIM.

Table 6 PLCC, SROCC, and RMSE values between subjective scores and objective scores by nine FR-IQA methods for 70 head images, 70 chest images, and 70 abdominal images
Table 7 Statistical significance testing results based on prediction residuals of nine FR-IQA methods for 70 images by each target site. Each entry in the table is a codeword consisting of three symbols. The symbol location within a codeword represents each target site in the following order: [head, chest, abdomen]. The symbol “1”, “-”, or “0” means that the method in the row, for a particular target site, is statistically (with 95% confidence) better, indistinguishable, or worse than the method in the column, respectively

Table 8 shows the average computation time required to calculate an image’s score using each FR-IQA method. PSNR had the shortest time (0.0041 s), while VIF had the longest (0.6183 s), with a difference of 0.6142 s.

Table 8 Average computation time per image for each FR-IQA method

Discussion

In this study, subjective assessments using DSCQS method were used as the ground truth to evaluate the performance of the FR-IQA methods. We assessed the images distorted by two types of degradation factors: Gaussian noise and Gaussian blur. Gaussian noise makes low-contrast objects difficult to percept and degrades image quality in terms of graininess, whereas Gaussian blur obscures small objects and fine details and degrades image quality in terms of sharpness [31, 46]. The results in Fig. 6 reflect these facts and support that subjective assessments were properly conducted. In addition, as the intensity of distortion increased, the subjective rating decreased, and the rate of decrease progressively declined, approaching a plateau. We set the maximum distortion intensity values to σnoise = 30 and σblur = 3.0. Considering this result, even if the distortion intensity was increased further, we expected no significant differences in the obtained subjective scores. Therefore, we considered that the distortion intensities used in this study cover a wide range of degradation that humans can perceive in medical images, and we believe that the settings of the distortion intensities were appropriate.

This study aimed to evaluate whether FR-IQA methods developed for natural images could be adapted to the quality assessment of medical images, such as CT images. Objective scores by all the nine FR-IQA methods used in this study had strong positive correlations (PLCC and SROCC) with the subjective assessment scores (Table 2). This suggests that FR-IQA methods developed for natural images can be used to assess the quality of CT images. Among the nine FR-IQA methods, VIF had the best performance, with the highest PLCC and SROCC and the lowest RMSE (Table 2). In statistical significance testing, VIF was also statistically better than other FR-IQA methods except for IWSSIM (Table 3). These results indicate that VIF is one of the most accurate methods for estimating CT image quality as a surrogate for subjective assessments. VIF was also shown to perform best in a previous study comparing subjective assessment with FR-IQA for MR images [32]. Furthermore, in the field of natural images, several studies have reported that VIF has superior performance for some datasets [6, 16, 17], particularly for multiple distortion databases [4]. Hence, we think that VIF is a method that expresses human perception well in the entire imaging field including all kinds of medical and natural images. Excluding IFC, VIF is unique among the nine IQA methods as it calculates a quality score based on the mutual information between a reference and a distorted image. IFC, a mutual information-based method like VIF, had the third highest correlation coefficient and third lowest RMSE among the nine FR-IQA methods. These approaches, based on mutual information, are considered useful in FR-IQA. Unlike IFC, VIF performs normalization based on reference image information, suppressing content dependence. This could explain why VIF outperformed IFC.

SSIM, now widely used to assess medical images, showed the lowest performance (Table 2). This is similar to previous studies that targeted MR images [32]. MSSSIM, an index derived from SSIM, showed performance improvement compared to SSIM. While SSIM is a single-scale approach, MSSSIM is a multi-scale approach. Based on the assumption that the optimal scale depends on conditions such as resolution and the distance between the image and the viewing point, MSSSIM is calculated by repeatedly low-pass filtering and down-sampling image pairs and weighting each component on various scales. This suggests that the single scale used in SSIM may not be optimal for this study’s image resolution and observation distance conditions. Moreover, among the SSIM-based methods used, IWSSIM performed the best. Both SSIM and MSSSIM employ a method that averages the scores calculated from each pixel to estimate global image quality, assuming each pixel has the same importance. In contrast, IWSSIM employs an information content-weighting strategy to estimate global image quality, which calculates weighted values for regions that humans are likely to pay attention to perceptually. Incidentally, IWSSIM uses MSSSIM for local quality estimation, which is also flexible to scale variations as MSSSIM. IWSSIM showed the second-best performance among the nine FR-IQA methods (Table 2) and was statistically better than other FR-IQA methods, similar to VIF (Table 3). This indicates that the pooling strategy used in IWSSIM performed excellently. Therefore, if SSIM-based methods are used, IWSSIM is the best choice because it is more flexible for scale and calculates a weighted global score.

In this study, even when the data were evaluated by dividing each distortion type and target site (Tables 4, 5, 6, and 7), VIF and IWSSIM, which demonstrated high performance in the overall evaluation results, showed consistently high performance and little bias due to distortion types or target sites.

In previous studies, FR-IQA methods for medical images evaluated single distorted images by each of several distortion types [31,32,33,34,35]. However, actual image distortion is caused by a mixture of various factors. In this study, we used multiple distorted images, including noise and blur, and achieved an evaluation of more complex distortion. Therefore, we believe that this study’s results can be applicable to more clinically realistic situations.

The biggest disadvantage of the FR-IQA method is that the image quality is evaluated using a reference image, which may be difficult to obtain in clinical situations. Techniques for directly evaluating clinical CT image quality belong to the NR-IQA category, and research in this category is also active. However, since the FR-IQA method mainly calculates the score from some kind of difference or similarity between the evaluation image and the reference image, it can be used in a wide range of applications regardless of the type of deterioration or target part. Therefore, FR-IQA is frequently used to evaluate imaging conditions, reconstruction conditions, and image processing algorithms, including noise reduction and super-resolution technology. However, since these indices were originally developed for natural images, this study’s motivation was to verify whether these indices can also be used to evaluate medical images. The results obtained in this study will support the reliability of the results of studies using the FR-IQA in the evaluation of CT image quality in the past.

This study had several limitations. First, some FR-IQA methods require the setting of several parameters. We set the parameters for each FR-IQA method based on the values recommended in the original papers. Optimizing these parameters for medical images may improve the performance of the FR-IQA methods because these recommended values are for natural images. Second, the distribution of subjective assessment scores for the 210 distorted images was uneven. Figure 8 depicts a histogram of the subjective assessment scores. The volume zone was approximately 40, and the distribution was not uniform. Therefore, our results may have been strongly influenced by the distorted images near the volume zone. Finally, the distorted images used in this study were simulated artificially and may differ from actual clinical images. The noise in CT images follows a Gaussian distribution [42]; therefore, we adopted Gaussian noise. Unlike artificial distortions, clinical images do not necessarily exhibit uniform degradation throughout the entire image, and the degree of distortion may vary depending on the local area. Realistic distorted images are necessary; however, preparing images containing inhomogeneous various degrees of known degradations is difficult.

Fig. 8
figure 8

Histogram of subjective assessment scores for 210 distorted images

Conclusion

The nine FR-IQA methods demonstrated excellent correlations with subjective assessment, suggesting that FR-IQA methods developed for natural images can be applied to CT images, and will support the reliability of the results of studies using the FR-IQA in the evaluation of CT image quality in the past. In particular, VIF demonstrated the highest performance and was the most useful method for assessing the quality of CT images.