Evaluation of an AI-based, automatic coronary artery calcium scoring software

Objectives To evaluate an artificial intelligence (AI)–based, automatic coronary artery calcium (CAC) scoring software, using a semi-automatic software as a reference. Methods This observational study included 315 consecutive, non-contrast-enhanced calcium scoring computed tomography (CSCT) scans. A semi-automatic and an automatic software obtained the Agatston score (AS), the volume score (VS), the mass score (MS), and the number of calcified coronary lesions. Semi-automatic and automatic analysis time were registered, including a manual double-check of the automatic results. Statistical analyses were Spearman’s rank correlation coefficient (⍴), intra-class correlation (ICC), Bland Altman plots, weighted kappa analysis (κ), and Wilcoxon signed-rank test. Results The correlation and agreement for the AS, VS, and MS were ⍴ = 0.935, 0.932, 0.934 (p < 0.001), and ICC = 0.996, 0.996, 0.991, respectively (p < 0.001). The correlation and agreement for the number of calcified lesions were ⍴ = 0.903 and ICC = 0.977 (p < 0.001), respectively. The Bland Altman mean difference and 1.96 SD upper and lower limits of agreements for the AS, VS, and MS were − 8.2 (− 115.1 to 98.2), − 7.4 (− 93.9 to 79.1), and − 3.8 (− 33.6 to 25.9), respectively. Agreement in risk category assignment was 89.5% and κ = 0.919 (p < 0.001). The median time for the semi-automatic and automatic method was 59 s (IQR 35–100) and 36 s (IQR 29–49), respectively (p < 0.001). Conclusions There was an excellent correlation and agreement between the automatic software and the semi-automatic software for three CAC scores and the number of calcified lesions. Risk category classification was accurate but showing an overestimation bias tendency. Also, the automatic method was less time-demanding. Key Points • Coronary artery calcium (CAC) scoring is an excellent candidate for artificial intelligence (AI) development in a clinical setting. • An AI-based, automatic software obtained CAC scores with excellent correlation and agreement compared with a conventional method but was less time-consuming.


Introduction
Non-contrast-enhanced, ECG-triggered, coronary calcium scoring computed tomography (CSCT) detects coronary artery calcifications (CAC) at low radiation doses [1] and is reliable in predicting future cardiovascular (CV) events for asymptomatic patients, independent of conventional risk models [2]. Clinical guidelines in the USA [3,4] and Europe [5] recommend CSCT in selected asymptomatic individuals, typically with an intermediate probability in a pre-test, clinical CV risk assessment.
The most commonly used CSCT technique for CAD grading and risk is the Agatston score (AS), other techniques are the volume score (VS) and the mass score (MS) [6].
The CAC scoring is traditionally performed by experts using semi-automatic software's which includes manual identification and marking of the calcified coronary artery lesions. Guidelines worldwide endorse the use of CSCT, and the use of the method is likely to increase. Consequently, there is a need for more efficient automatic systems. The last few years have brought improvements in artificial intelligence (AI) radiology systems. A recent study in CT diagnostics of lung cancer, for example, demonstrated AI to be on par with or even outperforming radiologists [7]. In CAC scoring, AI could have a similar potential to assist or replace the human reader, thereby reducing clinical workload and increasing efficiency.
The aim of the study presented herein was to compare an automatic AI-based CSCT post-processing software prototype to a semi-automatic, conventional software by evaluating the correlation and agreement of the total AS, VS, MS, and the number of calcified coronary lesions. Also, comparison for CV-based risk classification into five commonly used categories was performed. Finally, time for analysis was evaluated.

Ethics
This study was conducted according to the principles set forward in the declaration of Helsinki and according to Good Clinical Practice. Permission was obtained from the regional ethical review board (Dnr 2018/535-32). In accordance with the ethical regulations for Swedish registries and Swedish legislation, patients were informed about their participation in a registry, and the right to deny participation or have data removed, which waivers any requirements for written consent.
For the multi-hospital acquired training data, necessary ethics approval and/or patient consent was obtained when required.

Study sample
In this observational, cross-sectional study, patients and their baseline characteristics were retrospectively collected from a nationwide quality registry, SWEDEHEART [8]. All patients with a CSCT performed on a particular state-of-the-art CT scanner between 1 December 2017 and 31 January 2019 (n = 342) at Linköping University Hospital were consecutively included. The indication was suspected ischemic heart disease. Exclusion criteria were, as previously suggested [9], anatomical abnormalities (n = 2), intracoronary stents (n = 0), metal implants (n = 18), and CSCT scans with severe motion artifacts or high level of noise determined by visual inspection (n = 6) ( Fig. 1). In addition, one CSCT scan (n = 1) was excluded due to incomplete scanning of the coronary arteries. The final main dataset consisted of 315 CSCT scans. In total, 13 out of 20 CSCT with anatomical abnormalities and metal implants were considered readable for CAC scoring, and represented an independent dataset (n = 13).

CT acquisition parameters and image reconstruction
All CSCT scans were acquired through the use of a Siemens SOMATOM Force (Siemens Healthineers) MDCT. A prospectively ECG-triggered high-pitch spiral CSCT scan was performed, with a tube voltage of 120 kV, and automated tube current modulation (CARE Dose4D, Siemens) with a setting of 40 quality ref. mAs. Further settings were as follows: gantry rotation time 0.25 s, pitch 3.2, collimation 192 × 0.6 mm, matrix size 512 pixels, and temporal resolution 66 ms. The scan was set to start at 65% of the cardiac cycle. Reconstructions were made with a routine weighted filtered back projection (WFBP, Siemens) algorithm using medium sharp convolution kernel (Qr36), 3.0 mm section thickness, and increment 1.5 mm. Beta blockers were administered if the heart rate was > 65 bpm. After CTCS scanning, a CCTA was performed in the same session.

AI-based, automatic system overview
The automatic software was trained on multi-vendor, multiscanner, and multi-hospital, anonymized data from routine coronary calcium scoring acquisitions. No training datasets were used in the current study.
During model training, the locations of the coronaries created a territory map in a heart-centric coordinate system. This map serves to assign prior likelihood of different voxels belonging to the coronary arteries.
For each evaluated CSCT scan, a model is used to segment the heart, to establish a heart-centric coordinate system. The pre-computed coronary territory weights are mapped to the local size and shape of the patient's heart. All voxels > 130 HU are extracted. Around each voxel, an image patch is extracted to represent the local spatial characteristics, the prior likelihood from the territory map and the location (x, y, z) of the voxel in the heart-centric coordinate system. The model uses these features to make a prediction that this voxel belongs to the coronaries. Some work [10][11][12] already used patient-specific, heartcentric coordinate systems, but relied on manually placed markers, or local image coordinates in combination with a computationally expensive registration to an atlas-based model [13,14]. Another work used a heart segmentation but no further classification besides voxel intensity [15]. As far as we know, the evaluated new machine learning model that combines the location within this coordinate system, the local image information around a voxel, and the coronary territory map is novel.

Data reporting
A standard reference was obtained with a semi-automatic, previously validated [16], post-processing software (syngo.via, Siemens Healthineers). All 315 CSCT scans were double read by two radiologists in at least two sessions (M.S. and S.S., both with 10 years' experience of cardiac CT reading) and all interpretation differences were resolved by consensus. To determine the presence of CAC, an attenuation threshold was set at > 130 HU. Calcified coronary objects having an area of ⋝ 1 mm 2 were included, as originally described [17] using default software settings. Every calcified region of interest was manually identified and marked to attain the total AS, VS, MS, and the number of calcified coronary lesions. The time used for the first read was registered.
A total of 62 (20%) CSCT scans from the standard reference underwent a second opinion evaluation from two additional readers (A.P., radiologist, 20 years' experience of cardiac CT reading, and L.H., cardiac imaging radiographer, 2 years' experience of cardiac CT research reading). CSCT scans selected for second opinion were those considered to potentially shift in risk category due to readers arbitrariness (n = 32), calcifications close to the coronary ostia (n = 27), or difficulties to discriminate peripheral calcified coronary lesions from noise (n = 3). After consensus was reached, two changes were made, both with AS difference ≤ 5, and there was no shift in risk category.
For inter-reader agreement, a subset of 106 (33.6%) CSCT scans were randomly selected and assigned to two independent radiologists. One radiologist was assigned 71 CSCT scans (G.N., 1 year experience of cardiac CT reading) and one radiologist 35 CSCT scans (A.B., 16 years' experience of cardiac CT reading), both blinded to previous results.
The automatic software was implemented in MeVisLab on a regular workstation. All CSCT scans (n = 315) were analyzed with the automatic software, retrieving the total AS, VS, MS, and number of calcified coronary lesions. The automatic system run-time and the time for a manual double-check of the results were registered. The double-check included a localization of all CAC, and to attain an image-based numerical correlation to the automatically derived number of calcified coronary lesion.
No human interaction was needed, except for loading data into the software. For each CSCT scan, a visual CSCT feedback with crosshairs was displayed in three dimensions, allowing multiplanar reconstructions (Fig. 2).
The readable CSCT scans (n = 13) with coronary abnormalities and metal implants were independently reported, following the same routine. However, another radiologist (G.W., 16 years' experience of cardiac CT reading) performed the semi-automatic double-read, and there was no second opinion.

Statistics
Continuous data are presented as mean ± standard deviation if normally distributed, or as median and interquartile range (IQR) if non-normally distributed. Categorical data are presented as numbers and percentages. Normality was tested with Shapiro-Wilk's test. The correlation and agreement between the standard reference and the automatic software for the AS, VS, MS, and the number of lesions were assessed with Spearman's rank correlation coefficient (⍴) and intraclass correlation coefficient (ICC), as appropriate for non-parametric data. Bland Altman plots displayed bias and limits of agreements within 95% confidence interval. Differences in risk classifications were assessed by weighed kappa analysis (κ) and accuracy. Inter-observer agreement was demonstrated with ICC and Spearman's rank correlation coefficient (⍴). Difference in time was analyzed with Wilcoxon signed-rank test. A two-sided p < 0.05 was considered statistically significant. Randomization for inter-rater agreement was achieved by Excel (Microsoft Office 365); all other analyses were performed using IBM SPSS v.24 (IBM SPSS).
The correlation for the automatic software in relation to the standard reference with respect to the AS, VS, and MS was assessed with the Spearman's rank correlation coefficient showing ⍴ = 0.935, 0.932, and 0.934 (p < 0.001), respectively (Fig. 3).
The agreement for the automatic software in relation to the standard reference with respect to the AS, VS, and MS was assessed with the ICC, showing 0.996, 0.996, and 0.991, respectively (p < 0.001).
Bland Altman plots mean difference and 1.96 SD upper and lower limits of agreements were as follows: AS − 8.2 (− 115.1 to 98.2), VS − 7.4 (− 93.9 to 79.1), and MS − 3.8 (− 33.6 to 25.9) (Fig. 4). A few outliers contributed to an overestimation tendency of CAC scores by the automatic software, mostly in the lower ranges. A confusion matrix for risk category assignment demonstrated an accuracy of 89.5% and weighed kappa analysis (κ) = 0.919 (p < 0.001) ( Table 2). In total, 33 CSCT scans were misclassified: 27 were overestimated and six underestimated. In total, 29 CSCT scans were off by one category, 19 of those shifting from AS 0 to AS 1-10. All the 19 CSCT scans shifting from AS 0 to AS 1-10 were due to erroneous registration of image noise in the heart and/or adjacent structures, 13 had an AS difference < 2, and four had an AS difference between two and eight. Three CSCT scans were off by two categories, two due to inclusions of aortic root calcifications, and one due to inclusion of a pericardial calcification. One CSCT scan was off by four categories due to inclusion of a mitral valve calcification. All underestimated CSCT scans were off by one category, four had an AS difference of ≤ 2 and one had an AS difference of 21, the latter not including a calcification close to the right coronary ostium.
The correlation and agreement for the automatic software in relation to the standard reference with respect to the number of calcified coronary lesions were assessed with the Spearman's rank correlation coefficient and the ICC showing ⍴ = 0.903 and ICC = 0.977 (both with p < 0.001), respectively.
The correlation and agreement for inter-reader agreement between the standard reference and the independent readers with respect to AS were assessed with the Spearman's rank correlation coefficient and the ICC, showing ⍴ = 0.968 and ICC = 1.000 (both with p < 0.001), respectively.
The correlation and agreement for the automatic software in relation to the independent readers with respect to AS were assessed with the Spearman's rank correlation coefficient and the ICC, showing ⍴ = 0.909 and ICC = 0.979 (both with p < 0.001), respectively.
Among the separate 13 CSCT scans having coronary abnormalities and metal implants, the correlation and agreement between the standard reference and the automatic software with respect to AS were assessed with the Spearman's rank correlation coefficient and the ICC, showing ⍴ = 0.939 and ICC = 0.956 (both with p < 0.001), respectively. Risk category assignment demonstrated an accuracy of 54% and weighed kappa analysis (κ) = 0.621 (p = 0.001).

Discussion
In this study, assessment of three CAC scores and number of calcified coronary lesions obtained from an AI-based, automatic post-processing software were evaluated using a semiautomatic post-processing software as a reference. Correlation, agreement, and subsequent risk classification were excellent, and the automatic analysis was less timeconsuming.
The correlation and agreement of the AS, VS, and MS of the automatic software compared with the standard reference was excellent. The Bland Altman plot for the AS, VS, and MS demonstrated narrow limits of agreement, but a small overestimation bias in the lower range of CAC scores. The risk group categorization was accurate, yet 33 (10.5%) CSCT scans were misclassified, the majority shifting from AS 0 to AS 1-10. This misclassification bias in the lowest AS score could be a clinical shortcoming, since AS > 0 is suggested to represent  Confusion matrix with distribution of cardiovascular disease (CVD) risk categorization comparing the standard reference with the automatic software. Accuracy = 89.5% and weighed kappa analysis (κ) = 0.919 (p < 0.001). Columns to the right demonstrate a summary of risk category shifting. No risk category shifting is indicated in italics incipient CAD, and may therefore be a gatekeeper for prescription of medication [19][20][21]. Notably, this bias in the low AS score range may be exaggerated due to skewness in the dataset, since 140 out of 315 CSCT scans (44%) had an AS 0. All CSCT scans off by two or more categories were due to errors not likely to be made by experts, possibly reflecting AIspecific challenges. However, these false CAC results are recognizable in the visual CSCT feedback, therefore unlikely to be clinically problematic.
There was an excellent correlation and agreement regarding the number of calcified coronary lesions, a feature previously demonstrated to have a prognostic value [22].
The number of studies evaluating automatic CSCT software is limited, and comparisons with other studies are difficult due to differences in study designs, inclusions, image acquisition, reconstructions, and quantitative evaluation methods. However, the correlation and agreement for CAC scoring and risk category classification were overall in line with previous studies having roughly similar prerequisites [12][13][14]. Automatic systems applied on the orCaScore framework have demonstrated excellent result [9,23], but the orCaScore dataset has an even distribution amongst risk categories and is smaller. Also, CAC scoring is mainly applied to asymptomatic patients, probably more likely to have a CAC distribution similar to this study.
Substantial efforts were made in creating a strong standard reference, both with double readings and a subset of CSCT scans for second opinion. A close to perfect interreader agreement was achieved, which is not unique for this study [13,14,24], but it indicates an adequate reliability. One radiologist assigned CSCT scans for interreader agreement was relatively inexperienced, but the results were still excellent, possibly indicating that CAC scoring is not particularly difficult. The fact that CSCT requires expert interaction, yet is relatively simple, makes it an excellent candidate for AI development in a clinical setting. However, an acceptable run-time is an important prerequisite, being congruent with, or slightly faster than what has previously been described [23]. While the automatic software is faster, it also takes expert time to confirm the results. Therefore, the automatic run-time plus the reader confirmation time may be a clinically more appropriate registration, in this study still shorter than the semi-automatic method. However, the CAC scoring feedback is not editable. Interestingly, 48% of the excluded CSCT scans were still applicable for semi-automatic reading, showing an excellent correlation and agreement compared with the automatic software but demonstrating less accurate risk group categorization. However, this should be analyzed in larger samples.
No data were missing, but there are limitations to this study. First, all CSCT scans were conducted at a single center on only one CT scanner. This could be a shortcoming since the AS derived from ex vivo human hearts examined on different CT vendors have variations [25]. Yet, there were no substantial differences in inter-scan variability in an in vivo study with 30 patients [24]. Second, the automatic software is only compared with a semi-automatic software from the same vendor. However, this semi-automatic software was previously compared with other vendors, demonstrating similar results [16], thereby supporting external validity. Third, a total of 27 CSCT scans (8.6 %) were excluded, limiting generalizability. Fourth, this study is limited to numerical correlation and agreements; no comparisons were performed for individual CSCT scans. Possible false positive and false negative high attenuation objects are thus not evaluated. Fifth, all CSCT scans were obtained from symptomatic patients with no known ischemic heart disease, while the technique is more commonly applied on asymptomatic patients. However, the vast majority had low pre-test probability and a large proportion of no detectable CAD, therefore probably similar to an asymptomatic population. Sixth, time for semi-automatic reading did not include the double read or the second opinion. Nonetheless, CAC scoring is usually performed by one single reader. Seventh, time used for CSCT reading may have inter-reader variations, and our registrations were derived from only one radiologist. Eighth, in the automatic time registration double-check, numerical correlations were not applicable if extensive CAC, due to indistinguishable calcified lesions. Lastly, a larger dataset could reach statistically stronger results. Still, with one exception [14], this is the largest known study evaluating automatic CAC scoring derived from CSCT scans.

Conclusion
There was an excellent correlation and agreement between the automatic and the reference standard for three CAC scores and for the number of calcified coronary lesions. Risk category classification was accurate, but with an overestimation bias tendency, especially when AS was zero. Double-checking the results should therefore be mandatory. Nonetheless, the automatic run-time plus a manual double-check of the results were still less time-consuming than using the reference standard.