1 Introduction

Computed tomography is a significant tool in the diagnosis and staging of pneumonia, providing important information for the management of the recent COVID-19 pandemic [1]. However, aspects relating to pulmonary lesions seen in CT are typical but not pathognomonic [2], suggesting that this method has its limitations [3].

Artificial intelligence (AI) is commonplace in medical imaging due to its ability to automate image segmentation, improve lesion detection, better perform quality assurance, and increase inter- and intra-rater reliability [4,5,6,7]. Moreover, structured reports are claimed to have enhanced radiology reporting by enabling standardization, productivity, and improved information transmission [8, 9], leveraging their potential to support decision-making technologies using AI [8, 10, 11].

Recent studies have shown promising trends for radiomics and AI in classifying pneumonia in CT studies [12, 13]. Even though AI has been widely used with different data types for COVID-19 diagnoses, such as radiomics and clinical data, it has been little used in respect of imaging reports [14]. Huanhuan Liu et al. developed a clinical-radiomics model that included distribution features and clinical data, demonstrating the added value of lesion distribution reporting in their models [13]. In this scenario, the integration of minable data with artificial intelligence can provide tangible benefits for radiology practices and patient care.

In this work, we explored radiological features from multi-center, standardized web-based structured reports and radiomics. We also developed AI models and evaluated their classification performance to differentiate COVID-19 from other types of pneumonia, while evaluating their impact on intra-rater reliability.

2 Materials and Methods

2.1 Subjects

Cases were defined as subjects with COVID-19 confirmed using RT-PCR and referred for thorax CT between April, 2020 and April, 2021 and sequentially selected from three Brazilian academic hospitals (Table 1). Controls were defined as subjects sequentially selected with pneumonia before the COVID-19 pandemic, having been referred for thorax CT between January, 2018 and October, 2019.

Table 1 Subject and center characteristics

Exclusion criteria were (a) CT images with poor quality; (b) small or imperceptible lesions; (c) unavailability of RT-PCT for the confirmed COVID-19 group; (d) images that were not compliant with the institutional protocol for lung imaging (e.g., matrix size, kernel); and (e) images that were no longer available on the Picture Archiving Communication System (PACS) during the retrieval process.

The study was approved by the ethics committee at each institution: Hospital Universitario Prof. Edgard Santos (UFBA) (4.494.511), Hospital Universitario Alcides Carneiro (UFCG) (4.569.389), and Hospital Universitario de Juiz de Fora (UFJF) (4.926.688).

2.2 Structured Report

The structured report was composed of ten questions concerning the radiological findings: five concerning lesion analysis and five concerning lesion distribution, eliciting the following binary (yes/no) responses:


Analysis:

  • A1: Predominant ground-glass opacity with a rounded morphology.

  • A2: Ground-glass opacities with superimposed interlobular septal thickening and intralobular septal thickening (crazy-paving pattern).

  • A3: Ground-glass opacity and pulmonary consolidation.

  • A4: Pulmonary consolidation with air bronchograms.

  • A5: Reversed halo sign or cryptogenic organizing pneumonia (COP).

Distribution:

  • D1: Bilateral and multifocal.

  • D2: Peripheral distribution.

  • D3: Prevalent in lower lobes and dorsal region.

  • D4: Peribronchovascular opacities and peripheral distribution.

  • D5: Diffuse opacities.

Experienced radiologists (one from each hospital) retrieved the corresponding CT images from the hospital PACS and selected the corresponding structured report findings. Only image series with axial slices of the lung were considered. This data was then included in the Nuclearis software (Salvador, Brazil), a web-based radiology information system capable of personalizing standard structured reports. No other information, such as side notes, remarks, additional details, was analyzed.

2.2.1 Structured Report Model

An Extreme Gradient Boosting (XgBoost) algorithm was built into the training cohort (n = 73) and validated in the independent cohort (n = 55) using a structured report score of 0 or 1, with 0 representing non-COVID-19 and 1 representing COVID-19.

2.3 Radiomics

2.3.1 Automatic Segmentation

The process involved 37 COVID-19 individuals and a further 36 individuals with pneumonia not associated with COVID-19 (hereafter denominated feature selection cohort). CT images of these patients had been manually segmented using LIFEx, version 6.30 (www.lifex.org), to train a 2D convolutional neural network [15] with a fivefold cross-validation scheme. A total of 12,780 CT slices were used, and data were augmented using 1 mm erosion in the LIFEx ROIs, as well as flip and rotation in the original images and ROIs, which resulted in a total of 51,120 images. Each training dataset used 60 subjects to segment the additional subjects automatically. The 2D output probability maps were filtered with a 3D Gaussian kernel before thresholding to obtain the final 3D lesion segmentation. Only the independently segmented lesions were included in the radiomic feature selection analysis. This process used the established image segmentation software, LIFEx, to delineate the ground truth of lung lesions with the aim of developing proprietary automatic segmentation software for the radiomic feature extraction.

2.3.2 Feature Extraction

Radiomic features were extracted using PyRadiomics [16]. The matrix grid was resampled to voxels of 1×1×2 mm3, and the discretization within each ROI was scaled to 128 grayscale levels. Radiomic feature classes and corresponding features are presented in the supplementary material. A total of 93 features were calculated, including 18 first order and 75 textural features: 16 Gy level run length matrix (GLRLM), 16 Gy level size zone matrix (GLSZM), five neighborhood gray-tone difference matrix (NGTDM), 14 Gy level dependence matrix (GLDM), and 24 Gy level co-occurrence matrix (GLCM).

2.3.3 Feature Selection

The feature selection process is illustrated in Fig. 1. Lung lesions from the feature selection cohort were automatically segmented, and two additional segmentation datasets (artificial observers) were generated using erosion functions of 1 and 3 mm. Additional blur and sharp kernels were applied to the original image to simulate different scanner/image reconstructions. Ninety-three radiomic features were computed for each of the three observers from the erosion and image quality groups. The intraclass correlation coefficient (ICC) was used to estimate reproducibility. Features which ICC > 0.90 were selected.

Fig. 1
figure 1

Feature selection framework

2.3.4 Radiomic Model

We used the same cohort of feature selection for model building (n = 73) and an independent cohort with 55 subjects (27 COVID-19 and 28 non-COVID-19) for model validation. A random forest model was built at the lesion level (segmented lesions = 1060), and annotated according to subject status. The algorithm classified lesions using a radiomic score that ranged from 0 to 1, with 0 representing non-COVID-19 and 1 representing COVID-19 lesion. The model (rf mean) was validated at the subject level by averaging the scores of all lesions for the specific subject.

2.4 Physician’s Performance

Two experienced physicians, one radiologist, and one pneumologist scored the CT images in the validation group (n = 55) to classify the pneumonia as COVID-19 or non-COVID-19. Both physicians were blinded in respect of RT-PCR and clinical information. The radiologist was the same person that performed the structured report (Hospital B), and the pneumologist only participated in this validation step. They initially analyzed the CT images independently, with no support from the machine learning models. After 30 days they repeated the analysis with the support of both radiomic and structured report models using a binary classification for each model.

2.5 Statistical Analysis

The odds ratios of radiological findings were assessed as an indication of feature importance, using the entire dataset (n = 128). The mean of Dice Coefficient (F1 Score) was computed using Python 3.9.2 software to assess the performance of our automatic COVID-19 segmentation method. Machine learning analysis was performed using Orange software, version 3.29.3 [17]. The sensitivity and specificity of the models were calculated according to category classification outputs from Orange, and the 95% confidence intervals were estimated using OpenEpi software [18]. The receiver operating characteristics (ROC) curves were assessed to estimate the area under the ROC curve (AUC) and the 95% confidence intervals (CI) using the continuous scores from Orange and SPSS 18.0 for Windows. The mean ROC curve for physicians’ performance was simulated by iteratively misspecifying the true values with a random gaussian function to enable improved readability and interpretation. The procedure was stopped when the simulated ROC curve yielded values corresponding to the real sensitivity and specificity [19]. The inter-rater reliability was assessed with and without the support of machine learning models using Cohen’s Kappa coefficient (k), with the following classification: weak (k ≤ 0.20), fair (0.20 < k ≤ 0.40), moderate (0.40 < k ≤ 0.60), substantial (0.60 < k ≤ 0.80) or strong (k > 0.80) [20].

3 Results

3.1 Structured Reports

The XgBoost performed with 81.6% (66.6–90.8%) sensitivity and 82.9% (67.3–91.9%) specificity in the training cohort (AUC = 0.91) and 77.3% (56.6–89.9%) sensitivity and 69.7% (52.7–82.7%) specificity in the validation cohort (AUC = 0.82) (Table 3).

The odds ratio (95% CI) revealed feature importance, as presented in Table 2. According to this analysis, features A1, A2, A4, D1, and D2 were the strongest predictors (p < 0.05), being present with greater frequency in the COVID-19 group, except for A4, which appeared more frequently in the non-COVID-19 group.

Table 2 Feature importance for structured reports

3.2 Radiomics

The mean F1 score of the auto-segmentation algorithm was 0.72. Figure 2 shows the axial view of an infected COVID-19 individual (Fig. 2a), the auto-segmentation (Fig. 2b), ROI erosion functions at 1 mm (Fig. 2c) and 3 mm (Fig. 2d), as well as image enhancements of blurred (Fig. 2e) and sharp (Fig. 2f) kernels.

Fig. 2
figure 2

Auto-segmentation and observer augmentation

The intersection between erosion and image quality groups resulted in three features showing excellent reproducibility: GLDM Gray Level Non-Uniformity, First Order Energy, and First Order Total Energy.

The rf mean model with these selected features produced AUC = 1.0 in the training cohort and AUC = 0.84 (0.73–0.95) in the validation cohort, both at the subject level. Sensitivity and specificity in the validation cohort were 70.4% (51.5–84.2%) and 89.3 (72.3–96.3%), respectively (Table 3).

Table 3 Model performance on the validation dataset

3.3 Physicians

In respect of the physicians’ assessment, the pneumologist performed with sensitivity and specificity of 81.5% (63.3–91.8%) and 67.9% (49.3–82.1%), respectively; while the radiologist performed with sensitivity and specificity of 85.2% (67.5–94.1%) and 60.7% (42.4–76.43%), respectively. When physicians were assisted with artificial intelligence models, their performance for sensitivity and specificity were, respectively, 85.2% (67.5–94.1%) and 85.7 (68.5–94.3) for the pneumologist, and 88.9% (71.9–96.2%) and 96.4% (82.3–99.4%) for the radiologist. The Cohen’s Kappa coefficient for both physicians was k = 0.66 (0.46–0.86) without the support of machine learning, which rose to 0.78 (0.61–0.94) with the use of machine learning. Therefore, the readings that involved assistance by artificial intelligence improved the lower limit, which increased from moderate to substantial. Figure 3 depicts the overall readers’ performance based on the AUC, while Fig. 4 depicts the readers’ reliability.

Fig. 3
figure 3

Physicians and artificial intelligence performance

Fig. 4
figure 4

Reader’s reliability. Physician’s agreement WITHOUT artificial intelligence (A) and WITH artificial intelligence (B)

Even though we only found a significant difference in specificities for the radiologist reader with and without AI (p < 0.05), our results suggest that standardized and semi-automatic measures based on questionnaires and radiomics may assist physicians in respect of classification.

4 Discussion

This study shows that structured reports, radiomics, and machine learning algorithms provide a reliable classification of CT scans when identifying COVID-19 individuals and better assist physicians in their decisions. Our study characterized CT pneumonia examinations using radiomics and a typically structured report questionnaire for lung CT pneumonia and AI models. Furthermore, we conducted a careful methodological and statistical design to avoid pitfalls in the modeling workflow and analysis, thus promoting the results' reliability and repeatability [21, 22].

Syeda et al. [14] published a systematic review on the role of machine learning in tackling coronavirus disease. However, of the forty studies that have evaluated the diagnosis of COVID-19, none used imaging features extracted from standardized structured reports. Although several authors reported the application of AI for diagnoses of COVID-19 using CT, their works were based on image processing techniques and/or clinical data.

We were not able to demonstrate that the model’s performance was statistically higher than that achieved by the physicians, but the observed trends provide cause for optimism. More importantly, the physicians improved their reporting when assisted by these models. Interestingly, the results from our radiomic model are similar to Guiot J et al. [12], who achieved a sensitivity of 69.5% and a specificity of 91.6%. Liu H et al. [13] developed both clinical-radiological and clinical-radiomic models, reporting sensitivity and specificity of 63 and 84%, respectively, for their clinical-radiological model, which included the features of age, gender, neutrophil ratio, lymphocyte count, location, distribution, reticulation, and lung involvement. The overall performance of AUC = 0.83 (0.75–0.90) is also equivalent to our structured report model, while their clinical-radiomic model achieved AUC = 0.93 (0.85–1.00) with a sensitivity and specificity of 85 and 90%. Although clinical data may provide additional predictive information, such data are not available in many clinical settings, while our models require only simple scoring and automated image quantification.

Our work also confirmed the higher prevalence of typical features previously reported in COVID-19 cases (A1, A2, D1, D2) [1, 23], demonstrating the added contribution of these patterns and their dependence on artificial intelligence models. The performance of these models motivates a more extensive clinical evaluation based on a larger cohort designed to evaluate gains in accuracy and inter/intra-rater reliability by radiologists with different levels of experience using our algorithm.

Our study has certain limitations. Firstly, the limited sample size imposes a relatively large confidence interval for our results. Sensitivity and specificity ratios are not affected by the sample size, but the confidence interval in larger samples is smaller. Our results are presented with 95% confidence interval and 98% statistical power [18]. Secondly, our radiomic model could be generalized only using the feature selection and extraction parameters used in this work. In addition, our algorithm shows two scores (radiomic and structured reports) therefore limiting the capacity of the decisions taken by physicians when both scores are ambiguous (n = 18/55). Finally, our models are disease specific. The AI models were trained/validated using subjects affected by pneumonia, meaning that our results are not necessarily valid for cancer or other inflammatory lesions. Our method only informs the likelihood of the lesions’ etiology being due to COVID-19.

The management of COVID-19 patients is multifaceted. The CT aspects of COVID-19 lesions are usually presented with ground-glass opacity, rounded or polygonal, and its distribution is mostly peripheral and bilateral, without pulmonary consolidation (Table 2). Since inflammatory lesions in the lungs are not pathognomonic, the etiology of some pneumonia cases is difficult to differentiate using simply CT without complementary technologies. Thus, decision-making technologies to handle coronavirus disease in real-time are valuable tools to both avoid its transmission and also rationalize resources. The proposed approach intends to improve standardization, automate and identify infected patients as well as discard cases of pneumonia that are unlikely to involve COVID-19. Our efforts are aligned with the global demand to make minable data and artificial intelligence feasible and translational to clinical practice, given that radiomics, reporting standardization, and statistical tools are increasingly deployed [24].

5 Conclusion

Radiomics, structured reports, and machine learning algorithms enable good classification of CT scans to identify COVID-19 individuals. In addition, this approach assisted physicians in standardizing better the classification and reporting of COVID-19 in computed tomography.