Initial chest radiographs and artificial intelligence (AI) predict clinical outcomes in COVID-19 patients: analysis of 697 Italian patients

Objective To evaluate whether the initial chest X-ray (CXR) severity assessed by an AI system may have prognostic utility in patients with COVID-19. Methods This retrospective single-center study included adult patients presenting to the emergency department (ED) between February 25 and April 9, 2020, with SARS-CoV-2 infection confirmed on real-time reverse transcriptase polymerase chain reaction (RT-PCR). Initial CXRs obtained on ED presentation were evaluated by a deep learning artificial intelligence (AI) system and compared with the Radiographic Assessment of Lung Edema (RALE) score, calculated by two experienced radiologists. Death and critical COVID-19 (admission to intensive care unit (ICU) or deaths occurring before ICU admission) were identified as clinical outcomes. Independent predictors of adverse outcomes were evaluated by multivariate analyses. Results Six hundred ninety-seven 697 patients were included in the study: 465 males (66.7%), median age of 62 years (IQR 52–75). Multivariate analyses adjusting for demographics and comorbidities showed that an AI system-based score ≥ 30 on the initial CXR was an independent predictor both for mortality (HR 2.60 (95% CI 1.69 − 3.99; p < 0.001)) and critical COVID-19 (HR 3.40 (95% CI 2.35–4.94; p < 0.001)). Other independent predictors were RALE score, older age, male sex, coronary artery disease, COPD, and neurodegenerative disease. Conclusion AI- and radiologist-assessed disease severity scores on CXRs obtained on ED presentation were independent and comparable predictors of adverse outcomes in patients with COVID-19. Trial registration ClinicalTrials.gov NCT04318366 (https://clinicaltrials.gov/ct2/show/NCT04318366). Key Points • AI system–based score ≥ 30 and a RALE score ≥ 12 at CXRs performed at ED presentation are independent and comparable predictors of death and/or ICU admission in COVID-19 patients. • Other independent predictors are older age, male sex, coronary artery disease, COPD, and neurodegenerative disease. • The comparable performance of the AI system in relation to a radiologist-assessed score in predicting adverse outcomes may represent a game-changer in resource-constrained settings.

Severe complications of COVID-19 include severe pneumonia, acute respiratory distress syndrome, multiple organ failure, and death [2]. Since initial detection, the virus has rapidly spread across the world, infecting more than 14 million people and killing over 590,000 [3].
As the COVID-19 pandemic keeps overwhelming healthcare systems worldwide, a prompt large public health surveillance and response program is needed. Real-time reverse transcriptase polymerase chain reaction (RT-PCR) remains the reference standard for diagnosis, but its high falsenegative rate, limited testing capacity, and long turnaround times hinder its effectiveness most likely contributing to the spread of the infection within communities [4].
In this scenario, the role of imaging with chest X-ray (CXR) and chest computed tomography (CT) may become fundamental in quickly providing results that can guide in terms of triage and clinical management [5].
The few currently available reports on the use of CXR for COVID-19 diagnosis seem to point towards the fact that this imaging modality lacks sensitivity in identifying some of the otherwise evident findings visible on CT [6][7][8][9]. However, during the SARS coronavirus outbreak in 2003, CXR radiographic findings were found to be associated with worse clinical outcomes [10][11][12], similar to what has been recently reported in COVID-19 infection in young and middle-aged adults [13].
In resource-constrained settings with large throughputs to handle, such as with the COVID-19 pandemic, artificial intelligence (AI) may help expedite reading times, thus becoming an important asset in the clinical management of these patients. An AI system for the detection of COVID-19 was recently shown to be able to identify COVID-19 pneumonia on CXR with performance comparable to six independent radiologists [14]. However, the role of AI on CXR as a prognostic tool has not yet been evaluated in COVID-19 patients.
Taken together, the high immediate and widespread availability of CXR, the reduced risk of cross-infection and lowcost compared with CT imaging, and the potential to become a first-line triage tool are worth further investigation to increase our understanding of the predictive role of radiographic features in COVID-19 [5].
In this context, our study aims to identify and quantify COVID-19 CXR findings, assess the relationship between initial CXR severity and clinical outcomes, and evaluate the use of an AI system as an initial prognostic tool in COVID-19.

Patients
This series is part of the COVID-19 Institutional clinicalbiological cohort assessing patients with COVID-19 (COVID-BioB, ClinicalTrials.gov NCT04318366) at a 1350-bed tertiary care academic hospital in Milan, Italy. The study was approved by the ethics committee (EC) (protocol number 34/INT/2020). All procedures were conducted in agreement with the 1964 Helsinki declaration and its later amendments; informed consent was collected from all patients according to the EC guidelines.
All consecutive patients aged ≥ 18 years, admitted to the Institution's Emergency Department (ED) with a positive RT-PCR nasopharyngeal swab between February 25 and April 9, 2020, were initially considered. Patients with a CXR obtained on presentation were included in the study. Exclusion criteria were patients who acquired infection during hospitalization, those transferred to the institution from other hospitals or later transferred to other hospitals, and those with positive RT-PCR as outpatients. A complete exclusion flow diagram is provided in Fig. 1.

Clinical data collection
All prospectively collected clinical data were retrospectively extracted from the study's dedicated electronic database.
The time-to-event for clinical outcomes, i.e., death, admission to intensive care unit (ICU), and discharge, was calculated from the date of hospital admission to the date of the event; follow-up was right-censored on May 5, 2020.
Clinical outcomes categories were defined as (i) death (primary) and (ii) critical COVID-19, which included patients admitted to ICU and deaths occurring before ICU admission.

Imaging data collection and evaluation
Conventional chest X-ray (CXR) images were acquired in the posteroanterior (PA) or in the anteroposterior (AP) projection for patients not able to stand. All AP projection images were acquired with portable X-ray machines with patients in a supine position or sitting up.
Radiographs obtained on ED presentation were reviewed by two radiologists (F.D.C. and C.M.A.M., respectively, with 30 years and 24 years of experience in thoracic imaging); agreement was obtained by consensus. To minimize bias, reviewers had no knowledge of clinical data other than COVID-19 positivity.
The following radiographic findings were evaluated: hazy opacities, consolidation, hilar enlargement, and pleural effusion [15]. Lung opacities' distribution was assessed and categorized as follows: peripheral/peri-hilar predominance, upper/ lower quadrant predominance, or no predominance and bilateral or unilateral involvement.
The severity of lung involvement, on all baseline CXRs, was quantified by a deep learning artificial intelligence (AI) system (qXR v2.1 c2, Qure.ai Technologies) and compared with a radiologist-assessed score.
qXR is a CE-certified deep learning AI system based on a set of convolutional neural networks (CNNs) trained to detect a number of specific abnormalities on frontal CXRs (blunting of costophrenic angle, cardiomegaly, cavitation, consolidation, fibrosis, hilar enlargement, nodules, opacities and pleural effusion). The specific architectures that form the basic blocks in the systems and detect individual abnormalities are versions of residual neural networks (ResNets) with squeeze-excitation modules with abnormality-specific modifications. The AI system identifies normal CXRs and detects and localizes suspect abnormalities providing results in terms of percentage of involvement and, if necessary, reports the pre-defined tags.
The algorithm was trained on a set of 2.3 million CXRs collected from different centers in different geographical locations [16]. Two different datasets (respectively consisting of more than 89,000 and 2000 distinct CXRs) were used for validation and another set of images for algorithm development. A validated natural language processing algorithm identified the defined abnormalities in the original radiology reports, in the largest dataset, which were considered the gold standard. The developers report that the algorithm, using the radiologists' assessment as the gold standard, achieved an area-under-the-curve (AUC) for the detection of the specific abnormalities varying from 0.89 to 0.98; notably, AUCs of 0.95 (95% CI 0.92-0.98) for consolidation and 0.94 (95% CI 0.93-0.96) for opacities [16]. The algorithm was additionally tuned with recent images from COVID-19-positive and COVID-19-negative patients [17].
For the purpose of our study, the software output was personalized to only report the extent of consolidation and lung opacities. The severity of the lung involvement was calculated by the AI system as the percentage of pixels involved by opacity or consolidation for each lung (cutoff 3%). The average of the two values ((percentage of right lung involvement + percentage of left lung involvement)/2), Qure AI "score," was then obtained to reflect total lung involvement (minimum score 0 = no lung involvement; maximum score 100 = complete opacification/consolidation of both lungs) as described in Fig. 2. The same CXRs were then evaluated by the radiologists using the Radiographic Assessment of Lung Edema (RALE) score to quantify the severity of lung abnormalities [18]. Each CXR was divided into quadrants and each quadrant was assigned a score by a radiologist which described the (i) extent of opacities (0-4; absence, < 25%, 25-50%, 50-75%, and > 75% involvement) and (ii) density of opacity (1-3; hazy, moderate, or dense). The final score (maximum 48) was obtained by summing the product of the consolidation and density scores for each of the four quadrants.

Statistical analysis
Patients' characteristics were assessed with standard descriptive statistics. Frequencies presented as percentages were used to express categorical values; median values with respective interquartile ranges (IQR) were used for continuous variables. Imputation for missing data was not performed.
To evaluate the sensitivity of the initial CXR, radiological scores of > 0 were interpreted as positive.
The correlation between the two radiological scores was assessed by Kendall's rank correlation test.
Baseline CXR lung opacity characteristics and radiological scores of patients with symptoms suggestive for COVID-19 for < 7 days or ≥ 7 days at ED presentation were compared using the chi-square test and Mann-Whitney U test, respectively; the cutoff at 7 days was selected as the median value.
The ability of the AI calculated total lung involvement and the radiologist-assessed RALE score to predict mortality and critical COVID-19 was determined by the area-under-thecurve (AUC) of receiver operating characteristics (ROC) curves. The optimal cutoff values were determined on the highest Youden index value for the primary outcome (mortality) and were used to estimate Kaplan-Meier curves for survival and ICU-free survival, which were compared by the log-rank test.
A Cox proportional hazard model including sex and age (model 1) was used to evaluate the association between radiological scores and clinical outcomes. A second more comprehensive Cox proportional hazard model (model 2) which in addition to model 1 variables included important comorbidities, or known risk factors, was also used. Effect estimates were reported as hazard ratios (HRs) with 95% confidence intervals (CIs).
The correlation between the RALE or Qure AI score and clinical signs (considered temperature and PaO2/FiO2 ratio) were evaluated using two-tailed Pearson's correlation or Kendall's tau based on the distribution of the variables.
Two-tailed tests were performed, and a p value of < 0.05 was considered statistically significant.
Statistical analyses were performed using SPSS 26 (SPSS Inc./IBM) and SAS version 9.4.

Clinical data
Six hundred and ninety-seven (697) patients were included in the study (
The AI system reported no involvement in 140 patients (20.4%) while the RALE score reported negative CXRs (RALE score = 0) in 91 patients (13.1%); therefore, the sensitivities were respectively 79.6% and 86.9%. There were no patients with a Qure AI score in the lower quartile (Q1, score < 11) and a RALE score in the upper quartile (Q4, score > 13). There were, although, only two cases with a RALE score in the lower quartile (0 and 3) but a Qure AI score in the upper quartile (> 44; respectively 46 and 44.5). The first case was due to the presence of dense breast tissue which was interpreted by the AI system as an increased density of the underlying lung. The second case was a patient with a large unilateral pleural effusion that was interpreted by the AI system as an extensive parenchymal consolidation.
Of the 45 patients with a negative first RT-PCR at ED presentation, 8 were reported as no involvement by the AI system and 5 were scored negative by radiologists' assessment (RALE score = 0).
The Mann-Whitney U test showed that the two radiological scores were significantly higher in patients with symptoms for ≥ 7 days at ED presentation compared with those symptomatic for less than 7 days (p = 0.031 for Qure AI score and p = 0.014 for RALE score). Peripheral predominance was the only radiographic finding significantly different between the two groups (p = 0.001) ( Table 3).

Clinical outcomes
Regarding the ability of the two scores to predict outcomes, the optimal cutoffs identified in the ROC curves IQR, interquartile range; RT-PCR, real-time reverse transcriptase polymerase chain reaction Temperature at ED presentation was not correlated to RALE or Qure AI score. At ED presentation, COVID-19 patients had a significant correlation between higher baseline scores and lower PaO2/FiO2 (RALE score − 0.353; p < 0.001 (Kendall's tau) and Qure AI score − 0.476; p < 0.001 (Pearson's test)).
Full results for other independent predictors are reported in Table 4.

Discussion
Current COVID-19 radiological literature is dominated by CT and a limited number of reports describe the role of CXRs [6][7][8][9]. CT was proposed as a first-line investigation at the start of this pandemic; however, this approach has some limitations [19]. In fact, not only it increases the risk of transmission to healthcare workers and other patients, but also the necessary decontamination procedures required after scanning COVID-19 patients could obviously disrupt radiological service availability in a setting where a dedicated COVID-19 CT scanner is not available. Thus, the Fleischner Society Consensus and the American College of Radiology caution towards this approach and the latter suggests the use of CXR to minimize the risk of cross-infection [5,[20][21][22]. Keeping all this in consideration, and the fact that CXR is widely available and already routinely obtained in the ED, improving our understanding of the role of COVID-19 CXR radiographic features is mandatory. In our cohort, radiographic features were consistent with those of other reports; the distribution of lung opacities (consolidation and hazy opacities) was typically bilateral, peripheral, and basilar with limited cases of pleural effusion [6,8,13].
To evaluate the ability of a deep learning AI-based system (qXR v2.1 c2, Qure.ai Technologies) to predict adverse outcomes in COVID-19 patients, we compared its performance with the RALE score, a radiographic score with an excellent inter-observer agreement that has been validated to assess the severity and predict outcomes in ARDS patients [18].
The two scores, assessed on the initial CXRs executed at ED presentation, were found to be independent outcome predictors in multivariate regression models including age, sex, and comorbidities; patients with higher Qure AI (cutoff 30) and RALE scores (cutoff 12) were more likely to become critical and have a fatal outcome. Other independent predictors of an adverse outcome were older age, male sex, coronary artery disease, COPD, and neurodegenerative disease. A correlation between higher baseline CXR scores and lower PaO2/ FiO2 was also observed. Hazard ratios of the two radiological scores were similar and mostly higher than those of the clinical risk factors. Our results confirm those of Toussie et al, who have previously validated the use of initial CXR severity scores as independent outcome predictors, but on a larger population which included older patients and a longer follow-up (at least 25 days) with a significant number of negative outcome events [13]. Lung disease severity, assessed on the CXR at ED presentation, represents a valuable prognostic factor, which should be taken into consideration by medical teams in triage decisions.
In addition to this, we evaluated the use of an AI system to predict outcomes in patients with COVID-19. The comparable performance of the AI system with respect to a radiologistassessed score in predicting adverse outcomes could represent a game-changer for resource-constrained settings as the COVID-19 pandemic keeps spreading, especially for those countries with a shortage of radiological expertise. The possibility of having the lung disease severity rapidly assessed by an AI system, together with patients' clinical data, may help medical teams identify patients at a higher risk of an adverse outcome straight at the ED presentation and thus allocate the limited resources more efficiently.
The main limitation of this study is the retrospective design, which can lead to observer bias. Another limitation is that this single-center study was conducted in one of the hospitals in the frontline dealing with the COVID-19 outbreak in the Lombardy region (Italy) which put the local health system under severe strain forcing indications for hospital access. Presentation to EDs was instructed only for those with moderate-severe clinical conditions while patients with mild symptoms were instructed to remain at home. This could explain the high CXR's sensitivity in our cohort (79.6% and 86.9% respectively for the AI system and radiologists' assessment). Furthermore, the decision to group patients in the  combined outcome critical COVID-19, which included all patients admitted in ICU (n = 80) and those who died prior to being transferred to ICU (n = 99), may have led to an overlap between the outcomes. Nevertheless, the scores obtained on CXRs at admission were able to predict mortality of SARS-CoV2-positive patients and can be used as repeatable, accurate, and defined tools to stratify patients and predict outcomes upon presentation. In addition, the results of this retrospective analysis have to be proven and verified prospectively in a larger population which should include patients with mild disease. Lastly, this study considered only the initial CXR analysis; further studies with sequential CXRs analysis are required to understand the disease progression in relation to therapeutic response.

Conclusions
Our study has shown that initial CXR's severity assessed by a deep learning AI system may have prognostic value in COVID-19 patients, with a performance comparable to a radiologist-assessed score. A Qure AI score ≥ 30 or a RALE score ≥ 12 on the CXR at ED presentation were independent and comparable predictors of adverse outcomes. We suggest that lung disease severity at ED presentation, as seen as opacification on the initial CXR, should be considered in the risk-stratification of COVID-19 patients, especially in resource-constrained settings.
Funding The authors state that this work has not received any funding.

Compliance with ethical standards
Guarantor The scientific guarantor of this publication is Francesco De Cobelli M.D.

Conflict of interest
The authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.
Statistics and biometry One of the authors has significant statistical expertise.
Informed consent Written informed consent was obtained from all subjects (patients) in this study.
Ethical approval Institutional Review Board approval was obtained.
Study subjects or cohorts overlap In the paper "Early predictors of clinical outcomes of COVID-19 outbreak in Milan, Italy" by Ciceri et al, published online in Clinical Immunology (https://doi.org/10.1016/ j.clim.2020.108509) clinical and chest radiography findings of 410 patients of the COVID-19 institutional clinical-biological cohort (Covid-BioB, ClinicalTrials.gov NCT04318366) were analyzed in relation to outcomes.
In this paper, of which I am a co-author, radiological findings were only a minor part of all the predictors evaluated and only shortly mentioned in the discussion. The present study expanded the sample, despite the stricter exclusion criteria, focusing on the radiological findings.

Methodology
• retrospective • diagnostic or prognostic study • performed at one institution Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.