This retrospective, single-center study was approved by our institutional review board and written informed consent was waived. Reporting was done in accordance with the Standards for Reporting Diagnostic Accuracy Initiative (STARD) recommendations 
We retrospectively reviewed medical records from our university hospital archives to search for consecutive patients who underwent chest CT and RT-PCR testing for suspected COVID-19, from March 9, 2020, to May 3, 2020. Chest CT and RT-PCR testing were performed for clinical suspicion of COVID-19 based on the presence of at least one of the following respiratory tract infection symptoms: (I) fever higher > 37.5 °C, (II) cough, and (III) clinically relevant dyspnea, with or without a history suggestive of exposure to SARS-CoV-2 including (a) close relationship with a confirmed positive individual, (b) travel or residential history in areas with high prevalence of disease, or (c) contact with individuals with fever or respiratory symptoms from those areas within 14 days prior to CT scan.
Exclusion criteria were as follows: (I) lack of RT-PCR testing results, (II) time interval between CT scan and RT-PCR longer than 7 days, and (III) uninterpretable CT scans due to motion artifacts or incomplete scanning.
CT Technique and image analysis
All CT scans were performed using a single 64-slice CT scanner (LightSpeed VCT, GE Healthcare). All patients were scanned on supine position during single deep-inspiration breath-hold. No contrast medium was administered. Scanning parameters were as follows: tube voltage of 100 or 120 kV according to the patient’s body size, variable tube current with automatic mAs modulation (Smart mA, GE Healthcare), 0.6-mm section thickness and a pitch of 1.388, and iterative reconstruction techniques (ASIR) at 40%.
All CT scans were retrieved from the Picture Archiving and Communication Systems, anonymized and uploaded onto a dedicated workstation (SuiteEstensa 2.0, EBIT - Esaote Group Company) for image analysis.
Each CT scan was independently analyzed by twelve readers, stratified into four different groups according to their experience as follows: high-experience group (R1, R2, and R3 [D.B., U.D., and E.D.], board-certified radiologists with more than 10 years of experience in thoracic imaging and more than 100 COVID-19 positive CTs reported); intermediate-experience group (R4, R5, and R6 [S.M., M.G.C., and M.I.], board-certified radiologists with more than 50 and less than 100 COVID-19-positive CTs reported); low-experience group (R7, R8, and R9 [M.D.I., F.G., and E.O.], radiologists in-training with less than 50 COVID-19-positive CTs reported); and group of radiographers (R10, R11, and R12 [S.P., V.C., and C.G.], all with a background of more than 50 CTs performed on COVID-19-positive patients). A training set of 30 CTs, in which findings corresponding to each CO-RADS category were equally distributed, was provided to each reader. Furthermore, all readers had a general familiarity with CO-RADS, having adopted it at our institution since its introduction, approximately a month before the start of our study.
All readers scored each CT scan assigning a CO-RADS category reflecting their overall suspicion of COVID-19 lung involvement as follows: CO-RADS 1, very low probability; CO-RADS 2, low probability; CO-RADS 3, equivocal/unsure probability; CO-RADS 4, high probability; and CO-RADS 5, very high probability. For a detailed description of all the CT findings associated with each CO-RADS category, please refer to the original paper by Prokop et al . All readers were blinded to the RT-PCR results, to the clinical information and radiological reports of individual patients, and to the disease prevalence in the study sample.
RT-PCR testing performed on respiratory specimens obtained by nasopharyngeal and throat swabs served as a reference standard for the diagnosis of COVID-19. Clinical information and index test results were not available to the assessors of the reference standard. As per our institution guidelines, patients with initial negative RT-PCR, but CT findings suggestive of COVID-19, underwent repeated RT-PCR testing up to a maximum of three times within 7 days after CT scan. Patients who showed at least one positive RT-PCR were considered to be positive for COVID-19; otherwise, they were considered negative. Nonetheless, patients with initial negative RT-PCR and negative CT findings underwent a 14-day follow-up and were considered to be negative if no symptoms’ worsening or laboratory findings consistent with COVID-19 occurred.
Categorical variables were expressed as frequencies or percentages. Continuous variables were expressed as means ± standard deviations (SD). The Χ2 test was used to calculate differences in sex, symptoms, and number of RT-PCR testing between COVID+ and COVID− participants. The Mann-Whitney U test was performed to assess differences in age between the two groups.
Fleiss’ kappa statistics were used to evaluate interreader agreement for CO-RADS rating both among all readers and among each group of reader. The following coefficients were applied: κ ≤ 0.20, slight agreement; κ = 0.21–0.40, fair agreement; κ = 0.41–0.60, moderate agreement; κ = 0.61–0.80, substantial agreement; and κ = 0.81–1.0, almost perfect agreement .
For each reader, the receiver operating characteristics curve (ROC) and the corresponding area under the curve (AUC) were calculated by using the DeLong et al method , to assess the CO-RADS diagnostic performance. Mean AUC across observers from the four different readers’ groups and their corresponding 95% confidence intervals (95% CI) were computed and a pairwise comparison of AUCs from all readers was performed by means of the DeLong et al method .
For each reader, the highest Youden index (J = sensitivity + specificity− 1) was calculated to select the optimal threshold to discriminate between COVID+ and COVID− participants, and the corresponding sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were computed. Inconclusive results (i.e., CO-RADS 3) were included in the analysis of the diagnostic performance and whether to treat them as positive or negative results depended on the ROC curve and Youden’s index analysis results.
In addition, the number and the percentage of readings assigned to each CO-RADS category were determined for both COVID+ and COVID− participants. False-positive CO-RADS 4 and 5 patients and false-negative CO-RADS 1 and 2 patients were subsequently investigated to clarify the reason for erroneous classification.
In all cases, p < 0.05 was considered the threshold for assessing statistical significance. All statistical analyses were performed with commercially available software (MedCalc Statistical Software version 19.2.5, MedCalc Software Ltd).