1 Introduction

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, the capital of China’s Hubei province, and has since spread worldwide (Andersen et al. 2020). On January 30, 2020 the World Health Organization (WHO) declared a global health emergency [1]. Common symptoms of COVID-19 include fever, cough, and shortness of breath (Guan et al. 2020; Xu et al. 2020). While the majority of cases result in mild symptoms, some progress to viral pneumonia. By 7 August 2020, over 19 million officially confirmed cases were reported in practically every corner of the Earth with 717,687 officially reported deaths documented (Dong et al. 2020).

As the first countries explore deconfinement strategies (Cousins 2020; Salathé et al. 2020) after a long period of quarantine, the death toll of COVID-19 keeps rising, specially in US, UK, and Brazil (Dong et al. 2020). New technologies and strategies have emerged in order to support healthcare systems during this pandemic time (Hu et al. 2020; Ting et al. 2020). As early as March 2020, Chinese hospitals used artificial intelligence (AI)-assisted computed tomography (CT) imaging analysis to screen COVID-19 cases and streamline diagnosis (Jin et al. 2020).

In this work, we build a large multiclass dataset of CT scans for SARS-CoV-2 infection identification. The dataset is built upon on the recently introduced dataset (Soares et al. 2020). The proposed dataset contains 4173 CT-scans of 210 different patients which are divided into 3 different classes (healthy, COVID-19, and other pulmonary diseases). These data have been collected from real patients in hospitals from Sao Paulo, Brazil. The aim of this dataset is to encourage the research and development of artificial intelligence (AI) methods that are able to identify if a person is infected by SARS-CoV-2 through the analysis of his/her CT scans.

An open-source dataset for COVID-19 identification through CT scans has been proposed by Yang et al. (2020), however, the data collected for this dataset has been acquired from scientific journals and may not provide the necessary quality to train an algorithm for complex applications as such. Moreover, other authors as Santa Cruz et al. (2021); Mangal et al. (2020); Pham (2021); Signoroni et al. (2021) provided open-source datasets and solutions based on X-ray scans which are not detailed as CT scans.

As a baseline result for the new dataset based on CT scans, we consider the eXplainable Deep Learning approach (xDNN) (Angelov and Soares 2020). As the explanation of AI systems is essential to medical applications, we used the xDNN approach as baseline for this application. XDNN is a prototype-based approach that allows users to audit the decisions of the network through its similarity mechanism. XDNN obtained an F1 score of \(97.31\%\), which is higher than traditional deep learning approaches such as ResNet.

2 Methods

The proposed dataset is composed of 4173 CT-scans of 210 different patients which are divided into: 80 patients infected by SARS-CoV-2; 80 patients with other pulmonary diseases as non-COVID pneumonia, bronchitis, and lung cancer; and 50 patients with healthy lung conditions. The data was collected from March 15 to June 15 2020 in the Public Hospital of the Government Employees of Sao Paulo, and the Metropolitan Hospital of Lapa, Sao Paulo – Brazil. The following demographic data have been collected during the clinical analysis of each patient:

  • Sex

  • Age

  • Number of days since the 1st symptoms

  • Comorbidities

  • Hypertension

  • Diabetes

  • Chronic obstructive pulmonary disease (COPD)

  • Obesity

  • Pulmonary involvement \(> 50\%\)

  • Outcome

Table 1 details the patient’s considered in this study.

Table 1 This table demonstrates the number of patients considered to compose the dataset. In this case, we considered data from 80 patients infected by SARS-CoV-2, out of which 41 were male and 39 were female. We also considered data from 80 patients presenting other pulmonary diseases such as lung cancer, bronchitis, etc. The dataset is also composed of CT scans that do not present any pulmonary disease, These data refer to data of 50 patients

The inclusion criteria for this study are listed as follows:

  • Patients with a positive new coronavirus nucleic acid antibody and confirmed by the CDC;

  • Patients who underwent thin-section CT;

  • Age\(>=18\);

  • Presence of lung infection in CT images.

The median duration from the onset of the illness to CT scan was 5 days, ranging from 1 to 14 days. The CT protocol was as follows: 120 kV; automatic tube current (180 mA-400 mA); iterative reconstruction; 64 mm detector; rotation time, 0.35 sec; slice thickness, 5 mm; collimation, 0.625 mm; pitch, 1.5; matrix, \(512\times 512\); and breath hold at full inspiration. The reconstruction kernel used is set as “lung smooth with a thickness of 1 mm and an interval of 0.8 mm”. During reading, the lung window (with window wiDecision Treeh 1200 HU and window level-600 HU) was used. Figure 2 illustrates some examples of CT scans found in the dataset.

3 Data records

The database can be downloaded from Synapse (https://www.synapse.org/#!Synapse:syn22174850), and it has been presented in two formats: PNG and CSV, where PNG represents the CT scans files and CVS are the demographic data. Figure 1 illustrates the data distribution for the patients infected by SARS-CoV-2 and considered in this study.

Fig. 1
figure 1

The study considered data of 80 different patients (41 male and 39 female patients). The data revealed that the major of the patients are 50–59 years old

The data types of the demographic data variables considered in this study are depicted below:

  • Sex (Boolean)

  • Age (Integer)

  • Number of days since the 1st symptoms (Integer)

  • Comorbidities (Boolean)

  • Hypertension (Boolean)

  • Diabetes (Boolean)

  • Chronic obstructive pulmonary disease (COPD) (Boolean)

  • Obesity (Boolean)

  • Pulmonary involvement \(> 50\%\) (Boolean)

  • Outcome (Boolean)

Figure 2 illustrates different examples of data available in the proposed dataset.

Fig. 2
figure 2

a A 27-year-old male patient presented with fever and headache for 2 days. CT scans do not show the presence of any pulmonary disease. The RT-PCR test revealed negative for SARS-CoV-2 infection. b A 63-year-old woman patient presented shortness of breath and cough for 4 days. CT scan shows the presence of subpulmonic pleural effusion. The RT-PCR test revealed negative for SARS-CoV-2. c A 31-year-old woman presented fever, dry cough, shortness of breath for 4 days. CT scan revealed multifocal bilateral consolidation with ground-glass opacities with typical distribution. The RT-PCR tested positve for SARS-CoV-2

4 Technical validation

In order to validate our data in this section we report the results by different classification approaches. The following metrics have been used to evaluate the classification of the CT scans in different classes:

$$\begin{aligned} Accuracy\, (\%) = \frac{TP+TN}{TP+FP+TN+FN} \times 100, \end{aligned}$$
(1)

Precision:

$$\begin{aligned} Precision\, (\%) = \frac{TP}{TP+FP} \times 100, \end{aligned}$$
(2)

Recall:

$$\begin{aligned} Recall\, (\%) = \frac{TP}{TP+FN} \times 100, \end{aligned}$$
(3)

F1 Score:

$$\begin{aligned} F1~ Score\, (\%) = 2 \times \frac{Precision\times Recall}{Precision+Recall} \times 100, \end{aligned}$$
(4)

where TPFPTNFN denote true and false, negative and positive respectively.

The area under the curve, AUC, is defined through the TP rate and FN rate.

In this section, we report the results obtained by the xDNN classification approach (Angelov and Soares 2020; Soares et al. 2019) when applied to the proposed SARS-CoV-2 CT scan data set. We divided the dataset into 80% for training purposes and 20% for validation purposes. The division has been made in terms of patients; therefore, we separated data of 168 patients for training and data for 42 patients for validation. Results presented in Table 2 compare the performance of the xDNN algorithm with other state-of-the-art approaches, including ResNet, GoogleNet, VGG-16, AlexNet, Decision Tree, and AdaBoost.

Table 2 Results considering different methods for the COVID-19 identification

The xDNN (Angelov and Soares 2020, 2020) classifier provided highly interpretable results (Angelov et al. 2021) that may be helpful for specialists (medical doctors). Rules generated by the identified prototypes are illustrated by Figs. 3 and 4, respectively. xDNN identified data of 18 patients with COVID-19 as prototypes and data of 11 patients non-infected as prototypes. The training time for the xDNN algorithm (Angelov and Soares 2020) was only 11.82 s for all images (an average of 5 milliseconds per image. On the other hand, the traditional deep learning approach may take hours for the same task and usually requires hardware accelerators such as GPUs and once trained is not flexible to new data. We have to stress that xDNN does not require full re-training if new data is presented (Angelov and Soares 2021)—it keeps all prototypes identified so far and may add new ones if the data pattern requires that Soares et al. (2020, 2019).

Balanced one-way ANalysis Of VAriance (ANOVA) (McHugh 2011) was used to compare the results provided by the classification methods. The null hypothesis is that the mean results provided by the methods are the same. A cutoff value p less than 0.05 suggests that the accuracy of at least one of the algorithms is significantly different from the others. A \(p = 4.38e-22\) was obtained and, therefore, the mean accuracy of the algorithms is not all the same; the null hypothesis was rejected.

The Tukey Honestly Significant Difference (HSD) test (McHugh 2011) was performed to compare pairs of classifiers over accuracy. Table 3 shows the results of the Tuckey HSD test for a 95% confidence interval for the true difference of the means.

Table 3 Tukey Test Results

If the \(p-adj < 0.05\) then the null hypothesis is rejected and the difference between the methods is statistically significant. As shown in Table 3 the proposed xDNN has results statistically different from 4 traditional approaches, including well-known deep learning approaches as GoogleNet, VGG-16, and AlexNet.

Through the xDNN method we generated (extracted from the data) linguistic IF...THEN rules which involve actual images of both cases (COVID-19 and NO COVID-19) as illustrated in Figs. 3 and 4. Such transparent rules can be used in a clear decision-making process for early diagnostics for COVID-19 infection. Rapid detection with high sensitivity of viral infection may allow better control of the viral spread. Early diagnosis of COVID-19 is crucial for disease treatment and control.

Fig. 3
figure 3

Final rule given by xDNN classifier for the COVID-19 identification. Differently, from typical deep neural networks, xDNN provides highly interpretable rules which can be visualised and used by human experts for the early evaluation of patients suspected of COVID-19 infection. The classification is done based on the similarity of the unlabeled CT scan slice to the identified prototypes

Fig. 4
figure 4

Non-SARS-CoV-2 final rule given by the proposed eXplainable Deep Learning classifier

5 Conclusion

In the context of a pandemic and the urgency to contain the crisis, research has increased exponentially in order to alleviate the healthcare systems burden (Cohen et al. 2020). However, many prediction models for diagnosis and prognosis of COVID-19 infection are at high risk of bias and model overfitting as well as poorly reported, their alleged performance being likely optimistic. In order to prevent premature implementation in hospitals, tools must be robustly evaluated along several practical tests. Indeed, while some AI-assisted tools might be powerful, they do not replace clinical judgment and their diagnostic performance cannot be assessed or claimed without a proper clinical trial.

Moreover, The lack of a public database made it difficult to conduct large-scale robust evaluations. This small number of samples prevents proper cohort selection which is a limitation of this study and exposes our evaluation to sample bias. In this study, we present a database which is composed of 4173 CT-scans of 210 different patients, out of which 2168 correspond to 80 patients infected with SARS-CoV-2 and confirmed by RT-PCR. These data have been collected at the Public Hospital of the Government Employees of Sao Paulo and the Metropolitan Hospital of Lapa, Sao Paulo, Brazil. Sao Paulo is now one of the global epicenters of the COVID-19 disease.

As a baseline result for the proposed dataset, we used an explainable deep learning approach. The xDNN classifier presented an F1 score of \(97.31 \%\) for the proposed task. Moreover, the xDNN approach provided insights into the decision-making process which is helpful to support specialists in the diagnosis of the disease. This is of great importance for medical specialists to understand and diagnose COVID-19 at its early stages via computer tomography. The proposed dataset is available https://www.synapse.org/#!Synapse:syn22174850 and xDNN (Angelov and Soares 2020) code is available at https://github.com/Plamen-Eduardo/xDNN-SARS-CoV-2-CT-Scan.