A large multiclass dataset of CT scans for COVID-19 identification

Soares, Eduardo; Angelov, Plamen; Biaso, Sarah; Cury, Marcelo; Abe, Daniel

doi:10.1007/s12530-023-09511-2

A large multiclass dataset of CT scans for COVID-19 identification

Original Paper
Open access
Published: 27 June 2023

Volume 15, pages 635–640, (2024)
Cite this article

Download PDF

You have full access to this open access article

Evolving Systems Aims and scope Submit manuscript

A large multiclass dataset of CT scans for COVID-19 identification

Download PDF

Eduardo Soares ORCID: orcid.org/0000-0002-2634-8270¹,
Plamen Angelov¹,
Sarah Biaso²,
Marcelo Cury³ &
…
Daniel Abe²

2196 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

The infection by SARS-CoV-2 which causes the COVID-19 disease has spread widely over the whole world since the beginning of 2020. Following the epidemic which started in Wuhan, China on January 30, 2020 the World Health Organization (WHO) declared a global health emergency and a pandemic. In this paper, we describe a publicly available multiclass CT scan dataset for SARS-CoV-2 infection identification. Which currently contains 4173 CT-scans of 210 different patients, out of which 2168 correspond to 80 patients infected with SARS-CoV-2 and confirmed by RT-PCR. These data have been collected in the Public Hospital of the Government Employees of Sao Paulo and the Metropolitan Hospital of Lapa, both in Sao Paulo – Brazil. The aim of this data set is to encourage the research and development of artificial intelligent methods that are able to identify SARS-CoV-2 or other diseases through the analysis of CT scans. As a baseline result for this data set, we used the recently introduced eXplainable Deep Learning approach (xDNN), which is a transparent deep learning approach that allows users to inspect the decisions of the network.

Early Detection of COVID-19 from CT Scans Using Deep Learning Techniques

Deep Learning-Based COVID-19 Diagnosis and Trend Predictions

COVID-19 Diagnosis with Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, the capital of China’s Hubei province, and has since spread worldwide (Andersen et al. 2020). On January 30, 2020 the World Health Organization (WHO) declared a global health emergency [1]. Common symptoms of COVID-19 include fever, cough, and shortness of breath (Guan et al. 2020; Xu et al. 2020). While the majority of cases result in mild symptoms, some progress to viral pneumonia. By 7 August 2020, over 19 million officially confirmed cases were reported in practically every corner of the Earth with 717,687 officially reported deaths documented (Dong et al. 2020).

As the first countries explore deconfinement strategies (Cousins 2020; Salathé et al. 2020) after a long period of quarantine, the death toll of COVID-19 keeps rising, specially in US, UK, and Brazil (Dong et al. 2020). New technologies and strategies have emerged in order to support healthcare systems during this pandemic time (Hu et al. 2020; Ting et al. 2020). As early as March 2020, Chinese hospitals used artificial intelligence (AI)-assisted computed tomography (CT) imaging analysis to screen COVID-19 cases and streamline diagnosis (Jin et al. 2020).

In this work, we build a large multiclass dataset of CT scans for SARS-CoV-2 infection identification. The dataset is built upon on the recently introduced dataset (Soares et al. 2020). The proposed dataset contains 4173 CT-scans of 210 different patients which are divided into 3 different classes (healthy, COVID-19, and other pulmonary diseases). These data have been collected from real patients in hospitals from Sao Paulo, Brazil. The aim of this dataset is to encourage the research and development of artificial intelligence (AI) methods that are able to identify if a person is infected by SARS-CoV-2 through the analysis of his/her CT scans.

An open-source dataset for COVID-19 identification through CT scans has been proposed by Yang et al. (2020), however, the data collected for this dataset has been acquired from scientific journals and may not provide the necessary quality to train an algorithm for complex applications as such. Moreover, other authors as Santa Cruz et al. (2021); Mangal et al. (2020); Pham (2021); Signoroni et al. (2021) provided open-source datasets and solutions based on X-ray scans which are not detailed as CT scans.

As a baseline result for the new dataset based on CT scans, we consider the eXplainable Deep Learning approach (xDNN) (Angelov and Soares 2020). As the explanation of AI systems is essential to medical applications, we used the xDNN approach as baseline for this application. XDNN is a prototype-based approach that allows users to audit the decisions of the network through its similarity mechanism. XDNN obtained an F1 score of $97.31\%$, which is higher than traditional deep learning approaches such as ResNet.

2 Methods

The proposed dataset is composed of 4173 CT-scans of 210 different patients which are divided into: 80 patients infected by SARS-CoV-2; 80 patients with other pulmonary diseases as non-COVID pneumonia, bronchitis, and lung cancer; and 50 patients with healthy lung conditions. The data was collected from March 15 to June 15 2020 in the Public Hospital of the Government Employees of Sao Paulo, and the Metropolitan Hospital of Lapa, Sao Paulo – Brazil. The following demographic data have been collected during the clinical analysis of each patient:

Sex
Age
Number of days since the 1st symptoms
Comorbidities
Hypertension
Diabetes
Chronic obstructive pulmonary disease (COPD)
Obesity
Pulmonary involvement $> 50\%$
Outcome

Table 1 details the patient’s considered in this study.

Table 1 This table demonstrates the number of patients considered to compose the dataset. In this case, we considered data from 80 patients infected by SARS-CoV-2, out of which 41 were male and 39 were female. We also considered data from 80 patients presenting other pulmonary diseases such as lung cancer, bronchitis, etc. The dataset is also composed of CT scans that do not present any pulmonary disease, These data refer to data of 50 patients

Full size table

The inclusion criteria for this study are listed as follows:

Patients with a positive new coronavirus nucleic acid antibody and confirmed by the CDC;
Patients who underwent thin-section CT;
Age$>=18$;
Presence of lung infection in CT images.

The median duration from the onset of the illness to CT scan was 5 days, ranging from 1 to 14 days. The CT protocol was as follows: 120 kV; automatic tube current (180 mA-400 mA); iterative reconstruction; 64 mm detector; rotation time, 0.35 sec; slice thickness, 5 mm; collimation, 0.625 mm; pitch, 1.5; matrix, $512\times 512$; and breath hold at full inspiration. The reconstruction kernel used is set as “lung smooth with a thickness of 1 mm and an interval of 0.8 mm”. During reading, the lung window (with window wiDecision Treeh 1200 HU and window level-600 HU) was used. Figure 2 illustrates some examples of CT scans found in the dataset.

3 Data records

The database can be downloaded from Synapse (https://www.synapse.org/#!Synapse:syn22174850), and it has been presented in two formats: PNG and CSV, where PNG represents the CT scans files and CVS are the demographic data. Figure 1 illustrates the data distribution for the patients infected by SARS-CoV-2 and considered in this study.

The data types of the demographic data variables considered in this study are depicted below:

Sex (Boolean)
Age (Integer)
Number of days since the 1st symptoms (Integer)
Comorbidities (Boolean)
Hypertension (Boolean)
Diabetes (Boolean)
Chronic obstructive pulmonary disease (COPD) (Boolean)
Obesity (Boolean)
Pulmonary involvement $> 50\%$ (Boolean)
Outcome (Boolean)

Figure 2 illustrates different examples of data available in the proposed dataset.

4 Technical validation

In order to validate our data in this section we report the results by different classification approaches. The following metrics have been used to evaluate the classification of the CT scans in different classes:

$$\begin{aligned} Accuracy\, (\%) = \frac{TP+TN}{TP+FP+TN+FN} \times 100, \end{aligned}$$

(1)

Precision:

$$\begin{aligned} Precision\, (\%) = \frac{TP}{TP+FP} \times 100, \end{aligned}$$

(2)

Recall:

$$\begin{aligned} Recall\, (\%) = \frac{TP}{TP+FN} \times 100, \end{aligned}$$

(3)

F1 Score:

$$\begin{aligned} F1~ Score\, (\%) = 2 \times \frac{Precision\times Recall}{Precision+Recall} \times 100, \end{aligned}$$

(4)

where TP, FP, TN, FN denote true and false, negative and positive respectively.

The area under the curve, AUC, is defined through the TP rate and FN rate.

In this section, we report the results obtained by the xDNN classification approach (Angelov and Soares 2020; Soares et al. 2019) when applied to the proposed SARS-CoV-2 CT scan data set. We divided the dataset into 80% for training purposes and 20% for validation purposes. The division has been made in terms of patients; therefore, we separated data of 168 patients for training and data for 42 patients for validation. Results presented in Table 2 compare the performance of the xDNN algorithm with other state-of-the-art approaches, including ResNet, GoogleNet, VGG-16, AlexNet, Decision Tree, and AdaBoost.

Table 2 Results considering different methods for the COVID-19 identification

Full size table

The xDNN (Angelov and Soares 2020, 2020) classifier provided highly interpretable results (Angelov et al. 2021) that may be helpful for specialists (medical doctors). Rules generated by the identified prototypes are illustrated by Figs. 3 and 4, respectively. xDNN identified data of 18 patients with COVID-19 as prototypes and data of 11 patients non-infected as prototypes. The training time for the xDNN algorithm (Angelov and Soares 2020) was only 11.82 s for all images (an average of 5 milliseconds per image. On the other hand, the traditional deep learning approach may take hours for the same task and usually requires hardware accelerators such as GPUs and once trained is not flexible to new data. We have to stress that xDNN does not require full re-training if new data is presented (Angelov and Soares 2021)—it keeps all prototypes identified so far and may add new ones if the data pattern requires that Soares et al. (2020, 2019).

Balanced one-way ANalysis Of VAriance (ANOVA) (McHugh 2011) was used to compare the results provided by the classification methods. The null hypothesis is that the mean results provided by the methods are the same. A cutoff value p less than 0.05 suggests that the accuracy of at least one of the algorithms is significantly different from the others. A $p = 4.38e-22$ was obtained and, therefore, the mean accuracy of the algorithms is not all the same; the null hypothesis was rejected.

The Tukey Honestly Significant Difference (HSD) test (McHugh 2011) was performed to compare pairs of classifiers over accuracy. Table 3 shows the results of the Tuckey HSD test for a 95% confidence interval for the true difference of the means.

Table 3 Tukey Test Results

Full size table

If the $p-adj < 0.05$ then the null hypothesis is rejected and the difference between the methods is statistically significant. As shown in Table 3 the proposed xDNN has results statistically different from 4 traditional approaches, including well-known deep learning approaches as GoogleNet, VGG-16, and AlexNet.

Through the xDNN method we generated (extracted from the data) linguistic IF...THEN rules which involve actual images of both cases (COVID-19 and NO COVID-19) as illustrated in Figs. 3 and 4. Such transparent rules can be used in a clear decision-making process for early diagnostics for COVID-19 infection. Rapid detection with high sensitivity of viral infection may allow better control of the viral spread. Early diagnosis of COVID-19 is crucial for disease treatment and control.

5 Conclusion

In the context of a pandemic and the urgency to contain the crisis, research has increased exponentially in order to alleviate the healthcare systems burden (Cohen et al. 2020). However, many prediction models for diagnosis and prognosis of COVID-19 infection are at high risk of bias and model overfitting as well as poorly reported, their alleged performance being likely optimistic. In order to prevent premature implementation in hospitals, tools must be robustly evaluated along several practical tests. Indeed, while some AI-assisted tools might be powerful, they do not replace clinical judgment and their diagnostic performance cannot be assessed or claimed without a proper clinical trial.

Moreover, The lack of a public database made it difficult to conduct large-scale robust evaluations. This small number of samples prevents proper cohort selection which is a limitation of this study and exposes our evaluation to sample bias. In this study, we present a database which is composed of 4173 CT-scans of 210 different patients, out of which 2168 correspond to 80 patients infected with SARS-CoV-2 and confirmed by RT-PCR. These data have been collected at the Public Hospital of the Government Employees of Sao Paulo and the Metropolitan Hospital of Lapa, Sao Paulo, Brazil. Sao Paulo is now one of the global epicenters of the COVID-19 disease.

As a baseline result for the proposed dataset, we used an explainable deep learning approach. The xDNN classifier presented an F1 score of $97.31 \%$ for the proposed task. Moreover, the xDNN approach provided insights into the decision-making process which is helpful to support specialists in the diagnosis of the disease. This is of great importance for medical specialists to understand and diagnose COVID-19 at its early stages via computer tomography. The proposed dataset is available https://www.synapse.org/#!Synapse:syn22174850 and xDNN (Angelov and Soares 2020) code is available at https://github.com/Plamen-Eduardo/xDNN-SARS-CoV-2-CT-Scan.

Data availability

The data that support the findings of this study are openly available in Synapse at https://www.synapse.org/#!Synapse:syn22174850 and a small version of it in Kaggle at https://www.kaggle.com/datasets/plameneduardo/sarscov2-ctscan-dataset.

Code availability

We provided the code used in this research at https://github.com/Plamen-Eduardo/xDNN-SARS-CoV-2-CT-Scan. Other codes are available upon request to the corresponding author.

References

Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF (2020) The proximal origin of sars-cov-2. Nat Med 26(4):450–452
Article Google Scholar
Angelov P, Soares E (2020) Towards explainable deep neural networks (xDNN). Neural Netw 130:185–194
Article Google Scholar
Angelov P, Soares E (2021) Detecting and learning from unknown by extremely weak supervision: exploratory classifier (xclass). Neural Comput Appl 33(22):15145–15157
Article Google Scholar
Angelov PP, Soares EA, Jiang R, Arnold NI, Atkinson PM (2021) Explainable artificial intelligence: an analytical review. WIREs Data Min Knowl Discovery 11(5):e1424. https://doi.org/10.1002/widm.1424
Article Google Scholar
Angelov P, Soares E (2020) Towards deep machine reasoning: a prototype-based deep neural network with decision tree inference. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2092–2099. https://doi.org/10.1109/SMC42975.2020.9282812
Cohen JP, Dao L, Roth K, Morrison P, Bengio Y, Abbasi AF, Shen B, Mahsa HK, Ghassemi M, Li H, et al (2020) Predicting covid-19 pneumonia severity on chest x-ray with deep learning. Cureus 12(7)
Cousins S (2020) New Zealand eliminates covid-19. The Lancet 395(10235):1474
Article Google Scholar
Dong E, Du H, Gardner L (2020) An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases
Guan Wj, Ni Zy, Hu Y, Liang Wh, Ou Cq, He Jx, Liu L, Shan H, Lei Cl, Hui DS et al (2020) Clinical characteristics of coronavirus disease 2019 in china. N Engl J Med 382(18):1708–1720
Hu Z, Ge Q, Li S, Jin L, Xiong M (2020) Artificial intelligence forecasting of Covid-19 in China. arXiv preprint arXiv:2002.07112(01-20)
Jin YH, Cai L, Cheng ZS, Cheng H, Deng T, Fan YP, Fang C, Huang D, Huang LQ, Huang Q et al (2020) A rapid advice guideline for the diagnosis and treatment of 2019 novel coronavirus (2019-ncov) infected pneumonia (standard version). Mil Med Res 7(1):4
Google Scholar
Mangal A, Kalia S, Rajgopal H, Rangarajan K, Namboodiri V, Banerjee S, Arora C (2020) Covidaid: Covid-19 detection using chest x-ray. arXiv preprint arXiv:2004.09803
McHugh ML (2011) Multiple comparison analysis testing in anova. Biochem Med 21(3):203–209
Article MathSciNet Google Scholar
Pham TD (2021) Classification of covid-19 chest X-rays with deep learning: new models or fine tuning? Health Inf Sci Syst 9:1–11
Article Google Scholar
Salathé M, Althaus CL, Neher R, Stringhini S, Hodcroft E, Fellay J, Zwahlen M, Senti G, Battegay M, Wilder-Smith A et al (2020) Covid-19 epidemic in Switzerland: on the importance of testing, contact tracing and isolation. Swiss Med Wkly 150(11–12):w20225
Google Scholar
Santa Cruz BG, Bossa MN, Sölter J, Husch AD (2021) Public covid-19 x-ray datasets and their impact on model bias-a systematic review of a significant problem. Med Image Anal 74:102225
Article Google Scholar
Signoroni A, Savardi M, Benini S, Adami N, Leonardi R, Gibellini P, Vaccher F, Ravanelli M, Borghesi A, Maroldi R et al (2021) Bs-net: Learning covid-19 pneumonia severity on a large chest x-ray dataset. Med Image Anal 71:102046
Article Google Scholar
Soares E, Angelov P, Biaso S, Froes MH, Abe DK (2020) Sars-cov-2 ct-scan dataset: a large dataset of real patients ct scans for sars-cov-2 identification. medRxiv 1(1)
Soares E, Angelov P, Costa B, Castro M (2019) Actively semi-supervised deep rule-based classifier applied to adverse driving scenarios. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. https://doi.org/10.1109/IJCNN.2019.8851842
Soares E, Angelov P, Filev D, Costa B, Castro M, Nageshrao S (2019) Explainable density-based approach for self-driving actions classification. In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp 469–474. https://doi.org/10.1109/ICMLA.2019.00087
Ting DSW, Carin L, Dzau V, Wong TY (2020) Digital technology and covid-19. Nat Med 26(4):459–461
Article Google Scholar
World Health Organization Declares Global Emergency (2020) A review of the 2019 novel coronavirus (covid-19). Int J Surg 76:71–76. https://doi.org/10.1016/j.ijsu.2020.02.034
Article Google Scholar
Xu Z, Shi L, Wang Y, Zhang J, Huang L, Zhang C, Liu S, Zhao P, Liu H, Zhu L et al (2020) Pathological findings of covid-19 associated with acute respiratory distress syndrome. Lancet Respir Med 8(4):420–422
Article Google Scholar
Yang X, He X, Zhao J, Zhang Y, Zhang S, Xie P (2020) Covid-ct-dataset: a ct scan dataset about covid-19. arXiv preprint arXiv:2003.13865, 14

Download references

Author information

Authors and Affiliations

School of Computing and Communications, Lancaster University, Lancaster, LA1 4WA, UK
Eduardo Soares & Plamen Angelov
Public Hospital of the Government Employees of Sao Paulo, 01532-001, Sao Paulo, BR, Brazil
Sarah Biaso & Daniel Abe
Paulista School of Medicine, Federal University of Sao Paulo, 04021-001, Sao Paulo, BR, Brazil
Marcelo Cury

Authors

Eduardo Soares
View author publications
You can also search for this author in PubMed Google Scholar
Plamen Angelov
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Biaso
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Cury
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Abe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eduardo Soares.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Soares, E., Angelov, P., Biaso, S. et al. A large multiclass dataset of CT scans for COVID-19 identification. Evolving Systems 15, 635–640 (2024). https://doi.org/10.1007/s12530-023-09511-2

Download citation

Received: 20 October 2022
Accepted: 24 May 2023
Published: 27 June 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s12530-023-09511-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A large multiclass dataset of CT scans for COVID-19 identification

Abstract

Similar content being viewed by others

Early Detection of COVID-19 from CT Scans Using Deep Learning Techniques

Deep Learning-Based COVID-19 Diagnosis and Trend Predictions

COVID-19 Diagnosis with Artificial Intelligence

1 Introduction

2 Methods

3 Data records

4 Technical validation

5 Conclusion

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A large multiclass dataset of CT scans for COVID-19 identification

Abstract

Similar content being viewed by others

Early Detection of COVID-19 from CT Scans Using Deep Learning Techniques

Deep Learning-Based COVID-19 Diagnosis and Trend Predictions

COVID-19 Diagnosis with Artificial Intelligence

1 Introduction

2 Methods

3 Data records

4 Technical validation

5 Conclusion

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation