Introduction

The pandemic disease caused by the SARS-CoV-2 virus named COVID-19 is requiring unprecedented responses of exceptional intensity and scope to more than 200 states around the world, after having infected, in the first 4 months since its outbreak, a number of people between 2 and 20 million with at least 200,000 deaths. To cope with the spread of the COVID-19 infection, governments all over the world has taken drastic measures like the quarantine of hundreds of millions of residents worldwide.

However, because of the COVID-19 symptomatology, which showed a large number of asymptomatics [12], these efforts are limited by the problem of differentiating between COVID-19 positive and negative individuals. Thus, tests to identify the SARS–CoV-2 virus are believed to be crucial to identify positive cases to this infection and thus curb the pandemic.

To this aim, the current test of choice is the reverse transcriptase Polymerase Chain Reaction (rt-PCR)-based assays performed in the laboratory on respiratory specimens. Taking this as a gold standard, machine learning techniques have been employed to detect COVID-19 from lung CT-scans with 90% sensitivity, and high AUROC (0.95) [19, 27]. Although chest CTs have been found associated with high sensitivity for the diagnosis of COVID-19 [1], this kind of exam can hardly be employed for screening tasks, for the radiation doses, the relative low number of devices available, and the related operation costs. A similar attempt was recently performed on chest x-rays [4], which is a low-dose and less expensive test, with promising statistical performance (e.g., sensitivity 97%). However, since almost 60% of chest x-rays taken in patients with confirmed and symptomatic COVID-19 have been found to be normal [45], systems based on this exam need to be thoroughly validated in real-world settings [6]. Further, despite these promising results, some concerns have been raised on these and other works, most of which have not yet undergone peer review: a recent critical survey [46] reported that all of the surveyed studies were possibly subject to high bias and risk of over-fitting, and showed little compliance to reporting and replication standards

The public health emergency requires an unprecedented global effort to increase testing capacity [33]. The large demand for rRT-PCR tests (also commonly known as nasopharyngeal swab tests) due to the worldwide extension of the virus is highlighting the limitations of this type of diagnosis on a large-scale such as: the long turnaround times (on average over 2 to 3 hours to generate results); the need of certified laboratories; trained personnel; expensive equipment and reagents for which demand can easily overcome supply [28]. For instance in Italy, the scarcity of reagents and specialized laboratories forced the government to limit the swab testing to those people who clearly showed symptoms of severe respiratory syndrome, thus leading to a number of infected people and a contagion rate that were largely underestimated [39].

For this reason, and also in light of the predictable wide adoption of mobile apps for contact tracing [15], which will likely increase the demand for population screening, there is an urgent need for alternative (or complementary) testing methods by which to quickly identify infected COVID-19 patients to mitigate virus transmission and guarantee a prompt patients treatment.

On a previous work published in the laboratory medicine literature [14], we showed how simple blood tests might help identifying false positive/negative rRT-PCR tests. This work and the considerations made above strongly motivated us to apply machine learning methods to routine, low-costFootnote 1 blood exams, and to evaluate the feasibility of predictive models in this important task for the mass-screening of potential COVID-19 infected individuals. A comprehensive literature review has been recently published on the use of machine learning [46] for COVID-19 screening and diagnosis; after searching on PubMed, Scopus and Web of Science search engines, we confirm the findings of the above literature review: that no machine learning solution to date is applied to blood counts and other comprehensive routine blood tests for COVID-19 screening and diagnosis. The only study, available so far in the peer-reviewed literature, that applied this approach, although in combination with CT-based diagnosis, was proposed in [32], but it was limited to white blood cell count. In what follows, we report the study that proves the feasibility of our approach.

Methods

The aim of this work is to develop a predictive model, based on Machine Learning techniques, to predict the positivity or negativity for COVID-19. In the rest of this Section we report on the dataset used for model training and on the data analysis pipeline adopted.

Data description

The dataset used for this study was made available by the IRCCS Ospedale San RaffaeleFootnote 2 and it consisted of 279 cases, randomly extracted from patients admitted to that hospital from the end of February 2020 to mid of March 2020. Each case included the patient’s age, gender, values from routine blood tests extracted as in [13], and the result of the RT-PCR test for COVID-19, performed by nasopharyngeal swab. The parameters collected by the blood test are reported in Table 1.

Table 1 Features of the dataset considered in the present study

The dependent variable “Swab” is binary and it is equal to 0 in the absence of COVID-19 infection (negative swab test), and it is equal to 1 in the case of COVID-19 infection (positive to the swab test). The number of occurrences for the negative and positive class was respectively 102 (37%) and 177 (63%), thus the dataset was slightly imbalanced towards positive cases.

Table 2 summarizes the descriptive statistics of the continuous features considered in this work. In Fig. 1 we report the violin plots that show the feature distribution of the most predictive features employed to build the machine learning models of this case study.

Fig. 1
figure 1

Violin plots for selected features in the training dataset (chosen for their predictive importance)

Table 2 Descriptive statistics for the features considered in the present study

Figure 2 shows the pairwise correlation of the features used for this study, while Fig. 3 focuses on variables “Age”, “WBC”, “CRP”, “AST” and “Lymphocytes”.

Fig. 2
figure 2

Pairwise Pearson correlation of the features taken into account for this case study

Fig. 3
figure 3

Distribution plots and pairwise scatter plots of selected features. Red points and red distributions represent positive patients to Covid19, while blue points represent negative patients

Data manipulation

First of all, the categorical feature Gender has been transformed into two binary features by one-hot encoding. Further, we notice that the dataset was affected by missing values in most of its features (see Table 3).

Table 3 Features and missing values in the dataset

To address data incompleteness, we performed missing data imputation by means of the Multivariate Imputation by Chained Equation (MICE) [5] method. MICE is a multiple imputation method that works in an iterative fashion: in each imputation round, one feature with missing values is selected and is modeled as a function of all the other features; the estimated values are then used to impute the missing values and re-used in the subsequent imputation rounds.

We chose this method because multiple imputation techniques are known to be more robust and better capable to account for uncertainty, especially when the proportion of missing values on some features may be large, compared with single imputation ones [38] (as they employ the joint distribution of the available features). Further, in order to avoid data leakage and control the bias due to imputation, we performed the missing data imputation during the nested cross-validation (described in the following section), by using for the imputation only the data in each training folds: this allows to quantify the influence of the data imputation on the results by observing the variance of the results across the folds.

Model training, selection and evaluation

We compared different classes of Machine Learning classifiers. In particular, we considered the following classifier models:

  • Decision Tree [40] (DT);

  • Extremely Randomized Trees [17] (ET);

  • K-nearest neighbors [2] (KNN);

  • Logistic Regression [21] (LR);

  • Naïve Bayes [25] (NB);

  • Random Forest [23] (RF);

  • Support Vector Machines [41] (SVM).

We also considered a modification of the Random Forest algorithm, called three-way Random Forest classifier [7] (TWRF), which allows the model to abstain on instances for which it can express low confidence; in so doing, a TWFR achieves higher accuracy on the effectively classified instances at expense of coverage (i.e., the number of instances on which it makes a prediction). We decided to consider also this class of models as they could provide more reliable predictions in a large part of cases, while exposing the uncertainty regarding other cases so as to suggest further (and more expensive) tests on them.

From a technical point of view, Random Forest is an ensemble algorithm that relies on a collection of Decision Trees (i.e. a forest, hence the name of the algorithm) that are trained on mutually independent subsets of the original data in order to obtain a classifier with lower variance and/or lower bias. The independent datasets, on which the Decision Trees in the forest are trained, are obtained from an original dataset by both sampling with replacement the instance and selecting a random subset of the features (see [20] for more details about the Random Forest algorithm). As Random Forest are a class of probability scoring classifiers (that is, for each instance the model assigns a probability score for every possible class), the abstention is performed on the basis of two thresholds α,β∈ [0,1]: if we denote with 1 the positive class and 0 the negative class, then each instance is classified as positive if score(1) > α and score(1) > score(0), negative if score(0) >β and score(0) > score(1) and, otherwise, the model abstains. In these models the performance is usually evaluated only on the non-abstained instances [16], and the coverage is a further performance element to be considered.

The models mentioned above have been trained, and evaluated, through a nested cross validation [9, 20] procedure. This procedure allows for an unbiased generalization error estimation while the hyperparameter search (including feature selection) is performed: an inner cross-validation loop is executed to find the optimal hyperparameters via grid search and an outer loop evaluates the model performance on five folds.

Models were evaluated in terms of accuracy, balanced accuracyFootnote 3, Positive Predictive Value (PPV)Footnote 4, sensitivity, specificity and, except for the three-way Random Forest, the area under the ROC curve (AUC). After discussing this with the clinicians involved in this study, we considered accuracy and sensitivity to be the main quality metrics, since false negatives (that is, patients positive to COVID-10 which are, however, classified as negative, and possibly let go home) are more harmful than false positives in this screening task.

Results

For all the preprocessing steps and tested classifiers, we employed the standard Python data analysis ecosystem, comprising pandas [31] (for data loading and pre-processing), scikit-learn [35] (for both pre-processing and the classifiers implementations) and matplotlib [22] (for visualization purposes). The experiments were executed on a PC with an Intel i7 processor (6 cores, 3.2 GHz clock frequency) and 12 GB RAM: the model selection required around 2 minutes of computation time, while both the model fitting on the training set and the predictions on the test/validation sets required around 1 second of computation time.

Tables 4 and 5 show the 95% confidence intervals of, respectively, average accuracy and average balanced accuracy (that is, the average of sensitivity and specificity) of the models (on the nested cross-validation) trained on the two best-performing sets of features: the first one, dataset A, includes all the variables, while the second one, dataset B, excludes the “Gender” variable, as this was found of negligible predictive value.

Table 4 The models’ performance: 95% C.I. of model accuracy on 5-folds nested CV
Table 5 The models’ performance: 95% C.I. of model balanced accuracy on 5-folds nested CV

Figure 4 shows the performance of the traditional models (i.e., the TWRF model was excluded) on the nested cross-validation.

Fig. 4
figure 4

Violin plots of the accuracy distributions reached by each models on five folds (on dataset B)

To further validate the above findings, the entire dataset has been splitted into training and test/validation sets, respectively the 80% and the 20% of the total instances. The performance of the models, with optimal hyper-parameters as selected through nested cross-validation, is shown in Fig. 5, which depicts the ROC curves for all the models. Two models, Logistic Regression and Random Forest, exhibited comparable performance (difference less than 1%) in terms of AUC (LR = 85%, RF = 84 %) and sensitivity (LR = 93%, RF = 92%), but Random Forest reported higher performance in terms of accuracy (LR = 78%, RF = 82%) and much higher specificity (LR = 50%, RF = 65%): thus, Random Forest was selected as reference best performing model. The best performing model, i.e. the Random Forest classifier, trained on dataset B, achieved the following results on the test/validation set: accuracy = 82% , sensitivity = 92%, PPV = 83%, specificity = 65%, AUC = 84%. Figure 6 shows the performance of this model in the precision/recall space.

Fig. 5
figure 5

The sensitivity and specificity curve (i.e., sensitivity /positive predictive value curve or, equivalently true positive rate / false positive rate as depicted in the Figure) of the evaluated models. The best performing algorithm, Random Forest, is highlighted

Fig. 6
figure 6

The precision/recall (i.e., positive predictive value / sensitivity) curve, and the area under this curve

The optimal hyperparameters found are shown in Table 6.

Table 6 Optimal hyperparameters for the Random Forest classifier. For the sake of reproducibility, also the random seed is reported

Similarly, for the best three-way Random Forest classifier on the validation set we observed: accuracy = 86%, sensitivity = 95%, PPV = 86%, specificity = 75%, coverage = 70% (that is, for 30% of the validation instances the model abstained).

The feature importance assessed for the the best performing model (Random Forest on dataset B), are shown in Fig. 7. The feature importances were computed by estimating, for each feature, the total normalized reduction, across the Decision Trees in the trained Random Forest, to the variance of the target feature (hence, greater importance values denote a greater contribution to explaining the target variance): this computation was performed via the reference Random Forest implementation provided in the scikit-learn library.

Fig. 7
figure 7

Feature importance scores for the best performing model

Finally, it is worth noting that while the best performing model obtained good predictive performance, Random Forest is known to be a black-box model, that is a model that is not directly able to provide interpretable insight into how its predictions are made, as these predictions are obtained from the averaging of the Decision Trees in the forest. In order to provide an interpretable overview (in the sense of eXplainable AI[18]) of this predictive model, we also developed a Decision Tree model, which is shown in Fig. 8, to approximate the decision-making steps implemented by the Random Forest model. Although the depicted decision tree is associated with a lower discriminative performance than the two former (inscrutable) models, such a tree can be used as a simple decision aid by clinicians interested in the use of blood values to assess COVID-19 suspect cases.

Fig. 8
figure 8

An interpretable Decision Tree, developed in order to support the interpretation of the predictions from the other models. Color gradients denote predictivity for either classes (shades of blue correspond to COVID-19 negativity, shades of orange to positivity)

Discussion

We have developed two machine learning models to discriminate between patients who are either positive or negative to the SARS-CoV-2, which is the coronavirus causing the COVID-19 pandemia. In this task, patients are represented in terms of few basic demographic characteristics (gender, age) and a small array of routine blood tests, chosen for their convenience, low cost and because they are usually available within 30 minutes from the blood draw in regular emergency department. The ground truth was established through RT-PCR swab tests.

We presented the best traditional model, as it is common practice, and a three-way model, which guarantees best sensitivity and positive predictive value: the former is the proportion of infected (and contagious) people who will have a positive result and therefore it is useful to clinicians when deciding which test to use. On the other hand, PPV is useful for patients as it tells the odds of one having COVID-19 if they have a positive result.

The performance achieved by these two best models (sensitivity between 92% and 95%, accuracy between 82% and 86%) provides proof that this kind of data, and computational models, can be used to discriminate among potential COVID-19 infectious patients with sufficient reliability, and similar sensitivity to the current Gold Standard. This is the most important contribution of our study.

Also from the clinical point of view, the feature selection was considered valid by the clinicians involved. Indeed, the specialist literature has found that COVID-19 positivity is associated with lymphopenia (that is, abnormally low level of white blood cells in the blood), damage to liver and muscle tissue [44, 48], and significantly increased C-reactive protein (CRP) levels [10]. In [29] a comprehensive list of the most frequent abnormalities in COVID-19 patients has been reported: among the 14 conditions considered, they report increased aspartate aminotransferase (AST), decreased lymphocyte count (WBC), increased lactate dehydrogenase (LDH), increased C-reactive protein (CRP), increased white blood cell count (WBC) and increased alanine aminotransferase (ALT).

These parameters are also the most predictive features identified by the best classifier (Random Forest), all together with the Age attribute. Also other studies confirm the relevance of these features and their association with the COVID-19 positivity [8, 34, 37, 50], compared to other kinds of pneumonia [49]. This also gives confirmation that our models ground on clinically relevant features and that most of these values can be extracted from routine blood exams.

The interpretable Decision Tree model provides a further confirmation (see Fig. 8) of the soundness of the approach: the clinicians (ML, GB) and the biochemist (DF) involved in this study found reasonable that the AST would be the first parameter to consider (i.e., mirrored by the fact that AST was the root of the decision tree) and that it was found to be the most important predictive feature. Indeed, values of AST below 25 are good predictors of COVID-19 positivity (accuracy = PPV = 76%), while values below 25 are a good predictor of COVID-19 negativity (accuracy = Negative Predictive Value = 83%). Similar observations can also be made about CRP, Lymphocytes and general WBC counts.

No statistically significant difference was found between the accuracy and the balanced accuracy of the models (as mirrored by the overlap of the 95% confidence intervals), as a sign that the dataset was not significantly unbalanced.

Moreover, we can notice that the best performing ML classifier (Random Forest) exhibited a very high sensitivity (∼90%) but, in comparison, a limited specificity of only 65%. That gives the main motivation for the three-way classifier: this model offers a trade-off between increased specificity (a 10% increment compared with the best traditional ML model) and reduced coverage, as the three-way approach abstains on uncertain instances (i.e., the cases that cannot be classified with high confidence neither as positive nor negative ). This means that the model yields more robust and reliable prediction for the classified instances (as it is mirrored by the increase in all of the performance measures), while for the other ones it is anyway useful in suggesting further tests, e.g., by either a PCR-RNA swab test or a chest x-ray.

In regard to the specificity exhibited by our models, we can further notice that even while these values are relatively low compared with other tests (which are more specific but slower and less accessible), this may not be too much of a limitation as there is a significant disparity between the costs of false positives and false negatives and in fact our models favors sensitivity (thus, they avoid false negatives). Further, the high PPV (> 80%) of our models suggest that the large majority of cases identified as positives by our models would likely be COVID-19 positive cases.

That said, the study presents two main limitations: the first, and more obvious one, regards the relatively low number of cases considered. This was tackled by performing nested cross-validation in order to control for bias [43], and by employing models that are known to be effective also with moderately sized samples [3, 36, 42]. Nonetheless, further research should be aimed at confirming our findings, by integrating hematochemical data from multiple centers and increasing the number of the cases considered. The second limitation may be less obvious, as it regards the reliability of the ground truth itself. Although this was built by means of the current gold standard for COVID-19 detection, i.e., the rRt-PCR test, a recent study observed that the accuracy of this test may be highly affected by problems like inadequate procedures for collection, handling, transport and storage of the swabs, sample contamination, and presence of interfering substances, among the others [30]. As a result, some recent studies have reported up to 20% false-negative results for the rRt-PCR test [24, 26, 47], and a recent systematic review reported an average sensitivity of 92% and cautioned that “up to 29% of patients could have an initial RT-PCR false-negative result”. Thus, contrary to common belief and some preliminary study (e.g., [11]), the accuracy of this test could be less than optimal, and this could have affected the reliability of the ground truth also in this study (as in any other using this test for ground truthing, unless cases are annotated after multiple tests. However, besides being a limitation, this is also a further motivation to pursue alternative ways to perform the diagnosis of SARS-CoV-2 infection, such as our methods are.

Future work will be devoted to the inclusion of more hematochemical parameters, including those from arterial blood gas assays (ABG), to evaluate their predictiveness with respect to COVID-19 positiveness, and the inclusion of cases whose probability to be COVID-positive is almost 100%, as they resulted positive to two or more swabs or to serologic antibody tests. This would allow to associate a higher weight with misidentifying those cases, so as, we conjecture, improve the sensitivity further.

Moreover, we want to investigate the interpretability of our models further, by both having more clinicians validate the current Decision Tree, and possibly construct a more accurate one, so that clinicians can use it as a convenient decision aid to interpret blood tests in regard to COVID-19 suspect cases (even off-line).

Finally, this was conceived as a feasibility study for an alternative COVID-19 test on the basis of hematochimical values. IN virtue of this ambitious goal, the success of this study does not exempt us from pursuing a real-world, ecological validation of the models [6]. To this aim, we deployed an online Web-based toolFootnote 5 by which clinicians can test the model, by feeding it with clinical values, and considering the sensibleness and usefulness of the indications provided back by the model. After this successful feasibility study, we will conceive proper external validation tasks and undertake an ecological validation to assess the cost-effectiveness and utility of these models for the screening of COVID-19 infection in all the real-world settings (e.g., hospitals, workplaces) where routine blood tests are a viable test of choice.

Code and data availability

Availability of data and material

The developed web tool is available at the following address: https://covid19-blood-ml.herokuapp.com/ The complete dataset will be made available on the Zenodo platform as soon as the work gets accepted for publication.

Code availability

The complete code will be made available on the Zenodo platform as soon as the work gets accepted for publication.