Classification of COVID19 Patients Using Robust Logistic Regression

Coronavirus disease 2019 (COVID19) has triggered a global pandemic affecting millions of people. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causing the COVID-19 disease is hypothesized to gain entry into humans via the airway epithelium, where it initiates a host response. The expression levels of genes at the upper airway that interact with the SARS-CoV-2 could be a telltale sign of virus infection. However, gene expression data have been flagged as suspicious of containing different contamination errors via techniques for extracting such information, and clinical diagnosis may contain labelling errors due to the specificity and sensitivity of diagnostic tests. We propose to fit the regularized logistic regression model as a classifier for COVID-19 diagnosis, which simultaneously identifies genes related to the disease and predicts the COVID-19 cases based on the expression values of the selected genes. We apply a robust estimating methods based on the density power divergence to obtain stable results ignoring the effects of contamination or labelling errors in the data and compare its performance with respect to the classical maximum likelihood estimator with different penalties, including the LASSO and the general adaptive LASSO penalties.


Introduction
Coronaviruses (CoVs) are a group of enveloped, single, positive-stranded RNA viruses causing mild to severe respiratory illnesses in humans. Coronavirus disease 2019 , caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2), has lead to a global pandemic affecting millions of people and causing high mortality rates worldwide. Nonetheless, the actual knowledge about COVID19 is limited, and numerous studies have been carried out to identify genes involved in the host response to the SARS-CoV-2 infection, so as to determine mechanisms of pathogenicity and potential therapeutic targets (see, e.g., [20,22,24,29] among many others). Viral infections of human cells lead to the production of interferons (IFNs) as an antiviral mechanism. In majority of cases, patients are asymptomatic or exhibit mild symptoms, whereas in more severe cases, patients may develop severe lung injury and death from respiratory failure. Moreover, SARS-CoV-2 is able to achieve high viral load even in the absence of symptoms, increasing its contagiousness.
Upper airway gene expression analysis can be performed for identification of transcriptional regulatory mechanisms involved in the host response to infection by SARS-CoV-2 and consequently help to distinguish between patients suffering from COVID19 and other viral or non-viral acute respiratory illness (ARIs). Genetic variation may contribute to disease largely through misregulation of gene expression. Metagenomic Next-Generation Sequencing (mNGS) is an useful tool providing clinically actionable information for predicting causes of an infection, evaluating infectious disease risk and successful diagnosing. Therefore, genetic information may be used to build novel respiratory diagnostics that integrate host transcriptional signatures of infection. Conversely, gene expression profiling involves a large number of features, often much larger than the sample size. High feature dimensionality and paucity of samples possess a challenge for predictive classification and marker identification methodologies. Therefore, techniques for high dimensional data analysis need to be applied.
Despite the potential of mNGS, it presents some crucial barriers, including data cleanliness. Contamination of samples during specimen collection is a large concern given the increased analytical sensitivity of mNGS in comparison with standard culture methods. Accordingly, robust statistical analysis appears appropriate for the classification of COVID19 patients and identification of genes involved in patient's response to the infection using mNGS data.
Among the existing high dimensional statistical techniques, the regularized logistic regression model provides simultaneous gene identification and patient classification through a likelihood of suffering from the disease. The low dimensional logistic regression model has been widely used as a powerful classifier, but classical estimation design is ill-posed in the high dimensional set-up and regularized methods need to be applied. Regularization techniques assume that only a few number of explanatory variables are actually involved in the true model underlying the data, so they perform variable selection and parameter estimation by combining a model-based loss function with a penalization on the absolute value of the model parameters. Several penalties have been explored in the literature. The LASSO (Least Absolute Shrinkage and Selection Operator) procedure [27] stands within regularization methods, as it performs remarkably well as both a selector of important variables and as a prediction engine with computational feasibility. Later, Shevade and Keerthi [25] proposed sparse logistic regression based on the LASSO penalty and Cawley and Talbot [9] investigated sparse logistic regression with Bayesian penalty. However, it has been criticized for its biasedness, as it tends to select many noisy features with high probability, and consequently Zou [31] proposed the Adaptive-LASSO (referred as Ad-LASSO in the following) as an alternative to overcome this weakness. Huang et al. [17] applied adaptive LASSO to the logistic regression model and showed convenient asymptotic properties of the resulting estimators. Wide literature applies the regularized logistic regression model for gene identification and diagnosis of a disease though gene expression profiling. Some examples are Wu et al. [28], Jacob et al. [18] and Ghosh and Chinnaiyan [12]. In contrast, both LASSO and Adaptive LASSO procedures are based on the logistic likelihood function and hence, inherits severe lack of robustness; so both methods are sensitive to contamination in the sample.
While classical estimating methods are based on the maximum likelihood estimator (MLE), recent literature has shown the advantage of using divergence-based methods in terms of robustness, with an unavoidable (but often not quite significant) loss of efficiency. Robust methods for logistic regression based on bounded deviances have been introduced in Bianco and Yohai [4]. Cantoni and Ronchetti [8] studied robust M-estimators for generalized linear models and later Avella-Medina and Ronchetti [3] extended the theory for general penalized M-estimators in shrinking neighborhoods. Recently, Bianco et al. [5] studied penalized weighted M-estimators for the logistic regression model with random penalties. Basu et al. [6] introduced the minimum density power divergence (DPD) estimators for general statistical models, which are indeed robust against outliers and leverage points, fisher-consistent and enjoy asymptotic properties derived under much simpler conditions compared to the general M-estimators. The DPD has the interpretation of being a natural generalization of the likelihood-based loss function, so that the MLE is included as a particular case of the DPD-based family. Ghosh and Basu [13] studied the minimum DPD estimator (MDPDE) for generalized linear regression models, including the logistic regression, in the low-dimensional set-up, and later Basu et al. [7] extended the methodology for the high-dimensional logistic regression model, yielding the penalized MDPDE. They considered several penalty functions, including the LASSO and the Ad-LASSO penalties, as well as more general weighted adaptive LASSO (AW-LASSO) penalties. This last work [7] is particularly motivated from the excellent performances, both in terms of estimation accuracy and variable selection optimality, of the penalized DPDbased procedures observed under the high-dimensional linear regression model [14,15].
In this paper, we propose to develop a COVID19 patients classifier through their upper airway gene expression using the penalized logistic regression model, which simultaneously carries out important gene selection. We apply different estimation methods, namely the classical MLE and the MDPDEs penalized with the LASSO, Ad-LASSO and AW-LASSO penalties (specific to the SCAD penalty of Fan and Li [10]) as developed in [7]. While adaptive penalties enhance the variable selection property, robust procedures based on the DPD have been shown to perform competitively with non-robust ones in the absence of contamination, and to improve the estimation accuracy and robustness in a contaminated scenario. Further, AW-DPD-LASSO estimator based on nonconcave penalties acquires many of its advantageous properties with less computational burden.
The outline of the paper is as follows. Section 2 introduces the minimum DPD estimator family and the corresponding penalized estimators. Section 3 describes the real dataset containing the upper airway host transcriptional response of patients suspected of suffering from COVID19 disease. Section 4 studies two classification problems, diagnosis of COVID19 and differentiation between COVID19 and other viral ARIs and discusses the performance of the logistic regression model fitted with different robust and non-robust estimators in both situations. In Sect. 5, some final conclusions are drawn.

Robust Regularized Logistic Regression
Let us consider dichotomous and independent response variables Y 1 , ..Y n , each independently following a Bernoulli distribution, as where π i ∈ [0, 1]. The logistic regression model assesses the Bernoulli probabilities, π i , that are related to a fixed or random k-dimensional vector of explanatory variables, x i , through a common regression parameter β ∈ R k , for each i = 1, . . . , n, satisfying where the function logit( p) = log p 1− p . For simplicity, here, we have assumed that the intercept term is included within the covariate vector x (in its first component). Applying the inverse logit function, the logistic model gives the partnership class probability. In the following, we denote π i = π x T i β , the probability of success of the response Y i , emphasizing its dependence of the observation x i and the regression parameter vector β. Therefore, to fit the logistic regression model it suffices to estimate the common parameter β from the observed data.
As discussed in Sect. 1, classical estimation methods based on the likelihood function for the logistic regression model yield the MLE, known to be asymptotically efficient (is a BAN estimator) but not robust. Given the observed data (y 1 , x 1 ), .., (y n , x n ), the MLE, β, is defined by being the likelihood function. Equivalently, the MLE can be obtained by minimizing the negative log-likelihood, − log (L(β)). We will adopt this last formulation, where the estimator is computed as the minimum of a so-called loss function. Then, to achieve robustness in the estimation, an alternative loss function must be used. In this line, Ghosh and Basu [13] presented a robust family of estimators for the generalized linear models based on the DPD approach and proved robustness its properties. In particular, given the observed data (y 1 , x 1 ), .., (y n , x n ), the DPD for the logistic regression model yields where the tuning parameter α ≥ 0 controls the trade-off between efficiency and robustness. The minimum DPD estimator, β α , (MDPDE) is defined as the minimizer of the loss function given in (3), Furthermore, the DPD loss function can be defined at α = 0 taking continuous limits, and the resulting MDPDE coincides with the MLE. That is, the proposed family of MDPDPE can be considered as a generalization of the MLE. The MDPDEs, β α , demonstrably enjoy great asymptotic properties although they entail an unavoidable loss of efficiency. Conversely, the gain in robustness is in many cases cost-effective.
On the other hand, dealing with high dimensional data requires additional assumptions on the model parameters. In particular, we assume that the true regression vector is assumed to be sparse, that is, having few non-null elements. Explanatory variables with zero regression coefficient are not significant for the model. Thus, variable selection needs to be performed jointly with the parameter estimation. Regularization methods are characterized by combining a loss function (from the model) and a penalty function that induces zero estimation of many coefficients.
The classical penalized (regularized) LASSO estimator of β couples the negative loglikelihood loss function and the popular LASSO penalty, where λ is a regularization parameter controlling the shrinkage of the regression vector β. For more details, see Hastie et al. [16]. The choice of λ then determines the sparsity of the model; the greater is λ, the greater the weight of the penalty in the objective function is. Several criteria for the election of the regularization parameter have been proposed in the literature, including cross-validation or information criteria adapted to the high dimensional set-up. Fokianos [11], Park and Hastie [21], Plan and Vershynin [23], Zhu and Hastie [30] and Sun and Wang [26] are interesting papers based on the LASSO estimator for the logistic regression model. Basu et al. [7] extended the LASSO procedure with DPD-based loss function for the logistic regression model, producing more robust estimators. The so-called LASSO penalized MDPDE (DPD-LASSO) is then given by One of the major drawbacks of the LASSO penalty is that the estimators obtained with such penalty are not consistent, i.e., they lack the oracle property (Fan and Li [10]). Since LASSO function equally penalizes all the coefficients, it over-penalizes coefficients of irrelevant variables leading to a biased estimator. To overcome the bias deficiency, Zou [31] proposed the adaptive LASSO procedure in which adaptive weights are applied to different coefficients. Then, the adaptive LASSO objective function is given by where β = β 1 , ..., β k is a consistent estimator of β. The initial estimator β weights the penalty to which each element of the estimated vector is subjected. For zero initially estimated elements, we can simply define a sufficiently great penalty bound. Therefore, lower elements in β entail a greater penalty, inducing the sparsity in the adaptive LASSO estimator and conversely lower weights are assigned to large initially estimated coefficients. This adaptive penalty reduces the bias problem of the standard LASSO. Some interesting results in relation to the adaptive LASSO estimator in logistic regression models can be seen in Algamal and Lee [1], Araveeporn [2], Bianco et al. [5] and references therein. The idea of weighting the LASSO penalization can be extended to a more general framework, yielding the adaptive weighted LASSO estimator with objective function An interesting proposal for the weighted function is the first derivative of the nonconcave SCAD penalty given by with a > 2, where I and (·) + denote the indicator and positive part functions, respectively. The resulting weighted adaptive penalty is a linear approximation of the SCAD, and hence, it is expected to work as a substitute for this nonconcave penalty, improving unbiasedness, continuity and sparsity properties of the LASSO estimator. The weighted adaptive penalized estimator with this weight function will be referred to as the AW-DPD-LASSO. The adaptive and weighted adaptive LASSO procedure can be easily adapted to the DPD-based loss function leading to an objective function of the form The minimization of the objective (9) produces robust adaptively weighted DPD-LASSO estimators, which includes the DPD-LASSO estimator for w(·) = 1. The resulting penalized MDPDEs are indeed robust for all positives values of α when the initial estimator β is also robust, as proved in Basu et al. [7], and non-robust at α = 0 corresponding to the MLE. Moreover, they are consistent and asymptotically normal in the high dimensional data set-up with non polynomial order, i.e., when log(k) = O(n s ) for some s ∈ (0, 1), under some regularity conditions. Conversely, the gain in robustness entails an efficiency loss. Basu et al. [7] empirically compared the performance of the MDPPE for different values of α with high ultra-dimensional data, concluding that MDPDEs stand competitive in the absence of contamination and improve the model selection and classification rate in a contaminated scenario. The optimal value of the tuning parameter α directly depends on the data, as larger values of α produce more robust estimators which are preferable for high data contamination rate. Moderately large values of α, over 0.3 − 0.5, have been recommended in the literature for worthwhile trade-off between robustness and efficiency.

Data Description and Pre-processing
We study the upper airway host transcriptional response in patients with COVID19 (n = 93), other viral (n = 41) and non-viral (n = 100) ARIs so as to identify genes involved in the host response on host and build a classifiers capable of pairwise differentiate between classes. The data were first considered in Mick et al. [20] who conducted an observational cohort study at the University of California, San Francisco (UCSF) and Zuckerberg San Francisco General Hospital. They evaluated leftover RNA extracted from clinical swab specimens processed at the UCSF Clinical Microbiology Laboratory and performed a clinician-ordered test for SARS-CoV-2 using reverse transcription-polymerase chain reaction (RT-PCR). For negative PCR patients, the presence of other pathogenic respiratory virus was detected by mNGS.
Mick et al. [20] performed pairwise differential expression (DE) analysis between the three patient groups, gene set enrichment analyses (GSEA) on the genes differentially expressed and constructed parsimonious classifiers by combining the LASSO procedure for variable selection and random forest algorithm. They concluded that COVID19 is characterized by markedly attenuated activation of innate immune and pro-inflammatory pathways early in the course of disease compared to other viral ARIs. Human gene counts and metadata are publicity available at https://github. com/czbiohub/covid19-transcriptomics-pathogenesis-diagnostics-results, and IDSeq metagenomic analysis reports are available at https://idseq.net/ under project name "covid19_transcriptomics_pathogenesis_diagnostics". Before fitting the model, gene counts were variance-stabilizing transformed and patients labels were marked using RT-PCR results for COVID19 patients and mNGS results to distinguish between viral and non-viral ARIs. For more details about the preprocessing step, see Mick et al. [20]. The original set of 15900 features was reduced to the k = 2187 most correlated genes with the class distinction using Pearson correlation coefficient.
As discussed in Sect. 1, mNGS data are very sensitive to contamination during collection and may lead to contaminated observations of the explanatory variables. Moreover, standard RT-PCR risk of false-positive or false-negative outcomes, and therefore, some observations may be mislabelled. In order to evaluate the performance of the logistic classifier under contamination both in gene expression profiling (leverage points) and mislabelled observations, we introduce both types of data contamination. For the first, we flag a subset of significant variables using standard LASSO and we randomly select a 5% of the observations. For each selected COVID19 observation, we add twice the mean of the variable across all data to significant variables with negative regression coefficient and subtract the same amount to variables with positive regression coefficient. Then, we apply the inverse transformation to the rest of outliers observations. Finally, to generate mislabelled observations, we randomly select a 10% of the sample and switch its label.

Experiments and Results
We compare the performance of the DPD-based methods with the classical MLE through different accuracy measures, namely sensitivity (true positive rate, TP), specificity (true negative rate, TN) and correct classification rate (CCR). The explicit formulas are We also report the number of genes selected (model size (MS)) with each of the methods. We fit the preprocessed data to the logistic regression model and use the LASSO and adaptive LASSO (Ad-LASSO) methods to estimate the regression parameters, jointly with our proposed MDPDE for the values α = 0.1, 0.3, 0.5, 0.7 and 1, and two different weights functions; w(s) = 1/s yielding to the adaptive DPD-based (Ad-DPD-LASSO) estimator and with a = 3.7, for the AW-DPD-LASSO estimator. Moreover, we apply the highdimensional adaptation of the Generalized Information Criterion (HGIC), introduced in Konishi and Kitagawa [19], to select the optimal value of λ in (9), given by where L( β λ ) is the logistic loglikelihood function. Since the loss function associated with the logistic regression model is bounded, the penalized estimators are very sensitive to the choice of the penalty parameter. Larger choices of λ induce very shrunk estimators. Therefore, λ is chosen over a pre-defined bounded grid of values.
To compute the LASSO and Ad-LASSO methods, we use the R package glmnet, AW-LASSO is fitted using ncvreg package and we fit the DPD-based estimators with our own implemented code available at https://github.com/MariaJaenada/ awDPDlasso. Finally, to examine the robustness of the methods, the logistic regression model is fitted with original and contaminated data and then, evaluated with the original data (without outliers) in both settings.
Further, to assess the dependence of the model on the data, we fit the logistic model with the whole dataset and with 5 subsamples containing all observations except for 5 predefined folds, and we report the accuracy measures of the model fitted with the whole data, and the mean of the measures produced by the 5 different models fitted with each fold. The last one allows us to better assess the estimation dependency of the data, since some subsamples would contain more outlier observations than others.

Diagnosis of COVID-19
We first examine the performance of the different methods when differentiating COVID19 patients from patients suffering other viral and non-viral disease. Table 1 shows the results for the logistic model with two classes, COVID19 patients (Y = 1) and the rest of the patients (Y = 0) without and with contaminated observations, respectively, with a cut-off of 0.5. It is straightforward to see that adaptive methods select more parsimonious models, but remain competitive to the LASSO and Ad-LASSO in all accuracy measures. The proposed robust methods perform similarly to the LASSO and adaptive LASSO in the absence of contamination, so the penalized DPD-based estimators are competitive to penalized MLEs in a contamination-free scenario. Conversely, DPD-based robust methods maintain high classification rates when contamination is introduced into the data, unlike least squares-based methods, whose sensitivity drops considerably. That is, likelihood methods encounter more difficulties in correctly diagnosing the disease. The effect of contamination is even more pronounced when the fivefold cross-validation dataset is used, as the percentage of outlier observations is higher depending on the fold with lower sample size. We also study the accuracy by fitting a Receiver Operating Characteristic (ROC) curve of the model and reporting its area under the curve (AUC). The AUC is a robust overall measure to evaluate the performance of score classifiers because its calculation relies on the complete ROC curve and thus involves all possible classification thresholds. Figure 1 shows the AUC for the different methods with uncontaminated (top) and contaminated (bottom) datasets. All methods have a similar performance in the absence of contamination, but when outliers are introduced the AUC of the classical penalized MLEs decreases more than the robust method's AUC.
Complementary to the accuracy study of the model, it is also interesting to examine common genes selected in each method, and the stability in the selection, in the AUC with uncontaminated data and using all available observations. AUC with uncontaminated data and averaging by groups.
AUC with contaminated data and using all available observations. AUC with contaminated data and averaging by groups.  Figure 3 shows Venn diagrams of the gene sets selected by different methods under pure and contaminated data. It is striking that all methods select over 4-5 common genes, and almost all genes selected by Ad-DPD-LASSO are also selected by the AW-DPD-LASSO method. Following Mick et al. [20], over 10 genes may be enough to construct a competitive classifier for COVID19 diagnosis, and RT-PCR assays usually employ some of the four gene targets, namely ORF1ab/RdRp, E (envelope), N (nucleocapsid), and S (spike) genes for SARS-CoV-2 detection. Conversely, selected genes changed when contaminating the data even when fitting the model with the same estimating method, as our contamination scheme uses the set of important variables to introduce leverage points. Nonetheless, both sets of genes    Fig. 4. In particular, almost all genes selected under contaminated data are highly correlated with (at least) one gene selected in the absence of contamination. One may be also interested in determining the constant effect of a gene on the likelihood that one outcome will occur. Odds ratios (OR) may be used to compare the relative odds of the occurrence of the disease given the expression level of a certain gene. The Odds can be interpreted as the risk or importance of a gene in the diagnostic, so they allow comparison of the magnitude of the risk entailed by different genes for the COVID19 disease. Accordingly, each regression coefficient associated with a gene can be interpreted as the estimated relative increase in the log odds of the outcome per unit increase in the level of that gene. Then, the exponential function of the regression coefficient is the odds ratio associated with a one-unit increase in the expression level. Of course, zero-estimated coefficients, resulting in unit OR, imply that these genes do not affect to the diagnose. Table 2 reports the estimated coefficients and associated OR of the selected genes with the different DPDbased methods. Estimated coefficient and associated OR for the penalized MLE are presented in the Appendix for the seek of briefly. When the model is fitted using penalized MLEs, the OR associated with the selected variables are generally very close to the unit, implying a low importance in the diagnosis. Genes IFI6 and IF44L have Table 2 Estimated coefficients and OR associated with the selected genes with adaptive penalized DPD-based methods

Gene name
Coef.   the greatest OR value in adaptive methods and standard LASSO, respectively. Those genes are two of the most statistically significant genes upregulated by SARS-CoV-2, according to Mick et al. [20]. In contrast, ORs associated with coefficients obtained using DPD-based methods are generally more distant from the unit, suggesting genes with greater relevance in the diagnosis, including IFI6 and IF44L. In addition, DPDbased methods find some other important genes in the classification. Besides IFI6 and IF44L genes, TIMP1, FAM83A TRO and WDR74 have been flagged to be specifically upregulated in COVID-19 patients compared to both other viral and non-viral ARIs according to [20,29]. In turn, TIMP1 has been shown to be related to SARS-CoV-2 infection with lower expression level during pathogenesis ( [24]), which is translated in a low OR of the associated coefficient.
The identified genes in our analysis mostly coincide with some biomarkers discussed in the literature. In particular, 20 of the 23 genes identified using LASSO penalized MLE were also selected in Mick et al. [20] classifier, which uses a total of 27 genes. However, we have further explored the importance given by the classifier to each gene, studied the stability of the model when varying the penalty and, in the case of DPD-based estimators, the tuning parameter α. Adaptive methods fit more parsimonious models, while DPD-based method gives more importance of the selected genes. These two properties can be of great use when diagnosis new patients. Conversely, the results should be understood with caution, as the sample size of the data is not large enough and the conclusions may not be generalizable. Nonetheless, we can draw from the study the usefulness of penalized DPD-based methods, which perform well in the absence of contamination and increase the accuracy of the model when the data are contaminated, which is quite common when dealing with mNGS data.

Differentiating Between Viral ARIs
The previous results were calculated with the aggregated data, where the class NO-COVID19 included viral and non-viral ARIs. We now study the performance of the logistic model in differentiating COVID19 from other viral diseases. In this case, only n = 135 observations are available. We contaminate the data using the same methodology as described in Sect. 4, and we fit the model using all available observations and a fivefolds separately. Table 3 shows the accuracy measures produced by different methods under uncontaminated and contaminated scenario, respectively. Again, the proposed DPD-based estimators perform competitively in the absence of contamination and clearly improves the stability of the estimators when using the weighted adaptive LASSO. The decrease in specificity in the contaminated scenario stands out when dividing the sample in fivefolds. In this case, the sample size decreases to n = 108 observations, which increases sensitivity to outliers. Classical likelihood-based methods diagnose mostly all patients having COVID19 and is unable to differentiate it from other ARIs. On the contrary, our DPD-based robust adaptive weighted methods maintain a sufficiently high specificity and sensitivity when training with the whole dataset, proving their ability to differentiate between different viruses. However, when a too reduced training set is used, the adaptive DPD-LASSO method is highly dependent on the initial value and performs worse than the adaptively weighted method, which in turn maintain competitive classification rates. This classification problem illustrates the advantage of the robust procedure for high dimensional classification. Table 4 presents the estimated coefficients and associated OR of the selected genes with DPD-based methods. The results for penalized MLEs are reported in the Appendix. Again, robust methods give more importance to the selected genes, which is shown by their associated OR, and most of the selected genes with the adaptive and weighted adaptive penalties match. Genes selected by DPD-based methods for differentiating between ARIs were mostly identified when distinguishing COVID 19 from other diseases. In particular, TIMP1, TRO, WDR74, AL928654.3, ICAM4, and IGLL5 were also identified by at least one of the DPD-based methods, and LGR6, SMARCA1 and PCSK5 were identified by the classical LASSO estimator.

Conclusions
Robust penalized logistic regression is specially convenient when dealing with gene-based classification problems. From our results, DPD-based robust methods outperform classical ones when training data are contaminated with leverage points or mislabelled observations, which is quite common in genetic datasets. In particular, when the MLE is used, correct diagnosis of the COVID19 is highly affected by this data contamination and only a 50% of sensitivity is achieved. Thus, the model is useless for the diagnosis of new patients. Conversely, the robust DPD-based methods achieve highest sensitivity rates in contaminated scenarios and are competitive in the absence of contamination, presenting a compelling proposal. Besides, gene selection stability is shown within the same DPD-based penalized estimators family, and identi-fied genes with robust methods play a more important role in the diagnosis than genes selected by non-robust estimators. The accuracy loss of the non-robust methods under data contamination is emphasized when differentiating between viral ARIs. In this scenario, penalized MLEs lose their ability to detect viral diseases other than COVID, whereas robust estimators manage to maintain sufficiently high specificity. Weighted adaptive DPD-based estimators show the best performance in this case. All these results presented in this paper illustrate the benefit of using the robust DPD-based procedures which can be used routinely in any future real-life analysis of high-dimensional gene expression data and associated classification problems.

A Additional Results
Tables 5 and 6 present the estimated coefficients and associated OR of the penalized MLEs with LASSO, Ad-LASSO and AW-LASSO penalties, for the diagnosis of COVID19 patients and differentiating between ARIs classification problems, respectively.