Common, low-frequency, rare, and ultra-rare coding variants contribute to COVID-19 severity.

The combined impact of common and rare exonic variants in COVID-19 host genetics is currently insufficiently understood. Here, common and rare variants from whole exome sequencing data of about 4,000 SARS-CoV-2-positive individuals were used to define an interpretable machine learning model for predicting COVID-19 severity. Firstly, variants were converted into separate sets of Boolean features, depending on the absence or the presence of variants in each gene. An ensemble of LASSO logistic regression models was used to identify the most informative Boolean features with respect to the genetic bases of severity. The Boolean features selected by these logistic models were combined into an Integrated PolyGenic Score that offers a synthetic and interpretable index for describing the contribution of host genetics in COVID-19 severity, as demonstrated through testing in several independent cohorts. Selected features belong to ultra-rare, rare, low-frequency, and common variants, including those in linkage disequilibrium with known GWAS loci. Noteworthly, around one quarter of the selected genes are sex-specific. Pathway analysis of the selected genes associated with COVID-19 severity reflected the multi-organ nature of the disease. The proposed model might provide useful information for developing diagnostics and therapeutics, while also being able to guide bedside disease management.


Introduction
For almost two years COVID-19 has demonstrated itself to be a disease having a broad spectrum of clinical presentations: from asymptomatic patients to those with severe symptoms leading to death or persistent disease ("long COVID") [1][2][3]. While developing vaccination programmes and other preventive measures to significantly dampen infection transmission and reduce disease expression, a much deeper and more precise understanding of the interplay between SARS-CoV-2 and host genetics is required to support the development of treatments for new virus variants as they arise. Furthermore, advances in modelling the interplay between SARS-CoV-2 and host genetics hold significant promise for addressing other complex diseases. In this study, we demonstrate the value of genetic modelling with its direct translatability into drug development and clinical care in the context of a severe public health crisis.
The identification of host genetic factors modifying disease susceptibility and/or disease severity has the potential to reveal the biological basis of disease susceptibility and outcome as well as to subsequently contribute to treatment amelioration [4]. From a scientific point of view, COVID-19 represents a particularly interesting and accessible complex disorder for modeling host genetic data because the environmental factor (SARS-CoV-2) can be readily identified by a PCR-based swab test. The still moderate viral genome variability has thus far been shown to have relatively low impact on disease severity [5] where currently age, sex, and comorbidities are the major factors predicting disease susceptibility and outcome [6]. While these factors certainly have significant value for prediction, they provide limited insights into disease pathophysiology and are of limited relevance for drug development.
Common variants in the human genome affecting the susceptibility to SARS-CoV-2 infections and COVID-19 severity have been successfully identified by Genome-Wide Association Studies (GWASs) [7][8][9]. However, these variants only explain a small fraction of trait variability and, as it is well documented, GWASs are difficult to interpret because they often associate non-coding variants  Whole genome sequencing at mean coverage of 30x was performed on the Illumina NovaSeq6000 platform, then analyzed using the McGill Genome Center bioinformatics pipeline (https://doi.org/10.1093/gigascience/giz037), in accordance with GATK best practice guidelines.

PC analysis
The standard analysis of Principal Components was performed and the first principal components turned out to be connected with the patient's ethnicity collected in the medical records. Therefore, the genetic ancestry of the patients was estimated using a random forest classifier trained on samples from the 1000 genomes project and using as input features the first 20 principal components computed from the common variants by PLINK [27]. In order to avoid bias in the analysis due to the different ethnicity, only patients of genetic European ancestry were retained for further analyses.

Definition of the Boolean features
Variants were converted into 12 sets of Boolean features to better represent the variability at the gene-level. Firstly, any variant not impacting on the protein sequence was discarded. Then the remaining variants were classified according to their minor allele frequency (MAF) as reported in gnomAD for the reference population as: ultra-rare, MAF<0.1%; rare, 0.1%<=MAF<1%; low-frequency, 1%<=MAF<5%; and common, MAF>=5%. Non-Finnish European (NFE) was used as a reference population. SNPs with MAF not available in gnomAD were treated as ultra-rare.
INDELs with frequency not available in gnomAD were treated as ultra-rare when present only once in the cohort and otherwise discarded as possible artefacts of sequencing. For the ultra-rare variants, 3 alternative Boolean representations were defined, which are designed to capture the autosomal dominant (AD), autosomal recessive (AR), and X-linked (XL) model of inheritance, respectively.
The AD and AR representations included a feature for all the genes on autosomes. These features were equal to 1 when the corresponding gene presented at least 1 for the AD model, or 2 for the AR model, and variants in the ultra-rare frequency range and 0 otherwise. The XL representation included only genes belonging to the X chromosome. These features were equal to 1 when the corresponding gene presented at least 1 variant in the ultra-rare frequency range and 0 otherwise.
The same approach was used to define AD, AR, and XL Boolean features for the rare and low-frequency variants. Common variants were represented using a different approach that is designed to better capture the presence of alternative haplotypes. For each gene, all the possible combinations of common variants were computed. For instance, in the case of a gene belonging to an autosome with 2 common variants (named A and B), 3 combinations are possible (A, B, and AB), and (consequently) 3 Boolean features were defined both for the AD and AR model. In the AR model each of these 3 features was equal to 1 if all the variants in that particular combination were present in the homozygous state and 0 otherwise. The same rule was used for the AD model, but setting the feature to 1 even if the variants in that particular combination are in the heterozygous state. In both models, AD and AR, a further feature was defined for each gene to represent the absence of any of the previously defined combinations. In the AD model this feature was equal to 1 if no common variant is present and 0 otherwise; in the AR model, it is equal to 1 if no common variant is present in the homozygous state and 0 otherwise. The same approach was used to define the set of Boolean features for common variants in genes belonging to the X chromosome. The full list of Boolean representations is reported in Supplementary Table 2.

Model Training
The dataset was divided into a training set and a testing set (90/10), and the entire procedure described in this section was performed using only samples in the training set. A bootstrap approach with 100 iterations was adopted to train the model. At each bootstrap iteration, 90% of the samples were selected (without replication), and the following two steps were performed: ( The IPGS is then used, together with age and sex, for training a model that predicts the COVID-19 severity (step 4). These 4 steps of the training procedure are described in detail in the next subsections. Because the model is based on a combination of rare and common variants, the training procedure should be performed using a dataset with homogeneous ancestry.
Step 1: Features' Selection The subsets of the most relevant features were identified using logistic regression models with Least Absolute Shrinkage and Selection Operator (LASSO) regularization. Separate logistic models were trained for each of the 12 sets of Boolean features. The predicted outcome variable for each of these models was a re-classified phenotype adjusted by age and sex. In order to compute these re-classified phenotypes as adjusted by age and sex, the patients were first divided into males and females. Then for each sex an ordinal logistic regression model was fitted by using the age to predict the WHO phenotype classification into 6 grades. The ordinal logistic regression model was chosen as: it imposes a simple monotone relation between input feature and target variable; and it provides easily interpretable thresholds between the predicted classes. The patients with a predicted grading equal to the actual grading were excluded. The remaining patients were divided into two classes depending on whether their actual phenotype was milder or more severe than the one expected for a patient of that age and sex. This procedure has the benefit of isolating patients whose genetic factors are most important for predicting COVID-19 severity. This binary trait, i.e. phenotype more/less severe than expected, was used as the outcome variable for the 12 LASSO logistic models based on the 12 separate Boolean representations. For each LASSO model, the regularization strength was optimized by 10-folds cross-validation with 50 equally spaced values in the logarithmic scale in the range [10 -2 , 10 1 ]. The optimal regularization strength was selected as the one with the best trade-off between the simplicity of the model and the cross-validation score, i.e. as the highest regularization strength providing an average score closer to the highest average score than 0.5 standard deviations. Once the regularization strength was defined, the LASSO model was re-trained using all the samples in that particular bootstrap iteration. The features with non-null coefficients are the ones selected for the next step. In summary, for each bootstrap iteration, this procedure returns 12 lists of features (one for each Boolean representation) that are expected to be the most important features for predicting the phenotype adjusted by age and sex (in that particular bootstrap iteration).
Step 2: Weights of the Integrated Polygenic Score (IPGS) In the previous step, the Boolean representations are considered isolated from each other. The aim of the IPGS is to combine information from different representations (Equation 1). In order to reach this goal, it is necessary to compute the relative weights of the different contributions. For each bootstrap iteration, the list of relevant features extracted as described in the previous section are used to compute the number of features that are associated with mildness or severity for the different frequency ranges. For instance, corresponds to the number of features associated with the mild phenotype coming from Boolean features computed for variants in the frequency range [0.1%, 1%]. A feature is considered associated with the mild phenotype when its coefficient in the LASSO model estimated in step 1 is negative, i.e. it contributes to the prediction of the phenotype adjusted by age and sex in the direction of a phenotype less severe than what expected at that particular age and sex. The same rule, applied to the corresponding Boolean representation, is used to define the other feature-counts appearing in Equation 1. The weighting factors in Equation 1 were estimated as the ones that maximize the Silhouette coefficient of the separation between the clusters of patients more/less severe than expected. The minimization was performed with weighting factors restricted to the following ranges: ,

Model Testing
The training procedure returned 2 logistic models to be compared: one using as input features only age and sex, and the other one using as input features age, sex, and the IPGS. These models were tested, without any further adjustment, using other cohorts of European ancestry. The performances of the two models, with and without IPGS, were evaluated and compared in terms of accuracy, precision, sensitivity, and specificity. The increases of the performances are evaluated with respect to the performances of a model where the values of the IPGS feature have been shuffled, by computing the p-value on the empirical null distribution. In addition, the empirical probability All rights reserved. No reuse allowed without permission. perpetuity.
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; https://doi.org/10.1101/2021.09.03.21262611 doi: medRxiv preprint density function of IPGS has been estimated for the severe and non-severe patients of the cohort including both train and test sets and a t-test is carried out to evaluate whether the means of the two distributions were significantly different. As a further evaluation of the importance of the IPGS score on the severity prediction, univariate logistic regression models using as independent variables age (continuous represented in decades), sex, and IPGS were fitted to the dataset that combines both the training set and the testing sets for a total of 2,240 patients. These models were used to estimate the odds ratios and the p-values of the association with the severe phenotype.
Furthermore, a multivariable logistic regression was fitted using IPGS, age, and sex together.
Finally, a multivariable logistic regression was performed using as predictor variables: IPGS, age, sex and comoribidities (congestive/ischemic heart failure; asthma/COPD/OSAS; diabetes; hypertension; cancer). This latter model has been fitted in the training set, where the information on comorbidities was available.

Pathway analysis
Pathway analysis was made using a ranked GSEA approach [28][29], modified according to the specificity of our data. The metrics for gene ranking was calculated on the basis of the results of feature selection models, weighting in each Boolean feature both beta values and the number of bootstrap iterations where it was found significantly associated with severity/mildness (Supplementary Tables 3-6). All the Boolean features that were found significant in at least one of the models were included in the list. As the sign of beta depends on which allele is taken as reference (which is relative for common variants), we decided to use absolute beta values for all the features. To also weight the importance according to variant frequency, we used the F values from the IPGS score for the four categories (Ultra-rare 5, Rare 4, Low Frequency 2, Common 1). Finally, we summed all the weights of each Boolean feature by gene. Briefly: W geneA = ∑ FeaturesGeneA ABS(meanβ) * count * F All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in

Website and data distribution
The coordination of international partners has been possible through the Host Genetics Initiative (HGI) (https://www.covid19hg.org/projects/).

Code availability
Data analyses were performed using Python with the Scipy ecosystem [33], and the scikit-learn library [34]. Statistical association was done with the statsmodel Python library. The code is freely available at the github repository: https://github.com/gen-covid/pmm. All rights reserved. No reuse allowed without permission. perpetuity.
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021.

Results
The post-Mendelian paradigm for COVID-19 modelization for combining interpretability with predictivity based on ultra-rare, rare, low-frequency, and common variants.
The aim of the present study was to develop an easily interpretable model that could be used to predict the severity of COVID-19 from host genetic data. Patients were considered severe when hospitalized and receiving any form of respiratory support. The focus on this target variable is motivated by the practical importance of rapidly identifying which patients are more likely to require oxygen support, in an effort to prevent further complications. Interpretability has been a guiding principle in the definition of the machine learning model, as only a readily interpretable model can provide useful and reliable information for clinical practice while also contributing significantly to diagnostic, and therapeutic targeting. The high dimensionality of host genetic data poses a serious challenge to evident and reliable interpretability. So far, the development of a robust predictive model able to make a direct association between single variants and disease severity grading based on an accurate analysis of the vast number of host genetic variants compared to a much smaller number of individual patients has proven to be too complex and ultimately unreliable.
In order to address the complexity with predictive reliability, an enriched gene-level representation of host genetic data was modeled in a machine learning framework. The complexity of COVID-19 immediately suggests that both common and rare variants are expected to contribute to the likelihood of developing a severe form of the disease. However, the contribution of common and rare variants to the severe phenotype is not expected to be the same. A single rare variant that impairs the protein function might cause a severe phenotype by itself after viral infection, while this is not so probable for a common polymorphism, which is likely to have a less marked effect on protein functionality. These observations led to the definition of a score, named , that includes data regarding the variants at different frequencies: All rights reserved. No reuse allowed without permission. perpetuity.
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. can be interpreted as the contributions of common, low-frequency, rare, and ultra-rare variants to a score that represents the genetic propensity of a patient to develop a severe form of COVID-19.

Feature selection and gene discovery
The definition of the single terms of the IPGS formula requires 4 separate steps ( Fig. 1): (1) the definition of a severity phenotype adjusted by age and sex; (2) (Fig. 1B). The conversion of genetic variants into Boolean All rights reserved. No reuse allowed without permission. perpetuity.
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in  Table 2). The set of input features "ultra-rare_autosomal dominant" (UR_AD) is designed to represent in a binary way an autosomal dominant hereditary model associated with variants with MAF lower than 0.1%; i.e. these Boolean features are equal to 1 for genes presenting at least one variant in this frequency range. Similarly, the set of input features "ultra-rare_autosomal recessive" (UR_AR) and "ultra-rare_X-linked" (UR_X) were designed to describe the autosomal recessive and X-linked models of inheritance of ultra-rare variants. Analogous principles were used for rare and low-frequency variants. In the case of common variants, the same 3 sets of Boolean features representing the autosomal dominant, autosomal recessive, and X-linked models of inheritance were used. However, instead of simply defining the binary variables as "absence/presence of variants", the absence/presence of variant combinations was tested (Fig. 1C).
The Boolean representation of the genetic variability described in the previous paragraph significantly reduces the dimensionality of the problem. However, the number of input features is still orders of magnitude higher than the number of patients that can be reasonably collected for training the model. In order to reduce the number of input features, a feature selection strategy based on logistic models with the validated Least Absolute Shrinkage and Selection Operator (LASSO) regularization was employed (Fig. 1D). The aim of LASSO regularization is to minimize (shrink) the number of coefficients of the model, consequently minimizing the number of input features used for predicting outcomes. Separate logistic models with LASSO regularization were trained for the 12 sets of Boolean features for predicting COVID-19 severity, allowing us to identify the relevant features for each set. About 4% of the cumulative tested features were found to contribute to COVID-19 variability in severity (Fig. 1D).

Biological interpretability of extracted genetic features
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; Selected genes contributed by ultra-rare, rare, low-frequency variants, or/and common variants ( Fig. 2A-D and Supplementary Tables 3a-g). Specifically, 54% contributed by only one, 29% by two, 11% by three, and 6% by four types of variants. Around 25% of the genes were sex-specific. The latter group includes either genes located on the X chromosome, such as TLR7 and TLR8 in males, or genes regulated in opposite directions by androgens and estrogens when contributing with less penetrant common variants, such as p.L412F in TLR3 and p.D603N in SELP gene ( Fig. 2A-D).
Among the extracted ultra-rare variants there was a group of genes, such as TLR3, TLR7 and TICAM1, already shown to be directly involved in the Mendelian-like forms of COVID-19 ( Fig. 2A and Supplementary Table 3a-b). Furthermore, another group of genes are natural candidates because of their function: these include the ACE2 shedding protein ADAM17, CFTR-related genes, genes involved in glycolipid metabolism, genes expressed by cells of the innate immune system , and genes involved in the coagulation pathway. Finally, a group of genes led by ACE2 (if affected by ultra-rare variants) confers protection from the severe disease. This group includes several genes whose mutations are responsible for auto-inflammatory disorders.
Among the rare variants extracted, we identified some genes as candidates for COVID-19 severity, including TLR5 and SLC26A9 as well as other genes involved in the inflammatory response Among the low-frequency variants extracted, we identified some genes associated with either severity or protection from severe COVID-19 that are linked to the CFTR pathway (e.g., PSMA6) as well as specific genes involved in the immune response (e.g., NOD2) ( Fig. 2C and Supplementary Table 5a-b).
The model was also able to identify a group of extracted common variants already shown to be linked to either severe or mild COVID-19 ( Fig. 2D and Supplementary Table 6a-b). Among them are the L412F TLR3 and D603N SELP polymorphisms, already reported to be associated with the All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; https://doi.org/10.1101/2021.09.03.21262611 doi: medRxiv preprint severe disease [17][18] and several coding polymorphisms in Linkage Disequilibrium (LD) with already reported genomic SNP, such as the ABO blood group, OAS1-3 genes, PPP1R15A gene and others [4]. In conclusion, considering their functions, genes involved in the immune and inflammatory responses, or those involved in the coagulation pathway and NK and T cell receptor, are to be considered natural candidates for severe or mild COVID-19.

Integrated PolyGenic Score definition
The Boolean features selected by the LASSO logistic models were used to calculate the variables in Equation 1 (Fig. 3A-B). The corresponding weights ( variables) were defined by optimizing the separation between severe and mild cases as offered by the IPGS formula. The optimization was measured using the Silhouette coefficient, and the optimal values were computed using a grid-search approach over a predefined grid ( , and ).

Pathway analysis
In order to understand the biological mechanisms underlying the variability of disease, we performed a pathways analysis of the genes carrying variants discovered in the feature selection described above. The features obtained with this approach do not have the same predicted impact and are not discovered with the same confidence. Therefore, we decided to perform a rank-based pathway analysis, with genes with the highest impact and confidence ranking highest in the list, rather than a simple over-representation approach. We ranked all the genes that were found to be significantly associated with severity/mildness in at least one bootstrap repetition, based on a score that takes into account three parameters: average coefficient in the LASSO models selecting the All rights reserved. No reuse allowed without permission. perpetuity.
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; https://doi.org/10.1101/2021.09.03.21262611 doi: medRxiv preprint feature, number of significant bootstrap results, and the F correction factor for the frequency category used in the IPGS (detailed in the Methods section below). For genes with more than one significant Boolean feature, we summed up the scores of each feature. Gene Set Enrichment Analysis (GSEA) was then performed using two separate ranked gene lists (Supplementary Table   7) for females and males, followed by the generation of similarity networks using EnrichmentMap (Fig. 4A). The usage of rank-based search method allows to identify statistically significant pathways starting from extensive list genes, as each gene is associated with its specific importance.
Although no pathways satisfied the 0.25% FDR threshold normally required for standard GSEA analyses, the set of pathways considered significant using more relaxed thresholds on p values were shown to group in meaningful modules, providing useful information on pathogenetic mechanisms and on the genes that could explain how they can be affected. The network of all the pathways significantly enriched in both females and males ranked gene lists (p<0.01, n=25) is depicted in  Supplementary Table 8.

COVID-19 post-Mendelian model predictivity
The functional interpretation of the variants identified by the feature selection approach, complemented by the strong link between the involved human biological pathways and COVID-19 pathogenicity, support the hypothesis that the IPGS equation developed here may contribute significantly to predicting the severity of COVID-19. This hypothesis was tested by using a logistic regression model that predicts COVID-19 severity based on age, sex and the the IPGS (after All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; https://doi.org/10.1101/2021.09.03.21262611 doi: medRxiv preprint percentile normalization). The training set is composed of 466 patients not included in the training set previously exploited for the IPGS feature engineering. The model's performance was then tested using three independent cohort sets of European ancestry (Fig. 5A). The model exhibited an overall accuracy of 0.73, precision equal to 0.78, with a sensitivity and specificity of 0.72 and 0.75, respectively. Noteworthy, all the aforementioned metrics are higher than the corresponding values obtained using a logistic model that adopted as input features only age and sex. The increase in performances of the model with IPGS suggests that this score indeed confers significant additional (genetic) information for predicting COVID-19 severity compared to only age and sex. The increase of the performances is statistically significant (p-value < 0.05 for accuracy, precision, sensitivity, specificity) with respect to the distribution of performances for an ensemble of models where the IPGS feature has been randomized ( Fig. 5C lower left). A third logistic regression model fitted with IPGS alone, shows performances well above the random guess. Furthermore, the empirical probability density function of IPGS scores (Fig. 5C right) has been estimated for the severe and non-severe patients of the cohort including both training and testing sets. It is worth noting the shift on the right of the IPGS distribution for the severe patients, with significant p-value (<0.001) for the t-test of mean difference. This difference between severe and non-severe cases is preserved for the male and female cohorts when analyzed separately (p-values <0.001 and 0.024, respectively).
In line with the results obtained using the overall test set, the model including IPGS, age, and sex performed better than the model considering only sex and age as inputs, in each of the testing cohorts, separately (Fig. 5D). The increase in performance was systematically observed throughout all the cohorts: on average +1.33% for accuracy, +1% for precision, +1.33% for sensitivity, +1.67% for specificity. Considering the difference in phenotype classification inherent to a comparison among various international cohorts, and the genetic variability among different European sub-populations, the consistent increase in performances observed for the model with IPGS All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in

Advantages of IPGS and clinical interpretability of connected features
We then wanted to compare the clinical outcome with the probability of severity obtained from three different models: IPGS alone, sex-age alone or combined model (represented as heatmap in Fig. 6). It appears evident that in a subset of patients, the 2 models based on sex-age alone and IPGS alone have a discordant prediction (left and right end of dendrogram in Fig. 6A). In these cases IPGS appears to be a relevant predictor of severity (Fig. 6A). This is in accordance with the above presented logistic regression analysis (Fig. 5E) that shows IPGS having an OR of 2.32 for severity. Moreover, the list of features on which the IPGS score is built, represent a biological handle for pathophysiological mechanisms and possible personalized adjuvant treatments.
For example, three male patients, within two distinct age ranges (46-50, 51-55) (panels B, C and D) with severe outcome (intubation and CPAP) are imperfectly represented by the sex-age model All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; https://doi.org/10.1101/2021.09.03.21262611 doi: medRxiv preprint (probability of severity from 0,52 to 0,66) and better represented by the IPGS model (probability of severity from 0,91 to 0,95). The detected genetic variants that would allow to clinically consider putative personalized treatments in similar cases are: i) TLR7 ultra rare mutation indicating to consider possible adjuvant treatment with IFN gamma administration [12]; ii) homozygosity 603Asn in SELP gene suggesting putative adjuvant treatment with anti selectin P autoantibodies (e.g. Crizanlizumab) [18] and iii) polyQ longer than 23 in AR gene suggesting to consider possible adjuvant treatment with testosterone [16].
In a female patient, within age range 31-35, the sex-age model showed a probability of severity perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in

Discussion
The importance of combination of rare and low frequency variants has already been demonstrated to contribute to the prediction of susceptibility in other complex disorders [37-38].
Here we further expand this approach while demonstrating that ultra-rare, rare, low frequency, as well as common variants contribute to the likelihood of developing a severe form of COVID-19.
Furthermore, we included in our analyses a calibration of the relative weight of the variants vis-a-vis their impact on disease severity: a single ultra-rare variant might well by itself cause a severe phenotype of COVID-19, while this is less probable for a common polymorphism, one that is likely to have a markedly less direct effect on protein functionality. We performed a first modellization of COVID-19 genetics using both rare and common variants [20]. Because feature selection methodologies are generally sensitive to allele frequency, the extraction was performed separately for rare (MAF <1/100) and common (MAF >1/100) variants. However, the methodology revealed the insight that low-frequency variants (MAF from 1% to 5%) were disadvantaged if selected together with common ones. Furthermore, for extracting Mendelian-like genes a threshold of MAF < 0.1% (ultra-rare variants) appeared more effective than MAF< 1% and all mutations in the TLR7 gene that proved to have loss of function had indeed MAF< 1/1000 [12]. The model we arrived at, now considers separately ultra-rare, rare, low-frequency, and common variants.
Similar to the classical PRS (Polygenic Risk Score), the proposed IPGS (Integrated PolyGenic Score) may prove reliable for assessing the probability of severe COVID-19 following infection by SARS-CoV-2 [21]. While PRS is based on common polymorphisms found at the genomic level with the majority of loci potentially conferring risk being not easily interpretable due to the uncertainty of linked genes, IPGS allows immediate biological interpretation because it only includes coding variants. Furthermore, as opposed to PRS, IPGS relies on both polymorphisms and rare variants is capable of differentially weighting features in an indirectly proportional way in respect to frequency and therefore to protein impact. Each patient indeed is assigned both a number All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; and the list of her/his common and low-frequency polymorphisms relevant to COVID-19 supported by medically actionable information and of rare and ultra-rare variants conferring either risk of severity or protection from severe disease. Drawing on the entire picture presented through IPGS analysis, personalized adjuvant therapy could be envisaged. At the time of writing, a platform trial based on genetic markers is being discussed with the Italian Medicines Agency (EudraCT Number: 2021-002817-32).
Within 25 reported genomic SNPs demonstrably related to COVID-19 susceptibility/severity, 5 were reported to be in LD with coding variants [9,39]. The model presented here might provide useful information for uncovering the identity of the gene/coding variants responsible for COVID-19 susceptibility/severity linked to these genomic SNPs (Fig. 2D). For example, on chromosome 12, the genes mapping to the locus tagged by rs10774671 [9] are both OAS1 and OAS3. In OAS3 the coding variant is an Arginine to Lysine substitution (rs1859330) in high LD This innovative approach allowed us to better select genes located on the X chromosome related to COVID-19 that affect males and females in opposite ways ( Fig. 2A and Supplementary Tables   3 and 6). Interestingly, many of these genes were previously confirmed or hypothesized to escape the X chromosome inactivation. With respect to these genes, females produce twice the levels of protein in comparison with males. Mutations in hemizygous state in males and heterozygous state in females appear silent until SARS-CoV-2 infection occurs. For example, TLR7 and TLR8 are All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; selected for ultra-rare and associated with severity in males and with protection from severe disease in heterozygous females. We know that the activation of TLR7/8 induces the production of type 1 and type 2 IFN as well as pro-inflammatory cytokines, where the production defect in hemizygous males leads to severe COVID-19. However, an excess of the sensor can also lead to damage from hyperinflammation. Therefore, the condition of carrier females is the more favorable state and has in fact been associated with mild COVID-19 [41] .
Pathway analysis pointed to the relevance of obvious actors in COVID-19 pathology, such as immune cells and interferon signaling, but also to the important role of specific organs (brain, digestive tract, kidney, reproductive system) and functions (metabolism of lipids and steroids). The Furthermore, the IPGS is significantly associated with severity, showing an OR of 2.46 after adjustment for age, sex, and comorbidities. This indicates that IPGS is a novel prognostic factor that should be considered in the management of COVID-19 patients.
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Modelling precisely the role of the entire range of host genomics affecting disease susceptibility and severity in COVID-19 is critical to obtaining a complete biological understanding of the aetiology and pathogenicity of COVID-19 as well as other severe complex diseases. The application of IPGS based on Machine Learning principles within a post-Mendelian model allows us to more precisely identify the gene variants at play in COVID-19 as well as their specific roles, individually and in combination. This deep dive into the genetic architecture that allows for, contributes to, or even helps prevent diseases while increasing or decreasing their impact is critical for, and directly translatable into, (personalized) medicines development as well as prevention and treatment protocols. An integrated modelling of genetic variants based on a limited patient cohort, even limited in its geographical spread, may be sufficient for the development of diagnosis, and therapeutics across a wider range of populations. The advantage of this IPGS post-Mendelian model is that it learns and continues to learn as well as being a model from which we can obtain insights on the fundamental architecture of human genomics when confronted with severe and complex diseases.
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in

2825-2830 (2011).
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in  Fibros. 18(1),127-134 (2019    = contributing to COVID-19 mildness. Pink faces= contributing to females only; Blue faces = contributing to males only; Pink/Blue faces = contribution in both sexes. In parentheses: AD=autosomal dominant inheritance; AR=autosomal recessive inheritance; XL=X-linked recessive inheritance. A) ultra-rare mutations in the RNA sensor TLR7, TLR3, and TICAM1 (encoding TRIF protein), already reported associated with XL, AR and AD inheritance [10][11][12][13] impair interferon (IFNs) production in innate immune system cells. Mutations in TLR8, as well as of the signal transducer IRAK1 also impair interferon production. The specific location of TLR7/8 and IRAK1 (on the X chromosome) as well as X-inactivation escaping are responsible for opposite effects in males and females. Mutation in RNASEL impair the antiviral effect of the gene. In lung epithelial cells, ACE2 ultra-rare variants (on the X chromosome) exert protective effects (probably) due to lowering virus entrance, while ultra-rare variants in ADAM17 (might) reduce the shedding of ACE2 and induce a severe outcome. The same is true for CFTR and SCNN1A (encoding ENaCA protein and involved in a CFTR-related physiological pathway), and the lipid transporter ABCA3 [44]. Mutations of ADAMTS13 in vessels reduce the cleavage of the multimeric von Willebrand Factor (VWF), leading to thrombosis; B) Rare variants of the estrogen regulated TLR5 are associated with severity in females. Rare variants of the CFTR-related SLC26A9 are associated with severity in both sexes. This ion transporter has three discrete physiological modes: nCl(-)-HCO(3)(-) exchanger, Cl(-) channel, and Na(+)-anion cotransporter. Other examples of rare mutations associated with severity are the NK and T cell receptor FCRL6, IFN signal transducer IRAK2, and the actin depolymerization MICAL2; C) low-frequency variants in another CFTR-related gene, SCNN1D (encoding for ENaCD protein) are associated with mildness, while rare variants in the following genes are associated with severity: cargo protein SPMA6, vesicle formation PEX1, inflammatory protein NOD2 (CARD15); D) A number of coding polymorphisms, indicated with an asterisk, are in LD with genomic SNPs already associated with COVID-19 (The complete list is presented in Supplementary Table 11) [8,37]. In some cases, such as the case of All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; SFTDP, the genomic SNP is the coding polymorphism itself. Of note are the genes of surfactant proteins associated with severe disease: SFTDP gene encoding for SP-D protein and SFTPA1 gene encoding for SP-A protein; the signal transducer, PPP1R15A gene encoding for GADD34 protein. OAS1 and OAS3 related to RNA clearance of RNASEL (reported in panel A as having ultra-rare mutations; included here should also be the already reported TLR3412 [17]; the already reported SELP603 related to thrombosis [18]. Note: OAS1 haplotype A= c.1039-1G>A, (p. (Gly162Ser)), (p. (Ala352Thr)), (p. (Arg361Thr)), (p. (Gly397Arg)), (p. (Thr358Profs*26)). OAS1 haplotype B = haplotype without the variant combination in haplotype A.

A)
Workflow of the analysis. Genes corresponding to Boolean features found to be associated at least once were ranked based on a composite score and subjected to Gene Set Enrichment Analysis. Two separate ranked gene lists for females (7,317 genes, weight range 3x10-5-561) and males (7,325 genes, weight range 7x10-5-452) were used. The list of significant pathways was analysed and presented as a similarity network: B) Similarity network of the pathways with a significant enrichment both in females and males (p<0.01). The size of the circles is proportional to the pahway size. Significance above threshold is indicated by the red color. C) Similarity network of the pathways with a significant enrichment either in females (red left half of the circles) or males (red right half of the circles) (p<0.005). D) Heatmaps of the genes belonging to a selection of pathways of interest. The color gradient represents the weight of each gene, calculated and described in methods. Please note high ranking of TLR genes (TLR5, TLR8, TLR3 and TLR7) in the pathway of Response to Mechanical Stimulus, CFTR gene in Recognition for Clathrin-mediated endocytosis, RNASEL, TYK2, OAS1 and OAS3 genes in Interferon alpha-beta signaling. Note also the presence of the relevant pathway of Exhaust vs Memory CD8 T cell Up that also includes TLR7 gene.

Figure 5. Model predictivity
A) The post-Mendelian model was trained using a sample of 466 patients from the GEN-COVID cohort n.2 and Swedish cohort (having cases only) and tested with three additional European cohorts from UK, Germany and Canada. B) A logistic regression model was used for severity prediction. Severity was defined mainly on the basis of hospitalisation versus not hospitalisation. Hospitalised cases without respiratory support were included in controls. TN=True Negative; TP=True Positive; FN=False Negative; FP=False Positive. C) When the IPGS is added to age and gender as a regressor, the performances of the model increase: accuracy +1%, precision +1%, sensitivity +2%, specificity +1%. These increases are statistically significant (p-value <0.05 for accuracy, precision, sensitivity and specificity) with respect to the null distribution obtained by All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted September 14, 2021. ; randomizing the IPGS. The performances of the model built with IPGS alone are all above the random guess. In addition, on the right we reported the distributions of the IPGS for severe and non-severe patients. D) In the three tested cohorts, when the IPGS is added to age and sex as a regressor, all the performances increase: the accuracy up to +2%, the precision up to +1%, the sensitivity up to +3%, and the specificity up to +2%. We conclude that IPGS is able to improve prediction of clinical outcome in addition to the well-established powerful factors of age and sex. E) The univariate logistic regression models fitted on the cohort including both train and test, confirmed that the IPGS is associated with severity with an odds-ratio (OR) of 2.32, while age (continuous in decades) and sex have an OR of 1.89 and 2.99, respectively.

Figure 6. Clinically interpretability of IPGS.
Panel A shows the GEN-COVID cohort dendrogram and heatmaps of the probabilities of severity based on the 3 different models: sex-age alone, IPGS alone and combined model. In the extreme ends of dendrogram (left and right) the probability of severity based on sex-age alone and IPGS alone is highly discordant (different colours). Selected examples corresponding to the arrows are illustrated in panels B-G. In each panel IPGS score, probabilities of severity and key features useful for bedside clinical management are shown. B) Male patient, in the 46-50 age range, treated with CPAP ventilation, tocilizumab, enoxaparin, hydroxychloroquine and lopinavir/ritonavir; no comorbidities except for asthma have been reported. The patient presented a rare TLR7 mutation that leads to an impaired production of interferon gamma [12]. C) Male patient, in the 51-55 age range, treated with invasive mechanical ventilation, steroids and enoxaparin. He had among comorbidities obesity, anxiety, hypertension and cerebral ischemia. He was found to be homozygous for the SELP rs6127 (p.Asp603Asn). Homozygosity of Asparagine in position 603 of Selectin P makes this endothelial proteine more prone to clot formation and male patients more prone to COVID-19 thrombosis [18]. Hence the rationale for considering as putative adjuvant therapy in the management of similar cases the anti Selectin P antibodies, a drug already approved for vascular events of sickle cell anemia. D) Male patient, in the 51-55 age range, treated with CPAP ventilation, tocilizumab, steroids, enoxaparin, hydroxychloroquine and lopinavir/ritonavir; no comorbidities except for diabetes. He was found to have the androgen receptor polyQ repeats >23. The regular function of the androgen receptor is correlated with a beneficial immunomodulatory effect in those male patients in whom the increase in testosterone levels may overcome the receptor resistance. The rationale is to consider giving testosterone to those male subjects who cannot, on their own, raise the levels enough to overcome the receptor resistance due to poly-glutamine stretch longer than 23 repeats [16]. E) Female patient, in the 31-35 age range, treated with CPAP ventilation and steroids, enoxaparin and azithromycin; no comorbidities except for hypothyroidism. She was a carrier of an ultra rare mutation in ADAMTS13. Impaired function of ADAMTS13 leads to reduced cleavage of von Willebrand factor (vWF) and enhanced clot formation. The effect is enhanced in females and responsible for SARS-CoV-2 related thrombosis. Anti vWF immunoglobulines would be a putative therapeutic option to consider in similar cases. F-G) examples of low IPGS and related key features. F) Male patient, in the 81-85 age range, treated with low-flow oxygen. No information regarding pharmacological therapy during hospitalization is present. Among comorbidities: diabetes mellitus, congestive heart failure and bowel cancer and steroids. He presented an ultra rare mutation in ACE2. G) Male patient, in the 86-90 age range, treated with low-flow oxygen, steroid, enoxaparin and ceftriaxone plus azithromycin. Among his comorbidities: colon diverticulosis with constipation?, benign prostatic hyperplasia?, anxious-depressive syndrome, sideropenic anemia. He was a carrier of an ultra rare mutation in AGTR2. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Heatmaps of the genes belonging to representative pathways significant in either females or males, p<0.005. The color gradient represents the weight of each gene, calculated as described in methods.  perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Supplementary
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in