Introduction

Since its outbreak in late 2019, the coronavirus disease 2019 (COVID-19) pandemic has seen a considerable improvement, with a decrease of detected cases and mortality worldwide (Coronavirus Disease (COVID-19) Situation Reports n. d.). However, patient management can still be challenging since the clinical course can be highly heterogeneous, with a wide spectrum of biological responses and clinical manifestations, going from mild respiratory symptoms, to lung injury and pneumonia, and, in more severe and critical cases, to multiple organ failure and death (Osuchowski et al. 2021).

From an immunological point of view, severe COVID-19 and poor prognosis associate with a specific profile showing an exaggerated immune response, characterized by a “cytokine storm”, with high levels of IL-6 and IL-10 among others, and by a dramatic changes in blood cell sub-populations, with decreased lymphocytes, T-cell subsets, eosinophils, and platelets, and increased neutrophils and neutrophils-to-lymphocytes ratio (Chen et al. 2020; Carissimo et al. 2020; Martín-Sánchez et al. 2021; Hadjadj et al. 2020; Mann, et al. 2020; Laing et al. 2020; Ahern et al. 2022). Longitudinal studies analysing the leukocyte transcriptome of COVID-19 patients have described a molecular profile characterized by robust overrepresentation of interferon-related gene expression, marked decrease of transcriptional levels of genes contributing to general protein synthesis and bioenergy metabolism, and dysregulated expression of genes associated with coagulation, platelet function, complement activation, and TNF/IL-6 signalling (Ahern et al. 2022; Gill et al. 2020; Daamen et al. 2022; Yan et al. 2021; McClain et al. 2021).

Early identification of patients at risk of progressing to severe disease is essential for patient management and for the administration of specific and individualized therapies, in order to improve patient outcome and to optimize the allocation of healthcare resources. Several laboratory biomarkers, including parameters related to increased inflammatory response (such as lymphopenia, neutrophilia, raised C-reactive protein), have been associated with COVID-19 severity, hospitalization, intensive care unit admission, and mortality (Zakeri et al. 2021). Moreover, a higher risk of severe progression and mortality is related to individual factors, including older age, male gender and the presence of diverse comorbidities, such as overweight and obesity, chronic lung disease, immune depression, diabetes, hypertension and malignancy (Li et al. 2021). Attempts to build a clinical predictor score to estimate an individual hospitalized patient’s risk of developing critical illness has been made, yet with limited accuracy (Lombardi et al. 2021; Liang et al. 2020). Up to now, it is not clear whether other markers, such as molecular classifiers built on blood-based features, can improve the early prediction of this risk. Indeed, longitudinal studies on blood transcriptome and proteome have identified strong signatures associating with COVID-19 severity (Buturovic et al. 2022; Ng, et al. 2021; Kwan et al. 2021; Lee et al. 2021). Whether these markers can be used for early prognostic prediction is not known.

Here, we performed an ancillary cross-sectional study within the prospective longitudinal COVIDeF cohort, focusing on the time of first clinical evaluation of patients with mild COVID-19 pneumonia referred to Assistance Publique-Hôpitaux de Paris hospitals (Paris, France) during the first pandemic wave. We analysed the whole blood transcriptome in order to identify an early signature predicting severe progression of the disease. A control cohort of patients with non-COVID-19 pneumonia was included to warrant the specificity of the COVID-19 prognostic signature.

Results

Cohort presentation

A total of 159 samples, collected from patients enrolled during the first COVID-19 outbreak (April-June 2020) and presenting mild pneumonia at the moment of first clinical evaluation, were analysed. They included 115 patients with COVID-19 diagnosis and 44 patients with pneumonia not related to COVID-19. Among the 115 patients diagnosed with COVID-19, 11 evolved towards severe pneumonia, 51 towards intermediate pneumonia and 53 towards mild pneumonia (Supplementary Table S1).

In line with established risks factors (Li et al. 2021; Bergantini et al. 2021), the risk of evolution towards severe pneumonia was associated with age, diabetes, temperature, C-reactive protein (CRP), procalcitonin, fibrinogen, neutrophils and lymphocytes at inclusion, oxygen saturation and CT scan findings (Table 1).

Table 1 Cohort presentation, descriptive statistics and group comparisons. *Comparison of COVID-19 patients and a group of controls presenting non-COVID-19 pneumonia, using Wilcoxon and Fisher’s tests for quantitative and qualitative variable comparisons, respectively; **Comparison of COVID-19 patients depending on their outcome (towards mild, moderate or severe pneumonia), using Kruskal–Wallis’s and Fisher’s tests for quantitative and qualitative variable comparisons, respectively. §Neutrophils and lymphocytes scores, reflecting the scaled relative fractions inferred from transcriptome data. #For hospitalisation duration, patients with fatal outcome were excluded. BMI body mass index, CRP C-reactive protein, CT scan computed tomography scan, ECMO extracorporeal membrane oxygenation, HFNO high-flow nasal oxygen, IMV Invasive Mechanical Ventilation, NIV non-invasive ventilation, PCR Polymerase Chain Reaction. Median values were reported (range values in brackets)

Unsupervised transcriptome-based classification of samples

Unsupervised principal component analysis of the whole transcriptome dataset discriminated patients with COVID-19 from controls (first principal component PC1), and patients who would later develop a severe COVID-19 pneumonia from those with mild or intermediate evolution (second principal component PC2, Fig. 1a). Over-representation analysis of the top genes most contributing to PC1 showed an enrichment in signalling pathways mainly related to neutrophil activation, while the top 100 genes most contributing to PC2 were enriched in signalling pathways related to immune response to virus infection, including complement activation, regulation of humoral immune response, response to type I interferon, and regulation of viral genome replication (Supplementary Table S2). Consistently, the most contributing gene to PC1 was CD177, a marker of neutrophil activation. For PC2, the most contributing genes were IFI27, involved in type I interferon cell response, and OTOF, both over-expressed in COVID-19 patients and associating with the severity of evolution (Fig. 1b).

Fig. 1
figure 1

Global blood transcriptome discriminates patients depending on the type of pneumonia and the evolution of COVID-19 pneumonia. a) Sample projections based on the combination of the first two principal components (PC1, PC2) of unsupervised PCA performed on the whole dataset (n = 16001 genes, n = 159 samples). The center of each group is indicated by the larger circles. b) Boxplot of CD177, IFI27 and OTOF gene expression in the different group of analysis. *Student’s T-test p-value < 0.05, **Student’s T-test p-value < 0.001; ***Student’s T-test p-value < 10e-6

Considering the variability possibly due to blood cell composition, we inferred the score of different blood cell subtypes for each sample (Supplementary Table S3), then we evaluated the impact of cell composition on sample classification. Globally, compared to controls, COVID-19 samples showed lower neutrophils and higher lymphocytes and lymphocyte subtypes (Table 1, Supplementary Table S4). Indeed, inverse correlation was observed between neutrophils and global lymphocytes proportion, and between neutrophils and lymphocyte T CD8 + and CD4 + memory resting subtypes in particular (Supplementary Figure S1a). However, the variability due to the global blood formula alone couldn’t properly discriminate COVID-19 patients in terms of pneumonia evolution (Supplementary Figure S1b).

Blood early transcriptome signature of COVID-19 pneumonia

By comparing COVID-19 samples (n = 115) to controls (n = 44), and after adjustment on age and blood cell composition, we identified 68 differentially expressed genes (Benjamin-Hochberg adjusted p-value < 0.05 and a logFC > 1.5; Supplementary Table S5), mostly over-expressed in COVID-19 (n = 52/68). Gene ontology analysis of these over-expressed genes in the COVID-19 samples showed an enrichment in pathways related to virus response mainly involving type I interferon signalling (Fig. 2a and b; Supplementary Figure S2; Supplementary Table S6).

Fig. 2
figure 2

Differentially expressed genes in early COVID-19 pneumonia. a) Volcano plot of the top differentially expressed genes in COVID-19 (n = 115) versus controls (n = 44). b) Dot plot of the 10 most GO enriched signalling pathways of the differentially over-expressed genes in COVID-19 samples versus controls

Blood early transcriptome signature of future severe COVID-19 pneumonia

In patients with early COVID-19 pneumonia, blood samples were compared between those evolving towards severe pneumonia (n = 11) and those remaining mild (n = 53). After adjustment on age and blood cell composition, we identified 345 differentially expressed genes (Benjamin-Hochberg adjusted p-value < 0.05 and a logFC > 1.5; Supplementary Table S7). The enriched signalling pathways were represented by a response to virus infection involving a response to type I interferon, as assessed by GSEA analysis (Fig. 3a and b; Supplementary Figure S3; Supplementary Table S8).

Fig. 3
figure 3

Differentially expressed genes in patients with early COVID-19 pneumonia evolving towards severity. a) Volcano plot of the differentially expressed genes in patients with severe (n = 11) versus mild (n = 53) evolution. b) Dot plot of the top 10 activated and the top 10 suppressed signalling pathways enriched for the differentially expressed genes in patients with future severe COVID-19 pneumonia. c) Venn diagram representation of differentially expressed genes in patients with severe COVID-19 pneumonia in our study (in red) and in the studies of Wang et al. and Jackson et al. (in orange) (Jackson et al. 2022; Wang et al. 2022). d) Venn diagram representation of differentially expressed genes in patients with severe COVID-19 pneumonia in our study (in red) and in patients with severe Influenza infection in the studies of Zerbib et al. and Dunning et al. (in green) (Zerbib et al. 2020; Dunning et al. 2018)

We then tested how similar our early transcriptome signature of future severe COVID-19 pneumonia was with the longitudinal signature of evolution from mild towards severe COVID-19 pneumonia. For that aim, we analysed the overlap of our signature with published signatures reflecting the longitudinal evolution of blood transcriptome towards severe COVID-19 pneumonia (Jackson et al. 2022; Wang et al. 2022). Similarities were observed, including increased expression of CD177, IFI27 and OTOF (Fig. 3c; Supplementary Table S9). Of note, CD177 was also common when studying the overlap between our early signature and differentially expressed genes associated with the evolution towards severe Influenza infection (Fig. 3d; Supplementary Table S10), another virus infection, underlining the importance neutrophil induction beyond interferon activation in virus infections (Zerbib et al. 2020; Dunning et al. 2018).

Early prediction of severe forms of COVID-19

To select a limited set of genes predicting severity of COVID-19, we trained an Elastic Net-penalized linear model on the sub-cohort of COVID-19 mild pneumonia patients with severe or mild evolution of the disease (n = 11 and n = 53, respectively), starting from the 2500 most variable genes. Forty-eight genes were selected (Supplementary Table S11), properly discriminating severe from mild evolution of COVID-19 pneumonia in the training cohort. Of note, patients with intermediate COVID-19 pneumonia outcome—not used for the training—were scattered between patients with mild and severe COVID-19 pneumonia outcome and were referred to as a grey zone. Using receiver operated characteristic (ROC) curve analysis, optimal thresholds (-0.02 and 7.69) were identified on the first component of the 48-genes principal component analysis projection (Supplementary Figures S4-6). Using an independent validation cohort of 77 patients (28 with severe outcome, 23 with intermediate outcome and 26 with mild outcome), we could confirm the classification performance (Fig. 4). Sensitivity, specificity and accuracy for predicting severe outcome were 0.64, 0.91 and 0.81 respectively (Supplementary Table S12). In a multivariate model combining the 48-genes predictor, age, sex and blood cell composition, the 48-genes predictor remained highly significant of severe outcome against mild outcome (logistic regression p-value < 0.001; Table 2). Of note, this signature was not discriminant between Covid and control patients (data not shown).

Fig. 4
figure 4

Classification of samples based on the 48 selected genes discriminating COVID-19 patients depending on pneumonia evolution. Samples projection based on the two principal components (PC1, PC2) of unsupervised PCA performed using the 48 genes selected by Elastic net regression on the training cohort. In faint circles are presented the samples from the training cohort, on which the optimization of gene selection was operated. In bright squares are presented the samples from the external independent validation cohort

Table 2 Multivariate model combining transcriptome, age, sex, and neutrophils predictors on COVID-19 pneumonia severity. Training and validation cohorts were combined. The evolution towards severe COVID-19 pneumonia was considered. OR = Odds ratio, CI = Confidential Interval

Post hoc analyses showed a positive correlation between the 48-genes predictor and the following biochemical variables at admission: C-reactive protein (r = 0.65, p = 2.603e-07), procalcitonin (r = 0.70, p-value = 2.073e-06) and fibrinogen (r = 0.62, p-value = 4.364e-05). Of note, diabetes, a well-established risk factor for the evolution towards severe COVID-19 pneumonia, was weakly correlated with the 48-genes predictor (r = 0.37, p-value = 0.0256).

Ability of the 48-gene predictor of severe outcome to monitor longitudinal evolution towards severe COVID-19 pneumonia

To explore the longitudinal performance of our transcriptomic signature, we computed our 48-gene predictor in a published cohort (Supplementary Fig. 7) (Wang et al. 2022). Globally, we found a positive correlation between the 48-genes predictor values and COVID-19 severity assessed by the WHO severity level (r = 0.53, p-value = 6.945e-16; Supplementary Fig. 8). For patients with 2 to 4 COVID-19 severity levels, the longitudinal evolution of the 48-genes predictor showed a decrease, while for patients with 6 to 9 COVID-19 severity levels, the longitudinal evolution was more variable (Fig. 5). Finally, among the 8 patients with a change in the COVID-19 severity level, with our 7.69 threshold, the 48-genes predictor was globally discriminating the patients with a worsening pneumonia (Fig. 5).

Fig. 5
figure 5

Longitudinal ability of the 48-genes predictor to monitor evolution of COVID-19 pneumonia towards severity. The red and blue curves represent the mean value of the 48-genes predictor over time in 13 patients with 6–9 and 14 patients with 2–4 WHO COVID-19 severity levels respectively. For 8 patients with a change of severity level during follow-up, individual values are provided (broken lines), with colours reflecting the severity level at each time point during follow-up. The dashed horizontal lines indicate the 48-genes predictor thresholds established on the training cohort

Discussion

In this study focusing on patients with early-stage COVID-19 pneumonia, we identified a blood transcriptome signature predicting the risk of evolving towards a severe pneumonia. This signature could help to improve patients’ management, by proposing specific surveillance and treatments. This signature corresponds to a differential inflammatory profile between patients with severe and mild outcome, with the implication of humoral immune response, complement activation and interferon signalling pathway. This signature includes the over expression of inflammation markers previously reported in patients with severe COVID-19 pneumonia, such as IFI27 (Shojaei et al. 2023) or CD177 (Jackson et al. 2022; Wang et al. 2022; Lévy et al. 2019; An et al. 2021; An et al. 2022).This signature shows a gradient of expression from mild to intermediate and severe forms. Thus, the inflammatory signature observed in patients with severe COVID-19 pneumonia seems to be present early in the course of the disease, and to reflect the risk of developing a severe outcome.

Based on this observation we designed a predictor, with optimal selection of transcriptome biomarkers able to classify patients depending on their COVID-19 pneumonia evolution. This predictor could be validated on an independent validation cohort, where patients were evaluated longitudinally. However, with an accuracy of 0.81 the prediction of evolution towards a severe COVID-19 pneumonia is not absolute. This relates to the limited sensitivity in the validation cohort, with some COVID-19 patients with severe outcome grouped with patients with intermediate outcome. Nevertheless, taking into account the importance of both identifying early patients at risk of evolving towards a severe pneumonia and optimizing healthcare resources, in the validation cohort, none of the patients with severe COVID-19 pneumonia outcome were predicted as mild and only one patient with an actual mild COVID-19 pneumonia outcome was predicted as severe. In addition, the clinical criteria for classifying COVID-19 patients in the validation cohort, and the time of sampling during the course of the disease may have impacted the evaluation of accuracy on the validation cohort. Another limitation of the prediction of outcome is the broad distribution of patients evolving towards intermediate forms of pneumonia, with some overlap with patients evolving towards mild pneumonia -mainly in the training cohort-, and with those evolving towards severe pneumonia -mainly in the validation cohort. Though being important, the proper evaluation of intermediate patients is difficult for two reasons. Firstly, no clear clinical definition is available, with variable diagnostic criteria and thresholds to define intermediate Covid pneumonia (Jackson et al. 2022; Wang et al. 2022). Including these intermediate patients would have increased the size of the training cohort, but also the risk of misclassification. Limiting the inclusion to “mild” and “severe” classes warranted well-defined labels for training the predictor. Secondly, the cohort size is not large enough to assess the existence of statistically significant thresholds in the 48-genes predictor in patients with intermediate COVID-19 pneumonia. However, in our study, these intermediate patients indisputably fall in-between patients with mild and severe outcome as a continuum and they probably represent patients who deserve a closely clinical surveillance thus globally strengthening the validity of transcriptome prediction. This grey zone may also reflect a biological variability, and thus a certain limitation of the ability to predict outcome with this technique. Another potential issue in this study is the risk of overfitting due to the limited number of patients and the high number of features. This risk was mitigated by the cross-validation strategy in the training cohort, and the validation in two independent cohorts. Finally, a post hoc association between the 48-genes predictor and clinical and biochemical variables showed several significant associations. However, the prognostic value of these associated variables could not be tested in the final multivariate model due to the limited cohort size, and to the limited data available in the independent validation cohort.

This prognostic signature appears specific to COVID-19 patients, compared to controls (patients with non-COVID-19 pneumonia). Compared to controls, COVID-19 patients present a transcriptome signature reflecting pathways related to virus response mainly involving type I interferon response. Type 1 interferon activation has been well established when COVID-19 progresses towards severe pneumonia in longitudinal series, with several markers reported including IFI27, SIGLEC1, OAS1/2, IFI44, IFI44L, ISG15 (Shaath et al. 2020; Krämer et al. 2021; Masood et al. 2021; Khorramdelazad et al. 2022; Xu et al. 2022). Another potentially relevant gene identified here is OTOF, associated with inflammation and described as a type I IFN-induced effector (Roberson et al. 2022; Ding et al. 2022). Our results show that interferon type 1 activation occurs early in the course of COVID-19 pneumonia. This signature could contribute to diagnose the SARS-CoV-2 infection.

In addition, beyond COVID-19 pneumonia, to which extent this type 1 interferon signature is systematically present in SARS-CoV-2 infected patients remains to be established, especially in patients with mild or asymptomatic forms of the disease.

This study comes after several publications showing the implication of type I interferon in the evolution towards severe COVID-19 pneumonia. However, contrasting with a vast majority of previous works, this study is focusing on early-stage patients, when COVID-19 pneumonia is still mild. The clinical characterization is quite extensive, and the follow-up well documented, enabling a proper classification in terms of outcome. This original design was required for demonstrating the existence of a signature predicting the outcome. Of note, the inclusion of patients is restricted to the first outbreak wave in the training cohort. To which extent do these signatures stand with the new SARS-COV2 variants remains to be established. In addition, the relative proportion of severe pneumonia in the subsequent outbreak waves decreased, in the context of the progressive immunisation of population through vaccines and history of COVID-19 infections. However, severe pneumonia still occurs, and disease complications are still challenging to predict at individual levels. The early transcriptome signature proposed here may help improving this challenging detection.

In conclusion, whole blood transcriptome is able to early predict the outcome of COVID-19 pneumonia. This discrimination mainly relies on type 1 interferon activation, along with other immune alterations, which are already present at an early stage of the disease in patients later developing a severe pneumonia.

Methods

Patients and samples

A total of 159 patients presenting an early-stage pneumonia at the moment of first clinical evaluation at the hospital were recruited prospectively between April and June 2020 in Assistance Publique-Hôpitaux de Paris hospitals (Paris, France), as part of the multicentre longitudinal COVIDeF cohort (NCT04352348). Early-stage COVID-19 pneumonia was defined as confirmed SARS-Cov-2 infection, requiring at the admission supplemental oxygen but not ≥ 6 L/minute, or characterized by oxygen saturation < 95% or by the presence of one or more pneumonia morphologic criteria (CT scan or chest X-ray). COVID-19 diagnosis was made for 115 patients, based on positive SARS-CoV-2 PCR and/or serology test (n = 79), and/or on the presence of typical clinical symptoms with CT findings (n = 36). Whole blood samples for transcriptome analysis were collected at the moment of first clinical evaluation and only patients with an early-stage pneumonia were included. Disease evolution was evaluated on a time lap of 14 days after inclusion, and patients’ status was classified as mild (n = 53), intermediate (n = 51) or severe (n = 11) pneumonia, according to the onset and evolving severity of COVID-19 complications, including hospitalization duration, need for oxygen supply, mechanical ventilation, or extra corporeal oxygenation, death. Specifically: i) patients who didn’t need oxygen supply, either hospitalized or not, were classified as having mild progression COVID-19 pneumonia; ii) patients who were hospitalized with standard oxygen supply were classified as having intermediate progression COVID-19 pneumonia; iii) patients who were hospitalized and received mechanical ventilation, or extra corporeal oxygenation, or deceased, were classified as having severe progression COVID-19 pneumonia. Cases of pneumonia of other etiology (n = 44), used as controls, corresponded to patients with negative PCR and not diagnosed as COVID-19 by the physician based on CT findings.

RNA collection and extraction

Whole blood samples were collected into PAXgene tubes (PreAnalytiX, Hombrechtikon, Switzerland), following the manufacturer’s instruction. Total RNA was extracted on a QIAcube extractor, following the manufacturer’s instruction (Qiagen, Hilden, Germany), at the CRB platform (Saint Antoine hospital, Paris). Quantification and quality control of RNA were performed on a 2100 Bioanalyser System (Agilent Technologies, Inc., Santa Clara, CA, US). All samples passed the integrity quality control (RIN > 7).

Transcriptome data generation

Quantification and quality control of nucleic acids was performed by capillary migration on a Fragment Analyzer (Agilent Technologies, Inc.). Starting from 100 ng total RNA, mRNAs poly(A) were selected using oligo dT magnetic beads (NEBNext® Poly(A) mRNA Magnetic Isolation Module, New England Biolabs, Ipswich, MA, US), fragmented at around 300 bp and converted to oriented DNA (NEBNext® Ultra™ II RNA First Strand Synthesis Module & Directional RNA Second Strand Synthesis Module, New England Biolabs). Size selection and purification were performed using magnetic beads (Sera-Mag magnetic beads, GE Healthcare, Chicago, IL, US), and libraries were prepared (NEBNext® Ultra™ II End repair/A-tailing Module & Ligation Module, New England Biolabs), amplified by PCR (KAPA Hifi HotStart ReadyMix, Roche, Basil, Switzerland), quantified by qPCR (NEBNext® Custom 2X Library Quant Kit Master Mix, New England Biolabs; QuantStudio 6 Flex Real-Time PCR System, Life Technologies, Carlsbad, CA, US) and the related size profile was analyzed by capillary migration on a Fragment Analyzer (Agilent Technologies, Inc.). Paired-end sequencing (twice 100 cycles) was performed by “sequencing-by-synthesis” technology on a Flow Cell S2 NovaSeq 6000 platform (Illumina, San Diego, CA, US). Transcript quantification was done using the Salmon tool (Patro et al. 2017) (v.1.4.0) on transcriptome reference from GENCODE (Frankish et al. 2021) (release 33—GRCh38.p13).

Bioinformatics analyses

Quality control was performed on raw count matrix. All samples passed this control. Counts were aggregated for transcripts corresponding to the same gene, and only genes with a count sum > 0 in all samples were further considered. Globin genes were also discarded, as previously published (Harrington et al. 2020).

Counts were normalized with DESeq2 (Love et al. 2014) (v.1.24.0). From gene counts, blood cell composition was inferred using the online CIBERSORTx tool (Stanford University 2022) (Newman et al. 2019), with the following parameters: B-mode batch correction, disabled quantile normalization, absolute mode, n = 500 permutations. For each cell types, a score is generated, that reflects the abundance of each cell type in a mixture. Given the high correlation of each blood cell type proportion with neutrophils proportion (Supplementary Figure S1a), neutrophils proportion was chosen as a unique proxy of cell blood composition.

The edgeR package (Robinson et al. 2010) (v.3.26.8) was used to read and pre-process the data before analysis: raw counts were converted to counts per million (CPM), and lowly expressed genes were removed using a CPM > 1 in at least 3 samples as cut-off, obtaining a final dataset of n = 16,001 genes and n = 159 samples. Normalization was then performed by using the trimmed mean of M-values (TMM) method (Robinson and Oshlack 2010), as implemented in the edgeR package. The same packages and method were used to process and normalize data from the validation cohort (Ahern et al. 2022) and the additional longitudinal cohort (Wang et al. 2022). Global data structure was assessed on log2-CPM of normalized data by unsupervised principal component analysis (PCA). This method was chosen for the interpretability of PCA axes for deciphering the biological meaning of the observed variability. Over-representation analysis of genes most contributing to PCA components was performed by using the clusterProfiler package (Wu et al. 2021) (v.3.12.0). Of note, the edgeR normalisation did not significantly modify the normalized expression levels compared to CIBERSORTx (gene expression correlation r = 0.999, p-value < 2.2e-16).

To remove heteroscedasticity of counts data, normalized data were transformed using the voom function (Liu et al. 2015) implemented in the edgeR package. Differential expression analysis was performed by applying linear modelling using the limma package (Ritchie et al. 2015) (v. 3.40.6), including the estimated neutrophils count and age as covariates in the model matrix. Differentially expressed genes were selected using a Benjamin-Hochberg adjusted p-value < 0.05 and a logFC > 1.5 as cut-offs. Gene set enrichment analysis (GSEA) of differentially expressed genes was performed using the clusterProfiler package.

For predicting COVID-19 severity from transcriptome, gene selection was performed on the sub-cohort of severe and mild cases, by fitting an Elastic Net regularized regression (α = 0.5) on the most variable genes (n = 2500), with a tenfold cross-validation, using the glmnet package (Friedman et al. 2010) (v. 4.1–1). The predictive model, combining 48 discriminating genes, was assessed on an independent validation cohort from the whole blood RNAseq dataset recently published by Ahern et al9. Seventy-seven samples were selected (26 mild, 23 intermediates and 28 severe COVID-19 pneumonia, based on classification criteria similar to those we used in our cohort), using the following criteria: confirmed COVID-19 diagnosis and last sample for the same patient with multiple sampling time points.

Another cohort was used to explore the ability of the 48-genes predictor to monitor the longitudinal evolution of patients (Wang et al. 2022). PCA using the 48-genes predictor was performed using the PCA weights from the training cohort.

Statistical analyses

Quantitative variable correlations were performed using Pearson’s test. Quantitative and qualitative variable comparisons between groups were performed using Kruskal–Wallis’s test and Fisher’s test, respectively. ROC curve analysis was performed and the optimal thresholds predictive of COVID-19 outcome were defined using Youden’s J index. A multivariate analysis was performed on the combined samples of the training and the external independent cohorts, using a logistic regression model on 4 variables: the 48-genes predictor (based on the PC1 coordinates of the 48-genes principal component analysis projection), the neutrophils score, and the patients’ age and sex. All p-values were two-sided and adjusted for multiple comparisons using Benjamini-Hochberg’ method. The level of significance was set at adjusted p-value < 0.05. All tests were computed in R software environment.