Background

In the wake of the COVID-19 pandemic, a remarkably high number of individuals worldwide have been affected. While the pathophysiology and immune response to SARS-CoV-2 (the virus causing COVID-19) infection have been extensively studied to predict acute disease progression and prognosis [1,2,3,4,5,6,7], there is limited knowledge about the long-term effects caused to the host by the infection. The condition of persistent clinical symptoms after the acute phase of infection is common to many viruses—e.g., Epstein–Barr virus (EBV) [8, 9]; cytomegalovirus (CMV) [10, 11], and human herpesvirus 6 (HHV-6) [12]—but in the case of COVID-19, this condition takes on great significance as it appears to affect a large percentage of individuals. The persistence of COVID-19 clinical symptoms for at least 12 weeks or more (several months) is referred to as post-acute sequelae of SARS-CoV-2 infection (PASC) or long-COVID [13] affecting individuals with both severe acute symptoms and those with mild or asymptomatic disease progression [14]. These symptoms encompass organ (e.g., fatigue, post-exertional malaise, headache, insomnia, tachycardia) and neurological (e.g., brain fog, memory/speech/language issues, sleeping problems) manifestations [15, 16]. Estimating the accurate prevalence of this condition proves to be highly challenging, primarily because inherent studies are shaped by a wide array of variables, including age, gender, ethnicity, the severity of acute symptoms, the type of SARS-CoV-2 variant, follow-up durations, viral load, the presence of concurrent medical conditions, vaccination history, preexisting social, economic, and medical factors, among others [17]. Among the most updated meta-analyses, the prevalence is reported between 30 and 60% [18,19,20,21,22]. Another meta-analysis revealed that COVID-19 long-term symptoms are slightly associated with older age and strongly associated with female sex and preexisting comorbidities (e.g., diabetes and obesity) [23].

The exact causes of long-COVID are still being investigated, but some hypotheses have emerged over time. SARS-Cov-2 can reach (via hematogenous spreading) and infect cells of the central nervous system (CNS), producing neuroinflammation [24]. It has also been hypothesized that the SARS-CoV-2 virus may persist in specific tissues long after the acute phase [13], leading to potential long-term health complications [25]. Disease risk factors include cell death and immune dysfunction after SARS-Cov-2 infection [26, 27], uncontrolled and persistent release of cytokines [28, 29], multiple cell fusion in infected organ (syncytia) [30, 31], autoantibodies causing immunodeficiency (against type I IFN) [32, 33] or microclots [33, 34], and persistent viral infection [35, 36].

The possibility that these dysfunctions are mediated over time by epigenetic changes has been explored through EWAS approaches in a limited number of studies capable of identifying specific epigenetic signatures obtained by analyzing small cohorts of post-/long-COVID-19 [37,38,39]. Lee and colleagues [38] investigated DNA methylation changes in immune response-associated genes in post-COVID-19 patients (after 3 months from the acute phase), identifying the gene IFI44L (interferon-induced protein 44 like) as the primary target. IFI44L plays a critical role in antiviral and antibacterial activity. The study of [37] compared methylation changes in the acute phase, and after one year, highlighting the persistence of pathways related to viral response and inflammation. The study performed by Nikesjo and colleagues [39] found a unique DNAm signature in PACS patients involving modified pathways related to angiotensin II and muscarinic receptor signaling and mitochondrial function.

Here, we present a genome-wide study using the Illumina 850 K EPIC BeadChips of a large cohort of ninety-six individuals whose blood samples were collected 6 months after COVID-19 infection. The follow-up examination has highlighted the presence of suggestive long-term clinical features in twenty-eight patients. This study aims to assess potential epigenetic changes 6 months after COVID-19 exposure.

Results

Characteristics of the sample population

Demographic and clinical features, including age, sex, and cellular components of subjects sampled 6 months after the initial SARS-CoV-2 infection (cases) and healthy subjects with no history of SARS-CoV-2 infection (references), are summarized in Table 1. The age and sex distributions of the sample population were analyzed to understand the demographic profile of the participants: The two cohorts differ statistically in terms of age distribution (Mann–Whitney test, p < 0.05) and sex ratio (Fisher's exact test, p < 0.05). The estimation of cellular components of peripheral blood obtained from the EpiDISH package [40] was evaluated to explore any variations in the composition of blood samples between the reference and the post-COVID-19 groups. The analysis showed no significant differences in immune cell composition.

Table 1 Clinical characteristics of study cohorts

Exploratory

We conducted an exploratory approach to investigate the epigenetic differences between the two groups (post-COVID-19 vs. reference) through a principal component analysis (PCA) both at the CpG site and region levels (genes, CpG islands, and promoter). The analysis (Fig. 1) did not reveal distinct patterns of epigenetic variation between the two groups since the patterns observed were almost overlapping, indicating a lack of solid differences in DNA methylation at CpG sites and regions between the two groups.

Fig. 1
figure 1

Scatter plots of principal component analysis (PCA). Scatter plot distribution of samples and the first two principal components at a sites, b genes, c promoters, and d CpG islands

Differential methylation and over-representation (ORA) analysis

The differential methylation analysis between the two groups (post-COVID-19 vs reference) was performed using linear models for microarray data (LIMMA) [41]. Confounding factors (sex, chronological age, and cellular component estimations) [40, 42] were considered by adjusting for their effects in the analysis (see Methods for details). Limma results are available as Supplementary File 1. At the site level, the study revealed a set of 42 CpG sites overlapping 53 genes exhibiting significant differential methylation between the two groups (FDR < 0.05): 24 hypo-methylated (log2FC < 0; ∆β < 0) and 18 hyper-methylated (log2FC > 0; ∆β > 0) (Fig. 2).

Fig. 2
figure 2

Circos plot visualizes the genomic distribution of differentially methylated sites throughout the human genome. The blue (hyper-methylated) and red (hypo-methylated) dots represent the genomic position of sites that have exceeded the statistical significance threshold (FDR < 0.05) and are spatially arranged according to the -log10 (unadjusted p-value). The solid red line indicates the FDR significance threshold, while the dashed red line represents the Bonferroni significance threshold. X and Y chromosomes are omitted from the analysis.

Genomic localization of the 42 deregulated CpG sites showed a significant enrichment at the level of CpG islands (OR = 3.426, p-value = 1.58 × 10–4). In comparison, 18 out of 42 sites (43%) are functionally related to promoter regions (TSS200, TSS1500, 5'UTR, and 1stExon) (data not shown).

Differentially methylated cytosines were annotated with their corresponding genes for functional annotation and prioritization (Supplementary File 1).

Over-representation analysis (methylGSA) revealed KEGG pathways associated with amino acid metabolism, including "Alanine, aspartate, and glutamate metabolism" (hsa00250) and "D-glutamine and D-glutamate metabolism" (hsa00471), albeit with nominal significance (Supplementary File 2).

Next, we focused on differently methylated genes to determine their potential relevance to COVID infection and symptoms associated with post-acute sequelae of SARS-CoV-2 infection (PASC) by using the relevant keywords (e.g., COVID-19, long-COVID, post-acute sequelae of SARS-CoV-2 infection; neuronal; inflammation, virus infection); we ranked genes based on their most suitable phenotype associations (VarElect) (Supplementary File 3). Among the top-rated genes to note: GLUD1 (glutamate dehydrogenase 1) (cg00167275; log2FC = 1.5, ∆β = 0.31, FDR = 1.66e-35), ATP1A3 (alpha 3 subunit of the Na + /K + ATPase) (cg13628106; log2FC = -0.32, ∆β = -0.16, FDR = 3.2e-4), RNASEH2C (C subunit of ribonuclease H) (cg25294185; log2FC = -0.82, ∆β = -0.04, FDR = 1.96e-12), SMAD2 (SMAD family member 2) (cg05100634; log2FC = 0.51, ∆β = 0.02, FDR = 9.4e-7), TNIP1 (TNFAIP3 interacting protein 1) (cg22178392; log2FC = -0.18, ∆β = -0.03, FDR = 1.5e-2), PRKCI (protein kinase C iota) (cg18139307; log2FC = 0.18, ∆β = 0.31, FDR = 3.4e-6), and ARRB2 (arrestin beta 2) (cg10047026; log2FC = 0.24, ∆β = 0.025, FDR = 6e-3).

Considering the top 200 nominally significant genes, we did not significantly enrich KEGG/PANTHER pathways (Supplementary File 4). However, at the gene ontology level, we identified a significant enrichment at the level of cellular components (CC), primarily involving GO terms related to the proper function and structure of the Golgi apparatus (GO:0000139, "Golgi membrane"; GO:0098791, "Golgi subcompartment"; GO:0044431, "Golgi apparatus part"; GO:0005794"Golgi apparatus"; GO:0031984," organelle subcompartment").

At the regional level, the differential analysis revealed significant epigenetic changes limited to a specific CpG island region on chromosome 6 (chr6:41,068,476–41069343) (adjusted p-value = 0.006, diff.meth =  + 1%). This region encompasses the 3' terminal portion of NFYA (nuclear transcription factor Y subunit alpha) and the first exon of the pseudogene ADCY10P1 (ADCY10 pseudogene 1).

Meta-analysis

To enhance the robustness and validation of our findings, we conducted a comprehensive literature search for analogous studies. After identifying the sole survey with a similar design, we integrated our nominal p-values derived from the differential analysis of CpG sites with those obtained from Lee and colleagues [38], who conducted a comparison between COVID-19 positive and negative cases, using a meta-analytical method. The meta-analysis (Supplementary File 5) detected 13 significant (FDR < 0.05) CpG sites with identical directions of effect/deregulation across the two datasets. This analysis was conducted considering that there is indeed a discrepancy between the two studies regarding the timing of sample collection post-infection (6 months vs 3 months). The thirteen genes associated to the 13 CpG sites are (in order of significance): DYRK2 (dual-specificity tyrosine phosphorylation-regulated kinase 2), ATP5PF (ATP synthase peripheral stalk subunit F6), B4GAT1 (beta-1,4-glucuronyltransferase 1), PRKXP1 (PRKX Pseudogene 1), SLC25A21-AS1 (SLC25A21 antisense RNA 1), NRDC (nardilysin convertase), SMAD2 (SMAD Family Member 2), OSBPL6 (oxysterol-binding protein like 6), IFI44L (interferon-induced protein 44 like), PATJ(PATJ Crumbs cell polarity complex component), PRH1-PRR4 (PRH1-PRR4 readthrough), and UBAC2 (UBA domain containing 2). The thirteenth probe is located in an intergenic region and aligns simultaneously with two genes, ACAN (Aggrecan) and ISG20 (interferon-stimulated exonuclease gene 20).

Age acceleration

Methylation profiles were evaluated through Horvath's epigenetic clock tool to assess whether exposure to the COVID-19 virus impacted biological age estimates. We observed a slight but significant epigenetic age acceleration (EAA) in post-COVID-19 patients across all available clocks [42,43,44,45], (Horvath, p-value = 3 × 103; Hannum, p-value =  <  < 0.05; PhenoAge, p-value = 8.8 × 10–4; SkinBlood, p-value = 2.7 × 10–3) (Fig. 3 panels A, C, D, E). On the contrary, despite an increasing trend, no significant differences were found when evaluating the DNAm GrimAge predictor of lifespan (linear regression adjusted for covariates, p-value = 0.42) (Fig. 3B).

Fig. 3
figure 3

Boxplots showing the distribution of age acceleration differences for different epigenetic clocks: A Horvath, B GrimAge, C Hannum, D PhenoAge, and E SkinBlood. The thick horizontal line in each box represents the median of the distribution, while the box itself represents the interquartile range (IQR). In the "ggplot" boxplot function, the whiskers extend to the data points located within 1.5 times the IQR from the box by default. Dots represent outliers (single values exceeding 1.5 IQRs). (F) Telomere length evaluation (DNAmTLadjAge)

Moreover, we also assessed DNAmTLadjAge, a DNA methylation-based estimator of telomere length [46]. DNAmTLadjAge corresponds to the age-adjusted parameter that relates DNAmTL to chronological age. Negative values of DNAmTLadjAge indicate DNAmTL that is shorter than expected based on age, while positive values indicate the opposite. Our analysis revealed that DNAmTLadjAge was significantly lower in post-COVID patients compared to controls. Results of linear regression adjusted for covariates: telomere length (DNAmTL) (p-value <  < 0.05).

Epigenetic drift

The assessment of epigenetic drift was carried out through the identification of stochastic epigenetic mutations (SEMs) as described in [3, 47,48,49,50,51]. To detect SEMs, we first examined the distribution and variability of methylation levels in the control population for all the probes: A reference methylation range for each probe was generated using the formula Q1-(3 × IQR) = lower limit and Q3 + (3 × IQR) = upper limit. Methylation levels of cases falling outside this extreme interval were identified as SEMs. Differences between cases and controls were then investigated using two distinct metrics to evaluate the epigenetic drift: one examining the broader influence of SEMs (Global-Epi Mutation Load (EML)), while the other focused on assessing the burden of SEMs at the gene level (Gene-EML).

In the case cohort (Global-EML), the median value of SEMs stood at 488.5, IQR (373–922), whereas in the reference group, it was 369, IQR (314–520.5). The multiple regression model that accounted for sex and principal component covariates, which included age and cellular components, confirmed a significant increase of SEMs in the post-COVID-19 group compared to the control/reference group (with an estimated incremental log10 transformed value of 0.1667 and a p-value of 3.74 × 10–06) as shown in Fig. 4. The increase is confirmed even when SEMs are hyper-methylated and hypo-methylated categories (data not shown).

Fig. 4
figure 4

Boxplot showing the distribution of SEMs in Reference and post-COVID-19 groups. The thick horizontal line in the box represents the median of the distribution, while the box represents the interquartile range. By default, in the "ggplot" boxplot function, the whiskers extend to the data points located within 1.5 times the interquartile range (IQR) from the box. Dots represent outliers (single values exceeding 1.5 IQRs)

Differences in Gene-EML between cases and controls were investigated using a sequence kernel association test (SKAT) [52]. After correction for appropriate covariates (sex and principal components, as described for the differential methylation step), the analysis identified 790 SEMs-enriched genes with statistically significant associations with sample group variable (Perm P value < 0.05) (Supplementary File 6). The roster of genes exhibiting increased drift in cases underwent an over-representation analysis (Fig. 5) (Supplementary File 7).

Fig. 5
figure 5

Bar chart showing enrichment ratio of a KEGG and b PANTHER pathways

ORA identified several enriched pathways, some remaining significant even after multiple testing corrections. Notably, the "Insulin resistance" pathway (hsa04931) remained significant (FDR = 0.04), alongside the "VEGF signaling" pathway (hsa04370) and the "Apoptosis signaling pathway" (P00006) (FDR = 0.005). Other pathways, while not significant after correction, are noteworthy, including "Hypoxia response via HIF activation," "Axon guidance mediated by netrin," "Relaxin signaling pathway," "T-cell activation," and the "Endothelin signaling pathway."

Discussion

This study aimed to investigate potential epigenetic changes 6 months after COVID-19 exposure. Currently, within this timeframe of exposure, it stands out as one of the epigenetic studies with robust statistical power, analyzing the methylation profile of nearly a hundred individuals.

The role of DNA methylation in developing long-term COVID-19 symptoms has been examined in three previously published studies [37,38,39]. However, these studies utilized diverse approaches, primarily focusing on post-acute sequelae of SARS-CoV-2 infection (PASC), and employed study designs with reduced sample sizes. We examined peripheral blood samples collected from individuals 6 months post-infection, regardless of persistent COVID-19 symptoms, with a primary focus on investigating whether the virus-induced significant epigenetic remodeling or reprogramming in the host organism.

The initial suggestive discovery reveals no discernible variance in the immune system landscape between the two groups, as evidenced by comparable estimates of blood cellular composition (CD8T, CD4T, NK, Bcell, Mono, Gran) (Supplementary File 8).

Although not directly comparing the same populations at the same time point (3 months post-infection), [38] confirmed the absence of an inflammatory state, as they observed no significant differences in the various cell types under consideration.

The principal component analysis (PCA) showed no distinct patterns of epigenetic variation between the two groups, indicating a lack of solid differences in DNA methylation at CpG sites and regions (genes, promoters, CpG islands) level. However, the differential methylation analysis between the two groups revealed 42 CpG sites exhibiting significant differential methylation. The over-representation analysis highlighted a pathway related to glutamate/glutamine metabolism. The dysregulation of this pathway has been reported as significant in COVID-19 since studies have shown that glutamine and glutamate metabolism play a crucial role in COVID-19 severity, with elevated glutamate levels associated with an increased risk of infection and severe disease. In contrast, elevated glutamine levels are linked to a decreased risk of infection and severe COVID-19 [53]. Moreover, the functional prioritization analysis enabled the identification of genes, including GLUD1, ATP1A3, RNASEH2C, SMAD2, TNIP1, PRKCI, and ARRB2, which are particularly intriguing for their potential involvement in the neurological and immunological processes associated with post-COVID symptoms.

Among the genes with the highest prioritization scores, we identified GLUD1 (glutamate dehydrogenase 1), which is interesting because it plays a role in maintaining glutamate levels. The hyper-methylation of GLUD1 observed in this study aligns with the elevated levels of glutamate resulting from systemic inflammation caused by SARS-CoV-2 infection. This finding might contribute to explaining a number of neurotoxic effects, contributing to neuronal dysfunctions such as altered learning, memory, and neuroplasticity highlighted in post-COVID patients [54, 55].

Additionally, after prioritization analysis, ATP1A3 (alpha three subunits of the Na + /K + ATPase) emerged as another intriguing gene. Although not directly linked to COVID-19, it has been associated with various neurological disorders [56] and cardiac abnormalities [57], suggesting a potential indirect role in the neurological manifestations and cardiovascular complications observed in some patients.

Emerging findings indicate that also the other genes, RNASEH2C, SMAD2, TNP1, PRKCI, and ARRB2, may play a role in the pathophysiology of long-COVID by influencing various aspects of the immune response, inflammation, or cellular processes contributing to prolonged symptoms following COVID-19 infection. Mutations in the C subunit of ribonuclease H (RNASEH2C), for example, have been reported to affect the immune response and potentially result in severe COVID-19 outcomes [58, 59]. SMAD2 (SMAD family member 2) is a protein that plays a role in the TGF-β signaling pathway, which is involved in the regulation of cell growth, differentiation, and immune response [60]. Research has shown that SMAD2, along with other genes like SMAD1 and SMAD3, plays a role in modulating T-cell immunity and viral infection responses, contributing to symptoms such as chronic inflammation and immune dysregulation observed in long-COVID [61].

TNFAIP3 interacting protein 1 (TNIP1) is a hub protein associated with autoimmune diseases [62] and plays a role in COVID-19. Research suggests that TNIP1 is involved in the immune response and inflammation regulation, making it a potential target for therapeutic interventions in COVID-19 patients [63].

Among the latest interesting genes, we found PRKCI and ARRB2. The PRKCI gene encodes a protein kinase C iota (PKCi), which regulates cellular functions such as cell proliferation, division, differentiation, survival, migration, and polarization [64, 65]. On the other hand, ARRB2, a member of the arrestin/beta-arrestin protein family, is involved in desensitizing G-protein-coupled receptors and regulating signaling pathways related to cell proliferation, migration, and inflammation. Mutations in ARRB2 have been linked to neurodegenerative diseases [66, 67], cardiovascular alterations [68, 69], and cancer [70]. Additionally, β-arrestin 2 promotes the production of IFN-β and virus clearance in macrophages, although some viruses may degrade it to evade the immune response [71]. Both ARRB2 and PRKCI regulate toll-like receptor (TLR) signaling, which is critical for inducing inflammation in response to microbes and host molecules.

An additional analysis aimed to boost the over-representation analysis (ORA) sensitivity by incorporating the first 200 nominally significant genes into the gene list. This effort was made to capture potentially relevant genetic factors that may enhance ORA accuracy. The results revealed considerable enrichment in gene ontology (GO) terms associated with "Golgi apparatus functionality" under "cellular components." This finding is noteworthy as it aligns with previous research indicating that SARS-CoV-2 infection induces Golgi fragmentation, which aids in viral trafficking and release [72]. Additionally, Golgi fragmentation is commonly observed in brain samples from individuals with Alzheimer's disease and can be triggered by excessive neuronal activation [73].

After concluding the differential methylation analysis, the results were compared with those of a previous study using a meta-analytical approach to produce a more reliable list of genes. This comparative analysis identified 13 genes showing notably consistent epigenetic differences among studies. Despite the differing study designs, this gene list allows for a focus on more robust results.

Furthermore, the study explored additional facets of epigenetic regulation and investigated whether exposure to the SARS-CoV-2 virus affected biological age and epigenetic drift.

We observed a slight but significant age acceleration (AgeAccelerationDiff) and telomere shortening in post-COVID-19 patients, suggesting that SARS-CoV-2 virus exposure might accelerate aging. The effects of viral infections, specifically COVID-19, as well as immune responses on the process of biological aging are currently a topic of debate: several studies have explored this aspect among individuals with acute phase COVID-19, examining both the comparison between healthy controls and COVID-19 cases and the assessment of varying severity levels within the COVID-19 patient group, comparing mild and severe cases. While specific studies have shown no distinctions between chronological and biological age [3, 74], others have observed an acceleration [75]. In partial contrast to our findings, it is worth noting the absence of this acceleration in Lee et al., 2022 [38], where no significant differences are observed; however, it is essential to consider that the time points between our study (after 6 months) and Lee's [38] (after 3 months) are pretty different. Telomere shortening is a widely observed and confirmed aspect in the context of COVID-19, as demonstrated in other studies involving subjects with severe forms of the disease [75,76,77] and in post-COVID-19 survivors [78].

Another critical issue in understanding the biological effects of COVID-19 virus infection after 6 months is the results obtained by evaluating epigenetic drift. Epigenetic drift refers to the changes in DNA methylation patterns that occur over time, contributing to aging. It can be influenced by genetics and environmental exposure, including viral infections, exerting an influence on individual health by increasing genomic instability and promoting abnormal gene expression [79]. Stochastic epigenetic mutations (SEMs) can be considered a reliable measure of epigenetic drift [3, 50, 80]. For example, the burden of SEMs was recently found to be associated with Parkinson's disease (referred to as epigenetic mutation load) [81] or with amyotrophic lateral sclerosis [80]. Two different metrics have been used to assess epigenetic drift: one that analyzed the broader impact of SEMs (Global-EML) and another that concentrated on evaluating the burden of SEMs at the gene level (Gene-EML). Interestingly, our analyses showed a significantly increased Global-EML in the post-COVID-19 group compared to controls (Fig. 4).

Moreover, to assess epigenetic drift at the gene level (Gene-EML), we employed a sequence kernel association test (SKAT). Designed initially for rare variant studies, this method has recently found applications in other areas like copy number variations (CNVs) and epigenetic modifications. It has been widely used in numerous studies aiming to identify genetic associations with diseases such as Alzheimer's disease, schizophrenia, and autism spectrum disorder, amyotrophic lateral sclerosis [80, 82,83,84]. This analysis has identified a list of genes that exhibit significantly different epigenetic drift between the two study groups. The ORA analysis considering this list of genes revealed several significantly enriched biochemical pathways, most of which may directly or indirectly relate to COVID-19.

The "VEGF signaling pathway" and "Hypoxia response via HIF activation" pathways have been associated with COVID-19 due to their involvement in vascular dysfunction and inflammation observed during disease progression [85, 86]. Vascular endothelial growth factor (VEGF) plays a crucial role in angiogenesis and regulates various activities such as vascular permeability, cell migration, proliferation, and survival [87]. Hypoxia, or low oxygen conditions, can activate the hypoxia-inducible factor (HIF), a key regulator in the response to hypoxia. The concurrent activation of HIF and proinflammatory signaling leads to the upregulation of VEGF, which is elevated in COVID-19 patients compared to healthy controls [88, 89] and significantly higher in patients with severe outcomes compared to survivors [86, 90], suggesting that extensive activation of endothelial cells significantly contributes to disease progression. Unfortunately, we cannot speculate on whether these pathways are up- or down-regulated in our analysis; as for the SKAT analysis, we considered overall epigenetic dysregulation by aggregating hyper- and hypo-methylated SEMs.

Moreover, pathways like "Insulin Resistance" and the "Insulin/IGF pathway-protein kinase B signaling cascade" could become relevant due to COVID-19's association with various metabolic alterations, such as impacts on insulin sensitivity and glucose metabolism. Studies have shown that COVID-19 patients, even those with mild cases, may experience increased insulin resistance, which can persist long after the acute phase of the infection [91].

Another attractive deregulated pathway is represented by the "Apoptosis signaling pathway"; this pathway plays a central role in the pathogenesis of COVID-19, and its dysregulation may also contribute to disease severity. However, the role of apoptosis is very complex, and both the induction and inhibition of apoptosis have been suggested as potential therapeutic targets at different stages of the disease [92].

The "T-cell receptor signaling pathway" and "T-cell activation" pathways are highly suggestive concerning COVID-19. These pathways are relevant because the SARS-CoV-2 virus can directly impact the immune response of T-cells. During COVID-19 infection, there has been significant observed impact on T-cell populations and their activation, as T-cells play a crucial role in the immune response against the virus: Longitudinal studies [93, 94] show that immune abnormalities may persist after a severe COVID-19 progression, with sustained activation of myeloid cells, the presence of proinflammatory cytokines, and consistently activated T-cells still detectable between 8 to 12 months after the initial COVID-19 infection.

In conclusion, these results provide comprehensive insights into the epigenetic consequences of SARS-CoV-2 exposure after 6 months, emphasizing potential associations with aging, SEM accumulation, and dysregulation in critical pathways linked to insulin resistance, immune response, and vascular function.

We emphasize, for completeness of information, that an additional epigenetic analysis was conducted considering the subcohort of 28 samples that exhibited long-COVID symptoms. However, although these results confirm the findings already highlighted regarding the SARS-CoV-2 exposure factor, they should be taken and interpreted with caution due to the low sample size. The results of this additional analysis are reported in Supplementary File 9.

The study has some limitations Due to experimental constraints, cases and controls were not perfectly balanced on each BeadChip, necessitating careful evaluation and correction for any potential batch effects. To validate the results obtained from an EWAS (epigenome-wide association study), a biological replicate of the experiment with a different validation cohort would be necessary. Furthermore, although the sample size may be considered large, it may still be insufficient to detect very subtle differences between cohorts that could prove significant in a genomic framework or large genetic association study. However, it is essential to note that the analysis framework developed in this study aims to apply not only the most commonly adopted methods in the field of DNA methylation, such as the study of age predictors and the analysis of differential methylation, but also extends to less common analytical approaches by integrating the study of epigenetic drift and pathways enriched in genes significantly enriched in stochastic epigenetic mutations.

Materials and methods

Patient recruitment and study design

This study, conducted across multiple centers, intended to examine the long-term effects of COVID-19 by assessing ninety-six individuals 6 months after they contracted the virus (years 2020 and 2021). Inclusion criteria encompassed hospitalization during the infection period, with a subgroup of patients requiring intensive care unit admission and the use of forced ventilation. As a reference methylation profile, a control group of 191 selected individuals from an unrelated study with no history of COVID-19 (confirmed through serological testing) was included. At the time of sampling, control individuals did not exhibit any symptoms suggestive of COVID-19. The ethics committees of the participating hospital centers approved the study protocol. Informed consent was obtained from all participants before their inclusion in the study. All data collected were treated with strict confidentiality and adhered to relevant data protection regulations.

DNA extraction

DNA isolation from peripheral blood from all patients was performed using automatic equipment and a commercial kit based on magnetic beads separation. Total genomic DNA quantification was carried out using an N60 Implen Nanophotometer. Samples showing aberrant protein (260/280) and organic compounds (230/260) ratios were discarded or purified.

Methylation assay

Following the manufacturer's instructions, 900 ng of high-quality genomic DNA was bisulfite converted using the EZ DNA methylation kit (D5001, Zymo Research Corporation). Illumina incubation conditions were used to increase the efficiency and reproducibility of the bisulfite conversion. Quality control/quantification of bisulfite-converted DNA (bsDNA) was performed using an N60 Implen Nanophotometer. Approximately 200 ng/ul of bisulfite-converted DNA was hybridized on Illumina Infinium Methylation EPIC BeadChips. Fluorescent signals were detected using the (two-color laser—532 nm/660 nm) Illumina iScan scanner and saved as intensity data files (*.idat). The methylation level for each CpG site is represented as β-values based on the fluorescent intensity ratio between methylated and unmethylated probes. β-values may range between 0 (non-methylated) and 1 (completely methylated).

Quality control and differential methylation analysis

The generation of the β-value dataset of all samples was carried out using the ChAMP package [95]: After the quality control/preprocessing step, 12,777 probes with a detection p-value above 0.01, 5179 probes with a bead count < 3 in at least 5% of samples, 2875 NoCG probes, 94,318 probes with potential SNPs [96], 11 probes that align to multiple locations, and 16,424 probes located on X and Y chromosomes were filtered out. 734,334 probes and 287 samples were retained. After BMIQ normalization, the variability due to the batch effect was corrected by using the ComBat function [97], which performs parametric empirical Bayesian adjustments. To infer the proportions of a priori known cell types (e.g., CD8+T and CD4+T-cells, natural killer (NK) cells, B cells, monocytes, granulocytes (basophils, eosinophils, and neutrophils)) present in blood samples, we used EpiDISH package [40] in R environment.

Differential methylation analysis at the CpG sites level was conducted at the group level by computing p-values using the “Limma” package [41]. At the region level (Genes, promoters, CpG island, tiling), the “RnBeads” differential methylation (Limma) module was alternatively used [98].

To adjust for potential confounding factors, a principal component analysis (PCA) was performed to evaluate the association of age and blood cell estimations with both dependent (disease groups) and independent (methylation values) variables: The principal components summarizing 80% of the variability were used as covariates in the differential methylation step. Principal components (PCs) were used to avoid collinearity of covariates. Estimation of biological age (Horvath, Hannum, PhenoAge, DNAm skin, and blood) [43,44,45] and telomere length (surrogate marker DNAmTL) [46] were assessed using the DNA methylation age calculator analysis tool [42] (https://dnamage.genetics.ucla.edu/). Enrichment analysis on differentially methylated sites was performed using the "methylGSA" package [99] (methylglm function) in the R environment.

Individual sample analyses were carried out by identifying stochastic epigenetic mutations (SEMs) as described in [3, 47,48,49,50,51]. SEMs represent extreme aberrant methylation data points and were identified for each CpG site by comparing the methylation profile of each case to a reference methylation range, calculated from a control population as follows: upper value = Q3 + (k * IQR), lower value = Q1—(k * IQR), where Q1 represents the first quartile, Q3 corresponds to the third quartile, IQR (Interquartile Range) equals Q3—Q1, and k is set at 3. Outlier values were then classified as hyper-methylated or hypo-methylated with respect to the median values of the controls' corresponding probes. Gene annotation of SEMs was obtained using the web tool wANNOVAR [100]. For testing associations between cases and SEM enriched genes, a method for rare variants analysis was applied using the SKAT‐O method implemented in the RVTESTS package [52]. Organization/investigation of results was conducted according to the clinical characteristics/keywords by using the available prioritization tools, including WEB-based gene set analysis toolkit (WebGestalt) [101] which uses, among others, gene ontology, KEGG (Kyoto Encyclopedia of Genes and Genomes) and PANTHER (protein analysis through evolutionary relationships) databases, and VarElect (the next-generation sequencing phenotyper) [102]. Data/results were visualized using the "ggplot2" package for PCA charts and boxplots, the "Pheatmap" package for the heatmaps, and the "CMplot" for the Manhattan plots. Linear regressions (after checking assumptions) or the Mann–Whitney function were used to evaluate statistical differences in age, cell-type composition, and burden of SEMs between cases and controls. Unless otherwise stated, the statistical significance threshold was set to 0.05.

Meta-analysis The meta-analysis was performed using the computational tool METAL [103], specifically designed for (epi) genome-wide analysis. Only the overlapping site-specific p-values (n = 723,656) derived from the differential methylation analyses were integrated into the meta-analysis.