Introduction

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a widespread global pandemic resulting in over 569 million confirmed cases and more than 6 million deaths, as reported by the Johns Hopkins University Coronavirus Resource Center on July 23rd, 2022 (https://coronavirus.jhu.edu/map.html). With the increasing research on SARS-CoV-2 (Roshandel et al. 2020; Balaky et al. 2020; Zhu et al. 2020), more and more high-throughput multi-omics sequencing data are being made available on databases such as Gene Expression Omnibus (GEO), enabling integrative analyses of DNA methylation and gene expression data to identify epigenetically regulated modules or potential biomarkers. Thair et al. (2021) performed RNA-Seq on the samples infected with six viruses, including SARS-CoV-2, and identified a series of differentially expressed genes. Manuel Castro de Moura et al. (2021) analyzed the DNA methylation status of peripheral blood samples from 407 confirmed COVID-19 patients and identified 44 CpG sites associated with the severity of COVID-19. Finally, Balnis et al. (2021) compared the differentially methylated regions (DMR) between COVID-19 patients and healthy individuals, finding that the DMRs were enriched in gene promoter regions and hypomethylated in COVID-19 samples.

It's worth noting that there has not been any study that focuses on integrating DNA methylation and gene expression datasets to identify the functional epigenetic module and potential biomarkers for COVID-19. However, a supervised algorithm called FEM (Jiao et al. 2014) can be used to identify gene modules where a significant number of genes are differentially methylated and expressed simultaneously. FEM has already been applied to module discovery in many studies (Teschendorff et al. 2016; Cancer Genome Atlas Research Network et al. 2017; Ding et al. 2020a; Wang et al. 2020) and is commonly used to integrate DNA methylation and gene expression datasets (Ding et al. 2020b).

In this study, we first conducted differential expression and methylation analysis, identified two functional epigenetic modules using the FEM algorithm, and performed gene set enrichment analysis for the genes from the identified modules. Interestingly, we found that the SKA1 module is associated with virus replication and transcription, while the WSB1 module is related to the activity of ubiquitin-protein and ubiquitin-protein ligase. We also observed that two genes, CENPM and KNL1, in the SKA1 module were significantly hypomethylated and upregulated in COVID-19 samples compared with healthy individuals. To validate the associations between these two epigenetically activated genes and virus infections, we performed differential expression and survival analysis in cervical squamous cell carcinoma (CESC), liver hepatocellular carcinoma (LIHC), and oropharyngeal squamous cell carcinoma (OPSCC) tumor samples with human papillomavirus (HPV) and hepatitis B virus (HBV) positive or negative information. As expected, those two genes are upregulated in HPV- or HBV- positive group compared with the negative group, and the expression or methylation of those two genes was significantly associated with the survival of the corresponding tumor samples. Finally, we built the diagnostic modules based on the expression and methylation values of the genes from those two modules, the area under the ROC Curve (AUC) was greater than 0.98. Our results suggest that the FEM modules and the identified epigenetically activated genes may play an important role in the replication and transcription of SARS-CoV-2 and could serve as potential biomarkers and therapeutic targets for COVID-19.

Material and Method

Datasets and Preprocessing

The datasets used in this study were downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/); for RNA-Seq data, the read count data, including 62 COVID-19 patients and 24 healthy controls, were obtained under accession ID GSE152641(Thair et al. 2021), followed by the normalization with edgeR (Robinson et al. 2010). The processed Infinium Methylation EPIC DNA methylation dataset of 102 COVID-19 patients and 26 non-COVID-19 patients’ whole blood tissue samples was obtained with accession GSE174818 (Balnis et al. 2021).

Differential Analysis and Identification of Functional Epigenetic Modules

The PPI network was obtained from InBio (Li et al. 2017) and BioPlex (Huttlin et al. 2015) databases. Given the PPI network, using gene expression and DNA methylation matrix as input, the FEM algorithm was implemented to perform differential expression and methylation analysis and identify function epigenetic modules. Genes with |stat(mRNA)|≥ 1.5 and P(mRNA) ≤ 0.05 were regarded as significantly differentially expressed genes, and genes with |stat(DNAm)|≥ 1.5 and P(DNAm) ≤ 0.05 were defined as differentially methylated genes.

Genes Set Enrichment Analysis

The hypomethylated genes and the genes in the identified FEM modules were submitted to the online website DAVID (Huang et al. 2009a, b) to perform gene ontology (GO) and KEGG enrichment analysis. Terms with FDR ≤ 0.05 were considered as significantly enriched terms; for better visualization, − log10(FDR) was calculated to plot the dot plot (Fig. 3).

Diagnostic Model

The logistic regression module was built using python scikit-learn (https://scikit-learn.org/) based on the expression or the methylation beta values of the genes from SKA1 and WSB1 FEM modules. All samples were randomly split into training and test sets with a 4:1 ratio, and the test dataset was used to evaluate our model.

Analysis of the Epigenetically Activated Genes in Tumors with HPV and HBV

To validate the association of two epigenetically activated genes, CENPM and KNL1 (significantly hypomethylated and upregulated in COVID-19 samples), the cancer genome atlas (TCGA) cancer type and virus types were queried on the OncoDB database (http://oncodb.org) (Tang et al. 2022) using default parameters. We only included the figures with at least three samples and a p-value ≤ 0.05 in Fig. 5.

Result

Differential Expression and Methylation Analysis

We first performed differential expression and methylation analysis, as shown in Fig. 1A. Our analysis revealed that there were 2168 upregulated, 2110 downregulated, 105 promoter hypermethylated, and 531 promoter hypomethylated genes. We also identified 143 epigenetically activated genes and 17 epigenetically silenced genes.

Fig. 1
figure 1

Differential analysis. A Distribution of the differentially expressed and methylated genes: the epigenetically activated genes (hypomethylated and upregulated, stat(DNAm) <  − 1.5 and stat(mRNA) > 1.5 and p-value <  = 0.05), and epigenetically silenced genes (hypermethylated and downregulated, stat(DNAm) > 1.5 and stat(mRNA) <  − 1.5 and p-value <  = 0.05) are shown as purple dots. Size of points represents the min(− log10(PDNAm), − log10(PmRNA)). B GO enrichment analysis result for the hypomethylated genes

Moreover, we found that the hypermethylated genes were enriched in response to lipopolysaccharide. In contrast, the hypomethylated genes were enriched in functions such as protein binding, protein homodimerization activity, focal adhesion, and so on, as depicted in Fig. 1B.

Our findings were consistent with a previous study (Thair et al. 2021), which showed that upregulated genes were enriched in G-protein coupled receptor signaling pathway, neuroactive ligand-receptor interaction, inflammatory response, and DNA replication-dependent nucleosome assembly. In contrast, downregulated genes were enriched in rRNA processing, viral transcription, viral process, RNA transport, et al. (Supplementary Table S1).

Functional Epigenetic Modules

To further investigate the relationship between gene expression and DNA methylation, we used the FEM algorithm (Jiao et al. 2014) to integrate these two omic data. Our analysis identified two significant functional epigenetic modules SKA1 (p-value = 0.003) and WSB1 (p-value = 0.043), comprising 79 and 60 genes, respectively (Fig. 2 and Supplementary Table S2).

Fig. 2
figure 2

Functional epigenetic modules. Module SKA1 (A) and WSB1 (B), The node color illustrates the DNA methylation difference (blue means high methylation, and yellow indicates low methylation), while the edge color shows differentially expressed genes (red represents genes with elevated expression levels in COVID-19 and green represents genes with low expression). The size of the nodes is correlated with the degrees of the nodes in the network

After performing enrichment analysis for the genes in these two modules, we were surprised to find that the genes in the SKA1 module were enriched in the biological process of virus DNA replicating, including sister chromatid cohesion, cell division, mitotic nuclear division, chromosome segregation, cell cycle, and viral transcription (Fig. 3A). These results suggest that the SKA1 module may play a potential role in the replication and transcription of SARS-CoV-2.

Fig. 3
figure 3

Enrichment results for the genes in module SKA1 (A) and WSB1 (B)

In contrast, the genes in the WSB1 module were found to be enriched in protein polyubiquitination and protein ubiquitination-related biological processes and KEGG pathways (Fig. 3B). This indicates that the WSB1 module may be involved in protein regulation and signaling pathways related to ubiquitination.

Development of a Diagnostic Model for COVID-19 and Validation of Potential Markers

After identifying two modules that may be associated with the replication and transcription of SARS-CoV-2 and protein ubiquitination, we built a diagnostic model to test whether the genes in those two modules (SKA1 and WBS1) can distinguish COVID-19 samples from healthy controls. We randomly split the whole data into training and test sets, then implemented a logistic regression classifier to train the model in the training set and validate it in the test set. As expected, the AUC of this model was 1 and 0.79 for the gene expression and DNA methylation-based models, respectively (Fig. 4A and B). The heatmap (Fig. 4C and D) clearly showed a pattern for the expression and DNA methylation of SKA1 module genes between COVID-19 and healthy control samples. Similarly, the classifier for the WSB1 module had a good performance of AUC 0.83 and 0.99 (Supplementary Figure S1). The learning curves show that the training and validation accuracy become closer as the training size (number of samples) increases (Supplementary Figure S2), indicating that our model was not overfitting.

Fig. 4
figure 4

Diagnostic module for COVID-19 based on the genes in SKA1 module. The ROC curve for the diagnostic models based on gene expression (A) and DNA methylation (B) data of the genes in SKA1 module. Heatmap of the gene expression (C) and DNA methylation (D) of genes in SKA1 module between COVID-19 and healthy control samples. Gene expression and DNA methylation values were scaled from 0 to 1 by rows (genes). Scatter plot of the differentially methylated and differentially expressed genes in SKA1 (E) and WSB1 modules (F). The genes with labels (CENPM, KNL1, RBCK1, CCNF, and UNKL) are significantly epigenetically activated genes (hypomethylated and upregulated genes)

We noticed five epigenetically activated genes in the SKA1 and WSB1 modules, including CENPM, KNL1, RBCK1, CCNF, and UNKL, which were significantly hypomethylated and overexpressed in COVID-19 samples. Since the SKA1 module was enriched in the replication and transcription of SARS-CoV-2, we hypothesized that these CENPM and KNL1 might also be associated with the replication and transcription of viruses. To investigate this hypothesis, we examined the differential expression status of CENPM and KNL1 in six major oncoviruses across TCGA cancer types using OncoDB (see “Methods” section).

The OncoDB analysis showed that among all TCGA cancer types, the differential expression of CENPM was observed in HPV-positive CESC, OPSCC, and HBV-positive LIHC tumor samples. Specifically, compared with virus-negative samples, both CENPM and KNL1 are significantly overexpressed in virus-positive samples in the corresponding virus type (Fig. 5, p-value < 0.01). In addition, a study by Xiao et al. (Xiao et al. 2019) also reported overexpression of CENPM in hepatitis B virus (HBV)-related liver tissues compared with normal tissues, which is consistent with our findings.

Fig. 5
figure 5

Associations between HPV, HBV in CESC, OPSCC, LIHC, and genes CENPM and KNL1. In each TCGA cancer type, the expression of CENPM and KNL1 in virus-positive and virus-negative groups are shown as boxplots (upper panel). The lower panel shows the Kaplan–Meier survival curve of the overall survival based on the gene expression (CENPM on CESC: HPV, LIHC: HBV, and KNL1 on OPSCC: HPV) or DNA methylation (CENPM on OPSCC: HPC and KNL1 on CESC: HPV) of CENPM or KNL1

Finally, we examined the association between the gene expression or DNA methylation of CENPM and KNL1 and the overall survival probability of tumor samples stratified by virus-positive in the corresponding TCGA cancer types. As anticipated, we found that CENPM and KNL1 are significantly associated with the survival of HPV-positive CESC and OPSCC, as well as HBV-positive LIHC tumor samples (p-value ≤ 0.05). Our results suggest that genes CENPM and KNL1 are involved in the replication and transcription of SARS-CoV-2, as well as in the process of HPV and HBV in tumors.

Discussion

DNA methylation is a critical biomarker in many diseases, including cancer (Ding et al. 2019). Various studies have investigated the DNA methylation or gene expression profiles in COVID-19. However, to date, no research has focused on the combined analysis of gene expression and DNA methylation simultaneously and identifying functional epigenetic modules that are differentially methylated and expressed.

Here we identified two significantly functional epigenetic modules by integrating Methylation EPIC DNA methylation array and RNA-Seq datasets using FEM. SKA1 module was found to be closely associated with the cell cycle, DNA replication, and transcription of SARS-CoV-2, while module WSB1 is related to protein ubiquitination. Ubiquitin modifications can regulate the innate immune response by affecting the related regulatory proteins, altering their stability via the ubiquitin–proteasome pathway, or directly regulating their activity. It has been reported that viruses, including coronaviruses, often use modulation of ubiquitin and ubiquitin-like modifiers to evade the host cell's immune response (Lin and Zhong 2015; Tang et al. 2018). Recent research indicated that deubiquitinating enzymes play an essential role in coronavirus pathogenesis, involving the production of non-structural proteins required for the replication process of coronavirus (Clemente et al. 2020). ORF9b interrupts its K63-linked polyubiquitination upon viral stimulation, thereby inhibiting the canonical IκB kinase alpha (IKKα)/β/γ-NF-κB signaling and subsequent interferon production, which contributes mainly to the viral pathogenesis and development of COVID-19 (Wu et al. 2021). These studies indicated that protein ubiquitination is associated with the coronavirus's replication process and the pathogenesis and development of COVID-19, which means both SKA1 and WSB1 modules may play essential roles in the replication process of coronavirus.

Then, we built a logistic regression model only using the expression or DNA methylation values of genes from SKA1 or WSB1 modules. our results showed that the genes in these two modules could be used to distinguish COVID-19 samples from controls. The AUC is 1 and 0.79 for gene expression and DNA methylation of SKA1 module, respectively.

Finally, we screened out two potential marker genes, CENPM and KNL1, from SKA1 module. These two genes are epigenetically activated in COVID-19 samples. Surprisingly, these two genes are significantly overexpressed in HPV-positive CESC and OPSCC tumor samples, as well as HBV-positive LIHC tumor samples. In addition, the expression and DNA methylation profile of CENPM and KNL1 are also significantly associated with the overall survival of HPV- or HBV- positive CESC, OPSCC, or LIHC tumor samples.

To conclude, we identified two functional epigenetic modules, SKA1 and WSB1, and potential biomarkers, CENPM and KNL1, that are associated with the replication process of coronavirus and may be used as potential therapeutic targets for COVID-19 after further verification.