Background

The liver is the greatest digestive and endocrine organ in the human body, and primary HCC, which is characterized by easy metastasis and strong invasion, is one of the most commonly seen malignant tumours in the clinic, with a low 5-year survival rate [1]. HCC has become the second major malignant tumour that threatens human life [2]. In China, the leading aetiology of HCC is viral hepatitis, among which hepatitis B and hepatitis C are the most common, while alcoholic liver disease (ALD) is dominant in western countries [3]. In recent years, an increasing number of HCC cases with unknown reasons has been diagnosed, showing a younger trend [4]. In addition, with the advancement of HCC research, it has been discovered that epigenetic changes [5], including DNA methylation, histone modification and aberrant miRNA expression, play core roles in HCC genesis, wherein HCC genesis is related to abnormal DNA methylation [6]. A large amount of literature has indicated that DNA methylation represents a series of early and most frequent molecular behaviours during the HCC genesis process [7]. DNA methylation accumulates with disease progression and is currently regarded as an indication of malignant HCC [8]. Therefore, obtaining samples from HCC patients after diagnosis, carrying out specific analyses, and mining the specific biomarkers are the key to prognostic assessment, classification determination, recurrence judgement, and early selection of appropriate therapeutics and treatments.

DNA methylation refers to the process in which methyl groups are transferred onto the 5′-position carbon atom in the cytosine of the DNA CpG sequence to form 5-methylcytosine, with S-adenosylmethionine as the donor under the catalysis of DNA methyltransferases (DNMTs include three types, namely, DNMT1, DNMT2 and DNMT3) [9]. Hypermethylation of CpG islands is related to the silencing of tumour suppressor gene expression, which plays a key role during the early tumourigenesis process [10]. Additionally, the universally low methylation level of the tumour genome is closely associated with the activation of oncogenes, changes in chromatin structure, and the loss of basic groups; moreover, it is also closely correlated with tumour invasion, metastasis and prognosis [11]. Currently, research has verified that genome-wide low DNA methylation is an important mechanism responsible for early HCC formation and is also an important cause of early chromatin structural changes in non-cirrhosis HCC. The low methylation of genes such as AKT3, CD147, LINE1 and SFRP1 has been proven to be significantly correlated with the survival, tumour volume and malignant grade of HCC patients [12,13,14]. Moreover, the silencing of tumour suppressor genes and the loss of DNA repair function in HCC are closely correlated with the high methylation of promoter CpG islands, which prevents the timely repair of DNA damage, thus resulting in abnormal cell proliferation; for instance, the CpG islands of genes such as SOX1, VIM, BMP4 and CDKN2A show high methylation levels [15,16,17]. However, it is still unknown whether specific methylation patterns in these gene promoter regions display clinical significance compared with tumour classification, survival and prognosis in large-sample HCC patient data.

Therefore, this study aimed to establish a classification method that integrates several DNA methylation markers in order to help clinicians effectively assess HCC patient prognosis and select therapeutic strategies.

Methods

Pre-processing of preliminary sample data and initial screening of HCC DNA methylation sites

The latest clinical follow-up information was downloaded using the TCGA GDC API on July 31st, 2018 (as shown in Table S1.txt) and contained a total of 377 samples (as shown in Table S11.txt); the RNA-Seq data (as shown in Table S2.txt) included 424 samples. Illumina Infinium HumanMethylation450 data [18] were downloaded using the UCSC Cancer Browser [19] and contained 429 samples.

Subsequently, the samples with a follow-up period of over 30 days were screened from the clinical data for further methylation profile matching, and a total of 348 samples detected for methylation were screened. Moreover, the CpG sites with an NA ratio of over 70% were removed from all samples, and the cross-reactive CpG sites in the genome were also removed according to the cross-reactive sites. In addition, the missing values of methylation profiles were filled using the KNN method with the impute R package, and the unstable genomic methylation sites were further removed in order to remove the CpG sites and single nucleotide sites from the sex chromosomes. A total of 208,022 methylation sites were finally obtained. Second, the 348 samples were divided into a training set (n = 174) and test set (n = 174). The obtained clinical follow-up information is shown in Table S3.txt and the clinical follow-up information is shown in Table S4.txt.

Univariate analysis and multivariate survival analysis of the training set methylation sites

First, a univariate Cox proportional hazards regression model was applied to analyse each methylation site and the survival data using the coxph function of the survival R package [20], with p < 0.05 as the significant threshold. Subsequently, the univariate Cox proportional hazards regression model was also employed to analyse age, T, N, and M stages, grade, sex and survival data. The results suggested that the T, N and M stages showed significant differences in prognosis prediction, with log-rank p-values of 0.002446, 0.034366 and 0.00259, respectively. Then, the univariate Cox model was used to select the significant methylation sites for multivariate Cox proportional hazards regression model analysis, with the T, N and M stages as the covariants in the model and P < 0.05 as the significant threshold. Finally, the methylation sites of the training set samples that were significant in both the univariate and multivariate analyses were selected as characteristic biomarkers for further analysis.

Molecular subtyping of HCC, screening of intra-group-specific methylation sites, and enrichment analysis of signalling pathways and transcription factors

First, the ConsensusClusterPlus R software package [21] was used for the consistent clustering of characteristic methylation sites that were significant in both the univariate and multivariate analyses, and the molecular subtypes were screened for subgroup classification. The similarity distance between samples was calculated by the Euclidean distance [22], K-means was used for clustering, 80% samples were sampled 100 times by adopting the resampling program, and the optimal cluster number was determined by the cumulative distribution function (CDF) [23]. On this basis, methylation expression profile cluster analysis and clinical characteristic analysis of each subgroup were also performed. Subsequently, the EpiDiff software [24] was also employed to identify the specific methylation sites. For each cluster, the average value of each methylation level of the methylation sites that were significant in both the univariate and multivariate analyses was calculated, the obtained matrix was used as the input data of EpiDiff software, the threshold was set at 2.099, and the cluster-specific methylation sites were screened for genomic annotations. Additionally, the correlation of these specific methylation sites with the gene expression in the subgroups was explored. A total of 172 corresponding samples that had been detected with RNA-Seq were identified in the training set samples, gene expression profiles were extracted from these 172 samples to plot the expression profile heat map, and the correlation and consistency of the gene DNA methylation level with gene expression was observed. Finally, the corresponding genes in the promoter regions of these specific methylation sites were subjected to signalling pathway and transcription factor enrichment analysis using cluster Profiler [25] and g:profiler [26], respectively.

Construction and testing of the prognosis prediction model for HCC patients

The Bayesian network classifier was constructed using the above identified specifically expressed methylation sites in each subtype, and the model performance was judged through 10-fold cross-validation. Subsequently, the expression profile data of specific CpG methylation sites were extracted from both the training set and test set and were substituted into the model for calculation. In addition, the prediction results were calculated, and the prediction accuracy of the prognosis classification model as well as the identification stability of the methylation features were verified and analysed. (A study flowchart has been added, see Figure S1.tif).

Results

Mining characteristic DNA methylation sites based on the survival and prognosis data of HCC patients

First, a series of data downloaded from TCGA was pre-processed, including filling in the missing values, removing the CpG sites and single nucleotide sites from the sex chromosomes, removing the CpG sites with cross-reactivity from the genome, excluding samples with incomplete data, and randomly dividing the samples into a training set and test set (see Materials and methods). Afterwards, the methylation sites and survival data were subjected to univariate Cox proportional hazards regression model analysis (using the coxph function in the survival R package, see Table S5.txt), with the significance threshold set as P < 0.05. A total of 26,208 significantly different sites regarding prognosis were discovered, and the top 20 are shown in Table 1. Similarly, our findings revealed that the T, N and M stages were also significantly correlated with prognosis, with log-rank P values of 0.002446, 0.034366 and 0.00259, respectively. In addition, the univariate Cox model was used to select the significant methylation sites for multivariate Cox proportional hazards regression model analysis, with the T, N and M stages as the covariants in the model. Finally, a total of 10,699 significant methylation sites were obtained (see Table S6.txt). As a result, the methylation sites (n = 10,699) from the training set samples that were significant in both the univariate and multivariate analyses were selected as characteristic biomarkers for further analysis.

Table 1 The top 20 most significant methylation sites

Using the characteristic DNA methylation sites for the consistent clustering of HCC molecular subgroups

The Consensus Cluster Plus R software package was used for the consistent clustering of characteristic methylation sites that were significant in both the univariate and multivariate analyses, and the molecular subtypes were screened for subgroup classification. The similarity distance between samples was calculated by the Euclidean distance, K-means was used for clustering, 80% samples were sampled 100 times by adopting the resampling program, and the optimal cluster number was determined by the CDF. As shown in Fig. 1a, stable clustering results could be obtained when the cluster number was 6 or 7. Further observation of the CDF delta area curve (Fig. 1b) suggested that stable clustering results could be acquired when the cluster number was set as 7. Finally, k = 7 was selected to obtain 7 molecular subtypes.

Fig. 1
figure 1

The criteria for the selection of the number of categories. a The consensus among clusters for each category number k. b Delta area curve of consensus clustering, which indicates the relative change in area under the cumulative distribution function (CDF) curve for each category number k compared to that of k-1. The horizontal axis represents the category number k, and the vertical axis represents the relative change in area under the CDF curve

Methylation expression profile cluster analysis and clinical characteristic analysis based on the HCC molecular subgroups

The stable clustering results at k = 7 were selected according to consistent clustering. As presented in Fig. 2a, 174 tumour samples were assigned to these 7 subgroups. Furthermore, 10,699 methylation profiles were used for cluster analysis, and the distance between methylation sites was calculated by the Euclidean distance. Figure 2b shows that most methylation sites had low abundance in each sample, while the methylation expression profiles of these 7 clusters of samples were also greatly different. In particular, the methylation levels of the samples in Cluster5 and Cluster6 were markedly lower than those of the other groups.

Fig. 2
figure 2

The consensus matrix for DNA methylation classification with the corresponding heat map. a The colour-coded heatmap corresponding to the consensus matrix for k = 7 obtained by applying consensus clustering. The colour gradients from 0 to 1 represent the degree of consensus, with white corresponding to 0 and dark blue corresponding to 1. b The heatmap corresponding to the dendrogram in Figure a, which was generated using the pheatmap function with DNA methylation classifications, TNM stage, clinicopathological stage and histological type as the annotations

The distribution of the samples according to the 7 molecular subtypes regarding prognosis, T, N, and M stage, grade and age was further analysed, as displayed in Fig. 3. Figure 3a shows that the samples in Cluster5 and Cluster6 had the highest invasion degrees; Fig. 3b suggested that M1 samples were mainly distributed in Cluster4 and Cluster5; Fig. 3c indicated that the samples in Cluster5 and Cluster6 had high stages; Fig. 3d suggested that Cluster6 had the largest number of G4 samples; Fig. 3e presented the great difference in the age distribution among the different clusters; and Fig. 3f revealed a significant difference in prognosis among these 7 clusters of samples, among which the samples in Cluster2 had the best prognosis, while the samples in Cluster5 and Cluster6 had the poorest prognosis.

Fig. 3
figure 3

Comparison of prognosis, TNM stage, grade and age between each DNA methylation cluster. a The survival curves of each DNA methylation subtype in the training set. The horizontal axis represents the survival time (days), and the vertical axis represents the probability of survival. The metastasis b, stage score c, grade score d, age e and survival probability f distributions of each DNA methylation subtype in the training set. The horizontal axis represents the DNA methylation clusters

Screening of intra-group-specific methylation sites and enrichment analysis of transcription factors based on the HCC molecular subtype classification

To identify the methylation-based molecular subtypes of HCC, the EpiDiff software was used to identify specific methylation sites. For each cluster, the average value of the methylation levels of the 10,699 methylation sites was calculated, and the obtained 10,699 × 6 matrix was used as the input data of EpiDiff software, as presented in Table S7.txt. The threshold was set at 2.099, and 119 methylation sites were considered cluster-specific methylation sites, as displayed in Table S8.txt. The heat map is presented in Fig. 4a, which suggested that Cluster6 had the most specific methylation sites, and most of them were lowly methylated sites, while a small number of specific methylation sites also existed in the remaining clusters, which were mostly hypermethylated. In addition, there were no specific methylation sites in Cluster1. The modification level heat map of the specific methylation modification sites is illustrated in Fig. 4b. These 119 methylation sites were then subjected to genomic annotations, which obtained 105 genes near these methylation sites, as shown in Table S9.

Fig. 4
figure 4

The specific hyper/hypomethylated CpG sites for each DNA methylation cluster. a The display of specific CpG sites for each DNA methylation prognostic subtype. The red bars and blue bars represent hypermethylated CpG sites and hypomethylated CpG sites, respectively. b The heat map for the specific sites among the seven DNA methylation clusters. c The heatmap for the expression of specific related genes among the seven DNA methylation clusters

Moreover, the correlation of these specific methylation sites with gene expression in the subgroups was explored. A total of 172 corresponding samples that had been detected with RNA-Seq were identified in the training set samples, 105 gene expression profiles were extracted from these 172 samples (as presented in Table S10), and the expression profile heat map was further plotted. As shown in Fig. 4c, there were different expression patterns in these subgroups at the expression profile level, suggesting partial consistency between the DNA methylation levels of these genes and their expression.

Furthermore, to observe the mechanism of action of these specific methylation sites, KEGG enrichment analysis was carried out on the corresponding genes in the promoter regions of these specific methylation sites using the cluster Profiler R software package, as shown in Fig. 5a. It could be seen from the figure that these genes were enriched in multiple cancer-related pathways, especially the HCC pathway. Additionally, g: profiler was further used for transcription factor enrichment analysis, which revealed that these genes were enriched in the transcription factor KROX (TF: M00982), with a significant FDR value of 8.46e-03. The corresponding genes are shown in Fig. 5b, which suggested that 48 genes were enriched in KROX and involved 67 methylation sites.

Fig. 5
figure 5

Transcription factor and KEGG pathway enrichment analyses of the specific CpG sites. a KEGG enrichment analysis was carried out on the corresponding genes in the promoter regions. b The corresponding genes were enriched by transcription factor enrichment analysis

Construction of the prognosis prediction model for HCC patients and evaluation in the test set

To verify the differential ability of the identified specific methylation sites, the 119 specific methylation sites identified using EpiDiff software were used to construct a Bayesian network classifier, and the model performance was judged through 10-fold cross-validation. The classification accuracy of the model constructed based on the training set was 70.68%, and the area under the ROC curve (AUC) was 0.923 (Fig. 6a), suggesting that this classifier had a favourable classification performance.

Fig. 6
figure 6

Construction of the prognosis prediction model for HCC patients. a The ROC curve displays the sensitivity and specificity of the prognosis prediction model. The area under the curve (AUC) reached 0.923. b The colour-coded heatmap corresponding to the consensus matrix for k = 7 obtained by applying consensus clustering from the test set. c Survival curves of the seven clusters predicted from the test set using the prognosis model. The log-rank test was used to assess the statistical significance of the difference

Furthermore, to verify the stability and reliability of the model, the expression profile data of these 119 CpG methylation sites were extracted from the test set and incorporated into the model for model verification. The predicted results are shown in Table 2, which shows that the cluster statistic number of the training set was similar to that of the predicted samples. Furthermore, the methylation patterns of these 7 clusters of samples were plotted using the pheatmap R package (Fig. 6b). The heatmap showed the presence of markedly distinct methylation patterns among these 7 clusters of samples. Moreover, the differences in prognosis among these 7 clusters of samples were also analysed (Fig. 6c), and significant differences were found in the prognosis of these 7 clusters of samples (p = 0.00068). In addition, the Cluster2 samples showed markedly better prognoses over the other clusters of samples, which were consistent with the findings in the training set data.

Table 2 Statistics of various samples predicted in the test set

Finally, the clinical features of these 7 clusters of samples in the test dataset were analysed, and Fig. 7 shows that the clinical feature distribution in each cluster was consistent with that in the training set. Overall, the prognosis prediction model constructed using these 119 methylation profiles showed high prediction accuracy and stability in identifying the methylation features.

Fig. 7
figure 7

Verification of the stability and reliability of the prognosis prediction model for HCC patients in the test set. The topography score (a), lymphocyte infiltration (b), metastasis (c), stage score (d), grade score (e) and age (f) distributions of the DNA methylation clusters in the test set

Discussion

HCC is one of the most commonly seen malignant tumours and is the third leading the global digestive system tumour in terms of its mortality; the morbidity and mortality of HCC also show increasing trends [27, 28]. The leading causes of HCC are chronic hepatitis viral infection (such as HBV and HCV), liver cirrhosis, malnutrition, toxin invasion and metabolic disturbance. Currently, great achievements have been attained in the diagnosis and treatment of HCC, but its pathogenesis remains incompletely elucidated.

Studies have shown that tumours are highly heterogeneous [29], and heterogeneity is the main cause of the inconsistent clinical treatment of tumours. Finn et al. showed different clinical outcomes of different molecular subtypes of HCC, and different molecular subtypes in tumours respond differently to clinical therapeutic drugs [30]. At present, epigenetic abnormalities have been considered to be characteristic of many types of cancer [31]. The epigenome map can be used to identify the molecular subtypes of a particular cancer and is related to clinical outcomes [32]. For example, the CpG island methylation phenotype is found in many types of cancer, abnormal DNA methylation patterns can be used to detect the presence of precancerous lesions or malignant cells in tissue sections [33,34,35]. Abnormal DNA methylation can also be used to detect free tumour DNA in the blood of some cancer patients [36]. Seven different molecular subtypes were found in our study. They not only have different epigenetic patterns but also have significant differences in clinical characteristics such as age, tumour stage, histological type, and prognosis. These results indicate that the individuals in these clusters may have different responses to clinical drugs and provide a reference for personalized therapy and drug development. We further identified subtype-specific methylation sites that can be used as markers to predict patient outcomes.

At present, a large amount of literature has reported the role of DNA methylation in HCC diagnosis, treatment and prognostication. First, compared with genetic mutations, DNA methylation changes are earlier events during the cell carcinogenesis process, while the DNA decomposed after tumour tissue necrosis will enter the peripheral blood (plasma or serum) or other body fluids (such as saliva or sputum). Therefore, detecting the methylation status of the isolated tumour-related genes in body fluids may be applied in the early diagnosis of HCC. For instance, the promoter methylation status of two factors, TFPI2 and IGFBP7, in serum is markedly higher than that in normal subjects and hepatitis patients and may thereby become a non-invasive molecular biomarker for the early diagnosis of HCC in the clinic [37, 38]. Second, DNA methylation is mediated by methyltransferase, and it has become a novel approach in the field of HCC treatment to explore therapeutic strategies based on changes in DNMT activity and DNA methylation patterns. In addition, preclinical studies have verified that drugs targeting the DNMT action substrate (azacitidine) and the DNMT cofactor SAM (sinefungin) as well as the anti-SAM metabolite neplanosin, and the non-competitive non-nucleotide transmethylase inhibitor procainamide can reverse the abnormal methylation in tumour cells, re-express the inactivated genes, and suppress cancer cell proliferation and invasion [39,40,41]. Finally, DNA methylation also plays an important role in the prognosis of HCC patients. The CpG island methylation phenotype (CIMP) refers to the presence of methylation in the CpG island of multiple gene promoter regions. Some articles have suggested that patients with high tumour CIMP are associated with a markedly lower survival rate than those with low or no CIMP, while patients with various CIMP degrees also have evidently different tumour metastasis and TNM stages [42]. In addition, the correlation between the methylation of multiple gene promoter sequences and patient prognosis has also been verified, such as Tip30, RASSF1A and CD147 [43].

Nevertheless, the specific methylation sequences in the promoter regions of these gene remain unclear, and it is still unknown whether the methylation of these genes displays clinical significance compared with tumour classification, survival and prognosis in large-sample HCC patient data. This study attempted to establish a classification method that integrates several DNA methylation markers based on solving these problems in order to help clinicians effectively assess the prognosis of HCC patients and select the appropriate therapeutic strategies for HCC patients.

Conclusions

This study identified specific methylation sites based on the TCGA database and serial bioinformatics methods at the prognosis level and constructed a prognosis prediction model for HCC patients. This model contributes to clinically identifying novel HCC markers, providing multiple targets for the precision medicine of HCC, and more accurately subtyping HCC patients at the molecular subtype level. More importantly, this model is promising for providing certain assistance and guidance for clinicians regarding prognosis judgement, clinical diagnosis and medication for HCC patients with different epigenetic subtypes.