DNA methylation-based subtypes of acute myeloid leukemia with distinct prognosis and clinical features

Acute myeloid leukemia (AML) is a malignancy of the stem cell precursors of the myeloid lineage. DNA methylation is an important DNA modification that regulates gene expression. Investigating AML heterogeneity based on DNA methylation could be clinically informative for improving clinical diagnosis and prognosis. The AML subtypes based on DNA methylation were identified by unsupervised consensus clustering. The association of these subtypes with gene mutation, copy number variations, immune infiltration and clinical features were further explored. Finally, univariate, LASSO and multivariate cox regression analyses were used to identify prognosis-associated genes and construct risk model for AML patients. In addition, we validated this model by using other datasets and explored the involved biological functions and pathways of its related genes. Three CpG island methylator phenotypes (CIMP-H, CIMP-M and CIMP-L) were identified using the 91 differential CpG sites. Overall survival, morphology, macrophages M0 and monocytes were distinct from each other. The most frequently mutated gene in CIMP-L was DNMT3A while which in CIMP-M that was RUNX1. In addition, the TIDE scores, used to predict the response to immune checkpoint inhibitors, were significantly different among CIMPs. The CIMP-associated prognosis risk model (CPM) using 32 key genes had convinced accuracy of prediction to forecast 0.5-year, 1-year, 3-year and 5-year survival rates. Moreover, the risk score-related genes were significantly enriched in pattern specification process, regionalization, embryonic organ morphogenesis and other critical cancer-related biological functions. We systematically and comprehensively analyzed the DNA methylation in AML. The risk model we constructed is an independent predictor of overall survival in AML and could be used as prognostic factor for AML treatment. Supplementary Information The online version contains supplementary material available at 10.1007/s10238-022-00980-4.


Introduction
Acute myeloid leukemia (AML) is a malignancy of the stem cell precursors of the myeloid lineage [1].It is featured by abnormal immature blast cells expansion, which leads to bone marrow failure and ineffective erythropoiesis [2][3][4].Symptoms of AML include anemia, bleeding, infection, fever, and enlargement of liver, spleen and lymph nodes, and bone pain [1].A numerous studies have shown that radiation, chemical benzene, viruses and genetic factors could induce AML.It is rare in children but most commonly occurs among old people, accounting for about 90% of all leukemia in adults [5].
DNA methylation is one of the most common epigenetic modifications.It is a biological process by which methyl groups are added to the DNA molecule.Methylation can alter the activity of a DNA segment without changing the sequence.Typically, DNA methylation located in a gene promoter acts to repress the gene transcription.DNA methylation has been recognized as an important component of cancer development.In cancer, gene promoter CpG islands acquire abnormal hypermethylation leading to transcriptional silencing [6].Hypomethylation generally is related to chromosomal instability and loss of imprinting, while hypermethylation is associated with promoters and arise secondary to gene (oncogene/suppressor) silencing.Typically, there is hypermethylation of tumor suppressor genes and hypomethylation of oncogenes [7].
Abnormal methylation in AML is mainly caused by the methylation of CpG islands of key genes, which causes the gene silencing and affects the normal function of cells.Methylation research at gene level provides a new insight for elucidating the occurrence and development of AML.Here, we systematically studied the DNA methylation of AML by integrating TCGA and GEO datasets.We used unsupervised consensus clustering to identify the molecular subtypes of AML based on DNA methylation, and fully studied the association of these subtypes with gene mutation, copy number variations, immune infiltration and clinical features.Finally, we used univariate, LASSO and multivariate cox regression analyses to identify prognosis-associated genes and employed them to construct risk model for AML patients.

Preprocess of DNA methylation data from TCGA
The DNA methylation data (CpG sites) were obtained from TCGA according to the following criteria: removal of probes with low quality; removal of probes located in X or Y chromosome; removal of probes associated with single nucleotide polymorphisms (SNPs); and removal probes mapping not uniquely to the human reference genome.

Identification of prognosis-related DNA methylation sites
The differential DNA methylation sites were identified as follow: the standard deviation (SD) of beta value in AML samples > 0.2 and the mean of beta value in normal samples < 0.05.Kaplan Meier (KM) survival curve and log rank test were further used to screen these differential CpG sites that were significantly related to prognosis (P < 0.05).

Unsupervised clustering of DNA methylation in AML
The prognosis-related DNA methylation sites identified above were then analyzed by unsupervised K-means consensus clustering in ConsensusClusterPlus [12] R package.The best number (K = 3) of consensus clustering was evaluated based on the consensus matrix.K-means clustering of

Association analysis of CIMP and clinical characteristics
The overall survival (OS) among different CIMP was assessed using KM survival curve and log rank test to investigate the association between CIMP and clinical prognostic in AML.Chi-square or Fisher exact test were used to determine the statistical significance of other clinical features' distribution (age, gender, etc.) in different CIMP subtypes.

Gene set enrichment analysis (GSEA)
The gene expression profiles corresponding to the differential DNA methylation sites were first selected.GSEA analysis [13] was then performed by GSEA software using msigdb.v7.0.symbols.gmtgene set [14].

Association analysis of CIMP with gene mutation, CNV and immune infiltration
The maftools [15] R package was used to retrieve the gene mutations and calculate the tumor mutation burden (TMB) for each AML patient.The GISTIC [16,17] module in GenePattern [18] was used to predict the significantly CNV regions.Single sample GSEA (ssGSEA) was carried out by GSVA [19] package in R to calculate the infiltration level of 24 different immune cells.TIDE [20] was used to evaluate the response to immune checkpoint inhibitors for all AML patients.Wilcox test and Kruskal-Wallis were used to determine the statistical significance of immune infiltration cell proportion and response to immune checkpoint inhibitors in different CIMP subtypes.

Construction of CIMP-associated prognosis risk model (CPM)
The differentially expressed genes (DEGs) were firstly identified by Deseq2 [21] with using raw counts calculated by HTSeq [22], based on the CIMP subtypes.Univariate cox regression analysis was then carried out to screen DEGs that were significantly associated with prognosis (P < 0.05).
The survival-related genes were finally filtered by LASSO regression using glmnet [23] package in R to select key genes.LASSO regression is a popular variable selection method by fitting generalized linear models that reduces variable number by using penalty algorithms to effectively avoid overfitting and finally obtain more widely used models.A risk score was then established and formulated as below: where N is the number of genes, Exp(i) is the gene expression of each gene, while Coe(i) is the coefficient of LASSO Cox regression.The risk score is equal to the sum of LASSO Cox regression of each gene multiplied by the gene expression levels of each gene.Based on the median risk score, patients were divided into two groups: high-and low-risk groups.KM survival curves were drawn with the R package survminer.Log-rank test was performed to evaluate the statistical significance.The accuracy of risk score to predict 0.5-year, 1-year, 3-year, and 5-year OS rates of AML patients was evaluated using the pROC package in R.P value < 0.05 was considered as statistically significant.

Validation of CIMP-associated prognosis risk model
GSE12417 dataset with gene expression profiles and corresponding clinical data was used to validate the CIMPassociated prognosis risk model.Other AML-related key genes from previous study were also used to validate our risk model [24].

Gene ontology and KEGG pathway enrichment analysis
The correlation between risk score and differentially expressed genes was calculated by Pearson correlation analysis.The genes with correlation coefficient > 0.5 and P value < 0.05 were regarded as the risk model-related genes.These genes were further used for Gene ontology [25,26] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [27][28][29][30] pathway enrichment analyses, with using cluster-Profiler [31] package in R.

Nomogram construction based on risk score and clinical characteristics
Univariate and multivariable cox regression were used to identify the prognostic-related features in clinical.The nomogram using prognostic-related features was built to predict 1-year, 3-year, and 5-year OS rates of AML patients using rms R package.A receiver operating characteristic (ROC) curve and a calibration curve were constructed to evaluate the accuracy of the nomogram.

Workflow
The workflows of the analytical procedures are indicated in Fig. 1 in this study.

Prognosis role of DNA methylation in AML
Developing new prognostic markers to improve therapeutic effects in AML is quite urgent.In this study, we identified differential DNA CpG sites between AML samples (from TCGA) and normal samples (from GES105420).The 91 differential CpG sites, which were significantly related to prognosis (P < 0.05), were finally obtained via survival analysis.Then, unsupervised consensus clustering of AML was performed using ConsensusCluster-Plus R package.The best K of consensus clustering was 3, based on the consensus matrix (Fig. 2A).Moreover, K-means clustering of all AML samples was performed to identify CpG island methylator phenotype (CIMP).Three CIMPs (CIMP-H、CIMP-M and CIMP-L) were obtained, which were corresponding to high, medium and low DNA methylation levels, respectively (Fig. 2B).
We next investigated the overall survival (OS) among the 3 CIMPs and found CIMP-L and CIMP-H had the best and worst clinical outcomes, respectively (Fig. 2C), indicating DNA methylation sties as excellent prognosis of AML.Moreover, we explored the distribution of clinical features in different CIMP subtypes, including age, gender, morphology and race.Only morphology was found to be significantly associated with CIMP subtypes (Table 1, P < 0.05).

Gene set enrichment analysis (GSEA) and CIMP-associated phenotype difference
GSEA analysis results showed that CIMP subtypes were all significantly enriched in DNA binding transcription factor activity, animal organ morphogenesis and other pathways.(Supplementary Figs.1-3).
The distribution of different immune cells in three CIMP subtypes is shown in Supplementary Fig. 4. Of those, the infiltration levels of eosinophils, macrophages, monocytes and neutrophils were significantly different among CIMP subtypes (Fig. 3A).Macrophages M0 and monocytes were significantly different among CIMP-L, CIMP-M and CIMP-H comparisons (Fig. 3B).
There were 78 patient samples possessing both gene mutation and DNA methylation information.Of them, 53, 22 and 3 patients were assigned to CIMP-L, CIMP-M and CIMP-H subtype groups, respectively.We investigated the top 10 mutated genes according to the CIMP subtypes and found gene DNMT3A mutated in 8% of 78 samples, followed by FLT3 (7%) and NPM1 (6%) (Fig. 4A).In order to further analyze the specific mutation of the three CIMP subtypes, each subtype was divided into two groups based on the median value of the TMB(Fig.4B).Surprisingly, we found the most frequently mutated gene in CIMP-L was DNMT3A, while RUNX1 was the most frequently mutated gene in CIMP-M.Moreover, we used GISTIC module in GenePattern to investigate the association of CNV and CIMP subtypes, which is demonstrated in Fig. 4C.There were 78 patients possessing both gene mutation and DNA methylation information.A total of 137 patients were finally obtained, of which 94 were CIMP-L, 34 were CIMP-M and 9 were CIMP-H.
When predicting the response to immune checkpoint inhibitors, we found TIDE scores were significantly different among different CIMP subtypes (Fig. 5A, P = 0.016).In addition, it was found that CIMP subtypes could distinguish well the effect of immune checkpoint treatment via using pROC R package (AUC = 0.6582, Fig. 5B).

Construction of CIMP-associated prognosis risk model (CPM)
We identified 32 key genes and used them to construct the CIMP-associated prognosis risk model.Briefly, the risk score is equal to the sum of LASSO regression coefficient multiplied by the gene expression levels of each gene (Table 2).
Based on the median risk score, we divided the patients into two groups: high-and low-risk groups.We found the high risk group was significantly related to better survival (Fig. 6A).The AUC values of risk score to predict 0.5-year, 1-year, 2-year, and 3-year OS rates of AML patients were 0.5265, 0.6182, 0.6496 and 0.6567, respectively(Fig.6B 3-year AUC = 0.6567).Moreover, we validated this finding by using GSE12417 dataset.As expectedly, low risk group was significantly related to poor survival in GSE12417 and the AUC values of survival was larger than 0.67 (Fig. 6C,  D).Moreover, we employed other AML-related key genes including NOTCH1, NOTCH2, NOTCH3, NOTCH4, JAG-GGED2 and DLL-3, to validate our risk model and got similar results (Fig. 6E, AUC > 0.6).
We found these genes were significantly enriched in pattern specification process, regionalization, embryonic organ morphogenesis and other critical cancer-related biological functions (Fig. 6F).There were no significantly enriched KEGG pathways of these genes.

Nomogram construction based on risk score and clinical characteristics
We firstly used univariate and multivariable cox regression to identify the prognostic-related features in clinical, and found age was significantly associated with survival (Fig. 7A).Next we constructed the nomogram consisting of age, CIMP subtype and CPM to predict 1-year, 2-year, and 3-year OS rates of AML patients (Fig. 7B).The results of the nomogram and COX regression analysis were similar.The impact of CPM on the survival rate was small, and basically did not lead to a significant decrease in the survival rate; while CIMP type basically had no impact on the survival rate.However, age was extremely correlated with the survival rate.With the increase in age, the 1-year, 2-year and 3-year survival rates decreased significantly.Calibration curves were applied to evaluate performance of the nomogram, and it was shown that the nomogram had significant accuracy of prediction (Fig. 7C).

Discussion
AML is a malignant disease of the hematopoietic stem cells' disorder.Although advancements in therapeutic regimens have been achieved, the 5-year OS is still very poor (about 40%), especially in elderly population older than 60 years old.Therefore, it is quite urgent and necessary to develop new diagnostic and prognostic markers to improve therapeutic effects in AML.DNA methylation plays an important role in cancer occurrence and development by regulating gene expression.Abnormal DNA methylation provide potential biomarkers for the early diagnosis and prognosis of cancer [32][33][34][35].Here, we systematically investigated the DNA methylation and their prognostic roles in AML via integrative analyses by using TCGA and GEO datasets.There were 91 differential CpG sites in total that were significantly related to prognosis (P < 0.05) in AML by integrating TCGA and GES105420 data.Moreover, we identified 3 CpG island methylator phenotypes (CIMP-H、CIMP-M and CIMP-L) through unsupervised consensus clustering and K-means clustering.The survival curves of those CIMPs were distinct from each other.Interestingly, we found morphology was significantly related to CIMP subtypes (P < 0.05).All the 3 CIMPs were significantly enriched in DNA binding transcription factor activity and animal organ morphogenesis pathways.Particularly, dysregulated transcription factors regulate abnormal gene expression, including cell death program, differentiation blockade and hallmarks of cancer [36], which represent a unique class of drug targets for future cancer treatment.
By predicting infiltration levels of the 24 immune cells in AML, we discovered macrophages M0 and monocytes were significantly different among CIMP subtypes.Yuan et al. have reported that M0 could increase cancer invasion ability in lung cancer [37].Moreover, monocytes can influence multiple aspects of cancer development, including anti-tumor immunity, recurrence and metastasis [38].Elevated tumorassociated monocytes/macrophages could dampen antitumor immune response, leading to poor survival [39].This is coincidently consistent with our findings.What's more, we found the most frequently mutated gene in CIMP-L was DNMT3A while in CIMP-M that was RUNX1.DNMT3A mutations in AML were associated with poor event free and overall survival [40].RUNX1 is a relatively infrequent mutational target in AML [41].Inhibitors of DNMT3Aand RUNX1 could be potentially used for CIMP-Land CIMP-M patients, respectively, in AML to improve their clinical outcomes [42,43].In addition, the TIDE scores, used to predict the response to immune checkpoint inhibitors, were significantly different among CIMPs.
Next, we constructed the CIMP-associated prognosis risk model (CPM) using 32 key genes.Impressively, we found the high risk group was significantly related to better survival.The AUC values of risk score to predict 0.5-year, 1-year, 3-year, and 5-year OS rates of AML patients were 0.5265, 0.6182, 0.6496 and 0.6567, respectively.Moreover, we validated this finding by using GSE12417 dataset and other AML-related key genes.These finding indicated the CPM we constructed had significant accuracy of prediction to forecast 0.5-year, 1-year, 3-year and 5-year survival rates.Moreover, we the risk score-related genes were significantly enriched in pattern specification process, regionalization, embryonic organ morphogenesis and other critical cancerrelated biological functions.Finally, we built the nomogram consisting of age, CIMP subtype and CPM and the calibration curves demonstrated the nomogram had significant accuracy of prediction of 1-year, 2-year, and 3-year survival rates.
In conclusion, our study provided a systematic and comprehensive view of the DNA methylation in AML.We identified 3 CIMP subtypes based on DNA methylation.Moreover, the CIMP-associated prognosis risk model we constructed using the 33 genes is an independent predictors of OS in AML and could be used as prognostic factor for AML treatment.

Fig. 2
Fig. 2 Clustering analysis of DNA methylation in AML samples.A Unsupervised consensus clustering of AML samples over prognosisrelated DNA methylation sites (K = 3).1, 2 and 3 represent CIMP-H、CIMP-M and CIMP-L, respectively.B Unsupervised K-means clustering of prognosis-related CIMPs in AML.Red and blue indicate

Fig. 3
Fig. 3 Association between CIMP and immune infiltration levels.The infiltration levels of 24 different immune cells were predicted by ssGSEA.A Distribution of 24 immune cells' proportion among 3 CIMPs.The infiltration levels of eosinophils, macrophages, monocytes and neutrophils were significantly (P < 0.05) different among

Fig. 4
Fig. 4 Association between CIMP and genomic alterations.A. Mutation profiles of the top 10 mutated genes according to CIMP subtypes.The top and right panel represent the mutation burden and fraction mutation, respectively.Among the 78 patient samples with both gene mutation and DNA methylation information.53,22 and 3

Fig. 5
Fig. 5 Association between CIMP and response to immune checkpoint inhibitors.A Boxplot representation of TIDE scores among different CIMP subtypes.B ROC curve of prediction of response to

Fig. 6
Fig. 6 Construction of CIMP-associated prognosis risk model (CPM).Patients were divided into high-and low-risk groups based on the median risk score.A Kaplan-Meier OS survival plot of highrisk versus low-risk score groups.B ROC curve and AUC values of prediction to forecast 0.5-year, 1-year, 3-year and 5-year survival

Fig. 7
Fig. 7 Nomogram for prediction of 1-, 2-and 3-year OSrates of AML patients.A Forest plots of age, CPM score, morphology and gender.B Nomogram for the prediction of 1-year, 2-year, and 3-year OS rates of AML patients.Find the value on the variable axis and draw a vertical line upward to the 'Points' axis and determine the corresponding point for the variable.Add up the points obtained for each variables and locate this sum on the 'Total Points' axis.Draw a vertical line

Table 1
Clinical characteristics of AML samples in different CIMPs

Table 2
LASSO regression coefficients of the 32 key genes