Introduction

Chronic myeloid leukemia (CML) is a clonal myeloproliferative disorder of a pluripotent stem cell. It is mainly caused by the disorder of differentiation and maturation of hematopoietic stem cells. The annual incidence rate is about 1.3 per 100,000, and it is slightly more common in males than in females. The main hallmark is the presence of the Philadelphia chromosome, which is resulted from the balanced translocation of chromosome t(9;22) (q34; q11) [1]. At present, the use of ABL kinase inhibitors (e.g. imatinib) for the treatment of CML can inhibit the activity of BCR-ABL kinase effectively, inhibit the malignant proliferation of leukemia cells, and extend the survival time of patients significantly. During the treatment, there will be a stable point in CML drug response [2]. The condition of patients gradually eases before it comes, and stabilizes after it comes. It’s difficult to find the stable point only through clinical medicine. Therefore, it’s urgent to discover and validate stable points through bioinformatics for CML drug therapy.

Increasing evidences suggest that many mathematical models can contribute to elucidating mechanisms and providing quantitative predictions for cancer research [3], and the combination of model and clinical information has provided useful suggestions for treatment [4]. Sasaki K et al. used the robust linear regression model to define the best fit average molecular response, where the average molecular levels were defined. Predicting the highest probability of reaching optimal values proposed by the model to decide whether to continue treatment [5]. In addition, traditional biomarkers cannot distinguish the state of critical point before the disease worsens. Based on this situation, Chen LN et al. [6] proposed a theory of dynamic network biomarkers (DNB) to analyze the dynamic signals of DNB when the system was in the critical point state, and put forward three universal properties of DNB [7, 8]. Markus AD et al. showed that the critical point will enter the disease state quickly under certain triggering factors, so the critical point was treated as an early warning signal for complex diseases [9]. Lesterhuis WJ et al. found that the use of dynamic network biomarkers can identify critical points in the state of the system by comparing dynamic biomarkers with static biomarkers of complex diseases [10]. Combined with the advantages of high-throughput sampling of gene expression data, many discussions have shown that DNB is promising candidate biomarker for clinical trials and clinical detection of complex diseases [11].

Based on the advanced high-throughput technology, gene or protein expression data with dynamic measurements can be obtained. In order to detect the therapeutic effect of CML medications from a small amount of high-throughput data, a therapeutic effect recognition strategy is provided based on DNB for CML patients’ gene expression data. In the study, the datasets divided into the treatment group and the control group are used to select differentially expressed genes (DEGs) by t-test. DEGs are clustered into 60 categories by hierarchical clustering. Then, according to the three criteria for the identification of DNB proposed by Chen, a group of 250 genes is selected as DNB. Therefore, the therapeutic effect index (TEI) is constructed to observe the dynamic change, and it can be used to predict and determine when it is in pre-stable state. Finally, functional enrichment analysis is performed on the DNB, and the role of the DNB in CML is studied by KEGG enrichment analysis and literature mining.

Materials and methods

Datasets

Three datasets, including GSE33075, GSE12211, and GSE24493 from the National Center for Biotechnology Information’s Gene Expression Omnibus (GEO) database are used to analyze treatment time. Initially, datasets in CEL files are standardized by Robust Multichip Averaging (RMA) implemented in the affy package, and return the log2 conversion intensity [12], and the probe sets are mapped to unique gene symbols by the averaging method. This study doesn’t consider probe sets without corresponding gene symbols. Due to limited experimental data, multiple GEO data are combined to obtain 39 chips. The information of dataset is shown in Table 1. In the study, samples of CML diagnosed are defined as control groups. 8927 genes can be obtained from the same gene of each GEO dataset. The COMBAT method is used to adjust the batch effect [13]. The experiment Information of dataset is shown in Table 2. Figure 1 shows the distribution of box plots before and after removing batch effects.

Fig. 1
figure 1

The box plots of data expression. The combined dataset is visually displayed by the gene box plot. On the left side, the three datasets are merged without any transformation. On the right side, the three datasets are merged with the COMBAT method. After removing batch effects, the distribution of genes is more similar than before

Table 1 The information of dataset
Table 2 The experiment information of dataset

The student’s t-test applied in the selection of DEGs is used to assess the significance of DEGs between the control group and the treatment group. The p-value calculated by t-test is used for the subsequent filtering analysis with multiple testing corrections directly. Set the p-value of 0.05 and the fold change of 1.5. The volcano plot is shown in Fig. 2.

Fig. 2
figure 2

The volcano plot of DEGs

Identify pre-stable state based on DNB

We assume the reference sample data is C(t), where the n-dimensional vector represents the observed value or molecular concentration (e.g. gene expression or protein expression) at time t (t=0, 1,...), e.g. minutes, hours or days. Therefore, the Pearson correlation coefficient (PCC) [14] between the two genes x, y in the data from reference sample can be calculated as

$$\begin{array}{*{20}l} PCC(x,y)=\frac{\sum_{i=1}^{n} (x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}} \end{array} $$
(1)

Where xi and yi represent the ith sample’gene expressions of gene x and gene y in the reference sample, respectively. \(\bar {x}\) and \(\bar {y}\) represent the average gene expression of gene x and gene y in the reference sample, respectively.

The reference sample data C(t) can be divided into two groups, the control group Ccontrol(t) and the treatment group Ctreat(t), as follows

$$\begin{array}{*{20}l} C_{control}(t)=(C_{control}^{1}(t),...,C_{control}^{n}(t)) \end{array} $$
(2)
$$\begin{array}{*{20}l} C_{treat}(t)=(C_{treat}^{1}(t),...,C_{treat}^{n}(t)) \end{array} $$
(3)

There are St samples at time t for each gene or protein (see Fig. 3). Due to large differences in the expression values of various genes or proteins, the expression data is standardized as follow

$$\begin{array}{*{20}l} \tilde{C}=\frac{C_{treat}-mean(C_{control})}{SD(C_{control})} \end{array} $$
(4)
Fig. 3
figure 3

Sampling time and samples for the measured high throughput data

Where \(\tilde {C}\) represents the standardized expression data for gene or protein of each sample. mean(Ccontrol) and SD(Ccontrol) are the mean and standard deviation in control samples, respectively. Then the standardized matrix is showed

$$\begin{array}{*{20}l} \tilde{C}= \left[ \begin{array}{cccc} \tilde{c_{11}}& \tilde{c_{12}}&...&\tilde{c_{1t}}\\ \tilde{c_{21}}&\tilde{c_{22}}&...&\tilde{c_{2t}}\\...&...&...&...\\ \tilde{c_{n1}}&\tilde{c_{n2}}&...&\tilde{c_{nt}} \end{array} \right] \end{array} $$
(5)

where \(\tilde {c_{nt}}\) denotes the standardized data of the nth reference sample at time t.

Potential DNB modules can be detected because of the gene expression for a specific sample. For specific samples, DEGs are clustered by hierarchical clustering analysis. According to the three criteria of DNB identification proposed by Chen [15], the optimal group of genes or proteins is selected as DNB and is labeled as CDNB, the rest groups are labeled as Cother. During disease treatment, a key point is defined as pre-stable state, where the change of DNB is relatively stable after treatment, and the state changes sharply before pre-stable state. After identifying the DNB, the TEI at each time can be constructed based on the following three criteria:

(i) As the system approaches the pre-stable state, the average coefficient variation (CV) of molecules in this DNB group decreases rapidly and then approaches the CV value of health.

(ii) The average PCCs of molecules in this DNB group decreases rapidly in the absolute value and then approaches the PCC value of health.

(iii) The average PCCs of molecules between this DNB group and outside of DNB group increases rapidly in the absolute value and then approaches the OPCC value of health. Therefore, TEI at each time can be constructed as:

$$\begin{array}{*{20}l} TEI_{t}=\frac{CV_{t} \times{PCC_{t}}}{OPCC_{t}} \end{array} $$
(6)

where

$$\begin{array}{*{20}l} CV_{t}=\frac{SD(C_{DNB}(t))}{mean(C_{DNB}(t))} \end{array} $$
(7)
$$\begin{array}{*{20}l} PCC_{t}=\frac{cov(c_{i_{1}t},c_{i_{2}t})}{\sigma(c_{i_{1}t})\sigma(c_{i_{2}t})} \end{array} $$
(8)
$$\begin{array}{*{20}l} OPCC_{t}=\frac{cov(c_{it},c_{jt})}{\sigma(c_{it})\sigma(c_{jt})} \end{array} $$
(9)

(i=1, 2,..., the number of DNB)(j=1, 2,..., the number outside of DNB)Where PCCt is the average PCC of the DNB group at time t in absolute value. OPCCt is the average PCC between the DNB group and the outside of DNB group at time t in absolute value. CVt is the coefficient of variation of the DNB group at time t. According to the characteristics of the treatment, the TEI value changes slowly at the beginning of treatment, and decreases rapidly to be the lowest(i.e., reaches the pre-stable state) after treatment for a period of time, then approaches the TEI value of health.

Result

Based on the gene expression of the control group and the treatment group, 321 DEGs are selected by t-test and clustered into 60 categories by correlation analysis. A group of 250 genes is identified as the DNB (Additional file 1), where 43 genes relate to CML closely (Additional file 2). In order to clarify the time in the treatment, Fig. 4 shows the changes of four indices in detail. In the progress of imatinib treatment for CML patients, the CV value of DNB decreases gradually in Fig. 4a. The CV value is the lowest and closest to health value at time 3 (i.e., imatinib treatment for 1 month). The PCC value is the lowest at time 3, indicating the correlations of DNB decreases gradually in the process of imatinib treatment and the condition eases gradually in Fig. 4b. Although the change of the OPCC is not obvious in Fig. 4c, the TEI value is the lowest at time 3 and closest to the TEI value of health in Fig. 4d. Therefore, the most significant physiological effect occurs at time 3, indicating that the condition of CML patients is relieved significantly and become normal after imatinib treatment for 1 month.

Fig. 4
figure 4

The therapeutic effect index of CML. The abscissa represents time t. On the timeline, 1 represents imatinib for 16 h, 2 represents imatinib for 3 days, 3 represents imatinib for 1 month, and 4 represents normal. a The average coefficient variation (CV) of DNB. b The average PCC of DNB. c The average PCC between the DNB group and outside of the DNB group. d The TEI of DNB

To analyze the DNB dynamics, we discusses the molecular mechanism of disease from the perspective of the system by protein-protein interactions (PPI) in Fig. 5. It can be found that most genes in DNB interact strongly and most of the 43 DNB genes associated with CML have been shown to be most interactive. We also graphically demonstrate the dynamic changes in DNB with 4 sampling points in Fig. 6, which clearly shows the significance of the DNB in terms of expression variations and network structures near the pre-stable point (1 month).

Fig. 5
figure 5

Protein-Protein interaction (PPI) network for part of DNB. PPI network discusses the molecular mechanism of disease from the perspective of the system. A PPI network is set up for 250 DNBs, an interaction score of 0.7 is set, and genes not in the network are deleted. A PPI network of 228 genes is obtained, and it is found that most genes in DNB interact strongly and most of the 42 genes associated with CML have been shown to be most interactive

Fig. 6
figure 6

Dynamic changes in DNB (250 genes) subnetwork (43 genes) with 4 sampling points. For CML, we show the dynamic evolution of the network structure of the identified DNB subnetwork with 4 sampling points. (a) DNB at 16 h. 43 genes, 631 lines (b) DNB at 7 days. 43 genes, 413 lines (c) DNB at 1 month (the pre-stable state). 43 genes, 385 lines (d) DNB in normal. 43 genes, 457 lines. Each point represents a gene, which is gradually colored according to the standard deviation of the gene. Lines between genes indicate the correlation between genes, calculated by PCC, and the lines with weak correlation (|PCC|≤0.4) are deleted. From these dynamic evolution charts, it can be clearly seen that the DNB group provides important signals when the system approaches the pre-stable point, the standard deviation of DNB genes becomes smaller and tends to be stable after treatment for 1 month, correlation of DNB genes is gradually weakened and the condition has eased and stabilized. So, a strongly correlated observable subnetwork is also formed in terms of expression variations and network connections

To further analyze the biological function of the DNB, a bioinformatics database DAVID [16] with Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis is provided. GO analysis can be divided into three parts: molecular function, biological process and cellular composition. Some enriched GO functions based on the identified genes in the DNB are listed in Table 3. Some genes have been shown to be associated with CML. For example, on the cellular level, CML is associated with a specific chromosomal abnormality, T (9;22) is reciprocally transposed to form the Philadelphia (PH) chromosome, and the CABL proto-oncogene on chromosome 9 and the BCR (breakpoint cluster region) gene on chromosome 22 lead to the PH chromosome. The fusion of CABL and BCR is considered to be the main reason of CML. CRK is considered as the major tyrosine phosphorylated protein on recognition of CML neutrophils. PI3K is a heterodimer of regulatory and catalytic subunits, and the protein encoded by PIK3R2 is a regulatory component of PI3K. The protein encoded by TGFBR2 is a transmembrane protein that has a protein kinase domain, forms a heterodimeric complex with TGF- β receptor type-1, and binds TGF- β. This receptor/ligand complex phosphorylates proteins, which then enter the nucleus and regulate the transcription of genes related to cell proliferation, cell cycle arrest, wound healing, immunosuppression, and tumorigenesis [17]. The genes mentioned are associated with the pathogenicity of CML and may also regulate and provide an early warning signal for the process of CML treatment.

Table 3 Functional enrichment of GO for part of DNB

Functional enrichment analysis showed that DNB gene is involved in biological processes such as cell surface receptor signaling pathway, immune response, cell adhesion and apoptotic process. The specific immune responses of CML contribute to the control of the disease. For example, the low expression of antigens recognized by CD247 leads to impaired immune response [12], and is also associated with T cell co-stimulation and cell surface receptor signaling pathways. TNF receptor family member CD27 is expressed on bone marrow CML stem/progenitor cells in the bone marrow of CML patients. CD27 signaling promotes the growth of BCR/ABL+ leukemia cells by activating the Wnt pathway. Therefore, adaptive immunity contributes to leukemic progression. Targeting CD27 on the leukemia stem cells (LSCs) may represent an attractive therapeutic approach in blocking the Wnt/ β-catenin pathway in CML [13]. Changes in LGALS1 expression trigger changes in MDR1 expression and resistance to cytotoxic drugs, and MDR1 shows high efficacy in the treatment of BCR-ABL-positive CML, so LGALS1 may be considered as a novel target for combination therapy, used to improve the efficacy of imatinib in the treatment of CML [18]. Also, it is involved in the process of apoptosis. TGFBR2 regulates cell proliferation and participates in apoptotic processes.

According to KEGG pathway enrichment analysis, at least 50% of DNB genes are closely related to hematopoietic cell lineage, cytokine-cytokine receptor interaction, apoptosis, chronic myeloid leukemia MAPK signaling pathway, PI3K-Akt signaling pathway and other gene pathways. From the results, BCR, TGFBR2, ABL1, CRK, and PIK3R2 play a decisive role in the pathogenesis of CML from CML pathway in Table 4. Hematopoietic cell lineage, apoptosis, MAPK signaling pathway, and PI3K-Akt signaling pathway play a key role in the process of CML treatment in Fig. 7. The PI3K-Akt signaling pathway is activated by a variety of cellular stimuli or toxic insults and regulates basic cellular functions such as transcription, translation, proliferation, growth, and survival. The mitogen-activated protein kinase (MAPK) cascade is a highly conserved module involved in a variety of cellular functions, including cell proliferation, differentiation, and migration. Apoptosis is a genetically programmed process for the elimination of damaged or redundant cells by activation of caspases (aspartate-specific cysteine proteases).

Fig. 7
figure 7

Key biological pathways with DNB genes in CML pathway. By splitting the KEGG pathway map, a portion of the genes associated with DNB are extracted and finally the sub-pathway is obtained, as shown in the above figure. Among them, blue represents DNB, red represents genes in the CML pathway, and yellow represents genes of CML pathway’s pathways. Lines between genes represent various relationships between genes

Table 4 Functional enrichment of KEGG pathways for part of DNB

According to literature mining, it has been found that the chemokine receptor CCR5 plays a role in determining blast malignant properties and localization of extramedullary infiltrations in acute myeloid leukemia (AML) [19]. The cell surface target CD52 is expressed on neural stem cells (NSCs) in a group of patients with AML. CD52 is a novel prognostic NSC marker and a potential NSC target in patients with AML and may have clinical significance [20]. GATA3 is a sensitive and specific marker for diagnosing acute leukemia with T-cell differentiation and may be a useful complement to the panel of immunophenotypic markers for the diagnostic evaluation of acute leukemia [21]. In addition, genes such as CEBPD, FUT4, LILRB1 and MVP play a role in the cure, the treatment, and clinical drug resistance of AML [22], providing theoretical directions for the treatment of CML and finding new therapeutic targets in future.

Discussion

At present, most researches of CML are focused on the treatment, while a few on the progression of patients after drug treatment. Traditional biomarkers of disease can only distinguish normal state from disease state, and cannot recognize pre-stable state after drug treatment. CML patients are often resistant to conventional chemotherapeutic agents and tyrosine kinase inhibitors. Therefore, the key of the treatment is to control the progression of disease treatment. In order to detect the therapeutic effects of imatinib from a small amount of high-throughput data, a therapeutic effect recognition strategy based on DNB is provided for CML patients’ gene expression data. In the study, the student’s t-test applied in the selection of DEGs is used to assess the significance of DEGs between the control group and the treatment group. DEGs are clustered into 60 categories by hierarchical clustering, and a group of 250 genes satisfies the three criteria of DNB. Besides, the values of CV, PCC, and OPCC are calculated to construct TEI which is used to detect pre-stable state of CML. TEI in treatment progression shows 1 month is the best time for curative effect. In pre-stable state, the OPCC is not obvious. The other three indices are significantly related to the theory. After treatment for 1 month, the CV of the DNB gene becomes smaller and closer to the CV value at the time of health. The correlation between genes is gradually weakened, the condition is relieved and tends to be stable.

Among the 250 genes of DNB, 43 genes have been shown in pathogenesis maps of CML, and BCR, TGFBR2, ABL1, CRK, and PIK3R2 may be the key genes leading to the progression of CML, and the remaining genes have also been found in other types of leukemia like AML. It provides a certain theoretical direction to search for target genes in the future. In clinical medicine, imatinib treatment of CML is difficult to achieve recovery. Most patients adhere to medication after the condition is relieved, so that the patients can survive for a long time. Only a small number of patients can be cured and discontinued.

Conclusions

The results of this study intend to provide a certain theoretical direction and theoretical basis for medical personnel in the treatment of CML patients, and find new therapeutic targets in the future. The biomarkers of CML can help patients to be treated promptly and minimize drug resistance, treatment failure and relapse, which reduce the mortality of CML significantly. Due to the limited data, there are a few sampling points for collection and it is impossible to predict the pre-stable state fully. In the future we will focus on this important topic and continue to refine the algorithm in later research.