1 Introduction

Leukaemia as a common cancer type, nowadays still requires improvement in the domain of diagnostics and classification. Currently, modern molecular biology techniques are being assessed for their adequacy toward the detection and distinction between leukaemia subtypes. This task has been undertaken in several attempts. According to Andreeff et al. [1] in 1980 flow cytometric analysis of DNA and RNA has been used to recognize acute lymphoblastic leukaemia subtypes. Along with the creation of microarray technologies new opportunities emerged and in 1999 Golub et al. [2] discriminated between acute lymphoblastic and myeloid leukaemia (ALL, AML) types using expression data. Furthermore, gene expression profiling was used to classify paediatric ALL subtypes by Yeoh et al. [3]. Recently, apart from microarray technology for questions such as AML subtype determination [4], interest has been also turned towards searching for leukaemia biomarkers with miRNA [510] and lncRNA [11] analysis. Despite the existence of all those studies, there remains only one exceptional study which was conducted on a large scale to discriminate between all of the leukaemia subtypes [12].

This data set has been established with a great understanding of the importance of experimental design methods developed within the scientific community nowadays. The principles of control, replication and randomisation are commonly known and implemented throughout laboratories and research institutions regardless of the study field. This enables planning of complex experiments, such as the Microarray Innovations in leukaemia (MILE) [12]. It has been carried out with comprehensive state-of-the-art protocols and strict control procedures during the experimental stage. This was expected to lead to higher power of statistical testing, and thus a better chance of obtaining meaningful novel results. Still, the rich data set offers possibilities for further conclusions if deeper attention is directed towards the preprocessing and downstream analysis pipelines.

On the basis of this study, devoted to biomarker discovery, the presented work has the objective of demonstrating how considerate and custom data preprocessing is essential to the inference by reducing the chance of false discoveries. It has a substantial impact on the final conclusions, which proves how it should be commonly unthinkable to neglect this indispensable step in biomedical data mining.

2 Materials and Methods

2.1 Data Sets

The Microarray Innovations in Leukaemia (MILE) study [12] was designed to assess the clinical accuracy of gene expression profiles, originating from microarray experiments, compared to standard leukaemia laboratory methods (gold standard) for 16 acute and chronic leukaemia subclasses, myelodysplastic syndromes (MDSs) and control group that included non-malignant disorders and normal bone marrow. The leukaemia subclasses may be divided into four main groups: acute and chronic myeloid leukaemia (AML, CML) and acute and chronic lymphoblastic leukaemia (ALL, CLL). The investigation was performed in 11 laboratories across three continents and included a total of 3,334 patients. The study was very carefully designed to eliminate main problems, which occur when many experiments are carried out in various laboratories in diverse conditions—so called batch effect [13]. The experiments consisted of four phases: two main phases (Stage I and Stage II), each of them preceded by a pre-phase [14]. The goals of the pre-phases were to assure intra laboratory reproducibility and inter laboratory comparability. Each laboratory operator was trained on an identical sample preparation protocol. Additionally, each laboratory was provided with the same laboratory equipment and also kits and reagents for sample preparation and microarray analysis were taken from the same source.

In this analysis microarray data from Stage I of the MILE study were investigated, where 2096 bone marrow samples of acute and chronic leukaemia patients were hybridized to Affymetrix HG-U133 Plus 2.0 GeneChips. Summary of the MILE datasets Stage I is presented in Table 1.

Table 1 Summary of the MILE datasets (STAGE I)

Three comparison studies were accomplished following the same signal analysis pipeline. Two of three analyses were performed on main classes of leukaemia and, therefore, merging samples from the appropriate subclasses was needed. The summary of merged data is presented in Table 2.

Table 2 Samples in main classes of leukaemia after subclass merging using the MILE datasets in STAGE I

2.2 Analysis Pipeline

Taking into account the specific nature of the data set, the pipeline of analysis was designed as presented in Fig 1. It includes the use of state of the art methods for preprocessing, a technique for removing variability caused by external influence (unrelated to the analysed case), adaptive filtering for noise and uninformative features removal, statistical analysis with the aim of biomarker selection.

Fig. 1
figure 1

Summary of pipeline analysis, which includes preprocessing data, batch effect adjustment, adaptive filtering for noise and uninformative feature filtration, statistical analysis and biomarker selection

The three comparative analyses performed gradually take into account more and more details about the leukaemia. The first one is carried out on the main types of leukaemia, and in terms of statistical analysis a commonly used approach is chosen, which compares the mean gene expression level in each main type of leukaemia with the mean expression level among healthy donors from control group (here: non-leukaemia and healthy bone marrow). This is an example of case-control approach widely used in observational studies.

In the second analysis, an extension is performed relying on the cross-comparison of transcriptomic profiles of main leukaemia types between themselves. From this analysis a biomarker identification step is added as it is possible to set an appropriate condition. In this case only such features are taken into account and labelled as biomarkers, which differentiate one and only one main class from the rest.

The last analysis was performed on all of the leukaemia subgroups. It allows for the most profound analysis of leukaemia diseases. As mentioned earlier, this is a unique study, which was conducted on a large scale to discriminate between all of the leukaemia subtypes and in this final study information for all subclasses of leukaemia is taken under analysis.

2.3 Data Preprocessing

The intensity data from microarray experiments has been subjected to fRMA normalisation [15] with background correction, quantile normalisation and median polish summarisation. This method has been chosen to merge the advantages of classic RMA normalisation with the ability to include additional samples if need in the future. Probe reannotation was accomplished with custom CDF files available through the BrainArray repository [16].

The next step was to ensure data coherence, i.e. verify if the unification procedures applied in the study successfully dealt with the issue of bias introduced by batch effect. In this case Principal Component Analysis was performed and the outcome suggests that nonetheless a batch effect due to sample preparation in different laboratories may be observed (Fig. 2).

Fig. 2
figure 2

Principal component graphs demonstrating the existence of batch effect in the data with regard to sample preparation research centre

Therefore, the data were adjusted for batch effects with the use of ComBat algorithm [17], available through the SVA R package [18]. The results of Kruskall-Wallis test for differentially expressed genes among research centre batches proved a significant removal of batch effect (Table 3).

Table 3 Results of two-way ANOVA for gene differentiation among research centres participating in sample preparation and leukaemia subgroups (\(\alpha = 0.05\))

The final step consisted of gene filtration to remove features with signal close to background level. There are various techniques available for this purpose such as the commonly used method of removing \(50\%\) of the genes with lowest expression value or variance. However, in the studied case of 18 subtypes of disease this approach seems excessively strict and implies the search of an adaptive threshold rather than fixed. For this reason, the adaptive filtering based on Gaussian mixture decomposition has been selected [19]. The filtration was conducted in two steps: in the first step the signal was decomposed in terms of signal intensity amplitude, and the three components with the highest signal amplitude remained. Second, the data were considered variance-wise and the component with lowest variance was rejected (Fig. 3). A total of 9941 genes remained for further statistical analysis.

Fig. 3
figure 3

Decomposition into Gaussian components as a method of filtration of genes with signal intensity close to background values and low variance

2.4 Statistical Analysis and Biomarker Selection

To search for class enhanced differentially expressed genes (CE-DEGs) across types or subtypes of leukaemia, a set of statistical tests was carried out, independently for each comparative analysis. The CE-DEGs in this case are genes which differentiate a considered group from all the other groups in the manner of pairwise comparisons. At the beginning the conditions on normality and homogeneity of variances were verified and, accordingly, the appropriate parametric or non-parametric test was chosen.

During the first analysis, initially, Analysis of Variance (ANOVA) was conducted to filter out the genes, which do not differentiate among groups at all. Next, the mean gene expression level of each main type of leukemia was compared with the mean expression within reference group, therefore, Dunnetts test was used in post hoc comparisons to control the experimental event rate (EER).

For the remaining two analyses the same set of statistical tests was performed. It included non-parametric Kruskal-Wallis analysis of variance test, because of the violation of the assumptions for parametric ANOVA in several experimental groups. After this step features, which differentiate at least one leukaemia type from the rest types of diseases, were selected. Furthermore, as means of conducting post-hoc pairwise comparison tests, the Games-Howell method was chosen. Restrictive feature selection was then used to filter out the genes which differentiate solely one group from all of the other types or subtypes of leukaemia. The combination of the data preprocessing steps and statistically supported biomarker selection method form an innovative pipeline for comprehensive expression data analysis.

2.5 Cross Validation

With respect to the works presented in  [12] a similar cross validation scheme was executed for data processed in the original study and data from the proposed preprocessing and statistical testing analysis pipeline. Namely, 30-fold cross validation with three repetitions was carried out on the leukaemia subgroups using a Support Vector Machine (SVM) classifier. As a common practice to account for regularisation, the minimum error rate criterion was used in the differentiating feature selection process. Moreover, separability was measured using SVM on the entire data set for original data and processed with the proposed pipeline. The former feature set consisted of the union of top 100 differentially expressed genes from t test pairwise comparisons, whereas in the latter case the total number of CE-DEGs identified in the Games-Howell post-hoc test. The feature selection step was completed with the condition that genes which are incorporated into the model cannot be correlated in the sense of a large effect size value.

3 Results

3.1 Case–Control Approach: Leukaemia Versus Healthy Controls

The first analysis consisted of a common approach of examining differentiation between gene expression profile in samples collected from patients diagnosed with one of the main leukaemia groups and the control group. In this case the control samples are treated somewhat as a baseline and the insight is being driven towards up and down regulated genes. The summary of these findings is presented in Figs. 4 and  5. The Venn diagrams (http://bioinformatics.psb.ugent.be/webtools/Venn) present similarity among the four main leukaemia groups in terms of the sets of differentiating genes in total and taking into account the division of up and down regulated. The total number of genes differentially expressed between leukaemia and controls per each leukaemia type is presented in Table 4. As expected, the lowest number of CE-DEGs is observed for MDS cases, while ALLs, AMLs and CLLs present the similar number of CE-DEGs. There are no significant differences in the number of up and down regulated genes for ALL, AML, and CLL leukaemia type (50.27, 50.76, and 50.13\(\%\) of up regulated genes), while for CML type down regulated genes overdominate the system response (\(60.20\%\)). A similar trend is observed for MDS samples. The complete list of differentially expressed genes with regard to the healthy controls is given in Supplementary File 1 .

The CE-DEGs have been verified through literature research for the presence of key genes present in molecular mechanisms of the studied leukaemia types. In all of the investigated diseases these principal features appeared to be significantly altered in terms of gene expression. Hence, the lists of CE-DEGs included:

  • ALL: EBF1, LMO2, CDKN2A, PTEN, RB1, BTLA, CD200, TOX, NR3C1, TBL1XR1, ETV6, ERG genes reported to be linked with acute lymphoblastic leukaemia [20, 21]

  • AML: FLT3, IDH1, DNMT3A, CEBPA, KIT, NRAS, NPM1 genes connected with acute myeloid leukaemia [22]

  • CLL: ATM, GPI, BSG, LGALS1, PARVB, VIM, NOTCH1, BIRC3, MYD88, CD38 associated with chronic lymphoblastic leukaemia as in [23, 24]

  • CML: has been confirmed to have, among others, a significantly differentially expressed BCR-ABL gene, which is the leading oncoprotein involved in chronic myeloid leukemia [25, 26]

Fig. 4
figure 4

Venn diagram for differentially expressed genes in the main leukaemia subgroups with regard to their control

Fig. 5
figure 5

Venn diagrams for up and down regulated genes in the main leukaemia subgroups with regard to their control

Table 4 The number of statistically significant differentiating genes for each of the main groups of leukaemia with regard to the control sample group

The similarity has been further determined by means of the Dice coefficient [27] (DSC) with its 95% confidence intervals [28] (Table 5). These statistics show that the most substantial resemblance is within the genes differentially expressed in ALL and CLL, although a powerful similarity is also present between the AML and ALL groups. The least important closeness may be seen in the case of each main leukaemia group when compared to MDS. Detailed analysis of DSC values between MDS and leukaemia types reveals that MDS is the most similar to AML in systemic response to disease, having significantly the highest value of Dice similarity coefficients (0.259; \(95\%\) CI from 0.245 to 0.272), which is in compliance with the findings of other authors [29].

Table 5 Dice coefficients (DSC) with confidence intervals for main groups of leukaemia comparison

3.2 Comparison Among Leukaemia Main Types

In the second analysis, the main leukaemia types have been investigated using pairwise comparison testing to identify possible biomarkers among main groups of leukaemia. In this case, apart from being differentially expressed, the gene had to be uniquely statistically significant for only the one leukaemia type in order to be recognized as a potential biomarker (in contrast to CE-DEGs, which could differentiate several groups from each other). It cannot be differentially expressed among remaining leukaemia types. The findings have been summarised in Table 6 and on Fig. 6, while the complete lists of genes are available in Supplementary File 2.

Table 6 Results of Games-Howell post-hoc pairwise comparisons
Fig. 6
figure 6

Overlap of class enhanced differentially expressed genes among main leukaemia groups

Thereafter, the biomarkers were subjected to functional analysis for gaining knowledge of biological processes, in which they may be involved. Therefore, they have been checked for links to biological process terms in the Gene Ontology database [30]. The biomarker lists were submitted for Gene Ontology overrepresentation assessment using Fishers exact test. Nearly complete dissimilarity of the discovered overrepresented GO terms points to an apparent specificity of biological processes triggered by genes differentially expressed in the forenamed leukaemia types. The complete lists of ontology terms are gathered in Supplementary File 3.

Moreover, the biomarker genes have been juxtaposed with regard to their gene family for a more complete information set on the connection between their function and potential leukaemia-related processes. INGENUITYTM Pathway Analysis software by QIAGEN was used for this purpose. The summary of the outcome is presented in Table 7. The findings point to a few notable indications, i.e. the presence of growth factors only in acute leukaemia and phosphatases in myeloid leukaemia. Furthermore, the occurrence of the G-protein coupled receptor family is specific for ALL, peptidase for AML, transmembrane receptors for CLL and microRNA for CML.

Table 7 List of gene family characteristics for main leukaemia type candidate biomarkers

3.3 Searching for Leukaemia Subtype Biomarkers

Having the required measurements for gaining insight into the individual leukaemia subgroups, the data were investigated in a deeper manner and the analysis pipeline (Fig. 1) was repeated for all of the eighteen leukemia subtypes. Differentiation testing results demonstrate that an overwhelming majority of the genes remaining for analysis present statistical significance between the studied subgroups of leukemia (Table 3). After adequate gene filtration it is highly probable that at least one type will vary from the others significantly. Thus, pairwise comparisons were carried out between the subgroups and the final results (Table 8) pointed out to merely four genes differentiating a subgroup from all the others. The genes mentioned are (Fig. 7): (1) ASIC2 acid sensing ion channel 2, (2) GABRE—gamma-aminobutyric acid A receptor, epsilon, (3) LINC00525—long intergenic non-protein coding RNA 525, (4) CTNNA3—catenin alpha 3. The CTNNA3 gene has been shown to be linked to the Shwachman-Diamond syndrome which is characterized by a high risk of leukaemia [31]. In terms of relation to the bone marrow processes the GABRE gene which is a gammaaminobutyric acid receptor has proved to play a role during bone marrow stromal cell transplantation in the injured spinal cord in mice [32].

Table 8 Results of Games-Howell post-hoc pairwise comparisons
Fig. 7
figure 7

Boxplots illustrating the biomarker gene expression value distributions for subclasses of leukaemia

3.4 Classification Study

The cross validation results in detail are presented in Tables 9 and 10. The prediction for all of the leukaemia subclasses is given along with classification sensitivity. Furthermore, overall weighted average sensitivity with \(95\%\) confidence intervals is presented in Table 11. It is visible that features selected through the proposed analysis pipeline have higher average specificity than those chosen with the top 100 DEG original approach.

In terms of separability, there were 39 genes in the model for data processed with the original approach and 41 in the novel pipeline approach. Two of these features were common and the remaining were correlated with effect size at least at a medium level. the results are on a similar level. The identified novel pipeline signature is driven by leukaemia known MEIS1, CBFB, FOXO1, SETBP1 genes with the support, among the others, of KIAA0101, GPX1, INSR HCCS and THOC5 genes. The complete list of genes in the signature is available in Supplementary File 4. The majority of them has been previously reported to be linked to leukaemia related processes. Using the novel pipeline 0.998 accuracy was reached with the minimum error rule, versus 0.972 for the original MILE approach. However, less iterations for the procedure were required in case of the novel pipeline, as the considered feature space was smaller (2316 CE-DEGs) than in the original approach (3555 genes).

Table 9 Prediction table with cross validation results for the original MILE analysis pipeline data for leukaemia subgroups
Table 10 Prediction table with cross validation results for the novel proposed analysis pipeline data for leukaemia subgroups
Table 11 Weighted average cross validation sensitivity with \(95\%\) CI for the original MILE data and the novel processing pipeline

4 Discussion

The analysed data originate from one of the main phases of MILE study and contain 2096 samples prepared by 11 research centres from around the world. This may be the cause of impairment of the quality of data by the impact of technical factors related to each research centre. However, the whole experiment was very well designed, which means every laboratory was provided with the same equipment, kits, reagents coming from a common manufacturer or source. Likewise, the technicians were prepared in terms of using identical sample preparation protocol. As a result, the data should not have been greatly affected by bias.

The analysis adapted to the specific nature of the analysed data revealed that despite a well designed experiment, variability exists in the data associated with sample preparation by particular research institutes. This prompted batch effect adjustment of which the effects are presented both in the illustration of PCA components and also by analysis of variance using two-way ANOVA for research institutions, before and after batch effect correction. The presented study indicates that batch effect correction should be an indispensable element of the microarray analysis protocol, as often it is impossible to exclude the impact of all external factors.

The three comparative immersing analyses provide advancing knowledge on the potential mechanisms of particular leukaemia types and subtypes. The first one supports findings such as an important similarity of changes in gene expression between the same tissue type (ALL and CLL). Moreover, the acute leukaemia types (AML and ALL) also appear to have multiple shared molecular responses given their number of common CE differentially expressed genes. Additionally, the MDS studied subtype seems to have the least similar gene expression set with regard to the main leukaemia subtypes, of which AML was the mostly targeted by the same genes. This, together with a relatively small number of CE-DEGs in total, may point toward the suggestion that MDS is in its mechanisms more related to a healthy response than to any of the leukaemia types.

The second main leukaemia type comparative analysis supplies further evidence toward the similarity of ALL vs. CLL and AML vs. ALL gene expression wise. The abundance of differentiating features lead to the formulation of a biomarker definition such that only genes significant for a unique type are considered candidates. This implied a reduction in number of examined genes and investigating corresponding overrepresented gene ontology terms guides more towards a conclusion that a majority of the biological processes involved in leukaemia are specific to the aforementioned main types. Furthermore, the investigation of gene families presents some guidance toward inferring that while there are gene types specific for each of the main leukaemia groups, growth factors seem to be a linking factor for acute leukaemia, whereas phosphatases for myeloid leukaemia.

The final study involving deep data analysis of all of the subtypes of leukaemia allowed the extraction of important information. Four genes were discovered (ASIC2, GABRE, LINC00525, CTNNA3) as candidate biomarkers for four subtypes of leukaemia (ALL with t(1;19), AML with t(15;17), CML, CLL). Two of them are already described in the literature. Information, which has been found in the course of literature research, coincides to some extent with information about CTNNA3 and GABRE gene involvement in branches of diseases associated with leukaemia. However, the discovered ASIC2 and LINC00525 biomarkers are not mentioned in the literature in this context and would require experimental confirmation to contribute final proof for the utility of these biomarkers.

Cross validation comparison of the original approach versus tailored preprocessing and statistical testing reveal that adequate gene set selection yields supreme results in terms of classification sensitivity. Additionally, comparative separability assessment demonstrates that with a similar level of separability is possible to obtain with a smaller gene set, which, apart from reducing the chance of finding false positives, diminishes the number of iterations that need to be performed in a classification scheme. This may be considered as significant in terms of computational resources necessary for performing analyses.

5 Conclusions

The presented research confirms the significance of careful data preprocessing including batch effect adjustment and adaptive filtration for inference in a well designed large study of gene expression data in leukaemia patients. The above has been confirmed through statistical and functional analysis supported by bioinformatics repository information and literature survey of the biological conclusions. The obtained outcome produced four candidate biomarkers which imply further investigation through data mining procedures. The unique candidate biomarkers that have not been previously described in literature require experimental assessment to ultimately validate their suitability as auxiliary indicators of disease subtypes in leukaemia.

The contribution of the study is the original design of the data analysis pipeline tailored to large, multiclass, bioinformatic data. Compared to standard techniques the proposed design includes two-fold modifications. The first modification is in the preprocessing stage, more careful and elaborate, which allows for better reducing of measurement artifacts in the data while keeping the useful information. The second modification is the procedure for choice of the differentially expressed genes. We point out that in the multiclass experiments the concept of DEGs becomes more complex than in the two-class case. We introduce the definition of the class enhanced DEG and biomarkers. CE-DEG is a feature, which shows differential expression between the given class and all remaining classes grouped together. A biomarker is a CE-DEG, which additionally has a property that it does not show differential expression between any pair of the remaining classes. We apply the proposed data analysis pipeline to the MILE dataset and we demonstrate that the list of the obtained CE-DEGs, while comparable in size, is different than the list of DEGs computed in the MILE study. We also prove that our CE-DEG signature leads to significantly better classification power of the multi-class data classifiers.

In conclusion, the provided deep data analysis pipeline (Fig. 8) proves to be an advantageous tool for screening high-throughput molecular biology data sets.

Fig. 8
figure 8

Expression data analysis pipeline with gradual gene reduction