Comprehensive Analysis of MILE Gene Expression Data Set Advances Discovery of Leukaemia Type and Subtype Biomarkers

Large collections of data in studies on cancer such as leukaemia provoke the necessity of applying tailored analysis algorithms to ensure supreme information extraction. In this work, a custom-fit pipeline is demonstrated for thorough investigation of the voluminous MILE gene expression data set. Three analyses are accomplished, each for gaining a deeper understanding of the processes underlying leukaemia types and subtypes. First, the main disease groups are tested for differential expression against the healthy control as in a standard case-control study. Here, the basic knowledge on molecular mechanisms is confirmed quantitatively and by literature references. Second, pairwise comparison testing is performed for juxtaposing the main leukaemia types among each other. In this case by means of the Dice coefficient similarity measure the general relations are pointed out. Moreover, lists of candidate main leukaemia group biomarkers are proposed. Finally, with this approach being successful, the third analysis provides insight into all of the studied subtypes, followed by the emergence of four leukaemia subtype biomarkers. In addition, the class enhanced DEG signature obtained on the basis of novel pipeline processing leads to significantly better classification power of multi-class data classifiers. The developed methodology consisting of batch effect adjustment, adaptive noise and feature filtration coupled with adequate statistical testing and biomarker definition proves to be an effective approach towards knowledge discovery in high-throughput molecular biology experiments.


Introduction
Leukaemia as a common cancer type, nowadays still requires improvement in the domain of diagnostics and classification. Currently, modern molecular biology techniques are being assessed for their adequacy toward the detection and distinction between leukaemia subtypes. This task has been undertaken in several attempts. According to Andreeff et al. [1] in 1980 flow cytometric analysis of DNA and RNA has been used to recognize acute lymphoblastic leukaemia subtypes. Along with the creation of microarray technologies new opportunities emerged and in 1999 Golub et al. [2] discriminated between acute lymphoblastic and myeloid leukaemia (ALL, AML) types using expression data. Furthermore, gene expression profiling was used to classify paediatric ALL subtypes by Yeoh et al. [3]. Recently, apart from microarray technology for questions such as AML subtype determination [4], interest has been also turned towards searching for leukaemia biomarkers with miRNA [5][6][7][8][9][10] and lncRNA [11] analysis.
Despite the existence of all those studies, there remains only one exceptional study which was conducted on a large scale to discriminate between all of the leukaemia subtypes [12].
This data set has been established with a great understanding of the importance of experimental design methods developed within the scientific community nowadays. The principles of control, replication and randomisation are commonly known and implemented throughout laboratories and research institutions regardless of the study field. This enables planning of complex experiments, such as the Microarray Innovations in leukaemia (MILE) [12]. It has been carried out with comprehensive state-of-the-art protocols and strict control procedures during the experimental stage. This was expected to lead to higher power of statistical testing, and thus a better chance of obtaining meaningful novel results. Still, the rich data set offers possibilities for further conclusions if deeper attention is directed towards the preprocessing and downstream analysis pipelines.
On the basis of this study, devoted to biomarker discovery, the presented work has the objective of demonstrating how considerate and custom data preprocessing is essential to the inference by reducing the chance of false discoveries. It has a substantial impact on the final conclusions, which proves how it should be commonly unthinkable to neglect this indispensable step in biomedical data mining.

Data Sets
The Microarray Innovations in Leukaemia (MILE) study [12] was designed to assess the clinical accuracy of gene expression profiles, originating from microarray experiments, compared to standard leukaemia laboratory methods (gold standard) for 16 acute and chronic leukaemia subclasses, myelodysplastic syndromes (MDSs) and control group that included non-malignant disorders and normal bone marrow. The leukaemia subclasses may be divided into four main groups: acute and chronic myeloid leukaemia (AML, CML) and acute and chronic lymphoblastic leukaemia (ALL, CLL). The investigation was performed in 11 laboratories across three continents and included a total of 3,334 patients. The study was very carefully designed to eliminate main problems, which occur when many experiments are carried out in various laboratories in diverse conditions-so called batch effect [13]. The experiments consisted of four phases: two main phases (Stage I and Stage II), each of them preceded by a pre-phase [14]. The goals of the prephases were to assure intra laboratory reproducibility and inter laboratory comparability. Each laboratory operator was trained on an identical sample preparation protocol. Additionally, each laboratory was provided with the same laboratory equipment and also kits and reagents for sample preparation and microarray analysis were taken from the same source.
In this analysis microarray data from Stage I of the MILE study were investigated, where 2096 bone marrow samples of acute and chronic leukaemia patients were hybridized to Affymetrix HG-U133 Plus 2.0 GeneChips. Summary of the MILE datasets Stage I is presented in Table 1.
Three comparison studies were accomplished following the same signal analysis pipeline. Two of three analyses were performed on main classes of leukaemia and, therefore, merging samples from the appropriate subclasses was needed. The summary of merged data is presented in Table 2.

Analysis Pipeline
Taking into account the specific nature of the data set, the pipeline of analysis was designed as presented in Fig 1. It includes the use of state of the art methods for preprocessing, a technique for removing variability caused by external influence (unrelated to the analysed case), adaptive filtering for noise and uninformative features removal, statistical analysis with the aim of biomarker selection.
The three comparative analyses performed gradually take into account more and more details about the leukaemia. The first one is carried out on the main types of leukaemia, and in terms of statistical analysis a commonly used approach is chosen, which compares the mean gene expression level in each main type of leukaemia with the mean expression level among healthy donors from control group (here: non-leukaemia and healthy bone marrow). This is an example of case-control approach widely used in observational studies.
In the second analysis, an extension is performed relying on the cross-comparison of transcriptomic profiles of main leukaemia types between themselves. From this analysis a biomarker identification step is added as it is possible to set an appropriate condition. In this case only such features are taken into account and labelled as biomarkers, which differentiate one and only one main class from the rest.
The last analysis was performed on all of the leukaemia subgroups. It allows for the most profound analysis of leukaemia diseases. As mentioned earlier, this is a unique study, which was conducted on a large scale to discriminate between all of the leukaemia subtypes and in this final study information for all subclasses of leukaemia is taken under analysis.

Data Preprocessing
The intensity data from microarray experiments has been subjected to fRMA normalisation [15] with background correction, quantile normalisation and median polish summarisation. This method has been chosen to merge the advantages of classic RMA normalisation with the ability to include additional samples if need in the future. Probe reannotation was accomplished with custom CDF files available through the BrainArray repository [16].
The next step was to ensure data coherence, i.e. verify if the unification procedures applied in the study successfully   1 Summary of pipeline analysis, which includes preprocessing data, batch effect adjustment, adaptive filtering for noise and uninformative feature filtration, statistical analysis and biomarker selection dealt with the issue of bias introduced by batch effect. In this case Principal Component Analysis was performed and the outcome suggests that nonetheless a batch effect due to sample preparation in different laboratories may be observed (Fig. 2). Therefore, the data were adjusted for batch effects with the use of ComBat algorithm [17], available through the SVA R package [18]. The results of Kruskall-Wallis test for differentially expressed genes among research centre batches proved a significant removal of batch effect ( Table 3).
The final step consisted of gene filtration to remove features with signal close to background level. There are various techniques available for this purpose such as the commonly used method of removing 50% of the genes with lowest expression value or variance. However, in the studied case of 18 subtypes of disease this approach seems excessively strict and implies the search of an adaptive threshold rather than fixed. For this reason, the adaptive filtering based on Gaussian mixture decomposition has been selected [19]. The filtration was conducted in two steps: in the first step the signal was decomposed in terms of signal intensity amplitude, and the three components with the highest signal amplitude remained. Second, the data were considered variance-wise and the component with lowest variance was rejected (Fig. 3). A total of 9941 genes remained for further statistical analysis.

Statistical Analysis and Biomarker Selection
To search for class enhanced differentially expressed genes (CE-DEGs) across types or subtypes of leukaemia, a set of statistical tests was carried out, independently for each comparative analysis. The CE-DEGs in this case are genes which differentiate a considered group from all the other groups in the manner of pairwise comparisons. At the beginning the conditions on normality and homogeneity of variances were verified and, accordingly, the appropriate parametric or non-parametric test was chosen.
During the first analysis, initially, Analysis of Variance (ANOVA) was conducted to filter out the genes, which do not differentiate among groups at all. Next, the mean gene expression level of each main type of leukemia was compared with the mean expression within reference group, therefore, Dunnetts test was used in post hoc comparisons to control the experimental event rate (EER).
For the remaining two analyses the same set of statistical tests was performed. It included non-parametric   Kruskal-Wallis analysis of variance test, because of the violation of the assumptions for parametric ANOVA in several experimental groups. After this step features, which differentiate at least one leukaemia type from the rest types of diseases, were selected. Furthermore, as means of conducting post-hoc pairwise comparison tests, the Games-Howell method was chosen. Restrictive feature selection was then used to filter out the genes which differentiate solely one group from all of the other types or subtypes of leukaemia. The combination of the data preprocessing steps and statistically supported biomarker selection method form an innovative pipeline for comprehensive expression data analysis.

Cross Validation
With respect to the works presented in [12] a similar cross validation scheme was executed for data processed in the original study and data from the proposed preprocessing and statistical testing analysis pipeline. Namely, 30-fold cross validation with three repetitions was carried out on the leukaemia subgroups using a Support Vector Machine (SVM) classifier. As a common practice to account for regularisation, the minimum error rate criterion was used in the differentiating feature selection process. Moreover, separability was measured using SVM on the entire data set for original data and processed with the proposed pipeline. The former feature set consisted of the union of top 100 differentially expressed genes from t test pairwise comparisons, whereas in the latter case the total number of CE-DEGs identified in the Games-Howell post-hoc test. The feature selection step was completed with the condition that genes which are incorporated into the model cannot be correlated in the sense of a large effect size value.

Case-Control Approach: Leukaemia Versus Healthy Controls
The first analysis consisted of a common approach of examining differentiation between gene expression profile in samples collected from patients diagnosed with one of the main leukaemia groups and the control group. In this case the control samples are treated somewhat as a baseline and the insight is being driven towards up and down regulated genes. The summary of these findings is presented in Figs. 4 and 5. The Venn diagrams (http://bioinformatics.psb.ugent.be/webtools/Venn) present similarity among the four main leukaemia groups in terms of the sets of differentiating genes in total and taking into account the division of up and down regulated. The total number of genes differentially expressed between leukaemia and controls per each leukaemia type is presented in   NOTCH1, BIRC3, MYD88, CD38 associated with chronic lymphoblastic leukaemia as in [23,24] • CML: has been confirmed to have, among others, a significantly differentially expressed BCR-ABL gene, which is the leading oncoprotein involved in chronic myeloid leukemia [25,26] The similarity has been further determined by means of the Dice coefficient [27] (DSC) with its 95% confidence intervals [28] (Table 5). These statistics show that the most substantial resemblance is within the genes differentially expressed in ALL and CLL, although a powerful similarity is also present between the AML and ALL groups. The least important closeness may be seen in the case of each main leukaemia group when compared to MDS. Detailed analysis of DSC values between MDS and leukaemia types reveals that MDS is the most similar to AML in systemic response to disease, having significantly the highest value of Dice similarity coefficients (0.259; 95% CI from 0.245 to 0.272), which is in compliance with the findings of other authors [29].

Comparison Among Leukaemia Main Types
In the second analysis, the main leukaemia types have been investigated using pairwise comparison testing to identify possible biomarkers among main groups of leukaemia. In this case, apart from being differentially expressed, the gene had to be uniquely statistically significant for only the one leukaemia type in order to be recognized as a potential biomarker (in contrast to CE-DEGs, which could differentiate several groups from each other). It cannot be differentially expressed among remaining leukaemia types. The findings have been summarised in Table 6 and on Fig. 6,  Thereafter, the biomarkers were subjected to functional analysis for gaining knowledge of biological processes, in which they may be involved. Therefore, they have been checked for links to biological process terms in the Gene Ontology database [30]. The biomarker lists were submitted for Gene Ontology overrepresentation assessment using Fishers exact test. Nearly complete dissimilarity of the discovered overrepresented GO terms points to an apparent specificity of biological processes triggered by genes differentially expressed in the forenamed leukaemia types. The complete lists of ontology terms are gathered in Supplementary File 3.
Moreover, the biomarker genes have been juxtaposed with regard to their gene family for a more complete information set on the connection between their function and potential leukaemia-related processes. INGENUITY TM Pathway Analysis software by QIAGEN was used for this purpose. The summary of the outcome is presented in Table 7. The findings point to a few notable indications, i.e. the presence of growth factors only in acute leukaemia and phosphatases in myeloid leukaemia. Furthermore, the occurrence of the G-protein coupled receptor family is specific for ALL, peptidase for AML, transmembrane receptors for CLL and microRNA for CML.

Searching for Leukaemia Subtype Biomarkers
Having the required measurements for gaining insight into the individual leukaemia subgroups, the data were investigated in a deeper manner and the analysis pipeline ( Fig. 1) was repeated for all of the eighteen leukemia subtypes. Differentiation testing results demonstrate that an overwhelming majority of the genes remaining for analysis present statistical significance between the studied subgroups  0.703 0.743 0.483 0.240 0.655 0.489 0.259 0.499 0.228 0. . 6 Overlap of class enhanced differentially expressed genes among main leukaemia groups of leukemia (Table 3). After adequate gene filtration it is highly probable that at least one type will vary from the others significantly. Thus, pairwise comparisons were carried out between the subgroups and the final results (Table 8) pointed out to merely four genes differentiating a subgroup from all the others. The genes mentioned are (Fig. 7): (1) ASIC2 acid sensing ion channel 2, (2) GABRE-gamma-aminobutyric acid A receptor, epsilon, (3) LINC00525-long intergenic non-protein coding RNA 525, (4) CTNNA3-catenin alpha 3. The CTNNA3 gene has been shown to be linked to the Shwachman-Diamond syndrome which is characterized by a high risk of leukaemia [31]. In terms of relation to the bone marrow processes the GABRE gene which is a gammaaminobutyric acid receptor has proved to play a role during bone marrow stromal cell transplantation in the injured spinal cord in mice [32].

Classification Study
The cross validation results in detail are presented in Tables 9 and 10. The prediction for all of the leukaemia subclasses is given along with classification sensitivity. Furthermore, overall weighted average sensitivity with 95% confidence intervals is presented in Table 11. It is visible that features selected through the proposed analysis pipeline have higher average specificity than those chosen with the top 100 DEG original approach. In terms of separability, there were 39 genes in the model for data processed with the original approach and 41 in the novel pipeline approach. Two of these features were common and the remaining were correlated with effect size at least at a medium level. the results are on a similar level. The identified novel pipeline signature is driven by leukaemia known MEIS1, CBFB, FOXO1, SETBP1 genes with the support, among the others, of KIAA0101, GPX1, INSR HCCS and THOC5 genes. The complete list of genes in the signature is available in Supplementary File 4. The majority of them has been previously reported to be linked to leukaemia related processes. Using the novel pipeline 0.998 accuracy was reached with the minimum error rule, versus 0.972 for the original MILE approach. However, less iterations for the procedure were required in case of the novel pipeline, as the considered feature space was smaller (2316 CE-DEGs) than in the original approach (3555 genes).

Discussion
The analysed data originate from one of the main phases of MILE study and contain 2096 samples prepared by 11 research centres from around the world. This may be the cause of impairment of the quality of data by the impact of technical factors related to each research centre. However,  Enzyme  8  10  13  11  G-Protein coupled receptor  1  0  0  0  Growth factor  2  1  0  0  Transcription regulator  6  3  0  1  Cytokine  1  0  0  1  Ion channel  2  1  1  1  Transporter  2  6  6  0  Kinase  0  4   the whole experiment was very well designed, which means every laboratory was provided with the same equipment, kits, reagents coming from a common manufacturer or source. Likewise, the technicians were prepared in terms of using identical sample preparation protocol. As a result, the data should not have been greatly affected by bias. The analysis adapted to the specific nature of the analysed data revealed that despite a well designed experiment, variability exists in the data associated with sample preparation by particular research institutes. This prompted batch effect adjustment of which the effects are presented both in the illustration of PCA components and also by analysis of variance using two-way ANOVA for research institutions, before and after batch effect correction. The presented study indicates that batch effect correction should be an indispensable element of the microarray analysis protocol, as often it is impossible to exclude the impact of all external factors.
The three comparative immersing analyses provide advancing knowledge on the potential mechanisms of particular leukaemia types and subtypes. The first one supports findings such as an important similarity of changes in gene expression between the same tissue type (ALL and CLL). Moreover, the acute leukaemia types (AML and ALL) also appear to have multiple shared molecular responses given their number of common CE differentially expressed genes. Additionally, the MDS studied subtype seems to have the least similar gene expression set with regard to the main leukaemia subtypes, of which AML was the mostly targeted by the same genes. This, together with a relatively small number of CE-DEGs in total, may point toward the suggestion that MDS is in its mechanisms more related to a healthy response than to any of the leukaemia types.
The second main leukaemia type comparative analysis supplies further evidence toward the similarity of ALL vs. CLL and AML vs. ALL gene expression wise. The abundance of differentiating features lead to the formulation of a biomarker definition such that only genes significant for a unique type are considered candidates. This implied a reduction in number of examined genes and investigating corresponding overrepresented gene ontology     terms guides more towards a conclusion that a majority of the biological processes involved in leukaemia are specific to the aforementioned main types. Furthermore, the investigation of gene families presents some guidance toward inferring that while there are gene types specific for each of the main leukaemia groups, growth factors seem to be a linking factor for acute leukaemia, whereas phosphatases for myeloid leukaemia. The final study involving deep data analysis of all of the subtypes of leukaemia allowed the extraction of important information. Four genes were discovered (ASIC2, GABRE, LINC00525, CTNNA3) as candidate biomarkers for four subtypes of leukaemia (ALL with t(1;19), AML with t(15;17), CML, CLL). Two of them are already described in the literature. Information, which has been found in the course of literature research, coincides to some extent with information about CTNNA3 and GABRE gene involvement in branches of diseases associated with leukaemia. However, the discovered ASIC2 and LINC00525 biomarkers are not mentioned in the literature in this context and would require experimental confirmation to contribute final proof for the utility of these biomarkers.
Cross validation comparison of the original approach versus tailored preprocessing and statistical testing reveal that adequate gene set selection yields supreme results in terms of classification sensitivity. Additionally, comparative separability assessment demonstrates that with a similar level of separability is possible to obtain with a smaller gene set, which, apart from reducing the chance of finding false positives, diminishes the number of iterations that need to be performed in a classification scheme. This may be considered as significant in terms of computational resources necessary for performing analyses.

Conclusions
The presented research confirms the significance of careful data preprocessing including batch effect adjustment and adaptive filtration for inference in a well designed large study of gene expression data in leukaemia patients. The above has been confirmed through statistical and functional analysis supported by bioinformatics repository information and literature survey of the biological conclusions. The obtained outcome produced four candidate biomarkers which imply further investigation through data mining procedures. The unique candidate biomarkers that have not been previously described in literature require experimental assessment to ultimately validate their suitability as auxiliary indicators of disease subtypes in leukaemia.
The contribution of the study is the original design of the data analysis pipeline tailored to large, multiclass, bioinformatic data. Compared to standard techniques the proposed design includes two-fold modifications. The first modification is in the preprocessing stage, more careful and elaborate, which allows for better reducing of measurement artifacts in the data while keeping the useful information. The second modification is the procedure for choice of the differentially expressed genes. We point out that in the multiclass experiments the concept of DEGs becomes more complex than in the two-class case. We introduce the definition of the class enhanced DEG and biomarkers. CE-DEG is a feature, which shows differential expression between the given class and all remaining classes grouped together. A biomarker is a CE-DEG, which additionally has a property that it does not show differential expression between any pair of the remaining classes. We apply the proposed data analysis pipeline to the MILE dataset and we demonstrate that the list of the obtained CE-DEGs, while comparable in size, is different than the list of DEGs computed in the MILE study. We also prove that our CE-DEG signature leads to significantly better classification power of the multi-class data classifiers.
In conclusion, the provided deep data analysis pipeline (Fig. 8) proves to be an advantageous tool for screening high-throughput molecular biology data sets.