The AML/ALL leukaemia dataset
The utility of coXpress is demonstrated using gene expression data from the leukaemia microarray study of Golub et al [21]. This dataset represents gene expression measurements from 38 tumour mRNA samples, 27 acute lymphoblastic leukaemia (ALL) cases and 11 acute myeloid leukaemia (AML) cases. The HU6800 Affymetrix array was used, which contains 6800 probesets. The dataset has been filtered such that genes with negative values in any sample have been removed, resulting in 2568 genes.
Using coXpress, the genes were first clustered according to their expression levels in the 27 ALL samples, using the cluster.gene function. The distance measure used was 1 - r, where r is the pearson correlation coefficient. The resulting tree was cut at a distance of 0.4, representing a correlation coefficient of 0.6, using the cutree function.
These groups were then examined in both the ALL and AML cases using the coXpress function. The observed t statistics in all cases were compared with the t statistics generated by randomly resampling the dataset 10,000 times for each group size. The resulting table contains one row for each group.
To test the robustness of the method to outliers, a bootstrapping approach was used. Each group was re-tested 1000 times, each time randomly selecting 75% of the observations for each leukaemia subtype (20 AML cases and 8 AML cases). The number of times each group was found to be differentially co-expressed by the coXpress method was recorded.
Table 1 shows the results filtered for groups that are non-random in the ALL subset, random in the AML subset, and with more than 6 members. As can be seen, there are 10 groups, varying in size from 7 to 34 members. The mean pairwise correlations for the groups are all above 0.6 in the ALL cases, yet show little or no correlation in the AML cases, with mean values ranging from -0.093 to 0.144. The robustness resampling method provides evidence that the groups found are robust to outliers, with nine out of ten groups being found in over 90% of the resampled data sets, and the other being found in 76%.
Table 1 Differentially co-expressed groups from the Golub dataset Figure 2 demonstrates the method of coXpress. These graphs show data from the largest of the groups, group 3, which has 34 members. Fig. 2A compares the distribution of pairwise correlation coefficients in the ALL subset with two random distributions. The blue graph is the distribution of observed correlation coefficients in the ALL subset for group 3, the red graph is the distribution of pairwise correlation coefficients from data generated by the random uniform distribution, and the green graph is the distribution of pairwise correlation coefficients from a group of genes randomly selected from the dataset. As can be seen, the observed distribution for this group in the ALL subset is very different from the two random distributions. Fig. 2B is an identical graph for the group based on the AML subset. This time, the observed distribution shows no difference compared to the two random distributions. The t-statistics for each distribution are shown on these graphs. Fig. 2C shows the observed t-statistic for group 3 in the ALL subset compared to the distribution of 10,000 randomly generated t-statistics, and Fig. 2D is the equivalent graph for the AML subset. Again, it is clear that this group in the ALL subset is non-random, yet is no different to random in the AML subset.
Figure 3 shows the top 3 groups in table 1 graphically. Fig. 3A is the largest of the groups, with 34 members. These 34 genes have a mean pairwise correlation of 0.70 in the ALL subset, but only 0.003 in the AML subset. Fig. 3B shows a smaller group, with 7 members, with a mean pairwise correlation of 0.72 in the ALL subset and -0.09 in the AML subset. Finally, fig. 3C shows a group with 11 members, with a mean pairwise correlation coefficient of 0.679 in the ALL subset and only 0.086 in the AML subset. These graphs were produced using the plot.compare.group and plot.cluster.genes functions.
Figure 4 shows the same three groups in a different way. Here, each plot is a representation of the correlation matrix of the group of genes in either the ALL or the AML subsets. Each coefficient in the correlation matrix is represented as a square, with the colour of the square representing the amount of correlation. The colour scale used is green to red, with green representing -1 (negative correlation), red representing +1 (positive correlation) and black representing 0 (no correlation). In all three groups, the correlation matrices are red for the ALL subset, yet are a mixture of black, green and red in the AML subset. This view of the data is more useful than simply considering the average pairwise correlation, as it shows all of the values in an intuitive way. These graphs were produced using the show.cor.matrices function.
In each of the differentially co-expressed groups, not all pairwise correlation coefficients will have decreased or changed. To examine which pairs of genes have changed, the inspect.group function should be used. Table 2 shows the ten pairwise correlation coefficients that have changed the most between the ALL and AML subsets in group 62. As can be seen, these pairs of genes are all positively correlated in the ALL subset but are negatively correlated in the AML subset. Table 3 shows the ten pairwise correlation coefficients that have changed the least between the ALL and AML subsets in group 62. Many of these pairs of genes are still positively correlated in the AML subset, but not to the same extent. It is important that each differentially co-expressed group is examined in this way to determine which of the pairs of correlated genes have changed and which have not.
Table 2 Most changed pairwise correlation coefficients between the ALL and AML subsets in group 62 Table 3 Least changed pairwise correlation coefficients between the ALL and AML subsets in group 62 The GOHyperG function of the GOstats package [22] was used to find GO terms over-represented in the differentially co-expressed groups. Group 3, with 34 members, is enriched for GO terms for lymph node development, cell organisation and biogenesis, and protein biosynthesis and transport. Group 62, which has 7 members, is enriched for GO terms for methyltransferase activity, DNA modification, protein transport and DNA and protein methylation. Group 121, with 11 members, is enriched for GO terms for nucleotidase activity, and RNA splicing, processing and metabolism.
The ALL subtype dataset
This dataset is from the Acute Lymphoblastic Leukaemia study by Yeoh et al [23]. Six subtypes of ALL leukaemias are represented in 248 cases. The six subtypes are T-ALL, E2A-PBX1, BCR-ABL, TEL-AML1, MLL rearrangement, and hyperdiploid >50. The HG_U95Av2 Affymetrix microarray was used which contains 12,600 probesets. The dataset has been filtered such that genes with negative values in any sample have been removed, resulting in 1516 genes present in the dataset.
Using coXpress, the genes were first clustered according to their expression levels in the BCR-ABL samples, using the cluster.gene function. The distance measure used was 1 - r, where r is the pearson correlation coefficient. The resulting tree was cut at a distance of 0.5, representing a correlation coefficient of 0.5, using the cutree function. These groups were then examined in both the BCR-ABL and T-ALL subsets.
Those groups of size two were analysed using the cox.pairs function. Table 4 shows three pairs of genes that are significantly positively correlated in the BCR-ABL subset, and significantly negatively correlated in the T-ALL subset.
Table 4 Differentially co-expressed pairs in the ALL subtype dataset Groups of N ≥ 3 were analysed in the BCR-ABL and T-ALL subsets using the coXpress function. The observed t statistics in all cases were compared with the t statistics generated by randomly resampling the dataset 10,000 times for each group size. The resulting table contains one row for each group.
To test the robustness of the method to outliers, a bootstrapping approach was used. Each group was re-tested 1000 times, each time randomly selecting 75% of the observations for each leukaemia subtype. The number of times each group was found to be differentially co-expressed by the coXpress method was recorded.
Table 5 shows the results filtered for groups that are non-random in the BCR-ABL cases, random in the T-ALL cases, and with more than 10 members. Figure 5 shows the top 3 groups in table 1 graphically. Figure 5A shows a group of 16 genes that have a mean pairwise correlation coefficient of 0.669 in the BCR-ABL subset, yet only 0.06 in the T-ALL subset. Figure 5B shows a group of 10 genes that have a mean correlation of 0.65 in the BCR-ABL subset and only 0.08 in the T-ALL data. Finally, Figure 5C shows a group of 13 genes that have an average correlation of 0.64 in the BCR-ABL data, yet only 0.04 in the T-ALL data. The robustness resampling method provides evidence that the groups found are robust to outliers, with twelve out of thirteen groups being found in over 80% of the resampled data sets, and the other being found in 68.6%.
Table 5 Differentially co-expressed groups from the ALL subtype dataset The GOHyperG function of the GOstats package [22] was used to find GO terms over-represented in the differentially co-expressed groups. Group 47 with 16 members, is enriched for GO terms for hormone catabolism, glucocorticoid receptor signalling and glucocorticoid catabolism. Group 31 with 10 members contains two probes for a gene in the RAS oncogene family, and is enriched for GO terms for oxidoreductase activity and ubiquitin activating enzyme activity. Finally, group 89 with 13 members contains genes annotated as B-cell lymphoma and cancer susceptibility genes, as well as genes enriched for GO terms for endothelial cell migration, regulation of cell motility and migration, angiostatin binding and regulation of blood vessel endothelial cell migration.