## Abstract

### Background

Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.

### Results

We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival.

### Conclusions

The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.

### Similar content being viewed by others

## Background

Through the use of recently developed DNA arrays, it is now possible to obtain accurate, quantitative (relative) measurements of a large proportion of the mRNA species present in a biological sample. DNA arrays have been used to monitor changes in gene expression during important biological processes (for example, cellular replication and the response to changes in the environment), and to study variation in gene expression across collections of related samples (such as tumor samples from patients with cancer). A major challenge in interpreting these results is to understand the structure of the data produced by such studies, which often consist of millions of measurements. A variety of clustering techniques have been applied to such data, and have proved useful for identifying biologically relevant groupings of genes and samples [1,2,3,4,5,6,7,8,9,10,11,12,13]. Although the underlying principles and computational details of these methods differ, they share the goal of organizing the elements under consideration (such as genes) into groups (clusters) with coherent behavior across relevant measurements (such as samples). Generally absent is any consideration of the nature of the coherent variation. For example, one might want to identify groups of genes that have coherent patterns of expression with large variance across samples, or groups of genes that optimally separate samples into predefined classes (such as different clinical response groups in tumor samples). Here, we introduce a new statistical method, which we call gene shaving, that attempts to identify groups of elements (genes) that have coherent expression and are optimal for various properties of the variation in their expression.

Figure 1 shows the dataset used in our study, which consisted of 4673 gene expression measurements on 48 patients with diffuse large B-cell lymphoma (DLCL). These data have been described in detail previously [14]. The column labels refer to different patients, and the rows correspond to genes. The order of rows and columns is arbitrary.

Some authors have recently explored the use of clustering methods to arrange the genes in some systematic way, with similar genes placed close together (see [2] for developments and [15] for an overview). In Figure 2, we have applied hierarchical clustering to the genes and samples separately. Each clustering produces a (non-unique) ordering, one that ensures that the branches of the corresponding dendrogram do not cross. Figure 2 displays the original data, with rows and columns ordered accordingly.

Some structure is evident in Figure 2, and this method can be used to recognize relationships among the genes and samples.With any method that reduces the dimension of the data, however, finer structure can be lost. For example, suppose the expression of some subset of genes divides the samples in an informative way, correlating with the rate of proliferation of tumor cells, for example, whereas another subset of genes divides the samples a different way, representing the immune response, for example. Then methods such as two-way hierarchical clustering, which seek a single reordering of the samples for all genes, cannot find such structure.

The method of gene shaving we describe here is designed to extract coherent and typically small clusters of genes that vary as much as possible across the samples. Figure 3 shows three gene clusters for the DLCL data, found using shaving. Some of the genes within each cluster lie close to each other in the hierarchical clustering of Figure 2, but others, and the clusters themselves, are quite far apart.

In Figure 3 the samples have been ordered by values of the average gene expression. This average gene is a good representative of the cluster, as all the members are so similar. The variance measures at the top of each cluster are discussed in more detail later. The clusters are all of different sizes. We use an automatic method for determining the size of the clusters, based on a randomization procedure that protects us from looking too hard in the large sea of genes and finding spurious structure. The three cluster-average genes, one from each cluster, are reasonably uncorrelated (see below and Figure 6). This is another aspect of the shaving process - it seeks different clusters, where difference is measured by correlation of the cluster mean. Figure 4 shows the results of a hierarchical clustering applied to the three column-average genes. Whereas hierarchical clustering suggests two main gene groupings, the shaving process may suggest more useful groupings.

This article is organized as follows. In the section 'Gene shaving' we describe the method itself. The section entitled 'The gap estimate of cluster size' outlines the gap test for choosing the cluster size. In the section 'Predicting patient survival' we try to predict patient survival from gene cluster averages. 'Supervised shaving' is discussed in the following section. Finally, in the 'Conclusions' we propose some further generalizations. A more statistical treatment of gene shaving is given in [16].

## Results

### Gene shaving

In this section we describe in detail our technique for finding clusters like the example in Figure 3. A gene expression matrix is an *N* × *p* matrix of real-valued measurements *X* = x_{
ij
}. The rows are genes, the columns are samples, and x_{
ij
} is the measured (log) expression, relative to a baseline. Typically there are missing entries in *X.* We use a technique described in [17], an iterative algorithm based on the singular value decomposition, for imputing missing expression values;our analysis here assumes that *X* has no missing values.

Let *S*_{
k
} be the indices of a cluster of *k* genes, and

be the collection of *p* column averages of the expression values for this cluster. Then for each cluster size *k,* gene shaving seeks a cluster *S*_{
k
} having the highest variance of the column averages:

The important question of how to choose the cluster size *k* is addressed in the next section.

Our procedure generates a sequence of nested clusters *S*_{
k
}, in a top-down manner, starting with *k = N,* the total number of genes, and decreasing down to *k* = 1 gene. At each stage the largest principal component of the current cluster of genes is computed. This eigen gene is the normalized linear combination of genes with largest variance across the samples. We then compute the inner product (essentially the correlation) of each gene with the eigen gene, and discard ('shave off') a fraction of the genes having lowest (absolute) inner product. The process is repeated on the reduced cluster of genes. The shaving algorithm is depicted in Figure 5 and given in detail in Box 1.

There are a number of duplicate genes in the dataset. In some cases the sequence for a given gene appears on the microarray more than once, either by design or by accident. In other cases, more than one different EST (expressed sequence tag) is present for the same gene. Gene shaving can be affected by duplicate genes, since they are highly correlated with each other. We therefore averaged expression profiles for the duplicate genes, leaving 3624 unique gene profiles.

The sequence of operations 1-5 in Box 1 gives the first cluster of rows - the first ribbon in Figure 3. Step 6 orthogonalizes the data to encourage discovery of a different (uncorrelated) second cluster. Note that although we work with the orthogonalized matrix in the shaving process for the second and subsequent clusters, the derived clusters and their averages involve the original genes.

The shaving process requires repeated computation of the largest principal component of a large set of variables. If naively implemented, this requires the construction of a *N* × *N* sample covariance matrix ? of the genes, and the computation of its largest eigenvector. We can avoid the computational burden by working in the dual space, where the matrices have dimension *p* × *p.* Furthermore, as we require only the largest eigenvector, the computations can be reduced even further by using the power method, using the eigenvector of the previous cluster as a starting value.

The three resulting clusters are shown in Figure 3 and again in Figure 4. Figure 6 shows the pairwise scatterplots of each of the three column averages ('super genes') from the clusters. The absolute correlations range from 0.27 to 0.68. The full gene names for the members of the first three clusters are given in Table 1.

It is useful to evaluate how much of the dimensionality of the gene expression variation is captured by the clusters derived from gene shaving. We can approximate the expression profile for each gene in the complete dataset as a linear combination of the three super genes from each cluster (which are simple averages of the genes in each cluster). The percent variance explained by the first *j =* 1,2, ... 10 super genes is shown in Figure 7.

Thus the three gene clusters, involving a total of 33 genes, explain about 20% of the variation. The percent variance explained by the first *j* principal components (broken curve) exceeds that from gene shaving. Each principal component gives a non-zero weight to all 3624 genes, however.

### The gap estimate of cluster size

Each shaving sequence produces a nested set of gene clusters *S*_{
k
}, starting with the entire expression matrix and then proceeding down to some minimum cluster size (typically 1). If we applied this procedure to null data, in which the rows and columns were independent of each other, we could still find some interesting-looking patterns in the resulting blocks. Hence, we need to calibrate this process so that we can differentiate real patterns from spurious ones. Even in the case of real structure, it is unlikely that a distinct set of genes is correct for a cluster, and the rest not. More likely there is a graduation of membership eligibility, and we have to decide where to draw the line. Here we describe a procedure based on randomization that helps us select a reasonable cluster size.

Our method requires a quality measure for a cluster. We favor both high-variance clusters, and high coherence between members of the cluster. As the generation of the cluster sequence was driven strongly by the former, we focus on the latter in selecting a good cluster. By analogy with the analysis of variance for grouped data, we define the following measures of variance for a cluster *S*_{
k
}:

The between variance is the variance of the (signed) mean gene. The within variance measures the variability of each gene about the cluster average, also averaged over samples. As this can be small if the overall variance is small, a more pertinent measure is the between-to-within variance ratio *V*_{
B
}/*V*_{
W
}, or alternatively, the percent variance explained

A large value of *R*^{2} implies a tight cluster of coherent genes. This is the quality measure we use to select a cluster from the shaving sequence *S*_{
k
}.

Let *S*_{
k
} index the clusters of a given shaving sequence (with *k* being the number of genes). Let *D*_{
k
} be the *R*^{2} measure for the *k*th member of sequence. We wish to know whether *D*_{
k
} is larger than we would expect by chance, if the rows and columns of the data were independent.

Let *X*^{*b} be a permuted data matrix, obtained by permuting the elements within each row *of X.* We form *B* such matrices, indexed by *b* = 1,2,... *B*. Let *D*_{
k
}^{*b} be the *R*^{2} measure for cluster S_{
k
}^{*b}. Denote by _{
k
}^{*} the average of *D*_{
k
}^{*b} over *b*. The *Gap* function is defined by

We then select as the optimal number of genes that value of *k* producing the largest gap:

The idea is that at the value

the observed variance is the most ahead of expected. Multiple clusters are produced for the randomized data just like for the original data, and the gap test is used repeatedly to select the cluster size at each stage.

For the DLCL data, the maximum for the first cluster occurs at eight genes. Figure 8 shows the percent-variance curves, *D*_{
k
}, for both the original and randomized tumor data as a function of size, and the gap curves used to select the specific cluster sizes in Figure 3.

### Predicting patient survival

One important motivation for developing gene shaving was the wish to identify distinct sets of genes whose variation in expression could be related to a biological property of the samples. In the present example, finding genes whose expression correlates with patient survival is an obvious challenge. Group factors *g*_{1}, *g*_{2}, *g*_{3} were created by splitting each gene cluster in Figure 3 into two groups of 24 patients. We used each of these groupings as a factor in Cox's proportional hazards model for predicting overall survival [18]. Of the group factors only *g*_{2} was significant, at the 0.05 level (*p* = 0.04).

In [14], a cluster of 380 genes was chosen on the basis of their large variation over samples, and their 'germinal center B-like' or 'activated B-like' expression profiles. Using these 380 genes, a hierarchical clustering produced two groups of patients which were (just) statistically different in survival. Close inspection shows that 18 of the 23 genes in the second cluster above also fall into this cluster of 380 genes. Hence, gene shaving can find clinically and biologically relevant subdivisions in gene expression data in an unsupervised fashion.

It may be fortuitous that one of these groupings correlates with survival, as the clusters were not chosen with survival in mind. We next describe a modification of gene shaving that explicitly looks for clusters that are related to patient survival.

### Supervised shaving

The methods discussed so far have not used information about the columns to 'supervise' the shaving of rows. Here we generalize gene shaving to incorporate full or partial supervision.

As in Equation (1), we consider a cluster of genes *S*_{
k
} having column average vector . Let *y* = (*y*_{1}, *y*_{2}, ... *y*_{p}) be a set of auxiliary measurements available for the samples. For example each *y*_{
j
}, might be a survival time for the patient corresponding to sample *j* or a class label for each sample, such as a diagnosis category. Supervised shaving maximizes a weighted combination of column variance and an information measure *J*(, *y*):

for fixed 0 = a = 1. The value a = 1 gives full supervision; values between o and 1 provide partial supervision.

Choice of the measure *J*(, *y*) depends on the nature of the auxiliary information *y*. If the *y* codes class labels, *J*(, *y*) can be taken as the sum of squared differences between the category averages . For censored survival times *y*, think of as a covariate in a Cox (proportional hazards) model. If the score vector from this model is *g,* we set *J*(, *y*) = *gg*^{T}, a *p* × *p* matrix. Computationally we first scale the genes so that the within-risk set variance is 1.

When fully supervised, the shaving procedure reduces to simply ranking the genes from largest to smallest Cox model score test. Thus there is no role for principal components or top-down shaving in this case. However, to encourage coherence within the gene clusters, it can be better to use a partially supervised criterion, which does use principal components and top-down shaving. This is demonstrated in the example below. One can also include other prognostic factors in the model, and direct shaving to find a gene cluster whose column average is a strong predictor in the model. This can be done with other models, for example a linear regression model for a quantitative measurement. Operationally, all of these choices for *J* are quadratic functions of the column averages , and gene shaving can be carried out via principal components of a suitably modified variance matrix.

We applied supervised shaving to the set of 3624 genes from the DLCL samples. Figure 9 examines the effect of different values of the supervision weight a, showing the average (absolute) gene correlation and Cox model *p* value for each choice. From this plot we chose the value a = 0.10, which gives a good mix of high gene correlation and low *p* value. Partially supervised gene shaving produced a cluster with 234 genes, shown in Figure 10 and in Table 2.

Some of the genes are close together in the hierarchical clustering order (indicated by the first number in Table 2), many are not. Some genes have a negative sign, and others have no sign. We will call these 'negative' and 'positive' genes respectively. The negative genes have their expression values flipped before being averaged with other gene expression profiles. Figure 11a shows the gap curve, suggesting a cluster size of 35. However, further analysis below suggests the better cluster size of 234.

The cluster of 234 genes contains many of the strongest individual genes for predicting survival. For example, 130 of the strongest 200 genes are in the cluster. Some weaker genes are, however, also in the cluster, the weakest being the 1332nd strongest gene among the full list of 3624. Figure 11b shows the survival curves stratified by the low and high expression of this gene cluster, using the median of the cutoff point. The two resulting groups are shown in Figure 12.

Using this grouping as a predictor in the Cox model for survival gave a highly significant *p* value (0.00042). However, this *p* value must be scrutinized. Figure 13a,b shows the Cox model *p* value as a function of the cluster size, for both partially and fully supervised shaving. We will call these the 'training set *p* values'. As the gene shaving process was supervised by the survival times, the training set *p* values will be biased downward, and it is important to adjust them appropriately. We permuted the survival times, re-ran the shaving process and computed the corresponding *p* values. This was repeated 100 times, and for each cluster size we computed the proportion of times the permutation *p* values were less than or equal to the training set *p* values. These proportions are shown in Figure 13c,d, and are unbiased estimates of the true *p* values. Fully supervised shaving is too aggressive, and the upward adjustment of the *p* values is large. As a result the *p* value is around 0.05 for the full set of genes, but none of the smaller clusters is significant at the 0.05 level. For partially supervised shaving, many of the *p* values are below 0.05, and from this we chose the cluster size of 234 near the minimum.

Using the full set of genes, flipping each to have positive correlation with survival, averaging their expression values and finally cutting at the median, gave a grouping nearly the same as Groups 1 and 2 in Figure 12. The only change was a swap between DLCL-oo14 and DLCL-oo18, and these two samples are right at the median cutpoint between the two groups in Figure 10.

The international prognostic index (IPI) A score was also available for these patients. Components of the IPI include age, level of the enzyme lactate dehydrogenase (LDH) and the number of extranodal sites. As in [14], we dichotomized IPI scores into low (0, 1 or 2) and high (3, 4 or 5). The resulting grouping seems to be about as predictive as the IPI index, and is quite independent from it, as Table 3 indicates.

When added to a Cox model containing IPI, this grouping had a training set *p* value of 0.0006. Figure 14 shows the Kaplan-Meier survival curves for each group, stratified by low and high IPI.

In [14], two patient groups were defined from a hierarchical clustering tree grown from a 38o-gene cluster. As a predictor, the grouping was just significant in the low IPI group only, at the 0.05 level. Table 4 gives a cross-tabulation of that grouping with the one used in this paper in Figure 10.

Thus 25/36 = 69% of the patients are classified the same way by both groupings. The patient grouping of Alizadeh *et al.* [14] was based on a cluster of 380 genes, chosen for their large variation over the samples. Our cluster of 234 genes has 38 genes in common with this cluster of 380, and they are indicated by an asterisk in Table 2. Five of the 234 genes also appear in the unsupervised clusters found earlier, in the second of the three clusters.

## Conclusions

We have proposed a set of 'shaving' methods for isolating interesting clusters of genes from a set of DNA microarray experiments. The methods may be unsupervised, or may be supervised - that is, use information available about the samples such as a class label or survival time. The proposed shaving methods search for clusters of genes showing both high variation across the samples, and coherence (correlation) across the genes. Both of these aspects are important and cannot be captured by simple clustering of the genes, or thresholding of individual genes based on the variation over samples.

With our model-based approach for supervised shaving, one can incorporate other prognostic factors in the search for interesting gene clusters. If an outcome such as survival time is available for each sample, the method searches for a gene cluster whose column average gene

has a significant effect, possibly the presence of other prognostic factors, for predicting the outcome.

The microarray data *x*_{
ij
} we have considered are real-valued expression levels. However, other kinds of arrays produce different kinds of data. In particular, some arrays detect the presence or absence of single-nucleotide polymorphisms (SNPs), so that the *x*_{
ij
} values take on one of *k* = 2 unordered values. The shaving methods described can be easily modified to handle this kind of data. In detail, we construct *k* data matrices *X*_{1}, *X*_{2} ... *X*_{
k
}, each of size *n* × *m*. The *ij*th element of *X*_{
j
} is 1 if x_{
ij
} falls in class *j,* and zero otherwise. Letting ?_{
j
}*j* = 1, 2, ... *k* be the *n* × *n* covariance matrices of the genes in each *X*_{
j
}, we simply apply principal component shaving, using ?_{
j
}, as the variance matrix for the penalty. This can be done unsupervised, or a supervision term can also be added.

## References

Weinstein J, Myers T, O'Connor P, Friend S, Fornace A, Kohn K, Fojo T, Bates S, Rubinstein L, Anderson N, et al: An information-intensive approach to the molecular pharmacology of cancer. Science. 1997, 275: 343-349. 10.1126/science.275.5298.343.

Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.

Roth FP, Hughes J, Estep P, Church G: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole genome mRNA quantitation. Nat Biotechnol. 1998, 16: 939-945.

Golub T, Slonim D, Tamayo P, Huard C, Gassenbeek M, Mesirov JP, Coller H, Loh M., Downing J, Caligiuri M, Bloomfield C, Lander E: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-538. 10.1016/S0378-4371(00)00404-0.

Perou C, Jeffrey S, van de Rijn M, Rees C, Eisen M, Ross D, Perga-menschikov A, Williams C, Zhu S, Lee J, et al: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc Natl Acad Sci USA. 1999, 96: 9212-9217. 10.1073/pnas.96.16.9212.

Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A: Broad patterms of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.

Brazma A, Jonaxsen I, Vilo J, Ukkonen E: Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 1998, 8: 1202-1215.

D'haeseleer P, Wen X, Fuhrman S, Somogyi R: Inferring gene relationships from large-scale gene expression data. In Information Processing in Cells and Tissues. Edited by Holcombe M, Paton R. New York: Plenum;. 1998, : 203-212.

Ewing R, Kahla A, Poirot O, Lopez F, Audic S, Claverie J: Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res. 1999, 10: 950-959. 10.1101/gr.9.10.950.

Mjolsness E, Mann T, Castano R, Wold B: From co-expression to co-regulation: an approach to inferring transcriptional regulation among gene classes from large scale expression data. Technical report JPL-ICTR-99-4. Pasadena: Jet Propulsion Laboratory, Section 365;. 1999

Niehrs C, Pollet N: Synexpression groups in eukaryotes. Nature. 1999, 402: 483-487. 10.1038/990025.

Walker M, Volkmuth W, Sprinzak E, Hodgson D, Klinger T: Prediction of gene function by genome scale expression analysis: prostate cancer associated genes. Genome Res. 9: 1198-1203. 10.1101/gr.9.12.1198.

Tamayo P, Slonim T, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E: Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopoietic diferentation. Proc Natl Acad Sci USA. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.

Alizadeh A, Eisen M, Davis RE, Ma C, Losses I, Rosenwal A, Boldrick J, Sabet H, Tran T, Yu X, et al: Identification of molecularly and clinically distinct substypes of diffuse large B-cell lymphoma by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038/35000501.

Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P: Clustering methods for the analysis of DNA microarray data. Technical report. Stanford: Department of Statistics, Stanford University; . 1999

Hastie T, Tibshirani R, Eisen M, Brown P, Scherf U, Weinstein J, Alizadeh A, Staudt L, Botstein D: Gene shaving: a new class of clustering methods for expression arrays. Technical report. Stanford: Stanford University;. 2000

Hastie T, Alter O, Sherlock G, Eisen M, Tibshirani R, Botstein D, Brown P: Imputation of missing values in DNA microarrays. Technical report. Stanford: Stanford University

Cox D: Regression models and lifetables. J Roy Statist Soc B. 1972, 74: 187-220.

## Acknowledgements

We thank Mark Segal for helpful discussions, and Nick Fisher and Jerome Friedman for their work on 'bump-hunting' which inspired some of the ideas in this paper. The first two authors were supported by grants from The National Science Foundation and the National Institutes of Health. W.C.C. was supported by NIH grant UOI-CA 84967. R.L and collaborators were supported by LPPG grant CA 34233.

## Author information

### Authors and Affiliations

### Corresponding author

## Rights and permissions

## About this article

### Cite this article

Hastie, T., Tibshirani, R., Eisen, M.B. *et al.* 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns.
*Genome Biol* **1**, research0003.1 (2000). https://doi.org/10.1186/gb-2000-1-2-research0003

Received:

Revised:

Accepted:

Published:

DOI: https://doi.org/10.1186/gb-2000-1-2-research0003