Yeast cell cycle data
To demonstrate the utility of the algorithms proposed here, microarray data related to the cell cycle of Saccharomyces cerivisiae described by Spellman et al.  were employed. Specifically, the subset of data related to the α-factor block release experiment were used. The data in the α-factor release subset consisted of microarray measurements for 6178 open reading frames at 7 minute intervals from 0 to 119 minutes, for a total of 18 experiments. This data set was further screened to exclude any genes for which there were more than 4 missing measurements. This data set will be referred to as "Alpha-full" here and consisted of 6044 genes. In addition to this, a smaller set of 696 pre-selected genes from this group was used as well. These genes, identified as exhibiting cell cycle-dependent changes in m RNA expression levels, were the same as those employed by Lu et al. . This data set will be referred to as "Alpha-696" in this work. For both data sets, a corresponding measurement standard deviation matrix was constructed by assuming proportional errors of 20% of the measurement. This value is consistent with observations we have made on other microarrays, but is not critical since the absolute magnitude of the proportional error weighting will not affect the results. In addition, missing measurements (2042 in Alpha-full, 0 in Alpha-696) were set to zero and the corresponding error standard deviation was set to a value much greater than the largest proportional error value in the data (a value of 100 was used in this work).
The cell cycle data set was used because it has been widely studied and exhibits some temporally structured patterns of gene expression. In addition, a number of genes postulated to be associated with cell cycle regulation have been identified. It was hoped confirmation of these patterns could be established through curve resolution methods, although a one-to-one correspondence between the underlying regulating factors and the genes related to these cycles is not necessarily required. It should be emphasized, however, that the objective of this work is to demonstrate the utility of the curve resolution method and not to conduct an extensive analysis of the cell cycle using this tool.
Curve resolution of Alpha-696 data
Initially the MCR-WALS algorithm was applied to the Alpha-696 data set since this had been prescreened to select for genes with cell-cycle related expression patterns and therefore was thought to be more amenable to successful curve resolution analysis. The data were analyzed by specifying 4, 5, 6, 7, 8 and 9 components (factors) and the extracted time profiles (normalized to unit length) are shown in Figure 2 for each case. Different numbers of components were used because it was not known a priori how many underlying factors would be present, although it was suspected that a lower limit would be related to the number of different phases in the cell cycle. Furthermore, more confidence in the results can be achieved if the profiles remain consistent as the number of components is increased. If extraneous components are extracted, these can often be identified by an inconsistent or irregular pattern, or because they correlate with only a few genes that may represent outliers. The profiles extracted by MCR-WALS are not obtained in any particular order, but for the purposes of representation and comparison, they have been arranged in Figure 2 by the order of appearance of the first significant peak value.
The first and most significant feature to notice about Figure 2 is that all of the profiles extracted are consistent with the dynamics of the system being studied. In some cases, a unimodal profile compatible with a unique set of conditions is observed, while in other cases a cyclical pattern suggesting a relationship with the cell cycle is apparent. Most patterns are clear and smooth, with well defined maxima and minima that fall close to zero. These features alone indicate that the results of curve resolution are meaningful and suggest the potential utility of the method. While it is not essential that these underlying regulatory profiles correlate directly with stages in the cell cycle, since the latter are only required to be linear combinations of the former, there is a natural expectation that this will be the case. Because of this, these relationships warrant further investigation.
Considering first the four-component analysis shown in Figure 2a, the relationships with stages of the cell cycle were investigated in two ways. First, the 292 genes classified by Spellman et al. (ref ; Fig. 7) into five categories related to the yeast cell cycle were extracted. The expression profiles of these genes were then normalized and plotted as shown in Figure 3. This allowed a visual comparison between the profiles extracted and those observed for genes reported to be representative of that stage of the cell cycle. Note that, in this plot, missing measurements were interpolated between the surrounding measurements. A second, more quantitative comparison was made through a correlation study, the results of which are presented in Table 1. This was conducted by first finding the genes in the entire set (Alpha-full, 6044 genes) that correlated most strongly with each of the profiles extracted. Correlation coefficients were calculated around the origin (rather than around the mean) and a cutoff of >0.8 was arbitrarily chosen to indicate substantial similarity with the profiles. The number of genes meeting this criterion for each profile is listed as the parameter "Ntot" in Table 1. Within this group, the number of genes which were on Spellman's list of 292 genes is was also determined and is given as "(Nmatch)" in the table. For each profile, correlated genes that were found on this list are given in the table in descending order of correlation, along with the cell cycle stages to which they were assigned. Also shown in parentheses is the rank of that gene (among all genes) in the correlation list and the corresponding correlation coefficient. In order to conserve space, a maximum number of 20 genes from each list was allowed in the table.
The first of the four curves in Figure 2a (curve 1, as labeled on the left-hand side) is characterized by a maximum at time zero that falls relatively quickly to baseline levels. This curve was initially thought to be associated with genes that are down-regulated after release from α-factor arrest as they do not appear to be part of the regular cell cycle, although there is another small increase around one hour. In Table 1, only 5 of the genes classified in Spellman's list correlate highly with this profile (r>0.8) and only 35 genes in the set of 6044 in Alpha-full show this level of correlation. Interestingly, the most highly correlated profile of the 6044 genes is MFA1, which has been classified as being associated with G1. The other four classified genes are associated with M/G1. The fact that only 35 genes exhibit an expression profile with a correlation of >0.8 suggests that this profile is rarely observed in its pure form, but is likely to be a component of many genes. The significance of this is discussed in more detail in the context of curve 4 below.
The second curve in Figure 2a has a much clearer interpretation. Comparison with Figure 3a readily suggests an association of this profile with G1. This is further supported by the data in Table 1. All of the top five correlated genes have been classified by Spellman as G1, as well as 20 of the top 30. A total of 82 of a possible 119 genes classified as G1 by Spellman have a correlation coefficient of >0.8 with curve 2, while 3 were classified as M/G1 and 7 were classified as S, for a total of 92 genes from Spellman's list. In order to make this classification more quantitative, each profile was given a score for each of the five cell cycle classes. This score was calculated by:
This equation describes the score for profile p on class c, where c represents G1, S, G2, M or M/G1. The term r
represents the correlation coefficient between profile p and the expression profile of gene i, where the summation is over all N
genes in class c (as given by Spellman) that have a correlation coefficient greater than 0.8 with profile p. The quantity N
is the total number of genes in that as class classified by Spellman (G1 = 119, S = 37, G2 = 34, M = 61, M/G1 = 41). As an example, if all of the 119 G1 genes in Spellman's list had a correlation coefficient of unity with curve 2, the score would be 1, which is the highest value attainable. In Table 1, the score for curve 2 on G1 is 0.539, while the next highest is 0.154 for S, supporting the classification of G1.
Curve 3 in the four-component case also exhibits a cyclical pattern that is shifted later in time from Curve 2. This would be consistent with S, G2 or M and all of these classes give high scores with Curve 3, with 31/37 of the designated S genes, 27/34 of the designated G2 genes, and 38/61 of the designated M genes giving correlation coefficients above 0.8. Visually (see Figure 3b) and by score, the best match appears to be S, but it is likely that these three groups have merged together in this profile since an inadequate number of components were specified to capture all of the elements of the cell cycle. This is not a very specific group, since more than half of the 6044 genes give a correlation of 0.8 or better.
The fourth curve in Figure 2a gives a poor score with all of the designated classes except for M/G1. Of the 15 highly correlated genes with designated classes, 8 of these are M/G1, 6 are M, and 1 is G1. The score and number of correlated M/G1 genes is not especially high and the reason for this becomes clear on visual inspection of Figure 3. In addition to the peak around 70 minutes, the M/G1 designated genes also exhibit a peak just before the first peak of Curve 2, which has been designated G1. Thus, it would appear that the M/G1 phase is a composite of the two underlying functions shown in curves 1 and 4. Furthermore, close examination of the expression patterns for the designated M/G1 genes in Figure 3e reveals that some of these gene expression profiles are dominated by the first peak, some by the second peak, and some show a distinctive rapid decay from time zero. This suggests the presence of two or more processes underlying these genes. This does not mean that all of these genes cannot be considered to belong to the M/G1 classification, only that there may be more than one driving force behind the expression of these genes.
This initial four-component analysis was promising, but suggested (as anticipated) that there were too few components to adequately model the cell cycle. It was expected that extension to five components would allow better resolution of the S, G2 and M phases. While this did happen, other unexpected observations were made. The results of this analysis are shown in Figure 2b and Table 2 (the correlation coefficients have been removed to save space). In this case, curves 3 and 4 exhibit clear and excellent matches to designated S and M genes, respectively, although curve 3 also exhibits some strong G1 and M character, as might be expected. Also, as for the four-component model, the M/G1 phase seems to be a combination of the first and last curves and therefore gives only a moderate score in each case. What is particularly interesting is that the strongly correlated G1 pattern that was so apparent in the four-component model has disappeared in the five-component case. Instead, in this case, curve 2 represents the first peak of the G1 profile, but because the second cycle is absent, strong correlations with the designated genes are not observed. It is our interpretation that the second cycle of the G1 phase is now represented in curve 5 and what is seen in curves 2 and 5 are two separate regulating profiles for a mix of genes designated as G1 and M/G1. Note that in the five-component case relative to the four-component case, curve 1 falls off more sharply, curve 2 is shifted to an earlier time, and curve 5 (corresponding to curve 4 in Figure 2a) is shifted to a later time, all of which would be consistent with a blending of G1 and M/G1. Another notable feature of the five-component analysis is that there is no clearly defined curve for G2. However, the profiles for the designated G2 genes shown Figure 3c do not show distinctive features and could easily be obtained through linear combinations of the five curves presented.
The six-component model, presented in Figure 2c shows essentially the same features as the five-component model, with the addition of one new profile characterized by a single spike in expression levels at the second time point at t = 7 minutes. Because of the transient nature of this peak, it is not clear whether this represents a real profile, or whether it is simply an artifact of outliers in the data. Only one of the genes in Spellman's classification (PHD1 - M) shows a strong correlation with this profile (rank = 8, r = 0.84) and there are only 10 genes with a correlation above 0.8. This does not, however, exclude it from being a component of other expression profiles. Because the remaining profiles were so similar to the five-component model, a full table of correlation data is not included.
The profiles of the seven-component model shown in Figure 2d includes many of the same patterns as were seen in the five- and six-component models, but also brings a refinement that validates some of the hypotheses extended for the simpler models. To conserve space, the full table of results for the seven-component model is included as supplementary data (see additional file 1: TableA) and only summarized here. Curves 1, 2 and 3 are essentially unchanged from the six-component model. Curve 4, which is associated with the S phase, has a notable change in that the second cycle is considerably reduced in its magnitude. The score on the S phase genes is reduced to 0.281, compared to 0.743 and 0.721 for the five- and six-component models, respectively. More importantly, however, this profile is now much more specific for the S phase genes. In the five- and six-component models, the scores of this curve on G1 were 0.326 and 0.305, while the scores on G2 were both 0.443. For the corresponding curve in the seven-component model, the G1 score is only 0.041 and the G2 score is 0. Furthermore, the number of correlated genes has dropped from 1398 in the five-component model and 1294 in the six-component model to only 45 in the seven-component model. This clearly indicates that the model expansion has permitted this curve to become much more specific in representing the S phase.
Curve 5 for the seven-component model, which is representative of the M phase, has remained essentially the same as in the five- and six-component models, but curves 6 and 7 represent a further refinement of the last profile in the previous models. In earlier models, it was postulated that the G1 and M/G1 phases were driven by two distinct underlying functions, each correlated with one of the two peaks in the cycles. For the five-component model, it appeared that the last profile was a blending of the second cycle of these two phases. Now, in the seven-component model, it appears that this mixing has been resolved in curves 6 and 7. Curve 6, which has been shifted to shorter times, is representative of the second cycle of M/G1 and correlates with four designated genes in that group (see additional file 1: TableA). Curve 7 is shifted to longer times and correlates well with the second cycle of the G1 phase, again matching four genes in that group. As before, neither of these curves by itself gives a strong correlation with a large number of designated genes in the corresponding phases, but this is because they only represent half of the cycle. Even so, the genes that are correlated represent a statistically meaningful group of the overall population. When the time profile is divided into two and each half is considered separately, the G1 score for curve 3 increases from 0.007 to 0.405 (first half) and the G1 score for curve 7 increases from 0.028 to 0.870 (second half). Likewise, the M/G1 score for curve 1 increases from 0.130 to 0.161 (first half) and that for curve 6 increases from 0.087 to 0.303 (second half). This is further evidence of two independent driving functions for G1 and M/G1, one which is active on release from α-factor arrest and another which becomes activated in the second cycle.
Extensions of the model to 8 and 9 components, as shown in Figures 2e and 2f, retain essentially the same features as the earlier models, but some more subtle changes are evident. In particular, the profile associated with the S phase (curve 4 in Figures 2c–f) seems to follow the same pattern as the G1 and M/G1 genes, with the two parts of the cycles separating from one another. For the eight-component model, the second half of the S phase is likely modeled by the newly appearing curve 5, which has a high score on the genes classified as S (0.529) as well as a high score on the G2 genes (0.270). For the nine-component model, curve five develops more G2 character and for the first time a profile is classified in this group with a score of 0.470, although it also exhibits similarity to S and M (scores of 0.199 and 0.304, respectively). In this case, some of the second half of the S cycle has likely been blended into the second half of the G1 cycle represented as curve 9. The nine-component model is also characterized by a new profile in curve 6 (although it is arguable which is "new") that appears to be a combination of G2 and M.
At this point, the capabilities of curve resolution are approaching their limits and become increasingly speculative. As more components are added to the model, the algorithm diverts its efforts to modeling more subtle changes in the expression patterns and eventually artifacts that may be more related to noise than biological change creep into the profiles. This is evidenced by some of the later models which show more sporadic variations than the earlier ones. Also, as the changes being modeled in the cell approach finer and finer resolution in the time domain, our confidence in the results becomes eroded. For example, we can question whether curve 2 in Figures 2c–f is real or is an artifact of that particular time point. The limits of the curve resolution approach can be extended in several ways that include the acquisition of better quality data, more frequent sampling of the system, and the inclusion of reliable uncertainty information in the measurements.
It is important to remember that this modeling was performed with no prior assumptions about the components present, only assumptions of bilinearity and non-negativity were used. To further illustrate the effectiveness of curve resolution for extracting cell cycle related information, Figure 4 shows the normalized expression profiles of five genes used by Lu et al  to represent the phases of the cell cycle. These are shown as solid lines. Superimposed on these (dashed lines) are selected curves from the eight-component model with close matches. For G1, two curves are shown because it is postulated here that this cycle is driven by two different underlying processes. (The same is proposed for M/G1, but only one profile is evident in the selected gene.) No model profile is compared to the selected curve for G2, since there was no definitive match. Overall, the synchronicity of the extracted profiles with the independently selected genes is very good and supports this method of analysis.
As further evidence of the legitimacy of the profiles extracted by curve resolution, Figure 5 shows each of the curves extracted from the eight-component model (dashed lines) plotted with the 40 most highly correlated expression profiles (normalized to unit length) from Alpha-full. This plot confirms the presence of the predominantly unimodal profiles (curves 1,2,3, 4, 7 and 8), supporting the case for separate underlying regulatory factors for G1 and M/G1.
Curve resolution for Alpha-full data
In the analysis of the subset of genes represented by the Alpha-696 data set, it could be argued that the original microarray data had already been screened for genes that are known to be associated with cell cycle regulation. In other studies where such a subset selection may not be possible, the utility of the curve resolution method remains to be demonstrated. In other words, can curve resolution be used to extract original information from microarray data sets with no prior knowledge of gene association? To answer this question, the curve resolution algorithm was applied to the entire Alpha-full data set (6044 genes) without any prior information other than the proposed number of components and the non-negativity constraints normally applied. The results of this analysis are presented in Figure 6. The correlation and classification analysis for the four-component model is given in Table 3, and that for the five-component model is included as supplementary data to conserve space (see additional file 2: TableB).
On initial examination of Figure 6, it is immediately clear that the profiles extracted have some association with the cell cycle. Furthermore, comparison with Figure 2 reveals strong similarities in the profiles, although the profiles extracted from the Alpha-696 data set are generally more cleanly defined. This was anticipated, since the cell cycle genes in the expanded data set are diluted by other genes that may be unrelated or noisy. Another difference between the two sets of results is that, while the profiles exhibited are similar, they do not always appear in the same sequence. For example, in the Alpha-696 set, the two cycles of G1 separate into different components by the five-component model, while for the Alpha-full set this does not happen until six components are employed. Likewise, in the full data set, the sharp transient at 7 minutes does not appear until the seven-component model, where it was evident in the six-component model for the reduced data set. This is expected, since the distribution of gene expression profiles will be different in the full data set compared to the reduced set.
A complete analysis of the Alpha-Full profiles will not be carried out here, since the treatment is similar to the reduced data set. The tables show correlation data with the genes classified by Spellman for the four- and five-component models, respectively. For the four-component model (Table 3), the classifications for G1, S, M, and M/G1 (early expression) are clear. For the five-component data (see additional file 2: TableB) the classifications are less definitive, but the profiles still show a strong association with cell cycle regulated genes. In general, the correlation scores become more ambiguous as the number of components increases. This is due to several factors, including the blending of similar profiles, the resolution of profiles into early and late components, and the noise in the profiles resulting from noisy data. Nevertheless, the trends are clear and support the contention that this method can be used to extract underlying information in an unbiased way with no prior knowledge about the data.
Uniqueness of MCR-WALS solutions
An important consideration in the application of MCR is the uniqueness of the solutions it produces. In the work presented here, one set of solutions was presented for each model/data set combination. This is a common practice in the presentation of MCR results, but it is not very realistic. While it is hoped that the reported solution is representative, a range of equivalent or nearly equivalent solutions is usually possible. Reasons for this include: (1) the possible existence of mathematically degenerate solutions to Eq. (1) (rotational ambiguity), (2) computational and numerical limitations of the method used, and (3) noise in the data.
In their original work on self-modeling curve resolution for two-component systems, Lawton and Sylvestre  recognized that a set of solutions was possible, even in the presence of constraints. Therefore, the profiles (absorbance spectra in their case) were presented in the form of allowed boundaries ("feasible solutions") where the constraints could be satisfied. Although a number of attempts have been made to extend the analytical solution provided by Lawton and Sylvestre to more than two components [33–35], this has proven to be very difficult, especially in the presence of noisy data. Moreover, the notion of profile "boundaries" is not very meaningful in higher dimensions, since all of the profiles in a degenerate set are linked together and this is not reflected in the presentation of profile boundaries; i.e. it is not possible to mix solutions for the components arbitrarily . Some attempts have been made to attach boundaries to modern MCR methods such as ALS with mixed success [36–38]. Fortunately, many experimental situations lead to solutions that are unique or tightly bounded, so a single solution is often acceptable. This is because the nature of the data may lead to measurement points (e.g. times or genes) that are unique or highly selective for one component. Unfortunately, this is difficult to assess a priori.
Another source of multiple solutions is computational limitations. It is possible (and likely) that different starting points will yield different solutions, not only because of rotational ambiguities, but also because of local minima or premature termination. The ALS algorithm is quite stable in its convergence properties (although it can be slower than other minimization strategies) and this is one of the reasons for its popularity, but it is not immune from numerical problems. In this work, the use of the SIMPLISMA algorithm  removed the random element of initialization, although it should be noted that this method does not work well in the presence of large amounts of noise.
Finally, measurement noise plays an important role in the solutions obtained. Clearly, the data represents a single realization of the experimental results and MCR solutions for replicate experiments are not expected to be identical. Without the availability of replicate data, this contribution to the variability is difficult to assess directly, but it can be inferred through re-sampling methods.
To gain insight into the reproducibility of the solutions presented in Figures 2 and 6, two approaches, "random initialization" and "random sub-sampling" were adopted and new solutions were generated from ten replicate runs of the MCR-WALS algorithm in each case. In the random initialization approach, different initial estimates of P were obtained each time the program was run by randomly selecting individual gene profiles that were used as starting values. In the random sub-sampling approach, ten subsets of data, each half the size of the original data set, were obtained by randomly selecting gene expression profiles from the original data. These were then analyzed by MCR-WALS with initial estimates obtained via SIMPLISMA. Figure 7 shows some of the results of these studies. Although models with fewer components generally produce more reproducible results, the profiles in Figure 7 are for the more demanding six-component model. For simplicity, only two components were chosen for display and these were picked on the basis of their consistency among different models and between the Alpha-696 and Alpha-full data sets (components 4/4 and 3/2 in Figures 2/6).
The results in Figure 7 show good reproducibility among the profiles extracted in terms of the major features that they exhibit. As expected, the range of solutions is generally narrower for the Alpha-696 data, since these genes were preselected on the basis of cell-cycle association, but good results were still obtained for the full set of genes. For the bimodal curves in the right-hand panels, there is some shifting of the relative contributions of the two peaks and some small changes in their positions, but the association of these profiles with the cell cycle is unmistakable and is clearly not a statistical aberration. In the left-hand panels, a feature that was not apparent in either of the original analyses but was evident in the reproducibility studies is the second peak that occurs around 80 min. The presence of this peak is consistent with the second cycle of G1-regulated genes that were associated with the unimodal peak in the original analysis, so its appearance is not surprising and supports the original classification.
Clearly this type of reproducibility analysis is useful in assessing the reliability of the profiles extracted by MCR-WALS. Although there might be an inclination to report the average of these solutions as an overall solution, this is not recommended since the average profiles do not generally define an acceptable solution set. It should be noted that some of the profiles in the higher component models showed significant variability, but this was expected and is intimated at by the shape of the profiles themselves. In addition, obtaining consistent matching of the profiles from replicate runs can be a challenge, since the correlations are not always obvious.