The analysis of gene expression temporal profiles is a topic of increasing interest in functional genomics. Model-based clustering methods are particularly interesting because they are able to capture the dynamic nature of these data and to identify the optimal number of clusters. We have defined a new Bayesian method that allows us to cope with some important issues that remain unsolved in the currently available approaches: the presence of time dislocations in gene expression, the non-stationarity of the processes generating the data, and the presence of data collected on an irregular temporal grid. Our method, which is based on random walk models, requires only mild a priori assumptions about the nature of the processes generating the data and explicitly models inter-gene variability within each cluster. It has first been validated on simulated datasets and then employed for the analysis of a dataset relative to serum-stimulated fibroblasts. In all cases, the results have been promising, showing that the method can be helpful in functional genomics research.
This is a preview of subscription content, log in to check access.
Brown P, Botstein D. Exploring the new world of the genome with DNA microarrays. Nat Genet 1999; 21: 33–7
Lipshutz RJ, Fodor SPA, Gingeras TR, et al. High density synthetic oligonucleotide arrays. Nat Genet 1999; 21: 20–4
Sebastiani P, Gussoni E, Kohane I, et al. Statistical challenges in functional genomics. Stat Sci 2003; 18: 33–70
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York: Springer, 2001
Eisen M, Spellman P, Botstein D, et al. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A 1998; 95: 14863–8
Iyer VR, Eisen M, Ross DT, et al. The transcriptional program in the response of human fibroblasts to serum. Science 1999; 283: 83–7
Spellman PT, Sherlock G, Zhang MQ, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998; 9: 3273–97
Chu S, DeRisi J, Eisen M, et al. The transcriptional program of sporulation in budding yeast. Science 1998; 282: 699–705
Reis BY, Butte AS, Kohane IS. Extracting knowledge from dynamics in gene expression. J Biomed Inform 2001; 34: 15–27
Aach J, Church GM. Aligning gene expression time series with time warping algorithms. Bioinformatics 2001; 17: 495–508
Herwig R, Poustka AJ, Mller C, et al. Large-scale clustering of cDNA-fingerprinting data. Genome Res 1999; 9: 1093–105
Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci 1999; 96: 2907–12
Fraley C, Raftery A. How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 1998; 41: 578–88
Yeung KY, Fraley C, Murua A, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics 2001; 17: 977–87
Bar-Joseph Z, Gerber G, Gifford DK, et al. A new approach to analyzing gene expression time series data. The 6th Annual International Conference on Research in Computational Molecular Biology (RECOMB); 2002 April 18–21; Washington, DC
Schliep A, Schonhuth A, Steinhoff C. Using hidden Markov models to analyze gene expression time course data. Bioinformatics 2003; 19: 255–63
Ramoni M, Sebastiani P, Kohane I. Cluster analysis of gene expression dynamics. Proc Natl Acad Sci 2003; 99: 9121–6
Barash Y, Friedman N. Context-specific Bayesian clustering for gene expression data. J Comput Biol 2002; 9: 169–91
Magni P, Bellazzi R, De Nicolao G, et al. Non parametric AUC estimation in population studies with incomplete sampling: a Bayesian approach. J Pharmacokinet Pharmacodyn 2002; 29: 445–71
Magni P, Bellazzi R, De Nicolao G. Bayesian function learning using MCMC methods. IEEE Trans Patten Anal Mach Intell 1998; 20: 1319–31
De Nicolao G, Sparacino G, Cobelli C. Nonparametric input estimation in physiological systems: problems, methods and case studies. Automatica 1997; 33: 851–70
Bellazzi R, Magni P, De Nicolao G. Bayesian analysis of blood glucose time series from diabetes home monitoring. IEEE Trans Biomed Eng 2000; 47: 971–5
Schwartz G. Estimating the dimension of a model. Ann Stat 1978; 6: 461–4
Kay SM. Fundamentals of statistical signal processing: estimation theory. Prentice Hall Signal Processing Series. Englewood Cliffs (NJ): PTR Prentice Hall, 1993
Gelman A, Carlin JB, Stern HS, et al. Bayesian data analysis. London: Chapman & Hall, 1995
Hvidsten TR, Komorowski J, Sandvik AK, et al. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput 2001: 299–310
Sharan R, Maron-Katz A, Shamir R. CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 2003; 19: 1787–99
Dennis Jr G, Sherman BT, Hosack DA, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003; 4(5): P3
This work was in part supported by the Progetto di Ricerca di Interesse Nazionale (PRIN) 2003 grant ‘Dynamic modelling of gene expression profiles’ from the Italian Ministry of Education.
The authors have no conflicts of interest that are directly relevant to the content of this article.
Appendix A: Conditional Maximisation
The conditional maximisation method consists of the following steps.
Provide an initial estimate for the model parameters θ (a possible choice is to take the mean value of the prior distributions);
define an update order of parameter estimates;
for each parameter, update the estimate by maximising the marginal posterior distribution given the data and the current estimate of the other parameters;
repeat step 3 until convergence.
Convergence is reached when, for each parameter, the relative difference between the new and the old estimate becomes smaller than a fixed tolerance. Convergence of the conditional maximisation algorithm is very quick, almost regardless of the parameters’ order and of their initial estimates. In the following, we apply the conditional maximisation method to the estimation of the cluster parameters (see subsection titled Cluster Parameters Estimation).
The marginal densities to be maximised are proportional to the joint posterior distribution (equation 15). In fact, considering for example ω̅, it is possible to write (equation 20): where the denominator is a constant, once σ−2, λ−2 and y are known.
Substituting in equation 15, the distributions (equation 16) and (equation 18), we obtain (equation 21): Therefore, in the maximisation step it is possible to consider only the terms in equation 21 that depend on the parameter to be estimated.
Estimate of ω̅
If we consider only the terms in equation 21 that contain ω̅, then we have (equation 22): Such distribution is a multivariate normal. In fact, we can write (equation 23): where (equation 24): and (equation 25)
The MAP estimate of ω̅ is therefore (equation 26)
Estimate of σ−2
It is possible to repeat the steps followed for ω̅, thus finding (equation 27)
This is a Gamma distribution with parameters (equation 28): and (equation 29)
The MAP estimate of σ−2 is (equation 30)
Estimate of λ−2
Following the same steps as for the other two parameters, we have (equation 31)
This is another Gamma distribution with parameters (equation 32): and (equation 33)
In this case, the MAP estimate for λ−2 is (equation 34)
Appendix B: Pseudocode for Algorithm
The pseudocode of the algorithm is given in figure A1.
About this article
Cite this article
Ferrazzi, F., Magni, P. & Bellazzi, R. Random Walk Models for Bayesian Clustering of Gene Expression Profiles. Appl-Bioinformatics 4, 263–276 (2005). https://doi.org/10.2165/00822942-200504040-00006
- Cluster Model
- Bayesian Cluster
- Random Walk Model
- Random Walk Process
- Virtual Grid