Random Walk Models for Bayesian Clustering of Gene Expression Profiles

Abstract

The analysis of gene expression temporal profiles is a topic of increasing interest in functional genomics. Model-based clustering methods are particularly interesting because they are able to capture the dynamic nature of these data and to identify the optimal number of clusters. We have defined a new Bayesian method that allows us to cope with some important issues that remain unsolved in the currently available approaches: the presence of time dislocations in gene expression, the non-stationarity of the processes generating the data, and the presence of data collected on an irregular temporal grid. Our method, which is based on random walk models, requires only mild a priori assumptions about the nature of the processes generating the data and explicitly models inter-gene variability within each cluster. It has first been validated on simulated datasets and then employed for the analysis of a dataset relative to serum-stimulated fibroblasts. In all cases, the results have been promising, showing that the method can be helpful in functional genomics research.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Table I
Table II
Fig. 3
Fig. 4
Table III
Table IV
Fig. 5
Table V

References

  1. 1.

    Brown P, Botstein D. Exploring the new world of the genome with DNA microarrays. Nat Genet 1999; 21: 33–7

    PubMed  Article  CAS  Google Scholar 

  2. 2.

    Lipshutz RJ, Fodor SPA, Gingeras TR, et al. High density synthetic oligonucleotide arrays. Nat Genet 1999; 21: 20–4

    PubMed  Article  CAS  Google Scholar 

  3. 3.

    Sebastiani P, Gussoni E, Kohane I, et al. Statistical challenges in functional genomics. Stat Sci 2003; 18: 33–70

    Article  Google Scholar 

  4. 4.

    Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York: Springer, 2001

    Google Scholar 

  5. 5.

    Eisen M, Spellman P, Botstein D, et al. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A 1998; 95: 14863–8

    PubMed  Article  CAS  Google Scholar 

  6. 6.

    Iyer VR, Eisen M, Ross DT, et al. The transcriptional program in the response of human fibroblasts to serum. Science 1999; 283: 83–7

    PubMed  Article  CAS  Google Scholar 

  7. 7.

    Spellman PT, Sherlock G, Zhang MQ, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998; 9: 3273–97

    PubMed  CAS  Google Scholar 

  8. 8.

    Chu S, DeRisi J, Eisen M, et al. The transcriptional program of sporulation in budding yeast. Science 1998; 282: 699–705

    PubMed  Article  CAS  Google Scholar 

  9. 9.

    Reis BY, Butte AS, Kohane IS. Extracting knowledge from dynamics in gene expression. J Biomed Inform 2001; 34: 15–27

    PubMed  Article  CAS  Google Scholar 

  10. 10.

    Aach J, Church GM. Aligning gene expression time series with time warping algorithms. Bioinformatics 2001; 17: 495–508

    PubMed  Article  CAS  Google Scholar 

  11. 11.

    Herwig R, Poustka AJ, Mller C, et al. Large-scale clustering of cDNA-fingerprinting data. Genome Res 1999; 9: 1093–105

    PubMed  Article  CAS  Google Scholar 

  12. 12.

    Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci 1999; 96: 2907–12

    PubMed  Article  CAS  Google Scholar 

  13. 13.

    Fraley C, Raftery A. How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 1998; 41: 578–88

    Article  Google Scholar 

  14. 14.

    Yeung KY, Fraley C, Murua A, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics 2001; 17: 977–87

    PubMed  Article  CAS  Google Scholar 

  15. 15.

    Bar-Joseph Z, Gerber G, Gifford DK, et al. A new approach to analyzing gene expression time series data. The 6th Annual International Conference on Research in Computational Molecular Biology (RECOMB); 2002 April 18–21; Washington, DC

  16. 16.

    Schliep A, Schonhuth A, Steinhoff C. Using hidden Markov models to analyze gene expression time course data. Bioinformatics 2003; 19: 255–63

    Article  Google Scholar 

  17. 17.

    Ramoni M, Sebastiani P, Kohane I. Cluster analysis of gene expression dynamics. Proc Natl Acad Sci 2003; 99: 9121–6

    Article  Google Scholar 

  18. 18.

    Barash Y, Friedman N. Context-specific Bayesian clustering for gene expression data. J Comput Biol 2002; 9: 169–91

    PubMed  Article  CAS  Google Scholar 

  19. 19.

    Magni P, Bellazzi R, De Nicolao G, et al. Non parametric AUC estimation in population studies with incomplete sampling: a Bayesian approach. J Pharmacokinet Pharmacodyn 2002; 29: 445–71

    PubMed  Article  CAS  Google Scholar 

  20. 20.

    Magni P, Bellazzi R, De Nicolao G. Bayesian function learning using MCMC methods. IEEE Trans Patten Anal Mach Intell 1998; 20: 1319–31

    Article  Google Scholar 

  21. 21.

    De Nicolao G, Sparacino G, Cobelli C. Nonparametric input estimation in physiological systems: problems, methods and case studies. Automatica 1997; 33: 851–70

    Article  Google Scholar 

  22. 22.

    Bellazzi R, Magni P, De Nicolao G. Bayesian analysis of blood glucose time series from diabetes home monitoring. IEEE Trans Biomed Eng 2000; 47: 971–5

    PubMed  Article  CAS  Google Scholar 

  23. 23.

    Schwartz G. Estimating the dimension of a model. Ann Stat 1978; 6: 461–4

    Article  Google Scholar 

  24. 24.

    Kay SM. Fundamentals of statistical signal processing: estimation theory. Prentice Hall Signal Processing Series. Englewood Cliffs (NJ): PTR Prentice Hall, 1993

    Google Scholar 

  25. 25.

    Gelman A, Carlin JB, Stern HS, et al. Bayesian data analysis. London: Chapman & Hall, 1995

    Google Scholar 

  26. 26.

    Hvidsten TR, Komorowski J, Sandvik AK, et al. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput 2001: 299–310

    Google Scholar 

  27. 27.

    Sharan R, Maron-Katz A, Shamir R. CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 2003; 19: 1787–99

    PubMed  Article  CAS  Google Scholar 

  28. 28.

    Dennis Jr G, Sherman BT, Hosack DA, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003; 4(5): P3

    PubMed  Article  Google Scholar 

Download references

Acknowledgements

This work was in part supported by the Progetto di Ricerca di Interesse Nazionale (PRIN) 2003 grant ‘Dynamic modelling of gene expression profiles’ from the Italian Ministry of Education.

The authors have no conflicts of interest that are directly relevant to the content of this article.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mrs Fulvia Ferrazzi.

Appendices

Appendices

Appendix A: Conditional Maximisation

The conditional maximisation method consists of the following steps.

  1. 1.

    Provide an initial estimate for the model parameters θ (a possible choice is to take the mean value of the prior distributions);

  2. 2.

    define an update order of parameter estimates;

  3. 3.

    for each parameter, update the estimate by maximising the marginal posterior distribution given the data and the current estimate of the other parameters;

  4. 4.

    repeat step 3 until convergence.

Convergence is reached when, for each parameter, the relative difference between the new and the old estimate becomes smaller than a fixed tolerance. Convergence of the conditional maximisation algorithm is very quick, almost regardless of the parameters’ order and of their initial estimates. In the following, we apply the conditional maximisation method to the estimation of the cluster parameters (see subsection titled Cluster Parameters Estimation).

The marginal densities to be maximised are proportional to the joint posterior distribution (equation 15). In fact, considering for example ω̅, it is possible to write (equation 20): where the denominator is a constant, once σ−2, λ−2 and y are known.

figureU20

Substituting in equation 15, the distributions (equation 16) and (equation 18), we obtain (equation 21): Therefore, in the maximisation step it is possible to consider only the terms in equation 21 that depend on the parameter to be estimated.

figureU21

Estimate of ω̅

If we consider only the terms in equation 21 that contain ω̅, then we have (equation 22): Such distribution is a multivariate normal. In fact, we can write (equation 23): where (equation 24): and (equation 25)

figureU22
figureU23
figureU24
figureU25

The MAP estimate of ω̅ is therefore (equation 26)

figureU26

Estimate of σ−2

It is possible to repeat the steps followed for ω̅, thus finding (equation 27)

figureU27

This is a Gamma distribution with parameters (equation 28): and (equation 29)

figureU28
figureU29

The MAP estimate of σ−2 is (equation 30)

figureU30

Estimate of λ−2

Following the same steps as for the other two parameters, we have (equation 31)

figureU31

This is another Gamma distribution with parameters (equation 32): and (equation 33)

figureU32
figureU33

In this case, the MAP estimate for λ−2 is (equation 34)

figureU34

Appendix B: Pseudocode for Algorithm

The pseudocode of the algorithm is given in figure A1.

Fig. A1
figureA1

Pseudocode for the agglomerative search strategy employed by the Random Walk Bayesian Clustering method (see subsection titled Agglomerative Bayesian Clustering).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ferrazzi, F., Magni, P. & Bellazzi, R. Random Walk Models for Bayesian Clustering of Gene Expression Profiles. Appl-Bioinformatics 4, 263–276 (2005). https://doi.org/10.2165/00822942-200504040-00006

Download citation

Keywords

  • Cluster Model
  • Bayesian Cluster
  • Random Walk Model
  • Random Walk Process
  • Virtual Grid