High Dimensional Modelling

  • Søren Højsgaard
  • David Edwards
  • Steffen Lauritzen
Part of the Use R! book series (USE R)

Abstract

This chapter describes methods suitable for high-dimensional graphical modeling. Recent years have seen intense interest in applying graphical modeling techniques to data of high dimension: by this we mean from hundreds to tens of thousands of variables. Such data arise routinely in fields such as molecular biology. We first describe two typical datasets: one from a study of gene expression in breast cancer patients, and the other from the HapMap project, in which a large number of genomic markers and gene expression measurements are recorded for 90 individuals. We compare the computational efficiency of some model selection algorithms, as applied to one of the example datasets. Of these, an extension of the Chow-Liu algorithm to find the minimal BIC forest, implemented in the gRapHD package, is found to be most efficient. Also the glasso algorithm and a stepwise decomposable search algorithm are highly efficient. We describe these algorithms in more detail and illustrate their use on the example datasets. Finally, as a more advanced example, we illustrate how a Bayesian equivalent to the minimal BIC forest algorithm for high-dimensional discrete data may be obtained. Assuming a hyper-Dirichlet prior, the maximum a posteriori forest is derived by using the extended Chow-Liu algorithm with appropriate user-defined edge weights. This is illustrated using a subset of the HapMap data.

Keywords

Covariance Lasso 

References

  1. Chickering DM (1996) Learning Bayesian networks is NP-complete. In: Fisher D, Lenz HJ (eds) Learning from data: artificial intelligence and statistics V. Springer, New York, pp 121–130 Google Scholar
  2. Chow CK, Liu CN (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14:462–467 MATHCrossRefGoogle Scholar
  3. Dawid AP, Lauritzen SL (1993) Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann Stat 21:1272–1317 MathSciNetMATHCrossRefGoogle Scholar
  4. Edwards D, de Abreu GCG, Labouriau R (2010) Selecting high-dimensional mixed graphical models using minimal AIC or BIC forests. BMC Bioinform 11:18 CrossRefGoogle Scholar
  5. Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441 MATHCrossRefGoogle Scholar
  6. Fruchterman T, Reingold EM (1991) Graph drawing by force-directed placement. Softw Pract Exp 21:1129–1164 CrossRefGoogle Scholar
  7. Kirshner S, Smyth P, Robertson AW (2004) Conditional Chow-Liu tree structures for modeling discrete-valued vector time series. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, UAI ’04, AUAI Press, Arlington, pp 317–324. http://portal.acm.org/citation.cfm?id=1036843.1036882 Google Scholar
  8. Kruskal J (1956) On the shortest spanning subtree of a graph and the traveling Salesman problem. Proc Am Math Soc 7:48–50 MathSciNetMATHCrossRefGoogle Scholar
  9. Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J (2005) An expression signature for p 53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA 102(38):13550–13555. http://dx.doi.org/10.1073/pnas.0506230102 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Søren Højsgaard
    • 1
  • David Edwards
    • 2
  • Steffen Lauritzen
    • 3
  1. 1.Department of Mathematical SciencesAalborg UniversityAalborgDenmark
  2. 2.Centre for Quantitative Genetics and Genomics, Department of Molecular Biology and GeneticsAarhus UniversityAarhusDenmark
  3. 3.Department of StatisticsUniversity of OxfordOxfordUK

Personalised recommendations