Abstract
In this paper, we study a parametric modeling approach to gene set enrichment analysis. Existing methods have largely relied on nonparametric approaches employing, e.g., categorization, permutation or resampling-based significance analysis methods. These methods have proven useful yet might not be powerful. By formulating the enrichment analysis into a model comparison problem, we adopt the likelihood ratio-based testing approach to assess significance of enrichment. Through simulation studies and application to gene expression data, we will illustrate the competitive performance of the proposed method.
Similar content being viewed by others
References
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25(1):25–29
Barry WT, Nobel AB, Wright FA (2008) A statistical framework for testing functional categories in microarray data. Ann Appl Stat 2(1):286–315
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Collins K, Jacks T, Pavletich NP (1997) The cell cycle and cancer. Proc Natl Acad Sci USA 94(7):2776–2778
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38
Dørum G, Snipen L, Solheim M, Saebø S (2009) Rotation testing in gene set enrichment analysis for small direct comparison experiments. Stat Appl Genet Mol Biol 8:34
Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99:96–104
Efron B (2007) Correlation and large-scale simultaneous significance testing. J Am Stat Assoc 102:93–103
Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1(1):107–129
Ferbeyre G, Stanchina ED, Lin AW, Querido E, McCurrach ME, Hannon GJ, Lowe SW (2002) Oncogenic ras and p53 cooperate to induce cellular senescence. Mol Cell Biol 22(10):3497–3508
Greenway AL, McPhee DA, Allen K, Johnstone R, Holloway G, Mills J, Azad A, Sankovich S, Lambert P (2002) Human immunodeficiency virus type 1 nef binds to tumor suppressor p53 and protects cells against p53-mediated apoptosis. J Virol 76(6):2692–2702
Jiang P, Du W, Wu M (2000) p53 and bad: remote strangers become close friends. Cell Res 17(4):283–285
Khatri P, Draghici S (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21(18):3587–3595
Kim SS, Chae HS, Bach JH, Lee MW, Kim KY, Lee WB, Jung YM, Bonventre JV, Suh YH (2002) p53 mediates ceramide-induced apoptosis in SKN-SH cells. Oncogene 21(13):2020–2028
Kumar AR, Li Q, Hudson WA, Chen W, Sam T, Yao Q, Lund EA, Wu B, Kowal BJ, Kersey JH (2009) A role for MEIS1 in MLL-fusion gene leukemia. Blood 113(8):1756–1758
Levine AJ, Feng Z, Mak TW, You H, Jin S (2006) Coordination and communication between the p53 and IGF-1-AKT-TOR signal transduction pathways. Genes Dev 20(3):267–275
Lewis JM, Truong TN, Schwartz MA (2002) Integrins regulate the apoptotic response to DNA damage through modulation of p53. Proc Natl Acad Sci USA 99(6):3627–3632
Liu H, Takeda S, Kumar R, Westergard TD, Brown EJ, Pandita TK, Cheng EH, Hsieh JJ (2010) Phosphorylation of MLL by ATR is required for execution of mammalian s-phase checkpoint. Nature 467:343–346
Mootha V, Lindgren C, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov J, Golub T, Tamayo P, Spiegelman B, Lander E, Hirschhorn J, Altshuler D, Groop L (2003) PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34:267–273
Newton M, Quintana F, den Boon J, Sengupta S, Ahlquist P (2007) Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat 1(1):85–106
O’Callaghan-Sunol C, Gabai VL, Sherman MY (2007) Hsp27 modulates p53 signaling and suppresses cellular senescence. Cancer Res 67(24):11779–11788
Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E (2004) Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res 29(6):1213–1222
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:1
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005) From the Cover: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545–15550
Acknowledgements
This research was supported in part by a Biomedical Informatics and Computational Biology research grant from the University of Minnesota-Rochester, and National Institute of Health grant CA134848 and GM083345. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We would like to thank the associate editor and two anonymous referees for their constructive comments, which have dramatically improved the presentation of the paper.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 EM Algorithm for Estimating the Finite Mixture Model
We begin with the finite mixture model in (1) given \((\hat{\theta}_{0},\hat{\mu}_{0},\hat{\sigma}^{2}_{0})\) and K. Define indicators w ik ∈{0,1} following a multinomial distribution, Pr(w ik =1)=θ k , \(\sum_{k=0}^{K} w_{ik}=1\), and conditionally we assume z i |w ik =1∼f k . The complete data likelihood function for (z i ,w ik ) can be written as
In the E-step, the conditional probabilities can be checked to be
In the M-step, the conditional expected log likelihood can be checked to be proportional to
which can be easily verified to be maximized by
Given only \((\hat{\mu}_{0},\hat{\sigma}^{2}_{0})\) with θ 0 also being a parameter, we have
We can easily check that
1.2 EM Algorithm for Estimating the Gene Set Model
The complete data likelihood function for a gene set A given \((\hat{\theta}_{0},\hat{\mu}_{0},\hat{\sigma}_{0}^{2},\hat{\mu}_{k})\) is
The conditional expected log likelihood can easily be checked to be
We can easily verify that
1.3 EM Algorithm for Estimating the Model Under no Enrichment
The complete data likelihood can be written as
where \(\nu_{0}+\sum^{K}_{k=1}\nu_{lk}=1\), l=1,2. The conditional expected log likelihood can be easily checked to be
where
To maximize the conditional log likelihood, we use the Lagrange multiplier method
Setting the gradient vector ∇Q=0 yields the following equations:
From the first three equations we can obtain
When plugging these into the last two equations, we obtain
and
Rights and permissions
About this article
Cite this article
Lee, S.M., Wu, B. & Kersey, J.H. Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model. Stat Biosci 6, 38–54 (2014). https://doi.org/10.1007/s12561-012-9076-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-012-9076-3