Skip to main content

Advertisement

Log in

Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

In this paper, we study a parametric modeling approach to gene set enrichment analysis. Existing methods have largely relied on nonparametric approaches employing, e.g., categorization, permutation or resampling-based significance analysis methods. These methods have proven useful yet might not be powerful. By formulating the enrichment analysis into a model comparison problem, we adopt the likelihood ratio-based testing approach to assess significance of enrichment. Through simulation studies and application to gene expression data, we will illustrate the competitive performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25(1):25–29

    Article  Google Scholar 

  2. Barry WT, Nobel AB, Wright FA (2008) A statistical framework for testing functional categories in microarray data. Ann Appl Stat 2(1):286–315

    Article  MATH  MathSciNet  Google Scholar 

  3. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300

    MATH  MathSciNet  Google Scholar 

  4. Collins K, Jacks T, Pavletich NP (1997) The cell cycle and cancer. Proc Natl Acad Sci USA 94(7):2776–2778

    Article  Google Scholar 

  5. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38

    MATH  MathSciNet  Google Scholar 

  6. Dørum G, Snipen L, Solheim M, Saebø S (2009) Rotation testing in gene set enrichment analysis for small direct comparison experiments. Stat Appl Genet Mol Biol 8:34

    MathSciNet  Google Scholar 

  7. Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99:96–104

    Article  MATH  MathSciNet  Google Scholar 

  8. Efron B (2007) Correlation and large-scale simultaneous significance testing. J Am Stat Assoc 102:93–103

    Article  MATH  MathSciNet  Google Scholar 

  9. Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1(1):107–129

    Article  MATH  MathSciNet  Google Scholar 

  10. Ferbeyre G, Stanchina ED, Lin AW, Querido E, McCurrach ME, Hannon GJ, Lowe SW (2002) Oncogenic ras and p53 cooperate to induce cellular senescence. Mol Cell Biol 22(10):3497–3508

    Article  Google Scholar 

  11. Greenway AL, McPhee DA, Allen K, Johnstone R, Holloway G, Mills J, Azad A, Sankovich S, Lambert P (2002) Human immunodeficiency virus type 1 nef binds to tumor suppressor p53 and protects cells against p53-mediated apoptosis. J Virol 76(6):2692–2702

    Article  Google Scholar 

  12. Jiang P, Du W, Wu M (2000) p53 and bad: remote strangers become close friends. Cell Res 17(4):283–285

    Article  Google Scholar 

  13. Khatri P, Draghici S (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21(18):3587–3595

    Article  Google Scholar 

  14. Kim SS, Chae HS, Bach JH, Lee MW, Kim KY, Lee WB, Jung YM, Bonventre JV, Suh YH (2002) p53 mediates ceramide-induced apoptosis in SKN-SH cells. Oncogene 21(13):2020–2028

    Article  Google Scholar 

  15. Kumar AR, Li Q, Hudson WA, Chen W, Sam T, Yao Q, Lund EA, Wu B, Kowal BJ, Kersey JH (2009) A role for MEIS1 in MLL-fusion gene leukemia. Blood 113(8):1756–1758

    Article  Google Scholar 

  16. Levine AJ, Feng Z, Mak TW, You H, Jin S (2006) Coordination and communication between the p53 and IGF-1-AKT-TOR signal transduction pathways. Genes Dev 20(3):267–275

    Article  Google Scholar 

  17. Lewis JM, Truong TN, Schwartz MA (2002) Integrins regulate the apoptotic response to DNA damage through modulation of p53. Proc Natl Acad Sci USA 99(6):3627–3632

    Article  Google Scholar 

  18. Liu H, Takeda S, Kumar R, Westergard TD, Brown EJ, Pandita TK, Cheng EH, Hsieh JJ (2010) Phosphorylation of MLL by ATR is required for execution of mammalian s-phase checkpoint. Nature 467:343–346

    Article  Google Scholar 

  19. Mootha V, Lindgren C, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov J, Golub T, Tamayo P, Spiegelman B, Lander E, Hirschhorn J, Altshuler D, Groop L (2003) PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34:267–273

    Article  Google Scholar 

  20. Newton M, Quintana F, den Boon J, Sengupta S, Ahlquist P (2007) Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat 1(1):85–106

    Article  MATH  MathSciNet  Google Scholar 

  21. O’Callaghan-Sunol C, Gabai VL, Sherman MY (2007) Hsp27 modulates p53 signaling and suppresses cellular senescence. Cancer Res 67(24):11779–11788

    Article  Google Scholar 

  22. Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E (2004) Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res 29(6):1213–1222

    Article  Google Scholar 

  23. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MATH  Google Scholar 

  24. Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:1

    MathSciNet  Google Scholar 

  25. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005) From the Cover: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545–15550

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported in part by a Biomedical Informatics and Computational Biology research grant from the University of Minnesota-Rochester, and National Institute of Health grant CA134848 and GM083345. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We would like to thank the associate editor and two anonymous referees for their constructive comments, which have dramatically improved the presentation of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baolin Wu.

Appendix

Appendix

1.1 EM Algorithm for Estimating the Finite Mixture Model

We begin with the finite mixture model in (1) given \((\hat{\theta}_{0},\hat{\mu}_{0},\hat{\sigma}^{2}_{0})\) and K. Define indicators w ik ∈{0,1} following a multinomial distribution, Pr(w ik =1)=θ k , \(\sum_{k=0}^{K} w_{ik}=1\), and conditionally we assume z i |w ik =1∼f k . The complete data likelihood function for (z i ,w ik ) can be written as

$$\prod_{i=1}^m \bigl\{\hat{ \theta}_0f_0\bigl(z_i;\hat{ \mu}_0,\hat{\sigma}_0^2\bigr)\bigr \}^{w_{i0}} \prod_{k=1}^K\bigl\{ \theta_kf_k\bigl(z_i;\mu_k, \hat{\sigma}_0^2\bigr)\bigr\}^{w_{ik}}. $$

In the E-step, the conditional probabilities can be checked to be

$$T_{0,i} = \frac{\hat{\theta}_0f_0(z_i;\hat{\mu}_0, \hat{\sigma}_0^2)}{\hat{\theta_0}f_0(z_i;\hat{\mu}_0, \hat{\sigma}_0^2)+\sum_{k=1}^K\theta_kf_k(z_i;\mu_k,\hat{\sigma}_0^2)}, $$
$$T_{k,i} = \frac{\theta_kf_k(z_i;\mu_k,\hat{\sigma}_0^2)}{\hat{\theta}_0 f_0(z_i;\hat{\mu}_0, \hat{\sigma}_0^2) +\sum_{j=1}^K \theta_jf_j(z_i;\mu_j,\hat{\sigma}_0^2)}. $$

In the M-step, the conditional expected log likelihood can be checked to be proportional to

which can be easily verified to be maximized by

$$\hat{\theta}_k = (1-\hat{\theta}_0)\frac{\sum_{i=1}^m T_{k,i}}{\sum_{j=1}^K \sum_{i=1}^mT_{j,i}}, \quad\quad \hat{\mu}_k = \frac{\sum_{i=1}^m T_{k,i}z_i}{\sum_{i=1}^m T_{k,i}}, \quad k\ge 1. $$

Given only \((\hat{\mu}_{0},\hat{\sigma}^{2}_{0})\) with θ 0 also being a parameter, we have

We can easily check that

$$\hat{\theta}_k = \frac{1}{m}\sum _{i=1}^m T_{k,i}, \quad k\ge 0, \quad\quad \hat{ \mu}_k = \frac{\sum_{i=1}^m T_{k,i}z_i}{\sum_{i=1}^m T_{k,i}}, \quad k>0. $$

1.2 EM Algorithm for Estimating the Gene Set Model

The complete data likelihood function for a gene set A given \((\hat{\theta}_{0},\hat{\mu}_{0},\hat{\sigma}_{0}^{2},\hat{\mu}_{k})\) is

$$\prod_{i\in A} \prod_{k=0}^K \bigl\{\nu_kf_k\bigl(z_i;\hat{ \mu}_k,\hat{\sigma}^2_0\bigr)\bigr \}^{w_{ik}}. $$

The conditional expected log likelihood can easily be checked to be

$$\sum_{i\in A} \sum_{k=0}^KT_{k,i} \log\nu_k, \quad T_{k,i}=\frac{\nu_{k}f_0(z_i;\hat{\mu}_k, \hat{\sigma}_0^2)}{ \sum_{0=1}^K \nu_jf_j(z_i;\hat{\mu}_j, \hat{\sigma}_0^2)}, \quad k\ge 0. $$

We can easily verify that

$$\hat{\nu}_{k} = \frac{\sum_{i\in A} T_{k,i}}{m_A}, \quad k\ge 0. $$

1.3 EM Algorithm for Estimating the Model Under no Enrichment

The complete data likelihood can be written as

where \(\nu_{0}+\sum^{K}_{k=1}\nu_{lk}=1\), l=1,2. The conditional expected log likelihood can be easily checked to be

$$\sum_{i\in A}\Biggl\{ T_{0,i}\log \nu_0+\sum_{k=1}^KT_{k,i} \log\nu_{1k}\Biggr\} + \sum_{j\in A^c}\Biggl\{ T_{0,j}\log\nu_0+\sum_{k=1}^KT_{k,j} \log\nu_{2k}\Biggr\}, $$

where

To maximize the conditional log likelihood, we use the Lagrange multiplier method

Setting the gradient vector ∇Q=0 yields the following equations:

From the first three equations we can obtain

When plugging these into the last two equations, we obtain

$$\hat{\nu}_{0} = \frac{\sum_{i\in A}T_{0,i}+\sum_{j\in A^c}T_{0,j}}{m}, $$

and

$$\hat{\nu}_{1k} = (1-\hat{\nu}_{0})\frac{ \sum_{i\in A} T_{k,i}}{\sum_{l=1}^K \sum_{i\in A} T_{l,i}}, \quad \quad \hat{\nu}_{2k} = (1-\hat{\nu}_{0})\frac{\sum_{j\in A^c} T_{k,j}}{\sum_{l=1}^K \sum_{j\in A^c} T_{l,j}}. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, S.M., Wu, B. & Kersey, J.H. Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model. Stat Biosci 6, 38–54 (2014). https://doi.org/10.1007/s12561-012-9076-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-012-9076-3

Keywords

Navigation