Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model

Lee, Sang Mee; Wu, Baolin; Kersey, John H.

doi:10.1007/s12561-012-9076-3

Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model

Published: 21 November 2012

Volume 6, pages 38–54, (2014)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Sang Mee Lee¹^nAff2,
Baolin Wu¹ &
John H. Kersey³

270 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we study a parametric modeling approach to gene set enrichment analysis. Existing methods have largely relied on nonparametric approaches employing, e.g., categorization, permutation or resampling-based significance analysis methods. These methods have proven useful yet might not be powerful. By formulating the enrichment analysis into a model comparison problem, we adopt the likelihood ratio-based testing approach to assess significance of enrichment. Through simulation studies and application to gene expression data, we will illustrate the competitive performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised gene set testing based on random matrix theory

Article Open access 04 November 2016

Avoiding the pitfalls of gene set enrichment analysis with SetRank

Article Open access 04 March 2017

Improving the power of gene set enrichment analyses

Article Open access 17 May 2019

References

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet 25(1):25–29
Article Google Scholar
Barry WT, Nobel AB, Wright FA (2008) A statistical framework for testing functional categories in microarray data. Ann Appl Stat 2(1):286–315
Article MATH MathSciNet Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
MATH MathSciNet Google Scholar
Collins K, Jacks T, Pavletich NP (1997) The cell cycle and cancer. Proc Natl Acad Sci USA 94(7):2776–2778
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38
MATH MathSciNet Google Scholar
Dørum G, Snipen L, Solheim M, Saebø S (2009) Rotation testing in gene set enrichment analysis for small direct comparison experiments. Stat Appl Genet Mol Biol 8:34
MathSciNet Google Scholar
Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99:96–104
Article MATH MathSciNet Google Scholar
Efron B (2007) Correlation and large-scale simultaneous significance testing. J Am Stat Assoc 102:93–103
Article MATH MathSciNet Google Scholar
Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1(1):107–129
Article MATH MathSciNet Google Scholar
Ferbeyre G, Stanchina ED, Lin AW, Querido E, McCurrach ME, Hannon GJ, Lowe SW (2002) Oncogenic ras and p53 cooperate to induce cellular senescence. Mol Cell Biol 22(10):3497–3508
Article Google Scholar
Greenway AL, McPhee DA, Allen K, Johnstone R, Holloway G, Mills J, Azad A, Sankovich S, Lambert P (2002) Human immunodeficiency virus type 1 nef binds to tumor suppressor p53 and protects cells against p53-mediated apoptosis. J Virol 76(6):2692–2702
Article Google Scholar
Jiang P, Du W, Wu M (2000) p53 and bad: remote strangers become close friends. Cell Res 17(4):283–285
Article Google Scholar
Khatri P, Draghici S (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21(18):3587–3595
Article Google Scholar
Kim SS, Chae HS, Bach JH, Lee MW, Kim KY, Lee WB, Jung YM, Bonventre JV, Suh YH (2002) p53 mediates ceramide-induced apoptosis in SKN-SH cells. Oncogene 21(13):2020–2028
Article Google Scholar
Kumar AR, Li Q, Hudson WA, Chen W, Sam T, Yao Q, Lund EA, Wu B, Kowal BJ, Kersey JH (2009) A role for MEIS1 in MLL-fusion gene leukemia. Blood 113(8):1756–1758
Article Google Scholar
Levine AJ, Feng Z, Mak TW, You H, Jin S (2006) Coordination and communication between the p53 and IGF-1-AKT-TOR signal transduction pathways. Genes Dev 20(3):267–275
Article Google Scholar
Lewis JM, Truong TN, Schwartz MA (2002) Integrins regulate the apoptotic response to DNA damage through modulation of p53. Proc Natl Acad Sci USA 99(6):3627–3632
Article Google Scholar
Liu H, Takeda S, Kumar R, Westergard TD, Brown EJ, Pandita TK, Cheng EH, Hsieh JJ (2010) Phosphorylation of MLL by ATR is required for execution of mammalian s-phase checkpoint. Nature 467:343–346
Article Google Scholar
Mootha V, Lindgren C, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov J, Golub T, Tamayo P, Spiegelman B, Lander E, Hirschhorn J, Altshuler D, Groop L (2003) PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34:267–273
Article Google Scholar
Newton M, Quintana F, den Boon J, Sengupta S, Ahlquist P (2007) Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat 1(1):85–106
Article MATH MathSciNet Google Scholar
O’Callaghan-Sunol C, Gabai VL, Sherman MY (2007) Hsp27 modulates p53 signaling and suppresses cellular senescence. Cancer Res 67(24):11779–11788
Article Google Scholar
Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E (2004) Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res 29(6):1213–1222
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MATH Google Scholar
Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:1
MathSciNet Google Scholar
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005) From the Cover: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545–15550
Article Google Scholar

Download references

Acknowledgements

This research was supported in part by a Biomedical Informatics and Computational Biology research grant from the University of Minnesota-Rochester, and National Institute of Health grant CA134848 and GM083345. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We would like to thank the associate editor and two anonymous referees for their constructive comments, which have dramatically improved the presentation of the paper.

Author information

Sang Mee Lee
Present address: Department of Health Studies, University of Chicago, Chicago, IL, 60637, USA

Authors and Affiliations

Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building MMC 303, 420 Delaware St SE, Minneapolis, MN, 55455, USA
Sang Mee Lee & Baolin Wu
Masonic Cancer Center, University of Minnesota, Minneapolis, MN, 55455, USA
John H. Kersey

Authors

Sang Mee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Baolin Wu
View author publications
You can also search for this author in PubMed Google Scholar
John H. Kersey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baolin Wu.

Appendix

1.1 EM Algorithm for Estimating the Finite Mixture Model

We begin with the finite mixture model in (1) given $(\hat{\theta}_{0},\hat{\mu}_{0},\hat{\sigma}^{2}_{0})$ and K. Define indicators w _ik∈{0,1} following a multinomial distribution, Pr(w _ik=1)=θ _k, $\sum_{k=0}^{K} w_{ik}=1$, and conditionally we assume z _i|w _ik=1∼f _k. The complete data likelihood function for (z _i,w _ik) can be written as

$$\prod_{i=1}^m \bigl\{\hat{ \theta}_0f_0\bigl(z_i;\hat{ \mu}_0,\hat{\sigma}_0^2\bigr)\bigr \}^{w_{i0}} \prod_{k=1}^K\bigl\{ \theta_kf_k\bigl(z_i;\mu_k, \hat{\sigma}_0^2\bigr)\bigr\}^{w_{ik}}. $$

In the E-step, the conditional probabilities can be checked to be

$$T_{0,i} = \frac{\hat{\theta}_0f_0(z_i;\hat{\mu}_0, \hat{\sigma}_0^2)}{\hat{\theta_0}f_0(z_i;\hat{\mu}_0, \hat{\sigma}_0^2)+\sum_{k=1}^K\theta_kf_k(z_i;\mu_k,\hat{\sigma}_0^2)}, $$

$$T_{k,i} = \frac{\theta_kf_k(z_i;\mu_k,\hat{\sigma}_0^2)}{\hat{\theta}_0 f_0(z_i;\hat{\mu}_0, \hat{\sigma}_0^2) +\sum_{j=1}^K \theta_jf_j(z_i;\mu_j,\hat{\sigma}_0^2)}. $$

In the M-step, the conditional expected log likelihood can be checked to be proportional to

which can be easily verified to be maximized by

$$\hat{\theta}_k = (1-\hat{\theta}_0)\frac{\sum_{i=1}^m T_{k,i}}{\sum_{j=1}^K \sum_{i=1}^mT_{j,i}}, \quad\quad \hat{\mu}_k = \frac{\sum_{i=1}^m T_{k,i}z_i}{\sum_{i=1}^m T_{k,i}}, \quad k\ge 1. $$

Given only $(\hat{\mu}_{0},\hat{\sigma}^{2}_{0})$ with θ ₀ also being a parameter, we have

We can easily check that

$$\hat{\theta}_k = \frac{1}{m}\sum _{i=1}^m T_{k,i}, \quad k\ge 0, \quad\quad \hat{ \mu}_k = \frac{\sum_{i=1}^m T_{k,i}z_i}{\sum_{i=1}^m T_{k,i}}, \quad k>0. $$

1.2 EM Algorithm for Estimating the Gene Set Model

The complete data likelihood function for a gene set A given $(\hat{\theta}_{0},\hat{\mu}_{0},\hat{\sigma}_{0}^{2},\hat{\mu}_{k})$ is

$$\prod_{i\in A} \prod_{k=0}^K \bigl\{\nu_kf_k\bigl(z_i;\hat{ \mu}_k,\hat{\sigma}^2_0\bigr)\bigr \}^{w_{ik}}. $$

The conditional expected log likelihood can easily be checked to be

$$\sum_{i\in A} \sum_{k=0}^KT_{k,i} \log\nu_k, \quad T_{k,i}=\frac{\nu_{k}f_0(z_i;\hat{\mu}_k, \hat{\sigma}_0^2)}{ \sum_{0=1}^K \nu_jf_j(z_i;\hat{\mu}_j, \hat{\sigma}_0^2)}, \quad k\ge 0. $$

We can easily verify that

$$\hat{\nu}_{k} = \frac{\sum_{i\in A} T_{k,i}}{m_A}, \quad k\ge 0. $$

1.3 EM Algorithm for Estimating the Model Under no Enrichment

The complete data likelihood can be written as

where $\nu_{0}+\sum^{K}_{k=1}\nu_{lk}=1$, l=1,2. The conditional expected log likelihood can be easily checked to be

$$\sum_{i\in A}\Biggl\{ T_{0,i}\log \nu_0+\sum_{k=1}^KT_{k,i} \log\nu_{1k}\Biggr\} + \sum_{j\in A^c}\Biggl\{ T_{0,j}\log\nu_0+\sum_{k=1}^KT_{k,j} \log\nu_{2k}\Biggr\}, $$

where

To maximize the conditional log likelihood, we use the Lagrange multiplier method

Setting the gradient vector ∇Q=0 yields the following equations:

From the first three equations we can obtain

When plugging these into the last two equations, we obtain

$$\hat{\nu}_{0} = \frac{\sum_{i\in A}T_{0,i}+\sum_{j\in A^c}T_{0,j}}{m}, $$

and

$$\hat{\nu}_{1k} = (1-\hat{\nu}_{0})\frac{ \sum_{i\in A} T_{k,i}}{\sum_{l=1}^K \sum_{i\in A} T_{l,i}}, \quad \quad \hat{\nu}_{2k} = (1-\hat{\nu}_{0})\frac{\sum_{j\in A^c} T_{k,j}}{\sum_{l=1}^K \sum_{j\in A^c} T_{l,j}}. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, S.M., Wu, B. & Kersey, J.H. Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model. Stat Biosci 6, 38–54 (2014). https://doi.org/10.1007/s12561-012-9076-3

Download citation

Received: 04 May 2012
Accepted: 05 November 2012
Published: 21 November 2012
Issue Date: May 2014
DOI: https://doi.org/10.1007/s12561-012-9076-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model

Abstract

Access this article

Similar content being viewed by others

Unsupervised gene set testing based on random matrix theory

Avoiding the pitfalls of gene set enrichment analysis with SetRank

Improving the power of gene set enrichment analyses

References

Acknowledgements