Abstract
Set enrichment analytical methods have become commonplace tools applied to the analysis and interpretation of biological data. The statistical techniques are used to identify categorical biases within lists of genes, proteins, or metabolites. The goal is to discover the shared functions or properties of the biological items represented within the lists. Application of these methods can provide great biological insight, including the discovery of participation in the same biological activity or pathway, shared interacting genes or regulators, common cellular compartmentalization, or association with disease. The methods require ordered or unordered lists of biological items as input, understanding of the reference set from which the lists were selected, categorical classifiers describing the items, and a statistical algorithm to assess bias of each classifier. Due to the complexity of most algorithms and the number of calculations performed, computer software is almost always used for execution of the algorithm, as well as for presentation of the results.
This chapter will provide an overview of the statistical methods used to perform an enrichment analysis. Guidelines for assembly of the requisite information will be presented, with a focus on careful definition of the sets used by the statistical algorithms. The need for multiple test correction when working with large libraries of classifiers is emphasized, and we outline several options for performing the corrections. Finally, interpreting the results of such analysis will be discussed along with examples of recent research utilizing the techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G. and Sherlock, G. (2000) Gene Ontology: Tool for the unification of biology. Nature. 25, 25–29.
Online Mendelian Inheritance in Man, OMIM (TM). McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD) http://www.ncbi.nlm.nih.gov/Omim/
Mulder, N., Apweiler, R., Attwood, T., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A., Orchard, S., Orengo, C., Petryszak, R., Selengut, J., Sigrist, C., Thomas, P., Valentin, F., Wilson, D., Wu, C. and Yeats, C. (2007) New developments in the InterPro database. Nucleic Acids Research. 35, D224.
PubMed. http://www.pubmed.gov/
Guide to GO Evidence Codes. Gene Ontology http://www.geneontology.org/GO.evidence.shtml
Fisher’s exact test. Wikipedia http://en.wikipedia.org/wiki/Fisher’s_exact_test
Hypergeometric distribution. Wikipedia http://en.wikipedia.org/wiki/Hypergeometric_distribution
Binomial distribution. Wikipedia http://en.wikipedia.org/wiki/Binomial_distribution
Chi-square distribution. Wikipedia http://en.wikipedia.org/wiki/Chi-square_distribution
Goeman, J. and Buhlmann, P. (2007) Analyzing gene expression data in terms of gene sets: Methodological issues. Bioinformatics. 23, 980–987.
Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E. and Mesirov, P. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 102, 15545–15550.
Bonferroni correction. Wikipedia http://en. wikipedia.org/wiki/Bonferroni_correction
Ury, H. (1976) A comparison of four procedures for multiple comparisons among means (pairwise contrasts) for arbitrary sample sizes. Technometrics. 18, 89–97.
Holm, S. (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 6, 65–70.
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 57, 289–300.
Storey, J. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics. 31, 2013–2035.
Berriz, G., King, O., Bryant, B., Sander, C. and Roth, P. (2003) Characterizing gene sets with FuncAssociate. Bioinformatics. 19, 2502–2504.
Khatri, P. and Draghici, S. (2005) Ontological analysis of gene expression data: Current tools, limitations, and open problems. Bioinformatics. 21, 3587–3595.
Mootha, V., Lindgren, C., Eriksson, K.-F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., RidderstrÃ¥le, M., Laurila, E., Houstis, N., Daly, M., Patterson, N., Mesirov, J., Golub, T., Tamayo, P., Spiegelman, B., Lander, E., Hirschhorn, J., Altshuler, D. and Groop, L. (2003) PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 34, 267–273.
Khatri, P., Draghici, S., Ostermeier, G. and Krawetz, S. (2002) Profiling gene expression using onto-express. Genomics. 79, 266–270.
Lee, H., Braynen, W., Keshav, K. and Pavlidis, P. (2005) ErmineJ: Tool for functional analysis of gene expression data sets. BMC Bioinformatics. 6, 269.
Backes, C., Keller, A., Kuentzer, J., Kneissl, B., Comtesse, N., Elnakady, Y., Mueller, R., Meese, E. and Lenhof, H.-P. (2007) GeneTrail – Advanced gene set enrichment analysis. Nucleic Acids Research. 35, 186.
Prufer, K., Muetzel, B., Do, H.-H., Weiss, G., Khaitovich, P., Rahm, E., Paabo, S., Lachmann, M. and Enard, W. (2007) FUNC: a package for detecting significant associations between gene sets and ontological annotations. BMC Bioinformatics. 8, 41.
Beißbarth, T. and Speed, T. (2004) GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 20, 1464–1465.
Houstis, N., Rosen, E. and Lander, E. (2006) Reactive oxygen species have a causal role in multiple forms of insulin resistance. Nature. 440, 944–948.
Vivanco, I., Palaskas, N., Tran, C., Finn, S., Getz, G., Kennedy, N., Jiao, J., Rose, J., Xie, W., Loda, M., Golub, T., Mellinghoff, I., Davis, R. and Sawyers, C. (2007) Identification of the JNK signaling pathway as a functional target of the tumor suppressor PTEN. Cancer Cell. 11, 555–569.
Li, Z., Srivastava, S., Yang, X., Mittal, S., Norton, P., Resau, J., Haab, B. and Chan, C. (2007) A hierarchical approach employing metabolic and gene expression profiles to identify the pathways that confer cytotoxicity in HepG2 cells. BMC Systems Biology. 1, 21.
Grasser, W., Orlic, I., Borovecki, F., Riccardi, K., Simic, P., Vukicevic, S. and Paralkar, V. (2007) BMP-6 exerts its osteoinductive effect through activation of IGF-I and EGF pathways. International Orthopaedics. 31, 759–765.
Radich, J., Dai, H., Mao, M., Oehler, V., Schelter, J., Druker, B., Sawyers, C., Shah, N., Stock, W., Willman, C., Friend, S. and Linsley, P. (2006) Gene expression changes associated with progression and response in chronic myeloid leukemia. Proceedings of the National Academy of Sciences. 103, 2794.
Dehan, E., Ben-Dor, A., Liao, W., Lipson, D., Frimer, H., Rienstein, S., Simansky, D., Krupsky, M., Yaron, P., Friedman, E., Rechavi, G., Perlman, M., Aviram-Goldring, A., Izraeli, S., Bittner, M., Yakhini, Z. and Kaminski, N. (2007) Chromosomal aberrations and gene expression profiles in non-small cell lung cancer. Lung Cancer. 56, 175–184.
Dixon, A., Liang, L., Moffatt, M., Chen, W., Heath, S., Wong, K., Taylor, J., Burnett, E., Gut, I., Farrall, M., Lathrop, G. M., Abecasis, G. and Cookson, W. (2007) A genome-wide association study of global gene expression. Nature Genetics. Advanced online publication. 39, 1202–1207.
Acknowledgments
We would like to thank Roumyana Yordanova for her thoughts and advice on statistical matters.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Tilford, C.A., Siemers, N.O. (2009). Gene Set Enrichment Analysis. In: Nikolsky, Y., Bryant, J. (eds) Protein Networks and Pathway Analysis. Methods in Molecular Biology, vol 563. Humana Press. https://doi.org/10.1007/978-1-60761-175-2_6
Download citation
DOI: https://doi.org/10.1007/978-1-60761-175-2_6
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-60761-174-5
Online ISBN: 978-1-60761-175-2
eBook Packages: Springer Protocols