Extracting the Strongest Signals from Omics Data: Differentially Expressed Pathways and Beyond

  • Galina GlazkoEmail author
  • Yasir Rahmatallah
  • Boris Zybailov
  • Frank Emmert-Streib
Part of the Methods in Molecular Biology book series (MIMB, volume 1613)


The analysis of gene sets (in a form of functionally related genes or pathways) has become the method of choice for extracting the strongest signals from omics data. The motivation behind using gene sets instead of individual genes is two-fold. First, this approach incorporates pre-existing biological knowledge into the analysis and facilitates the interpretation of experimental results. Second, it employs a statistical hypotheses testing framework. Here, we briefly review main Gene Set Analysis (GSA) approaches for testing differential expression of gene sets and several GSA approaches for testing statistical hypotheses beyond differential expression that allow extracting additional biological information from the data. We distinguish three major types of GSA approaches testing: (1) differential expression (DE), (2) differential variability (DV), and (3) differential co-expression (DC) of gene sets between two phenotypes. We also present comparative power analysis and Type I error rates for different approaches in each major type of GSA on simulated data. Our evaluation presents a concise guideline for selecting GSA approaches best performing under particular experimental settings. The value of the three major types of GSA approaches is illustrated with real data example. While being applied to the same data set, major types of GSA approaches result in complementary biological information.

Key words

Omics data Gene set analysis approaches Hypotheses testing Self-contained Competitive Differential expression Differential co-expression Differential variability 



We would like to thank Bárbara Macías Solís for proof reading of the manuscript. Support has been provided in part by the Arkansas INBRE program, with grants from the National Center for Research Resources (P20RR016460) and the National Institute of General Medical Sciences (P20 GM103429) from the National Institutes of Health. Large-scale computer simulations were implemented using the High Performance Computing (HPC) resources at the UALR Computational Research Center supported by the following grants: National Science Foundation grants CRI CNS-0855248, EPS-0701890, MRI CNS-0619069 and OISE-0729792.


  1. 1.
    Mootha VK et al (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34(3):267–273CrossRefPubMedGoogle Scholar
  2. 2.
    Bar HY, Booth JG, Wells MT ((2012)) A mixture-model approach for parallel testing for unequal variances. Stat Appl Genet Mol Biol 11(1.) p. Article 8Google Scholar
  3. 3.
    Ho JW et al (2008) Differential variability analysis of gene expression and its application to human diseases. Bioinformatics 24(13):i390–i398CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Hulse AM, Cai JJ (2013) Genetic variants contribute to gene expression variability in humans. Genetics 193(1):95–108CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Mar JC et al (2011) Variance of gene expression identifies altered network constraints in neurological disease. PLoS Genet 7(8):e1002207CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Xu Z et al (2011) Antisense expression increases gene expression variability and locus interdependency. Mol Syst Biol 7:468CrossRefPubMedCentralGoogle Scholar
  7. 7.
    Bravo HC et al (2012) Gene expression anti-profiles as a basis for accurate universal cancer signatures. BMC Bioinform 13:272CrossRefGoogle Scholar
  8. 8.
    Dinalankara W, Bravo HC (2015) Gene expression signatures based on variability can robustly predict tumor progression and prognosis. Cancer Informat 14:71–81Google Scholar
  9. 9.
    Friedman JH, Rafsky LC (1979) Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann Stat 7(4):697–717CrossRefGoogle Scholar
  10. 10.
    Rahmatallah Y, Emmert-Streib F, Glazko G (2012) Gene set analysis for self-contained tests: complex null and specific alternative hypotheses. Bioinformatics 28(23):3073–3080CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Afsari B, Geman D, Fertig EJ (2014) Learning dysregulated pathways in cancers from differential variability analysis. Cancer Informat 13(Suppl 5):61–67Google Scholar
  12. 12.
    Fisher R (1932) Statistical methods for research workers. Oliver and Boyd, EdinburgGoogle Scholar
  13. 13.
    Stadler N, Mukherjee S (2015) Multivariate gene-set testing based on graphical models. Biostatistics 16(1):47–59CrossRefGoogle Scholar
  14. 14.
    Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441CrossRefPubMedGoogle Scholar
  15. 15.
    Meinshausen N, Bühlmann P (2006) High-dimensional graphs and variable selection with the lasso. Ann Stat 34(3):1436–1462CrossRefGoogle Scholar
  16. 16.
    Schafer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4(1):Article 32CrossRefGoogle Scholar
  17. 17.
    Choi Y, Kendziorski C (2009) Statistical methods for gene set co-expression analysis. Bioinformatics 25(21):2780–2786CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Rahmatallah Y, Emmert-Streib F, Glazko G (2014) Gene Sets Net Correlations Analysis (GSNCA): a multivariate differential coexpression test for gene sets. Bioinformatics 30(3):360–368CrossRefPubMedGoogle Scholar
  19. 19.
    Santos Sde S et al (2015) CoGA: an R package to identify differentially co-expressed gene sets by analyzing the graph spectra. PLoS One 10(8):e0135831CrossRefPubMedGoogle Scholar
  20. 20.
    Takahashi DY et al (2012) Discriminating different classes of biological networks by analyzing the graphs spectra distribution. PLoS One 7(12):e49949CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Goeman JJ, Buhlmann P (2007) Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23(8):980–987CrossRefPubMedGoogle Scholar
  22. 22.
    Tian L et al (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A 102(38):13544–13549CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Ackermann M, Strimmer K (2009) A general modular framework for gene set enrichment analysis. BMC Bioinform 10(1):47CrossRefGoogle Scholar
  24. 24.
    Rahmatallah Y, Emmert-Streib F, Glazko G (2014) Comparative evaluation of gene set analysis approaches for RNA-Seq data. BMC Bioinform 15(1):397CrossRefGoogle Scholar
  25. 25.
    Montaner D et al (2009) Gene set internal coherence in the context of functional profiling. BMC Genomics 10:197CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Gatti DM et al (2010) Heading down the wrong pathway: on the influence of correlation within gene sets. BMC Genomics 11:574CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Tripathi S, Emmert-Streib F (2012) Assessment method for a power analysis to identify differentially expressed pathways. PLoS One 7(5):e37510CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Glazko GV, Emmert-Streib F (2009) Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics 25(18):2348–2354CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Wang X et al (2011) Linear combination test for hierarchical gene set analysis. Stat Appl Genet Mol Biol 10(1.) Article 13Google Scholar
  30. 30.
    Hanzelmann S, Castelo R, Guinney J (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinform 14:7CrossRefGoogle Scholar
  31. 31.
    Khatri P, Sirota M, Butte AJ (2012) Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 8(2):e1002375CrossRefPubMedPubMedCentralGoogle Scholar
  32. 32.
    Maciejewski H (2014) Gene set analysis methods: statistical models and methodological differences. Brief Bioinform 15(4):504–518CrossRefPubMedGoogle Scholar
  33. 33.
    Nam D, Kim SY (2008) Gene-set approach for expression pattern analysis. Brief Bioinform 9(3):189–197CrossRefPubMedGoogle Scholar
  34. 34.
    Tamayo P et al (2012) The limitations of simple gene set enrichment analysis assuming gene independence. Stat Methods Med Res 25(1):472–487CrossRefPubMedPubMedCentralGoogle Scholar
  35. 35.
    Tarca AL, Bhatti G, Romero R (2013) A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One 8(11):e79217CrossRefPubMedCentralGoogle Scholar
  36. 36.
    Tripathi S, Glazko GV, Emmert-Streib F (2013) Ensuring the statistical soundness of competitive gene set approaches: gene filtering and genome-scale coverage are essential. Nucleic Acids Res 41(7):e82CrossRefPubMedPubMedCentralGoogle Scholar
  37. 37.
    Dinu I et al (2007) Improving gene set analysis of microarray data by SAM-GS. BMC Bioinform 8:242CrossRefGoogle Scholar
  38. 38.
    Subramanian A et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102(43):15545–15550CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Barbie DA et al (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462(7269):108–112CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Fridley BL, Jenkins GD, Biernacka JM (2010) Self-contained gene-set analysis of expression data: an evaluation of existing and novel methods. PLoS One 5(9)Google Scholar
  41. 41.
    Stouffer S, DeVinney L, Suchmen E (1949) The American soldier: adjustment during army life, vol 1. Princeton University Press, Princeton, NJGoogle Scholar
  42. 42.
    Taylor J, Tibshirani R (2006) A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics 7(2):167–181CrossRefPubMedGoogle Scholar
  43. 43.
    Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140CrossRefPubMedGoogle Scholar
  44. 44.
    Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106CrossRefPubMedPubMedCentralGoogle Scholar
  45. 45.
    Smyth G (2005) Limma: linear models for microarray data. In: Smyth G, Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W (eds) Bioinformatics and computational biology solutions using r and bioconductor. Springer, New York, pp 397–420CrossRefGoogle Scholar
  46. 46.
    Law CW et al (2014) Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2):R29CrossRefPubMedPubMedCentralGoogle Scholar
  47. 47.
    Rahmatallah Y, Emmert-Streib F, Glazko G (2016) Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline. Brief Bioinform 17(3):393–407CrossRefPubMedGoogle Scholar
  48. 48.
    Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98(9):5116–5121CrossRefPubMedPubMedCentralGoogle Scholar
  49. 49.
    Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics 17(6):509–519CrossRefPubMedGoogle Scholar
  50. 50.
    Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3:3CrossRefGoogle Scholar
  51. 51.
    Dinu I et al (2009) Gene-set analysis and reduction. Brief Bioinform 10(1):24–34CrossRefGoogle Scholar
  52. 52.
    Liu Q et al (2007) Comparative evaluation of gene-set analysis methods. BMC Bioinform 8:431CrossRefGoogle Scholar
  53. 53.
    Baringhaus L, Franz C (2004) On a new multivariate two-sample test. J Multivar Anal 88:190–206CrossRefGoogle Scholar
  54. 54.
    Klebanov L et al (2007) A multivariate extension of the gene set enrichment analysis. J Bioinforma Comput Biol 5(5):1139–1153CrossRefGoogle Scholar
  55. 55.
    Wu D et al (2010) ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 26(17):2176–2182CrossRefPubMedPubMedCentralGoogle Scholar
  56. 56.
    Damian D, Gorfine M (2004) Statistical concerns about the GSEA procedure. Nat Genet 36(7):663. author reply 663CrossRefPubMedGoogle Scholar
  57. 57.
    Ritchie ME et al (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43(7):e47CrossRefPubMedPubMedCentralGoogle Scholar
  58. 58.
    Pickrell JK et al (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464(7289):768–772CrossRefPubMedPubMedCentralGoogle Scholar
  59. 59.
    Olivier M et al (2002) The IARC TP53 database: new online mutation analysis and recommendations to users. Hum Mutat 19(6):607–614CrossRefPubMedGoogle Scholar
  60. 60.
    Liberzon A et al (2011) Molecular signatures database (MSigDB) 3.0. Bioinformatics 27(12):1739–1740CrossRefPubMedPubMedCentralGoogle Scholar
  61. 61.
    Wu D, Smyth GK (2012) Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res 40(17):e133CrossRefPubMedPubMedCentralGoogle Scholar
  62. 62.
    Bandres E et al (2005) Gene expression profile induced by BCNU in human glioma cell lines with differential MGMT expression. J Neuro-Oncol 73(3):189–198CrossRefGoogle Scholar
  63. 63.
    Ongusaha PP et al (2003) BRCA1 shifts p53-mediated cellular outcomes towards irreversible growth arrest. Oncogene 22(24):3749–3758CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  • Galina Glazko
    • 1
    Email author
  • Yasir Rahmatallah
    • 1
  • Boris Zybailov
    • 2
  • Frank Emmert-Streib
    • 3
  1. 1.Department of Biomedical InformaticsUniversity of Arkansas for Medical SciencesLittle RockUSA
  2. 2.Department of Biochemistry and Molecular BiologyUniversity of Arkansas for Medical SciencesLittle RockUSA
  3. 3.Computational Medicine and Statistical Learning LaboratoryTampere University of TechnologyTampereFinland

Personalised recommendations