Computational Statistics Approaches to Study Metabolic Syndrome

  • Ilkka Huopaniemi
  • Samuel KaskiEmail author


In this chapter, we review a set of key research problems and methods in analysing ‘omics’ data, gene expression, proteomics, metabolomics, and lipidomics. We start with the common systems biology approach to study metabolic syndrome, as well as any other disease, namely comparative case-control setting. The setting is usually an over-simplification, since there are other covariates that affect the concentrations of molecules, for instance drug treatments, gender, body mass index (BMI), and time in time-series experiments. Given these covariates, the setting becomes a multi-way experimental design. When multiple data sources are available, such as several ‘omics’ types, multiple tissues or multiple species, each forms a different data space with different molecules or variables, bringing in the problem of data integration. We start by giving a brief tutorial on the commonly used basic univariate and multivariate statistical approaches applicable if the problem is simplified by stratifying to a case-control design. We then focus on the multi-way setups of the Analysis of Variance (ANOVA) type, and in particular their main difficulty for ‘omics’ data: the large number of variables compared to the small number of observations. We introduce a recent family of Bayesian methods that is able to deal with multi-way, multi-source data sets and to translate biomarkers between multiple species. The approach is able to handle small sample-size combined with high dimensionality, and it allows a rigorous estimation of uncertainty of the results.


ANOVA Data integration Multi-way experimental design Omics data Statistical analysis 


  1. Bar-Joseph Z (2004) Analyzing time series gene expression data. Bioinformatics 20:2493–2503CrossRefPubMedGoogle Scholar
  2. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Methodol 57:289–300Google Scholar
  3. Bratchell N (1989) Multivariate response surface modeling by principal component analysis. J Chemom 3:579–588CrossRefGoogle Scholar
  4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefGoogle Scholar
  5. Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M (2008) High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Stat Assoc 103:1438–1456CrossRefPubMedGoogle Scholar
  6. Celeux G, Martin O, Lavergne C (2005) Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat Model 5:243–267CrossRefGoogle Scholar
  7. Damian D, Orešič M, Verheij E, Meulman J, Friedman J, Adourian A, Morel N, Smilde A, Greef J van der (2007) Applications of a new subspace clustering algorithm (COSA) in medical systems biology. Metabolomics 3:69–77CrossRefGoogle Scholar
  8. Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26CrossRefGoogle Scholar
  9. Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1:107–129CrossRefGoogle Scholar
  10. Fisher R (1918) The correlation between relatives on the supposition of mendelian inheritance. Royal Society of Edinburgh from Transactions of the Society Vol 52, pp 399–433Google Scholar
  11. Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian Data Analysis, 2nd edn. Chapman & Hall/CRC, Boca RatonGoogle Scholar
  12. Huopaniemi I (2012) Multivariate Multi-Way Modelling of Multiple High-Dimensional Data Sources. PhD thesis. Aalto University School of Science, Espoo, FinlandGoogle Scholar
  13. Huopaniemi I, Suvitaival T, Nikkilä J, Orešič M, Kaski S (2009) Two-way analysis of high-dimensional collinear data. Data Min Knowl Discov 19:261–276CrossRefGoogle Scholar
  14. Huopaniemi I, Suvitaival T, Nikkilä J, Orešič M, Kaski S (2010a) Multivariate multi-way analysis of multi-source data. Bioinformatics 26:i391–398CrossRefGoogle Scholar
  15. Huopaniemi I, Suvitaival T, Orešič M, Kaski S (2010b) Graphical multi-way models. In: Balcázar J, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases. Proceedings of European Conference, ECML PKDD 2010, Barcelona, Spain, September 20–24, 2010, vol. 1, Springer, Berlin, pp 538–553Google Scholar
  16. Kankainen M, Gopalacharyulu P, Holm L, Orešič M (2011) MPEA-metabolite pathway enrichment analysis. Bioinformatics 27:1878–1879CrossRefPubMedGoogle Scholar
  17. Langsrud O (2002) 50-50 multivariate analysis of variance for collinear responses. J R Stat Soc Series D-the Stat 51:305–317CrossRefGoogle Scholar
  18. Le Cao K-A, Martin P, Robert-Granie C, Besse P (2009) Sparse canonical methods for biological data integration: application to a crossplatform study. BMC Bioinformatics 10:34CrossRefPubMedGoogle Scholar
  19. Le Cao K-A, Meugnier E, McLachlan GJ (2010) Integrative mixture of experts to combine clinical factors and gene markers. Bioinformatics 26:1192–1198CrossRefPubMedGoogle Scholar
  20. Le H-S, Bar-Joseph Z (2010) Cross species expression analysis using a Dirichlet process mixture model with latent matchings. In: Lafferty J et al. (eds) Advances in Neural Information Processing Systems 23, MIT Press, Cambridge, pp 1270–1278Google Scholar
  21. Listgarten J, Kadie C, Schadt E, Heckerman D (2010) Correction for hidden confounders in the genetic analysis of gene expression. Proceedings of the National Academy of SciencesGoogle Scholar
  22. Lu Y, Huggins P, Bar-Joseph Z (2009) Cross species analysis of microarray expression data. Bioinformatics 25:1476–1483CrossRefPubMedGoogle Scholar
  23. McCarthy DJ, Smyth GK (2009) Testing significance relative to a fold-change threshold is a treat. Bioinformatics 25:765–771CrossRefPubMedGoogle Scholar
  24. Monni S, Tadesse M (2009) A stochastic partitioning method to associate high-dimensional responses and covariates. Bayesian Anal 4:413–436CrossRefGoogle Scholar
  25. Mostertz W, Stevenson M, Acharya C, Chan I, Walters K, Lamlertthon W, Barry W, Crawford J, Nevins J, Potti A (2010) Age and sex-specific genomic profiles in nonsmall cell lung cancer. J Am Med Assoc 303:535–543CrossRefGoogle Scholar
  26. Ng SK, McLachlan GJ, Wang K, Ben-Tovim Jones L, Ng S-W (2006) A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics 22:1745–1752CrossRefPubMedGoogle Scholar
  27. Nikkilä J, Sysi-Aho M, Ermolov A, Seppänen-Laakso T, Simell O, Kaski S, Orešič M (2008) Gender dependent progression of systemic metabolic states in early childhood. Mol Syst Biol 4:197CrossRefPubMedGoogle Scholar
  28. Orešič M, Simell S, Sysi-Aho M, Nanto-Salonen K, Seppänen-Laakso T, Parikka V, Katajamaa M, Hekkala A, Mattila I, Keskinen P, Yetukuri L, Reinikainen A, Lähde J, Suortti T, Hakalax J, Simell T, Hyöty H, Veijola R, Ilonen J, Lahesmaa R, Knip M, Simell O (2008) Dysregulation of lipid and amino acid metabolism precedes islet autoimmunity in children who later progress to type 1 diabetes. J Exp Med 205:2975–2984CrossRefPubMedGoogle Scholar
  29. Parkhomenko E, Tritchler D, Beyene J (2007) Genome-wide sparse canonical correlation of gene expression with genotypes. BMC Proceedings, vol 1, p S119Google Scholar
  30. Rantalainen M, Cloarec O, Beckonert O, Wilson ID, Jackson D, Tonge R, Rowlinson R, Rayner S, Nickson J, Wilkinson RW, Mills JD, Trygg J, Nicholson JK, Holmes E (2006) Statistically integrated metabonomic-proteomic studies on a human prostate cancer xenograft model in mice. J Proteome Res 5:2642–2655CrossRefPubMedGoogle Scholar
  31. Salek RM, Maguire ML, Bentley E, Rubtsov DV, Hough T, Cheeseman M, Nunez D, Sweatman BC, Haselden JN, Cox RD, Connor SC, Griffin JL (2007) A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human. Physiol Genom 29:99–108CrossRefGoogle Scholar
  32. Seo DM, Goldschmidt-Clermont PJ, West M (2007) Of mice and men: sparse statistical modelling in cardiovascular genomics. Ann Appl Stat 1:152–178CrossRefGoogle Scholar
  33. Smilde AK, Jansen JJ, Hoefsloot HCJ, Lamers R-JAN, Greef J van der, Timmerman ME (2005) ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data. Bioinformatics 21:3043–3048CrossRefPubMedGoogle Scholar
  34. Smyth G (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3: No 1, Article 3Google Scholar
  35. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005) Gene set enrichment analysis: a knowledge based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci Unit States Am 102:15545–15550CrossRefGoogle Scholar
  36. Suvitaival T, Huopaniemi I, Orešič M, Kaski S (2011) Cross-species translation of multi-way biomarkers. In: Honkela T, Duch W, Girolami M, Kaski S (eds) Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN), Part I, vol 6791 of Lecture Notes in Computer Science, Springer, pp 209–216Google Scholar
  37. Sysi-Aho M, Vehtari A, Velagapudi V, Westerbacka J, Yetukuri L, Bergholm R, Taskinen M-R, Yki-Järvinen H, Orešič M (2007) Exploring the lipoprotein composition using bayesian regression on serum lipidomic profiles. Bioinformatics 23:i519–528CrossRefGoogle Scholar
  38. Sysi-Aho M, Ermolov A, Gopalacharyulu PV, Tripathi A, Seppänen- Laakso T, Maukonen J, Mattila I, Ruohonen ST, Vähätalo L, Yetukuri L, Härkönen T, Lindfors E, Nikkilä J, Ilonen J, Simell O, Saarela M, Knip M, Kaski S, Savontaus E, Orešič M (2011) Metabolic regulation in progression to autoimmune diabetes. PLoS Comput Biol 7:e1002257CrossRefPubMedGoogle Scholar
  39. Tai F, Pan W (2007) Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics 23:3170–3177CrossRefPubMedGoogle Scholar
  40. Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Pearson Addison-WesleyGoogle Scholar
  41. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Series B 58:267–288Google Scholar
  42. Tripathi A, Klami A, Orešič M, Kaski S (2011) Matching samples of multiple views. Data Min Knowl Discov 23:300–321CrossRefGoogle Scholar
  43. Trygg J, Wold S (2002) Orthogonal projections to latent structures (o-pls). J Chemom 16:119–128CrossRefGoogle Scholar
  44. Trygg J, Wold S (2003) O2-pls, a two-block (xy) latent variable regression (lvr) method with an integral osc filter. J Chemom 17:53–64CrossRefGoogle Scholar
  45. Vapnik V (1995) The nature of statistical learning theory. SpringerGoogle Scholar
  46. Wang L, Zhang B, Wolfinger RD, Chen X (2008) An integrated approach for the analysis of biological pathways using mixed models. PLoS Genet 4:e1000115CrossRefPubMedGoogle Scholar
  47. Ward J (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244CrossRefGoogle Scholar
  48. Webb-Robertson B-JM, Mccue LA, Beagley N, Mcdermott JE, Wunschel DS, Varnum SM, Hu JZ, Isern NG, Buchko GW, Mcateer K, Pounds JG, Skerrett SJ, Liggitt D, Frevert CW (2009) A Bayesian integration model of high-throughput proteomics and metabolomics data for improved early detection of microbial infections. Pac Symp Biocomput 2009:451–463Google Scholar
  49. West M (2003) Bayesian factor regression models in the large p, small n paradigm. Bayesian Stat 7:723–732Google Scholar
  50. Westerhuis J, Hoefsloot H, Smit S, Vis D, Smilde A, Velzen E van, Duijnhoven J van, Dorsten F van (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89. doi:10.1007/s11306-007-0099-6CrossRefGoogle Scholar
  51. Wold S, Ruhe A, Wold H, Dunn WJ (1984) The collinearity problem in linear regression. The partial least squares (pls) approach to generalized inverses. SIAM J Sci Stat Comput 5:735–743CrossRefGoogle Scholar
  52. Wolfinger R, Gibson G, Wolfinger E, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules R (2001) Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 8(6):625–637CrossRefPubMedGoogle Scholar
  53. Wu MC, Zhang L, Wang Z, Christiani DC, Lin X (2009) Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics 25:1145–1151CrossRefPubMedGoogle Scholar
  54. Yetukuri L, Huopaniemi I, Koivuniemi A, Maranghi M, Hiukka A, Nygren H, Kaski S, Taskinen M-R, Vattulainen I, Jauhiainen M, Orešič M (2011) High density lipoprotein structural changes and drug response in lipidomic profiles following the long-term fenofibrate therapy in the FIELD substudy. PLoS One 6:e23589CrossRefPubMedGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.The Charles Bronfman Institute for Personalized MedicineThe Icahn School of Medicine at Mount SinaiNew YorkUSA
  2. 2.Helsinki Institute for Information Technology (HIIT), Department of Information and Computer ScienceAalto UniversityEspooFinland

Personalised recommendations