Data Mining and Knowledge Discovery

, Volume 19, Issue 2, pp 261–276 | Cite as

Two-way analysis of high-dimensional collinear data

  • Ilkka HuopaniemiEmail author
  • Tommi Suvitaival
  • Janne Nikkilä
  • Matej Orešič
  • Samuel Kaski


We present a Bayesian model for two-way ANOVA-type analysis of high-dimensional, small sample-size datasets with highly correlated groups of variables. Modern cellular measurement methods are a main application area; typically the task is differential analysis between diseased and healthy samples, complicated by additional covariates requiring a multi-way analysis. The main complication is the combination of high dimensionality and low sample size, which renders classical multivariate techniques useless. We introduce a hierarchical model which does dimensionality reduction by assuming that the input variables come in similarly-behaving groups, and performs an ANOVA-type decomposition for the set of reduced-dimensional latent variables. We apply the methods to study lipidomic profiles of a recent large-cohort human diabetes study.


ANOVA Factor analysis Hierarchical model Metabolomics Multi-way analysis Small sample-size 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Archambeau C, Bach F (2009) Sparse probabilistic projections. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems, vol 21. MIT Press, Cambridge, pp 73–80Google Scholar
  2. Beal M, Krishnamurthy P (2006) Gene expression time course clustering with countably infinite hidden markov models. In: Proceedings of the 22nd annual conference on uncertainty in artificial intelligence (UAI-06), Arlington, Virginia. AUAI PressGoogle Scholar
  3. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological) 57(1): 289–300zbMATHMathSciNetGoogle Scholar
  4. Bishop CM (1999) Bayesian PCA. In: Proceedings of the 1998 conference on advances in neural information processing systems II. MIT Press, Cambridge, pp 382–388Google Scholar
  5. Cao G, Bouman CA (2009) Covariance estimation for high dimensional data vectors using the sparse matrix transform. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems, vol 21. MIT Press, Cambridge, pp 225–232Google Scholar
  6. Celeux G, Martin O, Lavergne C (2005) Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat Model 5(3): 243–267zbMATHCrossRefMathSciNetGoogle Scholar
  7. Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian data analysis, 2nd edn. Chapman & Hall/CRC, LondonGoogle Scholar
  8. Ghahramani Z, Beal MJ (2000) Variational inference for Bayesian mixtures of factor analysers. In: Advances in neural information processing systems, vol 12. MIT Press, Cambridge, pp 449–455Google Scholar
  9. Langsrud O (2002) 50–50 multivariate analysis of variance for collinear responses. J R Stat Soc Ser D-the Statistician 51: 305–317CrossRefMathSciNetGoogle Scholar
  10. Ng SK, McLachlan GJ, Wang K, Ben-Tovim Jones L, Ng SW (2006) A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics 22(14): 1745–1752CrossRefGoogle Scholar
  11. Nikkila J, Sysi-Aho M, Ermolov A, Seppnen-Laakso T, Simell O, Kaski S, Oresic M (2008) Gender-dependent progression of systemic metabolic states in early childhood. Mol Syst Biol 4(197). doi: 10.1038/msb.2008.34
  12. Oresic M, Simell S, Sysi-Aho M, Nanto-Salonen K, Seppanen-Laakso T, Parikka V, Katajamaa M, Hekkala A, Mattila I, Keskinen P, Yetukuri L, Reinikainen A, Lahde J, Suortti T, Hakalax J, Simell T, Hyoty H, Veijola R, Ilonen J, Lahesmaa R, Knip M, Simell O (2008) Dysregulation of lipid and amino acid metabolism precedes islet autoimmunity in children who later progress to type 1 diabetes. J Exp Med 205(13): 2975–2984CrossRefGoogle Scholar
  13. Rowe DB (2000) On estimating the mean in Bayesian factor analysis. In: Social science working paper 1096, division of humanities and social sciences, Caltech, Pasadena, CA 91125Google Scholar
  14. Roweis S, Ghahramani Z (1999) A unifying review of linear Gaussian models. Neural Comput 11(2): 305–345CrossRefGoogle Scholar
  15. Sanguinetti G, Noirel J, Wright PC (2008) MMG: a probabilistic tool to identify submodules of metabolic pathways. Bioinformatics 24(8): 1078–1084CrossRefGoogle Scholar
  16. Seo DM, Goldschmidt-Clermont PJ, West M (2007) Of mice and men: sparse statistical modelling in cardiovascular genomics. Ann Appl Stat 1(1): 152–178zbMATHCrossRefMathSciNetGoogle Scholar
  17. Smilde AK, Jansen JJ, Hoefsloot HCJ, Lamers RJAN, van der Greef J, Timmerman ME (2005) ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data. Bioinformatics 21(13): 3043–3048CrossRefGoogle Scholar
  18. Steuer R (2006) Review: On the analysis and interpretation of correlations in metabolomic data. Brief Bioinform 7(2): 151–158CrossRefGoogle Scholar
  19. Tai F, Pan W (2007) Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics 23(23): 3170–3177CrossRefGoogle Scholar
  20. Vis D, Westerhuis J, Smilde A, van der Greef J (2007) Statistical validation of megavariate effects in ASCA. BMC Bioinform 8(1): 322CrossRefGoogle Scholar
  21. Wang L, Zhang B, Wolfinger RD, Chen X (2008) An integrated approach for the analysis of biological pathways using mixed models. PLoS Genet 4(7): e1000115CrossRefGoogle Scholar
  22. West M (2003) Bayesian factor regression models in the large p, small n paradigm. Bayesian Stat 7: 723–732Google Scholar
  23. Westerhuis J, Hoefsloot H, Smit S, Vis D, Smilde A, van Velzen E, van Duijnhoven J, van Dorsten F (2008) Assessment of plsda cross validation. Metabolomics 4(1): 81–89CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Ilkka Huopaniemi
    • 1
    Email author
  • Tommi Suvitaival
    • 1
  • Janne Nikkilä
    • 1
    • 2
  • Matej Orešič
    • 3
  • Samuel Kaski
    • 1
  1. 1.Department of Information and Computer ScienceHelsinki University of Technology (TKK)EspooFinland
  2. 2.Department of Basic Veterinary Sciences (Division of Microbiology and Epidemiology), Faculty of Veterinary MedicineUniversity of HelsinkiHelsinkiFinland
  3. 3.VTT Technical Research Centre of Finland (VTT)EspooFinland

Personalised recommendations