Skip to main content
Log in

Two-way analysis of high-dimensional collinear data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We present a Bayesian model for two-way ANOVA-type analysis of high-dimensional, small sample-size datasets with highly correlated groups of variables. Modern cellular measurement methods are a main application area; typically the task is differential analysis between diseased and healthy samples, complicated by additional covariates requiring a multi-way analysis. The main complication is the combination of high dimensionality and low sample size, which renders classical multivariate techniques useless. We introduce a hierarchical model which does dimensionality reduction by assuming that the input variables come in similarly-behaving groups, and performs an ANOVA-type decomposition for the set of reduced-dimensional latent variables. We apply the methods to study lipidomic profiles of a recent large-cohort human diabetes study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Archambeau C, Bach F (2009) Sparse probabilistic projections. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems, vol 21. MIT Press, Cambridge, pp 73–80

    Google Scholar 

  • Beal M, Krishnamurthy P (2006) Gene expression time course clustering with countably infinite hidden markov models. In: Proceedings of the 22nd annual conference on uncertainty in artificial intelligence (UAI-06), Arlington, Virginia. AUAI Press

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological) 57(1): 289–300

    MATH  MathSciNet  Google Scholar 

  • Bishop CM (1999) Bayesian PCA. In: Proceedings of the 1998 conference on advances in neural information processing systems II. MIT Press, Cambridge, pp 382–388

  • Cao G, Bouman CA (2009) Covariance estimation for high dimensional data vectors using the sparse matrix transform. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems, vol 21. MIT Press, Cambridge, pp 225–232

    Google Scholar 

  • Celeux G, Martin O, Lavergne C (2005) Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat Model 5(3): 243–267

    Article  MATH  MathSciNet  Google Scholar 

  • Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian data analysis, 2nd edn. Chapman & Hall/CRC, London

    Google Scholar 

  • Ghahramani Z, Beal MJ (2000) Variational inference for Bayesian mixtures of factor analysers. In: Advances in neural information processing systems, vol 12. MIT Press, Cambridge, pp 449–455

  • Langsrud O (2002) 50–50 multivariate analysis of variance for collinear responses. J R Stat Soc Ser D-the Statistician 51: 305–317

    Article  MathSciNet  Google Scholar 

  • Ng SK, McLachlan GJ, Wang K, Ben-Tovim Jones L, Ng SW (2006) A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics 22(14): 1745–1752

    Article  Google Scholar 

  • Nikkila J, Sysi-Aho M, Ermolov A, Seppnen-Laakso T, Simell O, Kaski S, Oresic M (2008) Gender-dependent progression of systemic metabolic states in early childhood. Mol Syst Biol 4(197). doi:10.1038/msb.2008.34

  • Oresic M, Simell S, Sysi-Aho M, Nanto-Salonen K, Seppanen-Laakso T, Parikka V, Katajamaa M, Hekkala A, Mattila I, Keskinen P, Yetukuri L, Reinikainen A, Lahde J, Suortti T, Hakalax J, Simell T, Hyoty H, Veijola R, Ilonen J, Lahesmaa R, Knip M, Simell O (2008) Dysregulation of lipid and amino acid metabolism precedes islet autoimmunity in children who later progress to type 1 diabetes. J Exp Med 205(13): 2975–2984

    Article  Google Scholar 

  • Rowe DB (2000) On estimating the mean in Bayesian factor analysis. In: Social science working paper 1096, division of humanities and social sciences, Caltech, Pasadena, CA 91125

  • Roweis S, Ghahramani Z (1999) A unifying review of linear Gaussian models. Neural Comput 11(2): 305–345

    Article  Google Scholar 

  • Sanguinetti G, Noirel J, Wright PC (2008) MMG: a probabilistic tool to identify submodules of metabolic pathways. Bioinformatics 24(8): 1078–1084

    Article  Google Scholar 

  • Seo DM, Goldschmidt-Clermont PJ, West M (2007) Of mice and men: sparse statistical modelling in cardiovascular genomics. Ann Appl Stat 1(1): 152–178

    Article  MATH  MathSciNet  Google Scholar 

  • Smilde AK, Jansen JJ, Hoefsloot HCJ, Lamers RJAN, van der Greef J, Timmerman ME (2005) ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data. Bioinformatics 21(13): 3043–3048

    Article  Google Scholar 

  • Steuer R (2006) Review: On the analysis and interpretation of correlations in metabolomic data. Brief Bioinform 7(2): 151–158

    Article  Google Scholar 

  • Tai F, Pan W (2007) Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics 23(23): 3170–3177

    Article  Google Scholar 

  • Vis D, Westerhuis J, Smilde A, van der Greef J (2007) Statistical validation of megavariate effects in ASCA. BMC Bioinform 8(1): 322

    Article  Google Scholar 

  • Wang L, Zhang B, Wolfinger RD, Chen X (2008) An integrated approach for the analysis of biological pathways using mixed models. PLoS Genet 4(7): e1000115

    Article  Google Scholar 

  • West M (2003) Bayesian factor regression models in the large p, small n paradigm. Bayesian Stat 7: 723–732

    Google Scholar 

  • Westerhuis J, Hoefsloot H, Smit S, Vis D, Smilde A, van Velzen E, van Duijnhoven J, van Dorsten F (2008) Assessment of plsda cross validation. Metabolomics 4(1): 81–89

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilkka Huopaniemi.

Additional information

Responsible editors: Aleksander Kołcz, Wray Buntine, Marko Grobelnik, Dunja Mladenic, and John Shawe-Taylor.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huopaniemi, I., Suvitaival, T., Nikkilä, J. et al. Two-way analysis of high-dimensional collinear data. Data Min Knowl Disc 19, 261–276 (2009). https://doi.org/10.1007/s10618-009-0142-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0142-5

Keywords

Navigation