Skip to main content

Sparse multi-block PLSR for biomarker discovery when integrating data from LC–MS and NMR metabolomics

Abstract

The objective of this study was to implement a multivariate method which analyzes multi-block metabolomics data and performs variable selection in order to discover potential biomarkers, simultaneously. We call this method sparse multi-block partial least squares regression (Sparse MBPLSR). To achieve this method, we first defined a nonlinear iterative partial least squares (NIPALS) algorithm for Sparse PLSR, thereafter we extended it to Sparse MBPLSR. Since over-fitting is an issue when variable selection is involved, we implemented a cross model validation (CMV) to assess the reliability and stability of the selected variables. The performance of the method was evaluated using a simulated data set and a multi-block data set from a dietary intervention study with pigs used as model for humans. The objective of the study was to investigate the biochemical effects in plasma after dietary intervention with breads varying in types of dietary fiber and to identify potential biomarkers. By introducing Sparse MBPLSR, we aimed at identifying the biomarkers where data from LC–MS and NMR instruments were analyzed simultaneously and therefore in addition we intended to explore the relationships among the measurement variables of this multi-block data set. The results showed that Sparse MBPLSR with CMV is a useful tool for analyzing multi-block metabolomics data with a good prediction and for identifying potential biomarkers.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

References

  • Anderssen, E., Dyrstad, K., Westad, F., & Martens, H. (2006). Reducing over-optimism in variable selection by cross-model validation. Chemometrics and Intelligent Laboratory Systems, 84(1–2), 69–74.

    Article  CAS  Google Scholar 

  • Bro, R., & Smilde, A. K. (2003). Centering and scaling in component analysis. Journal of Chemometrics, 17(1), 16–33.

    Article  CAS  Google Scholar 

  • Centner, V., Massart, D. L., de Noord, O. E., De Jong, S., Vandeginste, B. M., & Sterna, C. (1996). Elimination of uninformative variables for multivariate calibration. Analytical Chemistry, 68(21), 3851–3858.

    Article  CAS  PubMed  Google Scholar 

  • Chun, H., & Keles, S. (2009). Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics, 182(1), 79–90.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Chun, H., & Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society Series B, 72(1), 3–25.

    Article  Google Scholar 

  • Chung, D., & Keles, S. (2010). Sparse partial least squares classification for high dimensional data. Statistical Applications in Genetics and Molecular Biology, 9(1), 39.

    Article  Google Scholar 

  • Geladi, P., & Kowalski, B. R. (1986). Partial least-squares regression: A tutorial. Analytica Chimica Acta, 185(C), 1–17.

    Article  CAS  Google Scholar 

  • Gidskehaug, L., Anderssen, E., & Alsberg, B. K. (2006). Cross model validated feature selection based on gene clusters. Chemometrics and Intelligent Laboratory Systems, 84(1–2), 172–176.

    Article  CAS  Google Scholar 

  • Goodacre, R., Broadhurst, D., Smilde, A. K., Kristal, B. S., Baker, J. D., Beger, R., et al. (2007). Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics, 3(3), 231–241.

    Article  CAS  Google Scholar 

  • Hassani, S., Martens, H., Qannari, E. M., Hanafi, M., Borge, G. I., & Kohler, A. (2010). Analysis of -omics data: Graphical interpretation- and validation tools in multi-block methods. Chemometrics and Intelligent Laboratory Systems, 104(1), 140–153.

    Article  CAS  Google Scholar 

  • Hassani, S., Martens, H., Qannari, E. M., Hanafi, M., & Kohler, A. (2012). Model validation and error estimation in multi-block partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 117, 42–53.

    Article  CAS  Google Scholar 

  • Höskuldsson, A. (1988). PLS regression methods. Journal of Chemometrics, 2(3), 211–228.

    Article  Google Scholar 

  • Höskuldsson, A. (2001). Variable and subset selection in PLS regression. Chemometrics and Intelligent Laboratory Systems, 55(1–2), 23–38.

    Article  Google Scholar 

  • Indahl, U. (2005). A twist to partial least squares regression. Journal of Chemometrics, 19(1), 32–44.

    Article  CAS  Google Scholar 

  • Karaman, I., Qannari, E. M., Martens, H., Hedemann, M. S., Knudsen, K. E. B., & Kohler, A. (2013). Comparison of sparse and Jack-knife partial least squares regression methods for variable selection. Chemometrics and Intelligent Laboratory Systems, 122, 66–77.

    Article  Google Scholar 

  • Kemsley, E. K., Le Gall, G., Dainty, J. R., Watson, A. D., Harvey, L. J., Tapp, H. S., et al. (2007). Multivariate techniques and their application in nutrition: A metabolomics case study. British Journal of Nutrition, 98(1), 1–14.

    Article  CAS  PubMed  Google Scholar 

  • Kohler, A., Hanafi, M., Bertrand, D., Qannari, E. M., Janbu, A. O., Møretrø, T., et al. (2008). Interpreting several types of measurements in bioscience. In P. Lasch & J. Kneipp (Eds.), Biomedical vibrational spectroscopy (pp. 333–356). Hoboken, NJ: Wiley.

    Chapter  Google Scholar 

  • Lê Cao, K. A., Martin, P. G. P., Robert-Granié, C., & Besse, P. (2009). Sparse canonical methods for biological data integration: Application to a cross-platform study. BMC Bioinformatics, 10, 34.

    Article  PubMed Central  PubMed  Google Scholar 

  • Lê Cao, K. A., Rossouw, D., Robert-Granié, C., & Besse, P. (2008). A sparse PLS for variable selection when integrating omics data. Statistical Applications in Genetics and Molecular Biology, 7(1), 109.

    Article  Google Scholar 

  • Löfstedt, T., & Trygg, J. (2011). OnPLS: A novel multiblock method for the modelling of predictive and orthogonal variation. Journal of Chemometrics, 25, 441–455.

    Google Scholar 

  • Lopes, J. A., Menezes, J. C., Westerhuis, J. A., & Smilde, A. K. (2002). Multiblock PLS analysis of an industrial pharmaceutical process. Biotechnology and Bioengineering, 80(4), 419–427.

    Article  CAS  PubMed  Google Scholar 

  • Martens, H., & Martens, M. (2000). Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Quality and Preference, 11(1–2), 5–16.

    Article  Google Scholar 

  • Martens, H., & Næs, T. (1992). Multivariate calibration. Chichester: Wiley.

    Google Scholar 

  • Mehmood, T., Liland, K. H., Snipen, L., & Sæbø, S. (2012). A review of variable selection methods in partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 118, 62–69.

    Article  CAS  Google Scholar 

  • Moyon, T., Le Marec, F., Qannari, E., Vigneau, E., Le Plain, A., Courant, F., et al. (2012). Statistical strategies for relating metabolomics and proteomics data: a real case study in nutrition research area. Metabolomics, 8(6), 1090–1101.

    Article  CAS  Google Scholar 

  • Nørgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L., & Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy, 54(3), 413–419.

    Article  Google Scholar 

  • Nørskov, N., Hedemann, M., Theil, P., & Knudsen, K. (2013). Oxylipins discriminate between whole grain wheat and wheat aleurone intake: A metabolomics study on pig plasma. Metabolomics, 9(2), 464–479.

    Article  Google Scholar 

  • Ottestad, I., Hassani, S., Borge, G. I., Kohler, A., Vogt, G., Hyötyläinen, T., et al. (2012). Fish oil supplementation alters the plasma lipidomic profile and increases long-chain PUFAs of phospholipids and triglycerides in healthy subjects. PLoS ONE, 7(8), e42550.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Pluskal, T., Castillo, S., Villar-Briones, A., & Oresic, M. (2010). MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics, 11, 395.

    Article  PubMed Central  PubMed  Google Scholar 

  • Rosipal, R., & Krämer, N. (2006). Overview and recent advances in partial least squares. In C. Saunders, M. Grobelnik, S. Gunn, & J. Shawe-Taylor (Eds.), Subspace, latent structure and feature selection (pp. 34–51). Berlin: Springer.

    Chapter  Google Scholar 

  • Shen, H., & Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6), 1015–1034.

    Article  Google Scholar 

  • Smilde, A. K., van der Werf, M. J., Bijlsma, S., van der Werff-van der Vat, B. J., & Jellema, R. H. (2005). Fusion of mass spectrometry-based metabolomics data. Analytical Chemistry, 77(20), 6729–6736.

  • Szymaríska, E., Saccenti, E., Smilde, A., & Westerhuis, J. (2012). Double-check: Validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics, 8(1), 3–16.

    Article  Google Scholar 

  • Theil, P. K., Jørgensen, H., Serena, A., Hendrickson, J., & Bach Knudsen, K. E. (2011). Products deriving from microbial fermentation are linked to insulinaemic response in pigs fed breads prepared from whole-wheat grain and wheat and rye ingredients. British Journal of Nutrition, 105(03), 373–383.

    Article  CAS  PubMed  Google Scholar 

  • Trygg, J., Holmes, E., & Lundstedt, T. (2007). Chemometrics in metabonomics. Journal of Proteome Research, 6(2), 469–479.

    Article  CAS  PubMed  Google Scholar 

  • Urban Hjort, J. S. (1993). Computer intensive statistical methods. London: Chapman and Hall.

    Google Scholar 

  • van der Greef, J., & Smilde, A. K. (2005). Symbiosis of chemometrics and metabolomics: Past, present, and future. Journal of Chemometrics, 19(5–7), 376–386.

    Article  Google Scholar 

  • Wangen, L. E., & Kowalski, B. R. (1989). A multiblock partial least squares algorithm for investigating complex chemical systems. Journal of Chemometrics, 3(1), 3–20.

    Article  Google Scholar 

  • Wegelin, J. (2000). A survey of partial least squares (PLS) methods, with emphasis on the two-block case. Technical Report 371, Department of Statistics, University of Washington, Seattle.

  • Westad, F., & Martens, H. (2000). Variable selection in near infrared spectroscopy based on significance testing in partial least squares regression. Journal of Near Infrared Spectroscopy, 8(2), 117–124.

    Article  CAS  Google Scholar 

  • Westerhuis, J. A., Kourti, T., & Macgregor, J. F. (1998). Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics, 12(5), 301–321.

    Article  CAS  Google Scholar 

  • Westerhuis, J. A., & Smilde, A. K. (2001). Deflation in multiblock PLS. Journal of Chemometrics, 15(5), 485–493.

    Article  CAS  Google Scholar 

  • Wishart, D. S. (2010). Computational approaches to metabolomics. In R. Matthiesen (Ed.), Bioinformatics Methods in Clinical Research, Methods in Molecular Biology (Vol. 593, pp. 283–313). New York, NY: Humana Press.

    Chapter  Google Scholar 

  • Wold, S., Kettaneh, N., & Tjessem, K. (1996). Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. Journal of Chemometrics, 10(5–6), 463–482.

    Article  CAS  Google Scholar 

  • Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In B. Kågström & A. Ruhe (Eds.), Matrix pencils (pp. 286–293). Berlin: Springer.

    Google Scholar 

  • Xu, Y., & Goodacre, R. (2012). Multiblock principal component analysis: An efficient tool for analyzing metabolomics data which contain two influential factors. Metabolomics, 8(1), 37–51.

    Article  CAS  Google Scholar 

  • Yde, C. C., Jansen, J. J., Theil, P. K., Bertram, H. C., & Knudsen, K. E. B. (2012). Different metabolic and absorption patterns of betaine in response to dietary intake of whole-wheat grain, wheat aleurone or rye aleurone in catheterized pigs. European Food Research and Technology, 235(5), 939–949.

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors are grateful for financial support by the Nordic Centre of Excellence on Food, Nutrition and Health “Systems biology in controlled dietary interventions and cohort studies” (SYSDIET; 070014) funded by NordForsk. In addition, this work was supported by the Grant 203699 (New statistical tools for integrating and exploiting complex genomic and phenotypic data sets) financed by the Research Council of Norway.

Conflict of interest

The authors declared no competing interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to İbrahim Karaman.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 109 kb)

Supplementary material 2 (DOCX 78 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Karaman, İ., Nørskov, N.P., Yde, C.C. et al. Sparse multi-block PLSR for biomarker discovery when integrating data from LC–MS and NMR metabolomics. Metabolomics 11, 367–379 (2015). https://doi.org/10.1007/s11306-014-0698-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11306-014-0698-y

Keywords

  • Sparse PLSR
  • Multi-block
  • Cross model validation
  • Variable selection
  • Biomarker discovery