, 15:63 | Cite as

Two data pre-processing workflows to facilitate the discovery of biomarkers by 2D NMR metabolomics

  • Baptiste FéraudEmail author
  • Justine Leenders
  • Estelle Martineau
  • Patrick Giraudeau
  • Bernadette Govaerts
  • Pascal de Tullio
Original Article



The pre-processing of analytical data in metabolomics must be considered as a whole to allow the construction of a global and unique object for any further simultaneous data analysis or multivariate statistical modelling. For 1D 1H-NMR metabolomics experiments, best practices for data pre-processing are well defined, but not yet for 2D experiments (for instance COSY in this paper).


By considering the added value of a second dimension, the objective is to propose two workflows dedicated to 2D NMR data handling and preparation (the Global Peak List and Vectorization approaches) and to compare them (with respect to each other and with 1D standards). This will allow to detect which methodology is the best in terms of amount of metabolomic content and to explore the advantages of the selected workflow in distinguishing among treatment groups and identifying relevant biomarkers. Therefore, this paper explores both the necessity of novel 2D pre-processing workflows, the evaluation of their quality and the evaluation of their performance in the subsequent determination of accurate (2D) biomarkers.


To select the more informative data source, MIC (Metabolomic Informative Content) indexes are used, based on clustering and inertia measures of quality. Then, to highlight biomarkers or critical spectral zones, the PLS-DA model is used, along with more advanced sparse algorithms (sPLS and L-sOPLS).


Results are discussed according to two different experimental designs (one which is unsupervised and based on human urine samples, and the other which is controlled and based on spiked serum media). MIC indexes are shown, leading to the choice of the more relevant workflow to use thereafter. Finally, biomarkers are provided for each case and the predictive power of each candidate model is assessed with cross-validated measures of RMSEP.


In conclusion, it is shown that no solution can be universally the best in every case, but that 2D experiments allow to clearly find relevant cross peak biomarkers even with a poor initial separability between groups. The MIC measures linked with the candidate workflows (2D GPL, 2D vectorization, 1D, and with specific parameters) lead to visualize which data set must be used as a priority to more easily find biomarkers. The diversity of data sources, mainly 1D versus 2D, may often lead to complementary or confirmatory results.


2D NMR 1H-NMR COSY spectra Pre-prossessing workflows Metabolomic informative content (MIC) Biomarker discovery PLS sPLS L-sOPLS 



Support from the IAP Research Network P7/06 of the Belgian State (Belgian Science Policy) is gratefully acknowledged. Support from the CORSAIRE metabolomics platform (Biogenouest network) is also acknowledged. Pascal de Tullio is Research Director of the Fonds de la Recherche Scientifique (FNRS).

Author contributions

BF, BG, PG and PT conceived and designed research. EM, JL and PT collected and supplied the data. BF analyzed data and wrote the manuscript. All authors read and approved the manuscript.

Compliance with ethical standards

Conflict of interest

All authors declare that they have no conflict of interest.

Ethical approval

This study analyzes collected data which involved human participants. The studies were approved by our local Ethics Committee (CHR Citadelle, Liège, Number B412201215082-1267) and all subjects gave their informed consent.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Softwares availability statement

The raw data were processed with the Bruker Topspin 3.5 software. Peak lists were extracted using ACD/Labs 12.00 (ACD/NMR processor). The R software ( environment was exclusively used for statistical purpose, via existing packages (pls, spls, ropls), or coded ad hoc (PepsNMR package; MIC indexes, L-sOPLS, functions which are available here:


  1. Barna, J. C., & Laue, E. D. (1987). Conventional and exponential sampling for 2D NMR experiments with application to a 2D NMR spectrum of a protein. Journal of Magnetic Resonance (1969), 75(2), 384–389.CrossRefGoogle Scholar
  2. Bylesjo, M., Rantalainen, M., Cloarec, O., & Nicholson, J. (2006). OPLS discriminant analysis: Combining the strengths of PLS-DA and SIMCA classification. Journal of Chemometrics, 20(8–10), 341–351.CrossRefGoogle Scholar
  3. Chung, D., & Chun, H. (2012). Keles S, Spls: Sparse partial least squares (SPLS) regression and classification. R package, version, 2, 1–1.Google Scholar
  4. Chun, H., & Keles, S. (2007). Sparse partial least squares regression with an application to genome scale transcription factor analysis. Madison: Department of Statistics, University of Wisconsin.Google Scholar
  5. Craig, A., Cloarec, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78(7), 2262–2267.CrossRefGoogle Scholar
  6. Dieterle, F., Ross, A., Schlotterbeck, G., & Senn, H. (2006). Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Analytical Chemistry, 78(13), 4281–4290.CrossRefGoogle Scholar
  7. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499.CrossRefGoogle Scholar
  8. Feraud, B. (2019). Statistical contributions to the analysis of 2D NMR spectra in metabolomics studies: From pre-processing workflows to 2D biomarker discovery.
  9. Feraud, B., Govaerts, B., Verleysen, M., & De Tullio, P. (2015). Statistical treatment of 2D NMR COSY spectra in metabolomics: Data preparation, clustering-based evaluation of the metabolomic informative content and comparison with 1H-NMR. Metabolomics, 11(6), 1756–1768.CrossRefGoogle Scholar
  10. Feraud, B., Munaut, C., Martin, M., Verleysen, M., & Govaerts, B. (2017). Combining strong sparsity and competitive predictive power with the L-sOPLS approach for biomarker discovery in metabolomics. Metabolomics, 13(11), 130.CrossRefGoogle Scholar
  11. Frydman, L., Scherf, T., & Lupulescu, A. (2002). The acquisition of multidimensional NMR spectra within a single scan. Proceedings of the National Academy of Sciences, 99(25), 15858–15862.CrossRefGoogle Scholar
  12. Giraudeau, P. (2014). Quantitative 2D liquid-state NMR. Magnetic Resonance in Chemistry, 52(6), 259–272.CrossRefGoogle Scholar
  13. Giraudeau, P., Tea, I., Remaud, G. S., & Akoka, S. (2014). Reference and normalization methods: Essential tools for the intercomparison of NMR spectra. Journal of Pharmaceutical and Biomedical Analysis, 93, 3–16.CrossRefGoogle Scholar
  14. Hoch, J. C., Maciejewski, M. W., Mobli, M., Schuyler, A. D., & Stern, A. S. (2014). Non-uniform sampling and maximum entropy reconstruction in multidimensional NMR. Accounts of Chemical Research, 47(2), 708–717.CrossRefGoogle Scholar
  15. Jezequel, T., Deborde, C., Maucourt, M., Zhendre, V., Moing, A., & Giraudeau, P. (2015). Absolute quantification of metabolites in tomato fruit extracts by fast 2D NMR. Metabolomics, 11(5), 1231–1242.CrossRefGoogle Scholar
  16. Le Guennec, A., Giraudeau, P., & Caldarelli, S. (2014). Evaluation of fast 2D NMR for metabolomics. Analytical Chemistry, 86(12), 5946–5954.CrossRefGoogle Scholar
  17. Le Guennec, A., Tea, I., Antheaume, I., Martineau, E., Charrier, B., Pathan, M., et al. (2012). Fast determination of absolute metabolite concentrations by spatially encoded 2D NMR: Application to breast cancer cell extracts. Analytical Chemistry, 84(24), 10831–10837.CrossRefGoogle Scholar
  18. Liland, K. H. (2011). Multivariate methods in metabolomics, from pre-processing to dimension reduction and statistical analysis. TrAC Trends in Analytical Chemistry, 30(6), 827–841.CrossRefGoogle Scholar
  19. MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, University of California Press, pp. 281–297.Google Scholar
  20. Marchand, J., Martineau, E., Guitton, Y., Dervilly-Pinel, G., & Giraudeau, P. (2017). Multidimensional NMR approaches towards highly resolved, sensitive and high-throughput quantitative metabolomics. Current Opinion in Biotechnology, 43, 49–55.CrossRefGoogle Scholar
  21. Marchand, J., Martineau, E., Guitton, Y., Le Bizec, B., Dervilly-Pinel, G., & Giraudeau, P. (2018). A multidimensional 1H-NMR lipidomics workflow to address chemical food safety issues. Metabolomics, 14(5), 60.CrossRefGoogle Scholar
  22. Marjanska, M., Henry, P. G., Ugurbil, K., & Gruetter, R. (2008). Editing through multiple bonds: Threonine detection. Magnetic Resonance in Medicine, 59(2), 245–251.CrossRefGoogle Scholar
  23. Martin, M., Legat, B., Leenders, J., Vanwinsberghe, J., Rousseau, R., et al. (2017). PepsNMR for the 1H-NMR metabolomic data pre-processing. ISBA Discussion Paper, 2017/22,
  24. Martineau, E., Tea, I., Akoka, S., & Giraudeau, P. (2012). Absolute quantification of metabolites in breast cancer cell extracts by quantitative 2D 1H INADEQUATE NMR. NMR in Biomedicine, 25(8), 985–992.CrossRefGoogle Scholar
  25. Murtagh, F., & Legendre, P. (2011). Ward’s hierarchical clustering method: clustering criterion and agglomerative algorithm, arXiv preprint arXiv:1111.6285.
  26. Ravanbakhsh, S., Liu, P., Bjorndahl, T., Mandal, R., Grant, J. R., Wilson, M., & Greiner, R. (2014). Accurate, fully-automated NMR spectral profiling for metabolomics. arXiv:1409.1456.
  27. Rist, M. J., Roth, A., Frommherz, L., Weinert, C. H., Kruger, R., Merz, B., et al. (2017). Metabolite patterns predicting sex and age in participants of the Karlsruhe Metabolomics and Nutrition (KarMeN) study. PLoS ONE, 12(8), e0183228.CrossRefGoogle Scholar
  28. Rouger, L., Gouilleux, B., & Giraudeau, P. (2017). Fast n-dimensional data acquisition methods. Encyclopedia of spectroscopy and spectrometry (pp. 588–596). Oxford: Academic Press.CrossRefGoogle Scholar
  29. Rousseau, R. (2011). Statistical contribution to the analysis of metabonomic data in 1H-NMR spectroscopy, PhD Thesis, UCL,
  30. Sousa, S. A., Magalhaes, A., & Castro Ferreira, M. M. (2013). Optimized bucketing for NMR spectra: Three case studies. Chemometrics and Intelligent Laboratory Systems, 122, 93–102.CrossRefGoogle Scholar
  31. Thevenot, E. A., Roux, A., Xu, Y., Ezan, E., & Junot, C. (2015). Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses. Journal of Proteome Research, 14(8), 3322–3335.CrossRefGoogle Scholar
  32. Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119–128.CrossRefGoogle Scholar
  33. Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of American Statistical Association, 58(301), 236–244.CrossRefGoogle Scholar
  34. Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109–130.CrossRefGoogle Scholar
  35. Wold, S., Trygg, J., Berglund, A., & Antti, H. (2001). Some recent developments in PLS modeling. Chemometrics and Intelligent Laboratory Systems, 58(2), 131–150.CrossRefGoogle Scholar
  36. Wu, Y., & Liang, L. (2016). Sample normalization methods in quantitative metabolomics. Journal of Chromatography A, 1430, 80–95. ISSN 0021-9673.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Institute of Statistics, Biostatistics and Actuarial Sciences (ISBA)Université Catholique de Louvain (UCL)Louvain-la-NeuveBelgium
  2. 2.Machine Learning Group, Université Catholique de Louvain (UCL)Louvain-la-NeuveBelgium
  3. 3.Center for Interdisciplinary Research on Medicines (CIRM), Metabolomics groupUniversité de Liège (ULg)LiegeBelgium
  4. 4.EBSI Team, Chimie et Interdisciplinarité, Synthèse, Analyse, Modélisation (CEISAM), CNRS, UMR 6230Université de NantesNantesFrance
  5. 5.Spectromaîtrise, CAPACITES SASNantesFrance
  6. 6.Institut Universitaire de FranceParis Cedex 5France

Personalised recommendations