Two data pre-processing workflows to facilitate the discovery of biomarkers by 2D NMR metabolomics
The pre-processing of analytical data in metabolomics must be considered as a whole to allow the construction of a global and unique object for any further simultaneous data analysis or multivariate statistical modelling. For 1D 1H-NMR metabolomics experiments, best practices for data pre-processing are well defined, but not yet for 2D experiments (for instance COSY in this paper).
By considering the added value of a second dimension, the objective is to propose two workflows dedicated to 2D NMR data handling and preparation (the Global Peak List and Vectorization approaches) and to compare them (with respect to each other and with 1D standards). This will allow to detect which methodology is the best in terms of amount of metabolomic content and to explore the advantages of the selected workflow in distinguishing among treatment groups and identifying relevant biomarkers. Therefore, this paper explores both the necessity of novel 2D pre-processing workflows, the evaluation of their quality and the evaluation of their performance in the subsequent determination of accurate (2D) biomarkers.
To select the more informative data source, MIC (Metabolomic Informative Content) indexes are used, based on clustering and inertia measures of quality. Then, to highlight biomarkers or critical spectral zones, the PLS-DA model is used, along with more advanced sparse algorithms (sPLS and L-sOPLS).
Results are discussed according to two different experimental designs (one which is unsupervised and based on human urine samples, and the other which is controlled and based on spiked serum media). MIC indexes are shown, leading to the choice of the more relevant workflow to use thereafter. Finally, biomarkers are provided for each case and the predictive power of each candidate model is assessed with cross-validated measures of RMSEP.
In conclusion, it is shown that no solution can be universally the best in every case, but that 2D experiments allow to clearly find relevant cross peak biomarkers even with a poor initial separability between groups. The MIC measures linked with the candidate workflows (2D GPL, 2D vectorization, 1D, and with specific parameters) lead to visualize which data set must be used as a priority to more easily find biomarkers. The diversity of data sources, mainly 1D versus 2D, may often lead to complementary or confirmatory results.
Keywords2D NMR 1H-NMR COSY spectra Pre-prossessing workflows Metabolomic informative content (MIC) Biomarker discovery PLS sPLS L-sOPLS
Support from the IAP Research Network P7/06 of the Belgian State (Belgian Science Policy) is gratefully acknowledged. Support from the CORSAIRE metabolomics platform (Biogenouest network) is also acknowledged. Pascal de Tullio is Research Director of the Fonds de la Recherche Scientifique (FNRS).
BF, BG, PG and PT conceived and designed research. EM, JL and PT collected and supplied the data. BF analyzed data and wrote the manuscript. All authors read and approved the manuscript.
Compliance with ethical standards
Conflict of interest
All authors declare that they have no conflict of interest.
This study analyzes collected data which involved human participants. The studies were approved by our local Ethics Committee (CHR Citadelle, Liège, Number B412201215082-1267) and all subjects gave their informed consent.
Informed consent was obtained from all individual participants included in the study.
Softwares availability statement
The raw data were processed with the Bruker Topspin 3.5 software. Peak lists were extracted using ACD/Labs 12.00 (ACD/NMR processor). The R software (http://www.R-project.org) environment was exclusively used for statistical purpose, via existing packages (pls, spls, ropls), or coded ad hoc (PepsNMR package; MIC indexes, L-sOPLS, functions which are available here: https://github.com/ManonMartin/MBXUCL.).
- Chung, D., & Chun, H. (2012). Keles S, Spls: Sparse partial least squares (SPLS) regression and classification. R package, version, 2, 1–1.Google Scholar
- Chun, H., & Keles, S. (2007). Sparse partial least squares regression with an application to genome scale transcription factor analysis. Madison: Department of Statistics, University of Wisconsin.Google Scholar
- Feraud, B. (2019). Statistical contributions to the analysis of 2D NMR spectra in metabolomics studies: From pre-processing workflows to 2D biomarker discovery. http://hdl.handle.net/2078.1/214124.
- MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, University of California Press, pp. 281–297.Google Scholar
- Martin, M., Legat, B., Leenders, J., Vanwinsberghe, J., Rousseau, R., et al. (2017). PepsNMR for the 1H-NMR metabolomic data pre-processing. ISBA Discussion Paper, 2017/22, http://hdl.handle.net/2078.1/187159.
- Murtagh, F., & Legendre, P. (2011). Ward’s hierarchical clustering method: clustering criterion and agglomerative algorithm, arXiv preprint arXiv:1111.6285.
- Ravanbakhsh, S., Liu, P., Bjorndahl, T., Mandal, R., Grant, J. R., Wilson, M., & Greiner, R. (2014). Accurate, fully-automated NMR spectral profiling for metabolomics. arXiv:1409.1456.
- Rousseau, R. (2011). Statistical contribution to the analysis of metabonomic data in 1H-NMR spectroscopy, PhD Thesis, UCL, http://hdl.handle.net/2078.1/75532.
- Thevenot, E. A., Roux, A., Xu, Y., Ezan, E., & Junot, C. (2015). Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses. Journal of Proteome Research, 14(8), 3322–3335.CrossRefGoogle Scholar