ROIMCR: a powerful analysis strategy for LC-MS metabolomic datasets
The analysis of LC-MS metabolomic datasets appears to be a challenging task in a wide range of disciplines since it demands the highly extensive processing of a vast amount of data. Different LC-MS data analysis packages have been developed in the last few years to facilitate this analysis. However, most of these strategies involve chromatographic alignment and peak shaping and often associate each “feature” (i.e., chromatographic peak) with a unique m/z measurement. Thus, the development of an alternative data analysis strategy that is applicable to most types of MS datasets and properly addresses these issues is still a challenge in the metabolomics field.
Here, we present an alternative approach called ROIMCR to: i) filter and compress massive LC-MS datasets while transforming their original structure into a data matrix of features without losing relevant information through the search of regions of interest (ROIs) in the m/z domain and ii) resolve compressed data to identify their contributing pure components without previous alignment or peak shaping by applying a Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) analysis. In this study, the basics of the ROIMCR method are presented in detail and a detailed description of its implementation is also provided. Data were analyzed using the MATLAB (The MathWorks, Inc., www.mathworks.com) programming and computing environment. The application of the ROIMCR methodology is described in detail, with an example of LC-MS data generated in a lipidomic study and with other examples of recent applications.
The methodology presented here combines the benefits of data filtering and compression based on the searching of ROI features, without the loss of spectral accuracy. The method has the benefits of the application of the powerful MCR-ALS data resolution method without the necessity of performing chromatographic peak alignment or modelling. The presented method is a powerful alternative to other existing data analysis approaches that do not use the MCR-ALS method to resolve LC-MS data. The ROIMCR method also represents an improved strategy compared to the direct applications of the MCR-ALS method that use less-powerful data compression strategies such as binning and windowing. Overall, the strategy presented here confirms the usefulness of the ROIMCR chemometrics method for analyzing LC-MS untargeted metabolomics data.
KeywordsLC-MS Data analysis Data compression Data resolution Regions of interest (ROI) MCR-ALS Metabolomics Lipidomics Chemometrics Untargeted analysis
Continuous wavelet transformations
High performance liquid chromatography
Liquid chromatography coupled to mass spectrometry
Multivariate Curve Resolution-Alternating Least Squares
Regions of interest
Ultra-high performance liquid chromatography
The challenge of analyzing data is one of the main concerns of metabolomic liquid chromatography coupled to mass spectrometry (LC-MS) studies . Several software packages exist for MS-based metabolomic data analysis, including proprietary commercial, open-source, and online workflows . Some commercial tools provided by major vendors of MS and omics high throughput analytical instruments and equipment include MassHunter (Agilent Technologies), SIEVE (Thermo Scientific) and Progenesis QI (Waters). Some of the most frequently used open-source software packages include XCMS [3, 4] (and XCMS-based Metabox , metaX ), CAMERA , MAIT , MetaboAnalyst , Workflow4Metabolomics , MZmine  and MetAlign . However, none of these approaches are highlighted as the best strategy, and the analysis of LC-MS data remains an unresolved problem in the bioinformatics field due to the methodological discrepancies existing among these approaches.
The analysis of high-resolution LC-MS-based metabolomic datasets usually begins with filtering and compression, which is required to reduce their size into formats that are manageable with computers (without compromising the original information) and prevent errors linked to the restricted memory capacity of the computers. In addition to compressing data, in this first step, the conversion of raw data into a matrix representation is also required to obtain a set of well-structured variables (features) to analyze. The generated data matrices (x, y) are arranged with retention times in the rows (x-direction) and m/z values in the columns (y-direction). A classical procedure used for data compression and matrix transformation is binning. Using the binning procedure, high-resolution raw mass spectra are converted into a matrix representation by dividing the m/z axis into parts with a specific bin size that is generally set to a multiple of the mass accuracy of the mass spectrometer. However, a significant disadvantage of binning is the complication related to the proper choice of the bin size for a specific dataset, and the selection of the m/z bin size strongly correlates with the recovery of the proper elution profile peak shape. If the selected bin size is excessively small, chromatographic peaks fluctuate between bins and therefore are unable to be determined because of the chromatographic shape of the peak is not visible. If the bin size is excessively large, various peaks may occur in the same bin, and tiny peaks might disappear due to the elevated noise level . Moreover, peak splitting might occur for equidistant binning, regardless of the bin size. One major drawback of binning is the reduction in spectral accuracy originating from the compression of data in the m/z-mode dimension, which hinders the final identification of metabolites. Moreover, in most cases, the compression performed with binning is not sufficient and further windowing (i.e., independently selecting continuous regions in the rows (time) or the columns (m/z) to be analyzed) is necessary. Nevertheless, when performing windowing, the whole process is more tedious and time-consuming, since one sample must be analyzed in several parts.
A better alternative strategy to binning and windowing is based on the idea of assuming that analyte signals are a domain of data points with a high density arranged in a particular “data void”, as first presented by Stolt et al.  These regions where analytes are found are called regions of interest (ROIs) and are searched according to specific criteria (i.e., a particular threshold intensity, admissible mass error and minimum number of occurrences). Overall, the ROI strategy consists of considering data included in these regions while rejecting the other data. This strategy has already been implemented in the centWave algorithm of XCMS software . The result of the search for ROIs in a sample is a set of mass traces with distinct dimensions that must ultimately be reorganized into a data matrix. In contrast to the binning procedure, no reduction in spectral resolution occurs as a result of the application of the ROI searching procedure, since the bin size is not fixed. Thus, the ROI strategy allows researchers to take full advantage of all the benefits of high-resolution MS techniques. Currently, many of the current metabolomic data analysis software tools use ROI compression as a preliminary step for peak detection and/or integration.
Following the ROI search, data filtering and compression, the next crucial step in LC-MS-based metabolomic data analysis is data resolution. Most of the existing LC-MS data analysis approaches require two steps (i.e., chromatographic peak modelling and alignment) before peak resolution. Alignment methods search for matching peaks over various chromatographic runs and peak modelling methods force peaks to have a delimited and more regular shape, typically through the application of continuous wavelet transformations (CWT) and optional Gaussian fitting . Therefore, preliminary peak modelling and alignment appear as an indispensable step in most of the currently available data analysis packages and are often linked to an unknown amount of sources of error. In contrast, neither of the two corrections (i.e., peak modelling and alignment) are required when using Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS)  methods, since no modelling of elution profiles (peaks) is required (see below) and the aligned data are only produced in the spectral direction or mode. MCR methods are particularly powerful for mixture analysis and resolution in the simultaneous analysis of multirun chromatographic data.
The main goal of MCR-ALS methods is to resolve spectra arising from mixtures of the chemical constituents present in a sample into contributions from the individual components in the mixtures. Namely, MCR-ALS seeks to model the underlying physical processes that generate the data in terms of the composition of a sample. MCR-ALS-resolved MS spectra profiles are then immediately used to identify the chemical identities of metabolites through a comparison with standards or by searching a library. In the last few years, MCR-ALS methods have emerged as highly effective tools to resolve the lack of instrumental selectivity and coelution problems in different application areas, particularly in LC-MS-based metabolomic datasets.
Moreover, in this study, we provide an example of the performance of the ROIMCR strategy on analyzing a lipidomic LC-MS dataset. The illustrated lipidomic data set was generated in an experiment performed in a previous study by the authors  in which a human placental chroriocarcinoma cell line (JEG-3) was exposed to the endocrine disruptor chemical tributyltin (TBT). Examples of other recent applications to more complex systems have been recently published [17, 18, 19, 20, 21, 22] and are briefly described in “Applications of the ROIMCR procedure” section of this manuscript. Researchers interested in the ROIMCR procedure can test this strategy using the example data and the MATLAB functions for ROI compression, both of which are provided in a protocol written by the authors . That protocol, which is available at https://www.nature.com/protocolexchange/protocols/4347, provides a step-by-step description of the implementation of the ROIMCR procedure. In the present study, a detailed description of the basics and fundamentals of the methodology is presented.
A description of the ROI methodology is provided here. In addition, a brief description of the MCR-ALS method is presented below to facilitate the understanding of the whole ROIMCR procedure. MCR-ALS solves the MCR bilinear model (see Eqs. (1) and (2) below) using an alternating least squares optimization algorithm. The MCR-ALS method is already a well-stablished chemometric method and its principles and basis have been described in previous studies [23, 24, 25]. Its software implementation in the MATLAB computing and visualization language (The MathWorks Inc., https://www.mathworks.com) and other details are found on its official webpage: www.mcrals.info.
ROI search in one LC-MS sample
Search for m/z values associated with MS intensities greater than a signal threshold value (e.g. 0.1–1% of the mean/maximum signal intensity) in the first scan.
Search for clusters of m/z values enclosed within a specific mass error tolerance in the same scan.
Calculate the mean mass (or alternatively the median mass) of all the m/z values classified inside the same cluster (mzroi).
Arrange mean mass values from the lowest to highest values.
Repeat steps 1–4 for the remaining scans, merge them within the mass error tolerance and update the calculated mean mass values.
Select clusters having a minimum number of occurrences of m/z values.
Eliminate empty spaces in the final MSROI matrix, substituting them for random values with a mean threshold value, such as 1% of the threshold intensity value used in step 1.
The ROI search yields three outputs. A vector containing final mean m/z values of ROIs (“mzroi” in Fig. 3b), a newly arranged data matrix containing the MS spectra of every scan in its rows and the chromatograms of every ROI in its columns (“MSROI” in Fig. 3b) and a cell array (“roicell” in Fig. 3b) containing information about the m/z values, retention times, MS intensities, scan numbers and the calculated mean/median m/z value for each ROI.
ROI search in more than one LC-MS sample
Since the main purpose of metabolomics is to study the differences in metabolic profiles between multiple samples (e.g., controls vs. exposed), the final data analysis must consider all samples simultaneously. In fact, an MCR-ALS analysis of multiple samples requires the construction of column-wise augmented data matrices (see Simultaneous MCR-ALS analysis of multiple samples section). The construction of these matrices is only possible when dimensions in the m/z mode of all individual data matrices are the same. However, data compression using the ROI strategy produces data matrices with m/z mode dimensions equal to the number of ROIs, which can vary between samples. Thus, a final unification of ROIs among samples, considering both common and uncommon mzroi values, must be performed.
Check mzroi values between the two data matrices within the mass error tolerance, +/− mzerror. Consider the new mzroi to be the average of these values.
Build the new column-wise augmented data matrix with MS intensity values of the coincident mzroi values (if more than one mzroi value is coincident, then consider the sum of the MS intensity values).
Examine non-matching mzroi values; these values are accepted if their MS intensity is greater than the preselected threshold value. For the non-coincident mzroi values, replace empty values with random values at a low percentage (e.g., 1%) of the threshold intensity value.
Eliminate those mzroi values that are not coincident with an MS intensity value less than the threshold.
Reorganize the columns of the new augmented data matrix according to the new mzroi values, from lower to higher mzroi values.
Store output variables and plot ROI augmented matrices.
Thus, the required input information to perform ROI augmentation consists of the arrays of samples to be augmented, including m/z values (mzroi matrices) and MS intensities (MSROI matrices), the admissible mass deviation, the threshold intensity value and the vector containing the retention times. The output variables consist of a vector containing final mean m/z values of common and uncommon ROIs, the final augmented ROI matrix containing compressed data of all the input files and a vector containing the total number of scans (i.e., sum of the number of retention times of individual samples).
Multivariate curve resolution-alternating least squares (MCR-ALS)
In this equation, matrix D (I x J) exemplifies the spectral dataset derived from the output of a mass spectrometer. For LC-MS data, matrix D includes the MS spectra measured at all chromatographic retention times (i = 1, … I) in its rows and the elution profiles at the complete range of spectra m/z channels (j = 1, … J) in its columns. This matrix is decomposed in the product of two small factor matrices, C and ST. The C (I x N) matrix encloses column vectors that agree with the concentration elution profiles of the N (n = 1, …, N) pure chemical constituents or components of matrix D. In the ST (N x J) matrix, row vectors correspond to the MS spectra of these N pure components. The fraction of D that is not described by the bilinear model constitutes the residual matrix E (I x J). MCR-ALS methods presume that the measured variance in all samples in the raw dataset is explained using a combination of a relatively small number of chemically significant profiles compared to the number of measured variables (in this case, the number of ROIs). For LC-MS datasets, the variance observed in the investigated data matrices is explained by the combination of a number of components defined by their pure mass spectra (row profiles in the ST matrix) weighted by their concentration profiles (elution profiles in C matrix), as given in Eq. (1). Every component resolved by MCR-ALS is characterized by its unique MS spectrum and its elution profile, and are interpreted directly. The C and ST solutions of Eq. (1) are obtained using an alternating least squares (ALS) optimization under preselected constraints [1, 3, 22, 23, 24, 25]. In the case of LC-MS data, due to the sparsity of the MS data, non-negativity constraints of the elution and mass spectra profiles of the resolved components already provide good solutions for C and ST, although other constraints may be applied to the profiles of the resolved components, such as unimodality and local rank or selectivity constraints . The MCR-ALS method has been described in previous studies and applied to different type of datasets [1, 3, 22, 23, 24, 25].
The number of metabolites/lipids that is ultimately resolved by the proposed procedure will depend on different experimental parameters, such as the efficiency of metabolite extraction, the suitability of the chromatographic column, the resolution power, signal to noise ratio of the mass spectrometer, and the size of the elution time window analyzed. The number of selected components in the ROIMCR procedure, N, should be sufficiently large to capture all data features related to metabolites. Unavoidably, in addition to the metabolites, other MS signal contributions (background, solvent, etc.) are simultaneously resolved and yield extra components. Therefore, the recommendation is to select a number of components that is sufficiently large to explain most of the variance in the experimental data. The total number of components resolved using MCR-ALS is limited by the intrinsic mathematical structure of the dataset analyzed. MCR-ALS uses linear algebra operations to solve (using a least squares method) the system of linear equations involved in the assumed bilinear model (Eq. (1)) used to analyze the experimental data. The solution of this model implies the inversion of matrices C and ST, and therefore implies that their columns and rows, respectively, are linearly independent. This solution is also related to the rank of the experimental data matrix D. Different datasets will enable the resolution of a different number of components. If the number of components proposed is too large, the inversion of C and ST matrices is not possible due to rank deficiency problems. Occasionally, the precise definition of the best number of components is difficult to obtain due to the experimental noise; nevertheless, those extra components that are only related to noise will provide the shapes of the elution and spectra profiles that are unfeasible from a chemical perspective and explain very low data variance. No additional components should be added without a significant increase in the explained data variance, and should have well-shaped single peak elution profiles and sparse MS spectra signals. Once the results are obtained, every resolved component is examined to confirm its reliability and for its identification (MS) and relative quantitation (elution profiles). This output examination is performed individually, component by component. Residuals are also examined to determine whether some well-shaped peak chromatographic signals are still present. In some cases, some minor components with a very low contribution that is very close to the noise level are unable to be distinguished from background noise in the residuals. This situation is a possible limitation of untargeted metabolic approaches. However, most of the untargeted metabolomic studies focus on changes in the concentrations of the metabolites caused by the investigated stress conditions, not their absolute concentrations. Another possible alternative, in some cases, is to subdivide the whole chromatographic run into different time windows and submit each of them to a deeper MCR-ALS analysis, where the presence of minor components is analyzed more extensively.
Simultaneous MCR-ALS analysis of multiple samples
In this case, resolved pure mass spectra are the same for all simultaneously analyzed chromatographic runs or experiments (ST), while elution profiles (Caug) can vary from run to run.
In the MCR-ALS method, bilinear models described in Eq. (1) (single data matrix illustration) or Eq. (2) (augmented data matrix illustration) are resolved using an alternating least squares optimization approach under constraints . In both cases, when considering metabolomic LC-MS data, the minimum constrains to apply consist of non-negativity for concentration (elution), C or Caug, and spectra, ST, profiles, and normalization for the second. Due to the sparse nature of the MCR-resolved elution profiles, particularly the MS spectra profiles, no additional constraints are required to achieve reliable results.
In the proposed ROIMCR procedure, individual or augmented MSROI data matrices (D or Daug) are submitted for MCR-ALS analysis. The application of this method will provide the concentration/elution, C (or Caug), and MS spectra, ST, profiles of the resolved components. Notably, in the MCR-ALS procedure, elution profiles in Caug are not required to be aligned or shape modelled among different samples (chromatographic runs), and spectra profiles are the filtered MSROI-compressed spectra with the full instrument mass accuracy. Peak areas are calculated by integrating (numerical summation) the values in the concentration (elution) profiles resolved using MCR-ALS. These profiles are located in the columns of the C matrix (Eq. (2)) for every simultaneously analyzed sample. The summation is performed computationally. Depending on the time acquisition of the LC-MS instrument, the peak profile will be digitized with a different number of values, which would usually imply a minimum of 5 intensity values, and in many circumstances, this profile contains more than 10 intensity values. If the concentration profile does not have a peak shape, it is discarded and not considered. Most, but not all, of the elution profiles resolved using MCR-ALS have a good peak shape. For instance, background, solvent, and other spurious signals do not display a good peak shape and are not further considered. The number of components in the analysis of the Daug matrix (simultaneous analysis of multiple samples or datasets) is selected in a similar manner as described above for the analysis of a single dataset, after considering the increased complexity of the augmented data matrix Daug compared to the individual Dk matrices (see Fig. 4). Again, a more detailed description of the MCR-ALS method and the implementation of different constraints is presented in previous publications [1, 3, 22, 23, 24, 25].
The dataset used to illustrate the performance of the current methodology was obtained from a previous study performed by the authors [16, 17], where LC-MS data for lipids extracted from human placental choriocarcinoma cells (JEG-3) that were exposed to DMSO (vehicle controls) and to a non-lethal dose of the chemical endocrine disruptor TBT (exposed samples) for 24 h. Both groups (i.e., controls and exposed) contain three replicates. These raw data sets are available in CDF format at http://cidtransfer.cid.csic.es/descarga.php?enlace1=5792320ab8143eca122f4cf7dbb68cd40e2cf7.
Thus, the interested reader can use the data to test the ROIMCR procedure presented here. For details regarding the characteristics of the data, readers are advised to consult: https://www.nature.com/protocolexchange/protocols/4347.
Results of the application of the ROIMCR procedure to other datasets from recent studies [16, 17, 18, 19, 20, 21, 22, 26, 27, 28] are briefly described in “Applications of the ROIMCR procedure” section.
Implementation of the ROIMCR procedure
The ROI compression procedure presented in this study has been implemented as command line functions in the MATLAB environment available at http://cidtransfer.cid.csic.es/descarga.php?enlace1=298348e5b34daf9e844835352bafa645250ee1 and at www.mcrals.info.
A new user-friendly graphical interface for ROI compression is currently being developed and will be freely available at the same site. The provided MATLAB functions for ROI searching, filtering and compression are related to: a) ROI searching in one sample (ROIpeaks function); b) the evaluation of ROI profiles (ROIplot function), and c) the generation of augmented ROI data matrices (MSROIaug function). In addition, a statistical evaluation of the concentration profiles obtained after the MCR-ALS analysis may be performed (plot_profiles_table function). Regarding the implementation of MCR-ALS, its user graphical interface is also available at www.mcrals.info.
Although the dataset used as example in the present study was already used in previous studies by the authors [16, 17], the results presented here were not presented in the previous publications and are specifically selected to show the key features of ROIMCR methodology in the present study. These results include ROI searching of individual datasets, ROI data matrix augmentation and MCR-ALS analysis of the obtained augmented ROI matrix. The readers interested in the LC-MS data conversion and MATLAB import procedure are advised to consult https://www.nature.com/protocolexchange/protocols/4347.
ROI searching procedure
Optimization of ROI parameters
Number of ROIs and computation time resulting from ROI searches performed with three different values of the input parameters (signal threshold in absolute units, a.u., mass error tolerance in Da/e, and minimum number of occurrences). In cursive are indicated the optimum values of the parameters. The results shown are obtained considering the variation of one parameter while the other two remain fixed in their optimum value
Parameters of the ROI search
Number of ROI
Computational timea (s)
Signal threshold (a.u.)
Mass error tolerance (Da/e)
In the second case (see Table 1), the study of the effect of an admissible mass deviation on an ROI search, the three options tested corresponded to mzerror values of 0.5, 0.05 and 0.005 Da/e. The optimum mass deviation value should be halfway between an excessive and an insufficient mass accuracy. In this example case, with an mzerror value of 0.005 Da/e, peaks corresponding to the same ion were divided into distinct parts, whereas for a value greater than 0.5 Da/e, the opposite situation occurred, and peaks corresponding to distinct ions collapsed into the same chromatographic signal. Thus, the optimum mzerror value was set to 0.05 Da/e. The higher and lower values tested (0.5 and 0.005 Da/e, respectively) were again selected to easily visualize their effects on final ROI selection. Similar to the threshold parameter, a decrease in mzerror value increased the number of ROIs. In this case, however, the increase in ROI number was not as spectacular as for the threshold parameter, and the elapsed computation time was fairly constant for all calculations (see Table 1). In the third case (see Table 1), an evaluation of the effect of minimum occurrences on an ROI search, the three values tested corresponded to 100, 10 and 1. The minimum number of occurrences is directly related to a range of peak widths and detector speed, which varies among high-performance liquid chromatography (HPLC) (20–50 s) and ultra-high-performance liquid chromatography (UHPLC) (5–12 s) systems. In the current representative case, the system used to analyze the sample was an Acquity UHPLC system, and thus the optimum number of occurrences should correspond to a peak with range of 5–12 s. In particular, with this instrumentation, the interval between each occurrence was 0.63 s, and thus we selected 10 occurrences (i.e., 6.3 s) as the optimum value. When considering results obtained for the three values tested, the same trend observed for the other parameters was again detected, as higher numbers of ROIs were obtained when the values of the minimum number of occurrences decreased and lower numbers of ROIs were observed when the value increased. Regarding the mzerror parameter, the increase in ROI number observed at a lower minimum number of occurrences was less substantial than for the threshold parameter, and the elapsed computational time was similar in the three calculations (see Table 1). The example presented here clearly illustrates the importance of the proper optimization of ROI parameters before the application of the method. It also highlights the influence of the particular instrumental specifications (e.g., mass accuracy) on these parameters.
Evaluation of ROI profiles
Once the optimum parameters for the ROI search were selected, the augmentation was performed and a final augmented ROI matrix was generated. The dimensions of that matrix were (11,394 × 481), the x-dimension corresponding to six times the number of retention times of one sample (i.e., 1899) and the y-dimension corresponding to the total number of common and uncommon ROIs among the six samples.
ROI profiles versus feature profiles of XCMS
Various forms (X) of chromatography and mass spectrometry (XCMS) is a popular data analysis software package used by the metabolomics community that enables the automatic processing of large size full-scan LC-MS datasets and the prediction of candidate metabolites using mass identification and retention time algorithms . It is based on feature detection, where a “feature” is defined as a single m/z measurement of the mass spectrometer. A general XCMS analysis starts with the application of the centWave data processing algorithm, which first identifies features using the ROI approach and then models the obtained chromatographic peaks using a wavelet transformation and a Gaussian shape curve fitting strategy. In the last step, some alignment algorithms (such as obiwarp) are used to align the chromatographic peaks of the same feature among distinct samples.
Comparison of ROI search results obtained using our MATLAB routines and the centWave algorithm of XCMS package of Work4Metabolomics
Number of ROIs
m/z error tolerance
ROI search using MATLAB home-made routines
ROI search using centWave algorithm of XCMS of W4Wa
Number of coincident ROIs
In addition to the comparison performed here, other recent studies comparing the performance of XCMS software to an ROI search followed by MCR resolution are presented in the literature. Recently, the proposed procedure was tested in different studies, where the complexity of the analyzed samples was considerably greater and the number of samples larger (see Navarro-Reig et al.  and other citations listed above [16, 17, 18, 19, 20, 21]. We have also validated the procedure for quantitative purposes in Dalmau et al.  All these results have confirmed the adequacy of the proposed ROIMCR strategy to analyze metabolomic data, leading to very similar conclusions in both cases (XCMS and ROIMCR).
A peak alignment strategy is not required (needed in XCMS).
The shape of chromatographic peaks/elution profiles does not need to be modeled (needed in XCMS).
All features in the mass spectrum of one metabolite/lipid are directly resolved in the same MCR component. The assignment of different features to the same component spectrum is unnecessary (needed in XCMS). The CAMERA procedure is not required (needed in XCMS).
The full mass accuracy of MS measurements is preserved (the ROI searching procedure is similar to the one used in XCMS).
Signal filtering and data compression properties are also derived from the ROIMCR procedure.
The open source code is available (see the links below).
Data resolution using the MCR-ALS analysis
Once the augmented data matrix of ROI compressed data from the six samples has been constructed, the next required step is the MCR-ALS analysis.
The selection of the number of pure components is the first step in the MCR-ALS analysis. As described in “Multivariate curve resolution-alternating least squares (MCR-ALS)” section, the optimum number of MCR-ALS components should be sufficiently large to explain all the chromatographic peaks, the background (e.g., solvent), and contributions from other unknown signals. Any increase in the number of components should produce a significant reduction in the lack of fit and a corresponding increase in the explained variance. Otherwise, no other components should be added to the calculation. In the example presented here, the number of components was proposed to be 50 for the MCR-ALS analysis of the augmented matrix, resulting in a less than 7% lack of fit and 96.5% of the variance was explained. A larger number of components did not significantly improve the lack of fit or model new chromatographic peaks. The difference between the larger number of ROIs (481) compared to the smaller number of MCR components resolved (50) has two potential explanations. The first explanation is that not all ROIs will produce different MCR components within an elution profile and a mass spectrum profile characteristic of a metabolite or lipid. In addition, another important explanation is in MCR-ALS, various ROI (features) are grouped into the same component. ROIs grouped into the same MCR-ALS component generally include isotope and adduct peaks. In fact, the capability of MCR-ALS to group features corresponding to the same component (metabolite/lipid) is one of the most distinguishing and advantageous aspects of our ROIMCR methodology compared to other tools such as XCMS that associate each feature with a unique m/z. For this reason, another package, CAMERA [7, 31], has been developed to search for features that correspond to the same compound. In the present study, we used the CAMERA package of W4W to search for ROIs obtained with the centWave algorithm that corresponded to the same compound. The results of the CAMERA search indicated that the initial 300 ROIs were grouped into 194 components. However, the larger number of components obtained using the CAMERA software than the number of components resolved with MCR might be attributed to the fact that not all the 194 components correspond to distinct chromatographic peaks and further grouping should be performed, which is a laborious task. The final list of lipids or metabolites obtained using the two methods should ultimately be comparable, which implies their identification based on their exact mass or another analytical strategy (see Biomarker discovery section).
Importantly, due to the sparse number of MS spectra and their high selectivity, their resolution has little ambiguity [32, 33] and the possible underestimation of the number of MCR-ALS components will not cause a misinterpretation of the results but only a small loss of information. In that case, the final interpretation will only be provided for the ultimately resolved components. As previously explained in “Multivariate curve resolution-alternating least squares (MCR-ALS)” section, another possibility to resolve metabolites with very low signal contributions in LC-MS untargeted studies is to divide the whole dataset into shorter elution time windows.
Concentrations and spectral profiles of the resolved MCR-ALS components are finally used for biomarker assessments. However, a subsequent statistical analysis is required to identify the most relevant MCR-ALS components (i.e., the components that significantly vary among control and stressed samples). Distinct statistical tests have been used for this evaluation, such as the classical Student’s t-test, which was used in the present study. This test, together with other statistical tests, may be performed using the functions and protocol  available at https://www.nature.com/protocolexchange/protocols/4347.
A classical statistical Student’s t-test was performed on each component using a p-value less than 0.05 as the criterion to evaluate the significance of these changes. The results of the test revealed significant changes in the heights of the three components between the two groups (i.e., controls and exposed), suggesting that they represent potential biomarkers for TBT exposure. When needed, multiple comparisons procedures (MCPs)  can be applied to avoid the assignation of false positives. These statistical procedures are intended to consider and suitably manage multiple effects through some shared or joint measure of mistaken inferences. Alternatively, ANOVA and its multivariate extensions for well-designed data have been applied [35, 36] to better ascertain the reliability of the observed effects of TBT exposure. Additionally, the fold-changes for the three components were calculated (Fig. 6a), resulting in 3.5-fold, 4.5-fold and 4.0-fold changes for components A, B and C, respectively. The MS spectra profiles were evaluated to identify the lipid species corresponding to these MCR-ALS components. As shown in Fig. 6b, the exact masses of components A, B and C were 872.7702, 874.7857 and 902.8171 Da/e, respectively. Further identification using MS databases such as Lipid Maps  is also possible. As shown in the same figure, components A, B and C corresponded to triacylglycerol species 52:4, 52:3 and 54:3, respectively. Notably, this identification was made possible to a large extent because no loss of mass spectral information occurred after ROI compression.
Applications of the ROIMCR procedure
In previous sections of this study, the different methodological aspects of the ROIMCR procedure have been described in detail for a single dataset as an example. In previous and simultaneously performed recent studies [16, 17, 18, 19, 20, 21, 22, 26, 27], the ROIMCR procedure has been applied to diverse datasets and scenarios, such as a recent investigation of the rice metabolome using LCxLC-MS/MS . In this study, the ROIMCR procedure was applied to the different modulations of the second LC column to analyze several samples arranged in a super column-wise augmented data matrix. The number of components resolved in the MCR-ALS analysis of this super augmented data matrix was 250. The ROIMCR method determined which of the mass traces belonged to every resolved metabolite, and the resolution of the sample metabolites in the entire dataset was much faster with this method than with other traditional strategies based on the analysis of each component separately. In another recent study , the effects of different endocrine disruptors (EDC) on zebrafish (Danio rerio) embryos were investigated using an untargeted LC-HRMS (Exactive Orbitrap) metabolomic analysis. In this case, 25 zebrafish embryos (5 replicates for each of the 5 applied chemical doses) were simultaneously analyzed using the ROIMCR method for every EDC treatment. Eighty-six to 110 MCR-ALS components were resolved, depending on the EDC used, and the corresponding changes in the metabolite concentrations suggested the presence of similar underlying zebrafish responses to the different investigated EDCs. The underlying metabolomic and lipidomic patterns linked to thermal acclimation in Saccharomyces cerevisiae were investigated in another study using a combination of H1NMR and LC-MS. In this example, the application of the ROIMCR procedure allowed for more than a 100-fold reduction in the computer storage requirement, but maintained the highest possible experimental mass accuracy. Twenty-four yeast samples cultured at different temperatures were simultaneously analyzed and produced 80 tentative lipid candidates in the ESI+ mode and another 50 lipids in the ESI- mode of MS. In another recent study , the proposed ROIMCR LC-MS approach facilitated an assessment of the effects of acute and chronic UV irradiation on the phenotype and lipidomic profiles of keratinocytes. Finally, a similar ROIMCR strategy was applied to the simultaneous analysis of multiple mass spectra from plants to investigate the changes in lipid composition induced by the application of the chlorpyrifos pesticide . MS data from 20 samples receiving each treatment (4 doses with 5 replicates) at different growth stages were simultaneously analyzed and provided information about the changes in the spatial composition and distribution of different lipids on the surface of the investigated samples, which were also identified.
Finally, as an additional confirmation of the advantages of the ROIMCR procedure, the results obtained in the analysis of a new dataset are provided here to complete the assessment of this method. This dataset was obtained in a previous study  where three tissues (brain, gonads and gastrointestinal tract) were obtained from male and female zebrafish exposed to low dietary doses of four different carbon nanotubes (CBNs): C60 fullerene (C60), single-walled carbon nanotubes (SWCNT), short multi-walled carbon nanotubes (MWCNTs), and long multi-walled carbon nanotubes (MWCNTs). The lipid extracts of these samples were analyzed using LC-MS. The data produced in these LC-MS analysis were processed using the ROIMCR procedure (see Additional file 2: Figure S1, Additional file 3: Figure S2 and Additional file 4: Figure S3). One hundred fifty components were simultaneously resolved, explaining the 99.2% of the total data variance. Each of these components was described with one elution and one mass spectrum profile. Additional file 5: Figure S4 shows an example of the MCR-ALS-based resolution of component 2. In the upper panel of Additional file 5: Figure S4, an example of MCR-ALS output results is provided for the resolution of the 150 components in the simultaneous analysis of the same 80 zebrafish samples treated with the different carbon nanotubes. The lower panel Additional file 5: Figure S4 shows an example of the MCR-ALS-based resolution of component 2. This lipid component was identified from its mass spectrum as TAG 50:3, C53H100NO6, with an m/z value of 846.7474. Additional file 6: Figure S5 and Additional file 7: Figure S6 show the resolution of other MCR-ALS components, after their proper identification using lipid databases. In Additional file 6: Figure S5, the MCR-ALS components correspond to glycerolipid species, whereas in Additional file 7: Figure S6, they correspond to glycerophospholipid species. In all cases, the selected components are the most representative biomarkers of the treatments with the distinct carbon nanoparticles (i.e., C60, SWCNT, ShWCNT and LWCNT) in the distinct zebrafish tissues (i.e., brain, gonads and intestinal tracts). As observed in these figures, the differences in the numbers of MCR-ALS components between control and treated samples were very significant in some cases. For instance, in Additional file 6: Figure S5, concentrations of the TAG 54:3 and TAG 54:5 lipid species were up to 5-fold higher in the gonads of female controls compared to SWCNT-treated samples, as evaluated using MCR-ALS. In some other cases, however, a non-significant effect of the treatment was observed. An example is brain tissues from females presented in Additional file 7: Figure S6, which showed very similar LC-MS elution profiles for the resolved MCR-ALS components in controls and zebrafish treatment with the distinct carbon nanoparticles. More details and specifically a discussion of the results obtained in the study of this system are provided in a published study  and at https://doi.org/10.1093/mutage/gew050. Again, the valuable contribution of the MCR-ALS methodology to the evaluation of LC-MS omic profiles in target organisms exposed to environmental contaminants has been validated.
In summary, based on these previous studies, the applicability of the ROIMCR method has been confirmed for diverse metabolomic and lipidomic studies and presents some advantages compared to other strategies, as explained in “ROI searching procedure” and “Data resolution using the MCR-ALS analysis” sections of this paper. Moreover, XCMS and MCR-ALS data analysis strategies were also compared in other studies, such as in the previously published LC-MS metabolomics data analysis protocol , in the review article describing data analysis strategies for targeted and untargeted LC-MS metabolomics studies , in the article describing the LC-MS investigation of the changes in the rice metabolome induced by Cd and Cu exposure , and finally, in the recent validation of the ROIMCR method for untargeted qualitative and quantitative LC-MS analyses of the lipidomic . All these previous studies have confirmed the suitability of the ROIMCR method in the MS omics data analysis field.
The chemometric LC-MS data analysis strategy proposed in this study based on the ROIMCR procedure (ROI searching, filtering and compression followed by MCR-ALS analysis) has been shown to be a powerful approach to analyze LC-MS metabolomic datasets. On one hand, the principal benefit of performing the ROI filtering and compression steps is the capacity to minimize the primary dimensions of the data (gigabytes of storage) while preventing any loss of spectral accuracy. On the other hand, the main advantages attributed to the MCR-ALS analysis include: i) the possibility of immediate chemical identification of the metabolites based on the MS information provided in the analysis, ii) the high degree of interpretability of the results, iii) the flexibility in the structure and nature of the datasets that are potentially able to be analyzed and iv) the added value as a preprocessing method that does not require peak modelling or chromatographic alignment for the simultaneous analysis of multiple samples.
The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP/2007–2013) / ERC Grant Agreement n. 320737. The first author acknowledges the Spanish Government (Ministerio de Educación, Cultura y Deporte) for a predoctoral FPU scholarship. The authors acknowledge support of the publication fee by the CSIC Open Access Publication Support Initiative through its Unit of Information Resources for Research (URICI). Grant support from Generalitat de Catalunya 2017-SGR-753 and Spanish Ministry of Economy, Industry and Competitiveness (project CTQ2015–66254-C2–1-P) is also acknowledged.
Funding bodies did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing this manuscript.
Availability of data and materials
ROIMCR has been implemented as a MATLAB package. ROIMCR MATLAB functions are available at http://cidtransfer.cid.csic.es/descarga.php?enlace1=298348e5b34daf9e844835352bafa645250ee1 and in our webpage in http://www.mcrals.info and described at  and https://www.nature.com/protocolexchange/protocols/4347.
Data examples to test ROIMCR procedure are available as raw CDF data at: http://cidtransfer.cid.csic.es/descarga.php?enlace1=5792320ab8143eca122f4cf7dbb68cd40e2cf7.
EG wrote the manuscript, acquired the data used to test the methodology and participated in the evaluation of the efficacy of the strategy. JJ made substantial contributions in the development of the algorithm and revised the manuscript critically for important intellectual content. RT wrote and revised the manuscript, designed the ROIMCR algorithm and provided guidance on the implementation and the design of experiments. All the authors read, contributed to and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 5.Metabox by kwanjeeraw https://kwanjeeraw.github.io/metabox/ (Accessed 22 May 2018).
- 6.Http://metax.genomics.cn/. Welcome to metaX homepage! Accessed 22 May 2018.
- 12.Http://www.metalign.nl. No Title. Accessed 22 May 2018.
- 17.Navarro-Reig M, Jaumot J, Baglai A, Vivó-Truyols G, Schoenmakers PJ, Tauler R. Untargeted comprehensive two-dimensional liquid chromatography coupled with high-resolution mass spectrometry analysis of Rice metabolome using multivariate curve resolution. Anal Chem. 2017;89:7675–768.CrossRefGoogle Scholar
- 22.Gorrochategui E, Jaumot J, Tauler R. A protocol for LC-MS metabolomic data processing using chemometric tools. Protoc. Exch; 2015.Google Scholar
- 27.Dalmau N, Bedia C, Tauler R. Validation of the regions of interest multivariate curve resolution (ROIMCR) procedure for untargeted LC-MS lipidomic analysis. Anal Chim Acta. 2018;1025:80–91.Google Scholar
- 28.Gorrochategui E, Li J, Fullwood NJ, Ying G, Tian M, Cui L, Shen H, Lacorte S, Tauler R, Martin FL. Diet-sourced carbon-based nanoparticles induce lipid alterations in tissues of zebrafish (Danio rerio) with genomic hypermethylation changes in brain. Mutagenesis. 2017;32:91–103. https://doi.org/10.1093/mutage/gew050.CrossRefPubMedGoogle Scholar
- 30.Guitton Y, Tremblay-Franco M, Le Corguillé G, Martin J-F, Pétéra M, Roger-Mele P, Delabrière A, Goulitquer S, Monsoor M, Duperier C, et al. Create, run, share, publish, and reference your LC–MS, FIA–MS, GC–MS, and NMR data analysis workflows with the Workflow4Metabolomics 3.0 galaxy online infrastructure for metabolomics. Int J Biochem Cell Biol. 2017;93:89–101.CrossRefGoogle Scholar
- 31.Patti GJ, Tautenhahn R, Rinehart D, Cho K, Nikolskiy I, Johnson C, Siuzdak G. A View from Above: The Cloud Plot for Visualizing Global Metabolomic Data. Anal Chem. 2013;85(2):798–804.Google Scholar
- 34.Multiple comparison procedures; Hochberg, Y., Tamhane, A. C., Eds.; Wiley series in probability and statistics. Wiley: Hoboken, 1987.Google Scholar
- 37.Https://www.bruker.com/applications/life-sciences/metabolomics.html. Metabolomics for metabolomic analysis and metabolome study | Bruker.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.