Functionality of MetaboShiny demonstrated using a test dataset
To demonstrate the functionality of MetaboShiny, we leveraged the LC–MS dataset on urine samples from 1005 patients with and without lung cancer. This work identified 4 m/z values at MSI level 1, 264 m/z ( △ in figures), 308 m/z (□), 441 m/z (
) and 561 m/z (◯ ) predictive for cancer status, which were confirmed through targeted mass spectrometry, one of which, namely 264 m/z, was a novel compound called creatine riboside (Members and MSI Board Members 2007; Mathé et al. 2014).
As an initial analysis, we ran a t-test in MetaboShiny (Fig. 1a, b) and this returned a list of most significantly different m/z values between the control and disease groups. When using MetaboShiny to search for potential annotations for the m/z value at the top of this list, a search including the HMDB (Fig. 1c, d) uncovered creatine riboside (M + H adduct) as a putative hit (Fig. 1f). The built-in HMDB compound description page (Fig. 1e) reports that this m/z feature was first discovered by Mathé et al. (2014) (Fig. 1g), confirming the validity of this finding. All this information can be retrieved with a few mouse clicks.
Additionally, PLS-DA analysis was explored. This yielded a model that achieved significant separation between the population and lung cancer groups (p < 0.03; Fig. 1h). The loading for PC1 contains the four m/z values of interest, ranked #1, #2, #24 and #86 respectively. These results indicate that the four compounds are not just significant in univariate t-tests, but also can be used to train predictive models that stratify samples by cancer status.
MetaboShiny can also be used to perform more routine analyses such as fold-change analysis, volcano plots and correlation analyses. Furthermore, heatmaps and venn diagrams are available.
Lastly, Mathé et al. trained a random forest model and combined the results of multiple classifiers to find compounds that were good predictors in multiple metadata groups (sex, race, and smoking status) (Mathé et al. 2014).
Using MetaboShiny this analysis can be completely reproduced within an hour, demonstrating the utility of the software for rapid hypothesis generation and biomarker discovery.
Two of the four metabolites that Mathé et al. identified rank within the top 20 (#1 and #3 for 264 and 308 m/z respectively) P7L9 in terms of variable importance, with 264 m/z having the highest predictive value. Aside from m/z values, smoking status was a strong predictor for disease (ranked #2 by the Random Forest, Fig. 1j). Additional analyses exploring further functionality such as subsetting, formula prediction and PubMed text mining are available in Supplemental Section S2.
Taken together, this demonstrates the power of MetaboShiny to achieve rapid hypothesis generation, which can be followed up with additional experiments.
Performance quantification of MetaboShiny
To quantify MetaboShiny’s speed, a quantitative analysis was performed on MetaboShiny’s database searching performance. For an informative comparison, we created 300 databases spanning the range of real world database sizes. For each of these database sizes, a random pool of 100 m/z values, evenly distributed over the 60–600 m/z range were used individually to perform a match search.
The machine used was a MacBook Pro (13-inch, 2016) with 16 GB RAM and a 2,9 GHz Intel Core i5 processor. M/z values with more matches in the database trend towards taking more time, as MetaboShiny fetches additional information (description, molecular formula, name) when a match is found. Furthermore, SQLITE occasionally performs self-maintenance(indexing, for example), or the test system may simultaneously be performing background tasks. This is likely to cause some of the slower outliers. Regardless of this, one search even at the current maximum database size does not exceed 0.8 s for a single search, generally staying below half a second (Fig. 2a).
Furthermore, we tested the time it takes to import and normalize a dataset, after which analysis can be started. The test dataset took five minutes to process to the point of being able to start analysis. We generated synthetic datasets consisting of combinations of a minimum of 10 and a maximum of 10,000 samples, and minimally 10 to maximally 2000 m/z values. Importing and normalization generally takes less than 10 min, this time increasing as more m/z values and/or samples are added in the dataset. This was using the same normalization settings as the test dataset. Missing values were imputed using Random Forest, other normalization methods are expected to run substantially faster (Fig. 2b).
Post-annotation considerations
Putative identities of m/z values will still require validation through LC/MS–MS. MetaboShiny guides users to this step by prioritizing compounds based on their isotope and adduct status, alongside using database compound descriptions to prioritize compounds based on what is biologically known (Coley et al. 2019).
Further plans to enhance MetaboShiny include refining in silico compound annotation and prioritization, expanding the amount of available databases, and increasing the ability to streamline data integration.