FELLA: an R package to enrich metabolomics data
Pathway enrichment techniques are useful for understanding experimental metabolomics data. Their purpose is to give context to the affected metabolites in terms of the prior knowledge contained in metabolic pathways. However, the interpretation of a prioritized pathway list is still challenging, as pathways show overlap and cross talk effects.
We introduce FELLA, an R package to perform a network-based enrichment of a list of affected metabolites. FELLA builds a hierarchical representation of an organism biochemistry from the Kyoto Encyclopedia of Genes and Genomes (KEGG), containing pathways, modules, enzymes, reactions and metabolites. In addition to providing a list of pathways, FELLA reports intermediate entities (modules, enzymes, reactions) that link the input metabolites to them. This sheds light on pathway cross talk and potential enzymes or metabolites as targets for the condition under study. FELLA has been applied to six public datasets –three from Homo sapiens, two from Danio rerio and one from Mus musculus– and has reproduced findings from the original studies and from independent literature.
The R package FELLA offers an innovative enrichment concept starting from a list of metabolites, based on a knowledge graph representation of the KEGG database that focuses on interpretability. Besides reporting a list of pathways, FELLA suggests intermediate entities that are of interest per se. Its usefulness has been shown at several molecular levels on six public datasets, including human and animal models. The user can run the enrichment analysis through a simple interactive graphical interface or programmatically. FELLA is publicly available in Bioconductor under the GPL-3 license.
KeywordsMetabolomics Pathways Network analysis Data mining Knowledge representation
Functional class scoring
General public license version 3
Isobaric tags for relative and absolute quantitation
Kyoto encyclopedia of genes and genomes
Over representation analysis
Metabolomics is the science that measures lightweight molecules in living organisms and stands as a valuable source of biomarkers and biological knowledge . The preprocessing of such data can be achieved through pipelines like MeltDB  or MAIT . Once metabolite abundances are available, pathway analysis tools ease data interpretation  by framing the affected metabolites in terms of contextual knowledge. Databases like the Kyoto Encyclopedia of Genes and Genomes (KEGG)  are sources of curated pathway data. The classification of enrichment techniques used here follows the review in .
Over representation analysis (ORA) approaches are based on testing the proportion of a list of affected metabolites inside a pathway. ORA is available in tools like the web server MetaboAnalyst  and the R package clusterProfiler . Functional class scoring (FCS) approaches use quantitative data instead and seek subtle but coordinated changes in the metabolites belonging to a pathway. MSEA in MetaboAnalyst and IMPaLA  contain implementations of FCS for metabolomics. Pathway topology-based (PT) approaches further include topological measures of the metabolites in the statistic, accounting for their inequivalence within the pathway. PT analyses can be performed using MetaboAnalyst.
Here, we introduce the R package FELLA, available in Bioconductor , for metabolomics data interpretation that combines pathway enrichment with network analysis. The list of affected metabolites and the reported pathways are connected through intermediate entities -reactions, enzymes, modules- in a heterogeneous network layout. This suggests how the perturbation spreads at the pathway level and how pathways cross talk, enhancing the interpretability of the output.
In order to report a sub-network, nodes are ranked according to a scoring function –based on network propagation– and only the top scoring nodes are returned. Two algorithms are supported for propagating the labels from the affected metabolites: a classical heat diffusion approach  and the PageRank web ranking algorithm . Further details on the network propagation settings can be found in  and in Additional file 2. The main difference between both algorithms is that heat diffusion is undirected whereas PageRank is directed upwards. In practice, contrary to PageRank, heat diffusion will frequently report new metabolites because heat is allowed to propagate back to compounds from the upper levels . This behaviour can ease the discovery of intermediate metabolites that lay close to the input metabolites and tend to connect them. An example of its usefulness can be found in the gilt-head bream study.
As exposed in , ranking nodes according to their raw diffusion scores suffers from a strong bias, related to the node level and topological features. This is addressed by normalising the diffusion score of every node using its background distribution under input permutations. Permutations can be simulated through Monte Carlo trials to obtain an empirical p-value, labelled as p-score. Alternatively, a parametric z-score can be obtained without requiring Monte Carlo trials. The p-score is obtained by transforming the z-score to lie in the [0,1] interval through the cumulative distribution function of a standard normal distribution. Under both statistical approximations, nodes with the lowest p-scores are reported as the suggested sub-network. Note that p-scores are used as a ranker rather than for testing hypotheses.
An optional filter allows the removal of small connected components from the reported sub-network. When building the database, a number of random sub-networks are sampled to characterise how infrequent a connected component of order at least r is when k nodes are uniformly sampled. The assumption behind this filter is that meaningful inputs encompass metabolites relatively close to each other within the knowledge graph, prone to be reported in large connected components involving most of them.
FELLA relies on two classes: FELLA.DATA for the internal knowledge representation, based on the igraph R package , and FELLA.USER for the user analysis, see Fig. 1. These classes contain subclasses, invisible to the user and described in the Additional file 2. The functions to manipulate both classes are described below, following the three blocks from Fig. 1.
Block I: local database
The function buildGraphFromKEGGREST() retrieves the tabular KEGG data for the desired organism and builds the knowledge graph as described in . Then, a database can be built from the graph and stored in a local folder using buildDataFromGraph(). Databases are needed for the enrichment and should be loaded through the function loadKEGGdata().
Block II: enrichment analysis
Scoring methods offered in FELLA, chosen by the enrich function arguments method and approx
Notation in 
Included for reference
Heat diffusion scores followed by z-scores
Heat diffusion scores followed by permutations
PageRank scores followed by z-scores
PageRank scores followed by permutations
Block III: exporting results
Finally, the best scoring KEGG entries can be visualised through plot(), exported as a sub-network with generateResultsGraph(), or in tabular format with generateResultsTable(). A dedicated table with the reported enzymes and its associated genes can be obtained with generateEnzymesTable(). Alternatively, exportResults() allows writing such objects directly to files.
This tab contains a general description of the interface and a handle to submit the input metabolite list as a text file. Examples are provided as well. The right panel shows the mapped and the mismatching compounds with regard to the current database.
Widgets from this tab adjust the main function arguments for customising the enrichment procedure. They ease database choice from the internal package directory, method and approximation definition and parameter tweaking. It also allows the semantic similarity analysis on the reported enzymes, using the R package GOSemSim  with the Gene Ontology annotations .
Results and discussion
The results section mainly consists of an interactive network plot with the top k KEGG entries. Nodes can be moved, selected, queried and hovered to reveal the original KEGG entry. An interactive table lies below the plot and expands the data on the nodes.
The last tab offers several options to download the reported sub-network (tabular format or R object) and enzymes (tabular format).
The algorithmic part of FELLA has already been discussed and validated in . The usage of FELLA is hereby demonstrated on three public human studies on epithelial cells , ovarian cancer cells  and febrile illnesses . The examples guide the user on how to build the database, format the input data, complete the enrichment and export its results (see Additional file 2). FELLA reproduces findings from the original publications, not only in the form of pathway hits but also as newly suggested enzymes and metabolites. The Additional file 5 shows further details on the metabolites in each input and the reported sub-networks.
Summary of the FELLA.DATA objects used for the three human and the three non-human datasets
Epithelial cells dataset
The activation of the “glycerophosphocholine synthesis” rather than the “carnitine” response is a main result in the original work . FELLA highlights the related pathway “choline metabolism in cancer” and the “choline” metabolite as well. Another key process is the “O-linked glycosilation”, which is close to the KEGG module “O-glycan biosynthesis, mucin type core” and to the KEGG pathway “Mucin type O-glycan biosynthesis”. Finally, FELLA reproduces the finding of “UAP1” by reporting the enzyme “220.127.116.11”, named “UDP-N-acetylglucosamine diphosphorylase”. “UAP1” is a key protein in the study, pinpointed by iTRAQ (Isobaric Tags for Relative and Absolute Quantitation) and validated via western blot.
Ovarian cancer cells dataset
The second dataset has been extracted from the study on metabolic responses of ovarian cancer cells . OCSCs are isogenic ovarian cancer stem cells derived from the OVCAR-3 ovarian cancer cells. The abundances of 6 metabolites are affected by the exposure to several environmental conditions: glucose deprivation, hypoxia and ischemia. From those, 5 metabolites map to the FELLA.DATA object. The sub-network is obtained by leaving the default parameters and setting a limit of nlimit = 150 nodes.
Several “TCA cycle”-related entities are highlighted, also found by the authors and by previous work . It also mentions “sphingosine degradation”, closely related to the reported “sphingosine metabolism” in the original work. Enzymes that have been formerly related to cancer are suggested within the TCA cycle, like “fumarate hydratase” [22, 23, 24], “succinate dehydrogenase” [22, 25] and “aconitase” . Another suggestion is “lysosome”(s), known to suffer changes in cancer cells and directly affect apoptosis . Finally, the graph contains several “hexokinases”, potential targets to disrupt glycolysis, a fundamental need in cancer cells .
The metabolites in this example are related to the distinction between malaria and other febrile illnesses . Specifically, the list of 11 KEGG identifiers (9 in the FELLA.DATA object) has been extracted from the original supplementary data spreadsheet, using all the possible KEGG matches for the “non malaria” patient group. The sub-network is obtained by leaving the default parameters and setting a limit of nlimit = 50 nodes.
In this case, the depicted subnetwork contains the modules “C21-Steroid hormone biosynthesis, progesterone =>corticosterone/aldosterone” and “C21-Steroid hormone biosynthesis, progesterone =>cortisol/cortisone”, related to the “corticosteroids” as a main pathway reported in the original text. This is part of the also reported “Aldosterone synthesis and secretion”; aldosterone is known to show changes related to fever as a metabolic response to infection . Another plausible hit in the sub-network is “linoleic acid metabolism”, as erythrocytes infected by various malaria parasytes can be enriched in linoleic acid . In addition, the pathway “sphingolipid metabolism” can play a role in the immune response [31, 32]. As for the enzymes, “3alpha-hydroxysteroid 3-dehydrogenase (Si-specific)” and “Delta4-3-oxosteroid 5beta-reductase” are related to three input metabolites each and might be candidates for further examination.
Oxybenzone exposition on gilt-head bream datasets
A study of the consequences of the oxybenzone contaminant on gilt-head bream  found five dysregulated KEGG metabolites in their liver and eleven in their plasma. The study justified its findings through literature and complemented them with insights provided by FELLA. Here, both metabolite lists are used to build suggested sub-networks with the default parameters and fixing nlimit = 250. The FELLA.DATA object is built for the Danio Rerio organism, a common approximation when annotations specific to gilt-head bream are not available. Further details can be found in the vignette (Additional file 3) and its workspace (Additional file 6).
The enrichment on the liver-derived metabolites links all of them within a connected component of roughly 100 nodes. It points to “Phenylalanine metabolism” as one of the key metabolic pathways, in accordance with the main results from the article. Among the suggested metabolites, “Tyrosine” is of particular help to explain the connection between the affected metabolites (see Fig. 2 from ).
Plasma metabolites involve a more complex scenario. FELLA reports ten out of the eleven metabolites in a connected component involving around 120 nodes. Seven pathways are suggested, from which “Linoleic acid metabolism”, “Biosynthesis of unsaturated fatty acids”, “alpha-Linolenic acid metabolism”, “Glycerophospholipid metabolism” and “Glycine, serine and threonine metabolism” were used to build a comprehensive picture of the metabolic changes in the original manuscript (Fig. 3 from ). Such figure brings a structured overview that narrows down the core processes, also backed up by prior publications. Likewise, by drawing intermediate metabolites found through FELLA, like “Linoleic acid” and “Phosphatidylcholine”, it achieves a cohesive representation of the input metabolites.
Non-alcoholic fatty liver disease mouse model
This dataset exemplifies how FELLA can also be applied on an animal disease model. Metabolites in liver tissue from leptin-deficient ob/ob mice and wild-type were compared using Nuclear Magnetic Resonance, whereas several candidate genes were further investigated for differences in expression . Six affected metabolites are introduced in FELLA, leaving the default parameters and nlimit = 250. The FELLA.DATA object is built for the Mus musculus organism. The vignette with the whole analysis is provided provided as Additional file 4, whereas its R workspace can be found in Additional file 7.
The sub-network found by FELLA involves “N,N-Dimethylglycine”, a marginally significant metabolite in the experimental data but with a relevant role within the findings from the study. Regarding the genes, FELLA is able to find the enzyme associated to Bhmt, validated and discussed in the study. The enzyme associated to Cbs, another central hit, is not directly found. However, its ranking (top 17% among enzymes) and especially that of its reaction (top 3% among reactions) are highly suggestive. We also show how other (1) related metabolites, found by leveraging the expression data, and (2) differentially expressed genes, taken from an external study , tend to have top p-scores in the prioritisation provided by FELLA.
We present FELLA, an R package for enriching metabolomics data, focused on interpretability. It can be used either programmatically or through a simple user interface. FELLA offers a comprehensive enrichment by depicting the intermediate reactions, enzymes and modules that link the input metabolites to the relevant pathways. This layout gives a biological picture with information of the pathway overlap and the connections between the entities of interest, while suggesting enzymes and possibly other metabolites for further study. The utility of FELLA has been demonstrated on six public datasets, both with human and non-human organisms, where reported entities include several original findings in addition to results from third studies. FELLA is publicly available in the Bioconductor public repository under the GPL-3 license.
Availability and requirements
We would like to thank Haizea Ziarrusta and our collaboration with the Department of Analytical Chemistry, University of the Basque Country (UPV/EHU), Leioa, for using, discussing and helping improve our software. We would also like to thank the anonymous reviewers for their valuable comments.
This work was supported by the Spanish Ministry of Economy and Competitiveness (MINECO) [BFU2014-57466-P to OY, TEC2014-60337-R and DPI2017-89827-R to AP]. OY, AP and SP thank for funding CIBERDEM and CIBER-BBN, both initiatives of Instituto de Investigación Carlos III (ISCIII). SP thanks the AGAUR FI-scholarship programme. The funding body had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
SP, FF, MV, OY and AP conceived the software. SP implemented the software and analysed the data. SP wrote the original manuscript. FF, MV, OY and AP critically revised the original manuscript. OY and AP supervised the project. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
Francesc Fernández-Albert has been employed by Takeda Cambridge Ltd. This does not alter our adherence to BioMed Central policies. There are no patents, products in development or marketed products to declare. The commercial affiliation of Francesc Fernández-Albert did not play any role in the design, analysis and outcome of this article. The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 9.Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Ole’s AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015; 12(2):115–21.CrossRefGoogle Scholar
- 11.Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab. 1999.Google Scholar
- 13.Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006; Complex Systems:1695.Google Scholar
- 14.Chang W, Cheng J, Allaire J, Xie Y, McPherson J. Shiny: Web Application Framework for R. 2018. R package version 1.1.0. https://CRAN.R-project.org/package=shiny. Accessed 20 Sept 2018.
- 21.Gogiashvili M, Edlund K, Gianmoena K, Marchan R, Brik A, Andersson JT, Lambert J, Madjar K, Hellwig B, Rahnenführer J, et al. Metabolic profiling of ob/ob mouse fatty liver using HR-MAS 1 H-NMR combined with gene expression analysis reveals alterations in betaine metabolism and the transsulfuration pathway. Anal Bioanal Chem. 2017; 409(6):1591–606.CrossRefGoogle Scholar
- 23.Pithukpakorn M, Wei M-H, Toure O, Steinbach PJ, Glenn GM, Zbar B, Linehan WM, Toro JR. Fumarate hydratase enzyme activity in lymphoblastoid cells and fibroblasts of individuals in families with hereditary leiomyomatosis and renal cell cancer. J Med Genet. 2006; 43(9):755–62. https://doi.org/10.1136/jmg.2006.041087.CrossRefGoogle Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.