TumorMet: A repository of tumor metabolic networks derived from context-specific Genome-Scale Metabolic Models

Granata, Ilaria; Manipur, Ichcha; Giordano, Maurizio; Maddalena, Lucia; Guarracino, Mario Rosario

doi:10.1038/s41597-022-01702-x

TumorMet: A repository of tumor metabolic networks derived from context-specific Genome-Scale Metabolic Models

Data Descriptor
Open access
Published: 07 October 2022

Volume 9, article number 607, (2022)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

TumorMet: A repository of tumor metabolic networks derived from context-specific Genome-Scale Metabolic Models

Download PDF

2099 Accesses
2 Citations
1 Altmetric
Explore all metrics

A Publisher Correction to this article was published on 21 October 2022

This article has been updated

Abstract

Studies about the metabolic alterations during tumorigenesis have increased our knowledge of the underlying mechanisms and consequences, which are important for diagnostic and therapeutic investigations. In this scenario and in the era of systems biology, metabolic networks have become a powerful tool to unravel the complexity of the cancer metabolic machinery and the heterogeneity of this disease. Here, we present TumorMet, a repository of tumor metabolic networks extracted from context-specific Genome-Scale Metabolic Models, as a benchmark for graph machine learning algorithms and network analyses. This repository has an extended scope for use in graph classification, clustering, community detection, and graph embedding studies. Along with the data, we developed and provided Met2Graph, an R package for creating three different types of metabolic graphs, depending on the desired nodes and edges: Metabolites-, Enzymes-, and Reactions-based graphs. This package allows the easy generation of datasets for downstream analysis.

Measurement(s)	gene expression, metabolic relationships
Technology Type(s)	Genome Scale Metabolic Models; Computational network biology
Sample Characteristic - Organism	Homo sapiens

Common biochemical properties of metabolic genes recurrently dysregulated in tumors

Article Open access 08 May 2020

An integrated network representation of multiple cancer-specific data for graph-based machine learning

Article Open access 29 April 2022

Clustering analysis of tumor metabolic networks

Article Open access 25 August 2020

Background & Summary

Cancer is a complex disease caused by a myriad of factors and characterized by an astonishing complexity of phenotypes and traits, which determine its wide heterogeneity, even among cells of a single tissue. Nonetheless, three key processes are shared by all cancer cells: proliferation, invasion, and metastasis. To fulfill these tasks, cancer cells need to reprogram their metabolic activities and cross-talk with their neighborhood^1,2. This evidence gives the metabolism and its players a crucial role in cancer progression and, consequently, cancer research.

Among all the biological networks, the metabolic ones are particularly complex and highly interconnected. Still, they probably are the best characterized in terms of connections and those that better represent the genotype-phenotype associations³. According to this, the reconstruction of comprehensive networks through the integration of omics data into metabolic scaffolds is one of the tools preferred by the systems biology approach for investigating biological phenomena from a holistic point of view. The metabolic scaffolds are given by the Genome-Scale Metabolic Models (GSMs), built from multi-omics data integration, and carrying information concerning the genes/proteins with enzymatic activity, how they interact with bioactive compounds in the context of biochemical reactions, and how the metabolic interconnections change in different cells, tissues or specific conditions⁴. There is a great interest in exploiting these models to generate condition-specific graphs at the service of machine learning approaches. In the era of precision medicine, the main goal is to develop approaches and tools to face the well-known heterogeneity of physiological and pathological manifestations and provide focused solutions for specific conditions. Considering the disease cohort as a single group, including all the diagnosed patients, is a simplistic approach that does not contemplate any inter-samples heterogeneity due to genetic and environmental factors. While modern biology has accepted the intra-sample heterogeneity of single cells, it seems anachronistic to talk about disease- instead of patient-specific conditions. There are several studies that address the problem of heterogeneity by exploiting network-structured approaches^5,6,7.

Metabolic networks are complex and can involve different metabolic players (i.e., metabolites, enzymes, reactions). Machine and deep learning frameworks allow extracting knowledge from the metabolic networks while dealing with their structural and relational complexity⁴. In the context of findability, accessibility, interoperability, and reusability (FAIR) principles⁸, providing benchmark datasets for comparing novel approaches and for the general advancement of a specific research domain is extremely important. Graph-structured data coupled with machine learning approaches are receiving growing interest^{9,10,11,12,13}, and many benchmark datasets have been proposed in the context of biomedical graphs, especially derived from protein-protein interaction, chemical, imaging data^{14,15,16,17,18}. To the best of our knowledge, metabolic networks based on context- and patient-specific metabolic models have not been provided so far. To fill this gap, here, we provide the TumorMet repository. TumorMet contains two main sets of networks depending on the models from which they derive: Tissue-derived networks generated starting from tissue-specific models and PDGSMMs-derived networks obtained using Patient-Derived Genome-Scale Metabolic Models (PDGSMMs). The interesting implications of using the metabolic networks are twofold, from both a computational and biological perspective. Their complexity in terms of nodes and connections, and the plasticity given by the multiple ways in which they can be generated, make them appealing for the proposal and validation of novel approaches in the context of computational graph-based research. In this work, we presented three alternatives, each focused on a specific set of metabolic players (i.e., metabolites, enzymes, and reactions). As demonstrated by¹⁹, reconstruction algorithms used to generate context-specific models present a bug which determines an underestimation of the molecular context. The model’s conversion into a network allows further contextualization by integrating context-specific data. Being aware that the networks we generated for TumorMet are just a portion of the possibilities, we provided the Met2Graph package to give the user the freedom to build the networks depending on specific needs. Met2Graph indeed implements a flexible process flow to build the metabolic graphs, can be easily integrated with user-customized functions, and provides several arguments to personalize the networks. Some of the networks in this dataset were used for assessing graphs classification, clustering, and embedding^20,21,22,23, as well as for multimodal data analysis^24,25, demonstrating their benefits. An exciting field of biological network usage is also represented by the application of node classification approaches aimed at predicting the essential genes, namely those genes crucial for an organism’s viability. Usually, the Protein-Protein Interaction (PPI) networks are exploited to this extent, based on the assumption that the topological centrality is correlated to a functional centrality. As hypothesized in²⁶, one of the reasons why the PPI are the most used networks for this purpose could be their abundance compared to the other types, such as Metabolic networks, highlighting the importance of providing network datasets. Still, only physical interactions, additionally not contextualized, are insufficient to represent the genetic connections’ complexity²⁷. Modern biology extensively uses networks to integrate and analyze data in a way in which organisms, tissues, or cells are considered systems. This perspective gives a crucial role to the connections among biological components, and the network-based analyses are exploited for making relevant biological inferences. The central role of metabolism in different aspects of pathophysiological mechanisms and their tune regulation make these networks particularly interesting for extracting knowledge and making predictions. For example, the analysis of hub nodes²⁸ and the comparison of topological properties between different context-specific networks²⁹ are valuable resources in diagnostic and prognostic markers investigation for precision medicine. Along with the data, we also provide an R package, Met2Graph, to create metabolic graphs starting from GSMs and gene expression data. The package can generate three types of graphs, depending on the desired nodes and edges: Metabolites-based graphs, where metabolites are nodes connected by reactant-product relationships and the edges can be weighted by expression values of the enzymes catalyzing the corresponding reactions; Enzymes-based graphs, where enzymes are nodes that are connected if they catalyze two reactions, each producing and consuming a specific metabolite; and Reactions-based graphs, with reactions as nodes connected if the metabolite produced by one is consumed by the other. TumorMet is deposited at figshare repository³⁰ and the Met2Graph package used to generate it is available at the Met2Graph Github repository (https://github.com/cds-group/Met2Graph).

Methods

The metabolism involves several players, and focusing on one or another influences the type of analysis and the knowledge that can be extracted. The metabolites and the enzymes represent the main molecular components. A biochemical reaction is a transformation process that uses/consumes some metabolites (reactants) to produce new ones (products). The enzymes can facilitate these transformations as they are particular proteins having catalytic activity and the ability to speed up the rate of a reaction binding the substrate by a lock-key or induced-fit model. Not all the reactions are catalyzed by enzymes, as some of them can occur spontaneously. The enzymes are selective; this means that one binds specifically one or few substrates and, consequently, can catalyze one or more reactions, while the same reaction can be catalyzed by more enzymes acting as complex or as mutually exclusive catalyzers. This information is crucial in defining the rules to design a metabolic network since the connections between the metabolic players can be multiple and of different nature when involving the enzymes. In order to manage this issue, we defined some simplification strategies when enzymes represent edges and give rise to multiple connections (as in the case of Metabolites-based networks) and a different consideration of complex and mutually exclusive relationships when enzymes represent the nodes (as in the case of Enzymes-based networks). Further details are provided below in the network construction sections. The repository we provide contains different types of metabolic networks, depending on the nodes and the rules behind the connections: Metabolites-, Enzymes- and Reactions-based networks. A graphical overview of the metabolic networks construction is provided in Fig. 1.

Metabolic models

Tissue-specific GSMs for 5 of the different origin sites of cancer (lung, kidney, brain, ovary, prostate)³¹ and breast cancer INIT model³² were downloaded from the Metabolic Atlas repository (http://www.metabolicatlas.org) in the compressed Systems Biology Markup Language (SBML) format³³ to create the Metabolites-based graphs. PDGSMMs from the Biomodels repository (https://www.ebi.ac.uk/biomodels/pdgsmm/) have been downloaded to generate Metabolites-, Enzymes- and Reactions-based_PDGSMMs graphs for each patient. The Gene-Protein-Reaction (GPR) relationships were extracted from version 1.4.1 of the human generic GSM (https://github.com/SysBioChalmers/Human-GEM/tree/master/model).

Gene expression data

Gene expression data from 6 different tumor primary sites were used to create context-specific Metabolites-based metabolic networks. FPKM (fragments per kilobase per million reads mapped) normalized and log-transformed read counts from RNA sequencing experiments of the breast (TCGA-BRCA), lung (TCGA-LUAD and TCGA-LUSC), kidney (TCGA-KIRC and TCGA-KIRP), brain (TCGA-GBM and TCGA-LGG), ovary (TCGA-OV), and prostate (TCGA-PRAD) cancers were obtained from the Genomic Data Commons (GDC) data portal (https://portal.gdc.cancer.gov). GDC includes several cancer projects, among which The Cancer Genome Atlas (TCGA), which we selected to download the data. Each of them represents a dataset of the repository. Clinical annotations of the samples were also extracted from the database and included in each dataset as sample-sheets.

Metabolites-based_tissue networks construction

The metabolites are the nodes of the network, labeled by the corresponding ID, connected if they are involved in the same reaction, one as a reactant and one as a product. The connections have been created using the information from the relative context-specific metabolic model. Recurrent metabolites (e.g, ATP, CO2, H2O) have been removed to avoid redundant connections and unrealistic definition of paths³⁴. The small molecules such as H2O, NH3, O2, CO2, phosphate, and cofactors are generally considered recurrent metabolites. The recurrent metabolites list we used is provided as external data of the package Met2Graph; the argument rmMets can be set to FALSE to avoid removal, or the list can obviously be personalized by the user. The GPR associations have been derived from the generic human GSM. Each edge is labeled by the Ensembl stable ID (in the form of ENS[species prefix][feature type prefix][a unique eleven-digit number]) of the enzyme/s catalyzing the reaction, when present, and weighted by the expression value/s of the corresponding gene/s obtained by the GDC Portal. Each resulting graph corresponds to a specific sample of the GDC tumor dataset considered. These rules create graphs where a couple of nodes can have multiple edges since multiple enzymes are involved in the same reaction and/or because the same nodes pair can be present in different reactions. Multiple edges have been simplified by averaging the expression values of enzymes acting in the same reaction and then summing up these averages corresponding to different reactions with the same nodes pair. Thus, all the graphs resulting from the same metabolic model have the same number of nodes and edges but different edge weights. The networks are then personalized for each patient by using the expression values and as a consequence, the gene context mentioned by¹⁹ is met. Based on the rules defining the edges, these networks are directed. The properties of these networks are summarized in Table 1.

Table 1 Properties of the Metabolites-based networks derived from tissue models.

Full size table

Metabolites-based_PDGSMMs networks construction

The logic behind the generation of Metabolites-based_PDGSMMs networks is the same as that of the networks derived from tissue models described in the previous paragraph, with the only difference that here each patient-specific network is derived from the corresponding PDGSMM downloaded from the BioModels repository. The edges are weighted using the patient’s gene expression data from the GDC repository. Therefore, each patient-specific network has a different structure and different edge weights. These graphs are directed and weighted. The properties of these networks are summarized in Table 2a.

Table 2 For each tissue dataset of the Metabolites- (a), Enzymes- (b), and Reactions-based_PDGSMMs (c) networks (along the columns), we report the number of graphs (first row) and the corresponding networks topological properties, such as the number of vertices and edges, edge density, average network degree, eventual presence of edge weights, assortativity degree, global transitivity, average local transitivity, minimum and maximum diameter (second through and eleventh rows).

Full size table

Enzymes-based_PDGSMMs networks construction

These networks have enzymes as nodes connected if one catalyzes a reaction producing a metabolite consumed in a reaction catalyzed by the other. The recurring metabolites have also here been removed. According to the GPR, the enzymes involved in each reaction are associated by AND or OR logical relationship, indicating an enzymatic complex or an alternative activity, respectively. Based on this, enzymes related by AND have been considered as a single node, while OR relationships have been split into different nodes. To create patient-specific networks, PDGSMMs have been used as starting models for Metabolites-, Enzymes-, and Reactions-based_PDGSMMs datasets and downloaded from the BioModels repository. Each sample graph has then a different structure deriving from a different model. These graphs are directed and not weighted. The properties of these networks are summarized in Table 2b.

Reactions-based_PDGSMMs networks construction

The rules behind these networks are similar to those of Enzymes-based networks, with the difference of having reactions as nodes, connected if one produces a metabolite consumed by the other. Recurring metabolites have been removed as well. To have sample-specific graphs also in this case we used the PDGSMMs from Biomodels. The resulting graphs are unweighted and directed, and each sample has a different structure determined by the different starting models. The properties of these networks are summarized in Table 2c.

Simplified networks construction

Given the complexity and the size of these networks, we also provided a set of Metabolites-based sub-networks of a subset of kidney and lung samples, simplified according to the approach described in²¹. Briefly, central nodes have been selected by the Eigen centrality score, a measure describing the importance of a node in a graph that depends on that of its neighbors. The classification tests performed to demonstrate the reliability of these sub-networks compared to the whole networks gave comparable accuracy results (see Tables 3 and 4 in²¹). For each tissue, two sets of networks with a different number (#) of resulting nodes are provided. The properties of these networks, forming the Simpl-Kidney-# and Simpl-Lung-# datasets, are summarized in Tables 3 and 4.

Table 3 Properties of the Simplified Networks. See the caption of Table 1 for details.

Full size table

Table 4 Classes per dataset for usage validation of Metabolites-based networks through classification. Only primary tumors have been selected.

Full size table

Classification

Metabolites-based_tissue datasets

In previous works, we have demonstrated the utility of the network datasets in classification and clustering tasks using subsets of some of the Metabolites-based graph datasets now included in the TumorMet repository^{20,21,35,36,37}. Here, we extend to the entire repository the usage validation introduced in²⁰, wherein we classify whole graphs sharing the same set of nodes. The basic idea is to 1) represent each graph of a dataset using probability distributions describing the topological properties of each node; 2) extract the distance matrix (Gram matrix), i.e., the symmetric square matrix containing the distances, taken pairwise, between the networks of the dataset; and 3) classify the networks based on the obtained distance vectors.

1.
Based on the performance results achieved in^{20,21,35,36,37}, here we selected the Transition Matrix of order one ${{\mathscr{T}}}^{r}$ for representing each graph ${{\mathscr{G}}}^{r}$, whose generic element ${{\mathscr{T}}}_{i,j}^{r}$ is the probability of a node i to be reached in one step by a random walker located in node j. Each row ${{\mathscr{T}}}_{i}^{r}$ of this matrix includes local information on the connectivity of node i.
2.
For computing the distance between two networks ${{\mathscr{G}}}^{p}$ and ${{\mathscr{G}}}^{q}$, we selected the network distance:
$${\mathscr{M}}({{\mathscr{G}}}^{p},{{\mathscr{G}}}^{q})=\frac{1}{l}{\sum }_{i=1}^{l}{d}_{JS}({{\mathscr{T}}}_{i}^{p},{{\mathscr{T}}}_{i}^{q}),$$

obtained by averaging over all the l graph nodes the Jensen-Shannon distances d_JS of the probability distributions of their nodes³⁸.
3.
For classification, we considered the primary tumor classes described in Table 6. In particular, for Kidney, Lung, and Brain, the Primary-Tumor diagnoses indicated in the GDC sample metadata file, downloaded along with the gene expression files, have been used to label the samples and fulfill the classification task. For Breast, the 5 subtypes have been derived from the PAM50 classification³⁹. As the Normal-like subtype has only 40 samples and is very similar to the Luminal A subtype, we performed the tests both including (Breast_5cl) and excluding (Breast_4cl) this class. For Prostate, as having only one class of diagnosis, the Gleason pattern score, an indicator of different grades of malignancy, has been used. Among the possible four classes (Pattern from 2 to 5), we excluded the Pattern 2 class (not shown in Table 6), as it is made of only one sample. Moreover, we considered two different classification problems: the Prostate1 case, that aims at discriminating the Pattern 3 samples (199) from the Pattern 4 ones (249); and the Prostate2 case, that consists in discriminating the Pattern 3 samples from the samples being assigned to Pattern either 4 or 5 (289). For Ovary, the subtype assignment of High-Grade Serous Ovarian Cancer (HGSOC) has been taken from⁴⁰.

Metabolites-, Enzymes-, and Reactions-based_PDGSMMs datasets

The graph2vec framework⁴¹ is a neural method for learning graph-level embeddings in an unsupervised manner. It describes nodes through a recursive node relabeling algorithm assigning to each node a label uniquely representing its rooted subgraph (neighborhood). These labels form a vocabulary of words, and graphs are represented in the form of documents. Then, the Distributed Bag of Words doc2vec approach⁴² is used to learn the graph (document) embeddings. The performance has been evaluated by means of a stratified 10-fold Cross-Validation (CV) in which a SVM classifier, with a linear kernel, was applied to train and make predictions on 64-sized vectorizations of graphs (embeddings) produced by graph2vec with a recursive depth of 3 and a training duration of 200 epochs. The class labels used for the classification task are specified in Table 5.

Table 5 Classes of PDGSMMs used to accomplish the classification task of Kidney and Lung PDGSMMs derived networks.

Full size table

Data Records

The network files and associated metadata composing the repository TumorMet are available at figshare repository³⁰. The file TumorMet-repository.pdf summarizes the content of the repository. For easy access to the files, the repository is organized into seven datasets, each in a separate folder, representing the six tumor tissues and the simplified networks (i.e., Prostate, Lung, Kidney, Breast, Ovary, Brain, and Simplified networks). In each main tissue dataset folder, the sample-sheet file reporting the sample metadata as downloaded from GDC (i.e. Sample sheet.tsv) and an excel file reporting the correspondences between PDGSMM ids and TCGA ids (Dictionary_ids.xlsx) are provided. Each tissue dataset folder contains subfolders for the different types of networks, namely Metabolites-, Enzymes-, and Reactions-based, compressed in.zip format. The Metabolites-based folder is further subdivided into folders containing the Metabolites-based networks deriving from tissue models (Metabolites-based_tissue) and BioModels PDGSMMs (Metabolites-based_PDGSMMs). Enzymes- and Reactions-based networks are only derived from PDGSMMs. Simplified networks are provided for Kidney and Lung tissues. Each tissue folder contains the sample-sheet file reporting the sample metadata as downloaded from GDC (i.e., Sample sheet.tsv) and two subfolders for the networks files based on the number of nodes retained after the simplification process (for Kidney eigen_simplified_441_nodes and eigen_simplified_1034_nodes; for Lung eigen_simplified_312_nodes and eigen_simplified_1017_nodes). All the network files are provided in GraphML format. GraphML is a flexible and convenient XML format for storing network information. It supports unweighted, weighted, undirected, and directed networks and allows for the definition of node and edge attributes (http://graphml.graphdrawing.org/). A scheme of the repository content is illustrated in Fig. 2, while a summary of the networks features in terms of starting material and number of networks is provided in Table 6.

Table 6 Networks provided in the TumorMet repository.

Full size table

Technical Validation

Our validation process consisted of data-type and structural validation, as well as usage validation through downstream applications.

Data-type and structural validation

The quality of the original data used to generate the networks is given by the reliability of the data sources repositories, i.e., GDC, Human Metabolic Atlas, and BioModels. Node IDs were verified to be of the same type. All edges were verified to be between nodes in the node list. All attribute data were verified to correspond to an existing node or edge. The structural integrity of the networks has been assessed by removing self-loops. Any duplicate edges were also removed. We further checked that nodes with no edges were not present in the networks.

Usage validation

The tumor metabolic networks can be exploited in several downstream applications, ranging from pure network analysis to multi-level integration with other biological networks or data, to machine and deep learning approaches for unraveling the complex metabolic machinery and its role in precision medicine. In this section, we show the usage of TumorMet networks in classification of tumor samples, thus giving an idea of one of their potential applications. To furnish a baseline for comparing methods and approaches, we give several details of the two different workflows used for Metabolites-based networks derived from tissue models and Metabolites-, Enzymes-, Reactions-based networks derived from PDGSMMs.

Metabolites-based_tissue datasets

For the evaluation of classification performance, i) each of the Metabolites-based datasets was subdivided into a training and a test set; ii) a statistical validation was obtained on the training sets using a 10-fold CV, to ensure that the results were not biased to a specific training subset; iii) finally, the classification performance on the test datasets was evaluated using the models built on the training datasets.

i).
In the case of Kidney, Lung, Breast, and Brain tissue datasets, the choice of the training sets was driven by our previous work³⁶, where subsets of these datasets were already adopted for classification. Therefore, those subsets have been adopted here as training sets, while the newly added samples were assigned to the test sets. For the tissues not used previously (Ovary and Prostate), we obtained the training and test sets by using a 70:30 split ratio. The sample partitioning for each tissue is reported in Supplementary Table 1, while Figs. 3–4 provide the t-distributed Stochastic Neighbor Embedding (t-SNE) plots for the test sets.
Fig. 3
t-SNE representations of the Gram matrices of the test sets of the Kidney (a), Lung (b), Brain (c), and Ovary (d) Metabolites-based_tissue datasets. The TSNE function of the sklearn.manifold library has been used to generate the plots.
Full size image
Fig. 4
t-SNE representations of the Gram matrices of the test sets of the Breast_4cl (a), Breast_5cl (b), Prostate1 (c), and Prostate2 (d) Metabolites-based_tissue datasets. The TSNE function of the sklearn.manifold library has been used to generate the plots.
Full size image
ii).
For the statistical validation on the training sets, the data were min-max normalized and a Support Vector Machine (SVM) classifier with linear kernel was adopted using the libsvm implementation⁴³ available in scikit-learn⁴⁴. The one-vs.-rest strategy was used to classify the multi-class datasets. To account for unbalanced datasets, the “balanced” mode in sklearn was used to set the class weights; this parameter penalizes the wrong prediction of the classes having a number of instances lower than the others. The 10-fold CV on the training datasets was repeated 10 times, and the average of the CV scores are reported in Table 9 (top); these scores are also shown in the form of box plots in Fig. 5.
Fig. 5
Classification scores on the Metabolites-based_tissue datasets. The box-plots show the classification scores obtained from the 10 iterations of the evaluation procedure on the training sets of the six Metabolites-based_tissue datasets. (a–c) report Accuracy, Precision, Recall, and F1 as percentages; (d) reports MCC values.
Full size image
iii).
The classification performance on the test sets was computed using the same SVM classifier learned on the training sets. The obtained results are reported in Table 9 (bottom). Kidney, Lung and Brain graphs are well classified, as shown by accuracy scores both in CV on training sets and using new samples as testing data (Table 9 and Figs. 3, 5). More challenging tasks are instead given by the classification of Breast, Ovary and Prostate samples.

Regarding Breast, the inclusion of the Normal-like subtype into the classification does not dramatically change the results; however, compared to the tissues mentioned above, the results are worse, having an accuracy of around 80%. Looking at the t-SNE plots (Fig. 4a,b), it is evident how the Basal is the best discriminated and most homogeneous subtype, while some samples of Luminal A, Luminal B, and Her2 are overlapped, especially the latter two. Normal-like samples, as expected, are difficult to separate from Luminal A ones. Ovary samples are completely overlapping (Fig. 3d) and lead to poor accuracy percentage (around 70%, as reported in Table 9). Finally, the CV scores reported in Table 9 (top) and plotted in Fig. 5c, as well as the test samples validation results reported in Table 9 (bottom), indicate that Prostate samples are generally poorly discriminated and the results are slightly better for the Prostate2 classification task (when the Gleason Pattern 5 is assimilated to Pattern 4). Prostate cancer is characterized by a high molecular heterogeneity⁴⁵ which is evidently not caught considering only the Gleason score, as also highlighted by the t-SNE plots reported in Fig. 4c,d.

Metabolites-, Enzymes-, Reactions-based_PDGSMMs datasets

As detailed in the Section on Metabolic networks construction, these PDGSMMs derived graphs differ from the Metabolites-based graphs in that they do not share a common set of nodes across all patients. Therefore, we decided to accomplish the classification task on these datasets through a whole-graph embedding framework. Classification results based on these embeddings using the class labels specified in Table 5 for the Kidney and Lung PDGSMMs derived network datasets are reported in Table 8.

It is evident that the performance for these types of networks is not as good as the one obtained with Metabolites-based graphs, but it is worth pointing out that the two approaches to the classification task are completely different due to the different nature of the networks. Enzymes- and Reactions-based networks are indeed not weighted and have different structures being generated from different models. The complexity and density of these networks surely require a deeper investigation of the best suitable approach and parameters tuning to discriminate the differences among the samples, which is not the aim of this paper. As mentioned previously, one of the interesting aspects of the metabolic networks is their plasticity since different types of graphs can be generated depending on the desired nodes and connections. In future work, we will consider generating unique tri-partite graph for each patient to investigate the possibility to reduce classification performance differences. As for the networks extracted from tissue-specific models, the Metabolites-based_PDGSMMs networks are weighted by gene expression values. Comparing weighted vs. non-weighted networks in terms of classification performance, it is evident that the weights do not add any crucial information for discriminating the classes (Table 9). These networks derive from PDGSMMs reconstructed through the tINIT algorithm integrating TCGA gene expression data. Adding expression values to edges is therefore redundant and likely the models are already well contextualized. Instead, the weights have a different role in Metabolites-based_tissue networks, where are crucial for personalizing the networks in terms of patients. Furthermore, even if tested with different methods, the patients-specific Metabolites-based networks derived from tissue models seem to well contextualize the tissue models in terms of patients resulting as more representative of the tumor classes and with a higher discriminative power, as highlighted by classification performances (Table 7).

Table 7 Classification scores on Metabolites-based_tissue datasets.

Full size table

Table 8 Classification scores on Enzymes- and Reactions-based_PDGSMMs Kidney and Lung datasets.

Full size table

Table 9 Classification scores on weighted and unweighted Metabolites-based_PDGSMMs networks of Kidney samples.

Full size table

Usage Notes

The networks presented here have been generated using the Met2Graph R package we developed (see the paragraph on “Code availability”). The model in SBML format is imported and read by the Met2Graph package through the function readSBMLmod from the sybilSBML⁴⁶ package. Several checkpoints are included in the function to validate the model object before importing it, such as check of upper and lower bounds, GPR mapping, reactions’ ids, and presence of list of reactants and products. The code snippets of Listings 1–4 show Met2Graph functions and arguments used to obtain the different networks:

Listing 1 Metabolites-based_tissue networks.

Listing 2 Metabolites-based_PDGSMMs networks.

Listing 3 Enzymes-based_PDGSMMs networks.

Listing 4 Reactions-based_PDGSMMs networks.

There are several open-source network libraries that can be used to analyze and visualize the networks provided in GraphML format. Examples of network analysis and visualization software include NetworkX, igraph, Cytoscape, yEd and Gephi.

Code availability

The R package Met2Graph developed and used to generate the TumorMet datasets is publicly available at the Met2Graph Github repository (https://github.com/cds-group/Met2Graph). The package has a detailed tutorial to generate the networks. Met2Graph implements a flexible process flow to build graphs starting from a GSM and can be easily integrated with user-customized functions. It allows the creation of the three different types of graphs described, based on the selection of nodes, edges, and attributes: Metabolites-, Enzymes- and Reactions-based graphs. It allows integrating gene expression data into Metabolites-based graphs. It provides several options and parameters to customize the resulting graphs. To name a few: to create multiple or simplified edges (simplification is possible using three different methods), to remove recurring metabolites, to consider the double direction in case of reversible reactions, to generate graphs as directed or not, and to plot the networks. All the details and the different arguments are described in the package manual and “help” section of the related functions.

The code to compute the distribution based distance measures and to obtain the simplified networks is also available at the GraphDistances Github repository (https://github.com/cds-group/GraphDistances).

Change history

21 October 2022
A Correction to this paper has been published: https://doi.org/10.1038/s41597-022-01765-w

References

Jang, M., Kim, S. S. & Lee, J. Cancer cell metabolism: implications for therapeutic targets. Exp. & molecular medicine 45, e45–e45 (2013).
Article Google Scholar
Pavlova, N. N. & Thompson, C. B. The emerging hallmarks of cancer metabolism. Cell metabolism 23, 27–47 (2016).
Article PubMed PubMed Central CAS Google Scholar
Yizhak, K., Chaneton, B., Gottlieb, E. & Ruppin, E. Modeling cancer metabolism on a genome scale. Mol. systems biology 11, 817 (2015).
Article Google Scholar
Granata, I., Manzo, M., Kusumastuti, A. & Guarracino, M. R. Learning from metabolic networks: Current trends and future directions for precision medicine. Curr. Medicinal Chem. 28, 6619–6653 (2021).
Article CAS Google Scholar
Lam, S. et al. Addressing the heterogeneity in liver diseases using biological networks. Briefings Bioinforma. 22, 1751–1766 (2021).
Article CAS Google Scholar
Buphamalai, P., Kokotovic, T., Nagy, V. & Menche, J. Network analysis reveals rare disease signatures across multiple levels of biological organization. Nat. communications 12, 1–15 (2021).
Article Google Scholar
Wu, H.-Y., Nollenburg, M. & Viola, I. Graph models for biological pathway visualization: State of the art and future challenges https://doi.org/10.48550/ARXIV.2110.04808 (2021).
Article Google Scholar
Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Sci. Data 3 (2016).
Gaudelet, T. et al. Utilizing graph machine learning within drug discovery and development. Briefings Bioinforma. 22 (2021).
Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C. & Collins, J. J. Next-generation machine learning for biological networks. Cell 173, 1581–1592 (2018).
Article PubMed CAS Google Scholar
Liu, C. et al. Computational network biology: Data, models, and applications. Phys. Reports 846, 1–66 (2020).Computational network biology: Data, models, and applications.
Huang, W. et al. A graph signal processing perspective on functional brain imaging. Proc. IEEE 106, 868–885 (2018).
Article Google Scholar
Gu, L. et al. Semi-supervised learning in medical images through graph-embedded random forest. Front. Neuroinformatics 14 (2020).
Manipur, I., Giordano, M., Piccirillo, M., Parashuraman, S. & Maddalena, L. Community detection in protein-protein interaction networks and applications. IEEE/ACM Transactions on Comput. Biol. Bioinforma. 1–1, https://doi.org/10.1109/TCBB.2021.3138142 (2022).
Zitnik, M., Sosič, R., Maheshwari, S. & Leskovec, J. BioSNAP Datasets: Stanford biomedical network dataset collection, http://snap.stanford.edu/biodata (2018).
Hu, W. et al. Open graph benchmark: Datasets for machine learning on graphs. CoRR abs/2005.00687 (2020).
Shen, K. et al. A macaque connectome for large-scale network simulations in thevirtualbrain. Sci. data 6, 1–12 (2019).
Article ADS Google Scholar
Sugis, E. et al. HENA, heterogeneous network-based data set for Alzheimer’s disease. Sci. data 6, 1–18 (2019).
Article Google Scholar
Ponce-de Leon, M., Apaolaza, I., Valencia, A. & Planes, F. J. On the inconsistent treatment of gene-protein-reaction rules in context-specific metabolic models. Bioinforma. 36, 1986 (2020).
Article CAS Google Scholar
Granata, I. et al. Supervised classification of metabolic networks. IEEE Int. Conf. on Bioinformatics and Biomedicine,BIBM 2018, Madrid, Spain, December 3-6 2018, 2688–2693 (2018).
Google Scholar
Granata, I. et al. Model simplification for supervised classification of metabolic networks. Annals Math. Artif. Intell. 88, 91–104 (2020).
Article MathSciNet MATH Google Scholar
Manipur, I. et al. Netpro2vec: a graph embedding framework for biomedical applications. IEEE/ACM Transactions on Comput. Biol. Bioinforma. 19, 729–740 (2022).
Article CAS Google Scholar
Manzo, M., Giordano, M., Maddalena, L. & Guarracino, M. R. Performance evaluation of adversarial attacks on wholegraph embedding models. In Simos, D. E., Pardalos, P. M. & Kotsireas, I. S. (eds.) Learning and Intelligent Optimization 15th International Conference, LION 15, Athens, Greece, June 20-25, 2021, Revised Selected Papers, vol. 12931 of Lecture Notes in Computer Science, 219–236 (Springer, 2021).
Maddalena, L., Granata, I., Manipur, I., Manzo, M. & Guarracino, M. R. Glioma grade classification via omics imaging. In BIOIMAGING, 82–92 (2020).
Maddalena, L., Granata, I., Manipur, I., Manzo, M. & Guarracino, M. R. A framework based on metabolic networks and biomedical images data to discriminate glioma grades. In International Joint Conference on Biomedical Engineering Systems and Technologies, 165–189 (Springer, 2020).
Zhang, X., Acencio, M. L. & Lemke, N. Predicting essential genes and proteins based on machine learning and network topological features: a comprehensive review. Front. physiology 7, 75 (2016).
CAS Google Scholar
Nagai, J. S., Sousa, H., Aono, A. H., Lorena, A. C. & Kuroshu, R. M. Gene essentiality prediction using topological features from metabolic networks. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), 91–96 (2018).
Mi, K. et al. Construction and analysis of human diseases and metabolites network. Front. Bioeng. Biotechnol. 8, 398 (2020).
Article PubMed PubMed Central Google Scholar
Granata, I., Troiano, E., Sangiovanni, M. & Guarracino, M. R. Integration of transcriptomic data in a genome-scale metabolic model to investigate the link between obesity and breast cancer. BMC bioinformatics 20, 1–11 (2019).
Article Google Scholar
Granata, I. et al. TumorMet. Figshare https://doi.org/10.6084/m9.figshare.c.5931130.v1 (2022).
Uhlen, M. et al. Tissue-based map of the human proteome. Sci. 347, 1260419 (2015).
Article Google Scholar
Agren, R. et al. Reconstruction of genome-scale active metabolic networks for 69 human cell types and 16 cancer types using INIT. PLoS computational biology 8, e1002518 (2012).
Article PubMed PubMed Central CAS Google Scholar
Hucka, M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinforma. 19, 524–531 (2003).
Article CAS Google Scholar
Ma, H. & Zeng, A.-P. Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinforma. 19, 270–277 (2003).
Article CAS Google Scholar
Granata, I., Guarracino, M., Maddalena, L., Manipur, I. & Pardalos, P. On network similarities and their applications. In International Symposium on Mathematical and Computational Biology, 23–41 (Springer, 2019).
Granata, I., Guarracino, M. R., Maddalena, L. & Manipur, I. Network distances for weighted digraphs. In International Conference on Mathematical Optimization Theory and Operations Research, 389–408 (Springer, 2020).
Manipur, I., Granata, I., Maddalena, L. & Guarracino, M. R. Clustering analysis of tumor metabolic networks. BMC Bioinforma. 21, 1–14 (2020).
MATH Google Scholar
Endres, D. M. & Schindelin, J. E. A new metric for probability distributions. IEEE Transactions on Inf. Theory 49, 1858–1860 (2003).
Article MathSciNet MATH Google Scholar
Bastien, R. R. et al. Pam50 breast cancer subtyping by rt-qpcr and concordance with standard clinical molecular markers. BMC medical genomics 5, 1–12 (2012).
Article ADS Google Scholar
Lawrenson, K. et al. A study of high-grade serous ovarian cancer origins implicates the SOX18 transcription factor in tumor development. Cell Reports 29, 3726–3735.e4 (2019).
Article PubMed CAS Google Scholar
Narayanan, A. et al. graph2vec: Learning distributed representations of graphs. ArXiv abs/1707.05005 (2017).
Le, Q. & Mikolov, T. Distributed representations of sentences and documents. In International conference on machine learning, 1188–1196 (2014).
Chang, C.-C. & Lin, C.-J. Libsvm: A library for support vector machines. ACM Transactions on Intell. Syst. Technol.(TIST) 2, 1–27 (2011).
Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Ferrari, N. et al. Adaptive phenotype drives resistance to androgen deprivation therapy in prostate cancer. Cell Commun.Signal. 15, 1–14 (2017).
Article Google Scholar
Gelius-Dietrich, G., Fritzemeier, C. J., Desouki, A. A. & Lercher, M. J. sybil – efficient constraint-based modelling in r.BMC Syst. Biol. 7, 125 (2013).
Google Scholar

Download references

Acknowledgements

This work has been partially funded by the BiBiNet project (H35F21000430002) within POR-Lazio FESR 2014–2020 and co-funded by European Union PON “Ricerca e Innovazione 2014-2020” FSC - Project PON03PE_00060_5 MEDIA. It was carried out also within the activities of the authors as members of the INdAM Research group GNCS and the ICAR-CNR INdAM Research Unit. The work of Mario R. Guarracino was conducted within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE). The early stage investigator fellowship of Ichcha Manipur was supported by the INCIPIT program cofounded by Horizon 2020 - CO-FUND Marie Sklodowska Curie Actions.

Author information

Authors and Affiliations

National Research Council, Napoli, 80131, Italy
Ilaria Granata, Ichcha Manipur, Maurizio Giordano & Lucia Maddalena
University of Cassino and Southern Lazio, Cassino, 03043, Italy
Mario Rosario Guarracino

Authors

Ilaria Granata
View author publications
You can also search for this author in PubMed Google Scholar
Ichcha Manipur
View author publications
You can also search for this author in PubMed Google Scholar
Maurizio Giordano
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Maddalena
View author publications
You can also search for this author in PubMed Google Scholar
Mario Rosario Guarracino
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

I.G. - conceptualization, data production, code writing, manuscript writing. I.M. - data analysis, manuscript writing. M.G. - data analysis, manuscript draft review. L.M. - supervision, manuscript draft review. M.R.G. - supervision, manuscript draft review.

Corresponding author

Correspondence to Ilaria Granata.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Granata, I., Manipur, I., Giordano, M. et al. TumorMet: A repository of tumor metabolic networks derived from context-specific Genome-Scale Metabolic Models. Sci Data 9, 607 (2022). https://doi.org/10.1038/s41597-022-01702-x

Download citation

Received: 07 April 2022
Accepted: 15 September 2022
Published: 07 October 2022
DOI: https://doi.org/10.1038/s41597-022-01702-x
Springer Nature Limited

TumorMet: A repository of tumor metabolic networks derived from context-specific Genome-Scale Metabolic Models

Abstract

Similar content being viewed by others

Common biochemical properties of metabolic genes recurrently dysregulated in tumors

An integrated network representation of multiple cancer-specific data for graph-based machine learning

Clustering analysis of tumor metabolic networks

Background & Summary

Methods

Metabolic models

Gene expression data

Metabolites-based_tissue networks construction

Metabolites-based_PDGSMMs networks construction

Enzymes-based_PDGSMMs networks construction

Reactions-based_PDGSMMs networks construction

Simplified networks construction

Classification

Metabolites-based_tissue datasets

Metabolites-, Enzymes-, and Reactions-based_PDGSMMs datasets

Data Records

Technical Validation

Data-type and structural validation

Usage validation

Metabolites-based_tissue datasets

Metabolites-, Enzymes-, Reactions-based_PDGSMMs datasets

Usage Notes

Code availability

Change history

21 October 2022

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Table 1

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation