Thermodynamic measures of cancer: Gibbs free energy and entropy of protein–protein interactions
- 448 Downloads
Thermodynamics is an important driving factor for chemical processes and for life. Earlier work has shown that each cancer has its own molecular signaling network that supports its life cycle and that different cancers have different thermodynamic entropies characterizing their signaling networks. The respective thermodynamic entropies correlate with 5-year survival for each cancer. We now show that by overlaying mRNA transcription data from a specific tumor type onto a human protein–protein interaction network, we can derive the Gibbs free energy for the specific cancer. The Gibbs free energy correlates with 5-year survival (Pearson correlation of –0.7181, p value of 0.0294). Using an expression relating entropy and Gibbs free energy to enthalpy, we derive an empirical relation for cancer network enthalpy. Combining this with previously published results, we now show a complete set of extensive thermodynamic properties and cancer type with 5-year survival.
KeywordsCancer Signaling networks Gibbs free energy Entropy Protein-protein interactions 5-year survival
Early insights into protein–protein interaction (PPI) networks suggest that the complexity of changes in PPI network topology correlates with cancer stage [1, 2] and clinical outcomes . If validated prospectively, this would offer a powerful tool not only in better prognostic but also in therapeutic applications by providing a rational basis for personalized drug selection that is informed by the mRNA expression data.
Several published studies appear to corroborate the above hypothesis by linking molecular data with clinical outcomes. Paliouras et al.  used mass spectrometry on prostate clinical samples to show how changes in the protein–protein interaction network architecture relate to Gleason score and prostate specific antigen (PSA). Similarly, Freije et al.  showed that gene expression profiling of gliomas correlated with patient survival. We expect any dimensionality reduction of the expression vector to correlate with cancer stage, although the correlation may be poor because of inherent noise in the data and/or from assumptions inherent in the dimensionality reduction algorithm. In order to reduce the combined uncertainty inherent to a PPI network or the expression datasets and to better reconcile disparate PPI networks, one can combine PPI networks, transcriptome, and survival data. The consolidation of these data sets into a coherent abstract model is not only likely to improve the quality of the information in each of these previously unrelated data types, but may improve the data quality sufficiently to use the information for personalized therapies.
There are several ways of measuring complexity of protein–protein interaction networks. Chung et al.  observe that loops of three to six proteins are highly prevalent in PPIs and that 96% of the proteins in these loops play a significant role in some biological function including mRNA metabolic processing and cell cycle regulation. Recent papers [5, 6] describe topological metrics of PPI cancer networks that correlate with 5-year cancer patient survival. Of particular interest are Breitkreutz et al.  and Takemoto and Kaori  who introduce a thermodynamic measure based on degree distribution. A degree distribution is essentially a Boltzmann probability distribution [9, 10], which allows us to consider real-world statistical thermodynamics as a conceptual framework within which to view cancer initiation and progression. This is easy to visualize at a molecular level because Boltzmann’s entropy is a function of the natural-log of the number of equivalent ways, or the number of energy states, for a molecule (a protein). If a protein interacts with two neighbors, it has two different energy configurations for that molecule. If a protein interacts with 20 neighbors, it has a greater number of energy configurations and hence higher entropy. Boltzmann entropy is directly related to network degree entropy in PPIs and represents a quantitative measure of the network’s complexity. A simple and ordered network will have a low value of entropy associated with it. A complex and less-ordered network will be characterized by a higher value of entropy. However, in thermodynamics, entropy reflects only one aspect of a statistical system consisting of many units, its arrangement among possible microstates. In addition, the constituent units may physically interact, which brings another aspect into the picture, the system’s enthalpy. Together, enthalpy and entropy define a function of state called the Gibbs free energy that contains both aspects of the system’s behavior.
As motivated by the statements above, the main focus of this paper is on thermodynamics of protein–protein interaction networks in cancer, with an emphasis on entropy and Gibbs free energy as two key measures describing the complexity and chemical energetics of interactions in these networks. We therefore do not discuss Shannon, Kolmogorov–Sinai, or other information-based entropies, which may be relevant to the problem in general but not particularly useful in the present context. In the present manuscript, we review some thermodynamic entropy measures of PPI networks and then describe how to compute Gibbs free energy for cancer networks and show its correlation with 5-year survival, which provides retrospective validation of these concepts. In a more general context, we suggest that these energy views of the PPI are close analogies to the Waddington epigenetic landscape.
2 Brief review of entropy measures to PPI networks
Without attempting an exhaustive review of entropy of PPI networks, we discuss a few key papers. Rashevsky  was the first to suggest degree entropy as a complexity measure for graphs. His “graphs” were aliphatic molecular structures, so by modern standards, they were small graphs. The extension of information theory to thermodynamics in networks was made by Dehmer and Mowshowitz , in their review of the application of various entropy measures to network analysis. One of the first papers to discuss an information-based entropy was by Demetrius and Manke  who studied evolution of networks as a means to understand biological fitness. Their model assumes directed links in the network and they utilize Kolmogorov–Sinai entropy along with Markov processes to describe the evolution (and thus the robustness) of the networks. They extended that work to include cellular robustness .
As we pointed out in the introduction, being able to combine PPI and transcriptome data could enable more accurate use of these inherently noisy data, and with appropriate analysis may lead to actionable insight for clinical applications.
West et al.  describe fixed PPI architecture and mRNA expression data to derive unique weighted networks for each cancer studied. They start with a PPI from www.pathwaycommons.org and transcription data for different cancers. They modified the PPI to contain weighted connections by incorporating the transcriptome data. The weights are Pearson correlation coefficients of gene expression between genes i and j, across multiple samples of the same cancer type. Then computing entropy, they suggest that the best drug targets are those protein nodes with the highest robustness. Their suggested targets are strongly based on the mRNA expression levels, across a population of samples. If some protein has a very high up-regulated mRNA expression, one could have deduced the importance of that node in the network without actually computing the entropy. This is how many targets are “discovered.” Benzekry et al.  suggests a different approach to target discovery based on the unique architecture of each cancer PPI.
Other recent attempts are being made to combine PPI network data and RNA expression data. The quest to find correlations between the PPI networks/transcription data and survival/prognosis has continued. In 2012, Liu et al.  defined a measure called state-transition-based local network entropy (SNE). It is a Shannon information measure that is probabilistically, or conditionally, dependent on the previous state of a local dynamical network—a Markov process. They used mRNA expression data at different stages of tumor development, overlaid it on PPI network data, and showed that SNE changes significantly with cancer progression. Others have used Shannon entropy measure to show that gene expression patterns of melanoma and prostate cancers group according to cancer stage . Shannon entropy, unlike degree entropy, is not a thermodynamic measure.
Banerji et al.  use a slightly similar method to West et al.  to devise a different network entropy. They also use the PathwayCommons network and gene expression data. Using a mass action principle they assume a higher interaction probability if two genes are highly expressed and their protein products interact. Their main point is to show a difference in entropy between stem cells and differentiated cells. They also show entropy differences, with linear correlation, between normal tissue, cancer tissue, and cancer cell lines. This is very similar to the work of Rietman et al. , and the work we describe here.
The work we describe here is an extension of the results by [7, 8] who used unique KEGG (www.genome.jp/kegg) pathway networks for each cancer. They then computed the degree-entropy, or as we argued above, Boltzmann entropy, for the nodes in the network and the overall network. They showed a linear correlation between this entropy measure and overall 5-year cancer survival rate. Here, we describe how to calculate Gibbs free energy for the nodes in the PPI, and we show a linear correlation between Gibbs free energy and 5-year survival rate. We also derive an empirical relation between the observed entropy and the observed Gibbs free energy.
We now introduce Gibbs free energy, a thermodynamic measure encompassing both network complexity and cell thermodynamics (as represented by transcriptome), and show that it can be correlated with cancer survival. As we will see, Gibbs free energy is correlated with network complexity because it is thermodynamically a function of entropy and the network entropy is correlated with network complexity by degree distribution (Boltzmann distribution).
3 Theoretical background
The homeostasis of cells is maintained by a complex, dynamic network of interacting molecules ranging in size from a few dozen Daltons to hundreds of thousands of Daltons. Any change in concentration of one or more of these molecular species alters the chemical balance, or in terms of thermodynamics, chemical potential. These changes then percolate through the network, affecting the chemical potential of other species. The end result represents perturbations in the network manifesting as concentration changes, giving rise to changes in the energetic landscape of the cell. These energetic changes can be described as chemical potential on an energetic landscape only different in kind from the Waddington epigenetic landscape.
Mutational events invariably alter the chemical potential of one or more proteins and/or other molecular species within a single cell. Yet, two neighboring cancer cells in the same microenvironment may exhibit a different energetic landscape because the chemical potential is different within the two cells. Naturally, when a bundle of cells is harvested, for example in a biopsy, and the cells are digested to extract RNA for transcription analysis, the transcriptome is essentially an average of that bundle of cells. Since any given gene is typically transcribed into multiple copies of its corresponding mRNA molecule, the transcriptome can act as a surrogate for the concentration of the proteins. To support this conjecture, several research groups have described correlations of mRNA with protein concentrations [19, 20] and found Pearson correlation, r, to range from 0.4 to 0.8, in a large number of experiments across five different species. More recently studies of the human proteome across multiple tissue types included in the relevant transcriptomic analysis, and found an average correlation between transcription signal and mass spectrometry proteomic information to be 83% [21, 22].
In this connection, Huang et al.  proposed that RNA expression data are surrogate metrics for the protein state of cells and represent the concentration of specific numbers of individual proteins exposed to either dimethylsulfoxide or all-trans-retinoic acid. Thus, the authors first introduced the concept of a chemical energy landscape for cells. Following exposure to the chemical perturbation, the gene expression data were collected at different time points, cleaned to remove low-expression genes, and a self-organizing map created. A principal component analysis was then used to produce a map showing the energetic (chemical potential) trajectory of the cells. The transcriptome has been shown to correlate with protein concentrations [21, 22], and can be generally correlated to the state of the cell. Certainly there are high-throughput protein concentration techniques , but the transcriptome provides a higher number of measurements (probes) identified with gene label and readily mapped to protein–protein interaction networks (e.g., thebiogrid.org).
The dynamics of cells are coordinated and controlled by protein–protein interactions, and the complete set (known) PPIs gives rise to a network. The state-of-the-art database of these PPI networks is Biogrid (http://thebiogrid.org), described by Breitkreutz et al. . It should be stressed that, even though state-of-the-art today, it is not complete, and does not describe the full species-specific PPI networks. There are several reasons for this including the fact that the proteome has not been fully mapped from open-reading frames to genes and proteins. Consequently, calculations of the networks’ properties such as entropy or the Gibbs free energy should be taken as estimates reflecting the present state of knowledge about these networks.
Here, we report the outcomes of merging two types of data, transcriptome and PPI networks, to compute the energetic state of cancer. We show a correlation between the Gibbs free energy and 5-year patient survival for different cancers. Below, we describe the calculation of Gibbs free energy of cells, outline the data sources, and present the results and discussion.
Proteins do not interact simultaneously with large numbers of neighbors, as would be implied by the PPI network view of some hub proteins (e.g., p53). Instead, the hub protein may be interacting with one or two neighbors at a time, forming a complex nanomachine part such as a ribosome. We make the ensemble assumption that many copies of the hub protein may be located in many places in cells and each of the copies may be interacting with a different protein partner. Therefore, we can assume an ensemble of the protein of interest, as well as that its interactions with its neighbors, are akin to an ideal gas mixture.
The rescaling of the transcriptome is performed in order to convert it to units of “concentration.” The data from TCGA are log-2 normalized and already collapsed to gene symbols. The log-2 normalization comes about from preprocessing the gene probe data and represents the transcription values. These preprocessed data are typically in the range [–10, 10]. To rescale, we find the minimum value of the transcription dataset in question (e min) and the maximum value (e max). Giving the range these data fall in, the maximum will be about 10 and the minimum will be about –10. The expression value for each gene in that transcriptome vector is then processed as: ci = (ei−emin)/(emax−emin).
This rescaling is justified from both a mathematical perspective and a chemical physics perspective. Negative values in the argument of the natural logarithm are undefined. The argument from a chemical physics perspective is based on concentrations. If a gene is very down regulated, it is not producing much protein. We assign the protein concentration to 0 for the gene that is the most down regulated. Whereas a gene that is highly up regulated will be producing a great deal of protein, the rescaling assigns the gene product that is most up-regulated, the highest concentration, to a value of 1.
Summary table of the number of subjects in TCGA data sets and respective 5-year survival of individual cancer types from SEER
Kidney renal clear cell
Kidney renal papillary cell
Breast invasive carcinoma
Uterine corpus endometrial
Lung squamous cell
Before actually overlaying the expression data on the PPI network, the average expression vector is rescaled to be in the range [0, 1], effectively setting highly up-regulated gene expressions to 1 and highly down-regulated gene expressions to 0. A base assumption was made that previously established correlation that highly up-regulated genes result in a high protein concentration and highly down-regulated genes result in a very low protein concentration [22, 23]. This prevented any negative argument in the natural logarithm of Eq. (3), and provided consistency from a chemical physics perspective. The calculated Gibbs values are shown in Table 1.
A plot of Gibbs free energy values versus percent 5-year survival for these cancers is shown in Fig. 2. There are nine cancers shown in the graph (GBM, LUAD, LUSC, READ, COAD, OV, LGG, UCEC, BRCA) with Pearson r correlation of –0.7181, and p value of 0.0294. The Spearman correlation is –0.633 with a p value of 0.0671. The Kendall tau test correlation is –0.555 with a p value of 0.0371. These statistics do not include KRIC (“kidney renal clear cell”) and KRIP (“kidney renal papillary cell”) abnormal tissue growths, which, even though highly proliferative and destructive, are of questionable malignant potential. If one were to include these two abnormal growths (KRIC and KIRP) in the analysis, the correlation would drop to –0.016.
For comparisons, we used another measure of the expression data versus survival. We calculated singular values using numpy.lanalg.svd(X) in Python and compared them to survival. The first three singular values versus survival gave r correlations of –0.070, + 0.115, + 0.176, respectively (leaving out KIRC, KRIP). These are very poor correlations, and it is reasonable to conclude that Gibbs free energy is more effective in evaluating a real effect on survival, because it is associated with significant changes in energy of a signaling protein network in a cancer cell. An important implication of the correlation between Gibbs free energy and survival is that the higher the Gibbs free energy absolute value of a given cancer type, the more robust it is against external perturbations and the lower the probability of patient survival over a 5-year period. This is consistent with other concepts in physics where Gibbs free energy is a measure of stability of a thermodynamic system. Gibbs free energy and entropy are both thermodynamic measures, and because the observations are similar, we can compare the two thermodynamic measures. Physical systems in equilibrium have a statistical tendency to reach states of maximum entropy (when thermally isolated) or minimum Gibbs free energy (when kept at a constant ambient temperature). Although biological systems are open and far from thermodynamic equilibrium, we expect some aspects of their behavior to be driven by tendencies dictated by thermodynamics or thermodynamic-like considerations. In this paper, we show that reaching a Gibbs free energy minimum for the PPI aspect of cancer cell dynamical interactions is akin to a principle of maximum entropy (second law of thermodynamics).
Among other features, cancers can be viewed as severely mutated cells. The PPI networks we used do not consider mutations. In future analysis, we would expect to be able to include PPI networks that incorporate gene fusion protein products—the result of mutations. This would enhance the analysis considerably.
As information about cancer-related genomic alterations emerge and more and more data becomes available, we can begin to establish the relationships between protein–protein interaction network complexity and cancer progression. We provide Gibbs free energy, a thermodynamic measure encompassing both network complexity and protein concentration (transcriptome), and show that thermodynamics can be correlated with cancer survival. This allows us to potentially differentiate between normal and cancer cells using thermodynamic measures.
The symbol qG represent a quasi-Gibbs free energy, the symbol ξ represent the expression vector and the little network symbol represents the PPI network. This is analogous to a vector, vector-like product producing a scalar (vector dot product). In these calculations, the network architecture is fixed for all expression vectors, for all cancers. To evaluate whether the architecture of the network itself may play a role, we used random networks, more specifically, random perceptrons , and found the dot product for each expression vector with this perceptron network. We computed the indicated dot product, and found that these random networks did not correlate with survival (r = 0.094). Thus, the expression data and the PPI network are both needed for a meaningful Gibbs free energy. In effect, the PPI network provides a structure to the expression data.
It is worth mentioning that our approach to describe and quantify cancer cell networks in terms of statistical thermodynamics is deeply rooted in the methodology relevant to the cascades of biochemical reactions linking it to bioenergetics . In fact, we may be representing only some aspects of the cancer cell’s complexity, namely the topology of the signaling networks, protein expression levels, and protein–protein affinities. Cellular metabolism may well be an additional aspect that needs future integration , provided sufficient empirical data can be obtained. Moreover, as shown in , a complete picture may require the incorporation of time-dependence. Interestingly, the time scales of biochemical reaction rates also differ between cancer and normal cells .
6 Data sources and methods
Data for several cancers from The Cancer Genome Atlas (TCGA) hosted by the National Institute of Health (http://cancergnome.nih.gov) were collected. The Cancer Genome Atlas is described by the TCGA-Research Network . More specifically, we collected a set of data that used the Agilent platform G4502A and was pre-collapsed on gene symbols. We collected a total of 11 cancers: KIRC (kidney renal clear cell, TCGA 2013b) ; KIRP (kidney renal papillary cell); LGG (low-grade glioma); GBM (glioblastoma multiforme, TCGA), ; COAD (colon adenocarcinoma, TCGA 2012a) ; BRCA (breast invasive carcinoma, TCGA 2012c) ; LUAD (lung adenocarcinoma); LUSC (lung squamous cell, TCGA 2012b) ; UCEC (uterine corpus endometrial, TCGA, 2013a) ; OV (ovarian serous cystadenocarcinoma); READ (rectum adenocarcinoma).
We used the human protein–protein interaction network (Homo sapiens, 3.3.99, March, 2013) from BioGrid, which contains 9561 nodes and 43,086 edges. BioGrid (http://thebiogrid.org) [39, 40]. The entire human PPI was loaded into Cytoscape (version 2.8.1) . The list of genes obtained from TCGA (full-length expression set was 17,814 genes) for a specific cancer was “selected” using the Cytoscape functions, the “inverse selection” of Cytoscape function applied, and the nodes and their edges were removed. The resulting network, which now included only those genes found in both Biogrid and TCGA, consisted of 7951 nodes and 36,509 edges. This Cytoscape network was unloaded as an adjacency list for processing by custom Python code using Python (2.6.4) with appropriate NetworkX functions.
We used two databases for survival data: The Surveillance Epidemiology and End Results (SEER) National Cancer Institute database, which contains detailed statistical information about the 5-year survival rates of patients with cancer, and the National Brain Tumor Society database.
EAR was partly funded by the Newman Lakka Cancer Foundation, and CSTS Healthcare. JAT acknowledges funding from NSERC, Canadian Breast Cancer Foundation and the Allard Foundation. GLK was funded by NIH NIGMS RO1 GM93050, and philanthropic funds from Newman Lakka Cancer Foundation, Campanelli Foundation, Jack in the Beanstalk Foundation, and Binational Science Foundation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Heath.
EAR conceived the idea. JAT and EAR collaborated on the thermodynamics. JP contributed key chemical physics concepts. GLK contributed cancer biology concepts. All authors contributed to writing the manuscript.
- 1.Rietman, E., Bloemendal, A., Platig, J., Tuszynski, J., Klement, G.L.: Gibbs free energy of protein–protein interactions reflects tumor stage. http://biorxiv.org/content/early/2015/07/13/022491 (2015)
- 2.Paliouras, M., Zaman, N., Lumbroso, R., Kapogeorgakis, L., Beitel, L.K., Wang, E., Trifiro, M.: Dynamic rewiring of the androgen receptor protein interaction network correlates with prostate cancer clinical outcomes. Integr. Biol. (Camb.) 3, 1020–1032 (2011). doi:10.1039/c1ib00038a CrossRefGoogle Scholar
- 25.Breitkreutz, B.J., Stark, C., Tyers, M.: The GRID: the general repository for interaction datasets. Genome Biol. 3, PREPRINT0013 (2002)Google Scholar
- 26.Maskill, H.: The Physical Basis of Organic Chemistry. Oxford University Press, New York (1986)Google Scholar