Background

Describing the compositions of physical systems, such as in mixtures of industrial chemical reactions, across bacteria in the microbiome, or relative influences in cancer networks is of significant practical importance. In the present work, these systems are modeled as networks of components (or nodes) and their unknown node-node interactions. However, the challenge of inferring these interactions lies in incorporating the defining feature of such compositions: the total proportion across components must always stay fixed.

Much recent interest has been devoted to improving the statistical analysis of compositional data [1,2,3,4,5]. The typical strategies that have been employed broadly fall into three categories. First, many apply traditional statistics (such as correlational analyses). Applied to compositional data, however, such tools are known to generate spurious results [6,7,8]. A second approach considers analyses that are unaffected by data rescaling (“scale invariance”) and the addition of new components (“subcompositional coherence”) [1, 2, 9]. However, such methods cannot natively handle zeros in the data and require transformations (e.g. log ratios) that may introduce unwarranted biases into downstream analyses [10, 11]. A third approach considers more general models of the simplicial geometry, or the set of coordinates that sum to a fixed quantity, inherent to compositional data [12,13,14]. What is needed, however, is an approach for modeling compositional data that is both general and principled.

In contrast to previous approaches, we aim to infer the structure of our model from the data. The natural method for this is the principle of maximum entropy or Max Ent [15,16,17,18]. Here, one provides constraints, such as means, variances, and even the geometry of the data itself, and Max Ent provides the model. The advantage of this approach is twofold. First, as opposed to other modeling approaches, Max Ent makes minimal assumptions that are not warranted by the data itself; we simply require our principle to provide a unique, coordinate-independent answer that preserves independence of subcomponents [19]. Second, Max Ent is a widely and successfully utilized modeling framework for complex biological systems [20,21,22,23,24,25]. We provide theory and practical demonstrations of our new approach in the present work.

Results

The model

Suppose one is given several stochastic observations of the relative abundances of N different components. Each of these observations may be represented as a vector \(\Gamma =\{s_1,s_2,\ldots s_N\}\). Our goal is to infer the most likely and least-biased inter-component relationships that give rise to these observations (see Fig. 1). The unique model with this property is provided by the principle of maximum entropy, which selects the model P that both maximizes the entropy \(S=-\sum _{\Gamma }P_{\Gamma }\log P_{\Gamma }\) and satisfies known constraints from the data. Here, the standard constraints are the estimated first and second moments, \(M_i=\langle s_i\rangle\) and \(\chi _{ij}=\langle s_is_j\rangle\) [26], as well as the special compositional constraint, \(\sum _i s_i=1\) (or \(100\%\)). The resulting solution \(P^{*}\), obtained through the method of Lagrange multipliers, is given by:

$$\begin{aligned} P_{\Gamma }^{*} & = Z^{{ - 1}} \exp \left[ {\sum\limits_{i} {\left( {h_{i} + \frac{1}{2}\sum\limits_{{j \ne i}} {K_{{ij}} } s_{j} } \right)} s_{i} } \right], \\ Z & = \int_{{\sum\limits_{i}^{N} {s_{i} } = 1}} {\exp } \left[ {\sum\limits_{i} {\left( {h_{i} + \frac{1}{2}\sum\limits_{{j \ne i}} {K_{{ij}} } s_{j} } \right)} s_{i} } \right]d\vec{s} \\ \end{aligned}$$
(1)

Here \(h_i\) and \(K_{ij}\) enforce, respectively, the means \(M_i\) and the covariances \(\chi _{ij}-M_iM_j\). The normalizing constant Z is defined by an intractable integral over the simplex. Thus, the model parameters are found using an adapted pseudolikelihood approximation (see Methods: The simplex pseudolikelihood method). Finally, as \(\sum _i s_i=1\), several constraints are redundant. Thus, we set \(h_N=0\) and \(K_{ii}\) (\(i=1,2,\ldots N\)) to 0 (see Methods: Refining the maximum entropy parameters).

Fig. 1
figure 1

The Compositional Maximum Entropy (CME) approach. a Through maximum entropy, CME infers the unknown generative model of the observed component abundances. b. \(h_i\) embodies the influence of each (i) component. Components with large \(h_i\) tend to have higher abundances than those with small \(h_i\). \(K_{ij}\) embodies the interaction between pairs of components. Pairs with \(K_{ij}>0\) tend to coexist, while pairs with \(K_{ij}<0\) tend to be mutually exclusive

In summary, Eq 1 provides the Compositional Maximum Entropy model (CME) subject to known means and covariances. The CME method provides interpretable influence weights \(h_i\) for each component node i as well as the interaction strengths \(K_{ij}\) between each pair of components (i and j). Below, we provide two proofs of principle of the method: in a model of the abundances of co-evolving species and the analysis of gene expression data in cancer.

Quantifying competition among co-evolving species

The quantification of competition among bacteria in the gut, market forces in the economy (or even among scientists) is of course of great interest. A simple and widely-used mechanism is provided by the competitive Lotka-Volterra model (cLV), which describes the population dynamics (i.e., the abundances) of different species vying for a shared resource [27,28,29]. The population (\(\tilde{s}_i\)) of each species i depends on its growth rate \(r_i\) and interaction \(\alpha _{ij}\) with each other species j. Furthermore, the population of each type stops growing as it nears its carrying capacity \(\kappa _i\), representing the complete exhaustion of resources.

$$\begin{aligned} \frac{d\tilde{s}_i}{dt}=r_i\tilde{s}_i \times \Big (1-\frac{\sum _j \alpha _{ij}\tilde{s}_j}{\kappa _i}\Big ) \end{aligned}$$
(2)

While cLV remains a powerful model for predicting population dynamics, several challenges remain in calibrating it to experimental data. First, we are often only provided with relative (normalized) species abundances. Tools handling both this information loss and the resulting compositional data remain problematic [7, 30]. In addition, we rarely have access to the full time series [31]. Bacterial abundances, for example, are typically measured sparsely but across many different conditions and environments [7].

Here we show that CME can provide accurate quantitative estimates of inter-species interactions, as predicted by cLV, using only available experimental information. The simulated cLV abundances \(\tilde{s}_i\) are first normalized to resemble experimental data:

$$\begin{aligned} s_i=\frac{\tilde{s}_i}{\sum _j \tilde{s}_j}, \quad i=1,2,\ldots N \end{aligned}$$
(3)

The time-evolving relative abundances \(s_i(t)\) are then randomly sampled to apply CME. Compared to the cLV model, our proposed approach requires fewer parameters that are thus more resolvable from the limited available data [31].

cLV models exhibit three broad classes of stable inter-species behaviors: mutualism (they coexist), neutralism (they ignore each other), and competition (only one type can exist at a time) [30]. To illustrate these behaviors, we consider a cLV model of three different species with equal interactions \(\alpha _{ij}=\alpha\). Figure 2 shows the dynamics and abundance distributions for each of three different regimes: \(\alpha =0.6\) (mutualism, Fig. 2a), \(\alpha =1.2\) (neutralism, Fig. 2b), and \(\alpha =4.0\) (competition, Fig. 2c). For simplicity, \(r_i\), \(\kappa _i\), and the self-interactions \(\alpha _{ii}\) are fixed at 1. Gaussian noise was then added to the simulated dynamics to introduce additional inter-sample variability.

The cLV model exhibits a sharp qualitative transformation in its abundance distribution, from a unimodal (Fig. 2a) to a trimodal (Fig. 2c) behavior [32], as \(\alpha\) is increased above a critical value (\(\alpha \approx 1.2\), Fig. 2b). Despite requiring fewer parameters (\(h=0\) and K, compared to the original four of cLV), CME (right) captures the cLV model behavior (center) across this transformation; \(K>0\) describes mutualism while \(K<0\) describes competition.

Fig. 2
figure 2

Simulated abundances of three co-evolving species under mutualism (a), neutralism (b), and competition (c). Left, the cLV simulated abundances of each of the three interacting species over time. Center, the corresponding abundance distribution (cLV). Right, the best fit maximum entropy distribution (CME)

In summary, our model provides a simple, data-driven framework for modeling inter-species relationships from limited experimental information. We next consider the more complex case involving heterogeneous interactions from gene expression data in cancer.

Revealing driving interactions in cancer networks

Cancer is a heterogeneous disease involving complex molecular interactions between many genes. Despite the wealth of information provided by modern experimental tools, the application of such molecular data, including gene expression, to identify effective drug targets continues to face two significant obstacles. First, the accuracy of experimental expression profiles differs between genes [33]. Thus influences from biologically critical but more poorly resolved genes may be overlooked. Second, genes of typical interest often interact, and their effects overlap [34].

Novel network analysis techniques have been developed to refine the genetic signatures of critical genes in cancer. These approaches have been utilized to discover feedback structures in gene interaction networks, identify hubs and bridges, and define measures of robustness and fragility [35,36,37]. The Wasserstein distance from optimal transport lies as the basis for such methodologies, and in addition to the above references has been directly applied to the stationary (normalized) measures of the networks in question to derive biological information, e.g. showing that pediatric sarcoma data forms a unique cluster [38]. We will now show that CME may provide an important tool for such problems and help point to potential driver genes and their most important interactions.

To test our method, we analyze whole-genome expression data of triple-negative breast tumors, a highly aggressive and complex type of cancer. While many genes are known to be dysregulated in this disease, the relative influence of individual genes is far from established [39]. The data consist of expression profiles from 299 disease samples in METABRIC (Methods: The METABRIC dataset) [40]. We obtained normalized weights for each of \(N=3147\) genes using the Human Protein Reference Database (HPRD) for each sample (Methods: Network identification) [41]. As most of these genes provide no signal in the data, we renormalized these weights after considering only the top 17 highest variabiliy genes with known relevance to cancer (according to OncoKB, see [42] and Methods: Data preparation). Figure 3 illustrates the known connectivity of these genes, but with node size and color proportional to their inferred maximum entropy node weights (\(h_i\)). We immediately notice two key details. First, our genes of interest all form a tightly connected network. Second, despite being highly correlated with each other (as the topology would suggest), these genes have unequal influences on the data. The highest-ranked genes, SRC and TP53, are also known master regulators of cancer [43, 44].

Fig. 3
figure 3

Maximum entropy ranking of key genes in triple-negative breast cancer. Edges correspond to protein-protein interactions obtained from HPRD. Node color and size correspond to their influence (\(h_i\))

A major strength of maximum entropy methods is identifying key node-node interactions underlying the more complex covariances measured from data. This is illustrated in Fig. 4, which compares the maximum entropy pairwise interactions \(K_{ij}\) to those inferred from a widely-used alternative statistical model, the logit-normal distribution (Methods: Implementation of the logit-normal distribution [1]); there is an identifiable mapping between the strongest magnitude maximum entropy interactions (Fig. 4a), in contrast to these obtained from the logit-normal (Fig. 4b), and their corresponding gene-gene covariances (Fig. 4c).

We also note that the two top maximum entropy interactions alone (SRC/TP53 and BRCA1/PTPN11) provide an intuitive explanation for some of the key features of the data. SRC and TP53 maintain the critical balance between growth (SRC) and damage repair (TP53): enhanced SRC (or repressed TP53) promotes cell survival, growth, and metastasis, while the reverse leads to accelerated aging [43,44,45]. This known and critical negative interaction between SRC and TP53 separates most of the 17 genes into two distinct (and negatively covarying) clusters. Thus, since BRCA1 and PTPN11 belong to opposing clusters, their corrected interaction, as revealed by both maximum entropy and logit-normal modeling, is much larger than expected from their weak, positive covariance. Interestingly, both BRCA1 and PTPN11, along with SRC and TP53, are involved in the JAK-STAT pathway [46, 47]. Thus, these genes may have a general and synergistic role in cancer that remains to be explored.

Yet, while the logit-normal model does appear to resolve some features (such as the subtle covariance between AKT1 and EP300) that CME neglects, the interactions predicted by this method generally appear difficult to interpret in the context of the original covariance matrix: it predicts many interactions between uncorrelated genes and fails to resolve, among others, the clear negative covariance between SRC and TP53. Overall, the CME method provides a parsimonious biological mechanism, involving known cancer drivers and only a few of their interactions, for the genetic variability in this poorly understood disease.

Fig. 4
figure 4

Comparison between three breast cancer network analyses: CME (a), logit-normal (b), and the data covariances (c). Maximum entropy and logit-normal results are shown on a log-scale to reveal the most influential positive (red) and negative (blue) interactions

Discussion

We have provided CME, a probabilistic framework for inferring the behaviors of compositional systems from data. Typically, models are deduced bottom-up, starting from mathematical relationships between individual components and combining them often in a complex, nonlinear way. However, as we have described for the Lotka-Volterra model, these interactions can rarely be resolved from the available experimental data. CME, instead, takes a top-down approach – starting from the data and learning the most parsimonious model for it. As evidenced by our breast cancer analysis, CME may also provide more interpretable insights into the organization of compositional systems.

Similar to partial correlational analysis [48,49,50], maximum entropy computes direct pairwise interactions by controlling for the confounding indirect effects of the other nodes. Despite being widely used in data analysis and machine learning, partial correlations are only appropriate for linear associations or Gaussian-like data [48]. Maximum entropy methods, such as our application to compositional data [49], are, by contrast, much more general.

For simplicity, we have considered only small networks; however, our method can be easily extended to much larger networks. First, the pseudolikelihood approach at the core of our method has been successfully applied, with the proper regularization, to networks consisting of thousands of nodes [51]. Second, the implementation of our algorithm uses a scalable L-BFGS algorithm and is fully parallelized across multiple CPU cores.

The principle of maximum entropy deduces the simplex-truncated normal distribution from the given first and second moment constraints. While such models have been previously studied in compositional data analysis [13], our approach provides two key advantages. First, maximum entropy can naturally incorporate more general model constraints including higher-order moments [26], more complex geometries [52], additional types of data [53], and domain-specific assumptions [1, 2]. Second, our simplex pseudolikelihood method provides consistent [54] and asymptotically efficient [55] parameter estimates and is asymptotically equivalent to maximum likelihood estimation [54]. Furthermore, a recent study demonstrates that score matching approaches can be viewed as approximations of pseudolikelihood [56], suggesting a relationship between our approach and [13] that could be explored in a future work.

Conclusion

We proposed CME, a data-driven framework for modeling compositions in multi-species networks. We utilize maximum entropy, a first-principles modeling approach, to learn influential nodes and their network connections using only the available experimental information. Our method requires minimal assumptions and no modifications of the experimental data. Furthermore, the method can be easily generalized to incorporate new types of constraints and data that may emerge.

Methods

The simplex pseudolikelihood method

Fitting maximum entropy models to data is generally computationally intractable. Thus, to fit CME, we will adapt the widely-used pseudolikelihood approximation [51]. This method requires two pieces of information. First, we need a formula to compute the conditional distribution \(P(s_i|s_{\sim i})\), where \(s_{\sim i}\) represents all of the variables of interest \(s_j\) (\(j=1,2,\ldots N-1\)) excluding \(s_i\) and \(s_N=1-\sum _{i=1}^{N-1}s_i\). For the simplex model, we have:

$$\begin{aligned} P(s_i|s_{\sim i})&=\tilde{Z}_i(s_{\sim i})^{-1}\exp \Bigg [\Bigg ({\tilde{h}_i+\frac{1}{2}\tilde{K}_{ii}s_i+\sum _{j\ne i}\tilde{K}_{ij}s_j\Bigg )s_i}\Bigg ] \end{aligned}$$
(4)
$$\begin{aligned} \tilde{Z}_i(s_{\sim i})&=\int _{0}^{1-\sum _{j\ne i}^{N-1}s_j}\exp \Bigg [\Bigg ({\tilde{h}_i+\frac{1}{2}\tilde{K}_{ii}s_i+\sum _{j\ne i}\tilde{K}_{ij}s_j\Bigg )s_i}\Bigg ] ds_i \end{aligned}$$
(5)
$$\begin{aligned} \tilde{h}_i&=h_i+\frac{1}{2}(K_{iN}+K_{Ni}), \quad \tilde{K}_{ij}=K_{ij}-K_{iN}-K_{Nj} \end{aligned}$$
(6)

Unlike that of Eq 1, \(\tilde{Z}_i\) is a tractable Gaussian-like integral. However, its value is sample dependent. Thus, the second required piece of information is the actual samples of the agent proportions \(s^d_i\) (\(d=1,2,\ldots D)\) rather than simply the summary means and covariances. Together these enable the maximization of the pseudolikelihood functions \(\ell _{PL}^i\) (see Methods: Model implementation):

$$\begin{aligned} \ell _{PL}^i=\tilde{h}_iM_i+\frac{1}{2}\tilde{K}_{ii}\chi _{ii}+\sum _{j\ne i}^{N-1} \tilde{K}_{ij}\chi _{ij}-D^{-1}\sum _{d=1}^D\log \tilde{Z}_i(s^d_{\sim i}) \end{aligned}$$
(7)

Refining the maximum entropy parameters

One challenge in modeling compositional data is handling the parameter redundancies induced by the compositional constraint \(\sum _i s_i=1\). Specifically, \(M_N=1-\sum _{i\ne N} M_i\) and \(\chi _{iN}=\chi _{Ni}=M_i-\sum _{j\ne N} \chi _{ij}\) are entirely determined from the other data constraints. We could set the associated Lagrange multipliers to 0, but this would hide information about node N (as all of its connections would be forced to 0).

Instead, we recover interpretable model parameters with the following transformations:

$$\begin{aligned} K_{ij}=\frac{1}{2}(\tilde{K}_{ij}+\tilde{K}_{ji}-\tilde{K}_{ii}-\tilde{K}_{jj}), \quad h_i=\tilde{h}_i-K_{iN} \end{aligned}$$
(8)

By forcing \(K_{ii}\) to be 0 in Eq 1, we can resolve the interaction strengths between all pairs of nodes in the data. For simplicity, we have defined \(h_N=0\). However, we can increase or decrease all \(h_i\) by any constant and still have an equally good fit. Thus we introduce another transformation to facilitate intra-model comparison of these node weights:

$$\begin{aligned} Q_i=\frac{e^{h_i}}{\sum _{i=1}^N e^{h_i}} \end{aligned}$$
(9)

Conceptually, the quotient \(Q_i\) compares the relative probability of observing a network configuration with influence dominated (\(P^{*}(s_i=1)\)) by node i. We posit this as a useful comparison metric for future studies of compositional systems modeled under different conditions.

Model implementation

To provide a high-accuracy, low overhead approximate maximum of the CME log pseudolikelihood functions, we performed convex optimization using L-BFGS [57] augmented by automatic differentiation. To validate our method, we also designed a custom Monte-Carlo scheme to simulate from CME models. This scheme considers the fitted \(h_i\) and \(K_{ij}\) parameters and numerically estimates the corresponding means \(M_i\) and covariances \(\Sigma _{ij}=\chi _{ij}-M_iM_j\). In contrast to CME, such simulation is prohibitively expensive for even moderately-sized, strongly-interacting networks. However, it enabled us to confirm the high accuracy of our model on our Lotka-Volterra simulations (see Fig. 5).

Fig. 5
figure 5

Comparison of CME model covariances to the sample covariances of the cLV model. We observe complete agreement between our model and the data (see Figure 2), confirming the correctness of our maximum entropy fitting algorithm. The means, not shown, were all equal to 1/3 as expected

The METABRIC dataset

Microarray gene expression data for METABRIC were downloaded from the cBioPortal database [58, 59]. The METABRIC dataset, containing 1904 samples, is one of the most extensive publicly-available breast cancer studies [40]. We utilized microarray gene expression data containing 24368 genes from the 299 triple-negative samples.

Network identification

To quantify the (normalized) influence of genes relevant to triple-negative breast cancer, we utilized the method of network Markov chains [35,36,37]. The Human Protein Reference Database (HPRD) provides a curated interaction network of most human proteins [41]. Thus, to perform our analysis, we utilized the largest connected component, consisting of 3147 genes, obtained from the intersection of HPRD with the METABRIC gene list. We then performed network analysis as in [37] using the subset of 288 genes annotated in OncoKB, a curated database of prominent cancer genes [42].

Data preparation

For each sample, we obtain a measure of the relative influence of each of 288 genes. To identify potential drivers of the variability of these influences across the data, we computed their inter-sample Pearson correlations. We identified two distinct clusters of highly correlated genes: one containing a small number of immune-adjacent genes and the other, a much larger component, containing prominent breast cancer genes such as TP53 and BRCA1. Thus, we utilized only this second component for our analysis.

Our primary goal is to identify genes and their interactions that potentially drive the variability in treatment responses observed in triple-negative breast cancer [39]. Likely genes include only those with large influence and inter-subject variability. Upon computing the variance in the influence of each gene, we found 17 candidates with markedly higher variance than the remaining bulk. We thus renormalized node influence across these 17 prime candidates before performing our maximum entropy analysis.

Implementation of the logit-normal distribution

An alternative to CME, the logit-normal distribution is given by [1]:

$$\begin{aligned} P_{\Gamma }=Z_{LN}^{-1}\frac{1}{\prod _{i=1}^Ns_i}e^{-\frac{1}{2}\Big \{\log \Big (\frac{s_{{\tilde{N}}}}{s_N}\Big ) -M_{LN}\Big \}^{\top }\Sigma _{LN}^{-1}\Big \{\log \Big (\frac{s_{{\tilde{N}}}}{s_N}\Big )-M_{LN}\Big \}} \end{aligned}$$
(10)

where \(M_{LN}\) and \(\Sigma _{LN}\) are the means and covariances of the transformed data: \(y=\Big [\log (\frac{s_1}{s_N}),\ldots , \log (\frac{s_{N-1}}{s_N})\Big ]\). Here, the feature of interest is the precision matrix \(K^{*}_{LN}=-\Sigma _{LN}^{-1}\) which, under fairly general circumstances, has been shown to approximate maximum entropy interactions [20]. As with CME, we then utilized Eq 8 to define symmetric interactions between all pairs of nodes rather than simply the first \(N-1\).