Background

Co-expression networks are graph-theoretical constructs that represent global-level regulatory interactions and expression patterns of genes. These are well-defined mathematical structures amenable for systematic analysis of its global and local properties, as well as its dynamics and functionality. The case of said networks related to complex phenotypes such as cancer has been an area of interest in recent times (De Craene and Berx 2013; Dang et al. 2006; de Anda-Jáuregui et al. 2016). Modular structure (Girvan and Newman 2002; Newman 2006; Palla et al. 2005; Palla et al. 2007) is a quite relevant feature of co-expression networks, since it may provide some clues as to what are the actual biological mechanisms in complex phenotypes (Alcalá-Corona et al. 2016). In the case of breast cancer deregulation, functional biological organization has been shown to be related to network modularity (Alcalá-Corona et al. 2017; Alcalá-Corona et al. 2018). Such community structure of gene organization is characteristic of the different breast cancer molecular subtypes (Alcalá-Corona et al. 2017), so that particularities of the molecular phenotypes are well represented in the modular partition of the network (Alcalá-Corona et al. 2018).

Transcriptional co-expression networks can be probabilistically inferred from high-throughput gene expression data (Basso et al. 2005; Margolin et al. 2006; Hernández-Lemus and Siqueiros-García 2013; Hernández-Lemus and Rangel-Escareño 2011; Delgado and Gómez-Vela 2018; Kuzmanovski et al. 2018; Wong et al. 2018; Manem et al. 2018; Liu et al. 2018), and provide a representation of the expression landscape of a given phenotype. These type of regulatory networks consist of nodes representing genes and links representing co-expression (i.e. strong statistical dependency) between said genes. Given a Co-Expression Network \({\mathcal {G}}\), composed of gene nodes and links representing co-expression between genes, it is possible to detect non-overlapping co-expression modules (communities) due to its topology.

The gene set of each module Mi may be tested for association to known gene-sets of biological interest, such as biological functions, using enrichment analysis. These associations may be represented as a bipartite graph \({\mathcal {B}}\), with a set of module nodes M and a set of biological functions F, with links between modules and the functions in which they are enriched. With this in mind, it is possible to project \({\mathcal {G}}\) and \({\mathcal {B}}\) into two new graphs \({\mathcal {G}P}\) and \({\mathcal {B}P}\) (see methods) where nodes correspond to modules detected in the original graph \({\mathcal {G}}\).

These two projections recover two distinct types of relationships between groups of genes: on the one hand, whether different groups of genes have a level of co-expression that may be driven by biological factors, such as co-regulation; and on the other hand, whether different groups of genes are involved in the control of biological functions that are necessary for a given biological context (for instance, a phenotype). An interesting case is that of modules that are co-regulated and connected through shared biological functions.

In this work, we analyze two coexpression networks derived from basal breast cancer (tumors) and healthy breast tissue (controls), and explore the two modular projections described. We identify the differences in modular structure between the two phenotypes, and how these different modular structures differ in terms of the two types of intermodular relationships that we have described.

Methods

Network inference

Co-expression networks were reconstructed from gene expression data. Basal breast cancer gene expression data, along with adjacent normal expression data, were obtained from the Cancer Genome Atlas (Network and et al. 2012). Data acquisition, and pre-processing is described in (Espinal-Enriquez et al. 2017). Briefly, we used 142 Basal-like subtype breast cancer samples, along with 101 solid-tissue adjacent normal samples. 15,642 annotated genes were included in each sample, after removal of low-counts transcripts (<5 per sample). This set of un-paired data were pre-processed, normalized and bias-reduced, to have a comparable set of expression data between cancer and control samples.

Mutual Information (MI) was computed using an implementation of the ARACNE algorithm for all gene pairs (de Anda-Jáuregui et al. 2019). A suitable MI threshold was selected based on the following criteria:

  • At least 80% of nodes in the genome (out of 15,642) must be present in the network by being connected to at least one other gene

  • The network must have a giant connected component (i.e., the largest connected component with more than half of the nodes)

  • The highest (most restrictive) MI threshold must be selected

We evaluated different MI threshold values related to quantiles of the MI distribution. Generated networks were imported as igraph for [R] objects. igraph version 0.71 and R version 3.5.1 were used.

Mutual information is the maximum entropy/maximum likelihood estimate of statistical dependence between two random variables (Chow and Liu 1968). It is indeed a symmetrized version of the Kullback-Leibler divergence between the joint probability distribution for two variables and the product of their marginals (Kullback and Leibler 1951) (i.e. the joint probability distribution under independence conditions). Being a maximum entropy estimate it needs the least number of assumptions on the probability distributions. Indeed the only needed assumption is that these distributions have compact support. Other correlation measures assume identically distributed variables, linearity or rank ordering among them, etc. Such assumptions are often not compliant with the nature of gene expression data such as nonlinearity, ’delays’ (i.e. correlation shifts), and so on. For these reasons, mutual information has been thoroughly used for the inference of (large) gene co-expression networks. Another advantage of the use of mutual information measures to deconvolute gene regulatory networks from massive gene expression data is the fact that, in most cases (whenever Hammersley-Clifford conditions apply), the resulting graphs meet the requirements to belong to the family of Markov random fields, something that under some scenarios may be quite useful (Dobruschin 1968).

The major drawback for the use of the mutual information approach is the fact that one needs a way to reconstruct the probability distributions from empirical data. Even under the relatively ’soft’ conditions imposed by Glivenko-Cantelli convergence, this means that one still have to have a somehow large number of samples (more than approx. 100 for the case of gene expression data) for the empirical distribution to be useful in order to minimize the number of false positives. These conditions are fulfilled here.

Module detection and enrichment

Modules where detected using the Infomap (Rosvall and Bergstrom 2007; 2008; Rosvall et al. 2009) implementation for igraph, using 1000 iterations to achieve convergence. We have chosen the Infomap algorithm, since it has proven to be highly efficient compared to other methods. Based on benchmarks, Infomap was the best- ranked method in runtime, accuracy and performance (Lancichinetti et al. 2009), as it was assessed in terms of the LFR benchmark (Lancichinetti et al. 2008).

The field of enrichment analysis includes a wide variety of techniques (García-Campos et al. 2015). In this work we used an Over Representation Analysis, in which a hypergeometric (or Fischer exact) test is used to identify a statistically significant association between each module’s gene set, and the sets of genes involved in biological functions as described by the Gene Ontology (GO) database (Ashburner et al. 2000).

Each module gene set was tested for enrichment of Gene Ontology (Ashburner et al. 2000) terms via hypergeometric testing using the HTSanalyzer (Wang et al. 2011) package for R. GO terms were considered enriched if they had an adjusted Benjamini-Hochberg (Benjamini and Hochberg 1995) p-value smaller than 0.05. Enrichment relationships found were represented as a bipartite network, with a layer of modules and a layer of GO terms.

Figure 1 presents a pictorial abstraction of this process. Panel a represents module detection of \({\mathcal {G}}\) using Infomap. In panel B modules detected in panel A become nodes in the \({\mathcal {G}P}\) projection; links represent intermodule connections. Enrichment of modules (i.e. the \({\mathcal {B}}\) network) detected in panel A is presented in panel C. The three modules are connected to turquoise diamonds, which represent biological processes associated to said modules. Panel D shows a projection \({\mathcal {B}P}\) of \({\mathcal {B}}\) in which nodes are modules linked if they share a biological process.

Fig. 1
figure 1

Graphical description of the workflow presented here. a Module detection of tumor and control networks using infomap. In this figure, three modules are detected. b Modules detected in a become nodes in the \({\mathcal {G}P}\) projection; the links represent intermodule connections. c Enrichment of modules detected in a. In this case, the three modules are connected to light-blue diamonds, which represent biological processes associated to said modules. d Projection of c In this final case, nodes are modules linked if they share a biological process. Notice that B and D networks are not connected in the same way, despite they have the same nodes

GP and BP projections

Topological and functional neighborhoods define two projections \({\mathcal {G}P}\) and \({\mathcal {B}P}\) as previously mentioned. The first projection, \({\mathcal {G}P}\), is a graph where nodes are modules M, and links exist between modules Mi and Mj if there are links in \({\mathcal {G}}\) between genes in Mi and genes in Mj: we say these modules are topologically adjacent in the original network.

The second projection, \({\mathcal {B}P}\), is a graph where nodes are modules M and links exist between Mi and Mj if there is overlap in the neighborhoods of Mi and Mj in B: we say that these modules are functionally adjacent.

Results

Co-expression networks for breast cancer and adjacent normal

Networks were generated from the tumors and control datasets. After scanning different threshold values for mutual information (see Additional File 1) the highest threshold for MI that covered our criteria was found at the 0.999 quantile. These networks are described in Additional File 2. Figure 2 illustrates how different the tumor and control networks are; nodes are colored by the module to which they belong. It can be observed in the tumor network, modules with nodes of the same color, whereas in the control network, modules are not observable and colors are less separated. This is further supported by the different degree distributions (Fig. 3).

Fig. 2
figure 2

Regulatory networks corresponding to the control a and tumor b phenotypes. Nodes are colored according to the module to which each node belongs. Notice that in B, a visible modular structure appears, reinforced by the distribution of colors, meanwhile in A the network link distribution looks more homogeneous

Fig. 3
figure 3

Degree distributions for both networks. Red dots form the histogram of tumor network, meanwhile black dots take account for control network. Red dots appear to have two different regimes, with a crossover phenomenon. Black dots, on the other hand, appear to follow a power-law with a single scaling exponent

Said differences can be identified even by a quick glance at the node-degree distributions (Fig. 3; χ2 testing for differences in discrete distribution was performed, with the following results: χ2-statistic =1074170, pvalue=4.99e−10), as well as by the observation of the force-directed network visualization. The control network is characterized by a mono-scaled regime (the degree distribution appears to follow a power-law with a single scaling exponent over the whole range of degree values) whereas the tumor network presents an evident crossover phenomenon leading to multi-scaling, i.e. the degree distribution does not follow a power-law with a single scaling exponent, but rather it seems to have several different scaling regimes, with regions containing inflection points in so-called crossover regions.

Modular structure of breast cancer and healthy breast networks

Distinct modular structures were found in each network, in agreement with previous results from our group (Alcalá-Corona et al. 2018). The partition for the tumor network has a smaller description length L (Rosvall and Bergstrom 2008) value (L=8.268641) than the control network (L=11.80941). In the control network, we identified 981 modules, whereas for the tumor network we found 910 modules. As it can also be observed in Figs. 2 and 3, Fig. 4 shows histograms of the different module sizes, showing the largest modules in control network (χ2 testing for differences in discrete distribution was performed, with the following results: χ2-statistic =40324.45, pvalue=4.99e−10).

Fig. 4
figure 4

Histograms of module sizes in tumor and control networks. As it can be observed, the largest modules correspond to the control network (black dots in the upper left part the figure). Also notice the different concavities in red and black curves

For each transcriptional co-expression network, we projected the modules identified in it to a \({\mathcal {G}P}\) network were adjacent modules in the original network are found. These \({\mathcal {G}P}_{t}\) (for tumors) and \({\mathcal {G}P}_{c}\) (for controls) are depicted in Fig. 5, and described in Additional file 3 (modular projection parameters). There are three main differences between these networks that may be observed: i) a characteristic degree distribution for each projection (Fig. 6 χ2 testing for differences in discrete distribution was performed, with the following results: χ2-statistic =49532.28, pvalue=4.99e−10), ii) the higher edge density in \({\mathcal {G}P}_{c}\), which is also related to iii) the higher link/node ratio in \({\mathcal {G}P}_{c}\).

Fig. 5
figure 5

The modular network structure in tumor and control. In this case, nodes are modules and the connections represent inter-module genes connected in the original network. a control module network. b Tumor module network

Fig. 6
figure 6

Degree distributions for module networks of Fig. 5. Red dots represent the tumor network, meanwhile black dots are for controls

Functional Enrichment

We identified a set of biological functions described as GO terms associated to modules detected in the tumor and control networks. We represented these functional associations as bipartite graphs \({\mathcal {B}}_{t}\) (for tumors) and \({\mathcal {B}}_{c}\) (for controls) that are represented in Fig. 7a and b, with parameters described in Additional file 4.

Fig. 7
figure 7

Bipartite graph of GO term enrichment in network modules. a Control network. b Tumor network. In both networks, grey diamonds represent the module that have enriched GO terms. Colored circles represent GO categories enriched for the linked modules. In some cases, GO categories are connected to more than one module. Colors of GO categories represent a higher category in which each GO term belongs. Colors in a and b are not related. Notice that the categories in A are mainly related to maintenance, meanwhile in B (tumor bipartite network) the majority of categories are related to immunity a well known hallmark of cancer

We identified 665 GO terms associated to \({\mathcal {M}}_{t}\) and 827 GO terms associated to \({\mathcal {M}}_{c}\). It is important to notice that not all modules were enriched in biological processes; in fact, only 110 enriched modules are found in \({\mathcal {B}}_{t}\) and 82 enriched modules were found in \({\mathcal {B}}_{c}\). Furthermore, the set of enriched GO terms \({\mathcal {B}}_{c}\) and \({\mathcal {B}}_{t}\) are different (with a Jaccard index of 0.34).

The projections of modules based on functional adjacency \({\mathcal {B}P}_{c}\) and \({\mathcal {B}P}_{t}\) are shown in Fig. 8a and b. In the figure, modules are connected if they share at least one enriched process. Node size represents the module degree. Edge width is proportional to the number of shared enriched processes between modules. In both cases there are some modules that share several enriched processes. In \({\mathcal {B}P}_{t}\) (8b), there are clusters of modules sharing GO terms, whereas in 8A the compartmentalization is less evident.

Fig. 8
figure 8

Projection of breast cancer modules linked by shared enriched GO terms. a Control projection. b Tumor projection. Modules which show enrichment, but are not connected to other through shared enriched GO terms (29 in control, 40 in tumors) are not shown

Additional file 5 shows some of the relevant parameters for these projections. It may be observed that these projections are very sparse in terms of edges: only 51 of \({\mathcal {M}}_{c}\) are connected to other modules, whereas for \({\mathcal {M}}_{t}\) the number is 70. Importantly, there are modules in tumor and control networks (40 and 29, respectively) that are associated to GO terms not shared with any other module.

Discussion

Most central modules in the GP projection are the largest ones

The most central modules in both the \({\mathcal {G}P}_{c}\) and \({\mathcal {G}P}_{t}\) projections are also the largest ones. In \({\mathcal {G}P}_{t}\), this central module has 231 genes and 5437 intra-modular links. It is connected to 99 other modules. The most central module in \({\mathcal {G}P}_{c}\), has 1000 genes and 17,583 intra-modular links. It is connected to 742 other modules.

Interestingly such highly central modules are not particularly notable in terms of their functional associations. In controls, the largest module is enriched in 6 processes of nucleic acid regulation; it is linked through processes (i.e. in \({\mathcal {B}P}_{c}\)) to 6 other modules. For tumors, the largest module shows no statistically significant enrichment, and therefore is not linked to any other module in the \({\mathcal {B}P}_{t}\) projection.

Functional compartmentalization in health and disease

The bipartite graphs \({\mathcal {B}}\) are topologically similar between tumor and control; however, the enriched functions in each network are different. In both cases, the structures show star-like motives (Fig. 7a and b), which indicate mostly unique processes associated to a given gene module. We interpret this as evidence of compartmentalization of regulation, where each module is controlling the activity of independent sets of biological processes.

We observe important differences in terms of the biological processes associated to the most connected (i.e., most enriched) modules. The two most connected modules (with 146 and 121 neighbors, respectively) in \({\mathcal {B}}_{c}\) are associated to metabolism and cell cycle processes, as illustrated Fig. 7a as well as in Additional files 6 and 7; meanwhile, immunity-related processes are associated for the two most connected modules (with 95 and 81 neighbors, respectively) in \({\mathcal {B}}_{t}\), which we illustrate in Additional files 8 and 9.

As it may be observed, associated processes in \({\mathcal {B}}_{c}\) are for maintenance, meanwhile the processes associated to the \({\mathcal {B}}_{t}\) are well-known hallmarks of cancer (Hanahan and Weinberg 2011). The identification of hallmark processes in breast cancer co-expression networks derived from high-throughput data is consistent with recent reports by our own group.

Most connected modules through functional adjacency are similar in health and disease

The way modules are connected through processes is similar between health and disease, even though the modules and functions are different. The most enriched modules are not, however, the ones that are more connected to other modules in terms of functional adjacency. These are, as seen in the \({\mathcal {B}P}\) projections for both controls or tumors, of comparable sizes: 86 and 74 genes, with 356 and 361 intra-modular links respectively. In controls, this module is enriched in 20 processes. Through these processes, it is connected to 18 modules. It is also connected through co-expression links, as seen in the \({\mathcal {B}P}_{c}\) projection, to 123 other modules. Meanwhile, the comparable module in tumors is enriched in 81 different processes, but through these is linked to only 20 other modules. Through co-expression links, it is connected to 123 other modules. Interestingly, again there is little overlap in the processes associated to these modules, sharing only one function, Membrane protein complex, a general homeostatic event.

Connections between modules through functional and topological adjacency are seldom found

By comparing the set of links in the \({\mathcal {G}P}\) and \({\mathcal {B}P}\) projections, we may observe that there are very few links between modules appearing in both projection. In the case of tumors, \({\mathcal {G}P}_{t}\) and \({\mathcal {B}P}_{t}\) have 37 shared links (Additional file 9), whereas in controls, \({\mathcal {G}P}_{t}\) and \({\mathcal {B}P}_{t}\) have 51 shared links. As such, we may observe that both in health and disease, the connectivity patterns among gene modules in terms of co-expression and functionality are quite different.

Conclusions

Networks of gene regulation are known to exhibit a modular behavior. The co-expression of gene modules is a form in which cellular processes are regulated. In this work, we demonstrate that modules in transcriptional co-expression networks have different ways to interact, either through co-expression or through jointly regulating functional processes. There are instances in which modules are connected both transcriptionally and functionally, but these are rare. transcriptional co-expression networks of cancer have a more modular structure than those found in health. Modules found in the health network have higher degrees, whereas modules in the breast cancer network are less likely to have transcriptional relationships to other modules.

We observe that the set of biological functions associated to gene modules are vastly different in breast cancer and health, with gene modules of cancer associated to functions that drive disease, whereas gene modules in health are linked to functions associated to the maintenance of homeostasis. However, we may observe that the connectivity patterns formed by associations of gene modules and biological functions are similar in both health and disease, which indicates that compartmentalization of functional regulation through gene expression remains, even though the processes that are being regulated change.

The behaviors in terms of transcriptional and functional connectivity that gene modules in transcriptional co-expression networks exhibit, may allow for the identification of important modules in terms of either transcriptional, or functional, importance associated to biological conditions of importance, such as cancer.

Available code

All the code used for the present work is available in our repository: https://github.com/guillermodeandajauregui/ModuleEnrichmentAndProjection.