Using biological networks to integrate, visualize and analyze genomics data
- 3.2k Downloads
Network biology is a rapidly developing area of biomedical research and reflects the current view that complex phenotypes, such as disease susceptibility, are not the result of single gene mutations that act in isolation but are rather due to the perturbation of a gene’s network context. Understanding the topology of these molecular interaction networks and identifying the molecules that play central roles in their structure and regulation is a key to understanding complex systems. The falling cost of next-generation sequencing is now enabling researchers to routinely catalogue the molecular components of these networks at a genome-wide scale and over a large number of different conditions. In this review, we describe how to use publicly available bioinformatics tools to integrate genome-wide ‘omics’ data into a network of experimentally-supported molecular interactions. In addition, we describe how to visualize and analyze these networks to identify topological features of likely functional relevance, including network hubs, bottlenecks and modules. We show that network biology provides a powerful conceptual approach to integrate and find patterns in genome-wide genomic data but we also discuss the limitations and caveats of these methods, of which researchers adopting these methods must remain aware.
KeywordsBetweenness Centrality Network Visualization Random Walk Algorithm Molecular Interaction Network Bottleneck Node
Cellular processes are controlled and coordinated at multiple levels by tightly regulated transcriptional, post-transcriptional and post-translational molecular networks. Recent advances and falling costs of technologies such as next-generation sequencing (NGS) and mass spectrometry (MS) are enabling researchers to catalogue the component molecules of these networks at a genome-wide scale and under a large number of different experimental conditions (e.g. time points, cell types, stimuli and treatments). These high-throughput approaches typically result in one or more lists of genes or proteins (or other molecules such as lipids or metabolites) that are significantly altered, in their expression for example, at a specific time-point or condition. However, without further analysis, such lists are often of relatively limited use and fail to reveal the complex inter-relationships that may exist between molecules, their coordinated functions, and the emergent properties of the system. In this review, we discuss how researchers can move from gene lists to more systems-oriented analyses of their data, with a particular focus on using experimentally-supported molecular interaction networks. We discuss how to use publicly available bioinformatics tools and molecular interaction data to construct a network from a gene/protein list and explore how to subsequently visualize and analyze these networks for the purpose of revealing new insights into the phenotype of interest at the systems’ level. We give examples of how such approaches are being applied in the literature and we will focus particularly on examples of relevance to the animal functional genomics community.
Gene ontology and pathway analysis
As discussed above, the initial output of most genome-wide ‘omics’ experiments is a list of genes (or their products) that are significantly altered in the condition of interest. Typically, the first step in the investigation of these datasets is a functional enrichment analysis, which determines whether the list of genes is statistically enriched for certain biological processes or functions. The Gene Ontology (GO) consortium, for example, provides a controlled hierarchical vocabulary of terms for describing genes and their encoded products in terms of their molecular functions, biological processes or cellular components . A GO enrichment analysis can be undertaken using one of the many publicly available tools (http://geneontology.org/page/go-enrichment-analysis) and these analyses examine the gene list for the occurrence of GO terms that are more prevalent in the query gene list than expected by chance (it is important to note that using an appropriate background or ‘universe’ to assess statistical significance is essential) . Such over-represented terms may highlight previously unrecognised biological processes (as opposed to individual genes) that are preferentially and differentially regulated in the condition of interest. A feature of GO that is both a strength and a limitation is its hierarchical structure. Although efforts have been made to account for this structure in GO enrichment analyses , it can still be difficult to determine which level of the hierarchy is most responsible for the statistical enrichment. Often the most enriched terms are broad functional categories which can be of limited use to inform new functional insight.
In cells, biological pathways are the biochemical engines that are responsible for the transduction of signals (often received by receptors) into output responses (e.g. activation of a transcription factor and downstream gene expression). An enrichment analysis based on pathway annotations can therefore contain information that is more directly relevant and interpretable regarding the important processes at play in a particular condition. A wide variety of pathway analysis methods are available , including over-representation methods such as those implemented in KEGG , Reactome , WikiPathways [7, 8], InnateDB , or DAVID ; more quantitative methods based on gene set enrichment ; and more recent methods that attempt to account for the fact that not all genes have the same power to distinguish between different pathways .
Although powerful, pathway analysis methods also have their limitations. First, the majority of genes have not been assigned to a canonical pathway (e.g., more than 85 % of human Ensembl genes are not mapped to any KEGG pathway), and second, for those that are, there is a heavy bias towards well-studied signalling pathways . Thus, pathway analysis can tell us a lot about what we already know but less about new and unexpected relationships between genes of interest or indeed between the pathways themselves.
Network biology is a rapidly developing area of research, which recognises that biological processes are not chiefly controlled by individual proteins or by discrete, unconnected linear pathways, but rather by a complex system-level network of molecular interactions . Understanding how these molecular interaction networks give rise to emergent biological processes and identifying the important nodes and other topological features, which are key to controlling them, are crucial to understanding complex phenotypes in health and disease. Network medicine theory also proposes that disease-associated phenotypes are not the result of single gene mutations acting in isolation but are rather due to the perturbation of a gene’s network context . Therefore, the elucidation of disease mechanisms and the development of effective therapeutic targets require a deep understanding of how molecular interaction networks are pathogenically dysregulated. In practice, network analysis can also be an extremely powerful and complementary approach to traditional enrichment analysis methods . Advantages of this approach include the fact that network-based analyses are both more data driven and also less constrained by the limits of current functional annotations, as proteome-scale maps of the interactome (the complete complement of molecular interactions within a biological system) are now available for several species, including humans . Because of this, network analyses are less biased towards well-studied pathways and have a far greater coverage of known genes and proteins.
The interactome may be intuitively represented and interpreted by constructing a graph or network, in which an entity (e.g. gene, transcript, protein, miRNA, or metabolite) is represented by a node and its relationships or interactions to other entities by a series of pairwise lines or edges between these nodes. Networks are not restricted to one type of entity (node type) or relationship (edge type) and are often used to visualize and interpret several types of molecules and their molecular relationships simultaneously (physical interaction, reaction, regulation, correlation, etc.). This allows a more complete and realistic representation of a biological system. Additional information associated with the nodes (e.g. gene expression data) or edges (e.g. a confidence score) can also be easily integrated via the use of node and edge attributes. Another advantage is that network/graph theory and supporting computational methods are well established in other domains, which has allowed for the rapid re-purposing and development of software to support network visualization and analysis in biology .
There are two broad approaches that one can adopt when performing network analysis on a gene list of interest. The first is to overlay the genome-wide ‘omics’ data (e.g. gene expression data) on a pre-established global network of experimentally-supported interactions (e.g. public protein–protein interactions (PPI)), while the second is to infer a network directly from the data generated in the experiment (for a review of these approaches see ). In this review, we focus largely on the former integrative method and discuss both the strengths and limitations of this approach.
Constructing a molecular interaction network from a list of genes
The first consideration when constructing a molecular interaction network from publicly available data is what type of interaction data one wants to include in the network and where to source that data. A sometimes confusing plethora of molecular interaction databases are publicly available . Researchers need to be aware that not all of these databases contain the same type or quality of interaction data. Some databases, such as those that are members of the International Molecular Exchange (IMEx) Consortium , promote painstaking manual curation of experimentally-validated interaction data directly from the peer-reviewed biomedical literature. Other so-called ‘meta’-databases integrate and repackage interaction data from these primary sources and make it available through a single portal. Some databases also supplement experimentally-validated interaction information with computationally-predicted interactions . Although this practice is useful for enriching a sparse experimental interaction network, users need to be more aware of this. We also suggest that researchers compare results that are generated using an experimentally-validated network versus the network that has been supplemented with computationally-predicted edges. Researchers should also note that primary interaction databases show limited overlap in the interaction information they provide. This is partly intentional, as developers of the IMEx databases take steps to avoid duplication of effort in their very labour-intensive manual curation processes. However, this also means, that a lot of additional well-supported public interaction information will be ignored if interactions are sourced from one database only. Fortunately, web-services, such as PSICQUIC , are available to enable researchers to query multiple databases simultaneously, although, to date, the majority of papers reporting network analyses have not been so comprehensive.
Once all the experimentally-validated interactions that involve a given gene list (or their products) have been retrieved, there are some additional points to consider before proceeding to the downstream network analysis. First, the experimentally-validated interactions retrieved may be of several types, including physical (e.g. PPI or protein-DNA), regulatory (e.g. microRNA-mRNA), or biochemical interactions (e.g. phosphorylations). Although it may be valuable to integrate many types of interactions, one must proceed with caution since the meaning of an edge in such a network will vary substantially and this needs to be taken into account during subsequent analyses of the data. Physical PPI, for example, are usually undirected edges and may capture information regarding protein complexes, whereas biochemical interactions are usually directed (e.g. A phosphorylates B) and relate to a flow of signal information. Another consideration in the case of physical protein interactions that are determined by affinity purification coupled with mass spectrometry (AP-MS) , is that these methods usually cannot distinguish between direct and indirect interactors, although they are often represented as direct binary interactions in networks that are constructed using publicly available tools.
Another important consideration is the level of confidence associated with a particular molecular interaction, which may vary considerably, depending on how that interaction was experimentally determined. On the one hand, high-throughput approaches such as Yeast 2-Hybrid (Y2H), can be used to generate large amounts of data on the interactome, which are, however, often associated with relatively high false positive and false negative rates . On the other hand, interactions that are curated from more focused low-throughput studies described in the biomedical literature may have greater confidence but they can be biased towards well-studied pathways and biological processes. Several metrics have now been developed to provide an interaction confidence score and these scores can be reflected in networks using edge weights .
Finally, one must also bear in mind that the interactome retrieved from databases is a static snapshot of all known possible interactions for the given query list. Many of these interactions will be context-specific (e.g. occurring in a particular cell-type, or under specific conditions; or for a particular isoform of a protein ). Unfortunately, there is relatively little high-throughput context-specific interactome data in the literature and, thus, in molecular interaction databases. If analyses were restricted only to interactions that were context-specific (e.g. identified in the same cell-type), most of the data would be discarded. However, researchers can integrate other forms of external contextual information, such as gene or protein expression data, to select the most likely contextual sub-network of nodes and edges.
Case study: constructing an experimentally-validated molecular interaction network using InnateDB.com
A limitation of using the PSICQUIC web-service to build an interaction network is that it is not particularly accessible for most biologists. Fortunately, there are several more user-friendly web-based platforms available. Here, we provide a case study that describes how to use tools available at InnateDB.com to build and visualize a network of experimentally-validated molecular interactions from a gene list [9, 13]. InnateDB is a comprehensive database that contains more than 300,000 experimentally-validated human, mouse and bovine molecular interactions and more than 3000 pathway annotations, integrated from major public molecular interaction and pathway databases. In addition to this integrated data, the InnateDB curation team has contextually annotated more than 25,000 innate immunity-relevant molecular interactions through their review of more than 5000 biomedical articles. Interactions in InnateDB are curated to MIMIx standards  with rich contextual annotations, including the supporting publication, participant molecules, species, interaction detection method, host system, interaction type, cell, cell-line and tissue types, etc., that are associated with each interaction. For more details on InnateDB curation of the innate immunity interactome, see . InnateDB is also an analysis platform that offers seamlessly-integrated, user-friendly bioinformatics tools, including pathway and ontology analysis, network visualization and analysis, and the ability to upload and analyze user-supplied gene expression data (or other forms of quantitative data) in a network and/or pathway context.
It is important to emphasise that InnateDB does not only contain interactions of relevance to innate immunity but, as mentioned above, is also a repository for the entire human and mouse interactomes. The bovine interactome is inferred largely via orthology with human and mouse genes. The limitations of using orthology to infer interlogs is discussed in some detail in , but there are few options for researchers working on agriculturally-relevant animal species for which little or no experimentally-validated interactome data is available. It should be noted that the same issues are shared with GO and pathway analyses, as these species-specific annotations have also been largely inferred by orthology. Researchers who work on other mammalian species must map (by orthology) gene identifiers from their species of interest to their corresponding human/mouse gene ID prior to using InnateDB. It is generally not recommended to attempt to infer interlogs from more evolutionarily-distant species, since these interactions are much less likely to be conserved.
How to build a network using InnateDB.com
Network visualization and download
Inferring biologically important properties/features from networks
Constructing a network, while important, is only the first step of any network analysis. Without further investigation of network features (e.g. node degree or network modularity) and how these features potentially deviate from statistical expectation, building a network does little more than generate a pretty (or sometimes ugly) picture. Fortunately, numerous mathematical and computational approaches have been developed to analyze large networks to identify features of interest.
The distance between two nodes in a network can be measured by determining the minimum number of steps between them . Bottleneck nodes are defined as nodes with a high betweenness centrality (i.e. network nodes that have many “shortest paths” going through them) . Bottleneck nodes play key roles in mediating communication within a given network because they facilitate information flow between modules (relatively densely connected sub-networks, see next section). Such nodes are therefore like chokepoints in the network and have been described as being analogous to major bridges and tunnels on a highway map . Disruption of a bottleneck can lead to network “traffic” chaos, since there are few or no alternative routes around the bottleneck. Bottleneck nodes have been found to be more highly correlated with essentiality than hub nodes  and are also preferentially targeted by pathogens [44, 45]. It should be noted that the top hub and bottleneck nodes often tend to be very similar (Fig. 3). Lawless et al. , for example, constructed a network of genes that were differentially expressed in monocytes isolated from milk at 36 h post-infection with S. uberis and showed that 85 % of the top 20 hub proteins in the network were also bottleneck nodes. Thus, it can often be difficult to assess whether a node is important because it is highly connected or because it is a bottleneck.
Bioinformatics apps to identify hubs, bottlenecks and modules
A wide variety of bioinformatics tools to quickly identify network hubs and bottlenecks are available. Some examples include the aforementioned NetworkAnalyst, a tool to support network-based gene expression meta-analyses . NetworkAnalyst imports a list of user-defined genes and associated interactions from InnateDB to calculate degree, betweenness centralities and functional modules in the network (see below for further discussion of network modules). The Cytoscape platform also provides an ecosystem of mainly third party Apps that can be used to undertake these and more advanced network analyses . One such App is cytoHubba, which can be used to identify hubs and bottlenecks in networks imported into Cytoscape . This can be used in conjunction with networks that are generated by using InnateDB, which can be downloaded in XGMML format and then imported into Cytoscape.
A variety of computational tools have also been developed to identify modules in networks. For a comprehensive review, we refer the reader to . Here we introduce some useful tools that represent a good starting point for a researcher who is new to this topic. NetworkAnalyst also contains more advanced network analysis features that can be used to identify potentially functionally relevant network modules. NetworkAnalyst uses a random walk algorithm to identify modules of frequently visited nodes. It can also generate an edge weighted network, in which weights are derived from quantitative node information, such as gene expression attributes . Cytoscape also provides a number of user-friendly applications for module detection, including jActiveModules , which identifies connected regions of a network that also show significant changes in gene expression.
However, if the aim is to find disease-associated modules, other algorithms may perform better, since it was recently reported that disease-associated proteins do not reside in particularly dense local communities and that disease-related nodes may be better predicted using connectivity significance (i.e. whether the number of connections from a candidate protein to other known disease “seed” proteins is greater than statistically expected by chance) . The Disease Module Detection algorithm (DIAMOnD) is a new method to detect disease modules based on connectivity significance.
Apart from the choice of a network analysis tool, researchers need to be aware that the incompleteness of the interactome limits which disease modules can be detected, and that there is a minimum threshold for the number of known disease-associated proteins to be able to detect modules associated with a disease of interest . Finally, it should be noted that the detection of hubs, bottlenecks and modules is only the tip of the iceberg when it comes to network analyses and further analyses should be driven by the research questions that are specific to each study.
Conclusions and further discussion
In this review, we introduce network analysis and show that it is a powerful tool to assist researchers in the interpretation, visualisation, and analysis of genome-wide ‘omics’ data. However, significant challenges remain to be addressed. Unlike mapping the genome of a species (although the genome can also vary considerably between individuals), mapping the protein interactome of a species is something of a fallacy. The interactome is a highly dynamic entity that depends on the temporal, spatial, cellular and environmental contexts. Fortunately, with advances in technology, we are now moving towards an era of dynamic interactome studies . Recently, for example, researchers have mapped the Hippo signalling pathway protein–protein interaction network in the presence and absence of inhibition by serine and threonine phosphatases, and revealed how changes in phosphorylation result in a significant re-wiring of the protein interactions between members of this pathway . Similarly, Jäger et al.  have systematically determined the physical interactions of all 18 HIV-1 proteins and polyproteins with host proteins in two human cell lines (HEK293 and Jurkat) and showed that only about 40 % of interactions occurred in both cell types, which provides insight into just how different the interactome is likely to be in two different cell types. PPI networks are also likely to be substantially re-wired in diseases, with the effect of any given mutation rippling through the network and causing a re-wiring of proteins that otherwise carry no defects . Indeed, a recent study has shown that perturbation of protein–protein interactions is widespread in human genetic disorders . Investigating thousands of missense mutations, Shani et al. , showed that two-thirds of disease-associated alleles perturb protein–protein interactions.
Network re-wiring in different contexts will also change which topological and functional network features are important. Network re-wiring will likely have an impact on the top hub and bottleneck proteins, e.g. a hub node in a normal network may be less central in a disease-associated network and vice versa. Such re-wiring may also have an impact on the set of network modules that are identified in a disease network or in another phenotype of interest. Thus, an important focus for network biology will be to experimentally reconstruct and compare networks in normal and disease conditions to determine network features or components that are specifically associated with disease . An interesting future direction here is the question of how to target disease-associated networks for destruction while preserving normal network function .
Similarly, it will also be of significant interest to computationally model how network re-wiring may have an impact on how signals flow through the network and alter network outputs, such as the activation of differing transcriptional responses. Several approaches have been proposed to investigate how signals flow through large biological networks, in particular protein–protein interaction (PPI) networks, for which substantial amounts of data are publicly available [73, 74]. One promising approach is information flow analysis, a computational biology method that uses random walk algorithms to model how signals flow through large networks. One example of software that performs this type of analysis is ITM Probe , which is also available as a Cytoscape App . In ITM Probe, the user can define source nodes (nodes that emit information, e.g., receptors) and sink nodes (target nodes that absorb information, e.g., transcription factors). The algorithm then models information flow in a protein interaction network through discrete time random walks, where the walker has a certain probability to dissipate (i.e. to leave the network) at each step. Edge weight and interaction direction information can also be used to assign higher probabilities to certain paths through the network than others. The more times random walkers pass through a node, the higher the information flow score for that node will be. By altering the network between the source and sink nodes, one can computationally infer the impact of network re-wiring on information flow in the network.
While experimentally reconstructing networks under different conditions is an important goal, this will remain costly and technically challenging for most research groups well into the future. Fortunately, by overlaying dynamic data that is more readily generated (e.g. gene expression data) onto experimentally-validated networks (e.g. PPI), one can already gain insight into which network features might be preferentially associated with disease or another phenotype of interest. For example, a static map of the interactome can show some hub nodes with large numbers of connections. However, proteins have a limited number of structural interfaces with which to engage in direct protein–protein interactions and cannot interact directly with so many partners at the same time . This has led to the classification of hubs as either “party” hubs, which interact with most of their partners simultaneously, or “date” hubs, which bind their different partners at different times or locations , although this classification remains hotly debated [79, 80]. Regardless of whether this is a useful classification or not, it is clear that if one takes multiple random lists of genes and builds a network, one will find that some nodes are always or frequently identified as hubs because they are highly connected in the database and not necessarily because they are relevant to the condition of interest. Therefore, it is important for the researcher to calculate statistical significance based on this background expectation (e.g. using a hypergeometric distribution test), in a manner similar to that described previously for functional enrichment analysis.
In conclusion, networks provide a powerful conceptual approach to integrate and find patterns in genome-wide genomics data but researchers adopting these approaches need to be conscious of their limitations and caveats. In this review, we have mainly focused on PPI networks but a wide variety of other types of networks are becoming ever more prevalent in the scientific literature, including gene co-expression networks, transcriptional regulation networks, and metabolic networks [71, 81]. The great challenge will be to integrate these various types of networks into a universal network model of the cellular interactome.
TC, KB and DJL all contributed to the writing of this review article. All authors read and approved the final manuscript.
This work was funded in part by the European Union Seventh Framework Programme (FP7/2007-2013) PRIMES project under grant agreement number FP7-HEALTH-2011-278568. The Lynn Group is also supported by EMBL Australia. T.C. is supported by the Teagasc Walsh Fellowship scheme. This paper is part of the collection ‘ISAFG2015’ (6th International Symposium on Animal Functional Genomics, 27–29 July 2015, Piacenza, Italy). The publication of the papers in this collection was partly sponsored by OECD Co-operative Research Programme: Biological Resource Management for Sustainable Agricultural Systems (CRP). David Lynn’s participation in ISAFG2015 was financed by the OECD Co-operative Research Programme. The opinions expressed and arguments employed in this paper are the sole responsibility of the authors and do not necessarily reflect those of the OECD or of the governments of its Member countries.
The authors declare that they have no competing interests.
- 26.Villaveces JM, Jimenez RC, Porras P, Del-Toro N, Duesbury M, Dumousseau M, et al. Merging and scoring molecular interactions utilising existing community standards: tools, use-cases and a case study. Database (Oxford). 2015;2015:bau131.Google Scholar
- 69.Jager S, Cimermancic P, Gulbahce N, Johnson JR, McGovern KE, Clarke SC, et al. Global landscape of HIV-human protein complexes. Nature. 2012;481:365–70.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.