Introduction

During the past decade, the field of human genetics has witnessed massive global improvements in the generation of high resolution genomics data as genotyping arrays and next generation sequencing (NGS, whole genome [WGS] or whole exome sequencing [WES]) have become time- and cost-effective techniques to assist the genetic study of health and disease [1]. Similarly, transcriptomics and proteomics (and also epigenomics and metabolomics) studies have benefited from rapid advancements in the technologies and methods for data generation and analysis [2•]. Alongside, bioinformatics tools and pipelines that are accessible and shared throughout the wider scientific community, together with ever improving computational environments, have supported an exponential growth in big data availability for basic and applied biomedical research [3••].

We are currently—probably for the first time in medical history—facing a paradoxical “abundance problem”, i.e. having more data at hand than we can ever interpret and effectively translate into medical practice. Like for the Levinthal paradox on protein folding [4], the only way to tackle the current status quo is that of moving on from the classical way of analysing data one by one and changing the paradigm in which biomedical research operates. In truth, it appears that both reductionist (classical) and holistic (novel) approaches cannot be treated as separate fields anymore and need to be considered on a convergent and cross-supportive path where systems biology, computational modelling, mathematics and informatics play a critical role [5•].

In the era of big data, it has become increasingly clear that advances can only result from collective efforts and data sharing [6]. Biomedicine has thus seen the rise of consortia, large-scale (mainly international) efforts aimed at sharing resources, maximizing both sample collection and data generation and harmonizing analytical strategies. In the field of neurodegeneration examples include the International Genomics of Alzheimer’s Project (IGAP, http://web.pasteur-lille.fr/en/recherche/u744/igap/igap_download.php), the International Parkinson’s Disease Genomics Consortium (IPDGC, https://pdgenetics.org) and the International Frontotemporal Dementia Genomics Consortium (IFGC, https://ifgcsite.wordpress.com). These large-scale collaborative efforts are paving the way for a coherent understanding of the molecular mechanisms of complex neurodegenerative diseases. More in general, “resource” consortia together with international working committees and open access databases have been set up to promote international collaborations, standardize nomenclature, data storage and sharing in line with the highest standards and best practices (Table 1).

Table 1 Open access, big data repositories

Complex Neurodegeneration and Network Analysis

In monogenic disorders, a mutation with high effect size in a specific gene that acts as pathogenic trigger and disease mechanism can be (directly) inferred through the functional analysis of that single mutated gene.

In the case of complex diseases, multiple genetic markers with small effect size contribute all together to the trait. In complex neurodegenerative diseases, the genetic component for the majority of cases (sporadic) is indeed defined by a plethora of variants, i.e. genetic architecture of disease, priming the individual to develop disease at a certain stage of life [30]. In a minority of complex neurodegeneration cases (familial), mutations in single genes are isolated. Even if these mutations hold strong causative effects, modifiers within the genetic architecture can modulate disease onset and progression. Reports of PSEN1 mutation carriers who are resistant to or show a delayed onset for Alzheimer’s disease (AD) due to their APOE genotype [31, 32] as well as the non-complete penetrance of LRRK2 mutations in families affected by Parkinson’s disease (PD) [33] are examples of how seemingly even monogenic cases of familial neurodegenerative diseases can be indeed classified as complex disorders. In addition to the genetic component, the environment also plays a role in complex disease pathogenesis, acting as an additional risk factor, e.g. inducing disease-relevant epigenetic changes and/or acting as disease trigger on a receptive genetic asset. The molecular mechanisms at the basis of complex neurodegenerative diseases are not straightforward to be read, since the genetic architecture of risk is difficult to be modelled and requires multiple causative markers to be analysed simultaneously (Fig. 1).

Fig. 1
figure 1

The genetic architecture of disease can be graphically schematized by a “risk-barcode” where each line represents a risk factor that can be either a genetic variant or an environmental exposure. Lines have different thicknesses to represent different levels of contribution (strength or effect size) of each single component to the final disease risk. The principal problems in modelling the genetic architecture of risk with classical functional approaches are due to (i) common risk factors which are usually non-coding variants thus not immediately associated with any specific gene, (ii) common risk factors which have small effect sizes (strength) that are likely to fall below the sensitivity threshold of common functional experiments, (iii) modelling multiple risk factors concomitantly in the same model system which has proven challenging and sometimes impractical

In this scenario, in silico systems biology approaches, for example network analysis, have the potential to revolutionise the translation of genetics information into functional understanding of the molecular basis of disease. The availability of large sets of well-curated omics data and the development of bioinformatics approaches based on graph theory are opening up the possibility, for the first time, to study complex diseases by simultaneously modelling the multiple genetic factors at play with a more holistic approach by studying networks [5•].

Networks, also called graphs, are mathematical objects that represent multiple data as a whole. Networks are composed of nodes (objects constituting the network) and edges (connections between those objects). One can visualize biological networks by using freely available tools such as Cytoscape [34] (https://cytoscape.org) and yED (https://www.yworks.com/products/yed), and study networks through the mathematical approaches offered by graph theory.

Transcriptomics or proteomics (both steady-state and time-series type of data) are used for building gene co-expression networks (GCNs) following the assumption that genes that are co-expressed are probably co-regulated and thus part of the same pathway [35]. The input dataset for GCNs needs to be statistically processed (different methods have been developed such as WGCNA, CLR, ARACNe, PCIT, GENIE3, SIRENE and GeCON [36••]) to generate the co-expression information, i.e. the relationships that are essential for building edges.

Protein interaction data, derived from a wide range of cellular and biochemical model systems, can be used for building protein-protein interaction networks (PINs). Generating PINs is relatively straightforward considering that the relationship between nodes and edges (i.e. protein interaction) directly reflects the type of information contained in the original datasets [37].

Finally, hybrid networks can be constructed by mixing different types of omics data. Gene regulatory networks (GRNs) are a type of complex network where nodes can be genes, proteins, metabolites. Pairs of nodes are connected by edges where one of the nodes in the pair influences (via inhibition or activation) the activity of the other [38]. The construction of GRNs is usually performed by applying statistical approaches based on inference algorithms (including Bayesian, artificial neuronal, and Boolean networks, regression-based model, ordinal differential equation and information theory) [39, 40]. These methods are all aimed at extracting the probability of the reciprocal regulation for all pairs of nodes within large datasets used as input (e.g. gene expression, protein-DNA interactions, transcription factors binding) mathematically generating edges between nodes.

A key advantage of networks in an experimental context is that they are mathematical objects kept together by connections (i.e. relationships) between the nodes. Therefore, networks include multiple players simultaneously (nodes) that are analysed assessing their concurrent interactions within the global structure of the graph. This type of topological analysis is aimed at identifying relevant nodes and understanding how the information flows throughout the entire structure of the network [41]. Relevant nodes are, for example, hubs (highly connected nodes), i.e. essential genes within the network structure [42], and bottlenecks (shortcuts), i.e. non-essential genes that can be targeted (e.g. by drugs) to modify the flow of information within the network [43]. Assuming that nodes are genes and/or proteins connected in the network through functional relationships, the information contained in the network is a powerful aid for the prediction of disease pathways, key functional players, candidate genes for rare variant discovery or sites for therapeutic intervention. In this respect, one of the underlying assumptions when doing network analysis is the “guilt by association principle”; here, the function of a node is inferred from the functions of its connected nodes (neighbours) [44]. The “network parsimony principle” summarizes another important assumption used in network analysis, for which the shortest path across (disease) relevant nodes is supposed to be indicative of the disease molecular pathway. An additional approach to identify relevant regions of the network for understanding how the flux of information moves within the graph and how this can be modified during disease is the detection/analysis of both motifs (peculiar concatenations of nodes) [45] and modules (representing portions of the network identified as discrete clusters because of shared homogenous characteristics). This lead to another network analysis principle called “local hypothesis”, for which nodes involved in the same function (or disease) tend to share interactions and cluster within the same network module(s) [36••, 41].

Complex Neurodegenerative Diseases: Too Many Genes

As indicated above, in familial cases of complex neurodegenerative diseases, it is possible to identify mutation(s) with high effect size in so called Mendelian gene(s). It is noteworthy that different/multiple genes can be isolated in familial cases, and that all of them contribute to the pathogenesis of the same disease. For example, familial PD strongly associates with mutations in at least 7 different genes [46]; in familial frontotemporal dementia (FTD), at least 10 different mutated genes are associated with disease (despite some of them being extremely rare within the FTD population) [47]. It follows that a number of challenging questions arise, i.e why do many different (mutated) genes trigger a cascade of biological events that lead to the same clinical phenotype? One possibility is that, despite apparent differences, there is a limited number of common functions/pathways impacted in disease pathogenesis.

Classically, the effect of pathogenic mutations in familial genes has been investigated through knock-out/down models or in systems carrying one of the disease mutations (genetically modified models or patient-derived cells). Therefore, mutated genes have mainly been studied in isolation; rarely mixed models have been used to correlate the action of 2 or 3 genes. For example, LRRK2 (frequently mutated in familial PD) has been evaluated both in isolation (very frequently) and in hybrid models (rarely) in synergy with other familial PD genes such as SNCA [48] or VPS35 [49, 50] showing that these genes might indeed be part of communal molecular patterns of disease. It must be noted that this type of studies can be expensive and technically challenging as classical functional biology is not well equipped to model multiple genes at the same time. Similarly, there are many mouse models for AD developed by modifying only one single gene, while very few models are available as double transgenic (to study concomitant mutations in APP and PSEN1) or triple transgenic (to study concomitant mutations in APP, PSEN1 and MAPT) [51]. Network analysis has become an ever-increasing popular in silico approach to identify and prioritize communal pathways shared across “disease genes”, thus helping to shed light onto molecular mechanisms of disease and assisting disease modelling. Results from network analyses still need confirmation in the functional environment; however, networks offer a time- and cost-effective approach to inform wet lab research. Specifically, networks allow for a more holistic support of disease modelling and help in focusing resources on the most promising functional targets.

Different network-based approaches have been developed, yet, generally, they can be categorised in 2 major groups. The bottom-up group comprises those approaches that “build the network up” starting from the genes under investigation. Conversely, the top-down methods build a larger and unbiased network in the first instance and then map the genes of interested onto it.

Our group has contributed to the bottom-up approaches by developing a pipeline named weighted protein-protein interaction network analysis (WPPINA); here, PPIs were used to build a multiple layers interactome for each of the familial genes for both FTD and PD. The single interactomes were subsequently merged into a final network (familial network for PD and familial network for FTD). Graph theory was applied to extract inter-interactome hubs (IIHs) that are those nodes responsible for keeping graph cohesion. IIHs were then used to successfully identify communal (and discriminative) pathways to disease via functional and pathway enrichment [52, 53].

Dervishi et al. applied a similar bottom-up approach to the study of amyotrophic lateral sclerosis (ALS). After selecting a number of distinct seed genes associated with disease, they used protein interactions (through Ingenuity, QIAGEN Comp, LA, USA) to build an ALS network used to: “suggest how different gene mutations converge into significant perturbations in protein interaction domains” [54]. Similarly, Beltran et al. applied the PIN approach through Ingenuity on an input set composed of copy number variations (CNVs) and additional genes differently associated with ALS. This was instrumental to identify a number of core genes in ALS-associated subnetworks, disease pathways and mechanisms to be further functionally validated [55]. A top-down approach was used in AD by building a whole human PPI network to function as background for inferring an AD-specific protein sub-network [56] for conjunct functional analysis of AD genes and prediction of additional AD gene candidates. Another top-down approach has been investigated by Kahle et al., in which they firstly generated a PIN for ataxia genes. Subsequently, they integrated the literature-derived information with primary interaction data, experimentally obtained for selected ataxia proteins. They parsed medical records of patients with ataxia to identity comorbidities and finally evaluated whether proteins implicated in the comorbid conditions were present within the ataxia interactome and how connections among these proteins were structured. Such strategy was instrumental in shedding light onto the biological origin of the comorbidity and shared mechanisms across diseases [57].

Ghiassian et al. investigated the concept of “disease module” and, by analysing how proteins involved in disease are typically linked together within the structure of the network, they developed a pipeline (DIAMOnD) to detect disease modules within PINs for specific disease phenotypes [58].

GCNs have been applied in the form of weighted gene co-expression network analysis (WGCNA [59]) to the study familial FTD. Here, expression profiles from different regions of the brain relevant for disease were analysed, and gene co-expression was analysed through permutation. Clusters (modules) of highly co-expressed genes were identified, and familial genes for FTD were mapped onto those modules prior topological and functional evaluation highlighting impacted biological pathways in different brain regions [60].

Gilman et al. [61] used a hybrid network approach (network-based analysis of genetic associations (NETBAG)) to simultaneously analyse all the genes affected by CNVs in autism to prioritize and suggest biological processes and pathways at the basis of the disorder. The hybrid network was built with the entire set of human genes as nodes and considering edges (connectivity) based on shared Gene Ontology annotations (from GO), KEGG pathways interaction partners and co-evolutionary patterns. Genes with autism-associated CNVs were then mapped on the network and used to identify strongly connected clusters to be studied. This permitted the evaluation of the entirety of CNVs alterations in one step and the assessment of their functional relevance in a genome-wide context.

Inferring Disease Genes from the Genetic Architecture of Risk

The genetic architecture at the basis of complex neurodegenerative diseases is difficult to model. Genome-wide association (GWA) analysis is, very frequently, the technique of choice to evaluate the contribution of small effect size variants (distributed in the entire genome) to a complex trait [62].

GWA findings have to be validated in model systems to provide functional information on mechanisms leading to disease. However, the translation of genetics into the functional understanding of disease is challenging (Fig. 1). One of the issues with GWA types of studies is that signals do not necessarily pinpoint genes, rather regions of the genome (i.e. loci) that increase risk of disease, yet it is very difficult to understand which is the actual gene that is modulated by the disease-associated risk variants and what biological function is altered and eventually responsible for disease. Historically, researchers have suggested the open reading frame (ORF) closest to the risk signal to be the associated causal gene; this has, over the years, possibly generated type I error interpretation regarding the identity of the actual gene(s) modulated by the variant. Therefore, more recently, many groups are striving to establish pipelines to identify the real target genes of GWA variants.

Among the most successful approaches, there is the integration of genetic with quantitative trait loci (QTL) data [63]. The most popular form of QTL is expression QTL (eQTL), where the expression of cis-genes in relation to susceptibility markers is assessed (Fig. 2). Other QTL approaches, such as splicing QTL (sQTL), have been applied to determine alterations in splicing induced by the risk variant; methylation QTL (m-QTL) to verify the epigenetic change in methylation profile of nearby genes, and protein QTL (p-QTL) where the protein levels, rather than the RNA levels (as per eQTL), are evaluated. Clearly, not all the GWA signals can be explained via QTL analysis. This is possibly due to incompleteness of the omics databases needed for the analysis (tissue- and disease-specific data availability) or by methodological restrictions (e.g. in evaluating trans-QTLs [64,65,66] or QTLs resulting from a combination of multiple variants [67]). In this respect, again, network approaches have been proposed to be combined with QTLs to improve prioritization of modulated genes at the GWA loci.

Fig. 2
figure 2

Examples of a case of eQTL and a case of sQTL. eQTL, the presence of the variant at the locus affects the amount of mRNA (in this case reduction) produced for gene A, the variant is directly affecting the expression level of gene A. sQTL, the presence of the variant at the locus affects the splicing of gene C, in this case the long isoform is no more produced due to the presence of the variant

Our group applied PIN analysis to PD-GWA signals. We firstly identified pathways shared by familial genes for PD using PIN; we then mapped the ORFs in linkage disequilibrium (LD) with the GWA risk variants onto those pathways and the PIN. The rationale was that those ORFs whose protein product was present in the network, and was involved in at least one of those pathways, were to be prioritized as gene candidates [30].

Alternatively, Voineagu et al. generated a GCN for autism using RNA profiling of post-mortem brain tissues identifying 2 relevant modules, enriched in neuronal and glia markers, respectively, to be correlated with disease. Co-expression modules where then tested for enrichment of autism-associated signals showing that GWA data converged on the neuronal module only [68]. With a similar approach, Seyfried et al. reported GCNs (obtained through WGCNA applied to proteomic profiling of human brain cortical tissue) descriptive of expression changes of both asymptomatic and symptomatic AD. GWA signals were first linked to ORFs through gene set analysis (thus, generating a single p value for each ORF present at the risk loci). Then, significant ORFs were overlapped with GCN modules to infer specific pathways correlated with disease progression [69].

Future Directions

The pressing request for approaches able to handle increasingly large (omics) and complex (multi-omics) sets of data has been the driving force behind the development of tailored network analyses in biomedicine. This is consequence of networks being relatively simple yet powerful tools for biological data inference. Machine learning (ML) has started to support network analysis [70]. ML is referred to as a computational approach where a machine is set-up to recognize patterns in a dataset and to increase its accuracy by correction over process reiteration (learning) [71]. ML is used for building networks; many techniques for inferring edges in GRNs have been developed as ML approaches; for example, ML can power the identification of DNA patterns and transcriptional factors binding sites in large datasets. Alternatively, ML can be applied to the analysis of graphs. The potential of ML to efficiently detect re-occurring patterns of connections (motifs and network architecture) or identify similarities leading to node segregation (clustering) is starting to be investigated [72]. For example ML has been used to identify alterations in specific gene expression patterns indicative of candidate genes for cancer [73, 74] and predict PPIs based on protein pair features [75], as well as for dimensionality reduction after GO functional enrichment.

Conclusions

The research community is witnessing a very productive moment in biomedicine, experiencing an exponential growth in the amount of data that is generated with many initiatives taking place to improve the way we analyse data to extract biologically meaningful information to be translated for the benefit of medical practice. Of course, even if the computational power, the statistical approaches and the mathematics of graph theory are available, such paradigm shift in basic and applied research is still in its infancy. There still are levels of complexity that need to be overcome; for example, networks are more static than dynamic objects, where both edges and nodes can reconfigure themselves as in the real biological context [76•], and many omics datasets still lack that critical cell specificity type of information that would be necessary to draw more comprehensive functional conclusions. A specific initiative called Dialogue for Reverse Engineering Assessment and Methodology (DREAM) challenge (http://dreamchallenges.org) has been launched in 2006 as a crowdsourcing effort, where teams from all over the world are competing to develop the best performing pipelines to address compelling, big data problems in biomedicine. Analytical pipelines are being generated at a fast pace; however, these will need to stand the test of time; particularly, the next critical step will be validating the in silico findings, thus develop useful functional systems to model disease and highlight efficient endpoints for therapeutic drug intervention.