1 Introduction

1.1 Genomic variations

Genomic variations (polymorphisms) play a significant role in defining phenotypic alteration in any given population. These variations could be categorized as neutral, beneficial, or deleterious based on their direct effects on the phenotype. The genomic variations could be further defined based on the presence of alleles, i.e., bi-allelic or tri-allelic, with two or three types of nucleotides at the same positions. Usually, bi-allelic sites are present in abundance compared with tri-allelic sites, and therefore are highly considered for evolutionary and phylogenetic analysis (Adelson et al. 2019). With the emergence of next-generation sequencing (NGS) data, genomics studies have been revolutionized. However, with larger genome lengths, even minute errors in sequencing could result in false variant calling. There could be possible errors while analyzing the collected data, such as contamination, degradation, missing private variations for priming, and machine failure (Robasky et al. 2014). One could apply Sanger sequencing (Mu et al. 2016), sample replicates (Kamps-Hughes et al. 2018), and reference sequences (Hardwick et al. 2017) to validate the variant calling. One of the possible ways to avoid sequencing error is the smaller size of genomes, such as viral, most bacterial, and extracellular plasmids or mitochondrial genomes. The smaller size allows the highest accuracy in sequencing the genome. In humans, the nuclear genome is vast and often disintegrates during isolation. In contrast, with its high copy number and small size, the mitochondrial genome became a favorite molecular tool for molecular, phylogenetic, and evolutionary scientists. The mitochondrial genome consists of 16569 nucleotides, 13 protein-coding genes, 2 rRNAs, and 22 tRNAs. The 13 protein-coding genes are part of the mitochondrial OXPHOS pathway (Wallace 2013). Humans originated in Africa and the first haplogroup established was the macrohaplogroup L (Wallace 2015). From this macrohaplogroup, L3 originated 65,000–70,000 years ago and gave rise to two major haplogroups, M and N, which later migrated out of Africa and populated the rest of the world (Wallace 2015). Various different haplogroups were founded by purifying selection of specific mitochondrial variations and enriched at regional levels (Mishmar et al. 2003). Apart from identifying the haplogroups and migration patterns, there are variations that also affect the bioenergetic system (Wallace 2013). It has been reported that about 10 to 20% of the tRNA variations, at least a few of the rRNA variations (Ruiz-Pesini and Wallace 2006), and 25% of the mtDNA protein sequence variations (Mishmar et al. 2003; Ruiz-Pesini et al. 2004) have played roles in altering mitochondrial coupling efficiency. The central role of mitochondria is to generate energy through the electron transport chain (ETC). The ETC oxidizes the reduced dietary components and, in the process, generates a proton gradient across the inner mitochondrial membrane. This proton gradient then leads to the formation of ATP with the help of complex V (ATP synthase) in the matrix. The efficiency with which the proton gradient is converted into energy production is known as coupling efficiency. The mtDNA variations that cause a reduction in this efficiency lead to the production of more heat than ATP, which is used in colder climates. One such variation is 3394C in the ND1 gene, which is enriched in high-altitude Tibetans compared with low-altitude Asians (Ji et al. 2012). The mtDNA variations also affect the pH of the mitochondrial matrix and calcium dynamics in the mitochondria. Particularly, 8701A and 10398A lead to a decrease in the pH of the matrix and the uptake of calcium by mitochondria (Kazuno et al. 2006). Along with the phenotypic effect of mutations in coding genes, the variations in the control region also affect the overall dynamics of the mitochondrial genome. One such variant is 295T, which has been shown to enhance the binding of TFAM (mitochondrial transcription factor A) to the L-strand promoter of mtDNA, L-strand transcripts, and mtDNA copy number. The mtDNA mutations are also shown to be associated with a wide range of clinical phenotypes (Wei and Chinnery 2020). Large-scale deletions in mtDNA were, for the first time, shown to be responsible for optic neuropathy and myopathies (Holt et al. 1988; Wallace et al. 1988). Since mtDNA is present in a few to several copies per cell, depending upon the type of tissue, the mere presence of a mutation in a few molecules does not correspond to any clinical phenotypes. This condition where both mutated and wild-type mtDNA molecules are present is known as heteroplasmy. Mutated mtDNA molecules should be present in high frequency for any clinical implication of mutations to show up (Stewart and Chinnery 2021). Apart from the mtDNA-transcribed genes, there are ∼1200 genes, which, if they were to get mutated, might lead to mitochondrial dysfunctions (Calvo et al. 2006). With the gain in understanding of the nature of mtDNA mutations through deep sequencing studies, new therapeutic methods are also emerging that target symptomatic interventions, pharmacological therapies, ATP and nitric oxide synthesis pathways, antioxidant defense, and improving mitochondrial quality and apoptosis (Bottani et al. 2020). However, in certain conditions, the mere presence of a variation is not enough to impart its effect, as in the case of the mitochondrial genome, since the number of mitochondria varies in human cells and each mitochondrion harbors hundreds to thousands of copies of their genome (D’Erchia et al. 2015), and the deleterious effect on the phenotype of individuals depends on the ratio of damaged mitochondrion or mutated genomes with respect to the healthy ones.

1.2 Mutual effect of variations

In the human mitochondrial genome, it is observed that a variation may have different effects depending upon the haplogroup/genome backgrounds (Ji et al. 2012). As we know, the 3394C variant confers high-altitude adaptation in Tibetans, and its haplogroup background plays a vital role in its phenotypic consequences. When 3394C is present on the M haplogroup on the M9 background in Tibetans and on the C4a4 background in the Indian Deccan plateau population, it has beneficial effects and does not affect the complex I activity. However, its presence in the N haplogroup reduces the complex I activity and associates with Leber hereditary optic neuropathy (LHON), suggesting the role of the haplogroup background in modulating bioenergetics. Similarly, the non-pathogenic missense variants cause low-penetrance LHON, as shown in a study of complete mtDNA sequences of three families from southern Italy and one family from northern Italy (Caporali et al. 2018). It was reported that the variants, otherwise polymorphic, when present in a particular combination, led to reduced complex I activity and thereby the onset of LHON. There could be multiple reasons behind such complex observations, one being the hypothesis of genetic hitchhiking, which states that a selective sweep at one position in the genome could alter the allele frequency at a nearby position (Charlesworth et al. 2000). Another phenomenon where the presence of mutual variations imparts their effect at the phenotypic level is observed in the form of epistasis, which usually deals with alterations in traits associated with those variations (Lehner 2011). Considering the fact that the relationship between phenotypic effects and the presence of variations is not direct, it becomes important to assess the collective role of variations in understanding mitochondrial genetics in the human population. Genetic variations were observed to impart their effect at the phenotypic level as a cohort of multiple interactions and rarely individually (Papp and Pál 2011). The heritability of complex diseases is minutely affected by the mere presence of single nucleotide polymorphisms (SNPs) (Jakobsdottir et al. 2009). However, the manifestation of such diseases depends on the interactions of SNPs (Cordell 2002; Marchini et al. 2005; Phillips 2008), and therefore it is important to study the collective effect of genomic variations. There are various ways to study the collective effect of variations, and the specific interactions between genes associated with specific traits (Gilbert-Diamond and Moore 2017). To select a particular cohort of the variations and their interactions responsible for the manifestation of complex phenotypes, various computational methods have been developed and implemented, among which principal component analysis, to infer groups of SNPs from linkage disequilibrium to evaluate multivariate SNP correlations for intragenic diversity coverage (Horne and Camp 2004), integrative scoring system based on their deleterious effects (Lee and Shatkay 2009), and the Pareto-optimal approach for identifying functionally and informatively significant SNPs (Lee et al. 2009), are the most popular. There exist other approaches based on pair-wise interactions such as two variations significantly interacting through logic regression (Schwender and Ickstadt 2007), predictive rule inference (Wan et al. 2009), and shrunken dissimilarity measure, in which a gene–gene similarity value is calculated and pairs are selected if the similarity value crosses a set threshold value (Liu et al. 2020). These traditional methods of detecting genetic interactions based on SNPs rely on omitting interactions with minimal or no effect on the trait, thereby leading to the possibility of an increase in false-negative results. To overcome this, variable site pairs are considered based on their allele frequency in the population in question. This provides quantitative as well as qualitative (as major or minor allele) information of all the variable sites for that population. Such pairs of variable sites give rise to a network in which the nodes are defined as variable sites and edges are defined as the relative frequency of occurrence of their alleles together. In this way, we can start with most of the pairs of variable sites without worrying about false-negative errors since we are not omitting any variable site or its interactions based on phenotypic information. Next, with the appropriate threshold selection method we can focus on identifying significant variable sites and their interactions to further analyze the network properties. There are several structural properties such as degree, clustering coefficient, centrality measures, etc., which can be analyzed to extract specific information on important nodes as well as their interactions in the network (Jalan and Sarkar 2017). Various graph-theoretical measures and models are extensively reviewed to understand the biological significance and hidden properties of living systems (Pavlopoulos et al. 2011). With myriad biological components such as proteins (Pellegrini et al. 2004), transcription factors (Lee et al. 2002), metabolites, and metabolic reactions (Jeong et al. 2000), the systems were studied as a complex system of networks to advance the fundamental understanding to the origin of functioning of the system. The promising role of network science can be elucidated as a spectrum of structural and spectral properties that have been extensively applied to understand the complex behavior of multiple cancers (Rai et al. 2017) and to identify the crucial role of various proteins involved in each developmental stage of C. elegans (Shinde and Jalan 2015).

1.3 Biological applications of co-occurrence/co-mutation networks

Role of compensatory mutations as co-mutations have been reported in influenza viral evolution; e.g., the presence of E375G is known to functionally compensate for deleterious effects of R384G while M239V enhances viral fitness in the NP gene (Berkhoff et al. 2005). With this information, co-occurrence networks were constructed for the human influenza virus (H3N2) to analyze the collective effect of all the variations on the evolution of the virus from viral genomes isolated between 1968 to 2006 (Du et al. 2008). These studies identified the role of connectivity maps between and within viral genes as a contributing factor of influenza virus evolution. However, rather than focusing on individual nucleotide pairs, these studies provided a wholesome picture of viral evolution based on evaluating the changes in co-occurrence network topology with respect to time. In a similar line of work, entropy (for genetic diversity) and information gain (for antigenic degree) were analyzed to identify antigenic critical amino acid positions on hemagglutinin (HA) protein in influenza virus to distinguish between avirulent and virulent strains (Huang et al. 2009). This distinction is critical for developing new vaccines for antigenic variants. The information gain from these studies was used to identify co-mutation pairs on different epitopes which could lead to antigenic drift. Association rule mining was performed to extract co-occurrence of mutations in the HA gene of human influenza A/H3N2, A/H1N1, and B viruses to predict their evolution and emphasized vaccine upgradation (Chen et al. 2016). Rule-based co-mutation networks identified the driver mutations during the H1N1 2009 pandemic by comparing the degree centrality of pandemic and post-pandemic networks. In an another study, co-occurring mutations in HA and neuraminidase (NA) genes of influenza A/(H1N1)pdm09 viruses were explored (Liu et al. 2020). The study showed that sore throat was associated with co-occurring mutations in hemagglutinin and neuraminidase genes. Moreover, apart from the influenza virus, a change in the degree centrality of co-mutation in the Ebola virus has been attributed to the accelerated viral evolution in recent outbreaks. Lethality of the disease with case fatality rate was also predicted by mapping the co-mutation networks (Deng et al. 2015). In recent studies, co-existing mutations have been used to classify Indian SARS-CoV-2 strains based on 22 groups (Sarkar et al. 2021), whereas co-mutation modules were explored to capture the evolution and transmission patterns of SARS-CoV-2 (Qin et al. 2021). The effect of co-occurring genetic alterations on non-small-cell lung cancer (NSCLC) progression and therapy resistance were studied for the first time on CTNNB1 and PIK3CA genes. It was identified that co-occurring alterations in CTNNB1 and PIK3CA genes cooperatively promote cancer progression (Blakely et al. 2017). A systems biology approach was presented to extract the impact of functional interactions between mutated genes in different cancers (Cui 2010). In the study, co-occurring (and anti-co-occurring) mutations were defined based on the presence (or absence) of mutated genes in a particular cancer. Analyzing such genetic interaction networks based on co-mutations, it was reported that mutated genes that co-occur in tumors shared signal transduction pathways and had functional similarities. In another study, candidate therapeutic pathways for personalized medicine were identified by utilizing the information of mutated genes in tumors of 14 different cancer types to construct co-mutation-based genetic interaction networks (Liu et al. 2020). The interaction of WNT4 and WNT5A genes through rs2072920 and rs11918967 SNPs were shown to be associated with obesity in the Han Chinese people (Dong et al. 2017). Along with DNA sequences, co-evolution has been substantially studied in protein sequences (Morcos et al. 2011; Kamisetty et al. 2013). Various methods such as direct coupling analysis (Morcos et al. 2011), statistical coupling analysis (Russ et al. 2005), and evolutionary coupling analysis (Hopf et al. 2015) were developed and applied to establish the role of amino acid interactions in the structural stability and functionality of proteins. A relatively more developed method known as deep coupling scan was proposed, which takes care of patterns of evolutionary conservation in deep mutation scan data sets (Salinas and Ranganathan 2008). In this article, as a tutorial, we provide detailed step-by-step instructions to construct such co-occurrence/co-mutation networks along with a brief discussion of the results.

2 Network construction techniques

A network consists of a set of connected nodes or units, where connections are defined by an interaction type. There exist various network models incorporating various properties of real-world complex systems, among which three are most popular (box 1). Structural properties of real-world networks are readily compared with these corresponding model networks for their deviations and to detect system specific information. The prerequisite of constructing a network from a given set of genetic sequence data is the alignment of sequences usually in a FASTA format. For aligning the DNA sequences, the online tool Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/) (Sievers et al. 2011), and for offline alignment, AliView software (Larsson 2014), are readily employed. Once the sequences are aligned, the next step is to pre-process them, by replacing ambiguous characters such as ‘M’, ‘Y’, ‘S’, ‘K’, etc., with one letter, say, ‘N’. This is done to simplify the procedure of defining the variable sites. The first step to construct these nucleotide networks is identification of the variable sites. A variable site is defined as a position where more than one type of nucleotides is present. Such a site where only two types of nucleotides are present is considered a bi-allelic site, and the site with three nucleotides is considered a tri-allelic site. For our analysis, we only consider bi-allelic sites since the role of tri-allelic sites in evolution is still not clear. The allele frequencies substantially define the genetic structure of the population. However, the individual allele frequencies do not affect the construction of co-occurrence networks at all. Other than the bi-allelic or tri-allelic sites, we have sites with gaps and unknown nucleotides often represented by ‘−’ or ‘N’ or ‘?’. The sites with gaps usually represent indels (insertions and deletions), and so there is no meaning in taking those as variable sites; ‘N’ could be any nucleotide out of the four standard nucleotides, and hence does not give any particular biological information about that polymorphic site. Therefore, we suggest such sites be ignored since they could give rise to artifacts while calculating the co-occurrence/co-mutation frequency between variable positions. In the co-occurrence and co-mutation networks, the nodes are variable sites and the connections are defined according to the type of network we are constructing (this will be discussed further).

figure a
figure b

2.1 Co-occurrence networks

Co-occurrence networks take into account the position of variable sites as nodes, and the connection between a pair of the nodes is defined based on the co-occurrence of alleles for the given population. For this reason, there exists one unique network for each available sequence of the population. Since we have already mentioned that a co-occurrence network considers the allele (major or minor) present at a variable site for a particular sequence, we defined our edges with respect to the alleles and their frequencies for pairs of variable sites in the population. The co-occurrence frequency between a pair of variable sites for the alleles in position is calculated as


where Coij is the co-occurrence frequency between the x and y alleles present at the ith and jth variable site for a particular sequence. The numerator xiyj is the frequency of the presence of the x and y alleles together at the ith and jth sites, whereas xi and yj are the total frequencies of the allele x at the ith position and the y allele at the jth position, individually. The value of Coij gives information about the co-occurrence of two nucleotide positions in a given sample with respect to their presence in the whole population. The co-occurrence frequency ranges from [0 to 1], and we need to define a threshold value to filter out the possible noise to obtain a structurally meaningful sparse network. To define a threshold, we look for two structural properties of the network: one is the order of the largest connected component (LCC), and the other one is the average degree 〈k〉 of LCC. These two properties help in constructing a sparse network that is structurally more meaningful by filtering out some connections. For calculating the threshold, we start to construct a network by considering all the pairs with Coij > 0, which gives rise to a more or less globally connected network. To get a meaningful sparse network from such a network, we gradually remove the links with CoijCoth (co-occurrence threshold) and keep only those connections above a particular Coth and simultaneously calculate NcLCC and 〈k〉. Following this procedure, we obtain a threshold where the NcLCC consists of almost all the nodes but as few connections as possible, yielding a very low 〈k〉 or NcLCC < NLCC, where NcLCC and NLCC represent the number of connections and the number of nodes of the LCC, respectively (figure 1). We applied the above-mentioned method of threshold selection to all the networks generated for a given population, yielding as many sparse networks as the number of available sequences. In the next step, we constructed a master network by merging all the individual networks generated for each sequence. In this step, we obtained duplicate nodes and edges; however, we chose to perform a union operation on our networks (which can be performed by using codes in Python or Java or any other preferred programming language by the user). By performing a union operation on the edges, we obtained only one network in which all the nodes and edges of all the networks were taken into consideration only once, yielding a single undirected and unweighted co-occurrence network for all the samples. Then, various graph-theoretic structural properties were analyzed for this network, which are discussed in the results section.

Figure 1
figure 1

Schematic for construction of co-occurrence and co-mutation networks. Coij is co-occurrence frequency of two nucleotides at the variable positions i and j, xiyj is the occurrence of x and y nucleotides at the ith and jth positions together, xi and yi are the occurrence of x and y nucleotides at the ith and jth position, respectively. Cmij and Cmrij represent the natural and random co-mutation frequencies respectively, mij represents the frequency of minor alleles present at both ith and jth positions together, mi and mj are the minor allele frequencies at the ith and jth positions respectively. Once Coij or Cmij is calculated, one can use either or both of the threshold selection methods.

2.2 Co-mutation networks

For co-mutation networks, we again start with the multiple aligned DNA sequences. The variable sites are isolated based on minor allele frequency. Considering the minor allele, may have one additional aspect: comparison with the reference sequence. When we compare the minor allele with the reference sequence, we find that this minor allele is present as a major allele in the population. However, the site with such an allele would still be considered as a variable site for our analysis. As we calculated the co-occurrence frequency in the previous section, here we will be defining and calculating the co-mutation frequency between two variable sites based on their minor allele frequencies as


where Cmij is the co-mutation frequency, mij is the number of times minor alleles at the ith and jth position occur together, mi and mj are minor allele frequencies at the ith and jth positions individually. Calculating Cmij is the first step in constructing co-mutation networks whose range is again [0, 1]. Further, we calculate a statistical correlation (Pi,j) (popularly known as p-value test) to filter out interactions lying below a threshold value to get a meaningful network.

$$ P_{i,j} = \frac{{\left[ {\left( {Cm_{ij}^{r} } \right) \ge \left( {Cm_{ij} } \right)} \right]}}{reshuffling} $$

where Cmrij is a random co-mutation frequency, calculated after permuting the alleles at the ith and jth positions for a large number of times. As per standard practice, we keep the threshold value at standard ≤ 0.05 to filter the interactions. This method yields only one network combining all the sequences, and hence, the method is independent of number of samples.

2.3 Network properties

Degree (k): The degree of a node is defined as number of connections that a node has with other nodes (Barabási and Pósfai 2016). A node with a very high degree is referred to as the hub node and is known to play important functional roles in the corresponding system. In most networks, such hub nodes are very few, and hence the network becomes highly robust against random external attacks. On the other hand, a target attack to such nodes could drastically collapse the network. The degree of a node i is denoted as ki and is calculated from a symmetric matrix as


where j is number of columns of the given adjacency matrix. The average degree 〈k〉 of a network is the average of the degrees of all the nodes in the network and is a measure of the sparseness (or denseness) of the underlying system.

Clustering coefficient: The tendency of the nodes in a system to form triangles is captured by the clustering coefficient (Barabási and Pósfai 2016). Most real-world networks show high clustering coefficients compared with the corresponding random network. The higher clustering also provides information on the existence of modularity in the network. The clustering coefficient of a node i is calculated as


where kn is the number of connections between the neighbors of i.

Hierarchy: A decrease in the clustering coefficient with an increase in the degree of nodes suggests the presence of hierarchy in the network (Barabási and Pósfai 2016). This indicates that the nodes with small degrees belong to highly interconnected small modules. To quantify hierarchy, the local reaching centrality (CR) is measured for a node i as the proportion of all the nodes that can be reached from node i. Hierarchy arises due to the fact that the modules are not completely independent in a network where a few nodes play the crucial role of cross-talking between any two or more modules. CR is calculated as

$${C}_{R}\left(i\right)=\frac{1}{N-1}\sum_{j:0>{d}_{i,j}>\infty }\frac{1}{{d}_{i,j}}$$

where di,j is the shortest path length and N is the total number of nodes. Based on CR, the hierarchy is defined as

$$h=\frac{\sum \left[{C}_{R}^{max}-{C}_{R}\left(i\right)\right]}{N-1}$$

where \({C}_{R}^{max}\) is the highest reaching centrality in the network.

Betweenness centrality: This is the measure of the importance of a node as a connector or bridge between two modules or communities independent of their degree. It is defined as the fraction of shortest paths between all the pair of nodes that pass through the node i, and is calculated as

$${\beta }_{i}=\sum_{st}\frac{{n}_{st}^{i}}{{g}_{st}}$$

Modularity: In biological systems, the components form groups in order to perform relatively different functions at the molecular level (Barabási and Pósfai 2016). This property of a system is referred to as modularity. A high clustering coefficient is the signature of a network to have high modularity. This also gives information about the motifs, which are highly connected subgraphs. Motifs are present in almost all the real-world networks that have been examined so far. Modularity can be calculated using various algorithms such as Newmann–Girvan (Girvan and Newman 2002) or Louvian (Blondel et al. 2008).

3 Pseudo-codes/algorithms used to construct networks

In this section we have provided algorithms for constructing both types of networks. Algorithm 1 is common for both network construction methods.

figure c
figure d
figure e

4 Results

4.1 Structural properties

We have discussed two different methods of threshold selection to obtain a sparse network. One could apply either of these methods or even both of them together. Further, we considered a real-world example to understand these two methods of network construction. As a practical example, we started with ∼1500 human mtDNA from hmtDB from Oceania, giving rise to ∼1500 variable sites (dependent on the updates of the database), out of which only ∼450 take part in network construction with ∼470 connections. Figure 2a shows that the order of the largest connected component (NLCC) and its average degree (kLCC) decreases with an increase in the value of the co-occurrence frequency threshold. Initially, with no threshold, all the nodes (N) take part in network construction having degree N−1. In the inset of the graph in figure 3 (left), NLCC and kLCC from 0.9 to 1.0 are plotted to show the changes in the value of threshold to be considered to get a largest possible sparse network, which in this case is 0.9988. This method of threshold selection is perceived as the network efficiency score since it is employed to generate a sparse network. Similar to the network efficiency score, the effect of the p-value-based threshold selection method can be seen (figure 3, right). Here, we took the mtDNA of a Tibetan population with ∼85 samples, giving rise to ∼420 variable sites, of which ∼400 take part in network construction and ∼3400 significant connections. The number of nodes participating in the network construction is not affected as much as the change (more than 50%) in the number of connections (Nc) and the average degree 〈k〉, thus resulting in a sparse network. For this particular example, we considered a co-mutation network in which the minor allele frequency (maf) of a variable site plays a crucial role in determining the importance of the corresponding node in that network (figure 4). We plotted the degree of the node and its minor allele frequency to determine the role of the minor allele frequency in defining the co-mutation tendency of a variable site (figure 4a). The maf and degree show a positive correlation (0.4) for the given population, and there are a few nodes with smaller values of maf corresponding to a high degree in the network, and vice versa. In this context, the mutations which are usually the drivers of cancer evolution have been shown to have high maf (Spurr et al. 2018). It could be easily interpreted from the figure that there exist few nodes which have very high maf and most of the nodes have maf between 1 to 10 (figure 4a, x-axis). However, the two variable sites which co-mutate do not show any particular correlation in terms of maf (figure 4b). Variable sites with high maf can also co-mutate with nodes having relatively low maf. Moreover, maf could affect the distribution of edges with high co-mutation frequency (figure 4c and d). When the value of maf is increased from 15 to 20, the edges with Coij ≥ 0.6 are not affected, while edges with smaller Coij are removed from the network. This observation provides a general idea that pairs of nodes in which both the variable sites have high maf would give a high value of co-mutation frequency (Coij) in the network, since their association would be far less random (figure 4c and d).

Figure 2
figure 2

Evolution of co-occurrence network with change in co-occurrence threshold. In the first graph, when no threshold is taken, an all-to-all connected network is produced. In the second graph, when the threshold is increased to 0.5, only those edges are present which have co-occurrence frequency (Co) 0.5. In the third graph, the threshold is further increased to 0.9, where we lose a lot of connections and get a sparse network. In the fourth graph, the threshold is set to 1.0, which means that only edges with Co = 1.0 will be present in the network, giving rise to complete subgraphs.

Figure 3
figure 3

Left: Change in the order of largest connected component (blue squares) and its average degree (red dots) with respect to the co-occurrence frequency threshold. Right: The effect of p-value threshold on the order (NLCC), size (NcLCC) and average degree (kLCC) of largest connected component of a co-mutation network.

Figure 4
figure 4

These networks show similar average shortest path length and higher clustering coefficient as compared with the corresponding random networks, further adding to the evidence on the small-world-like nature of these networks. (a) The maf and degree of each node showed a positive correlation. (b) The maf of nodes, i and j of each pair. (c and d) The effect of maf on the distribution of co-mutation frequency.

As we know that the degree of a variable site provides the extent of co-evolution of that site with other sites, the degree is a prime property that indicates evolutionary change in the genome of an organism. The change in the average degree between viral genomes of two different seasons accounted for the sudden evolution of viral strains (Du et al. 2008). Similarly, the co-mutation of two amino acids of the HA protein in H3N2 virus was linked with antigenic drift or high co-evolution. A co-mutation score was calculated for each amino acid pair, and it was identified that the epitope regions (the antibody binding sites) showed a higher tendency of co-mutations as compared with non-epitope regions (Huang et al. 2009). The strong selection pressure on the mutated sites could also be identified by looking at the frequently co-mutating sites as in human influenza A/H3N2 and A/H1N1, and B viruses (Chen et al. 2016). In a large-scale study of mitochondrial genomic co-occurrence networks, it was found that the average degree is conserved across all the five continents (Shinde et al. 2018). This observation supports the consistent distribution of mtDNA variations across the whole world and the nature of its uniparental inheritance (Ruiz-Pesini et al. 2004; Wei and Chinnery 2020). The high-degree nodes usually considered as hub nodes were shown to be the mitochondrial haplogroup markers in three high-altitude populations (Verma et al. 2022).

Along with the degree of specific nodes, the degree distribution of all the nodes provides information about the nature of the network to compare with the model networks. The degree distribution (box 1) of these two types of the nucleotide networks follows a binomial distribution with a pronounced peak at 〈k〉 and decays exponentially for large degrees similar to small-world networks (Albert and Barabási 2002). However, the degree distribution could be affected by the choice of threshold for generating the final network. Additionally, these networks also show very high clustering coefficients, which suggests that the nodes in these networks tend to cluster together. The presence of a high clustering coefficient has been shown to be associated with the co-evolution of mitochondrial variations as a cohort between and among mitochondrial genes, in a dependable manner rather than individually, which is in line with the mechanism of haplogroup inheritance among subpopulations (Shinde et al. 2021; Verma et al. 2022). These networks show similar average shortest path length and higher clustering coefficient as compared with the corresponding random networks, further adding to the evidence about the small-world-like nature of these networks. In terms of sparseness, the network constructed using the p-value threshold method is more sparse as compared with that constructed using the efficiency score method. The high clustering also suggests the presence of high modularity in these networks. The modules formed in these networks are predominated by particular haplogroups and also specific genes. Thus, these modules harbor critical information, and these modules could be further analyzed at the genetic, haplogroup, or pathogenic levels.

The clustering co-efficient was reported to follow a negative power law with respect to degree (Ravasz et al. 2002), and hence provided evidence for the presence of hierarchy in these networks. The presence of hierarchy and modularity is consistent with the evolution of haplogroups over time and geographic space. As it is well established that humans migrated out of Africa and established settlements across Eurasia and other continents with time, various variations were also selected and enriched in the regional population while inheriting the previous root variants that gave rise to haplogroups. The hierarchy captures this characteristic behavior of mitochondrial evolution.

4.2 Perfectly co-occurring sites

As discussed, we can apply a threshold value for generating a sparse network. In co-occurrence networks, if we keep the threshold of efficiency score at 1.0, and in co-mutation networks, if we consider only those pairs with Cmij = 1.0 and Pij ≤ 0.05, we get only those pairs of variable sites that are perfectly co-occurring and co-mutating in the population, respectively. Such sites form complete subgraphs (all-to-all connected) and yield disconnected components (figure 2). The peculiar case of perfectly co-occurring sites could be interpreted in two ways: first, the nucleotides at the ith and jth positions are co-occurring in just one sample and not present in any other sample in that population; second, the nucleotides are co-occurring in many samples and are not present individually. In both these cases we would get a perfect co-occurrence of the involved sites; however, in the first case, the significance of co-occurrence would be negligible for common variants but could have considerable importance for rare variants (Bomba et al. 2017). To avoid this bias, one can either define the rare variants beforehand or consider only those sites with a higher minor allele frequency. The perfectly co-occurring sites give rise to disconnected complete subgraphs or motifs of order two or more. Two- and three-order network motifs were analyzed for codon bias in the human population (Shinde et al. 2018). In this study, the codon positions of the mitochondrial genome were mapped to these network motifs given the position of the variable sites. The protein-coding gene codon positions were mapped as 1, 2, and 3, and the non-coding gene codon positions were mapped as 0. It was shown that synonymous positions (0 and 3) tend to co-occur more often than the non-synonymous positions. This suggests that protein-coding regions have played a selective role in human evolution. In an another study to compare the low- and high-altitude populations of Asia, these perfectly co-occurring network motifs were employed to identify the role of high-altitude marker sites in defining the co-evolution patterns at high altitudes (Verma et al. 2021)

4.3 Gene mapping and genetic interaction networks

The variable sites participating in the network construction after applying a threshold could be mapped to their corresponding genes. For this purpose, the reference sequence should be mapped to the multiple sequence alignment by introducing all the indels. In this way, one can generate a genetic interaction network based on the co-mutation/co-occurrence network. Since more than one variable site might belong to the same gene, or two separate sets of variable sites might belong to same gene pairs, the resultant network would be weighted but undirected. For example, there exists a link between the variable sites 45 and 89, and these two sites belong to same gene, yielding a self-loop for this gene, and in another case, variable sites 35 and 102, and 48 and 105, have a link. The variable sites 35 and 48 belong to gene 1, and 102 and 105 belong to gene 2. This will yield a link between gene 1 and gene 2 with a weight 2. Thus, this genetic interaction network contains the information of co-evolution of the gene qualitatively and quantitatively. Such networks were constructed to compare the three high-altitude populations: Tibet, Ethiopia, and Andes. Functional enrichment analysis of the identified gene sets provided information about their role in the evolution of the human population at these high-altitude regions (Verma et al. 2022). By mapping the variable sites to genes, the co-evolution of genes could be quantified. For example, co-operative changes were observed among and between influenza genes. It was observed that hemagglutinin genes underwent connectivity changes within themselves during a particular period and the neuraminidase genes underwent a similar evolutionary pattern as other genes (Du et al. 2008). In gene–gene networks, the flow of information was investigated in co-mutating genes using a model ‘perturbed master equation’ (pME) in order to identify the gene pairs responsible for network frailness in breast cancer (Bersanelli et al. 2020). In another study, enriched genes were identified in cancer pathways from co-mutation-based gene–gene networks of a large-scale study across 14 cancers with 2.5 million non-synonymous mutations and ∼6700 tumor exomes (Liu et al. 2020). It was predicted based on this study that interactions between BRCA2 and TP53 were related to the sensitivity/resistance to anticancer drugs.

5 Conclusion and future prospects

This review described the methodology for constructing networks utilizing the information of variable sites of multiple DNA sequences. The variable sites could be defined as nodes given their allelic information. Based on this definition of allelic information of variable sites and their associations, we categorized these networks into (i) co-occurrence and (ii) co-mutation networks. Such networks provide insights into the evolutionary patterns of given species under the spectrum of external environments, specifically, fast-evolving viral genomes and mitochondrial genomes. The co-occurrence network motifs were studied in the mitochondrial genome in the human population of five continents to identify the role of codon positions (Shinde et al. 2018) and population-based biases in mitochondrial epistatic interactions of ancestral sites (Shinde et al. 2021) in shaping human evolution and migration patterns. Such motifs were also applied to analyze the human populations residing at different altitudes with respect to the Tibetan population (Verma et al. 2021), and co-mutation-based genetic networks identified the interplay of different gene sets in convergent evolution of highlanders globally (Verma et al. 2022). Network module identification and their analysis on these networks have shown to have potential applications in gaining information on human migration and geographic distribution of various haplogroups. The module-based study of amino acid substitutions in human influenza virus identified the change in the antigenic structure of viral proteins which could evade the recognition by antibodies (Du et al. 2008). Further, the evolution of viral genomes poses a greater threat to human health, such as the global pandemic of COVID-19, where the role of such networks becomes more important in predicting the evolution of potentially pathogenic strains of these viruses and similar pathogens (Qin et al. 2021; Sarkar et al. 2021). Genetic interaction networks have been successfully applied for mitochondrial (Verma et al. 2022) and viral genomes (Du et al. 2008) to identify the specific sets of genes responsible for evolution and adaptation. These networks could be further applied to identify unknown genes and their possible role in classifying viral or bacterial strains as virulent or hyper-virulent strains. These networks could also be applied to genomes of various organisms along with more sophisticated network science techniques, such as spectral techniques, as well as the established population genomics tools in advancing the understanding of genetic evolution.