Evolution of Protein Domain Architectures
This chapter reviews current research on how protein domain architectures evolve. We begin by summarizing work on the phylogenetic distribution of proteins, as this will directly impact which domain architectures can be formed in different species. Studies relating domain family size to occurrence have shown that they generally follow power law distributions, both within genomes and larger evolutionary groups. These findings were subsequently extended to multi-domain architectures. Genome evolution models that have been suggested to explain the shape of these distributions are reviewed, as well as evidence for selective pressure to expand certain domain families more than others. Each domain has an intrinsic combinatorial propensity, and the effects of this have been studied using measures of domain versatility or promiscuity. Next, we study the principles of protein domain architecture evolution and how these have been inferred from distributions of extant domain arrangements. Following this, we review inferences of ancestral domain architecture and the conclusions concerning domain architecture evolution mechanisms that can be drawn from these. Finally, we examine whether all known cases of a given domain architecture can be assumed to have a single common origin (monophyly) or have evolved convergently (polyphyly). We end by a discussion of some available tools for computational analysis or exploitation of protein domain architectures and their evolution.
Key wordsProtein domain Protein domain architecture Superfamily Monophyly Polyphyly Convergent evolution Domain evolution Kingdoms of life Domain co-occurrence network Node degree distribution Power law Parsimony
By studying the domain architectures of proteins, we can understand their evolution as a modular phenomenon, with high-level events enabling significant changes to take place in a time span much shorter than required by point mutations only. This research field has become possible only now in the -omics era of science, as both identifying many domain families in the first place and acquiring enough data to chart their evolutionary distribution require access to many completely sequenced genomes. Likewise, the conclusions drawn generally consider properties averaged for entire species or organism groups or entire classes of proteins, rather than properties of single genes.
We will begin by introducing the basic concepts of domains and domain architectures, as well as the biological mechanisms by which these architectures can change. The remainder of the chapter is an attempt at answering, from the recent literature, the question of which forces shape domain architecture evolution and in what direction. The underlying issue concerns whether it is fundamentally a random process or whether it is primarily a consequence of selective constraints. We end by outlining some available software tools and resources for analysis of domain architectures and their evolution.
1.2 Protein Domains
Protein domains are high-level parts of proteins that either occur alone or together with partner domains on the same protein chain. Most domains correspond to tertiary structure elements and are able to fold independently. All domains exhibit evolutionary conservation, and many either perform specific functions or contribute in a specific way to the function of their proteins. The word domain strictly refers to a distinct region of a specific protein, an instance of a domain family. However, domain and domain family are often used interchangeably in the literature.
1.3 Domain Databases
By identifying recurring elements in experimentally determined protein 3D structures, the various domain families in structural domain databases such as SCOP  and CATH  were gathered. New 3D structures allow assignment to these classes from semiautomated inspection. The SUPERFAMILY  database assigns SCOP domains to all protein sequences by matching them to hidden Markov models (HMMs) that were derived from SCOP superfamilies, i.e., proteins whose evolutionary relationship is evidenced structurally. The Gene3D  database is similarly constructed but based on domain families from CATH.
This approach resembles the methodology used in pure sequence-based domain databases such as Pfam . In these databases, conserved regions are identified from sequence analysis and background knowledge, to make multiple sequence alignments. From these, HMMs are built that are used to search new sequences for the presence of the domain represented by each HMM. All such instances are stored in the database. The HMM framework ensures stability across releases and high quality of alignments and domain family memberships. The stability allows annotation to be stored along with the HMMs and alignments. The InterPro database  is a meta-database of domains combining the assignments from several different source databases, including Pfam. The Conserved Domain Database (CDD) is a similar meta-database that also contains additional domains curated by the NCBI . SMART  is a manually curated resource focusing primarily on signaling and extracellular domains. ProDom  is a comprehensive domain database automatically generated from sequences in UniProt . Likewise, ADDA  is automatically generated by clustering subsequences of proteins from the major sequence databases, though it has not been updated for some time. Genome3D  is a recent consensus database which brings together several domain prediction tools as well as the SCOP and CATH databases for describing representative domain arrangements in a series of trusted, well-annotated genomes.
Since the domain definitions from different databases only partially overlap, results from analyses often cannot be directly compared. In practice, however, choice of database appears to have little effect on the main trends reported by the studies described here.
1.4 Domain Architectures
The terms “domain architecture” or “domain arrangement” generally refer to the domains in a protein and their order, reported in N- to C-terminal direction along the amino acid chain. Another recurring term is domain combinations. This refers to pairs of domains co-occurring in proteins, either anywhere in the protein (the “bag-of-domains” model) or specifically pairs of domains being adjacent on an amino acid chain, in a specific N- to C-terminal order . The latter concept is expanded to triplets of domains, which are subsequences of three consecutive domains, with the N- and C-termini used as “dummy” domains. A domain X occurring on its own in a protein thus produces the triplet N-X-C .
1.5 Mechanisms for Domain Architecture Change
In organisms that have introns, exon shuffling [23, 24] refers to the integration of an exon from one gene into another, for instance, through chromosomal crossover, gene conversion, or mobile genetic elements. Exons could also be moved around by being brought along by mobile genetic elements such as retrotransposons [24, 25].
Two adjacent genes can be fused into one if the first one loses its transcription stop signals. Point mutations can cause a gene to lose a terminal domain by introducing a new stop codon, after which the “lost” domain slowly degrades through point mutations as it is no longer under selective pressure . Alternatively, a multi-domain gene might be split into two genes if both a start and a stop signal are introduced between the domains. Novel domains could arise, for instance, through exonization, whereby an intronic or intergenic region becomes an exon, after which subsequent mutations would fine-tune its folding and functional properties [25, 27].
Recent literature (see, e.g., ) has discussed the possibility of de novo domain creation through a variety of mutational mechanisms, with some support for this occurring more often than previously thought [29, 30]. The majority of such new domains arise as novel genes from noncoding sequence but may subsequently recombine to join with older domains. Furthermore, young domains in vertebrates tend more often to occur at the N-terminal of a protein and tend to experience higher relative rates of non-synonymous substitution than older domains, which may reflect the nature of the mechanisms through which novel domains arise. Moore, Bornberg-Bauer et al. explore the relative prevalence of domain loss, duplication, and de novo origination in arthropods  and plants , suggesting such novel domains most frequently are associated with environmental adaptations.
2 Distribution of the Sizes of Domain Families
Domain architectures are fundamentally the realizations of how domains combine to form multi-domain proteins with complex functions. Understanding how these combinations come to be requires first that we understand how common the constituent domains of those architectures are and whether there are selective pressures determining their abundances. Because of this, the body of work concerning the sizes and species distributions of domain families becomes important to us.
Comprehensive studies of the distributions and evolution of protein domains and domain architectures are possible as genome sequencing technologies have made many entire proteomes available for bioinformatic analysis. Initial work [33, 34, 35] focused on the number of copies that a protein family, either single domain or multi-domain, has in a species. Most conclusions from these early studies appear to hold true for domains, for supra-domains (see below) and for domain architectures [36, 37, 38]. In particular, these all exhibit a dominance of the population by a selected few , i.e., a small number of domain families are present in a majority of the proteins in a genome, whereas most domain families are found only in a small number of proteins.
Power law distributions arise in a vast variety of contexts: from human income distributions, connectivity of internet routers, word usage in languages, and many other situations ([34, 35, 40, 41], see also , for a conflicting view). Luscombe et al.  described a number of other genomic properties that also follow power law distributions, such as the occurrence of DNA “words,” pseudogenes, and levels of gene expression. These distributions fit much better than the alternative they usually are contrasted against, an exponential decay distribution. The most important difference between exponential and power law distributions in this context concerns the fact that the latter has a “fat tail,” that is, while most domain families occur only a few times in each proteome, most domains in the proteome still belong to one of a small number of families.
Later work ([39, 43], see also ) demonstrated that proteome-wide domain occurrence data fit the general GPD better than the power law but that it also asymptotically fits a power law as X ≫ i. The deviation from strict power law behavior depends on proteome size in a kingdom-dependent manner . Regardless, it is mostly appropriate to treat the domain family size distribution as approximately (and asymptotically) power law-like, and later studies typically assume this.
What kind of evolutionary mechanisms give rise to this kind of distribution of gene or domain family sizes within genomes? In one model by Huynen and van Nimwegen , every gene within a gene family will be more or less likely to duplicate, depending on the utility of the function of that gene family within the particular lineage of organisms studied, and they showed that such a model matches the observed power laws. While they claimed that any model that explains the data must take into account family-specific probabilities of duplication fixation, Yanai and coworkers  proposed a simpler model using uniform duplication probability for all genes in the genome and also reported a good fit with data.
Later, more complex birth-death  and birth-death-and-innovation (BDIM) [29, 34, 39, 46] models were introduced to explain the observed distributions, and from investigating which model parameter ranges allow this fit, the authors were able to draw several far-ranging conclusions. First, the asymptotic power law behavior requires that the rates of domain gain and loss are asymptotically equal. Karev et al.  interpreted this as support for a punctuated equilibrium-type model of genome evolution, where domain family size distributions remain relatively stable for long periods of time but may go through stages of rapid evolution, representing a shift between different BDIM evolutionary models and significant changes in genome complexity. Like Huynen and van Nimwegen , they concluded that the likelihood of fixated domain duplications or losses in a genome directly depend on family size. The family will however only grow as long as new copies can find new functional niches and contribute to a net benefit for survival, i.e., as long as selection favors it.
Aside from Huynen and van Nimwegen’s, none of the models discussed depend very strongly on family-specific selection to explain the abundances of individual gene families, nor do they exclude such selection. Some domains may be highly useful to their host organism’s lifestyle, such as cell-cell connectivity domains to an organism beginning to develop multicellularity. Expansion of these domain families might therefore become more likely in some lineages than in others. To what extent these factors actually affect the size of domain families remains to be fully explored. Karev et al.  suggested that the rates of domain-level change events themselves—domain duplication and loss rates, as well as the rate of influx of novel domains from other species or de novo creation—must be evolutionarily adapted, as only some such parameters allow the observed distributions to be stable. Van Nimwegen  investigated how the number of genes increases in specific functional categories as total genome size increases. He found that the relationship matches a power law, with different coefficients for each functional class remaining valid over many bacterial lineages. Ranea et al. found similar results. Also, Ranea et al.  showed that, for domain superfamilies inferred to be present in the last universal common ancestor (LUCA), domains associated with metabolism have significantly higher abundance than those associated with translation, further supporting a connection between the function of a domain family and how likely it is to expand.
Extending the analysis to multi-domain architectures, Apic et al.  showed that the frequency distribution of multi-domain family sizes follows a power law curve similar to that reported for individual domain families. It therefore seems likely that the basic underlying mechanisms should be similar in both cases, i.e., that duplication of genes, and thus their domain architectures, is the most important type of event affecting the evolution of domain architectures.
3 Kingdom and Age Distribution of Domain Families and Architectures
How old are specific domain families or domain architectures? With knowledge of which organism groups they are found in, it is possible to draw conclusions about their age and whether lineage-specific selective pressures have determined their kingdom-specific abundances. Domain families and their combinations have arisen throughout evolutionary history, presumably by new combinations of pre-existing elements that may have diverged beyond recognition or by processes such as exonization. We can estimate the age of a domain family by finding the largest clade of organisms within which it is found, excluding organisms with only xenologs, i.e., horizontally transferred genes . The age of this lineage’s root is the likely age of the family. The same holds true for domain combinations and entire domain architectures. This methodology allows us to determine how changing conditions at different points in evolutionary history, or in different lineages, have affected the evolution of domain architectures.
Apic et al.  analyzed the distribution of SCOP domains across 40 genomes from archaea, bacteria, and eukaryotes. They found that a majority of domain families are common to all three kingdoms of life and thus likely to be ancient. Kuznetsov et al.  performed a similar analysis using InterPro domains and found that only about one fourth of all such domains were present in all three kingdoms, but a majority was present in more than one of them. Lateral gene transfer or annotation errors can cause a domain family to be found in one or a few species in a kingdom without actually belonging to that kingdom. To counteract this, one can require that a family must be present in at least a reasonable fraction of the species within a kingdom for it to be considered anciently present there. For instance, using Gene3D assignments of CATH domains to 114 complete genomes, mainly bacterial, Ranea et al.  isolated protein superfamily domains that were present in at least 90% of all the genomes and at least 70% of the archaeal and eukaryotic genomes, respectively. Under these stringent cutoffs for considering a domain to be present in a kingdom, 140 domains, 15% of the CATH families found in at least one prokaryote genome, were inferred to be ancient. Chothia and Gough  performed a similar study on 663 SCOP superfamily domains evaluated at many different thresholds and found that while 516 (78%) superfamilies were common to all three kingdoms at a threshold of 10% of species in each kingdom, only 156 (24%) superfamilies were common to all three kingdoms at a threshold of 90%. They also showed that for prokaryotes, a majority of domain instances (i.e., not domain families but actual domain copies) belong to common superfamilies at all thresholds below 90%.
Extending to domain combinations, Apic et al.  reported that a majority of SCOP domain pairs are unique to each kingdom but also that more kingdom-specific domain combinations than expected were composed only of domain families shared between all three kingdoms. This would imply a scenario where the independent evolution of the three kingdoms mainly involved creating novel combinations of domains that existed already in their common ancestor.
Several studies have reported interesting findings on domain architecture evolution in lineages closer to ourselves: in metazoa and vertebrates. Ekman et al.  claimed that new metazoa-specific domains and multi-domain architectures have arisen roughly once every 0.1–1 million years in this lineage. According to their results, most metazoa-specific multi-domain architectures are a combination of ancient and metazoa-specific domains. The latter category are however mostly found as novel single-domain proteins. Much of the novel metazoan multi-domain architectures involve domains that are versatile (see below) and exon-bordering (allowing for their insertion through exon shuffling). The novel domain combinations in metazoa are enriched for proteins associated with functions required for multicellularity—regulation, signaling, and functions involved in newer biological systems such as immune response or development of the nervous system, as previously noted by Patthy . They also showed support for exon shuffling as an important mechanism in the evolution of metazoan domain architectures. Itoh et al.  added that animal evolution differs significantly from other eukaryotic groups in that lineage-specific domains played a greater part in creating new domain combinations. Nasir et al.  analyzed the age and taxonomic distribution of domains drawing on species phylogenies reconstructed from domain repertoires, concluding among other things that most widespread domains are relatively old and suggesting high numbers of both domain gain and loss in the evolution of the three organismal superkingdoms. Bacterial and archaeal genes have tended to gain or lose domains encoding aspects of metabolic capacity, whereas those of eukaryotes—including multicellular ones—have gained domains enabling more elaborate extracellular processes such as immunity and regulatory capacities.
Our finding that 10% of all Pfam-A domains are present in all three main kingdoms is strikingly lower than in the earlier works and is even lower than reported by Ranea et al. , who used very stringent cutoffs. However, a direct comparison of statistics for Pfam domains/clans and CATH superfamilies is difficult. The decrease in ancient families that we observe may be a consequence of the massive increase in sequenced genomes and/or that the recent growth of Pfam has added relatively more kingdom-specific domains. We further found that only 1.5% of all domains or domain combinations are unique to archaea, suggesting that known representatives of this lineage have undergone very little independent evolution and/or that most archaeal gene families have been horizontally transferred to other kingdoms. The trend when going from domain via domain combinations to whole architectures is clear—the more complex patterns are less shared between the kingdoms. In other words, each kingdom has used a common core of domains to construct its own unique combinations of multi-domain architectures.
4 Domain Co-occurrence Networks
The 20 most densely connected hubs with regard to immediate domain neighbors, according to Pfam 30.0
Number of different immediate neighbors
P-loop containing nucleoside triphosphate hydrolase superfamily
FAD/NAD(P)-binding Rossmann fold superfamily
Protein kinase superfamily
Ig-like fold superfamily (E-set)
Tetratricopeptide repeat superfamily
Common phosphate-binding site TIM barrel superfamily
Ribonuclease H-like superfamily
Tim barrel glycosyl hydrolase superfamily
Peptidase clan CA
Beta propeller clan
Family A G protein-coupled receptor-like superfamily
Major facilitator superfamily
One way of evolving a domain co-occurrence network that follows a power law is by “preferential attachment” [53, 57]. This means that new edges (corresponding to proteins where two domains co-occur) are added with a probability that is higher the more edges these nodes (domains) already have, resulting in a power law distribution.
Apic et al.  considered a null model for random domain combination, in which a proteome contains domain combinations with a probability based on the relative abundances of the domains only. They showed that this model does not hold and that far fewer domain combinations than expected under it are actually seen. If most domain duplication events are gene duplication events that do not change domain architecture—or at the very least do not disrupt domain pairs—then this finding is not unexpected, nor does it require or exclude any particular selective pressure to keep these domains together in proteins. There is growing support for the idea that separate instances of a given domain architecture in general descend from a single ancestor with that architecture , with polyphyletic evolution of domain architectures occurring only in a small fraction of cases [53, 59, 60].
Itoh et al.  performed reconstruction of ancestral domain architectures using maximum parsimony, as described in the next section. This allowed them to study the properties of the ancestral domain co-occurrence network and thus explore how network connectivity has altered over evolutionary time. Among other things, they found increased connectivity in animals, particularly of animal-specific domains, and suggest that this phenomenon explains the high connectivity for eukaryotes reported by Wuchty . For non-animal eukaryotes, they reported a correlation between connectivity and age, such that older domains had relatively higher connectivity, with domains preceding the divergence of eukaryotes and prokaryotes being the most highly connected, followed by early eukaryotic domains. In other words, early eukaryotic evolution saw the emergence of some key hub proteins, while the most prominent eukaryotic hubs emerged in the animal lineage. Parikesit et al.  studied the functional annotation of co-occurring domains in eukaryotes, concluding that while these may have different associated functional descriptors, these descriptors usually tend to fall within the same overall category within the gene ontology. Co-occurring domains thus tend to contribute to the same overall process type rather than have very widely divergent functional annotations. Hsu et al.  constructed a network linking domain architectures (i.e., each node is a multi-domain architecture, as opposed to in a regular domain co-occurrence network) where parsimonious reconstruction suggests evolution of one from the other, identifying “highly evolvable” architectures as hubs in this network. Proteins with such architectures were reported to be more widespread, less often essential, more often duplicated, and more often associated with gene functions involved in specific adaptation of organisms.
5 Supra-domains and Conserved Domain Order
As we have seen, whole multi-domain architectures or shorter stretches of adjacent domains are often repeated in many proteins. These only cover a small fraction of all possible domain combinations. Are the observed combinations somehow special? We would expect selective pressure to retain some domain combinations but not others, since only some domains have functions that would synergize together in one protein. Often, co-occurring domains require each other structurally or functionally, for instance, in transcription factors where the DNA-binding domain provides substrate specificity, whereas the trans-activating domain recruits other components of the transcriptional machinery . Vogel et al.  identified series of domains co-occurring as a fixed unit with conserved N- to C-terminal order but flanked by different domain architectures and termed them supra-domains. By investigating their statistical overrepresentation relative to the frequency of the individual domains in the set of nonredundant domain architectures (where “nonredundant” is crucial, as otherwise, e.g., whole-gene duplication would bias the results), they identified a number of such supra-domains. Many ancient domain combinations (shared by all three kingdoms) appear to be such selectively preserved supra-domains.
How conserved is the order of domains in multi-domain architectures? In a recent study, Kummerfeld and Teichmann  built a domain co-occurrence network with directed edges, allowing it to represent the order in which two domains are found in proteins. As in other studies, the distribution of node degrees fits a power law well. Most domain pairs were only found in one orientation. This does not seem required for functional reasons, as flexible linker regions should allow the necessary interface to form also in the reversed case , but may rather be an indication that most domain combinations are monophyletic. Weiner and Bornberg-Bauer  analyzed the evolutionary mechanisms underlying a number of reversed domain order cases and concluded that independent fusion/fission is the most frequent scenario. Although domain reversals occur in only a few proteins, it actually happens more often than was expected from randomizing a co-occurrence network . That study also observed that the domain co-occurrence network is more clustered than expected by a random model and that these clusters are also functionally more coherent than would be expected by chance.
6 Domain Mobility, Promiscuity, or Versatility
While some protein domains co-occur with a variety of other domains, some are always seen alone or in a single architecture in all proteomes where they are found. A natural explanation is that some domains are more likely to end up in a variety of architectural contexts than others due to some intrinsic property they possess. Is such domain versatility or promiscuity a persistent feature of a given domain, and does it correlate with certain functional or biological properties of the domain?
Several ways of measuring domain versatility have been suggested. One measure, NCO , counts the number of other domains found in any architectures where the domain of interest is found. Another measure, NN , instead counts the number of distinct other domains that a domain is found adjacent to. Yet another measure, NTRP , counts the number of distinct triplets of consecutive domains where the domain of interest is found in the middle. All of these measures can be expected to be higher for common domains than for rare domains, i.e., variations in domain abundance (the number of proteins a domain is found in) can hide the intrinsic versatility of domains. Therefore, three different studies [14, 55, 66] formulated relative domain versatility indices that aim to measure versatility independently of abundance. It is worth noting that most studies have considered only immediately adjacent domain neighbors in these analyses, a restriction based on the assumption that those are more likely to interact functionally than domains far apart on a common amino acid chain. More recent work  introduced a network versatility metric which can classify domains as being central or peripheral with regard to the large-scale structure of their bigram network (i.e., the network-linking domains found adjacent in proteins), observing how peripheral such domains exhibit relatively higher primary sequence conservation suggestive of adaptation to more specific functions, whereas the core domains may be more multifunctional.
The first relative versatility study was presented by Vogel et al. , who used as their domain dataset the SUPERFAMILY database applied to 14 eukaryotic, 14 bacterial, and 14 archaeal proteomes. They modeled the number of unique immediate neighbor domains as a power law function of domain abundance, performed a regression on this data, and used the resulting power law exponent as a relative versatility measure. Basu et al.  used Pfam and SMART  domains and measured relative domain versatility for 28 eukaryotes as the immediate neighbor pair frequency normalized by domain frequency. They then defined promiscuous domains as a class according to a bimodality in the distribution of the raw numbers of unique domain immediate neighbor pairs. Weiner et al.  used Pfam domains for 10,746 species in all kingdoms and took as their relative versatility measure the logarithmic regression coefficient for each domain family across genomes, meaning that it is not defined within single proteomes.
To what extent is high versatility an intrinsic property of a certain domain? Vogel et al.  only examined large groups of domains together and therefore did not address this question for single domains. Basu et al.  and Weiner et al.  instead analyzed each domain separately and concluded that there are strong variations in relative versatility at this level. Their results are very different in detail, however, reflected by the fact that only one domain family (PF00004, AAA ATPase family) is shared between the ten most versatile domains reported in the two studies. As they used fairly similar domain datasets, it would appear that the results strongly depend on the definition of relative versatility. Another potential reason for the different results is that Basu’s list was based on eukaryotes only, while Weiner’s analysis was heavily biased toward prokaryotes. Furthermore, the top ten list in Basu et al.  and their follow-up paper  only overlap by four domains, yet the main difference is that in the latter study all 28 eukaryotes were considered, while the former study was limited to the subset of 20 animal, plant, and fungal species. The choice of species thus seems pivotal for the results when using this method. They also used different methods for calculating the average value of relative versatility across many species, which may influence the results.
Does domain versatility vary between different functional classes of domains? Vogel et al.  found no difference in relative versatility between broad functional or process categories or between SCOP structural classes. In contrast to this, Basu et al.  reported that high versatility was associated with certain functional categories in eukaryotes. However, no test for the statistical significance of these results was performed. Weiner et al.  also noted some general trends but found no significant enrichment of gene ontology terms in versatile domains. This does not necessarily mean that no such correlation exists, but more research is required to convincingly demonstrate its strength and its nature. More recently, Cromar et al.  analyzed domain architectures in eukaryotic extracellular matrix proteomes, noting that these structures are organized around a set of versatile domains under the weighted bigram metric of Basu et al. .
Another important question is to what extent domain versatility varies across evolutionary lineages. Vogel et al.  reported no large differences in average versatility for domains in different kingdoms. The versatility measure of Basu et al.  can be applied within individual genomes, which means that according to this measure domains may be versatile in one organism group but not in another, as well as gain or lose versatility across evolutionary time. They found that more domains were highly versatile in animals than in other eukaryotes. Modeling versatility as a binary property defined for domains in extant species, they further used a maximum parsimony approach to study the persistence of versatility for each domain across evolutionary time and concluded that both gain and loss of versatility are common during evolution. Inferring ancestral domain architectures, Cohen-Gihon et al.  report an increase in versatility in many domains during eukaryotic evolution, in particular around the divergence of Bilateria. Weiner at al.  divided domains into age categories based on distribution across the tree of life and reported that the versatility index is not dependent on age, i.e., domains have equal chances of becoming versatile at different times in evolution. This is consistent with the observation by Basu et al.  that versatility is a fast-evolving and varying property. When measuring versatility as a regression within different organism groups, Weiner et al.  found slightly lower versatility in eukaryotes, which is in conflict with the findings of Basu et al. . Again, this underscores the strong dependence of the method and dataset on the results.
Further properties reported to correlate with domain versatility include sequence length, where Weiner et al.  found that longer domains are significantly more versatile within the framework of their study, while at the same time, shorter domains are more abundant and hence may have more domain neighbors in absolute numbers. Basu et al.  further reported that more versatile domains have more structural interactions than other domains. To determine which of these reported correlations that genuinely reflect universal biological trends, further comprehensive studies are needed using more data and uniform procedures. This would hopefully allow the results from the studies described here to be validated and any conflicts between them to be resolved.
Basu et al.  further analyzed the phylogenetic spread of all immediate domain neighbor pairs (“bigrams”) containing domains classified as promiscuous. The main observation this yielded was that although most such combinations occurred in only a few species, most promiscuous domains are part of at least one combination that is found in a majority of species. They interpreted this as implying the existence of a reservoir of evolutionarily stable domain combinations from which lineage-specific recombination may draw promiscuous domains to form unique architectures. Later work by Hsu et al.  analyzed the domain co-occurrence networks centered on each domain family, classifying such subnetworks as being either mostly starlike, taillike, or tetragon-like, with promiscuous domains forming cores of starlike architecture networks in this representation.
7 Principles of Domain Architecture Evolution
What mutation events can generate new domain architectures, and what is their relative predominance? The question can be approached by comparing protein domain architectures of extant proteins. This is based on the likely realistic assumption that most current domain architectures evolved from ancestral domain architectures that can still be found unchanged in other proteins. Because of this, in pairs of most similar extant domain architectures, one can assume that one of them is ancestral. This agrees well with results indicating that most groups of proteins with identical domain architectures are monophyletic. By comparing the most similar proteins, several studies have attempted to chart the relative frequencies of different architecture-changing mutations.
Björklund et al.  used this particular approach and came to several conclusions. First, changes to domain architecture are much more common by the N- and C-termini than internally in the architecture. This is consistent with several mechanisms for architecture changes such as introduction of new start or stop codons or mergers with adjacent genes, and similar results have been found in several other studies [15, 25, 26]. Furthermore, insertions or deletions of domains (“indels”) are more common than substitutions of domains, and the events in question mostly concern just single domains, except in cases with repeats expanding with many domains in a row . In a later study, the same group made use of phylogenetic information as well, allowing them to infer directionality of domain indels . They then found that domain insertions are significantly more common than domain deletions.
Weiner et al.  performed a similar analysis on domain loss and found compatible results—most changes occur at the termini (see also discussion in ). Moreover, they demonstrated that terminal domain loss seldom involves losing only part of a domain, or rather, that such partial losses quickly progress into loss of the entire domain. However, it is important to ensure such observations are not confounded by cases where errors in gene boundary recognition make domain detection less accurate .
There is some support [23, 74, 75] for exon shuffling to have played an important part in domain evolution, and there are a number of domains that match intron borders well, for example, structural domains in extracellular matrix proteins. While it may not be a universal mechanism, exon shuffling is suggested to have been particularly important for vertebrate evolution .
Recognizing the potential role of gene duplications in domain architecture evolution, Grassi et al.  analyzed domain architecture shifts following either whole-genome duplication (WGD) or smaller-scale gene duplication events in yeast. Surviving WGD duplicates had retained ancestral architecture in ca 95% of cases, with approximately the same chance of architecture change in WGD as under local duplication. Genes retained over time from either type of duplication were enriched for a core of commonly occurring domains but with a subset of rarer domains additionally enriched in retained WGD duplicates compared to locally duplicated genes. The former category more often was associated with housekeeping-type gene functions, whereas the latter more often involved adaptive functions. Functional change was generally larger than architectural change following duplication. Zhang et al.  similarly studied domain architecture evolution in plants, noting that lineage-specific architecture expansions largely can be explained from differential retention of genes following successive whole-genome duplications. Another form of domain duplication particularly relevant in plants is amplification of the numbers of domain repeats in proteins, discussed, e.g., by Sharma and Pandey .
8 Inferring Ancestral Domain Architectures
The above analyses, based on pairwise comparison of extant protein domain architectures, cannot tally ancestral evolutionarily events nearer the root of the tree of life. With ancestral architectures, one can directly determine which domain architecture changes have taken place during evolution and precisely chart how mechanisms of domain architecture evolution operate, as well as gauge their relative frequency. A drawback is that since we can only infer ancestral domain architectures from extant proteins, the result will depend somewhat on our assumptions about evolutionary mechanisms. On the upside, it should be possible to test how well different assumptions fit the observed modern-day protein domain architecture patterns.
Attempts at such reconstructions have been made using parsimony. Given a gene tree and the domain architectures at the leaves, dynamic programming can be used in order to find the assignment of architectures to internal nodes that require the smallest number of domain-level mutation events. This simple model can be elaborated by weighting loss and gain differently or by requiring that a domain or an architecture can only be gained at most once in a tree (Dollo parsimony) .
An early study of Snel et al.  considered 252 gene trees across 17 fully sequenced species and used parsimony to minimize the number of gene fission and fusion events occurring along the species tree. Their main conclusion, that gene fusions are more common than gene fissions, was subsequently supported by a larger study by Kummerfeld and Teichmann , where fusions were found to be about four times as common as fissions in a most parsimonious reconstruction. Fong et al.  followed a similar procedure on yet more data and concluded that fusion was 5.6 times as likely as fission.
Buljan and Bateman  performed a similar maximum parsimony reconstruction of ancestral domain architectures. They too observed that domain architecture changes primarily take place at the protein termini, and the authors suggested that this might largely occur because terminal changes to the architecture are less likely to disturb overall protein structure. Moreover, they concluded from reconciliation of gene and species trees that domain architecture changes were more common following gene duplications than following speciation but that these cases did not differ with respect to the relative likelihood of domain losses or gains.
Recently, Buljan et al.  presented a new ancestral domain architecture reconstruction study which assumed that gain of a domain should take place only once in each gene tree, i.e., Dollo parsimony . Their results also support gene fusion as a major mechanism for domain architecture change. The fusion is generally preceded by a duplication of either of the fused genes. Intronic recombination and insertion of exons are observed but relatively rarely. They also found support for de novo creation of disordered segments by exonization of previously noncoding regions. More recently still a method for domain architecture history reconstruction using a network construct called a plexus was described . Yang and Bourne  further described another parsimony-based reconstruction approach, as did Wu et al. , reporting that histories of signaling and development proteins are enriched for gene fusion/fission events. Stolzer et al.  present another method for domain architecture history inference, made available through the Notung software.
9 Polyphyletic Domain Architecture Evolution
There appears to be a “grammar” for how protein domains are allowed to be combined. If nature continuously explores all possible domain combinations, one would expect that the allowed combinations would be created multiple times throughout evolution. Such independent creation of the same domain architecture can be called convergent or polyphyletic evolution, whereas a single original creation event for all extant examples on an architecture would be called divergent or monophyletic evolution. This is relevant for several reasons, not least because it determines whether or not we can expect two proteins with identical domain architectures to have the same history along their entire length.
A graph theoretical approach to answer this question was taken by Przytycka et al. , who analyzed the set of all proteins containing a given superfamily domain. The domain architectures of these proteins define a domain co-occurrence network, where edges connect two domains both found in a protein, regardless of sequential arrangement. The proteins of such a set can also be placed in an evolutionary tree, and the evolution of all multi-domain architectures containing the reference domain can be expressed in terms of insertions and deletions of other domains along this tree to form the extant domain architectures. The question, then, is whether or not all leaf nodes sharing some domain arrangement (up to and including an entire architecture) stem from a single ancestral node possessing this combination of domains. For monophyly to be true for all architectures containing the reference domain, the same companion domain cannot have been inserted in more than one place along the tree describing the evolution of the reference domain. By application of graph theory and Dollo parsimony , they showed that monophyly is only possible if the domain co-occurrence network defined by all proteins containing the reference domain is chordal, i.e., it contains no cycles longer than three edges.
Przytycka et al.  then evaluated this criterion for all superfamily domains in a large-scale dataset. For domains where the co-occurrence network contained fewer than 20 nodes (domains), the chordal property and hence the possibility of complete monophyly of all domain combinations and domain architectures containing that domain held. By comparing actual domain co-occurrence networks with a preferential attachment null model, they showed that far more architectures are potentially monophyletic than would be expected under a pure preferential attachment process. This finding is analogous to the observation by Apic et al.  that most domain combinations are duplicated more frequently (or reshuffled less) than expected by chance. In other words, gene duplication is much more frequent than domain recombination . However, for many domains that co-occurred with more than 20 other different domains, particularly for domains previously reported as promiscuous, the chordal property was violated, meaning that multiple independent insertions of the same domain, relative to the reference domain phylogeny, must be assumed.
A more direct approach is to do complete ancestral domain architecture reconstruction of protein lineages and to search for concrete cases that agree with polyphyletic architecture evolution. There are two conceptually different methodologies for this type of analysis. Either one only considers architecture changes between nodes of a species tree, or one considers any node in a reconstructed gene tree. The advantage of using a species tree is that one avoids the inherent uncertainty of gene trees, but on the other hand, only events that take place between examined species can be observed.
Gough  applied the former species-tree-based methodology to SUPERFAMILY domain architectures and concluded that polyphyletic evolution is rare, occurring in 0.4–4% of architectures. The value depends on methodological details, with the lower bound considered more reliable.
The latter gene-tree-based methodology was applied by Forslund et al.  to the Pfam database. Ancestral domain architectures were reconstructed through maximum parsimony of single-domain phylogenies which were overlaid for multi-domain proteins. This strategy yielded a higher figure, ranging between 6% and 12% of architectures depending on dataset and whether or not incompletely annotated proteins were removed. The two different approaches thus give very different results. The detection of polyphyletic evolution is in both frameworks dependent on the data that is used—its quality, coverage, filtering procedures, etc. The studies used different datasets which makes it hard to compare. However, given that their domain annotations are more or less comparable, the major difference ought to be the ability of the gene-tree method to detect polyphyly at any point during evolution, even within a single species. It should be noted that domain annotation is by no means complete—only a little less than half of all residues are assigned to a domain —and this is clearly a limiting factor for detecting architecture polyphyly. The numbers may thus be adjusted considerably upwards when domain annotation reaches higher coverage. A later study by Zmasek and Godzik  reports much higher rates (25–75%) still of polyphyletic evolution of eukaryotic multi-domain architectures, arguing that previous datasets were too small to have the power to reveal this.
Future work will be required to provide more reliable estimates of how common polyphyletic evolution of domain architectures is. Any estimate will depend on the studied protein lineage, the versatility of the domains, and methodological factors. A comprehensive and systematic study using more complex phylogenetic methods than the fairly ad hoc parsimony approach, as well as effective ways to avoid overestimating the frequency of polyphyletic evolution due to incorrect domain assignments or hidden homology between different domain families, may be the way to go. At this point all that can be said is that polyphyletic evolution of domain architectures definitely does happen, but relatively rarely, and that it is more frequent for complex architectures and versatile domains. A detailed case study was made recently of netrin domain-containing proteins, where polyphyletic evolution in metazoa seems well-supported ; these authors further suggest the term merology for such polyphyletic evolution. A series of papers by Nagy and Patthy et al. [73, 89, 90] further elaborates on challenges faced within this line of research; they report strong confounding influence of gene prediction errors. They further propose the term epaktology for gene similarity resulting from the independent acquisition of two proteins by the same additional domain. The authors suggest such cases inflate both estimates of terminal domain changes and estimates of gene fusion-driven changes in domain architecture. Beyond such changes, whether correctly inferred or not, the authors describe internal domain shuffling as an important mechanism for how domain architecture evolution has occurred.
As access to genomic data and to increasing amounts of compute power has grown during the last decade-and-a-half, so has our knowledge of the overall patterns of domain architecture evolution. Still, no study is better than its underlying assumptions, and differences in the representation of data and hypotheses mean that results often cannot be directly compared. Overall, however, the current state of the field appears to support some broad conclusions.
Domain and multi-domain family sizes, as well as numbers of co-occurring domains, all approximately follow power laws, which implies a scale-free hierarchy. This property is associated with many biological systems in a variety of ways. In this context, it appears to reflect how a relatively small number of highly versatile components have been reused again and again in novel combinations to create a large part of the domain and domain architecture repertoire of organisms. Gene duplication is the most important factor to generate multi-domain architectures, and as it outweighs domain recombination, only a small fraction of all possible domain combinations is actually observed. This is probably further modulated by family-specific selective pressure, though more work is required to demonstrate to what extent. Most of the time, all proteins with the same architecture or domain combination stem from a single ancestor where it first arose, but there remains a fraction of cases, particularly with domains that have very many combination partners, where this does not hold.
Most changes to domain architectures occur following a gene duplication and involve the addition of a single domain to either protein terminus. The main exceptions to this occur in repeat regions. Exon shuffling played an important part in animals by introducing a great variety of novel multi-domain architectures, reusing ancient domains as well as domains introduced in the animal lineage.
In this chapter, we have reexamined with the most up-to-date datasets many of the analyses done previously on less data and found that the earlier conclusions still hold true. Even though we are at the brink of amassing enormously much more genome and proteome data thanks to the new generation of sequencing technology, there is no reason to believe that this will alter the fundamental observations we can make today on domain architecture evolution. However, it will permit a more fine-grained analysis, and also there will be a greater chance to find rare events, such as independent creation of domain architectures. Furthermore, careful application of more complex models of evolution with and without selection pressure may allow us to determine more closely to what extent the process of domain architecture evolution was shaped by selective constraints.
11 Materials and Methods
Updated statistics were generated from the data in Pfam 30.0. All UniProt proteins in the SwissPfam set for Pfam 30.0 were included. These span 1090 bacteria, 506 eukaryotes, and 94 archaea. All Pfam-A domains regardless of type were included. However, as stretches of repeat domains are highly variable, consecutive subsequences of the same domain were collapsed into a single pseudo-domain, if it was classified as type Motif or Repeat, as in several previous works [50, 60, 66, 82].
Domains were ordered within each protein based on their sequence start position. In the few cases of domains being inserted within other domains, this was represented as the outer domain followed by the nested domain, resulting in a linear sequence of domain identifiers. As long regions without domain assignments are likely to represent the presence of as-yet uncharacterized domains, we excluded any protein with unassigned regions longer than 50 amino acids (more than 95% of Pfam-A domains are longer than this). This approach is similar to that taken in previous works [59, 60, 71]. Other studies [50, 72] have instead performed additional, more sensitive domain assignment steps, such as clustering the unassigned regions to identify unknown domains within them.
Pfam domains are sometimes organized in clans, where clanmates are considered homologous. A transition from a domain to another of the same clan is thus less likely to be a result of domain swapping of any kind and more likely to be a result of sequence divergence from the same ancestor. Because of this, we replaced all Pfam domains that are clan members with the corresponding clan.
12 Online Domain Database Resources
A selection of protein domain databases
Automatic clustering of protein domain sequences
Based solely on experimentally determined 3D structures
Meta-database joining together domain assignments from many different sources, as well as some unique domains
Bioinformatic assignment of sequences to CATH domains using hidden Markov models
Meta-database joining together domain assignments from many different sources
Domain families are defined from manually curated multiple alignments and represented using hidden Markov models
Automatically derived domain families from proteins in UniProt
Based solely on experimentally determined 3D structures
Domain families are defined from manually curated multiple alignments and represented using hidden Markov models
Bioinformatic assignment of sequences to SCOP domains using hidden Markov models trained on the sequences of domains in SCOP
Meta-database joining together domain assignments from many different sources, operating on the architecture level for a set of selected genomes
13 Domain Architecture Analysis Software
A selection of online software applying protein domain architecture evolution analysis
Searches for proteins with similar domain architecture
Visualizes domain evolution using trees
Searches for functionally equivalent proteins by scoring domain architecture similarities
Searches Pfam for proteins with specific domain architecture patterns
Homology searching by aligning multiple domains instead of residues
Graphical web tool for analyzing domain architecture evolution using Pfam
Other tools allow different types of analyses, for instance, searching for similar domain architectures or showing taxonomic distributions. Some of the protein domain databases listed in Table 2 include variants of such analyses, while external tools typically offer more specialized functionality. For example, the Pfam website allows searching for domain content, while the java tool PfamAlyzer allows searching Pfam for particular domain architecture patterns specified with a given domain order and spacing .
The RAMPAGE/RADS tools  make use of domain assignments for rapid homology searching. DoMosaics  is a software tool that can act as a wrapper for domain annotation tools, allowing detailed visualization and analysis of domain architectures, as does DomArch . The DAAC algorithm  explicitly transfers functional annotation to query sequences based on domain architectural similarity to annotated homologs, as does FACT . In the same vein, similarity measures between architectures are available using the WDAC  tool and in ADASS . Domain architecture similarity is used for orthology detection in the porthoDom software . The DOGMA tool makes use of domain content data to assess completeness of a proteome or transcriptome .
Which aspects of domain architecture evolution follow from properties of nature’s repertoire of mutational mechanisms, and which follow from selective constraints?
What trends have characterized the evolution of domain architectures in animals?
Discuss approaches to handle limited sampling of species with completely sequenced genomes. How can one draw general conclusions or test the robustness of the results? Apply, e.g., to the observed frequency of domain architectures that have emerged multiple times independently in a given dataset.
Describe the principle of “preferential attachment” for evolving networks. In what protein domain-related contexts does this seem to model the evolutionary process, and what distribution of node degrees does it produce?
What protein properties correlate with domain versatility? Can the versatility of a domain be different in different species (groups) and change over evolutionary time?
What protein domain-related properties differ between prokaryotes and eukaryotes?
- 1.Chandonia J-M, Fox NK, Brenner SE (2017) SCOPe: manual curation and artifact removal in the structural classification of proteins – extended database. Comput Res Mol Biol 429(3):348–355Google Scholar
- 28.Bornberg-Bauer E, Huylmans A-K, Sikosek T (2010) How do new proteins arise? Nucl Acids Seq Topol 20(3):390–396Google Scholar
- 53.Przytycka T, Davis G, Song N, Durand D (2005) Graph theoretical insights into evolution of multidomain proteins. In: Miyano S, Mesirov J, Kasif S, Istrail S, Pevzner PA, Waterman M (eds) Res. Comput. Mol. Biol. 9th Annu. Int. Conf. RECOMB 2005 Camb. MA USA May 14-18 2005 Proc. Springer, Berlin, pp 311–325Google Scholar
- 61.Parikesit AA, Stadler PF, Prohaska SJ (2017) Large-scale evolutionary patterns of protein domain distributions in eukaryotes. BioRxivGoogle Scholar
- 83.Wiedenhoeft J, Krause R, Eulenstein O (2010) Inferring evolutionary scenarios for protein domain compositions. In: Borodovsky M, Gogarten JP, Przytycka TM, Rajasekaran S (eds) Bioinforma. Res. Appl. 6th Int. Symp. ISBRA 2010 Storrs CT USA May 23-26 2010 Proc. Springer, Berlin, pp 179–190Google Scholar
- 97.Vera-Parra N, Gutiérrez-Ramirez M, Lopez-Sarmiento D (2016) Automatic construction and graph-making of functional domain architectures. Adv Nat Appl Sci 10(12):99–106Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.