Genomic Evidence for a Simpler Clotting Scheme in Jawless Vertebrates
- First Online:
- Cite this article as:
- Doolittle, R.F., Jiang, Y. & Nand, J. J Mol Evol (2008) 66: 185. doi:10.1007/s00239-008-9074-8
- 140 Views
Mammalian blood clotting involves numerous components, most of which are the result of gene duplications that occurred early in vertebrate evolution and after the divergence of protochordates. As such, the genomes of the jawless fish (hagfish and lamprey) offer the best possibility for finding systems that might have a reduced set of the many clotting factors observed in higher vertebrates. The most straightforward way of inventorying these factors may be through whole genome sequencing. In this regard, the NCBI Trace database (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi) for the lamprey (Petromyzon marinus) contains more than 18 million raw DNA sequences determined by whole-genome shotgun methodology. The data are estimated to be about sixfold redundant, indicating that coverage is sufficiently complete to permit judgments about the presence or absence of particular genes. A search for 20 proteins whose sequences were determined prior to the trace database study found all 20. A subsequent search for specified coagulation factors revealed a lamprey system with a smaller number of components than is found in other vertebrates in that factors V and VIII seem to be represented by a single gene, and factor IX, which is ordinarily a cofactor of factor VIII, is not present. Fortuitously, after the completion of the survey of the Trace database, a draft assembly based on the same database was posted. The draft assembly allowed many of the identified Trace fragments to be linked into longer sequences that fully support the conclusion that lampreys have a simpler clotting scheme compared with other vertebrates. The data are also consistent with the hypothesis that a whole-genome duplication or other large scale block duplication occurred after the divergence of jawless fish from other vertebrates and allowed the simultaneous appearance of a second set of two functionally paired proteins in the vertebrate clotting scheme.
KeywordsBlood clottingLamprey genomeGene duplications
Blood coagulation is known to follow a similar scheme in all vertebrates, the culminating event being the thrombin-catalyzed conversion of fibrinogen into fibrin. Interest in how the process evolved to yield the complex system that occurs in mammals has been longstanding, not only because every component seems essential, but also because some of the factors—like factors IX and X—depend on others—like factors VIII and V—for their activity. It has long been appreciated that a series of different gene duplications gave rise to many of the factors, and it was hoped that studies on early diverging vertebrates, especially jawless fish, might reveal a simpler process as it existed in earlier times.
However, it has proved difficult to assess whether the entire constellation of factors leading to thrombin generation in mammals is present in these early-diverging vertebrates. For one thing, demonstrating the presence or absence of a particular factor by biochemical assay in such creatures is handicapped by “species specificity,” which confounds classical measurements that depend on the use of standardized mammalian protein reagents or genetically defective plasmas. Although prothrombin and fibrinogen have long ago been purified and the presence of tissue factor in lampreys demonstrated biochemically (Doolittle et al. 1962; Doolittle and Surgenor 1962), it has not been realistic to attempt the purification of other coagulation factors, several of which are present in only minute amounts in mammalian plasma. Some clotting factors have been cloned from lampreys and hagfish (Strong et al. 1985; Bohonus et al. 1986; Wang et al. 1989; Banfield and MacGillivray 1992; Pan and Doolittle 1992) and several more from later-diverging teleosts like zebrafish (Jagadeeswaran et al. 2000; Sheehan et al. 2001; Hanumanthaiah et al. 2002) and puffer fish (Davidson et al. 2003a). Now, with the advent of whole-genome sequencing (WGS), it has become feasible to reconstruct the ensemble of clotting proteins that occurs in early diverging vertebrates.
In this regard, recent studies based on the complete genome sequence of the puffer fish, Fugu rubripes, revealed that genes for most known clotting proteins—including all of the vitamin K-dependent factors and the critical cofactor proteins, factors V and VIII—are present (Davidson et al. 2003a, b; Jiang and Doolittle 2003). For the most part, only genes for a few of the more peripheral coagulation factors, like factors XI and XII, were not found (Jiang and Doolittle 2003). The puffer fish is a teleost, however, and it would be of much greater interest if such a study could be conducted on one of the jawless fish—lamprey or hagfish—which diverged 50 million to 100 million years earlier (Carroll 1988) and would be more likely to have a simpler clotting system. None of the principal clotting factors is found in the genome of the protochordate Ciona intestinales (Jiang and Doolittle 2003).
At present, a complete genome sequence is not yet available for either lamprey or hagfish. However, a “trace database” maintained by the NCBI and EBI includes the lamprey among its numerous holdings. Trace databases are uncurated collections of DNA sequences, mostly determined by random shotgun methods at major sequencing centers around the world. In the case of the lamprey (Petromyzon marinus), the November 2006 collection contained 18,787,613 machine-generated “reads,” or “traces,” most between 300 and 1000 “letters,” amounting to 14,640,144,063 nucleotides (nt). The lamprey genome is estimated to contain between 1.6 billion and 2.2 billion nt (Gregory 2005), suggesting that the average redundancy in the database is about six- or sevenfold. The current study began with a computer search of the Trace database in an effort to identify various coagulation factors.
The degree of coverage notwithstanding, determining whether a gene is present or not in a Trace database is much more challenging than is the case when a complete and assembled genome is at hand. In the case of the puffer fish, for example, data were available in the form of 12,381 scaffolds that had been assembled from the original DNA fragments. The average scaffold size was such that in most cases all the exons of a given gene were present on a single assembly. In the lamprey Trace database, exons or parts of them were scattered over the approximately 18 million entries, minimizing chances of verification by exons from the same gene being linked. However, these collections do contain a small fraction of EST sequences (determined from cDNA), and these are usually longer and often provide overlaps. Additionally, many of the Trace sequences are available as “mate-pairs,” sequences determined from the two different “ends” of inserts in vectors, occasionally permitting estimates about positional neighborliness in the genome.
Recently, after the completion of the original set of searches, a partial assembly of the same lamprey data became available. Obviously, assembly data greatly facilitate the reconstruction of gene sequences, and it was a straightforward matter to match up the initial findings with the partially assembled data. The average size of the “contigs” (or supercontigs) that link together fragments from the original database is about 7000, which is an order of magnitude greater resolution than was the case for the original “traces.” As a result, in several cases entire genes are found on single entities.
The most serious complication for the present study stems from most of the gene products of interest having resulted from gene duplications that took place about the same time as the appearance of vertebrates, aggravating the problem of distinguishing orthologues from highly similar paralogues. In particular, in a previous study (Jiang and Doolittle 2003) we had found that factors V and VIII in the puffer fish are only 42% identical to their human counterparts, and in either species the two factors themselves are 38% identical, indicating that the gene duplication giving rise to these two factors occurred not very long before the divergence of bony fish and the line leading to mammals and, perhaps, after the divergence of the jawless fish.
The cases of factors V and VIII are even more problematic because of their being members of the ferroxidase family of proteins, which includes ceruloplasmin and hephaestin, themselves composed of three major domains that are the result of (tandem) duplications (Hellman and Gitlin 2002). In other vertebrates, factors V and VIII are distinguishable from hephaestin and ceruloplasmin in that they have “discoidin” domains at their carboxyl termini (sometimes called “fac5–8 C” domains).
The publicly accessible lamprey Trace Archive data were downloaded from the NCBI web site (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi), after which they were reformatted into a form compatible with previously described software (Doolittle 1987). Each DNA entry was translated into amino acid sequences corresponding to all six frames, which were then treated separately. In addition, BLAST software (Altschul et al. 1997) was downloaded from the NCBI web site; tblastn was used extensively, with the translated sequences being searched against the raw DNA data.
Searches for exact matches at least 30 codons in length were conducted on 20 proteins whose sequences had been determined before the inception of genome projects. These searches were conducted both with in-house software and with tblastn. The same dual strategy was employed in the search for coagulation factors. Verification of matches was obtained by using BLAST to search the NCBI nonredundant protein database, both before and after concatenating various translated traces to form longer sequences. Validation of a proper assignment was achieved, or not, by the nature of the top hits in this reverse search. Phylogenetic reconstructions were made by a distance-matrix method (Feng and Doolittle 1996), a parsimony procedure (Doolittle and Feng 1990), and a neighbor-joining method (Saitou and Nei 1987); the software for the latter was downloaded from PHYLIP (Felsenstein 1989). In the case of the distance-based trees, the BLOSUM62 substitution matrix was used (Henikoff and Henikoff 1996). Trees were drawn on the PHYLODENDRON web site, http://www.iubio.bio.indiana.edu/treeapp/treeprint-form.html.
A part of the Trace archiving strategy involves determining the sequences of both “ends” of every DNA fragment, and in several cases reciprocal matches were made that allowed neighboring exons to be linked. The Trace ID numbers of all mate-pairs were compared with those of all candidates in a particular venue (i.e., factor V–VIII study, for one, and separately for the vitamin K-dependent protease study). Additionally, a perl script was written that allowed translated sequences of identified mate-pairs to be used in conjunction with blastp to learn what kind of domains they might contain. The Trace ID numbers of all fragments matching the various targets are provided in tabular form in the Supplementary Material.
The draft assembly data were produced by the Genome Sequencing Center at Washington University School of Medicine in St. Louis and were obtained from ftp://genome.wustl.edu/pub/petromyzon_marinus. The web site notes that the original WGS data were obtained from a specimen provided by M. Bronner-Fraser, which was sequenced to a total 5.9× whole-genome coverage. Because the draft assembly is based on the same Trace collection used in our study, it was a relatively straightforward matter to screen the posted list of “reads.placed” to find the contigs and supercontigs on which our original “hit-groups” are situated. In many cases it was possible to link hit-groups into consecutive strings, and in several instances (especially in the cases of the vitamin K-dependent proteases) entire genes were found to be encompassed on single supercontigs. In some cases, however, the hits are on small supercontigs or were terminal segments in the “wrong” direction with regard to expected neighbors and remain unlinked “orphans.”
Completeness of the Lamprey Trace Database
Proteins whose sequences were known in advance of the genome project and that were used to test for completeness of coverage
Factors V and VIII
When sequences of the three A domains of human factors V and VIII were used as queries in a search of the lamprey Trace database, a large number of “hits” was obtained. Unexpectedly, when subjected to “back-searching” (reciprocal matching) against NCBI data protein databases, the majority of these were more similar to hephaestin and ceruloplasmin than to factor V or VIII. Nonetheless, a long list of candidate “traces” was compiled, and those portions showing any detectable similarity to human factors V and VIII were retrieved.
All told, searches of the six A domains (three each from human factors V and VIII) identified approximately 257 “hits” in the lamprey trace database. When these were compared with each other, many were found to contain the same (or virtually the same) inserts, and the number was subsequently reduced to 50 “hit-groups,” consistent with an approximate fivefold redundancy. There was a wide range, however, the number of identical inserts in the various groups ranged from 0 to 29.
Exon-intron distribution of hephaestin-family genes
In humans, the hephaestin and ceruloplasmin genes each have 18 introns that demarcate 19 exons (Syed et al. 2002; Daimon et al. 1995). Factors V and VIII have 18 and 19 introns in the corresponding regions, respectively (Cripe et al. 1992). The exon sizes for the four human proteins range from 30 to 85 codons in length (Fig. 2). The fact that the majority of pairs of exons in human hephaestin and ceruloplasmin genes have slightly different lengths than occur in factors V and VIII served as a useful, if tentative, property for assigning exons from the lamprey Trace database to one set or the other. It was also helpful that the exon boundaries for the three homologous A domains differ in all four of these proteins.
Although no additional hits were found (i.e., beyond the original 257) when exons from factors V and VIII were searched individually, several more hits were added when the searches were conducted with the individual exons of human hephaestin and ceruloplasmin. Hits found only with hephaestin or ceruloplasmin query sequences have been designated differently (X1–7 in Fig. 2) in order to distinguish them from those found with factors V–VIII. The various “hit groups” were assigned to either the hephaestin-ceruloplasmin family or the factor V–VIII family on the basis of sequence similarity and exon sizes. In seven cases it was possible to link segments using “mate-pairs”(MP in Fig. 2). In one instance, two exons were found in a single trace, out of frame and separated by a small intron.
An analysis of matching segments make it clear that, at a minimum, there are four genes in lamprey that belong to the hephaestin-ceruloplasmin-factor V/VIII family, only one of which seems closer to the FV/VIII side of the family (Fig. 2). The question to be decided was, Are genes for both factors V and VIII present, or is there only a single progenitor (preduplication) gene? If there is only one gene of the factor V–VIII type, then there would reasonably have to be an “extra”(homologue) gene for either hephaestin or ceruloplasmin. It is known that, at the very least, lamprey blood plasma contains a blue protein with all the properties of ceruloplasmin (J. Gitlin, unpublished data).
Vitamin K-Dependent Proteases
GLA domain inventory
An iterative search for sequences corresponding to GLA domains of human vitamin K-dependent proteases (prothrombin, factors VII, IX, and X, and protein C) and two nonproteases (proteins S and Z) resulted in 68 “hits,” which, after analysis for redundancy, reduced to eight unique sequences (allowing for a few single-amino acid differences). This compares with 16 GLA domains3 found in the (nonredundant) genome of the puffer fish (Jiang and Doolittle 2003) and 4 in the sea squirt (Jiang and Doolittle 2003; Kulman et al. 2006). If it is presumed that the lamprey genome coverage is sufficiently complete at this point, and that all related sequences have been identified, then the eight GLA sequences place an upper bound on the number of vitamin K-dependent proteins in the lamprey.
In mammals, GLA domains are also associated with proteins not involved in blood clotting, including bone proteins (Price et al. 1976), a cell growth potentiating factor (Manfioletti et al. 1993), and certain proline-rich proteins (Kulman et al. 1997). The GLA sequences involved with bone are quite dissimilar and may not have been picked up by our searches. In any event, the jawless fish do not have calcified bone and probably do not have the GLA-containing proteins associated with bone in other vertebrates. A search of the Trace database with mammalian Matrix GLA Protein and Bone GLA protein sequences (Laize et al. 2005) did not turn up any credible hits. In passing, it can be noted that in the puffer fish, only 10 of the 16 GLA domains are associated with the relevant proteases. Of the remaining six, one each occurs in proteins S and Z and the remaining four with proteins not involved in clotting.
Kringle and EGF domains
Prothrombin is unique among the vitamin K-dependent proteases in that it has two kringle domains following the GLA domain; the other proteases in this group have two EGF domains at that location (Fig. 1). It was possible to identify the first prothrombin kringle by a mate-pair trace. A second was assigned initially on the basis of similarity to kringles in human and hagfish prothrombins. The overabundance of EGF domains in the Trace database precluded making unequivocal assignments merely on the basis of resemblance to human orthologues. However, EST sequences (from cDNA) for two related gene products resembling factor X included two EGF domains. It was also possible to link another EGF domain with a putative factor VII by a mate-pair (Fig. 5). The remaining EGF domains were assigned initially on the basis of scoring matches with human or puffer fish counterparts, whichever gave the higher score (Fig. 5).
Serine protease domains
Several fragments for prothrombin, including a relatively long EST fragment, were immediately put in place by exact matching with a sequence previously determined with lamprey cDNA (Pan 1992). Some others were aided by resemblances to a published cDNA sequence for prothrombin from hagfish (Banfield and MacGillivray 1992). EST sequences were also helpful in linking peripheral domains to the main bodies of two putative factor X sequences. Coincidentally, these two entities had been cloned (cDNA) more than 10 years ago in the laboratory of S. Sommers (personal communication of unpublished material); the cDNA sequences are virtually identical to the sequences reconstructed from the Trace fragments. The strong resemblance between these two putative factor X sequences (>60% identical) implies a duplication that occurred well after the divergence of lampreys from other vertebrates. One of the gene products, denoted factor XB in Fig. 5, lacks a key residue in the activation sequence and ought not be active as a protease.
Other Coagulation Factors
Searches were conducted for two other vitamin K-dependent factors that are not proteases. A few marginal matches were found for portions of protein S, which in other vertebrates is composed of two laminin domains and four EGF domains besides its terminal GLA domain. The large number of EGF domain sequences in the trace database made positive identification difficult. As for the GLA domain, the sequence in human protein S is most similar to GLA hit-group 6, the only one of the eight GLA hit-groups not present in the assembly (Fig. 7). The case for protein S in the lamprey genome remains equivocal.
Nor was it possible to identify any reasonable candidates for protein Z, a fast-changing vitamin K-dependent factor that contains a nonfunctional protease portion (Ichinose et al. 1990). In contrast, strong matches for plasminogen (Trace ID 1184095239) and tissue plasminogen activator (Trace ID 1483701246) were inadvertently uncovered during the searches for vitamin K-dependent proteases.
We were unable to demonstrate the presence of a tissue factor (TF) sequence—another fast-changing entity—in either the trace database or the assembly, even though tissue factor has been demonstrated biochemically in lamprey tissues (Doolittle and Surgenor 1962). However, one of the three repeats that occurs in tissue factor inhibitor (TFI) was initially found in the Trace database (Trace ID 1470228595). The same sequence is found in the assembly along with the other two repeats on supercontig-12432. The sequence is about 45% identical to human TFI, which is actually a greater similarity than was found for the puffer fish-human comparison (Jiang and Doolittle 2003).
Positive identifications were also made for thrombomodulin, initially in the Trace database and then, more convincingly, in the draft assembly. In other vertebrates (puffer fish and mammals) thrombomodulin has an intron-free sequence, but it was unlikely that the full-length sequence (575 codons in the human version) would appear in a single sheared shotgun fragment in the Trace database. Nonetheless, one trace corresponded to a homologous amino-terminal lectin domain (Trace ID 1377464678) and two others encompassed identical strings of five EGF domains (Trace IDs 1382345176 and 1206195865). The full sequence, amounting to 700 residues, was found on supercontig-5373, with no introns, and included an appropriately positioned putative membrane-spanning segment. The sequence is only 30% identical to human thrombomodulin, not unreasonable considering that puffer fish thrombomodulin is only 37% identical to the human protein. A second lamprey thrombomodulin was found on supercontig-25172. In this case, the sequence corresponding to the amino-terminal portion occurs in a nonsequenced region, and only 494 codons are present.
The presence of genes in the lamprey genome that encode clotting factors was anticipated, a number of them having been isolated and/or cloned in the past. The basic events involving the thrombin-catalyzed conversion of fibrinogen to fibrin and its subsequent cross-linking and lysis were not an issue. The detailed pathway of thrombin generation was still undetermined, however, and whether or not all of the vitamin-K dependent factors are present in the jawless fish was not known.
We are now proposing that the lamprey has a reduced set of clotting factors, corresponding to what would have been in place before the duplication of two different kinds of protein, the vitamin K-dependent proteases, for one, and the factor V-VIII family, for the other. It is not yet possible to reconstruct the full sequence for the putative preduplication factor 5/8 gene. Five of the original 13 hit-groups were not used in the assembly, 2 were found linked to hephaestin-ceruloplasmin sequences, and 4 more are on orphan contigs. However, one—the very one that had been placed at the carboxyl-terminal end—was found to be linked to a discoid domain, a characteristic feature of factors V and VIII (Fig. 3). Importantly, none of the other hit-groups found in the assembly occur on contigs encoding discoid domains.
The evidence for there being only four genes in the hephaestin-ceruloplasmin-factor V–VIII family is based on there never being more than four different sequences for any set of aligned segments (Figs. 2 and 3). That only one of these is related to factors V and VIII is based on the majority of fragment sequences being more similar to hephaestin and ceruloplasmin and on intron locations coinciding with those of hephaestin and ceruloplamin in three of the sequences and with factors V and VIII in the fourth (Fig. 4). Additionally, the region occupied by a long stretch of low-complexity sequence in all known factors V and VIII is definitely not present in three of the reconstructed sequences. Finally, only one of the four reconstructed sequences was found to be associated with a discoidin sequence.
Several kinds of evidence speak for the absence of the vitamin K-dependent protease factor IX. First, and most important, none of the candidate sequences gave factor IX as a top hit when subjected to reverse searching. Moreover, five of the six unique segments corresponding to the region in vitamin K-dependent proteases where cleavage activation occurs have a brace of cysteines universally observed in factors VII and X but which have not been found in factor IX in other vertebrates. On a more circumstantial level, of the four gene duplications that have previously been postulated as leading to the five vitamin K-dependent proteases, there is general agreement that the one giving rise to factors IX and X is the most recent (Doolittle and Feng, 1987; Doolittle 1993; Davidson et al. 2003a, b; Jiang and Doolittle 2003). In that same vein, the sequence similarities of factors IX and X are only slightly less than the resemblances observed between puffer fish and human sequences for those two factors (Jiang and Doolittle 2003), again suggesting that the duplication event occurred not long before the appearance of teleost fish. Although it is more difficult to prove the absence of a gene than its presence in a sequence database, we would cautiously propose that a gene corresponding to factor IX is not present in the lamprey genome.
The Matter of Whole-Genome Duplications
Although the notion that the vertebrate blood coagulation pathway is the result of a series of gene duplications is longstanding (Doolittle 1961; Doolittle and Feng 1987), more recently it has been suggested that the gene duplications responsible for some clotting factors may be linked to whole-genome duplication events (Davidson et al. 2003a, b). The possibility of polyploidy leading to whole-genome duplications was first raised by Ohno (1970), and it has been much discussed and hotly debated ever since (Sidow 1996; Smith et al. 1999; Hughes et al. 2001; Panopoulou et al. 2003; Dehal and Boore 2005; inter alia). Examples of preduplication genes in lamprey or hagfish—i.e., genes that are duplicated in other vertebrates—are well known, beginning with the observation that lampreys have a single-chained hemoglobin (Ingram 1963). More recent reports include high-mobility-group proteins (Sharman et al. 1997) and thyroid hormone and retinoid X rcceptors (Escriva et al. 2002).
In summary, the genomic picture presented here suggests that lampreys have a simpler clotting scheme than later diverging vertebrates. In particular, they appear to lack the equivalents of factors VIII (or V) and IX, suggesting that the gene duplications leading to these coagulation factors, synchronous or not, occurred after their divergence from other vertebrates.
In the end, the draft assembly of the lamprey genome, taken together with other sequences found in the Trace database, supports the suggestion that the lamprey genome lacks the separate equivalents of factors V and VIII (i.e., it has a preduplication gene) and a factor IX (i.e, it has a preduplication gene corresponding to factor X). A genuine, fully assembled lamprey genome should settle the matter unequivocally. Meanwhile, given the specific nature of our proposal, it may be possible to conduct biochemical experiments on lamprey blood that could address the question directly.
The search of the factor V B domain actually detected a large number of almost-perfect tandem 27-nt repeats in the lamprey Trace database. Although both human and mouse factor V have 30 imperfect copies of these tandem nine-amino acid repeats, none occurs in the factor V sequences of chicken or puffer fish; this coincidental but perhaps chance similarity does not bear directly on the problem at hand.
One of the Trace sequences (G52 in Fig. 2; Trace ID 1446326143) exhibited a remarkable 77% identity to a 34-residue segment of human factor VIII (26 identities among the 34 residues) and may be a contaminant. The corresponding region from puffer fish is only 44% identical to the human segment. The same region from chicken is coincidentally 77% identical to the human sequence, but the eight differences are not the same as observed in the lamprey sequence.
This work was supported in part by National Institutes of Health Grant HL-81553. We thank Steve Culbertson for assistance in identifying mate-pairs. We are also are grateful to Jonathan Gitlin (Washington University, St. Louis) for sharing his unpublished findings on lamprey ceruloplasmin and to Steve Sommer (City of Hope) for unpublished cDNA sequences of two vitamin K-dependent proteases from lamprey.