Ancient Phylogenetic Beginnings of Immunoglobulin Hypermutation
Many structures and molecules closely related to those involved in the specific process of immunoglobulin (Ig) hypermutation existed before the appearance of primordial Ig genes. Consequently, these structures can be found even in animals and organisms distinct from vertebrates; likewise, homologues of hypermutation enzymes are present in a broad range of species, from bacteria to mammals. Our analysis, based predominantly on primary structure, demonstrates the existence of molecules similar to Ig domains, variable Ig domains (IGv), and antigen receptors (AR) in unicellular organisms, nonvertebrate metazoans, and nonvertebrate Coelomata, respectively. In addition, we deal here with some important structural properties of CDR1-like segments of the selected sponge adhesion molecule GCSAMS exhibiting chimerical Ig domain similarities, and demonstrate the occurrence of conserved regions corresponding to Ohno’s modern intact primordial building block in the C-terminal part of IGv-related segments of nonvertebrate origin. The results of our analysis are also discussed with respect to the possible phylogeny of molecules preceding the hypothetical common antigen receptor ancestor.
KeywordsBLAST CDR1 Domain similarity Geodia cydonium Hypermutation Immunoglobulin Primordial building block Protein kinase substrate Template
Immunoglobulin (Ig)1 hypermutation represents a selective type of site specific enormously frequent somatic mutation occurring almost exclusively in B cells (Gordon et al. 2003). This process is related to hypermutation of T-cell receptors in T cells (Oprea and Kepler 1999) and participates, among others, in the maturation of human antibodies. It also initiates isotype switch of human Ig genes (Banerjee et al. 2002; Poltoratsky et al. 2004) and gene conversion of corresponding genes of chicken origin (Di Noia and Neuberger 2004).
Two highly conserved families, i.e., APOBEC family and Y family of DNA polymerases (Wedekind et al. 2003; Yang 2005), include important mutator enzymes involved in Ig hypermutation. The members of these families can usually be characterized by RPS BLAST similarities with sequences of three conserved domains (cd012083 and the closely related pair, pfam00817 and COG0389) which otherwise determine superior superfamily relationships in a broad range of organisms, from bacteria (Brotcorne-Lannoye et al. 1986; Yang et al. 1992; Reuven et al. 1999; Bjedov et al. 2003) to vertebrates. Activation-induced cytidine deaminase (AID), a member of the APOBEC family (indicated by cd01283), represents a proposed main DNA mutator of Ig genes (Muramatsu et al. 2000; Petersen-Mahrt et al. 2002; Beale et al. 2004). Surprisingly, this enzyme or its familiar homologues have not yet been detected in model Urochordata genomes of genus Ciona (Azumi et al. 2004; also in accordance with our recent BLAST searches including associated domain projections). AID can also trigger accompanying Ig hypermutation effects of error prone-enzymes of the Y family of DNA polymerases (Poltoratsky et al. 2004), mentioned above. Three members of this Y family of DNA polymerases (i.e., catalytic subunits of DNA polymerases η and ι and possible recognition subunit Rev1 of polymerase ζ complex) were found to participate in the corresponding error-prone replication (Faili et al. 2002; Simpson and Sale 2003; Zeng et al. 2004). The subunits of mutator enzymes involved in Ig hypermutation can also be important for some steps of cancerogenesis (Simhadri et al. 2002; Diaz et al. 2003; Okazaki et al. 2003; Washington et al. 2003; Faili et al. 2004; Lossos et al. 2004; Malpeli et al. 2004) and possibly also in recombination (Bradshaw et al. 2002), both concerning genes different from Ig ones. In agreement with these less specific reactions and preceding familial/superfamilial relationships, the enzymes involved in Ig hypermutation possibly evolved from ancestors of different function. On the other hand, the ancestor’s relationship to more frequently hypermutating shorter segments such as CDR1 and CDR2 remains unclear.
In contrast to the well-defined conserved domain relationship of hypermutation enzymes described above, very little is known about the possible phylogenetic relationship to their DNA “macrosubstrates.” Corresponding hypermutating DNA regions (HDR) of about 1.1-kb lengths (Rada et al. 1997) include not only exon sequences of Ig genes but also intron ones involved in class switch (Muramatsu et al. 2000). Though members of the immunoglobulin superfamily were most frequently detected in similar HDR, such regions also contain segments encoding members of different protein superfamilies (Morelli et al. 2002; Gordon et al. 2003). Such HDR were also found in DNA extracted from tumor cells of various cell lineage different from B cells, i.e., cells of melanoma, mesenthelioma, and carcinoma origin (Morelli et al. 2002). Recent data suggest that AID, which is predominantly instrumental in the Ig-related hypermutation of short tetranucleotides (“microsubstrates”) of general motif structure DGYW/WRCH (Rogozin and Diaz 2004 binds to “G-loops” within transcribed S regions (Duquette et al. 2005). Consequently, the last interaction may restrict the occurrences of at least some macrosubstrate HDR.
In spite of the scant knowledge about the evolution of HDR, the evolution of Ig superfamily members and Ig domains (most frequently associated with HDR) has been investigated for a long time (Williams and Barclay 1988; Halaby and Mornon 1998; Gordon et al. 2003). This also holds for the evolution of Ig genes. We may distinguish four eras in the evolution of Ig genes important also for their hypermutation phylogeny: (i) prehistory (more than 1000 to 500 million years ago [MYA]) (Marchalonis et al. 2002; Du Pasquier et al. 2004), (ii) beginnings of evolution of primordial antigen receptors (AR) (Litman et al. 1999; Lee et al. 2002; Suzuki et al. 2004) (we consider a period of at least 500–480 MYA), (iii) “big bang” period in the evolution of Ig germline gene reshuffling and rearrangement (Bernstein et al. 1996a; Marchalonis and Schluter 1998; Schatz 2004) (480–450 MYA), (iv) period of gradual evolution of these processes (Sitnikova and Su 1998) (450 MYA–present) and appearance of hypermutation-dependent isotype switch (Zarrin et al. 2004).
The data presented here concern the “prehistory” period of Ig gene evolution. Like several other studies, this paper includes the sequence study of molecules closely related to the immediate phylogenetic (“prehistorical”) precursors of the hypothetical common ancestor of antigen receptors (PPCAR) possibly encoded by rearranging genes (Litman et al. 1999; Du Pasquier et al. 2004; Pancer et al. 2004; van den Berg TK et al. 2004). Such PPCAR perhaps participated in recognition of viruses (Du Pasquier et al. 2004). Nevertheless, a portion of this paper reports a sequence study of phylogenetically more distant nonvertebrate molecules containing conserved domain similarities to variable Ig domains (IGv) or otherwise markedly similar to AR. This means that we also look for molecules exhibiting a possible structural relationship to earlier PPCAR variants of considered involvement in cell adhesion or clearance processes (Tilson and Rzhetsky 2000; Marchalonis et al. 2002). In consequence of this and our previous studies (Kubrycht et al. 2002, 2004), we also demonstrate the existence of two interesting regions of different Ig domain location in selected molecules. In addition, some examples of bacterial and even archaeal Ig or Ig-like domains are also mentioned.
Materials and Methods
Programs, Algorithms, and Formulas
General remarks on sequence alignments, limits, and standardization of database searches
Restriction of the segment of oligonucleotide dissimilarities between the investigated sequence and the compared sequence set
The segment of major occurrence of oligonucleotide dissimilarities (DIF1) was restricted by the occurrence of DIF1OL oligonucleotides, using the algorithm: DIF1OL = (h(N;0) OR m(N;0)) AND hm(N + 1;0), where m, h, and hm indicate the origin of compared sets, i.e., human, mouse, and both mRNA sequences, respectively; N is the minimal length of oligonucleotide identities found by (standard) BLASTN; and 0 denotes oligonucleotides without the corresponding oligonucleotide identity. In our approach, the segments extended by at least half of N and N + 1 length overlaps are correlated with DIF1, which usually determines either a single or a pair of N and N + 1 related envelope regions (ER).
Amino acids (aa) of substantially different column occurrences
Let X and Z be any different aa occurring in the same sequence block column more frequently than follows from considered random probabilities where at least one of them is less frequent than the usual or reevaluated high occurrence aa (for details see WP 3.3 and 3.4.4).
Assumption 1: Let column occurrence of X determine more than one model random event (i.e., one model aa) higher model (reference) twin (sequence block) column (MTC; for details see WP5.4) than the Z occurrence does. This means that LEX − LEZ > 1, where LEX, LEZ are length equivalents of X and Z in A given column, respectively (see also WP3.3 and Kubrycht et al. 2002).
Assumption 2: Similarly, let the ratio of X and Z probabilities (relative X probability) determine a probability lower than random occurrence of X, i.e., the corresponding MTC is higher than one in accordance with the formula: log(cX/cZ) : log(aX) > 1, where cX,, cZ are the given aa probabilities of X and Z in a given column, and aX is the considered aa probability (for topical values see Kubrycht et al. 2002).
Definitions: Provided that the same X and Z are in agreement with both assumptions, we say that X is an aa of superior occurrence (SO aa), whereas Z is an aa of inferior occurrence. In the cases of other relationships between X and Z, both X and Z are named residual alternative aa (RA aa). For the purpose of these definitions see Formation of Hybrid Templates, below.
Preselection and Selection Procedures Involved in the Restriction of Displayed Sequence Sets
NCBI conserved domain search for Ig and Ig-like domains in Archaea and Bacteriaa,b
Conserved domain similaritiesd
hyp P AF2119
hyp P MM1983
bac SPC IGL
bac IGL P
Similarities of conserved variable Ig domains in nonvertebrate Eukaryotaa
Conserved domain similaritiesd
IG4, IGL, IGcam, ig, IGv4, IG9, IGc2, IGv9
Second Ig domain
IG4, IGL, IGcam, IGv4, ig, IG9, IGc2, IGv9
Second Ig domain
IGv4, IG4, IGcam, IGL
First Ig domain
IG4, IGv4, IGcam, IGL
IG4, IGv4, IGcam, IGL, IG9, ig
TyrKc, Pkinase, STKc, SPS1
IG4, IGv4, IGcam, IGL
IG4, IGL, IGv9
IGL, IG4, IGcam
IG4, IGcam, IGL, IGv4, ig
IGcam, IGc2, IG4, IGL, ig
IG4, IGL, IGcam, IGv9, IGC2
IG4, IGL, IGcam, IGc2, IG9, ig
IGcam, IGL, IG4, IGc2, IG9, ig, IGv4
OLF,IG4,collagen,IGL, IGcam, IGc2
IGL, IG4, IGv4, ig, IG9, IGcam
IG4, IGL, IGcam, IGc2, IGv9
IGL, IG4, IGcam, IG9, IGv4, IGc2
IG4, IGL, IGcam, IGv9, IGc2
IG4, IGL, IGcam, IG9, IGc2, ig
IG4, IGcam, IGL, IGc2, IGv9
Second Ig domain
IG4, IGL, IGc2, IGcam, IG9, ig, IGv4, IGv9
First Ig domain
IG4, IGL, IGv4
IGc2, IG9, IGL, IGcam
IGL, IG4, IGcam, IGc2, IG9, ig IGv9
IGL, IGv4, IG4
IG4, IGc2, IGcam, IGL, ig, FN3
IGcam, IGL, IG4, IG9, IGc2, ig, IGv4
IGcam, IG, IGL, IGc2, FN3, fn3, ig
IG4, IGcam, IGL, IGv9, IGc2
IGcam, IG4, IGc2, IGL, ig, IG9
IGL, IGcam, IG4, IGc2, IG9, ig, IGv9
TyrKc, Pkinase, STKc, SPS1, KR, Fz
IGv4, IG4, IGv9
CBM_14, ChtBD2, IGL, IG4
IGv4, IGL, IG4, ig, IGv9, IG9
Knowledge-based approach complementary to the results of the searches described in the preceding section
Some antigen receptor-related proteins in nonvertebrate Coelomataa,b
Dominant domain similaritiesb,c
IGL*, IGcam, FN3*
N4, N9, T
IGcam*, IG4*, FN3*, IGc2
N9, S, T
274 (sim NCAM2)
PTPc*, FN3*, IGcam*
IGcam*, IG4, FN3*, IGL, fn3
IW, N4, S, T, Vp
100 (L1 CAM)
TyrKc, KR, IGL, Fz
TyrKc, IGcam, IGL*
CBM_14, IGv4, IGL
N4, N9, T, Vp
IW, S, T, Vp
PSI-BLAST procedure further selecting IGv-related sequences
The protocol was in agreement with the recommended prealignment strategy (Simossis et al. 2005). In accordance with Table 2 and in coincidence with critical domain similarities of molecules displayed in Table 3 (except for COS2.1), the sequence of the domain smart00409 was determined as the best possible query sequence for both multistep PSI-BLAST iteration procedures. In addition, we required collocation of two types of sequence similarities with selected segments: (i) conserved IGv (domain) similarities in the case of molecules described in Table 2 and (ii) BLASTP similarities with IGv in other cases. To avoid structural redundancy, only GCSAMS (model molecule in Traces of Hypermutation Milieu in the Selected IGv-Related Sequences of a Marine Sponge [Results]) and RTK (like GCSAML and GCSAMS, RTK is also upregulated during allograft fusion [Blumbach et al. 1999; Schutze et al. 2001]) represented the sponge adhesion molecules displayed in Table 2 after the first PSI-BLAST iteration.
Restriction of conserved region of IGv-related segments derived by the procedure described in “PSI-BLAST Procedure Further Selecting IGv-Related Sequences,” above
Two different multiple sequence alignments representing two independent methods (i.e., maximum likelihood [Muscle 2.01] and neighbor joining [CLUSTAL W 1.82]) were employed. To better restrict short segments of high similarity in blocks generated by multiple sequence alignment, we look for the region of minimal presence of gaps, where common aa or aa of superior occurrence can also be found.
Important Sequences and Employed Templates
Sequences of conserved Ig domains as PSI-BLAST query sequences
The Ig domain regions containing all the segments exhibiting RPS BLAST-derived similarities with the sequences considered to be upload to PSI BLAST were determined. These regions were employed as query sequences instead of the whole domain sequences (simplified in the following text).
Reference sequences (RfS) used in the combined search related to Table 3
In accordance with literature data, both nonrepeated (unique) available sequences of TCRL and VpBLP protein sequences were included in our RfS set. IgW and NITR representatives were selected by PSI-BLAST searches. Two IgW segments of the highest scores of similarity to both different IGv (smart00406 and cd00099) were put in these searches as query sequences. Since a similar PSI-BLAST search was not successful in the case of the SIRP family, the SIRP molecule of the best position in all individual BLASTP searches with accessible IgW query sequences represented a given family as a RfS. Complete protein sequences determined by preceding segment similarities were used as RfS in our combined search (Table 3). For a more detailed overview related to our choice of RfS see WP2.3.
Formation of hybrid templates
Conserved regions in the selected nonvertebrate IGv-related segmentsa,b
Clustal W (1.82) multiple sequence alignment
Common aa (., : ,*)
. : * * *
Muscle (2.01) multiple sequence alignment
Common aa (*)
* * *
Ig-like, IG, and IGv Domain Similarities: On-line Screening of Ig Domains from Unicellular Organisms to Nonvertebrate Craniata
Possible occurrence of Ig-like segments in Archaea proteins
Conserved similarities to typical bacterial Ig-like domains Big_1, BID_1, can be found in predicted protein sequences of Archaea origin (Table 1). Archaeal BID_1/Big_1 similarities in molecules displayed in Table 1 are accompanied by conserved similarities with polycystic kidney disease domains (PKD), which are very frequent in archaeal surface proteins (also in agreement with three-dimensional [3D] studies of archaeal surface layer proteins [Jing et al. 2002]). Interestingly, weak cross-similarities between segments similar to bacterial Ig-like domains and given PKD can be observed when comparing MM1983 (Table 1; Deppenmeier et al. 2002) with the extensive set of intimin (Table 1; see also the following paragraph) sequences on BLASTP.
Ig-like and Ig domains in the sequences of proteins from additional unicellular organisms
The majority of typical bacterial Ig-like domains described in Table 1 are cell surface proteins, which mediate specific interactions. Two of these molecules were extensively investigated. Different alleles and isoforms of BID_1/Big_1- and BID_2/Big_2-related proteins of the intimin family from E. coli interact with the enterocyte cytoplasmic membrane of various mammals. Some isoforms or alleles of the intimin family participate in pathogenetic processes caused by virulent E. coli strains such as enteritis and hemorrhagic enterocolitis (China et al. 1999; Zhang et al. 2002). Invasins having the same Ig-like domain similarity and coming from Yersinia pathogens are required for efficient invasive translocation of corresponding bacterial cells through intestinal epithelium to Peyer’s patches (Dersch and Isberg 2000). In addition to typical bacterial Ig-like domains, conserved metazoan ones, e.g., Ig, IgC2, IGcam, and IgL, were also detected in the sequences of bacterial proteins (last part of Table 1) but not in our screening of protozoan and yeast proteins. Despite this result, other different structural relationships to Ig domains or Ig superfamily were described in several papers dealing with unicellular Eukaryota (Wojciechowicz et al. 1993; Chiang et al. 2001, 2002; Sheppard et al. 2004).
IGv-related molecules in metazoans
The results of our search and comparison of nonvertebrate chimerical IGv-related conserved Ig domain similarities (CICIDS) are shown in Table 2. Selected fruit fly protein GC2198 and molecule VCBP5 of Branchiostoma floridae origin are secreted proteins (Cannon et al. 2002; Vogel et al. 2003), and isoforms of turtle proteins are secreted or transmembrane proteins, whereas the membrane relationships of the anopheles molecule and fruit fly protein CG6867 are as yet unknown (Table 2). All the other molecules listed in Table 2 are (or are predicted to be) cell surface proteins (Blumbach et al. 1999; Teichmann and Chothia 2000; Cannon et al. 2002; Sato et al. 2003; Vogel et al. 2003). In contrast to indicated CICIDS and to the close phylogenetic relationship to Igs and T-cell receptors (Sato et al. 2003), five hydrophobic (potentially transmembrane) segments were detected in the sequence of VDB (last item in Table 2).
The bit score of similarities between IGv smart00406 and VDB (53.1bits) or the minimum bit score (54.6 bits) related to NCBI protein bank accessible IgW (molecules close to primordial Igs [Bernstein et al. 1996b]) were distinct from other bit score values (interval, 33.4–43.8 bits) of displayed CICIDS. The two-tailed test determines significant differences between the highest IGv (smart00406)-related score of VDB and the scores of other IGv-related similarities (p < 0.001) or the other smart00406 ones (p < 0.01) displayed in Table 2.
CICIDS of improved linkage to IGv were selected based on the reevaluation of RPS BLAST results, and using PSI-BLAST iteration on the set of molecules described in Table 2. Only RPS BLAST derived IGv similarities of bit score limited by 40 bits or double-domain IGv similarities were passed to the sequence subset of improved RPS BLAST relationship. On the other hand, successful overlapping between CICIDS and the segments selected in PSI-BLAST procedures (generated by corresponding IGv sequence query) was required in our latter procedure (selected CICIDS are denoted by asterisks in Table 2). The final double-selected subset of CICIDS then contained only 9 of 26 original CICIDS in Table 2, i.e., CICIDS of CG14162d2, VDB, 1 representative isoform of turtle proteins, and 6 CICIDS of four sponge adhesion molecules of Geodia cydonium origin displayed in Table 2. In addition to this double-selection result, GCSAMSd1, GCSAMLd1, and a proposed single Ig domain of VDB were passed through the bit score limit, exhibited double domain similarity, and were selected in both IGv-related PSI-BLAST procedures.
IGv (domain) smart00406 similarities were more frequent in Table 2 (18 items) and in their PSI-BLAST fraction (11 items) than IGv cd00099 (13 and 6 items, respectively). In accordance with this result, smart00406 is a pluripotent conserved domain of phylogenetic linkage to seven different Ig domains, whereas unipotent (cd00096-linked) IGv cd00099 resembles a terminally diversified domain (for details see RPS BLAST web pages). Nevertheless (and perhaps in agreement with the proposed terminal status), cd00099 exhibits a higher score of similarities to the model IgW sequence set related to primordial Igs (for details see Reference Sequences [RfS] Used in the Combined Search Related to Table 3 [Materials and Methods]). Unfortunately, almost all IGv similarities presented in Table 2 are not regionally dominant. Most frequently, the Ig domain more related to constant chains [IG] smart00409 displays dominant Ig domain similarity to IGv-related segments, which may imply a closer relationship of this domain to ancestor structures of selected CICIDS and both IGv.
Traces of Hypermutation Milieu in the Selected IGv-Related Sequences of a Marine Sponge
Occurrence of hypermutation-related tetranucleotides in CDR1-like segments of GCSAMS
Our original (Kubrycht et al. 2004) and more recent searches concerned hypermutation-related tetranucleotides (i.e., hypermutation tetranucleotides [HT] and their single mutants [SM]) derived from both recent and former hypermutation motifs, i.e., RGYW/WRCY ([Rogozin and Kolchanov 1992; Dorner et al. 1998a] R = A,G) and DGYW/WRCH ([Rogozin and Diaz 2004] D = A,G,T; H = A,C,T; W = A,T; Y = C,T), respectively. These searches revealed the presence of some of these structures in GCSAMS(cdr1.L1), a longer CDR1-like segment of the first Ig domain of GCSAMS (for structural details see WP2.1; Kubrycht et al. 2004), whereas only a single occurrence of the cytosine- (and also AID)-unrelated HT TATT (Dorner et al. 1998b) was found.
Because of the low power of Fisher’s test under our limiting conditions (in accordance with Lepš 1996), we used a combined two-parameter approach in our statistical evaluation of HT and their SM occurrences. We required reliability in the chi-square test with the correction for continuum, which is usually recommended to be used in the case of low mean values (Lepš 1996), and sufficient sample size in accordance with the binomial approach to short motifs (Kubrycht et al. 2004). Our approach revealed a significantly (p < 0.01) increased number of GGCA HT and SM in GCSAMS(cdr1L.1) relative to their occurrence in GCSAMS(cdr1L.2) or the expected random value. Despite the problematic sample size, the occurrence of the complementary SM pair AGTA/TACT is markedly increased. Hence both AGTA and TACT SM are present four times in two antiparallel pentadecanucleotide segments of GCSAMS(cdr1.1.com) but are absent in whole GCSAMS(cdr1L.2) as well as in the longer complementary part of GCSAMS(cdr1L1). AGTA/TACT is also a unique pair which contains two important trinucleotide structures: the phylogenetically important motif AGY involved in hypermutation of Ig and nurse shark antigen receptors (Diaz and Flajnik 1998) and a unique cytosine-containing AGY-unrelated trinucleotide pair TAC/GTA correlated with Ig hypermutation (Dorner et al. 1998a).
The segment almost identical to CDR1-like segment GCSAMS(cdr1.1.com) exhibits the least similarity to mammalian mRNA sequences within the preformed GF region
Since the regions investigated here are related to CDR1, we assumed that the occurrence of selected oligonucleotide mRNA dissimilarities (ORD) within GF should be first of all related to the mutation instability (see also WP2.1). Consequently, we observed the distribution of sites without sequence identities to human and mouse mRNA oligonucleotides of inferior possible lengths (IROIL) within GF. The resulting profiles are displayed in Fig. 1. ORD occur predominantly in the DIF1 segment (GP: N166–192). This local occurrence of ORD was significantly more frequent than that in the complementary part of GF (p < 0.01; chi-square evaluation for 2 × 2 tables [Lepš 1996]) or its envelope regions (ER) also mentioned in Fig. 1. Significant differences (p < 0.01) were also determined in the case of a similar evaluation of both ER of GCSAMS(cdr1.1.com) located at positions closely related to DIF1 (GP: N167–195 and N168–194). In addition, the important difference between DIF1 and the model neighbor segment of equal length, DIF2 (GP: N139–165), was also found in further comparative BLASTN searches, within the GF region. The DIF2-related odds ratios 2.6 and 4.9 (p < 0.01) suggest respective predominant dissimilarities of mouse and human mRNA sequences (encoding molecules of different names) with DIF1. The difference was related to the frequencies of the corresponding oligonucleotide identities (OI; length of 13 nucleotides and higher) in the nonvertebrate metazoan sequence set, where only a 1.2 times higher frequency of OI than in DIF1 was found in DIF2.
Knowledge-Based Approach in the Subset of Coelomate Proteins Complementary to Preceding Ig Domain Screening
Similarities between selected and reference sequences
Six and four of ten selected molecules exhibited simultaneous BLASTP similarities with two or three different RfS, respectively (Table 3). Four and five such similarities were found in the cases of VDB and tractin, respectively. These frequent double and multiple BLASTP similarities suggest common structural features in the sequences of the presented molecules and RfS. TCRL represented the most frequently similar RfS in our BLASTP searches. Eight of ten selected molecules were similar to TCRL (for details see Table 3). These similarities differently overlapped with short IGv- and full length IGcam-related segments of TCRL located at aa positions 84–119 (TIGV) and 158–228 (TIGCAM), respectively. VDB similarity was a unique one which overlaps with the whole TIGV. The segments of leechCAM, RTPh, and apCAM partially overlapped with TIGV and fully overlapped with TIGCAM. Partial overlaps of both TIGV and TIGCAM were seen in their similarities to COS2.1 and tractin, whereas a single partial overlap of TIGV was found in the case of ror similarity. FGF receptor and the other tractin segments completely overlapped with TIGCAM.
Additional important structural relationships
Both RfS-related PSI-BLAST searches within the IgW sequence set selected the same IgW heavy chain sequence from the sandbar shark of clonal name AAB03680, which suggests an improved model importance of this RfS. This relationship appears to be important also from the point of view of construction of the hybrid template used in the following section. All the molecules without family-like similarities to vertebrate sequences displayed in Table 3 contained exclusively Ig domains (for a possible explanation see Questions and Possibilities [Discussion]).
Conserved Regions Within the Segments Related to Variable Ig Domains
Multiple sequence alignments of IGv-related molecules preselected by the double PSI-BLAST-derived procedure
Fifteen segments were selected by the procedure described in PSI-BLAST Procedure Further Selecting IGv-Related Sequences (Materials and Methods). Subsequent multiple sequence alignments performed by CLUSTAL W 1.82 and MUSCLE 2.01 (for complete record of alignments see WP3.1) enabled us to locate common aa and generate a hybrid template sequence (HTS1 in Table 4). Primary derived conserved regions (CR) were found in the C-terminal parts of selected IGv-related segments in accordance with the third section of Materials and Methods. Their positions also corresponded to the C-terminal position(s) of the compared Ig domain(s). CR contained three different common aa and did not contain any gaps.
Relationship between CR and Ohno’s primordial building block
High-density (more than 75% identity) sequence similarities between the CR segment of model (reference sequence) IgW (CRIgW; Table 4) were found at three positions of accessible IgW sequences, i.e., 65–79/68–82, 80–94/81–95, and 98–112/99–113 (original CRIgW position was 98–112). This result was further confirmed by comparisons between Igs and symmetrically extended 25-aa- and 35-aa-long CR-containing segments of model IgW. Corresponding data permitted us to perform more reliable searches, which indicated prevailing occurrences of similarities at the second and the third alternative positions. Interestingly, the second alternative position of given CRIgW similarities in IgW corresponded simultaneously to the position (aa 83–98) of the peptide chain encoded by Ohno’s 48-base-long “modern intact primordial building block” (PBB) derived from sequences of Ig variable region genes (Ohno et al. 1982) and to the position (aa 81–95) in IgW chains corresponding to dominant BLASTP and BLASTX similarities related to PBB. For additional consistent relationships between PBB and CR sequences see Table 4. In conclusion, CRIgW (Table 4) appears to be a PBB homologue, possibly closer to the conserved sequence of AR ancestor. This possibility is also in agreement with the results of the following phylogram study.
Approach to phylogenetic analysis of variable and IGv-related segments
In accordance with our phylogram-based frequency analysis (chapter WP4), only conserved and PSI-BLAST derived segments (PBDS) of apCAM and MDM origin exhibited closer overall linkage to IgW than NITR2 (a molecule with a published very close phylogenetic relationship to AR [van den Berg et al. 2004]) in our 40 phylograms. Despite this, the subset of phylograms constructed based on CLUSTAL W-derived multiple sequence alignments with PBDS indicated the closest phylogram linkage between IgW and CG6867 segments, whereas only several (less frequent) close IgW phylogram linkages to apCAM, but not to MDM, were observed (WP4.3). In addition, at least comparable phylogram IgW relationships with that between the IgW and the VDB (a molecule selected three times in IGv-Related Molecules in Metazoans, above) segments were found in the cases of the ror, GCSAMSd1, VCBP5, and CG14469-PA segments (WPT1 in WP4.3).
Results of template-derived PHI BLAST searches
Hybrid template sequence HST1 as a sequence query and two patterns, i.e., P1, including common aa Dx(3)YxC, and “single-tripled mutation-related” P2 (Dx(3)YxCx[AV]) (Table 4; see also WP3.2), enabled us to search for possible Ig-related conserved segments of nonvertebrate Metazoa origin (Formation of Hybrid Templates [Materials and Methods]). The searches with P1 resulted in 103 different segments. Except for 15 selected segments, all others contained an N-terminal leucine in addition to the selection, the pattern thus forming the common structure Lx(8)Dx(3)YxC. Similarly, 33 of these segments contained N-terminal LTI. These facts led us to reevaluate the N-terminal aa of the HTS1 in accordance with the CLUSTAL W alignment and also to establish reevaluated high occurrence aa (rhoaa; Table 4).
At least 7 of the top 10 layers (T10L; i.e., working item subset in which Expect or bit score values are higher than the values of the eleventh item) determined by 12 PHI BLAST searches contained specific segments (SPSE) of tractin and CG6867 displayed in Table 4 and also hemolin segments of Lymantria dispar (gypsy moth) origin (insect hemolins are proteins induced by bacterial infection and interact with lipopolysaccharide [Lindstrom-Dinnetz et al. 1995; Yu and Kanost 2002]). On the other hand, SPSE of MDM and GCSAMSd2 were present only two times among T10L. SPSE of other molecules such as RTK and VCBP5 occurred only in items outside the T10L subset. A maximum number of identities with rhoaa (9 of the possible 10) was found in SPSE of CG6867. Four molecules selected here (CG6867, GCSAMS, MDM, VCBP5) are also mentioned in the immediately preceding section.
The search with P1 in the set of all available bacterial proteins revealed only ZP_00561298 from Disulfitobacterium hafniense, a molecule displayed in Table 1. The segment containing a segment of ZP_00561298 (S1) is predominantly related to smart00408 (IGc2). S1 and two segments of the molecule COG3210 of Cytophaga hutchinsonii origin (see also Table 1) were present in the results of P2-related searches. One of the selected COG3210 segments (S2; aa positions 1625–1640) was located in the region exhibiting the highest conserved domain similarity found in the given molecule. This similarity concerned smart00409, which represents an important IGv-related domain (for details see IGv-Related Molecules in Metazoans, above). Both S1 and S2 contained seven rhoaa like the PHI BLAST-derived segment of MDM mentioned above.
Selected Molecules and Structures Related to Antigen Receptors
Instead of dominant high-score IGv similarities to N-terminal Ig domains of vertebrate AR and AR-related proteins, the sequence of IG (domain) smart00409 forms prevailing superior conserved domain similarities to nonvertebrate IGv-related protein segments (Table 2; see also IGv-Related Molecules in Metazoans [Results]). The corresponding possible relationship of smart00409 to IGv ancestor structure appears to be interesting, first of all, due to the similar general role of the IGc1 domain deduced from 3D studies of IGv (Du Pasquier et al. 2004). In addition, this possibility would also be in agreement with comparable results of parallel PSI-BLAST procedures with both IGv and single smart00409 displayed in Tables 2 and 4, respectively (provided that we disregard the consequences of antiredundant selection of sponge adhesion molecules before formation of the final set displayed in Table 4). The benefit of presented domain relationships is perhaps also in determined common aa and conserved regions, which could be interesting for future 3D reevaluations of IG and IGv architectures and interactivities. In addition, the C-terminal position of the conserved regions remarkably collocates with the AR segments undergoing recombination (for sequence comparisons see Relationship Between CR and Ohno’s Primordial Building Block [Results]).
In contrast to the highly sophisticated artificial sequence of smart00409, sequences of sponge molecules GCSAMS and GCSAML are actual AR-related sequences. Similarly to AR these highly homologous proteins participate in allograft reactions (Blumbach et al. 1999; Schutze et al. 2001) and include segments of conserved domain similarities to IGv (IGv-Related Molecules in Metazoans [Results]). GCSAMS as a representative of both these molecules is phylogenetically related to AR (see Approach to Phylogenetic Analysis of Variable and IGv-Related Segments and, also, Results of Template-Derived PHI BLAST Searches [Results]; Table 4) and its N-terminal segment also exhibits several structural relationships to CDR1 (see Traces of Hypermutation Milieu in the Selected IGv-Related Sequences of a Marine Sponge [Results], Fig. 1, and Kubrycht et al. ). In addition, the source of GCSAMS and GCSAML, the marine sponge Geodia cydonium (as well as other Demospongidae species), is possibly related to the earliest common metazoan ancestor (Urmetazoa [Muller 1998; Muller et al. 2001; Wiens et al. 2003]). Moreover, evidence was presented allowing the conclusion that marine sponge (Demospongidae) proteins are more closely related to the corresponding molecules from H. sapiens than to those of C. elegans and D. melanogaster (Muller et al. 2001). Consequently, GCSAMS and GCSAML represent possible candidates for common ancient phylogenetic origin with AR. A similar relationship to AR also concerns chordate VDB, a nonvertebrate molecule of maximal RPS BLAST similarity to IGv described here. This molecule was also selected by all our PSI-BLAST searches and passed through all criteria described in the first section of Results and in Table 3. In accordance with published results of BLASTX comparisons (Sato et al. 2003) and also with our BLASTP searches, VDB is more similar to T-cell receptors than to Igs, whereas GCSAMS, mentioned above, is more similar to Igs (Table 3). Despite the problematic phylogenetic relationship of corresponding species, several procedures used in Conserved Regions Within the Segments Related to Variable Ig Domains (Results) and WP4 suggest an interesting structural similarity between AR and the molluscan defense molecule (MDM). Like AR, MDM is considered to be a possible mediator of non-self recognition (Hoek et al. 1996). In contrast to the neuronal origin of all other protostomial molecules described in Table 3, this molecule is a unique one specifically expressed in granular cells located in connective tissue of mesoderm origin. In addition to preceding marked functional and structural similarity between AR and MDM, a question arises with respect to more general phylogenetic relationships among vertebrate AR, MDM, and some protostomial molecules involved in adhesion of neural cells. Hence apCAM and ror were selected in Approach to Phylogenetic Analysis of Variable and IGv-Related Segments (Results) and some additional protostomial molecules can be found in Table 4. Though the structurally based selection appears to be sufficiently exact, a broader reevaluation of protein interactivities has to be implemented in the future. We assume that particularly interactions with random peptides (Smith 1985; Rossenu et al. 1997) and recent prediction (Huang et al. 2004; Kim et al. 2004; Park et al. 2005) and experimental tools (Ito et al. 2001) of interactome analysis will be important to reevaluate recent sequence data, including also the sets of molecules described here.
Ig Domains of Unicellular Eukaryota
A recent study has revealed the existence of human molecules (putative cytoskeletal organizing protein TRIOBP [Riazuddin et al. 2006]) exhibiting conserved domain similarity to Ig/Ig superfamily-related domain Candida_ALS (pfam05792), frequently found in yeast molecules including α-agglutinin of Saccharomyces cerevisiae (Wojciechowicz et al. 1993; Sheppard et al. 2004). On the other hand, an absence of conserved domain similarities between proteins from unicellular Eukaryota and AR-related Ig domains (all Ig domains present in Tables 2 and 3 and two domains, IGc and IgGc1, astonishingly absent from the tables) is still observed. This follows not only from the database searches described under Materials and Methods, but also from the two-step (BLASTP and RPS BLAST) database screening using all not yet compared AR-related Ig domains as starting query sequences (data not shown). Only some special homologies of proteins of unicellular Eukaryota origin without unifying conserved domain similarity have been described, e.g., ICAM-L homologues were found in Leishmania (protozoan) genus (Chiang et al. 2001, 2002). Such a result need not represent a necessary contradiction. Hence new, more general (RPS-BLAST) and similar domains can appear in the future (similarly to the reevaluation of cytidine deaminase domains in the past).
The absence of conserved domain similarities between AR-related Ig domains and proteins of unicellular Eukaryota origin also contrasts with such similarities of bacterial proteins (Table 1). This unexpected difference may follow from more frequent symbiotic or parasitic interactions of bacteria with metazoans. Hence such interactions may potentiate topical convergent competitive changes of physical cell–cell interactions (including “reconstituted” or “imitating” interactions of metazoan proteins via similar segments of metazoan and bacterial Ig domains improving critical Ig domain similarities), or horizontal gene transfer of metazoan genes encoding proteins with critical conserved Ig domain similarities to bacteria (Ochman et al. 2000; Ray and Nielsen 2005), or even inverse transfer from bacteria to metazoans similarly to the postulated transfer of RAG1 genes during the “big bang” period of Ig germline gene reshuffling and rearrangement (Bernstein et al. 1996a; Marchalonis and Schluter 1998).
Questions and Possibilities
The relationship between hypermutating DNA regions (HDR) in somatic cells, i.e., the regions which also include hypermutating Ig exons (see also Introduction), and the more widely spread DNA hot-spot regions is as yet unknown. Nevertheless, some facts suggest a possible phylogenetic linkage between these nonstable DNA regions. Two types of somatic mutation structures (AGC/GCT involved also in a more specific Ig hypermutation and WAN) and GC dinucleotide related to meiotic mutation were found to be less frequently involved also in counterpart processes, i.e., in human meiotic and somatic mutations, respectively (Oprea et al. 2001). This means that both types of mutations, as well as hot spots and HDR, are still not completely separated in vertebrates. Interestingly, this fact and the assumed resistance of Ig domains composing AR-related molecules to hypermutation (and possibly also insertion/deletion changes) enable us to explain the absence of non-Ig domain similarities in the sequences of AR-related molecules less similar to vertebrate proteins described in Additional Important Structural Relationships (Results) and Table 3. Hence the gradual loss of function and elimination of the domain exons different from that encoding Ig ones can be expected in genes of more diversified sequences (less similar to recent vertebrates) included in hot spots or primitive HDR. The assumed resistance of AR-related Ig domains seems to be in accordance with the ability of antibodies to form alternative interactions (James et al. 2003) and a broad-range interactivity of molecules exhibiting IG and IGv conserved domain similarities (in addition, the existence of hypermutating AR genes). In the end, this consideration poses the question whether the genes of lower similarity to vertebrate molecules mentioned in Table 3 indeed hypermutate.
In our previous paper we hypothesized that DNA encoding some substrate, inhibitory or regulatory regions of protein kinases (PK), or a more general pattern or pattern-related DNA structure had a role in the formation of the ancestor CDR1 structure (Kubrycht et al. 2004; slightly updated). Interestingly, peptide PK substrates and inhibitors (PKSI) form an extensive set of sometimes very similar but distinctly interacting structures like hypervariable regions of antibodies CDR1 and CDR2 (Kubrycht et al. 2002). In addition, we can find the segments similar to PKSI in almost the same positions of N-terminal segments of Igs and GCSAMS (Kubrycht et al. 2002, 2004). These segments (PKSI-related regions) even overlap the N-terminus of CDR1 and the CDR1-like segment of GCSAMS (see Occurrence of Hypermutation-Related Tetranucleotides in CDR1-like Segments of GCSAMS [Results] and WP2.1). Besides the possible role of recombinant events, an interesting alternative based on convergent or combined convergent/recombinant events can also be considered, when trying to explain the given similarities (Kubrycht et al. 2004). This interesting alternative would follow from the recent concept of very ancient (more than 1000–500 million years ago) “innate immunity” including undiversified immune/clearance processes via ancient immune/preimmune molecules (Marchalonis et al. 2002). Hence the cross-similarity of PKSI-related regions mentioned above accentuates the question of their ancient inhibitory/binding cross-reactivity with the active centers of different PK (e.g., PK released by parasite organisms or dying cells). This interaction would lead to the proposed ancient clearance process, which diminishes possible disregulating effects of PK via their binding to the PKSI-related region of cell surface PPCAR (more precisely early PPCAR) and subsequent PK destruction in lysosomes. In this case, overlapping (perhaps weakly hypermutating) CDR1/DIF1-like segments (see Occurrence of Hypermutation-Related Tetranucleotides in CDR1-like Segments of GCSAMS [Results] and The Segment Almost Identical to CDR1-like Segment GCSAMS[cdr1.1.com] Exhibits the Least Similarity to Mammalian mRNA Sequences Within the Preformed GF Region [Results]) of PPCAR would then complete the necessary context of interaction (via additional recognition or sterical hindrance), restricting and/or spreading the repertoire of recognized PK.
Since the hybrid template in Results of Template-Derived PHI BLAST Searches (Results) selects bacterial protein COG3210 belonging to a group of large exoproteins involved in heme utilization and adhesion, the question of the role of diversified heme structures in the early evolution of variable Ig domains arises. In spite of it, more detailed structural analysis will be necessary to assume such possibility even in the case of the early PPCAR variant(s).
Two different procedures were demonstrated in our local search for segments related to Ig hypermutation (see Traces of Hypermutation Milieu in the Selected IGv-Related Sequences of a Marine Sponge [Results] and Kubrycht et al. ). These procedures represent only the initial stage of more extensive future mapping of such segments. In accordance with this trend, the first program predicting local occurrence of somatic mutations based on secondary structure of DNA was recently described (Wright et al. 2004). In addition to the most frequently used hypermutation tetranucleotides (Rogozin and Diaz 2004), a broader repertoire of structures was proved in their linkage to Ig hypermutation (Dorner et al. 1997, 1998a, b; Diaz and Flajnik 1998; Diaz et al. 1999; Oprea et al. 2001; Shapiro et al. 2003; Boursier et al. 2004; Duquette et al. 2005) and could be useful in future studies. Since the mechanism of Ig hypermutation was fully completed only in vertebrates, model organisms closer to the vertebrate lineage will also be necessary for further sophisticated phylogenetic research. The first example of such animal model is possibly G. cydonium (Muller 1998; Muller et al. 2001), whose molecules were successfully correlated in this paper. The recently described monitoring of hypermutation using retroviral vectors with fluorescence proteins of different color (Klasen et al. 2005) also represents a possibility for similar investigation of potentially competent or on-line predicted nonvertebrate cells (e.g., cells expressing the molecules mentioned in the first or second paragraphs of Questions and Possibilities or Selected Molecules and Structures Related to Antigen Receptors, respectively, above).
Abbreviations related to molecules may denote both protein and nucleotide sequences. Standardized BLAST searches and procedures are unambiguously denoted by usual abbreviations. CDR1—the first hypervariable region of Igs; GP—terminal GCSAMS positions of the observed peptide or oligonucleotide segments are marked with “aa” or “N,” respectively; GCSAM—sponge adhesion molecules of Geodia cydonium origin; GCSAMS and GCSAML—original abbreviations of cell recognition molecules from the sponge G. cydonium, also denoted GSAMS and GSAML, respectively; PPCAR—phylogenetic precursor(s) of hypothetical common ancestor of antigen receptors encoded possibly by rearranging gene; WP1.1 to WP5.5—sections of the web page http://www.papersatellitesjk.com, including also more detailed lists of abbreviations.
- Azumi K, De Santis R, De Tomaso A, Rigoutsos I, Yoshizaki F, Pinto MR, Marino R, Shida K, Ikeda M, Ikeda M, Arai M, Inoue Y, Shimizu T, Satoh N, Rokhsar DS, Du Pasquier L, Kasahara M, Satake M, Nonaka M (2003) Genomic analysis of immunity in a Urochordate and the emergence of the vertebrate immune system: “Waiting for Godot.” Immunogenetics 55:570–581PubMedCrossRefGoogle Scholar
- Bradshaw PS, Condie A, Matutes E, Catovsky D, Yuille MR (2002) Breakpoints in the ataxia telangiectasia gene arise at the RGYW somatic hypermutation motif. Gene 21:483–487Google Scholar
- Deppenmeier U, Johann A, Hartsch T, Merkl R, Schmitz RA, Martinez-Arias R, Henne A, Weizer A, Baumer S, Jakobi C, Bruggemann H, Lienard T, Christmann A, Bomeke M, Steckel S, Bhattacharya A, Lykidis A, Overbeek R, Klenk HP, Gunsalus RP, Fritz HJ, Gottschalk G (2002) The genome of Methanosarcina mazei: evidence for lateral gene transfer between Bacteria and Archaea. J Mol Microbiol Biotechnol 4:453–461PubMedGoogle Scholar
- Diaz M, Velez J, Singh M, Cerny J, Flajnik MF (1999) Mutational pattern of the nurse shark antigen receptor gene (NAR) is similar to that of mammalian Ig genes and to spontaneous mutations in evolution: the translesion synthesis model of somatic hypermutation. Internat Immunol 11:825–833CrossRefGoogle Scholar
- Kabat EA, Wu TT, Perry HM, Gottesman KS, Foeller C (1991) Sequences of proteins of immunological interest. NIH publication No. 91-3242. NIH, Bethesda, MDGoogle Scholar
- Kubrycht J, Borecký J, Sigler K (2002) Sequence similarities of protein kinase peptide substrates. Comparison of their primary structures with immunoglobulin repeats. Folia Microbiol 47:319–358Google Scholar
- Kubrycht J, Borecky´ J, Soucˇek P, Jezˇek P (2004) Sequence similarities of protein kinase substrates and inhibitors with immunoglobulins and model immunoglobulin homologue: cell adhesion molecule from the living fossil sponge Geodia cydonium. Mapping of coherent database similarities and implications for evolution of CDR1 and hypermutation. Folia Microbiol 49:219–246Google Scholar
- Lepš J (1996) Biostatistics. University of Southern Bohemia, Ceske Budejovice, Czech RepublicGoogle Scholar
- Morelli C, Karayianni E, Magnanini C, Mungall AJ, Thorland E, Negrini M, Smith DI, Barbanti-Brodano G (2002) Cloning and characterization of the common fragile site FRAF6F, harboring a replicative senescence gene and frequently deleted in human tumors. Oncogene 21:7266–7276PubMedCrossRefGoogle Scholar
- Notredame C (2003) Recent progress in multiple sequence alignments: a survey. Available at: http://www.isrec.isb-sib.ch/∼cschmid/DEA/Module5/lectures/4.2.msa_algorithms.pdf
- Riazuddin S, Khan SN, Ahmed ZM, Ghosh M, Caution K, Nazli S, Kabra M, Zafar AU, Chen K, Naz S, Antonellis A, Pavan WJ, Green ED, Wilcox ER, Friedman PL, Morrel RJ, Riazuddin S, Friedman TB (2006) Mutations in TRIOBP, which encodes a putative cytoskeletal-organizing protein, are associated with nonsyndromic recessive deafness. Am J Hum Genet 78:137–143PubMedCrossRefGoogle Scholar
- Wiens M, Mangoni A, D’Esposito M, Fattorusso E, Korchagina N, Schroder HC, Grebenjuk VA, Krasko A, Batel R, Muller IM, Muller WEG (2003) The molecular basis for the evolution of the metazoan bodyplan: extracellular matrix-mediated morphogenesis in marine demosponges. J Mol Evol 57:S60–S75PubMedCrossRefGoogle Scholar
- Zvárová J (2001) Biomedical statistics. I. The fundamentals of statistics for biomedical fields. Karolinum, Prague, Czech RepublicGoogle Scholar