Background

MBL (mannose-binding lectin) is an important component of innate immunity and a central recognition molecule of the lectin pathway of complement, which probably represents the most ancient pathway of complement activation [1]. It binds to an array of carbohydrates such as d- mannose and N- acetyl-d-glucosamine on the surface of pathogens and directly opsonizes the microorganism for phagocytosis or activates the complement system via interaction with MBL-associated serine proteases (MASP-1, -2, -3 and Map19). Complement activation kills the pathogen by the membrane-attack complex or by complement-mediated phagocytosis through increased deposition of opsonic C3 fragments. MBL is also able to recognize altered self structures present on apoptotic cells, promoting their clearance, and to modulate the release of various pro-inflammatory cytokines [2, 3].

The MBL2 genetic polymorphism is responsible for the very common and widespread variation of circulating levels of MBL oligomers and of functional activity of the protein in the human species. This variation is mainly caused by three single nucleotide polymorphisms (SNPs) in the first exon of the gene: MBL2*D (Arg52Cys), *B (Gly54Asp) and *C (Gly57Glu). These mutations have a profound effect on the assembly and stability of the protein, which leads to an increase of low-molecular-mass MBL that has reduced capacity of activating complement and of ligand binding [4, 5]. The D, B and C SNPs have been collectively labeled O, whereas the major alleles at these loci have been called A. The concentration of the protein in serum is further modulated by at least three SNPs in the promoter region: MBL2*H,L (located 550 bp before the transcription start site), X, Y (located 221 bp before the transcription start site) and P, Q (non coding SNP located 4 bp after the transcription start site) [6, 7]. The combination of structural and promoter polymorphisms results in a dramatic variation in the concentration of high-order MBL oligomers in apparently healthy individuals of up to 1,000-fold (European: range <20-10,000 ng/ml) [8]. Linkage disequilibrium between the SNPs is responsible for only eight haplotypes (as opposed to the 64 theoretically possible) associated with increasingly lower MBL serum concentration: MBL2*HYPA = LYQA = LYPA > LXPAHYPD = LYPB = LYQC = LYPD [7, 913]. Using a haplotyping strategy developed by one of us, we recently defined 14 additional allelic haplotypes, most of them similar to LYQA or LYPA [2]. Genotypes carrying two copies of either HYPD, LYPB, LYQC or LYPD or one of them and LXPA are particularly associated with the susceptibility and severity of many diseases, as well as with protection against intracellular infections such as tuberculosis, leprosy and leishmaniasis [1416].

In this work, we aimed to improve our former analysis by sequencing and haplotyping larger samples of European- and African-derived populations. In order to standardize and simplify comparisons between future association studies, we propose a nomenclature based on the evolutionary convergence of the identified MBL2 haplotypes [17]. We tested our samples for the hypothesis of selective neutrality and suggest that stochastic evolutionary factors have had a major role in shaping this polymorphism worldwide.

Results

To uncover the selective role diseases could have exerted on the MBL2 polymorphism, we evaluated the MBL2 promoter and exon 1 region from 856 chromosomes of Gabonese adults (this work) and children [2], as well as from 470 chromosomes belonging to individuals of European descent, and compared it with previously published data. Genotype frequencies were at Hardy and Weinberg equilibrium.

MBL2 haplotypes identified in this study are listed in Table 1. They were named according to their evolutionary divergence [17] from a hypothetical ancient sequence probably related to LYQA and LYPA [11, 18]. According to the nomenclature system we adopted, the first clades to diverge are numbered with Arabic numerals. The 26 identified haplotypes are divided into two major phylogenetic branches by six polymorphisms (P1, Q1 or g.396A > C; P2, Q2 or g.474A > G; P3, Q3 or g.487A > G; P4, Q4 or g.495_500del6; P5, Q5 or g.753C > T, all in strong linkage disequilibrium with the commonly investigated P6, Q6 polymorphism or g.826C > T) (Figure 1). Clade *1 is represented by LYPA and other haplotypes with P variants. Clade *4 is represented by LYQA and other haplotypes with Q variants. Other clades are represented by the intermediate rare haplotypes previously found by our group in Gabon (2 and 3) [2]. Sublineages of each clade are subsequently designated with capital letters (e.g. LYQA-derived haplotypes = *4A and LYQC-derived haplotypes = *4F), and individual present-day haplotypes are given Arabic numerals (e.g. LYQA = *4A1), following the schema numerals/letters/numerals, if they diverge further (e.g. the LYQC-derived haplotype with the g.797C > A SNP, associated with severe malaria = *4F2A). This system is flexible enough for the accommodation of new haplotypes. For example, we added the LYPA-similar haplotypes H16 and H19 found by others exclusively in Pygmy populations [19] as *1K1 and *1L1, and added the HYPG haplotype described by us in another study [16], as 1B4. It is however not suited for recombinant haplotypes. In this case, we chose to call them by the names of the parental haplotypes, separated by a dot. LYPD for example is most probably the product of a recent intragenic recombination event between HYPD (*1B2) and LYPA (*1A1) or LYPB (*1F1) [20]. Since the recombination between HYPD and LYPB would have generated HYPB, which has not been found, we arbitrarily chose to call this haplotype *1A1.1B2 (equivalent to LYPA × HYPD). We also wished to incorporate reported associations of haplotypes with MBL concentration. In order to do this, we added a dash followed by small capitalized "h" or "l" letters, referring to "high" or "low" MBL levels in serum, respectively (e.g. LYQA = *4A1-h).

Table 1 Nucleotide changes and haploypes of MBL2.
Figure 1
figure 1

Maximum parsimony tree with suggested phylogenetic nomenclature (see text). The recombinant haplotype *1A1.1B2-l (LYPD) was excluded. Bootstrap values are given at nodes of the tree.

We identified in this and in other studies 14 haplotypes belonging to clade *1 and 9 haplotypes belonging to clade *4 and added data from others for comparison (Table 2). Eight of the first 14 and 6 of the last 9 haplotypes were polymorph in at least one population. Among the rare haplotypes, we found three previously unknown in the European population: *1B3, a rare HYPA- similar haplotype; *1C2, the only LXPA- similar haplotype; and *4C1-h, a LYQA-similar haplotype with a g.456G > T SNP found in three heterozygotes (the first two haplotypes were singletons). The g.456G > T SNP was assigned by others to an otherwise HYPA haplotype reconstructed from unphased genotypic data of one Sardinian heterozygote [19]. Maximum likelihood phasing of our own data with the EM and ELB algorithms generated 1-2% erroneously assembled haplotypes in the Gabonese and European samples. Only in the Gabonese, seven spurious "new" haplotypes were generated with the EM and eleven with the ELB algorithm (Table 3). To verify the effect of sample size in frequency estimates, we compared the haplotype distribution between some populations investigated by us and by others [19]. Although there were no significant differences with the exact population differentiation test, differences between individual haplotype frequencies were significant, even between samples with similar ancestry (Table 2).

Table 2 MBL2 haplotype frequencies (%) in diverse populations.
Table 3 Performance of haplotyping algorithms.

With the exception of LYQA (*4A1-h), *4 haplotypes are well represented only in the African population. In contrast, HYPA (*1B1-h) and LYPB (*1F1-l) are among the *1 haplotypes that reach high frequencies in the European, Asian and Native American, but not in the African population (Figure 2). The uneven haplotype distribution around the world is reflected by the average Fst value among all segregating sites (0.1831, P < 0.00001), which indicate great genetic differentiation between the analysed populations. One of the lowest individual significant Fst values corresponded to the X/Y SNP, whereas the highest values corresponded to the H, L and P, Q segregating sites (Figure 3). The time to the most recent common ancestor of the MBL2 alleles was inferred at 73,251 years ago [95% CI 5,220 - 214,440]. The mean coalescence time implies that the ancestor of groups *1 and *4 alleles were separated before the modern human dispersal from Africa [21]. The TMRCA of groups *1 and *4 was estimated to be ca. 55,000 years ago, which also indicates that the presence of alleles of African populations in both clades is a result of an ancient ancestry.

Figure 2
figure 2

Mutational network of MBL2 haplotypes. The size of each node is proportional to the haplotype frequency in the pooled sample (this work and [18, 28]). Variant nucleotide positions are indicated in red. In black: African; red: European; orange: North Chinese; blue: Guarani; green: Kaingang.

Figure 3
figure 3

Global Fst values distributed along the sequence. Fsts were calculated using the data of Table 2 and of [19]. Nucleotide positions corresponding to variant sites are shown on the x-axis. *** p < 0.001, ** p < 0.01.

*1B-derived haplotypes, even those found using maximum-likelihood phasing by others [19], seem to be restricted to Euroasiatic populations. Beyond those described in this work, we recently identified *1B4 in the Euro-Brazilian population. This haplotype is similar to HYPA but with a synonymous substitution in codon 44 (also called HYPG) [16]. To our knowledge, LXPA (*1C1-l) has only one rare similar haplotype (*1C2), identified in one European individual. We also found only one LYPB- similar haplotype (*1F2), but others cite another four [19]. Each occur with frequencies around 2% in Asian/Amerindian groups (Ashkenazi Jewish, Japanese, Chinese and Kaingang), but three were defined by SNPs upstream to the region analysed in this study. The *1H1-h haplotype has a similar global distribution as the commonly investigated haplotypes and is well represented in African, Asian and Amerindian(-derived) populations, being less frequent in European groups. We found a similar haplotype (*1H2-h) once in a Gabonese and once in a Euro-Brazilian individual. All other clade *1 haplotypes are concentrated in African groups. *1E1-h has a rare coding mutation found only once in the Gabonese, as *1G1-h [2]. The *1D1-h haplotype, which we found with 3% frequency in this population, was found by others with comparable frequencies (1.6 - 4.2%) in the Mbuti Pygmy, Nigerian Yoruba and Somali populations [19]. *1J1-h was also found with 1.6% and 0.8% frequencies in Tanzanian Chagga and in the Somali groups, respectively. *2A1-h and *3A1-h are intermediate between P and Q containing haplotypes and most probably reminiscent of the ancient original MBL2 haplotype [2]. The LYQA- similar *4B1-l haplotype carries a coding mutation and was found only once in the Gabonese, as the LYQC-similar haplotype *4F2B-l. In addition to the Gabonese, *4D1-h was found by others with 1.6% frequencies in the Tanzanian Chagga [19]. *4E1-h has a SNP within a glucocorticoid responsive element and seems to be well distributed in Africa, except in the Mbuti and Baka Pygmies [19]. 4F2A-l was formerly found associated with severe malaria [2] and has a similar distribution, except for the fact that it is also present in South-West Asian and European(-derived) groups with 11.9% (Ashkenazi Jewish [19]) to 0.5% (Germans, this work) frequencies. *4F3-l was also found in the Biaka Pygmy (2.1%), Nigerian Yoruba (1.6%) and Tanzanian Chagga (4.7%) groups [19], as well as in Afro-Americans [18] and in one individual of the Kaingang Amerindian population, known to be of mixed ancestry [22].

Tajima's D was significant in those regions containing five of the six P, Q segregating sites in the Gabonese population (Figure 4A). Yet Fu and Li's D* was significant in regions with rare SNPs: the LXPA-derived *1C2 haplotype in Europeans and the LYPA- derived *1E1-h haplotype in the Gabonese (also called LYPF due to a non-synonymous SNP in the exon 1 region) (Figure 4B). Highest nucleotide diversity was registered in the same windows with Tajima's D peaks (Figure 4C). None of the neutrality tests employed for the whole sequence or parts of it yielded significant results (Table 4).

Table 4 MBL2 sequence diversity parameters of several populations.
Figure 4
figure 4

Sliding window plot of (A) Tajima's D values, (B) Fu and Li's D* values and (C) nucleotide diversity for the entire sequenced region. Statistics were calculated for overlapping windows of 60 bp, placed at 15 bp intervals along the sequence. * P < 0.05.

Discussion

Both circulating levels of MBL oligomers and functional activity have been correlated with common MBL2 genetic variants. There are at least 28 segregating sites in the MBL2 promoter and exon 1 sequence [23], and 26 allelic haplotypes were physically defined in this study. Nucleotide diversity in Afro-derived populations reached 5 × the average value of chromosome 10 (8.25 × 10-4) [24], where the MBL2 gene resides (10q11.2→q21). This is still 2 × less than the lowest values found for polymorphic MHC regions (1%) [25], indicating that the MBL2 promoter-exon1 nucleotide diversity is intermediate among immune protein coding genes.

Several of the newly identified haplotypes are polymorph and of interest for disease association studies. Nevertheless beside the A/B/C/D system adopted for exon 1 alleles since 1991 [26] and of the H, L, X, Y and P, Q names for promoter SNPs since 1998, no other nomenclature was suggested. We adopted a phylogenetic approach that easily accommodates new haplotypes following a logical order, and suggested a way to call eventual recombinant haplotypes, incorporating knowledge about MBL serum levels.

Nevertheless haplotypes generated with EM and ELB haplotyping algorithms should be included with caution, especially when containing singletons. In our comparison, EM and ELB algorithms allowed for 1-2% errors in populations with high nucleotide diversity (π). The pseudo-Bayesian ELB performed worse in groups with very high π values, as Africans, generating more spurious "new" haplotypes. We did not find six of the haplotypes reconstructed by others using the Bayesian method implemented in PHASE software [19]. Two were recombinant (LYQB and HXPA), one presented a SNP that we haplotyped to LYQA and three were LYPA-similar haplotypes that seemed to be restricted to Pygmy populations, with SNPs presenting high Fst values. To avoid the inclusion of false haplotypes in the nomenclature system, we followed the approach of a group which only analysed haplotypes having a minimal frequency of 10% [27]. Two of the Pygmy haplotypes fulfilled this requirement, but all other haplotypes should ideally be phased by a physical haplotyping technique before inclusion.

Others used sample sizes at least four times smaller than ours [18, 19]. This caused discrepant frequency results especially for the most common haplotypes. Since rare variants are not easily detected in small population samples, we also found considerable differences between our Fu and Li's D* and F* and other's results [18]. Indeed, two singletons caused significant D* values in regions with very low nucleotide diversity levels specifically in our European and Gabonese samples.

We added data from other studies [2, 18, 19, 22, 28] to calculate the Fst statistic. This approach resulted in much higher Fst values for the whole gene (0.18), than those found previously by others (0.06 [18]) and by us using only the Amerindian and Chinese samples (0.12, [22]). The same was true for the H/L and P/Q SNPs (Fst values around 0.2-0.25, compared to published 0.1-0.15, [18]), which indicate that they are good markers for population differentiation. As opposed to these high Fst values, the X/Y SNP presented values lower than 0.05 in this and in another study [18], compatible with global balancing selection.

We previously discussed the origin and distribution of the LYPA (*1A1-h), HYPA (*1B1-h), HYPD (*1B2-l), LXPA (*1C1-l), LYPB (*1F1-l), LYPD (*1A1.1B2-l), LYQA (*4A1-h) and LYQC (*4F1-l) haplotypes [22]. In general, the most frequent clade *1 haplotypes are globally distributed, whereas clade *4 haplotypes are more restricted to the African continent. Four of the five most ancient haplotypes also belong to clade *1: *1A1-h, *1B1-h, *1C1-l and *1H1-h. Among them, only *1C1-l (with the X variant) is associated with low (although complement-activating) MBL production. This and the *4A1-h haplotype do not naturally occur in native Aboriginal, Greenlandic and Amerindian populations [11, 22, 29, 30], having probably been lost through bottleneck effects along the migration routes. The other eight polymorph haplotypes (with a frequency higher than 1%) have probably had a more recent origin, being geographically more restricted. Among them, only two are associated with high MBL levels: *1D1-h and *4E1-h. All others generate low MBL levels that, in addition, are greatly restricted in complement activation due to the B, C and D mutations, which occur in critical residues of the collagen-like region (*1B2-l, *1F1-l, *4F1-l, *4F2A-l, *4F3-l and *1F2) (Figure 5). Interestingly, the MBL1P1 pseudogene has been selectively turned off during evolution through the same molecular mechanisms causing the non-functional recent MBL2 haplotypes in man [31]. A more restricted distribution is obviously the case of all haplotypes containing singletons, as well as of *1J1-h, *4D1-h and *4F3-l in Africa, *1A1.1B2-l, *1B2-l and *4C1-h in Europe. They are therefore characteristic of different ethnic groups.

Figure 5
figure 5

Distribution and origins of the MBL2 alleles in the world. Arrows denote the mutational steps between haplotypes (six between *1 and *4) and when dotted, the ancient migratory routes with their approximate ages [51]. The haplotypes which could have been lost by natural selection and/or genetic drift were denoted by '?'. In bold: haplotypes generated before human out-of-Africa migration. Squared: more recent haplotypes, with geographically restricted distribution. KYA thousand years ago.

The clades *1 and *4 are separated by six mutational steps (P, Q variants), which probably occurred before the human out-of-Africa migration (Figure 5). Of these six segregating sites, probably the most ancient is the g.487G > A variant and the most recent, the g.396A > C variant [2]. Q variants are less widely distributed than P variants, justifying their high Fst values. They are functionally associated with higher promoter activity [6, 32] and five of them presented positive, significant Tajima's D values in the Gabonese population. A significant positive value for Tajima's D test indicates an excess of intermediate-frequency variants, as compared with expected frequencies under neutrality, and constitutes evidence of balancing selection (mutations leading to higher MBL levels could have been selectively retained in the ancient human population) or population subdivision. Nevertheless the emergence of several recent mutations as well as genetic drift erased the selective signature at the long haplotype scale, leading to non-significant, although positive, Tajima's D values for the whole haplotype in this and in other studies (eg. Table 4), one of which included 1,166 chromosomes from 24 worldwide populations [18, 19, 22]. The patterns of MBL2 variation at the large temporal scale would thus have been shaped by stochastic evolutionary factors and therefore be compatible with neutral evolution.

Conclusions

In this work, we evaluated the MBL2 promoter-exon 1 region using haplotype-specific sequencing in more than 700 chromosomes and found three new European haplotypes. We propose a phylogenetic nomenclature to standardize MBL2 studies and found two major phylogenetic branches due to six strongly linked polymorphisms associated with high MBL production. They present high Fst values and are imbedded in regions with high nucleotide diversity and significant Tajima's D values. Compared to others using small sample sizes and unphased genotypic data, we found differences in haplotyping, frequency estimation, Fu and Li's D* and Fst results. Using extensive testing for selective neutrality, we confirmed that stochastic evolutionary factors have had a major role in shaping this polymorphic gene worldwide.

Methods

Subjects and samples

We investigated 104 German Europeans, 131 Euro-Brazilians and 144 Gabonese adults. The German Europeans were healthy unrelated students and employees of the University of Tübingen, enrolled as controls in a genetic association study with type 2 diabetes, approved by the Ethics Committee of the University of Tübingen in Germany [33]. The Euro-Brazilians were healthy blood donors with mixed, however predominantly European ancestry, resident in Paraná state, South Brazil, sampled for different association studies, all approved by the Ethics Committee of Research in Humans of the Clinical Hospital, Federal University of Paraná, Brazil [16, 34, 35]. The Gabonese individuals took part in a large epidemiologic survey to detect the prevalence of asymptomatic Plasmodium falciparum infection in the villages around Lambaréné, Gabon, a study approved by the ethics committee of the International Foundation Albert Schweitzer Hospital [36]. All individuals signed an informed consent form prior to their inclusion in these studies.

MBL2 typing

DNA was collected with anticoagulant ethylenediaminetetraacetic acid and extracted from peripheral blood mononuclear cells through standard salting-out and phenol/chloroform/isoamyl alcohol methods. A fragment of 1059 nucleotides was amplified using the forward primers MBLfor (5'-ATGGGGCTAGGCTGCTGAG-3') and the reverse primer MBLrev (5'-CCAACACGTACCTGGTTCCC-3'). Sequence specific (SSP) PCR products were generated using the same reverse primer, combined to forward primers specific for variant H (Hf: 5'-GCTTACCCAGGCAAGCCTGTG-3') or for the variant L (Lf: 5'-GCTTACCCAGGCAAGCCTGTC-3'); for the variant X (Xf: 5'-CCATTTGTTCTCACTGCCACC-3') or for the variant Y (Yf: 5'-CCATTTGTTCTCACTGCCACG-3'). The PCR products with the primers Hf or Lf with MBLrev and Xf or Yf with MBLrev were 837 and 508 nucleotides in length, respectively. Hf and Lf were also combined to specific reverse primers for the variant P (Pr: 5'-CTCAGTTAATGAACACATATTTACCG-3') or for the variant Q (Qr: 5'-CTCAGTTAATGAACACATATTTACCA-3'), generating a product of 599 nucleotides. All fragments were sequenced with the amplification primers or with an internal exon 1 sequencing primer, MBLint (5'-GAGGCCAGGGATGGGTCATC-3'), using Big dye terminator version 1.1 chemistry (Applied Biosystems, Foster City, CA). Amplification conditions are described in detail elsewhere [20]. The reactions were purified with the Performa DTR V3 system (Edge BioSystems, Gaithersburg, MD) and analyzed on an automated sequencer (ABI Prism 3100 Genetic Analyzer, Applied Biosystems, Foster City, CA). New variants (singletons) were verified by reamplification and resequencing.

Statistical analyses

Genotype and haplotype frequencies were obtained by direct counting. We tested for deviations from Hardy-Weinberg proportions with the exact test of Guo and Thompson [37]. The haplotype frequency distributions of the populations examined by our group and by others were compared by applying the exact test of population differentiation of Raymond and Rousset [38]. Genetic differentiation among populations was estimated from haplotype frequencies using the Fst statistic, based on the analysis of molecular variance [39]. To verify the effect of other methods to infer haplotypes compared to physical haplotyping of SNPs, we simulated our own data using the (maximum-likelihood) EM algorithm or the (pseudo-Bayesian) ELB algorithm, with the settings recommended by the authors [40, 41]. These statistical analyses were done using the software package ARLEQUIN version 3.1 [42]. Fisher's exact tests were performed for differences between individual haplotype frequencies, using SISA software package http://home.clara.net/sisa.

We calculated the following summary statistics of genetic diversity: the number of polymorphic sites (S), the nucleotide diversity over loci (π) and Watterson's θ, defined as 4Neμ, where Ne is the effective population size and μ, the estimated mutation rate. We examined deviation from neutrality-equilibrium conditions using Tajima's D statistic [43], Fu and Li's D and Fu and Li's F without an outgroup (also known as D* and F*) [44] and Fay and Wu's H [45] tests. Significance was assessed by comparing the observed values to 104 coalescent simulations, conditional on the observed sample size and on the value of S or on the value of θ, assuming a standard neutral model with no recombination. Deletions were excluded from all analyses. To see if deviation from selective neutrality can be found in specific regions of the gene, we also tested the 5' upstream regulatory region (which includes the non-coding P, Q SNP) and the exon 1 coding region separately. The heterogeneity in π values and Tajima's D statistic across the sequenced region was also examined by use of the sliding window feature of the DnaSP program. Statistics were calculated for overlapping windows of 60 bp, placed at 15 bp intervals along the sequence. Neutrality tests and sequence diversity parameters were calculated using the DnaSP version 4.10.1 software [46].

The Network 4.1.1.2 package http://www.fluxus-technology.com/sharenet.htm was used to construct the minimum-mutation network, which reflects the mutational relationships among the MBL2 haplotypes by means of the Median Joining (MJ) algorithm [47]. The MEGA 3.1 program was used to construct the phylogenetic maximum parsimony tree with bootstrap test http://www.megasoftware.net/. The time to the most recent common ancestor (TMRCA) of MBL2 was estimated using a relaxed molecular clock approach [48]. Evolutionary rate was modeled by the uncorrelated lognormal distribution and a coalescent prior (Bayesian skyline) was assigned to the tree. The average rate of molecular evolution of the MBL2 gene (1 × 10-7) was obtained using a theta per site value of 0.0039 calculated for human sequences in DnaSP [46] and the estimate of human effective population size of 10,000 [49]. A normal prior with mean 1 × 10-7 and standard deviation of 1 × 10-7 was used for the rate of evolution. Divergence time inference was conducted in BEAST 1.4.8 [50]. In order to obtain the posterior distribution of divergence times, the Markov chain was sampled 50,000 times and 10% of the states were discarded as burn-in.