Background

The Fulani are a large and widely dispersed group of both nomadic herders and sedentary farmers living in the African Sahel/Savannah belt. Currently, they reside mostly in the western part of Africa, but some groups are dispersed up to the Blue Nile area of Sudan in the east [1, 2]. Although some historians postulated an origin of the Fulani in ancient Egypt or the Upper Nile valley [3], written records suggest that the Fulani spread from West Africa (currently Senegal, Guinea, Mauritania) around 1000 years ago, reaching the Lake Chad Basin 500 years later [4, 5]. They founded several theocratic states such as Massina [6], Sokoto [7], or Takrur [8], and many Fulani abandoned the nomadic lifeway and settled down, including in large urban centers. This expansion was accompanied by a process of group absorption of sedentary peoples called Fulanisation, that led to shifts in ethnic identity of some sedentary peoples, as has been described in North Cameroon [9]. However, several Fulani groups retained their very mobile lifestyle relying on the transhumance of their livestock and cattle milking. These fully nomadic or at least semi-nomadic groups are still present in several Sahelian locations, especially in Mali [10], Niger [11], Central African Republic [12] and Burkina Faso [13, 14]. All Fulani speak the fulfulde Niger-Congo west-Atlantic language (a language continuum of various dialects), consistent with their postulated Western African ancestry [15].

Similar to other pastoralists, the Fulani experienced specific selection pressures probably associated with a lifestyle characterized by transhumance and herding [16, 17]. Lactase Persistence (LP) is a widely studied genetic trait with evidence of recent selection in populations who adopted pastoralism and heavily rely on dairy products, especially drinking fresh milk [18,19,20,21,22]. LP is associated with the control element of the LCT gene on chromosome 2 [18, 23,24,25,26,27,28,29]. Specific polymorphisms in this region prevent the down-regulation of the LCT gene during adulthood and confer the ability to digest lactose after weaning [18, 20, 29]. The LP trait is particularly frequent in northern European populations, pastoralists from East Africa, farmers and pastoralists from the Arabian Peninsula, and Arab speaking pastoralists from northeastern Africa and the Sahel/Savannah belt [20, 30,31,32,33]. To date, five different variants conferring LP in populations across the globe have been identified [20]. The independent genetic backgrounds of these polymorphisms suggest convergent adaptation in populations with dairy-producing domesticated animals.

The T-13910 allele is reported to be the key variant regulating maintenance of LCT gene expression in European adults. This variant is generally not detected in most East African and Middle Eastern populations, where other LP variants are observed instead [29,30,31, 33, 34]. Fulani populations living mainly in the western Sahel/Savannah belt, however, carry the European-LP mutation with frequencies ranging from 18 to 60% [29, 35,36,37]. The presence of this “European” LP variant at relatively high frequencies across different Fulani populations is puzzling and could either result from convergent evolution in both Africa and Europe or from gene flow between ancestors of the Fulani and Europeans. The later hypothesis is supported by the fact that T-13910 has not been detected (or is only present at very low frequencies) in neighbouring populations of the Fulani [29, 37] and that European admixture in Fulani genomes has been reported in previous studies [17, 38].

Details surrounding the European admixture event and the post-admixture selection of the European LP mutation in Fulani genomes remain unclear. Studies based on uni-parental markers reported higher frequencies of western Eurasian and/or North African mitochondrial DNA (mtDNA) and Y chromosome haplogroups in the Fulani than in neighbouring populations [39,40,41,42]. However, studies on Alu insertions did not lead to similar results, connecting instead the Fulani with East African pastoralists [43].

In this study we analyse genome-wide single nucleotide polymorphisms (SNP) data from 53 Fulani pastoralists from Ziniaré, Burkina Faso to investigate the history of the Fulani population and the patterns of Eurasian admixture in their genomes, and to uncover the origin of the LP variant they carry. We perform genome-wide selection scans to investigate the strength of selection on the LP region and to identify other additional genomic regions that experienced selection during processes of adaptation to herding lifestyles in the Fulani. Lastly, we attempt to identify additional genomic regions associated with the ability of digesting milk during adult life by performing, for the first time in published research, a genome-wide association study (GWAS) on the lactose tolerance phenotype in adults.

Results

Fulani ancestry and admixture

We started by investigating the genetic affinities of the Fulani from Ziniaré in Burkina Faso using a set of comparative populations from Africa, Europe and Near East (Fig. 1a, Additional file 1: Table S1). The principal component analysis, PCA, (Fig. 1b, Additional file 2: Figure S1) clusters the Fulani groups with other West Africans while displaying some genetic affinity to Eurasians. This prevalent West African component was also visible in population structure analysis (Fig. 1c, Additional file 2: Figure S2), where the Fulani from Ziniaré in Burkina Faso have ancestry fractions of 74.5% West African, 21.4% European and 4.1% East African origin at K = 3. We observe a similar genetic structure among all other Fulani groups in our dataset, except for the Fulani from Gambia. We notice that some individuals in this group display a higher European ancestry component than others, suggesting some degree of sub-structure in this population (Additional file 2: Figure S2). This result might suggest recent additional admixture between certain Fulani groups from Gambia and West African neighbouring groups or alternatively, a shift in ethnic identity.

Fig. 1
figure 1

a Geographic locations of the samples used in this study (map generated using R package Maps [44]) (b) Principal component analysis and (c) Population averaged cluster analysis for K = 3, 5 and 7 of the merged dataset of 1355 individuals and 297,954 autosomal SNPs. Full results of the cluster analyses are available in Additional file 2: Figure S2

We inferred the time of admixture in Fulani genomes based on patterns of linkage disequilibrium decay [45], with a generation time of 29 years [46, 47], and found evidence for two admixture events between groups with West African and European ancestries (Additional file 1: Table S2). The first admixture event is dated to 1828 years ago (95% confidence interval (CI): 1517–2138) between a parental population/s related to the West African ancestry groups in our dataset (Jola, Gurmantche, Gurunsi and Igbo) and a parental population carrying European ancestry (related to North-Western Europeans (CEU), Iberians (IBS), British (GBR), Tuscans (TSI), and Czech&Slovaks (CS) in our dataset). The second admixture event is dated to more recent times – 302 years ago (95% CI: 237–368) – and occurred between a West African group, with broadly similar ancestries compared to the first admixture event, and a European group. However, this European group is more related to present-day southwestern Europeans (Iberians (IBS) and Tuscans (TSI)).

In addition to the SNP typing we sequenced the LP region in intron 13 of the MCMC gene (upstream to the LCT gene) in the Fulani, Czech and Slovak individuals, using Sanger sequencing. Of the known LP mutations in intron 13 of the MCM6 gene, the Fulani from Ziniaré, Burkina Faso, only have the "European" LP T-13910 variant. We observed a T-13910 allele frequency of 48.0%, while the genome-wide European admixture fraction in the Fulani is 21.4% at K = 3. The notable European admixture fraction in the Fulani coupled with the high frequencies of the LP T-13910 allele suggests the possibility of adaptive gene flow into the Fulani gene pool.

We reconstructed the local ancestry of the region surrounding the T-13910 allele and across chromosome 2 for three Fulani groups (Fulani from Ziniaré, combined with West-Central African Fulani, and Fulani from Gambia), assuming either two or three ancestral sources: West African and European from the high density dataset A; and West African, European and North African from the lower density dataset B (Fig. 2a, Additional file 2: Figure S3 and S4). The European genome proportions in the LP region were 0.519 and 0.491, for the two datasets respectively and in both cases all segments carrying the T-13910 allele were assigned to a European ancestry. The region extends for over 2 Mb and contains 8 genes, including LCT and MCM6 (Fig. 2b) and haplotype lengths are similar in other Fulani groups in the dataset (Additional file 2: Figure S3). For the dataset where North Africans were included as a parental population, the region near the LP variant departs 5.58 standard deviations (SD) from the genome-wide average of European ancestry (mean = 0.128, Additional file 2: Figure S4). Looking in closer detail at the haplotype structure of this region, we observe that the haplotype carrying the mutation occurs at high frequency and show decreased diversity surrounding the T-13910 allele, compared to the alternative (ancestral) C-13910 allele (Additional file 2: Figures S5, S6), indicating a strong selective sweep. Furthermore, in haplotype networks of the region, the haplotypes carrying the T-13910 allele in the Fulani cluster with European haplotypes (Additional file 2: Figure S7). Our results therefore strongly support that the T-13910 LP allele occurs on a European haplotype background and was introduced into Fulani genomes by admixture rather that occurring as an independent convergent adaptation event.

Fig. 2
figure 2

a Ancestry specific inference of chromosome 2 of haplotypes carrying allele T–13910 and (b) regional zoom-in. c Genome-wide distribution of randomly sampled fragments being flanked by North-African-like segments over 10,000 bootstrap tests. The line in red represents the observed average proportion of European-like segments flanked by North-African-like segments in the Fulani from Burkina Faso

To examine which particular source population was a likely candidate for this postulated European contact, we extracted all European-like segments across the Fulani genomes. We performed f3 outgroup analyses on the regions showing a European background (on the dataset with a separate North African component in the Fulani genomes – Extended Dataset B, Additional file 2: Figure S8). The European-like segments showed the highest shared drift with Sardinians and French Basque populations, although based on the confidence intervals we could not specifically pinpoint any of the European groups included in the test. A previous study has reported a Mozabite-like (i.e. Berber-like) component in the Fulani from Burkina Faso and Niger [17], raising the possibility that the source population for the European admixture fraction (and LP mutation) could be of North African origin. This is difficult to observe in our clustering results since the Fulani form their own cluster (at K = 4) before a North African component becomes visible (Fig. 1c, Additional file 2: Figure S2). We therefore re-ran the clustering analysis with a supervised approach (Additional file 2: Figure S9) and observed that the ancestry components of the Mozabite group could explain the non-West African genetic variation in the Fulani.

To further investigate the origin of the European ancestry segments in the Fulani, we analysed the flanking regions of European segments in their genomes. We observed a significant enrichment of North African ancestry in regions flanking European fragments. On average, European fragments in Fulani genomes are flanked by North African segments with a frequency of 0.302. To test for enrichment, we performed a bootstrapping test by randomly drawing fragments in the genome and recording their flanking regions (Fig. 2c and Method section) and observed a highly significant association between European and North African segments in the Fulani genomes (p-value < 1 × 10− 4). These results suggest that it is unlikely that both ancestries would have been introduced by separate gene-flow events. To further test this, we simulated admixture scenarios (using genome-wide ancestry proportions of North Africa, Europe and West Africa in the Fulani genomes) and inferred the expected proportion of European haplotypes surrounded by North African ancestry in case of independent admixture events. If the European and North African segments were introduced by independent contact with a European and North African groups, respectively, we would expect on average that admixed segments would follow a random distribution across the genomes. In the 100 simulated populations we did not observe similar frequencies of European segments being surrounded by North African segments at the frequency we observe in the Fulani from Ziniaré, Burkina Faso (Additional file 2: Figure S10, p-value < 0.01), indicating that the two ancestries, at least in this Fulani population from Ziniaré, were not introduced by two separate events.

This scenario was further confirmed by testing specific demographic models using admixture graphs [45]. A model describing the Fulani as an admixed group between Mozabite and a West African group has a slightly lower Z-score (0.066) compared to a model where the Fulani result from admixture between a West African group and a western European group (CEU, 0.091) (Additional file 2: Figure S11 A and B). However, when both Europeans and North Africans are included in the admixture graph models, a model that assumes that European ancestry is first admixed into North African ancestry and then introduced into the Fulani (Additional file 2: Figure S11 C) is significant (Z-score = 0.926), whereas the model where Europeans directly mixed with West Africans to produce the Fulani is not significant (Additional file 2: Figure S11 D).

Lactase persistence in the Fulani

We established here that Fulani genomes acquired European admixture and the lactase persistence T-13910 allele by admixing with a North African population. Results from a Lactose Tolerance Test and Sanger sequencing on a larger group of Fulani, Czechs & Slovak individuals (see Method section) showed that carriers of the 13910*T allele (both TT–13910 and CT–13910 genotypes) have significantly higher glycemic levels than individuals homozygous for the − 13910*C allele (Additional file 2: Figure S12, S13, Additional file 1: Table S3, S4). These results clearly associate the 13910*T allele with the LP phenotype and point to a dominant effect of the − 13910*T allele in both Fulani and Czech & Slovak populations. Attempts to identify other regions in the genome associated with the ability to digest milk in adult life in a genome-wide setup have never been performed before, neither in the Fulani nor in any other group.

To investigate if other parts of Fulani genomes are involved in the ability to digest lactose we performed a Genome-Wide Association Scan (GWAS, Fig. 3a, Additional file 2: Figure S14, S15) for the glycemic measurement phenotype. This GWAS led to the identification of two regions, on chromosome 2 and chromosome 13 respectively, that clearly stand out. Even though none of the peaks reached the overly conservative Bonferroni multiple test correction threshold (due to small sample sizes and a large number of markers), the two prominent peaks on chromosome 2 and 13 clearly indicates an association with glucose levels in the bloodstream after ingestion of lactose. (Fig. 3a). As expected, the chromosome 2 peak overlaps with the region that contains the T-13910 mutation near the LCT gene (p-value = 3.17 × 10− 6, Fig. 3b). To test to what extent the − 13,910 SNP explain the phenotype, we calculated the effect size of the − 13,910 SNP based on a linear model. We observed that 35.1% of the residual variance can be explained by T-13910 allele (p-value = 3.709 × 10− 7). Surprisingly, however, the region on chromosome 13 showed a slightly higher association with the phenotype in our GWAS analysis, with the highest association for the rs6563275 SNP (p-value = 1.03 × 10− 6, Fig. 3c). This region does not contain any gene but it is located ~ 2.7 Mega base pairs (Mb) upstream of the SPRY2 gene (the nearest gene). The rs6563275 SNP had an effect size of 38.7% (p-value = 6.62 × 10− 8). For the rs6563275 and − 13,910 SNPs together, a combined effect size of 59.2% (p-value = 3.01 × 10− 12) was estimated. The regions seam to act independent of each other and controlling for one SNP in the GWAS did not affect the other peak (Additional file 2: Figure S16). Also controlling for the top SNP in the two different regions seem to completely remove the association in the particular region, indicating that one SNP/haplotype per region is responsible for the associations (Additional file 2: Figure S16).

Fig. 3
figure 3

a P-values of the genome-wide association with the glycemic differentiation test after lactose ingestion (conditioned on study group). The triangular-shaped dot represents the Bonferroni p-value with alpha = 0.05. b, c Zoom-ins of the chromosome 2 and 13 regions, respectively. d P-values of integrated haplotype scores (iHS) across the genome and (e, f) chromosome 2 and 13 regional zoom-ins. g FZR (Fulani, Burkina Faso) and YRI (Yoruba, Nigeria) cross-population extended heterozygosity haplotype (XP-EHH) across the genome and (h, i) chromosome 2 and 13 regional zoom-ins

To test the impact of selection in Fulani genomes over the LCT, SPRY2 and other regions across the whole genome, we calculated integrated haplotype scores (iHS) [48] and cross-population extended haplotype homozygosity (XP-EHH) [49] with the Yoruba as a comparative group (Fig. 3d-i). Both tests showed clear signals of positive selection at the − 13,910 LP region on chromosome 2 in the Fulani. The LP region contained the highest peak for both scans (with 18.9 and 10.0 SD from genome-wide average, respectively). The XP-EHH results clearly showed the T-13910 allele as being selected in the Fulani compared to the Yoruba population (who does not carry any known LP variant) (Fig. 3h). The region surrounding the rs6563275 SNP on chromosome 13, however, did not display any signal of recent selection in our scans (Fig. 3f, i). We calculated a selection coefficient for the − 13,910 LP region on chromosome 2 in the Fulani using Mozabite and CEU as parental populations, respectively (Additional file 2: Figure S17). We found that a selection coefficient between 0.036 and 0.034 is necessary to explain the T-13910 allele frequency in the Fulani population, with the assumption of a constant allele frequency over time in the parental populations.

A number of other potential selection signals were observed across Fulani genomes (Additional file 1: Table S5). A particular strong selection signal was observed on chromosome 18, where the XP-EHH test showed the second highest genome-wide region value (9.2 SD), comparable to that of the MCM6/LCT region. This signal seems to correspond to the PTPRM gene that encodes a tyrosine phosphatase enzyme highly expressed in adipose tissues and associated with HDL cholesterol levels, body weight and type 2 diabetes [50,51,52]. Furthermore, the iHS selection scan identified the region around the MAN2A1 gene to be under selective pressure (p-value departing 17.0 SD from average). This gene encodes a glycosyl hydrolase found in the gut that functions in liberating α-glucose and β-glucose. Both these selection signals could represent additional indicators of dietary adaptation in the Fulani population.

Discussion

The Fulani people are the most wide-spread pastoralist group in the Sahel/Savannah belt, living (today) in a very large area that extends from the Fouta Djallon in Guinea to the Blue Nile in Ethiopia and Sudan. Even though an origin in the central Sahara has been suggested on archaeological grounds [53], we found that the contemporary Fulani have a predominant West African genetic background combined with North African and European ancestry fractions (Fig. 1b, Additional file 2: Figure S4, S9). These estimated genomic ancestry components, based on an in-depth genome analysis of a Fulani group from Ziniaré, Burkina Faso, are comparable to those inferred in previously studied Fulani groups from other regions of Africa [17, 38, 54]. The sub-Saharan ancestry in Fulani clusters close to West African Niger-Congo speakers represented in our dataset by e.g. Wolof, Jola, Gurmanche, and Igbo (Fig. 1b, Additional file 2: Figure S1, Additional file 1: Table S2). The identification of the specific ancestry fragments flanking European-like segments, supervised admixture and model based analyses support the view that the European ancestry in Fulani genomes is coupled to their North African component (Fig. 2c, Additional file 2: Figure S9- S11). These two genetic ancestries have been intertwined in the northwestern part of the African continent for at least the last 3000 years [55]. Fregel and colleagues (2018) linked the diffusion of people across Gibraltar to Neolithic migrations and the Neolithic development in North Africa [55]. This trans-Gibraltar mixed ancestry was previously observed in the Fulani mitochondrial gene-pool that link the Fulani to south-western Europe based on mtDNA haplogroups H1cb1 and U5b1b1b [41].

We inferred that the non-West African proportion in the Fulani were introduced through two admixture events (Additional file 1: Table S2), dated to 1828 years ago (95% CI: 1517-2138) and 302 years ago (95% CI: 237–368). The oldest date compare well with previous dating efforts of the admixture event in the Fulani from Gambia (~ 1800 years ago) [56, 57], indicating a similar genetic history between the Fulani groups of Gambia and Burkina Faso. We hypothesize that the postulated first admixture between West African ancestors of the Fulani with an ancestral North African group/s possibly favoured, or even catalysed changes in their lifeways and consequently led the Fulani expansion throughout the Sahel/Savannah belt. This view is consistent with traces of pastoralism in the West African Savannah (northern Burkina Faso, in particular), starting around 2000 years ago according to archaezoological data [58]. The second admixture event dates to more recent times from a Southwestern European source (Additional file 1: Table S2). This event can possibly be explained by either subsequent gene-flow between the Fulani and North Africans (who carry considerable admixture proportions from Europeans due to trans-Gibraltar gene-flow); or by the European colonial expansion into Africa.

In the demographic model predictions where only one non-West African parental population is included (Additional file 2: Figure S11 A and B), both European and North African admixture can potentially explain the admixed part of the Fulani genetic composition. However, if both ancestries are present in the demographic model (Additional file 2: Figure S11 C and D), only a North African ancestry population (mixed with a European population) can be a potential ancestor to the Fulani from Burkina Faso, whereas the model where Europeans directly mixed with West Africans to produce the Fulani is not significant. These results stress the importance of demographic context when identifying potential sources of admixture, when the sources have a similar genetic background.

The ability to digest milk during adulthood is a well-known case of recent selection in genomes of pastoralist and farming groups across the globe. The five independent mutations in intron 13 of the MCM6 gene have been widely investigated and the association with expression of the LCT gene after the weaning period has been well established [18, 20, 59]. The LP trait is associated with one of the most well-known signals of genetic adaptation to food-producing Neolithic lifestyles. High frequencies of the European-specific LP variant T-13910 are observed in Fulani groups across the Sahel/Savannah belt (Additional file 1: Table S6). It is thought that the sustained expression of the LCT gene into adulthood, adds a dietary advantage in human populations who practice pastoralism for animal milk purposes. In our study the LP trait selection coefficient (s) estimates in the Fulani (Additional file 2: Figure S17), 0.034–0.036, are comparable to previously calculated selection coefficients for LP in African populations; i.e. within East African groups it ranges between 0.035 and 0.077 (under a dominant model, [18]), and 0.04–0.05 in Nama pastoralists of Southern Africa [21].

To date no publication has used a genome-wide approach to investigate whether other genomic regions are associated with the LP phenotype (Fig. 3a-c, Additional file 2: Figure S14-S16). Here we confirmed an association between the previously identified chromosome 2 LP region on a genome-wide level. Additionally, we identified another signal associated with the ability to digest lactose (and generate glucose in the blood), on chromosome 13. We report here a strong association between glycemic levels (after lactose ingestion) and a region 2.7 Mb upstream of the SPRY2 gene on chromosome 13. Previous GWAS studies have associated the SPRY2 gene with adiposity and metabolism impairment [60], and with diabetes type 2 in Asian cohorts [61,62,63]. The importance of the association is possibly highlighted by a study that found that mice displayed hyperglycemia when the SPRY2 gene is knocked down [64], indicating that it is possible that the rate/extent of glucose formation is influenced by the SPRY2 gene. This gene have not previously been linked to lactase persistence and possibly this region could be linked to an additional genetic variant that confers increased ability to digest lactose as adults. However, a more likely scenario is that the association we observe might not be because of LP trait itself, but rather due to the involvement of the genomic region in the subsequent steps of glycemic production. The latter hypothesis is supported by the fact that this region did not seem to have undergone selection in the Fulani, unlike the LP region on chromosome 2.

Genome-wide selection scans showed the chromosome 2, T-13910 region, to be under strong selection, confirming that the European haplotype carrying the T-13910 mutation experienced adaptive gene-flow into the Fulani gene pool. Additional strong selection signals in the Fulani were found for genomic regions carrying the MAN2A1 gene that encodes a glycosyl hydrolase and the PTPRM gene that encodes a tyrosine phosphatase expressed in the adipose tissues. These genes might represent other selection events in Fulani genomes to adapt to diets related to pastoralist lifeways. Higher consumption of sugars and fat contained in milk from domesticated animals might have triggered selective pressures in variants located within various genes leading to several dietary adaptations in the Fulani.

Conclusion

The complete history of the Fulani pastoralists remains to be uncovered, but through the genetic analyses performed in this study (based on the Fulani population from Ziniaré, Burkina Faso) we show that present-day Fulani genomic diversity developed from admixture between a West African group and a group/s that carried European and North African ancestry. The European LP variant was likely introduced through this admixture event, and was strongly selected in successive generations, in a similar way as the TAS2R gene family [17]. Our results further showed that the LP region was not the only region that were under strong recent selective pressure in the Fulani ancestors, and several other selection signals points to dietary adaptations. It may well have been these and other similar selective advantages in the Fulani that contributed to their population expansion and long range spread across the Sahel/Savannah belt of Africa.

Methods

Sample collection

Sample collection in Burkina Faso was conducted in collaboration with the Burkinabe CNRST (Centre national de la recherche scientifique et technologique) institution in Ouagadougou. Measurements of LP phenotypes and collection of saliva (Oragene kit, DNA Genotek) of the Fulani (FZR, n = 56) were carried out in three nomadic camps located, at the time, northeast of Ziniaré (longitude − 1.241395; latitude 12.620579) with research permit No 0495 and help of local assistants. Informed consent was obtained from all the participants included in the study before samples were collected. Additional DNA and LP phenotype measures were collected for a comparative dataset of 63 unrelated volunteers based in Prague with Czech or Slovak (CS) nationality. All Czech and Slovak volunteers signed informed consent on anonymous use of their sample. This study was approved by the Ethical Committee of the Charles University in Prague (approval no. 2016/07) and by the Swedish Ethical Review Authority (approval no. 2019–00479).

Phenotype test

For estimation of lactase activity, we used the lactose tolerance test (LTT), which is based on the measurement of an increase of blood glucose (glycemia) after consumption of 50 g of lactose on an empty stomach [18, 65]. Blood glucose was measured by eBsensor (Visgeneer Inc.). Volunteers were asked to starve overnight (minimally 8 h) and their base-line blood glucose was measured afterwards. Then they were asked to drink 50 g of lactose dissolved in 200 ml of water (which is equivalent to the amount of lactose in 1 to 2 litres of cow’s milk) [18, 65]. Blood glucose was measured 20, 40 and 60 minutes after ingestion. The maximal difference from the base-line from these three measurements was used in genotype-phenotype comparisons.

Sequencing of 359-bp fragment in intron 13 of the MCM6 gene

To detect which LP mutations were present in the 56 Fulani and 63 Czech/Slovak genomes, we Sanger sequenced intron 13 of the MCM6 gene with a previously reported set of primers [66]. The primers cover a 359-bp long fragment where 5 known LP associated variants are located. Polymerase Chain Reaction (PCR) products were Sanger sequenced by Macrogen (Korea).

Genome-wide SNP typing

A subset of 55 Fulani and 7 Czech/Slovak individuals were selected for genome-wide genotyping on the Illumina Omni2.5-Octo BeadChip (which contains the T–13910 SNP). The data was aligned to the Human Genome built version 37.

Data management and quality filtering was carried out using PLINK v.1.90 software [67]. A total of 2,608,742 SNPs were obtained from the 62 individuals. All individuals passed a 0.15 data missingness threshold. We subsequently filtered to keep only autosomal SNPs with a SNP missingness filter of 0.1. To account for possible genotyping errors, we applied a Hardy-Weinberg equilibrium filter (HWE) that excluded 90 SNPs (for p ≤ 1e-4). AT and CG SNPs were excluded to prevent strand flipping errors when merging with comparative datasets. Relatedness was measured by Identity-By-State analysis and two Fulani individuals were excluded due to potential genetic relatedness. A total of 2,359,821 SNPs and 60 individuals were kept for the study.

Merging with previously published data

We merged the new data with published comparative datasets following the same quality control criteria as described above. We added 1295 samples from 39 populations (full descriptive list in Additional file 1: Table S1).

For the first dataset (dataset A) we compiled selected groups from the 1000 genomes project [68] and a Sahelian dataset [17] to merge with the newly generated data. After merging and quality filtering, dataset A had 785 individuals and 1,968,522 autosomal SNPs. This dataset was used in initial analyses of the genetic affinities, selection scans, GWAS and the local ancestry analyses using RFMix involving two reference population sources in the Fulani. We added additional samples to dataset A in order to get a more representative picture of past demographic history by compiling dataset B, which covers 1355 individuals from 41 populations and 297,954 autosomal SNPs (Additional file 1: Table S1). Dataset B was used in studies of genetic affinities (such as PCA, admixture analyses and f-statistic methods) and local ancestry analyses using three reference population groups. Dataset B was extended with additional Eurasian populations for f3-statistics with the Fulani European-like segments (see European-specific analysis and F-statistics based methods section). Furthermore, a San population from Namibia [69] was also included as outgroup in the demographic models using qpGraph (see European-specific analysis and F-statistics based methods section). Datasets were phased using SHAPEIT software v.2 [70], using the HapMap II recombination map.

Population structure analyses

Population structure analyses were performed on dataset B. We generated a Principal Component Analysis (PCA) that compares Fulani individuals to comparative groups of dataset B. The PCA analyses were performed with EIGENSOFT v.7.2.1 [71, 72] under default settings. We inferred population structure with ADMIXTURE v.1.3.0 [73]. The number of clusters (K) was set from 2 to 10, replicated 20 times. The cluster-inference and visual inspection was made with Pong v.1.4.5 [74].

Estimating admixture dates

To estimate the time of possible admixture events we used a linkage disequilibrium (LD) decay based method. The date estimations were done for dataset B using Malder v.1.0, ADMIXTOOLS package v.5.0 [45]. The HapMap II genetic map was used as recombination map.

Local ancestry analyses

We use the RFMix software v.1.5.4 [75] to identify local ancestries of genomic fragments in Fulani genomes. An initial RFMix run was performed with two ancestral populations, represented by 50 YRI (Yoruba in Ibadan, Nigeria), for West African ancestry, and 50 CEU individuals, to represent European ancestry on a total 1,968,522 autosomal SNPs. We ran RFMix analyses with two extra iterations to account for admixture in the source populations and to minimize assignment errors. We set 5 minimum reference haplotypes per tree node and the number of generations to admixture to 30. We used HapMap II genetic map as recombination map and positions outside the map windows were excluded. We ran an additional RFMix analysis with similar settings and added 30 Mozabite individuals as a third parental source to account for potential North African ancestry in Fulani genomes using a total of 297,954 autosomal SNPs.

Haplotype plots and networks

We extracted the haplotypes of approximately 1.1 Mb (positions 135,759,095 to 136,824,836) surrounding the T–13910 (position 136,608,646) on chromosome 2 from our phased dataset. The selected region was chosen based on RFMix results and the haplotypes were sorted by position of T/A–13910. We used the same selected region to construct a median joining network using the NETWORK software package v.5.0.0.1 employing the median joining (MJ) algorithm [76] and the maximum parsimony (MP) option [77]. The parameter “Frequency > 1” was activated, so that unique haplotypes are not shown in network plots.

European-specific analysis and F-statistics based methods

To test the affinity of the European-specific fragments in the Fulani genomes we firstly kept only the European-like regions of the Fulani genomes based on the output of RFMix for dataset B. f3-outgroup and pairwise fst were obtained using ADMIXTOOLS v.5.0. To avoid affinity bias towards CEU (European parental population in RFMix), the individuals previously selected for representing the European source in RFMix were replaced by the remaining 50 individuals of the CEU population in the 1000 genomes project.

To inspect the frequency of a European-like (CEU) fragment being flanked by North-African-like (Mozabite) fragment we recorded the number of European fragments flanked by a North-African ancestry region in relation to the total number of European fragments across all Fulani datasets. To test whether North African regions flank European regions more often than expected, we performed a bootstrap test by randomly selecting fragments across the genome (equal to the number of European-like fragments) and test the likelihood of a random fragment being flanked by a North African ancestry region. This analysis was repeated 9999 times.

We also tested if the different ancestry distributions across the genome could be explained by two independent admixture events. We simulated 53 phased individuals containing a conservative number of 10,000 haplotypes. These individuals hold similar admixture fractions to the Fulani (0.13 European, 0.19 North African and 0.68 West African). We tested if the three different ancestries follow a random distribution across the genomes, as expected under a neutrality model assumption. This simulation test was performed 100 times and the frequency of a European-like fragment being surrounded by North African-like fragments was recorded.

Lastly we tested population models with specific admixture events using qpGRAPH, within the ADMIXTOOLS v5.0 package [45]. The model predictions were performed using Fulani, CEU, Mozabite and Yoruba individuals, using default parameters.

GWAS

The Fulani (FZR) and Czech & Slovakian (CS) individuals were used in a genome-wide association scan (GWAS) analysis between their LTT phenotype and their genomic composition (samples extracted from dataset A). The analysis was performed using the GenABEL v. 1.7–6 R package [78]. A SNP cut-off rate of 0.95, minimum allele frequency of 0.1, a cut-off p-value for the HWE of 1e-08 and false discovery rate of 0.95 were implemented in the analysis. We classified the phenotypic trait into lactose intolerant (glucose level < 1.1 mmol/l), intermediate tolerance (1.1 mmol/l < glucose level < 1.7 mmol/l) and lactose tolerant (glucose level > 1.7 mmol/l), as implemented in previous studies [18, 29, 65]. We applied the qtscore function in GenABEL and controlled for the population group the samples belonged to. We also did a GWAS run controlling for sex (the only additional phenotype information we have access to). Two additional GWAS runs were performed where we added the top SNPs (− 13,910 or rs6563275) as co-variates. Finally, to calculate how much a certain SNP explained the phenotype, a linear model estimation was used. We calculated the effect size of − 13,910 and rs6563275 in relation to the phenotypic residual variance and we used the adjusted R2 to infer the percentage of the phenotype explained by each SNP and both SNPs together.

Selection scans

Scans for signals of selection in the Fulani genomes were done using the integrated haplotype score (iHS) and Cross-population Extended Haplotype Homozygosity (XP-EHH) analyses within the R package REHH v.2.0.2 [79]. In both analyses the length of haplotype homozygosity was calculated with a maximum gap between two SNPs of 200,000 bp. We used the chimpanzee, urangutan and gorilla genomes to identify the ancestral allele and performed the selection analyses on 1,531,283 SNPs. Peaks were identified by averaging the -log10(p-value) every 10 SNPs and top 5 regions were inspected on Genome Browser to identify possible genes in the target region. Selection coefficient estimates were calculated using formula by Ohta and Kimura [80].