Background

Recombination enables the evolution of complex traits by shuffling novel genetic variants, brought about by mutation, into new combinations with existing alleles from varying genomic origins [1]. Due to the evolutionary significance of recombination, many implementations of software packages that infer the recombination rate have been developed. Some of these packages rely on inferring past recombination events by analysing pedigrees [2], by detecting changes in ancestry [3, 4] or by the boundaries of blocks of identity by descent (IBD) [5]. These methods require large sample numbers (> 2000) to accurately infer the recombination rate at fine-scales. Other packages use linkage disequilibrium (LD) [6] or derivatives thereof, e.g. summary statistics [7], to infer recombination into the very distant past and require fewer individuals to infer recombination at fine-scales. LD-based recombination maps, however, are strongly influenced by past demographic events, e.g. population bottlenecks [8]. Recombination inference software that is aware of changes in the effective population size (Ne) of a population, such as pyrho [9], can be used to mitigate this effect.

The rate of recombination varies between species [10], between populations within species [9, 11] and even among individuals [12]. The recombination rate across the genome is generally expressed as a ratio of genetic distance and physical distance, known as a recombination map. It has been shown that at low resolutions (> 1 Mb), population specific recombination maps are fairly similar [13] and at high resolutions they correlate according to continental levels of population differentiation [11]. For instance, the pedigree-based deCODE [14] map, based on the Icelandic population, correlates better at fine scales to the linkage-disequilibrium-based (LD-based) HapMap II [15] map of the CEU (Utah residents with Northern and Western European ancestry from the CEPH collection) than it does to the HapMap II map of the YRI (Yoruba in Ibadan, Nigeria) [7, 14]. Many population-specific recombination maps have been inferred to date, but none have been inferred for any southern African populations [16] and researchers studying these populations have had to use available maps that might not suit their analysis.

In this manuscript, we present a novel recombination map for the Nama—an indigenous population of southern Africa [17] that forms part of a larger group of geographically close and culturally related individuals known collectively as the “Khoe-San”. The Khoe-San are reported to have the most divergent lineages of any other living population [18,19,20,21,22], and it is believed that they have largely remained isolated until ~2000 years ago [17, 18, 23]. Therefore, a recombination map for this population may be very different at fine scales compared to recombination maps that have been inferred for other populations. The Khoe-San also contribute a significant ancestral component (15–75%) to admixed southern African groups, like the South African Coloured (SAC) population and southern Bantu-speaking populations [24, 25], and a recombination map for diverse Khoe-San populations could benefit studies involving these groups. The demographic history of the Nama is multi-layered, with 5–25% gene flow from Eastern African caprid and cattle pastoralists ~2000 years ago [26] and genetic exchange with the Damara—a hunter-gatherer population of West-Central African ancestry who became economic clients of the Nama. These events were finally followed by recent admixture with European colonists and to a lesser degree ~250 years ago.

We used whole genome sequencing (WGS) data of 54 unrelated Nama individuals [27] to infer a LD-based recombination map that is adjusted according to past changes in Ne. Demographic history was inferred using SMC++ [28] for distant changes in Ne and AS-IBDNe [29] for recent Ne changes. The Ne size changes were then combined, and the demography-aware LD-based method pyrho [9] was used for recombination rate inference. The resultant population-specific recombination map was then compared to other publicly available recombination maps using the Spearman rank correlation coefficient. Finally, we assessed the fine scale differences between the inferred Nama recombination map and the combined Phase II HapMap recombination map in a region of chromosome 1 and demonstrated how the use of different recombination maps can affect the results from a selection scan.

Results

Briefly, 84 Nama individuals were sequenced to 4x-8x depth via Illumina short read sequencing, variant-called and phased in combination with additional African low coverage genomes as well as 1000 Genomes Phase 3 as part of the African Genome Resource [30]. Genomes were variant-called with GATK3.4 following best practices and phased with SHAPEIT2. Further details regarding the production of this dataset are described in Ragsdale et al. [27]. Global ancestry estimates for the Nama, as compared to other Africans from the African Genome Resource along with representative Europeans (CEU), were inferred using ADMIXTURE. Ancestry estimates indicate that the bulk of the Nama’s ancestry is Khoe-San, which is rare elsewhere in the African continent with the exception of the southern Bantu-speaking Sotho and Zulu (Fig. S1). There is a sharp cline in European ancestry across individuals, ranging from ~0 to 50% as may occur with a recent pulse of admixture which has not yet reached equilibrium in a few generations. A subset of individuals carry ancestry frequent in Bantu-speaking and eastern African populations, likely reflecting recent Damara or Herero marriage as indicated in demographic interviews with participants. Ancestry proportion among the full set of related individuals was similar to the subset of unrelated individuals (Fig. S2).

The inferred demographic history of the Nama

We inferred the Ne for the Nama using SMC++ (Fig. 1 A right) and AS-IBDNe (Fig. 1 A left). The results from SMC++ represent the Ne change from 50,000 to 260 generations into the past. The AS-IBDNe results represent the Ne change from 50 to 4 generations into the past and the Ne was inferred using IBD segments of Khoe-San ancestry exclusively. An Ne of ~30,000 approximately 10,000 generations ago with a reduction in Ne to ~21,000 approximately 5000 generations ago is consistent with previously published inferred Ne for the Nama [31]. Inconsistent with previous results, there is a further reduction in Ne to ~10,000 approximately 1000 generations ago. The inferred Ne by SMC++ then stops at 260 generations, because SMC++ can infer Ne approximately 6—120 thousand years ago (kya) with low error [28] and by default SMC++ uses an heuristic to calculate these timepoints automatically given the data.

Fig. 1
figure 1

A The inferred effective population size history for the Nama plotted on a log10 scale with SMC++ results on the right and AS-IBDNe results for the Nama on the left. BD The AS-IBDNe results for the LWK (B), GBR (C) and Nama (D) ancestral components in the Nama

We therefore used AS-IBDNe to estimate population fluctuations over the past thousand years [32]. We deconvoluted 84 Nama genomes (SNP array) into local ancestry tracts with three possible ancestry states: Khoe-San ancestry, European ancestry and Western-Central African ancestry as represented by Nama, GBR (British in England and Scotland) and LWK (Luhya in Webuye, Kenya) population samples. We tested the accuracy of RFMix via simulation in the Nama as well as testing both SNP array and low coverage genome data in order to determine the best dataset for local ancestry inference. We simulated continuous gene flow from 3 ancestral groups: European admixture starting 8 generations ago with 1% contribution per generation, Bantu admixture starting 14 generations ago with 2% contribution per generation, and the remaining contribution for each generation coming from the Khoe-San. The population randomly mates to create each subsequent generation, also taking into account the recombination landscape to accurately copy haplotype blocks. Eleven individuals were used for each ancestral population: French individuals as the European reference, Bantu-speaking as the West African reference and Nama individuals with > 90% Khoe-San ancestry (that were not later used in RFmix runs) for the Khoe-San ancestry component. The average global LAI accuracy, allowing the reference individuals to themselves be admixed, was ~92% on average for the simulated individuals. European ancestry-specific accuracy was 97.6% with individuals being 12.8% European on average, Khoe-San ancestry-specific accuracy was 92.4% with individuals being 76.8% Khoe-San, and Bantu accuracy was 81.5%, with individuals having 10.4% Bantu ancestry overall.

Comparing RFmix runs for the SNP array and genome data to previously obtained ADMIXTURE ancestry percentage estimates, we found that the Khoe-San ancestry was systematically under-called in the genomes compared to the global ancestry estimates from ADMIXTURE, with European and Bantu ancestry consistently higher. This trend is however improved when using the SNP-array data for LAI. Therefore in the AS-IBDNe analysis, admixture deconvolution was performed on MEGA SNP array [33] data in order to facilitate larger numbers of haplotypes in the reference populations.

Beginning 50 generations ago, we infer an Ne of 4360 for the Khoe-San component (Fig. 1D). The Ne starts to decline 34 generations ago and continues to decline until an Ne of 190 inferred 4 generations ago; estimation stops 4 generation ago to avoid coalescent events based on genealogical relationships. The rapid population decline substantially predates the arrival of European settlers in the Richtersveld in 1760 (or ~7 generations ago), an arid region just south of the Orange River [34]. The Ne results inferred for each set of ancestry specific IBD segments (Fig. 1B–D) have very narrow 95% confidence intervals.

The correlation between the inferred Nama recombination map and other publicly available maps

Fine-scale recombination rate differences between pairs of populations are correlated according to continental levels of population differentiation [9, 11]. Considering the long period that the Nama were isolated and their complex demographic history, we hypothesise that there is no available recombination map that is representative of the Nama. Therefore, we compared the inferred recombination map for the Nama with 26 other publicly available recombination maps derived from [9] using the Spearman rank correlation at a 2-kilobase resolution (Fig. 2). These maps were inferred for populations from the 1000 Genomes [35] dataset, and the populations are classified into various super-populations representing major ancestry differences. We find that pairwise correlations between all 27 maps cluster according to continental levels of population differentiation (super-populations). Furthermore, we find that the Nama are more closely related to other African populations than to other continental groups (< 0.75); however, the pairwise correlations between the Nama and the other African populations are much weaker (~0.79) than the pairwise correlations between the African populations (> 0.90). These values represent correlations between inferred maps and, therefore, include any noise potentially introduced during inference. The true maps are likely to be more similar than these values suggest.

Fig. 2
figure 2

Heatmap indicating the Spearman rank correlation between the genetic maps of 27 populations, including the Nama, at a 2-kilobase resolution. The colour of the population labels represent distinct super-population groups, with the Nama highlighted in red. There is clear clustering according to super-population groups and the Nama recombination map correlates the best with other African populations

The Spearman rank correlation mitigates potential differences in map length that would influence the Pearson correlation coefficient. Therefore, we neglect the magnitude of the recombination rate in favour of qualitative aspects of the maps. Inspecting the qualitative aspects of recombination maps is especially relevant when LD-based recombination maps are compared, since LD-based methods produce population recombination rates that need to be scaled using Ne and therefore assume an accurate estimate for Ne.

Fine-scale recombination as applied in selection scans

The combined Phase II HapMap recombination map is derived from 270 individuals who represent four geographically diverse populations, including the Yoruba from Western Africa. It is sometimes used as a proxy [36] for southern African populations, since all other available recombination maps derived from African populations are of western African ancestry, a globally diverse map is thought to be the best substitute. Even though population-specific recombination maps are similar at low resolutions, certain analyses, such as selection scans, might benefit from a high-resolution population-specific recombination map that accurately captures fine-scale differences. Figure 3 illustrates the recombination rate (cM/Mb) plotted over part of chromosome 1 for the combined Phase II HapMap recombination map (orange) and the inferred recombination map for the Nama (blue). The positions of regions of high recombination (hotspots) are largely concordant between the two maps and mainly differ in magnitude. However, in the region at 24.2 Mb, there are hotspots present in the Nama recombination map that are absent from the combined Phase II HapMap recombination map. To further investigate the effects that these differences could have, we performed genome-wide selection scans on Nama SNP array data using the combined Phase II HapMap map and the Nama map. We focused on the integrated haplotype scores (iHS), a selection statistic which detects recent positive selection, by evaluating haplotype homozygosity for the ancestral and derived haplotypes extending from a locus of interest [37]. iHS is most effective at detecting alleles that have been swept to intermediate frequencies, and it is among the most common statistics cited in other comparable selection scans in the Khoe-San.

Fig. 3
figure 3

The recombination rate of the combined Phase II HapMap recombination map and the inferred recombination map for the Nama plotted over a segment of chromosome 1. There is a high degree of overlap between the maps across this region, but there are positions with recombination hotspots indicated by the Nama map that are not indicated by the combined Phase II HapMap map, e.g. at 24.2 Mb

After taking the absolute value of the integrated Haplotype Scores (iHS) and filtering for the highest 1.0% of the scores, we found an overlap of 1504 candidate genes (50%) between the two maps. However, the run using the combined Phase II HapMap map and the run using the Nama map identified 808 and 713 unique candidate genes respectively (Fig. 4). The difference in the number of top 1.0% of hits is due to the change in the relative length of the maps. The Pearson correlation (r) between the iHS scores found using the combined Phase II HapMap and Nama maps is 0.93. We compiled a list of 131 candidate genes [18, 20, 36], previously identified using iHS, that are under selection in the Khoe-San and compared this list to our results. We found an overlap of three genes (CTNNAL1, ALDH1A2 and SYT14) between the previously identified genes and the run using the combined Phase II HapMap map but only an overlap of one gene (TRIM39) between the previously identified genes and the run using the Nama map. TRIM39 encodes for a ring finger protein associated with diseases including Behcet’s syndrome; it regulates p21 and plays an important role in determining cell fate [38]. Previous research has demonstrated that selection statistics such as iHS are sensitive to phasing, sample size and ascertainment bias [39]. Our results indicate that a population-specific recombination map should also be considered in attempts to fine-map adaptive haplotypes.

Fig. 4
figure 4

Venn diagram of the candidate genes found using the 1.0% highest selection scan results (absolute value iHS) for the selection scan using the combined Phase II HapMap map (white) and the selection scan using the Nama map (grey)

Discussion and conclusions

Recombination maps are important resources for epidemiological and evolutionary analyses; however, there are currently no recombination maps that represent southern African populations [16]. The Nama, a southern African indigenous population, would likely produce a distinct recombination landscape from publicly available recombination maps, because of their complex demographic history. The recent rapid population decline (shown in Fig. 1) partially illustrates this complex history. Despite gene flow from Eastern African pastoralists ~2000 years ago and recent admixture with Europeans, the Nama do not cluster with any of the continental groups that we have representative recombination maps for (Fig. 2). Therefore, their recombination landscape is indeed unique and epidemiological studies that involve the Nama or any other related populations, like other Khoe-San populations or southern African Bantu-speaking groups, would benefit from our inferred map. This recombination map also represents a likely upper bound on the extent of divergence we expect to see for a recombination map in humans and would be of interest to any researcher that wants to test the sensitivity of population genetic or GWAS analysis to recombination map input. Fine-scale differences in recombination can meaningfully alter the results of a selection scan (demonstrated in Figs. 3 and 4). However, it should be noted that recent studies found that population-specific recombination maps have little effect on phasing [40], imputation [40] and local ancestry inference [41]. Therefore, the combined Phase II HapMap recombination map’s proxy status with regards to the Nama is dependent on the analysis that the map is used for.

There are many available techniques [42] to infer the recombination rate and some have contrasting limitations which means that not all techniques would allow accurate, fine-scale estimates for a given dataset. Assuming limitless resources, we would have preferred pedigree-based methods, because these allow sex-specific recombination rate inference and rely on inferring individual recombination events between successive generations based largely on observed meioses. However, pedigree-based methods require many thousands of individuals to produce fine-scale maps [2]. Other options are IBD-based and LAI-based methods, but they too require in the order of a couple thousand individuals for fine-scale estimates [5]. Our small sample size (54 unrelated individuals) made LD-based methods the obvious choice for fine-scale estimates. However, there are many assumptions that accompany LD-based methods that make them less than ideal, for instance the assumption of a constant Ne and the potential bias from gene flow when inferring recombination in admixed populations [43]. Therefore, the complex demographic history of the Nama made demography-aware methods, like pyrho, the ideal compromise between data availability and accuracy. Even so, the population-specific recombination map presented here is likely an accurate representation of the recombination landscape of the Nama and future epidemiological and evolutionary research will benefit from this resource.

Methods

Inferring demographic history

It has been shown that demographic history, especially recent bottlenecks, can greatly impact LD-based recombination inference. We, therefore, inferred the demographic history of the Nama to improve our recombination rate estimates. Two methods, SMC++ (v1.15.2) [28] and IBDNe (v23Apr20) [32], were used and the results combined. See Fig. 5 for an overview of the methods.

Fig. 5
figure 5

A brief overview of the methods used in effective population size inference and the subsequent recombination rate inference

SMC++ uses LD information to infer demographic histories and can infer divergence times between 6 and 120 kya with low error [28]. A whole genome sequencing (WGS) dataset (EGAD00001006198) of 84 Nama individuals (54 unrelated) was used. The input for SMC++ was created separately for each chromosome, from the unrelated individuals in the WGS dataset, by using the vcf2smc program with 10 randomly selected “distinguished” (see Terhorst et al. [28] for more information on this) individuals. The result is 10 separate datasets for each chromosome. This creates a composite likelihood which, according to the authors, may lead to improved estimates. A per-generation mutation rate of 1.25e−8 was assumed and all of the input files were then included in an estimate of the Ne through time using the estimate program. Since SMC++ regards uncalled regions as long runs of homozygosity, Stephen Schiffels’ mappability mask (created for human genome build GRCh37 using SNPable [44]) was used to mask regions of low mappability. All other default parameters were used.

IBDNe can infer the Ne size 4-50 generations into the past by using identity by descent (IBD) information. By separating IBD segments by ancestry before inferring the Ne, one can obtain an estimate of Ne localised to each population ancestry. We developed a Snakemake pipeline (Fig. 6), called AS-IBDNe (https://github.com/hennlab/AS-IBDNe), to estimate ancestry specific Ne from a given SNP array dataset. The pipeline was adapted from the procedure used in Browning et al. [29]. We ran it on 84 Nama individuals genotyped on the Multi-Ethnic Global Array (MEGA) [33]. The pipeline takes in SNP-array data in plink binary file format, uses plink v1.9 [45] to break the data by chromosome, and shapeIT v2 [46] to phase the chromosomes. The dataset is then converted to VCF format using SHAPEIT2 and split into one file containing the reference individuals and one file containing the admixed individuals using BCFtools [47]. Next, RFMix v2.0 [48] is run on these two vcf files to estimate the ancestry of arbitrarily sized segments across the genome. Simultaneously, RefinedIBD and merge-ibd-segments.17Jan20.102.jar [49] is run on the phased data to infer ibd segments and remove any gaps between them. The ancestries produced from RFMix v2.0 are then assigned to each IBD segment using a custom python script. This information is provided to the program IBDNe [32], which produces estimates of historical population size for each ancestry. The RFMix results were also used to create the ternary diagrams in Fig. S2 using the ggtern package in R. All the default parameters of RFMix, RefinedIBD and IBDNe were used except RFMix, which was run with 3 expectation maximisation iterations and the reanalyse-reference flag, and IBDNe, which was run with the mincM flag set to 3. The combined Phase II HapMap recombination map was used whenever a recombination map was required during the inference. The output of SMC++ can be converted to a csv where the timescale and the Ne estimates are linear. The output from AS-IBDNe can then be added to the linear output from SMC++, and this file can then be used during recombination rate inference in pyrho [9].

Fig. 6
figure 6

An overview of the AS-IBDNe pipeline. Input SNP array data in plink binary format is split by chromosomes using plink v1.9. Each chromosome is then phased and converted to vcf format by SHAPEIT2. IBD segments are next inferred using RefinedIBD, and merge-ibd-segments is used to remove gaps between them. Meanwhile, RFMix2.0 is run to estimate the ancestry of differently sized genomic segments. Finally, RFMix-produced ancestries are assigned to each IBD segment, and IBDNe is run to produce ancestry-specific effective population size estimates

Recombination rate inference

Previous published guidelines [42] aided the choice of recombination rate inference method and pyrho, a demography-aware LD-based method, was selected. We assumed the same per-generation mutation rate of 1.25e−8 for all the inference steps. The most computationally laborious task when using pyrho is the generation of a lookup table which enables subsequent processes to be computationally faster. The combined SMC++/AS-IBDNe demographic history was used to generate a lookup table for the unrelated subset of 54 Nama WGS individuals using pyrho make_table. A convenient feature of this lookup table is that it is compatible with other recombination rate inference software, e.g LDhat [6], which make use of exact two-locus sampling probabilities with the added benefit of already taking the specified demographic history into account. This lookup table and the combined demographic history were employed to find optimal hyperparameters to be used for recombination rate inference with pyrho hyperparam. The parameters that yielded the highest overall accuracy were a smoothness penalty of 15 and a window size of 30. These parameters and the lookup table were then used to infer the recombination rate with pyrho optimise. The output provides the per base pair per generation recombination rate for a given interval.

Selection scans

For the selection scans, we used data from 104 Nama individuals who were genotyped on the Illumina Omni2.5 array as part of the African Genome Diversity Project. Close relatives were identified from demographic interviews and verified via allele-based kinship coefficients in plink. Individuals with more than 50% European, and Damara or Herero admixture were excluded. Ancestry estimates were obtained using ADMIXTURE with k = 6 possible ancestral clusters: Nama, Northern San, Near Eastern, East African Nilotic, West African, and European (see also Fig. S1) [50]. After QC, kinship and ancestry exclusions, we analysed n = 55 individuals. We calculated iHS using selscan 1.3.0 [51] and default parameters. For the --map flag, we used recombination rates from the custom Nama map in one run and from the combined Phase II HapMap in a second run. We filtered for the most extreme iHS scores (absolute value) by taking the highest 1.0% of the scores.

We annotated these positions using the gene range list provided by Plink (https://www.cog-genomics.org/plink/1.9/resources). We compared the candidate genes found in each run of selscan to create a Venn Diagram. We also calculated the Pearson correlation between iHS scores for each SNP as calculated by each run of selscan.