Introduction

During the past ~100,000 years, humans have expanded from a lowland tropical environment in Africa to occupy an enormous range of environments throughout the world, including extremes of heat, cold and ultraviolet radiation, and have adapted to these by both behavioral and biological responses (Beall 2007). High-altitude environments (regions above 1,500 m according to Ward et al. 2001) are of particular interest in studies of adaptation for two reasons. First, all individuals in a population experience the same stress, which cannot readily be modified by technological or cultural means and, therefore, involves primarily a physiological response. Second, such environments are present in several different regions, including Africa, Asia and South America, and so provide an opportunity to investigate the extent to which similar selective pressures lead independently to similar adaptations (Beall 2007). At an elevation of 2,000 m, the standard atmospheric equation (West 1996) estimates barometric pressure to be 20% lower than at sea level. Hence in a given inspired volume, the partial pressure of oxygen is reduced accordingly. In the absence of any physiological compensatory response, the alveolar gas equation (Fenn et al. 1946) predicts that arterial blood would have a partial pressure of around 9 kPa, compared to the 13.3 kPa, considered normal in healthy young subjects at sea level (Lumb 2011). Even with normal ventilatory compensation, there is detectable arterial hypoxia at rest in subjects exposed to this altitude (Muhm 2007). This produces a commensurate decline in aerobic exercise capacity of 5–10% (Cerretelli 2008) with consequent reduction of outdoor performances such as running, hunting, or escaping from predators, which can lead to a reduction in fitness if not counterbalanced by acclimatization or genetic adaptation. An organism exposed to such conditions has, therefore, to find a metabolic way to compensate for the reduction of oxygen delivery.

Worldwide, three sets of populations living at high altitudes have received most attention: the Ethiopian highlanders, the Himalayan Sherpa and Tibetans, and the Andean Quechua and Aymaras (Moore 2001; Rupert and Hochachka 2001; Suzuki et al. 2003; Rajput et al. 2006; Zhang et al. 2006; Beall 2007; Erzurum et al. 2007). Physiological studies have shown differences in the mechanisms of adaptation in different regions: in the Andes, e.g., increased oxygen delivery is achieved by an increase in haemoglobin concentration, but in Tibetans the haemoglobin concentration is not increased (Simonson et al. 2010) showing that populations from different parts of the world can adapt to a similar environments following more than one pathway. Recently, the genetic basis of some of these adaptations has been investigated and signs of positive selection found in EPAS1, EGLN1 and PPARA in Tibetan populations (Beall et al. 2010; Bigham et al. 2010; Simonson et al. 2010; Yi et al. 2010; Peng et al. 2011; Xu et al. 2011). Because of the diversity of the hypoxic response, the investigation of additional high-altitude populations would be of considerable interest.

The Caucasian mountains on the border between Asia and Europe reach 5,642 m at their highest point (Mt. Elbrus) and show archeological and genetic evidence of continuous human occupation since >10,000 Ybp (Caciagli et al. 2009), so provide an additional environment where humans have adapted to high altitude. The geo-morphological, anthropological, and linguistic landscapes are all very complex. For example, several distinct ethnic groups live at high altitudes in the Republic of Daghestan (Russia) speaking Caucasian, Indo-European or Altaic languages (Caciagli et al. 2009). Anthropological and genetic studies have shown that the majority of these groups have high endogamy, high inbreeding, and small population sizes that have remained stationary for many generations (Bulayeva et al. 2005, 2006). Strict patrilocality and endogamous marriages have led to a reduction of diversity within each population, and the small size has led to genetic drift and differentiation between them, as revealed using Y-chromosomal and mitochondrial DNA markers (Barbujani et al. 1994a, b; Bulayeva et al. 2003; Nasidze 2003; Nasidze et al. 2004; Tofanelli et al. 2009; Balanovsky et al. 2011).

Studies of genetic adaptations to high altitude in Daghestan have been limited, but have revealed some evidence for such adaptations:

  • A variant hemoglobin (Hb) alpha subunit sequence was reported in 1987 in some Daghestani individuals (Lacombe et al. 1987). Remarkably, the same position is also variant in some high altitude adaptated deer mice (Storz et al. 2007, 2009, 2010) where in association with other mutations, it contributes to the increase in oxygen affinity of the mouse Hb. However, no data concerning changes in Hb oxygen affinity are available in the literature for the human variant.

  • When populations of highlanders moved to the lowlands as a consequence of a Soviet government decision, the mortality rate increased dramatically. Although this increased mortality could be partly explained by novel pathogens encountered in the lowlands, it could not be entirely accounted for in this way (Bulaeva et al. 1995, 1996), so might also reflect a reduction of low-altitude fitness due to genetic adaptation to the high altitude or low genetic diversity in these populations.

Study approach

Three main factors influenced our study design. First, a relatively small number of samples were available (Caciagli et al. 2009) and the physiological parameters of the individual donors were not known. Second, there are likely to be substantial differences between the Daghestani study populations due to genetic drift during their different demographic histories as these have diverged over the last few centuries and millenia, in addition to any adaptive differences linked to high altitude. These two considerations led us to choose an empirical candidate gene approach, in which we would compare neutral regions with genes potentially implicated in high altitude adaptation. A third relevant factor was that our study was initiated in 2008, before the publication of the recent genome-wide surveys, and was, therefore, restricted to candidate genes known in 2008. In particular, at that time there was, in our assessment, not enough evidence to consider PPARA and EPAS1 as strong candidate genes, although EPAS1 has subsequently been associated with high altitude adaptation in multiple populations (Beall et al. 2010; Bigham et al. 2010; Simonson et al. 2010; Yi et al. 2010; Peng et al. 2011; Xu et al. 2011). Thus, we set out to compare 20 autosomal regions accepted as likely to be neutral (Wall et al. 2008) with 15 candidate genes for high altitude adaptation: HIF1A; EGLN1; EGLN2; EGLN3; VHL; EPO; EPOR; VEGFA; EDN1; NOS3; ACE; HBA; HBB; HBD, and HBG1 (Table 1). The three Daghestani highlander populations (Avars N = 16, 2,120 m; Laks N = 21, 2,100 m; Kubachians N = 12, 1,890 m) would be compared with a geographically matched lowlander control population (Adygei, N = 20, 17 m). We would search for an excess of differentiation in the candidate genes between the lowland and highland populations (F ST) when compared with the neutral regions as a potential sign of adaptation. To provide calibration standards and quality controls for the resequencing data, we also analyzed 20 CEU individuals from the HapMap3 panel (Frazer et al. 2007), 12 of whom were also included in the 1000 Genomes pilot project (Durbin et al. 2010).

Table 1 Candidate genes analyzed

Results

We successfully re-sequenced 15 candidate genes and 20 neutral control regions (480 kb in all) in 89 individuals from three Daghestani high-altitude populations (Avars, Kubachians and Laks) and two control populations, the Adygei and the CEU, with an estimated concordance rate with HapMap3 of 99.25%, false discovery rate of 6.6% and false negative rate of 6% (fourth data point in Fig. 1; false discovery and false negative estimates based on comparisons with 1000 Genomes data). These estimates are conservative, since they assume that both the 1000 Genomes and HapMap3 datasets are error-free. We detected 1,066 SNPs (514 in neutral regions and 552 in candidate genes), 316 (207 and 109) novel, which together form the basis for the subsequent analyses. The derived allele frequency spectrum (DAF) shows a large excess of rare alleles (Supporting Fig. 1), as expected in human populations (Durbin et al. 2010) although our stringent filtering process (see “Quality checks” in “Materials and methods”) removed 54% of the rare variants when compared with other high depth re-sequencing studies (1000 Genomes exome pilot project, data not shown).

Fig. 1
figure 1

False discovery and missingness rates using a range of filtering thresholds

We first compared the Daghestani and control populations with the HapMap samples (20 individuals each), performing a PCA using the HapMap SNPs from the neutral regions. The Daghestani populations cluster together, close to the CEU and distant from the CHB and YRI (Fig. 2). This result is expected from their geographical location, confirming the reliability of the SNP calls and suggesting that drift in the Daghestani populations has not been so high that geographical signatures have been erased.

Fig. 2
figure 2

Principal components analysis (PCA). The Daghestani populations form a sub-cluster within the European/Caucasian cluster

To investigate the differentiation between highlanders and lowlanders, we calculated pairwise F ST values for each SNP between each Daghestani population and the Adygei (three pairwise comparisons per SNP). The 95% upper boundary calculated from the 20 neutral regions was taken as the empirical significance level for each highlander-control population pair. We considered as outliers those candidate gene SNPs exceeding this value, corrected for multiple tests (three per SNP) according to Bonferroni’s formula: significance/number of tests, i.e. SNPs included in the top 1.7% F ST distribution (Supporting Table 2). Averaged over all genes, the proportion of outliers exceeded 1.7% only in Kubachians (Table 2). When looking at each gene individually, we saw an excess of outlier SNPs in the EGLN1, EGLN3 and HIF1A genes in the Kubachians, as well as among the β and δ globin genes in several populations. Due to the high sequence similarity among the globins, the filtering process removed many potential SNPs from these regions. Therefore the high F ST values in the globin genes may be explained by the small numbers of SNPs detected in these genes, and were not considered further. The Daghestani variant of HBA was not detected in this study, but is rare and has only been reported from a few families (Lacombe et al. 1987). However, the EGLN1 region, where Kubachians display a cluster of 13 intronic SNPs with high FST values (Fig. 3; Supporting Tables 2, 3), cannot be accounted for by any known artefact, and points to an unusually highly differentiated region. We validated all of the SNPs tested and 85% of the genotype calls through capillary sequencing with the remaining 15% of genotypes being incorrect heterozygous calls, in line with the validation rates (70–90%) obtained by the 1000 Genomes Project (Durbin et al. 2010). This intronic region appears to be quite conserved among mammals and is a site of histone H3K4 methylation (Fig. 3) indicative of transcriptional activation (Lupien and Brown 2009). In addition, a single highly differentiated non-synonymous SNP (rs11549465) in HIF1A exon 12 in the Laks also stood out because of its known functional importance, influencing the transactivation response of the HIF dimers and previously associated with oxygen metabolism (Prior et al. 2003; Tanimoto et al. 2003; Doring et al. 2010), and the complete absence of the derived allele (Fig. 4; Supporting Table 3).

Table 2 Distribution of F ST outlier SNPs in each population compared with Adygei by number (left) or proportion (right)
Fig. 3
figure 3

F ST outlier SNPs in EGLN1 from the three high-altitude populations. The customized UCSC track shows the genomic location of all the SNPs called in this study from the EGLN1 genic region (black) and the outlier SNPs for the F ST statistic in Avars (blue), Kubachians (red) and Laks (brown). Below the SNPs is the gene structure, some relevant histone modifications, the conservation of the region and all the SNPs reported in the dbSNP(130) database. The Kubachian 13-SNP cluster is seen in the left half of the plot, spanning the 229580000 coordinate

Fig. 4
figure 4

F ST outlier SNPs in HIF1A from the three high-altitude populations. The customized UCSC track shows the genomic location of all the SNPs called in this study in the HIF1A genic region (black) and the outlier SNPs for the F ST statistic in Avars (blue), Kubachians (red) and Laks (brown). Below the SNP position is the gene structure, some relevant histone modifications, the conservation of the region and all the SNPs reported in the dbSNP(130) database. The non-synonymous SNP rs11549465 is identifiable as the only outlier in the Laks track

To further investigate the EGLN1 and HIF1A regions identified by the F ST results, we calculated six statistics (Tajima’s D, Fu and Li’s F and D, Fu’s Fs, Fay and Wu’s H and π) to compare the evolutionary histories of these genes and the neutral regions. To obtain an empirical threshold for the significance of each test, we used the maximum and minimum scores obtained for the neutral regions in each population. EGLN1 stood out again, since most of the populations gave outlier results at each test although with different trends between the tests. The values, expressed as fold above or below the empirical threshold, are plotted in Fig. 5a. The results reveal an excess of intermediate-frequency variants in these regions, possibly due to the positive selection on an allele that has reached about 50% in the population, or balancing selection. This pattern is confirmed by the network (Bandelt et al. 1999) shown in Fig. 5b, which reveals two distinct clusters of high-frequency haplotypes separated by several steps. Although these neutrality test results only show nominal significance (not corrected for multiple testing), we interpret them as further support for non-neutral evolution of regions identified by the previous F ST results.

Fig. 5
figure 5

a Neutrality tests for EGLN1. The y axis shows the fold above the neutral region maximum, chosen as the empirical threshold for deviation from neutrality. b Network of EGLN1 haplotypes. Circles represent haplotypes, with area proportional to frequency (smallest = 1) and are coloured according to the population of origin as in A. Lines represent mutational steps separating the haplotypes, with the number of steps indicated in red

Discussion

We have developed a robust approach to resequencing target regions in populations of interest using next-gen sequencing technology involving PCR enrichment and sample multiplexing, and applied it to an investigation of high-altitude adaptation in a little-studied area, Daghestan in the Eastern Great Caucasus. The next-gen sequencing approach proved to be well suited for our study design, since only 154 of the 1,066 SNPs we identified would have been available on standard genotyping arrays (e.g., Illumina 1 M Omni SNPchip). In particular, the highly-differentiated cluster of intronic SNPs in the EGLN1 gene is not represented at all on this array, so would have been overlooked in an array-based approach. In addition to this, the sequencing data allowed us to support the F ST results by calculating a set of statistics that make use of the full allele frequency spectrum, hence reducing the risk of ascertainment bias. The data from the 20 neutral regions show the expected patterns in the PCA plot (Fig. 2) and derived allele frequency spectrum (Supporting Fig. 1), testifying to the reliability of the results from next-gen sequencing technology. A key aspect of our work was the inclusion of these putatively neutral control regions, which allowed us to identify likely selected genes and SNPs using an empirical approach in populations with complex and poorly understood demographies. Our results suggested unusual evolutionary events at two genes, and we now discuss these findings and their biological consequences, along with their implications for a more general understanding of high altitude adaptation.

The derived allele frequency spectrum of the combined candidate gene regions (Supporting Fig. 1) and the neutrality tests applied to the phased haplotypes identified an excess of intermediate-frequency SNPs in several genes. EGLN1 exemplifies these characteristics (Fig. 5). Furthermore, a substantial intronic region of this gene is highly differentiated between the highlander Kubachians and the Adygei living at sea level (Fig. 3). This segment shows high conservation across mammals, comparable to an exonic region, and annotation for the presence of histone methylation (H3K4me1) which indicates transcription enhancer activity (Lupien and Brown 2009). We speculate that one or more of the SNPs in the cluster may alter the EGLN1 transcription level with a consequent change in the protein production. A decreased availability of EGLN1 (Bernhardt et al. 2010) as well as other modifications of the EGLN-VHL-HIF axis (Smith et al. 2006; Formenti et al. 2011) have been shown to induce upregulation of some genes involved in the HIF-dependent hypoxia response, such as EPO, EGLN1 and VEGF. Future functional studies will be needed to investigate the role of the SNP cluster in regulating gene expression. Interestingly, EGLN1 was picked out as a recently selected gene in Tibetans showing a pattern of high altitude adaptation in three studies (Simonson et al. 2010; Peng et al. 2011; Xu et al. 2011). However, the signature of selection in Tibetans was a long haplotype inferred from SNP array data, centered 20 kb downstream from the transcribed region, and thus distinct from the intronic signature in Kubachians. Another study (Aggarwal et al. 2010) reported two SNPs from the same intron in EGLN1 (rs480902 and rs479200) as markers of putative adaptation to high altitude; however, rs480902 was not detected as polymorphic and rs479200 did not stand out as an outlier in the present study. Although it remains unclear whether or not there might be a common underlying biological mechanism such as a modification of expression, it is indeed striking that the same gene stands out in three independent studies investigating geographically distinct populations, indicating a common path of adaptation to a hypoxic environment.

The other remarkable finding was an outlier SNP in Laks within exon 12 of HIF1A. This SNP, rs11549465, shows the derived allele at a frequency of ~10% in the Adygei, consistent with the Human Genome Diversity Project data (Rosenberg 2006), but is detected only in its ancestral status in Laks, Avars and Kubachians, although partial sequencing failures make conclusions about the latter two populations preliminary. The variant causes a proline to be replaced with a serine in its derived state. As a result, the transactivation of the HIF1A protein dramatically increases during hypoxia (Tanimoto et al. 2003), triggering a downstream hypoxic response. Conversely, the ancestral proline allele was found at significantly higher frequencies in elite endurance athletes when compared with non-athletes (Doring et al. 2010) and increase in VO2max after 1-month training in elderly people (Prior et al. 2003). Hence the absence of the derived allele in the highlander populations can be seen as a twofold consequence of selection: against the damaging excessive activity of a hypoxia master regulator gene, favouring instead the ancestral variant capable of conferring better endurance and VO2max plasticity in a mildly hypoxic environment.

In conclusion, although a study based on known candidate genes for high altitude adaptation can only reveal the involvement or lack of involvement of these genes, it is striking that in the intronic EGLN1 cluster in Kubachians and the rs11549465 SNP in Laks, we find two distinct patterns of high altitude adaptation previously unknown in Daghestani highlanders. Furthermore, our findings show how even a mildly hypoxic environment (2,000 m) can induce genetic adaptation. These results will benefit from further functional follow-up, but already illustrate both shared aspects between high altitude adaptation in Daghestan and other areas, and features that differentiate Daghestan from other studied regions.

Materials and methods

Samples and ethics statement

The Daghestani DNA samples used in this study were extracted from blood samples collected from healthy adult male donors after obtaining individual written informed consent, and were transported under a research agreement between G. Paoli’s group at the Department of Biology of the University of Pisa and K. Bulayeva’s group at Vavilov Institute of General Genetics of Russian Academy of Sciences, and have been used in previous studies (Caciagli et al. 2009). The CEU samples were obtained from the Coriell Institute (www.coriell.org), while the Adygei DNAs (extracted from lymphoblastoid cell cultures) were from the Kidd Lab, Department of Genetics, Yale University School of Medicine. Approval for the study was provided by the Cambridgeshire 2 Research Ethics Committee (09/H0308/1).

Candidate genes and their functions

The genes chosen as candidates for high altitude adaptation, together with their main features are reported in Table 1.

Control regions

Twenty genomic regions with no known association with high altitude adaptation and considered to be neutral on the basis of lack of documented function and long distance from known coding regions (Wall et al. 2008) were used as controls. Three sections (1–3 kb each) spanning each of the 20 genomic regions were resequenced. The coordinates and sequences of each trio region (referred to “neutral regions”) were kindly provided by Dr. Michael Hammer (http://hammerlab.biosci.arizona.edu) and the primers were designed independently (Supporting Table 1).

Long template PCRs, primer design and standardization

Primers were designed using Primer3 (http://primer3.sourceforge.net/) and BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi). The design was automated using the software Pfetch and custom Perl scripts available at lp8@sanger.ac.uk. BLAST was used to check for primer specificity. Primer pairs (www.sigma-aldrich.com) were tested and long-PCR (1–6 kb) conditions standardized using a HapMap DNA (NA07029) as a test sample. The 11,904 PCR amplifications were performed in 96-well plates using Platinum Taq DNA polymerase high fidelity (Invitrogen), using one plate for each of the 124 primer pairs (protocols available on request). Amplicons were visualized by agarose gel electrophoresis in the presence of ethidium bromide and those from the same individual pooled using volumes inversely proportional to reaction yield, in order to establish approximate equimolarity. The pooled PCR products were subsequently purified through post-PCR columns (Qiagen QIAquick PCR Purification Kit).

Illumina GAII resequencing

All samples were sequenced on an Illumina GAII Genome Analyzer. To exploit the high yield of an Illumina run, an indexed re-sequencing strategy was adopted (Kozarewa and Turner 2011). A unique eight-nucleotide sequence “tag” specific to each individual was added to one of the linkers ligated to each genomic DNA fragment. Tagged DNAs from eight individuals were then pooled, and the resulting mix sequenced using 76 bp paired-end reads. Data from each lane were split according to the individual-specific tag, mapped to the reference sequence (build 36) and SNPs called using Burrows–Wheeler Aligner (BWA) and SAMtools (Li and Durbin 2009; Li et al. 2009).

Validation

Six of the 13 EGLN1 SNPs showing outlier F ST values in Kubachians were validated by a standard PCR/Sanger sequencing reaction on an ABI Genetic Analyzer 3730xl using genomic DNA.. Two pairs of primers were used for sequencing: Pair1 FWD: GCTCTGGTGACAGGAATACTGAA; Pair1 REV: CTGTAGTCCTAGCACTTTGGGAG; pair2 FWD: AAACAGGGATACAAAGCTTAGAGAA; Pair2 REV: AAGTTTCCAAGAACCTATCGAGG.

Quality checks

Raw SNP calls were calibrated by comparing genotype calls made using a range of coverage and allelic ratio thresholds with known genotypes at the 763 HapMap3 sites present in the sequenced regions in the 20 CEU individuals. The maximum concordance of the two dataset was 99.25%, achieved by limiting genotype calls to positions with at least 25× read depth and defining heterozygotes when the allelic ratio lay between 0.25 and 0.75. The reliability of these calibrated calls was then tested by comparing the power to call SNPs in the 12 CEU individuals in the 1000 Genomes Project. Assuming the 1000 Genomes calls to be correct, we applied four additional filters to provide a false discovery rate of 6.6% with a false negative rate to 6%. The filters (thresholds) were goodness of the mapping to the reference sequence (≥57), SAMtools SNP score (≥100), consensus score (≥100) and noise of the signal (≤0.01) (Fig. 1).

Data analyses

Principal components analysis (PCA) was performed using the software Eigensoft (Price et al. 2006) on the 41 neutral region SNPs overlapping with the HapMap3 dataset. The following neutrality tests were applied to identify possible signs of positive selection using an in-house script on PHASEd (Stephens and Donnelly 2003) haplotypes: Fay and Wu’s H; Fu and Li’s D; Fu and Li’s F; Fu’s Fs; π; Tajima’s D (Tajima 1989; Fu and Li 1993; Fu 1997; Fay and Wu 2000). Some haplotypes were displayed in a median-joining network using Network software (Bandelt et al. 1999). The pairwise F ST between Adygei and each highland population was calculated using the R package Hierfstat (de Meeus and Goudet 2007). The ancestral/derived alleles and functional consequences of the variants were identified from the Ensembl database of genomic annotations (www.ensembl.org).