Background

Population structure is a very important factor in medical genetic association studies which can compromise modern genomic methods not being properly accounted for. In Russia, population studies were mainly conducted using Y-chromosome or mitochondrial markers with the recent application of microarray methods [1,2,3] and did not allow to estimate the functional role of variants. Some recent phylogeographic studies used whole-genome sequencing with samples from Russia to elucidate history of migrations in Eurasia, but used small samples from diverse populations [4, 5]. In this study, we identified exome genetic variants for 39 individuals from Novosibirsk, Russia and compared them with the previously published genome-wide data and exomes of European populations from the 1000 Genomes Project to understand the level of the exome-wide divergence and the extent of the population stratification. The Novosibirsk population (NVSB) is of particular interest because it exhibits an example of a modern big city population affected by political and economic events of the twentieth century which changed the historical landscape of ethnic diversity of the former USSR territory through increasing urbanization, mass migration across the country and rapid demographic growth. In this study, we identified exome genetic variants for 39 individuals from Novosibirsk, Russia and compared them with the previously published genome-wide data and exomes of European populations from the 1000 Genomes Project to understand the level of the exome-wide divergence and the extent of the population stratification. Additionally, we tested allele frequency differences between our sample and combined European dataset for medically and pharmacogenetically important variants to identify loci which can be important for national studies.

Methods

The study participants (n = 39) were from Novosibirsk and represented people with monogenic diabetes (n = 10), healthy individuals (n = 7) and a cohort from the tick-borne encephalitis study (n = 22). The participants signed an informed consent and defined themselves as ethnic Russians. The ethnicity of the participants was additionally checked prior the analysis with data from 1000 Genomes Project and two samples identified as clear outliers (close to the Asian populations) were excluded from the analysis. Isolated DNA was enriched using Agilent SureSelect V5 kit in and sequenced on Illumina HiSeq 4000 with 150PE reads. After the quality control with Trimmomatic [6] the reads were aligned with BWA mem [7] to Hg19 reference genome and processed with SAMtools [8]. Single nucleotide variants (SNVs) and indels were identified using GATK [9] according to the GATK Best Practices workflow for germline variation with the sensitivity filter equal to 99.9. The resulted VCF file was combined with 1000 Genomes Project genotypes [10] using bcftools [11] merge and filtered with VCFtools [11] at maximum 10 missed genotypes (−-max-missing-count) keeping only biallelic sites.

We performed the analysis on the two levels: with the Finnish (FIN) population for population genetic analysis (PCA, ADMIXTURE, Fst) and without FIN population to test the allele frequency differences for clinically and pharmacogenetically important variants. The FIN population was excluded from the second analysis as the most divergent European population with unique history [12]. To reduce the influence of the tightly linked loci on the patterns of population structure we applied the linkage-disequilibrium pruning using PLINK V1.93 software (Table 1). To estimate the patterns of the population structure we used the Principal Component Analysis (PCA) realised in SNPrelate [13] with European (1000 Genomes Project) and previously published Russian Siberian populations [2]. The proportions of genetic ancestry between populations were estimated using ADMIXTURE [14] for K = 2–8 (Table 1) and tested using Cross Validation Error estimation (CVE). To estimate and test statistically the level of pairwise population differentiation (Fst, [15]) we used smartpca software of the EIGENSTRAT package [16].

Table 1 Number of variants and filters applied to them for various analysis

We annotated the variants using ANNOVAR [17] and PharmGKB [18] databases and then tested medically (ClinVar, [19]) and pharmacogenetically relevant variants for the differences in allele frequencies between the NVSB population and the combined non-Finnish European (NFE) dataset with PLINK v1.93 [20] using 1 M permutations.

The average coverage of the studied exomes varied from 47.7X to 71.3X. In total, we identified 136,276 SNVs and 14,464 indels in the studied dataset. Merging with data from 1000 Genomes Project produced a dataset of overlapped variants consisted of 117,010 SNVs and 5989 indels.

Results

During the population genetic analysis, the first principal component accounted for 0.77% of the total variation and separated all the populations (Fig. 1a) except closely related American (CEU) and British (GBR). The second principal component accounted for 0.36% of the total variation and separated mostly Tuscan (TSI) and Spanish (IBS) samples. Novosibirsk population (NVSB) was placed between the Finnish (FIN) and CEU with GBR samples and was clearly distinguished from them. The Russian Siberian populations from a previous microarray-based study [2], represented by a similar Caucasian Siberian population (Russian_NSK) and partially isolated Siberian Starovers (Old Believers, Russian_STV) were not distinguished between each other and samples from our study (NVSB).

Fig. 1
figure 1

a Principal Component Analysis (Russian_NSK and Russian_STV are Russian from Novosibirsk and Siberian starovers respectively from [2]) b Observed and expected P-value distribution for allele frequency differences between NVSB and combined NFE sample c Results of the ADMIXTURE analysis for K = 2–5

In ADMIXTURE analysis, the lowest value of the Cross Validation Error was attributed to the K = 2, which captured the divergence of FIN from other European populations. NVSB demonstrated a higher proportion of the ancestral Finnish-related genetic component at K = 2 and at K = 3 relative to other populations. A new cluster (green) consisted of TSI and IBS appeared at K = 3 and then at an additional ancestral component emerged (K = 4) clearly separating NVSB (Fig. 1c, purple). Lastly, at K = 5, the IBS was separated from the rest of the samples.

The pairwise Fst values between all the populations except the CEU and GBR (P-value = 0.048) were highly significant (P-value < 1.1656e-11) albeit low (Fst = 0.002–0.013). The NVSB population demonstrated the highest level of differentiation with TSI (Fst = 0.009) and the lowest with GBR and CEU (Fst = 0.005). The results of the test for allele frequency differences between NVSB and NFE populations demonstrated pervasive inflation of the P-values attributed to numerous loci (Fig. 1b).

Among the 452 pharmocogenetically and 210 medically important variants we found 3 and 7 variants respectively (Table 2) which showed significant allele frequency differences between the NVSB and NFE population after the multiple testing correction (BH adjusted P-value < 0.05). The most significant differences in allele frequencies were attributed to such genes as FCGR3B, TYR, OCA2, FABP1 and SLC4A1 genes.

Table 2 Genetic variants from PharmGKB (P-value < 0.01) and ClinVar (BH adjusted P-value < 0.05) databases which demonstrated highly significant differences in allele frequency between NVSB and NFE

Discussion

In this study, we used an exome-wide dataset for the first time to study the population structure of the Caucasian Siberian population from a big Russian city Novosibirsk. The exome-wide survey of the Novosibirsk population demonstrated its genetic congruence with the previously published Russian dataset including the partially isolated Siberian Starovers regardless of the dramatic migration and demographic changes of the previous century. The Caucasian Novosibirsk population is quite homogeneous (Fig. 1a) and significantly differentiated from other European populations from 1000 Genomes Project demonstrating a relatively higher Finnish component which is presumably ancestral but not a result of recent migrations according to the ADMIXTURE results (Fig. 1c). This genetic differentiation although low in absolute Fst values should be taken into account during association studies. We identified 10 medically relevant SNVs with statistically significant allele differences between the NVSB and NFE populations including rs2241883 in FABP1 gene previously associated with polycystic syndrome [21] and toxicity of fenofibrate [22], rs1801274 variant in FCGR2A gene shown to be important for the efficiency of trastuzumab in breast neoplasms [23], the rare rs17879961 variant in CHEK2 gene reliably associated with predisposition to breast and colorectal cancer [24] and showed elevated frequency in NVSB. These variants should be studied in future on an expanded dataset with associated clinical data.

Conclusion

The study reports for the first time an exome-wide comparison of a population from Russia with European samples and emphasizes the importance of population studies with medical annotation of variants.