Background

Marine fish species are often distributed across a variety of habitats, which makes these organisms interesting models to study the interaction between gene flow and natural selection. Atlantic cod (Gadus morhua L.) exploits different ranges of salinity and temperature across an extensive geographical distribution. However, due to a large effective population size and to gene flow between habitats [1], a weak population genetic structure is predicted.

Two extensively studied populations of Atlantic cod are the migratory Northeast Arctic cod (NEAC) population and the stationary population known as coastal cod (CC) [2, 3]. Generally, migratory individuals use deeper and more offshore habitats, and the NEAC animals migrate long distances from Lofoten, Norway, to the feeding areas in the Barents Sea. In contrast, stationary individuals usually occupy the Norwegian coastal water habitats during their entire life cycle. At present, NEAC is the largest population of Atlantic cod, and is located in the Barents Sea [3]. However, NEAC individuals can migrate to areas which are also inhabited by CC individuals (Fig. 1). Accordingly, breeding between NEAC and CC individuals occurs, but, to date, the degree of gene flow and interbreeding between both populations is not well known.

Fig. 1
figure 1

Geographical distribution of the wild CC (black points and light grey shading) and NEAC (dark grey shading) populations and the hybrid zone (where the distributions overlap) of individuals used for the National cod breeding program of Nofima in Tromsø (Norway)

In addition to dissimilarities in migratory and feeding features, there are also clear differences in growth rate and age at maturity between these two populations [2]. Moreover, the CC population is less susceptible to viral nervous necrosis (VNN) than the NEAC population. Natural selection can explain these results, with the NEAC population being adapted to the Barents Sea environment, while the CC population is more resistant to pathogens that are present in a coastal environment [4]. Traditionally, differences in the structure of their otoliths have been used to differentiate individuals from either the NEAC or CC populations [5]. However, genetic differences between these two populations have also been identified (reviewed by [2]). On the one hand, differences have been found regarding blood type E [6], haemoglobin (Hb-I) alleles [7], 10 microsatellite loci [8], and alleles at the membrane protein gene pantophysin (Pan I) (now named synaptophysin like 1 or SYPL1) [9]. At the Pan I locus, the frequency of the Pan IB allele is above 90% in the NEAC population, whereas the Pan IA allele predominates in the CC population [10]. On the other hand, little or no genetic differences between these populations have been detected at most allozymes [11], microsatellites [8] and at the mitochondrial cytochrome b locus [12]. It has been suggested that, during the early stages of divergence, genetic differentiation may have been restricted to a few specific genomic locations, called genomic islands, while the majority of the genome remained homogenised because gene flow is still in progress [13]. This has been supported by theoretical and empirical investigations [14]. Involvement of other mechanisms has been suggested, such as chromosomal rearrangements including inversions, divergence hitchhiking, and processes that promote the genomic co-localisation of genes [15]. In cod, several authors [3, 15,16,17,18,19,20,21,22,23,24,25,26] using different types of genomic data showed that population differentiation occurred at four discrete islands of genomic divergence located on different chromosomes (Table 1).

Table 1 Genomic data used and chromosomes on which genomic divergence was detected in previous studies

Recent developments have enabled next-generation sequencing technology to compare individual complete genomes with high precision. The objectives of our study were to identify genomic islands of divergence and to propose an approach to measure the strength of the natural selection within each genomic island in the NEAC and CC populations and their crosses using whole-genome single nucleotide polymorphism (SNP) data.

Methods

Genomic data

Genomic information from 111 animals belonging to the National cod breeding program of Nofima in Tromsø (Norway) was used. More specifically, year-classes (YC) from 2003, 2004 and 2005 were formed as progeny of wild Atlantic cod. Since the generation interval of Atlantic cod is 3 years, the data from YC 2006, 2007 and 2008 represent the first generation (F1) of the progeny of the selected fish from YC 2003, 2004 and 2005, respectively. Accordingly, the data from YC 2009 represent the second generation (F2) of the progeny of the selected fish from YC 2006. Each sire was mated to two dams, while each dam was mated to one sire. Furthermore, some sires and dams were also mated across YC to create genetic links between YC. The base population of the breeding program consisted of fish that were sampled from different geographical areas along the coast of Norway. The dataset consisted of fish with pure (CC and NEAC) and crossed origins (see [27, 28] for more details). Genomic DNA from these individuals was extracted for resequencing using the Truseq Library prep kit from Illumina (Illumina, San Diego, USA). Paired-end sequencing (2 × 100 nucleotides) was carried out using an Illumina HiSeq 2000 instrument to generate ~ 22× genome coverage for each sample. Reads were identified and filtered as follows: short reads were aligned against the cod genome assembly v 1.94 [29] using Bowtie 2 [30]. SAMtools [31] was then used to identify and retain uniquely mapped reads. SNPs were called using the FreeBayes (v0.9.18) software [32]. Finally, SNPs that were informative in more than 20% of the individuals and with a MAF higher than 0.05 were kept for further analysis. Individuals were classified based on their allelic differences at the Pan I locus [23] and their geographical origins. The final dataset included 93 animals (9 CC stationary individuals, 50 NEAC migratory animals, and 34 CC × NEAC crossed individuals) and 3,123,434 autosomal SNPs. Figure 2 shows the number of SNPs analysed for each chromosome.

Fig. 2
figure 2

Number of SNPs analysed per chromosome

Genome-based estimates of coancestry

Following Malécot [33], it is possible to calculate the coancestry coefficient \(f_{ij}\) between individuals \(i\) and \(j\). Genomic estimates of coancestry for each chromosome were obtained using the SNPs genotyped on each chromosome. Accordingly, genomic estimates of coancestry within chromosomes were obtained as follows. Because neighbouring SNPs showed significant coancestry differences, such that it was difficult to identify specific regions with a higher or lower coancestry based on individual SNPs, coancestry was estimated over sliding 200-SNP windows. Following this approach, the noisiness of single-locus coancestry estimates can be reduced and the precision of the estimates can be increased by combining data from several adjacent SNPs. This was based on the method proposed by Weir et al. [34]. For each chromosome, the first sliding window was identified by taking the first 200 SNPs at the beginning of the chromosome. Subsequently, the window slides across the chromosome by moving one SNP to the right, until the end of the chromosome is reached, maintaining 200 SNPs in each window. For each window, coancestry was estimated by taking the average of all coancestry values of the SNPs lying in that window.

Genome-based estimates of coancestry were obtained for the CC, NEAC, CC × NEAC populations, and we also calculated the coancestry coefficient \( f_{{{\text{CC}}\;{\&}\;NEAC}} \) between the individuals of both populations.

Coefficient of genetic differentiation

The Genepop software version 4.3 [35] was used to calculate Wright’s FST [36] between the CC and NEAC populations. FST was calculated by taking all the SNPs from each chromosome into account and also by using the same sliding window approach as above with a window size of 200 SNPs on each chromosome.

Private allele frequency

Using the same sliding window approach, allele frequencies that were higher than 0 in the CC population and equal to 0 in the NEAC population, and vice versa, were averaged.

Analysis of the statistical significance

The coancestry, the  genetic differentiation coefficients and the private allele frequency for all chromosomes (except for chromosomes 1, 2, 7 and 12, which clearly showed selection signatures; see Figs. 3 and 5) were used to establish the distribution of these values under the null-hypothesis. Significance thresholds were estimated as the 0.01% highest and lowest values for coancestry and as the 0.01% highest values for genetic differentiation and private allele frequency. Since the sliding windows overlap, the statistical tests were not completely independent, which reduced the effective number of tests performed, but this was not expected to affect the significance threshold. For instance, the threshold for the top 10 out of 1 million tests is approximately the same as for the top 1000 out of 100 million tests (for a large number of tests as was the case here). Although a P-value of 0.01% is very stringent, no explicit multiple testing correction was performed, such that these should be interpreted as nominal P-values. Nominally significant values are indicated in grey on Figs. 4, 6 and 7.

Fig. 3
figure 3

Mean coancestry coefficients (\(f\)) across chromosomes. Bars indicate standard errors

Fig. 4
figure 4figure 4

Patterns of coancestry (\(f\)) on chromosomes 1 (a), 2 (b), 7 (c) and 12 (d) for the CC, NEAC, CC × NEAC and \( f_{{{\text{CC}}\;{\&}\;NEAC}} \) populations. Genomic islands with significantly higher or lower coancestry are shown in grey

Principal components analysis

The R-package Adegenet package version 2.0.1 [37] was used to calculate and plot the first two principal components based on the SNP data available for the genomic islands on chromosomes 1, 2, 7 and 12 and using all the SNP information for the other chromosomes.

An additional analysis was performed by classifying the individuals based on the PCA results. This analysis was performed for each genomic island on chromosomes 1, 2, 7 and 12. With this classification, it was possible to test the hypothesis that all the segregating SNPs in the group of homozygous animals for the rearranged segment were not segregating in the group of homozygous animals for the non-rearranged segment, and vice versa. In theory, SNPs that segregate in the group of animals with the rearranged segment should not segregate in the group with the non-rearranged segment because the rearrangement occurred on a single chromosome that later accumulated new mutations. However, these new SNPs do not segregate in the group of animals with the non-rearranged segment because the rearrangement prevents recombination between rearranged and non-rearranged segments.

Another analysis based on PCA results was performed to evaluate the possibility that the sampled fish resulted from recent crossbreeding between the CC and NEAC populations. In this situation, a rearranged segment on one chromosome would not be independent of the rearranged segment on the other chromosome. This is important because otherwise the use of the Pan I locus [23] to classify NEAC and CC individuals (as mentioned above) might bias the results on all the chromosomes and not just on chromosome 1. To test for this, we calculated the correlation between the rearranged alleles on different chromosomes, which was expected to be positive if the fish were recent crossbreds and vice versa.

Estimation of selection pressure

The inbreeding coefficient of the two populations relative to the population from which they diverged for each locus \(s\), \(F_{s}\), can be estimated as:

$$F_{s} = \frac{{\left( {q_{1} - q_{2} } \right)^{2} }}{{q_{1} + q_{2} - 2q_{1} q_{2} }},$$

where \(q_{1}\) and \(q_{2}\) are the allele frequencies in the CC and NEAC populations, respectively. The average inbreeding (\(F\)) for all the chromosomes (except chromosomes 1, 2, 7 and 12; see Figs. 3 and 5) is calculated and \(Nm\) (i.e. the product of effective population size \(N\) by migration rate \(m\)) is estimated as:

$$Nm = \frac{{(1/\overline{F)} - 1}}{4}.$$
Fig. 5
figure 5

Mean genetic differentiation coefficients (FST) across chromosomes. Bars indicate standard errors

Accordingly, the average inbreeding within the genomic regions on chromosomes 1, 2, 7 and 12 (see Figs. 4, 6 and 7), is calculated. Based on these averaged inbreeding values and the above \(Nm\) value, it is possible to estimate \(Ns\) (i.e. the product of effective population size \(N\) by selection pressure \(s\)) as:

$$Ns = \frac{{1 - \bar{F} - 4Nm\bar{F}}}{{2\left( {\bar{F} - 1} \right)}}.$$
Fig. 6
figure 6

Genetic differentiation coefficient (FST) between populations CC and NEAC for chromosomes 1 (a), 2 (b), 7 (c) and 12 (d). SNPs with a significantly higher differentiation coefficient are shown in grey

Fig. 7
figure 7

Private allele frequency in populations CC (left column) and NEAC (right column) on chromosomes 1 (a), 2 (b), 7 (c) and 12 (d). Markers with a significantly higher private allele frequency are shown in grey

\(Ns\) are calculated separately for each of the chromosomes, i.e. 1 (\(Ns_{1}\)), 2 (\(Ns_{2}\)), 7 (\(Ns_{7}\)) and 12 (\(Ns_{12}\)). Accordingly, \(Ns\) was calculated for each individual SNP (as mentioned above) and also for the entire rearranged segment based on the separation of individuals in rearranged and non-rearranged genotypes obtained from the PCA analysis. If \(Ns\) > 1, selection is strong, and if \(Ns\) < 1, selection is weak relative to genetic drift/inbreeding [38]. More details about the estimation of the selection pressure are provided in "Appendix".

Results

Estimates of coancestry

Mean coancestry across the genome did not differ significantly between CC (0.712 ± 0.004) and NEAC individuals (0.718 ± 0.006) (P = 0.425). The mean coancestry for the CC × NEAC individuals was smaller (0.702 ± 0.003), and the corresponding value of \(f_{{\text{CC}} \; \& \; NEAC}\) was 0.689 ± 0.011.

Figure 3 shows that the largest differences in coancestry for the groups of evaluated animals were on chromosomes 1, 2, 7 and 12.

When coancestry values were smoothed by using a sliding window approach, differences in coancestry within chromosomes became clearer at specific genomic islands on chromosomes 1, 2, 7 and 12 (Fig. 4). A block of high coancestry in both the CC and NEAC populations was observed between 9 and 27 Mb on chromosome 1. In the same genomic island, the values of \(f_{{\text{CC}} \; \& \; NEAC}\) indicate a genomic region that has a much lower coancestry coefficient. The patterns of coancestry on chromosomes 2, 7 and 12 in the NEAC population show islands of high coancestry between 18 and 25 Mb, 13 and 24 Mb, and 0 and 13 Mb, respectively. In these genomic regions on chromosomes 2 and 7, the values of \(f_{{\text{CC}} \; \& \; NEAC}\) indicate regions of decreased coancestry.

More specifically, the genomic islands were between 8,819,361 bp and 27,328,570 bp, 18,352,060 bp and 25,309,797 bp, 13,344,692 bp and 23,924,283 bp and between 446,989 bp and 12,636,074 bp on chromosomes 1, 2, 7 and 12, respectively.

No pronounced blocks of coancestry were observed on the remaining chromosomes (see Additional file 1: Figure S1).

Estimation of the genetic differentiation F ST

The mean genetic differentiation (FST) value across the genome between the CC and NEAC populations was 0.062 ± 0.020. Using the SNPs that were genotyped on each chromosome, more precise genetic differentiation coefficients were obtained across the chromosomes (Fig. 5). The highest values of genetic differentiation between the CC and NEAC populations were observed on chromosomes 1, 2, 7 and 12.

Genetic differentiation coefficients for chromosomes 1, 2, 7 and 12 are in Fig. 6. FST values within each chromosome corroborate the genomic islands that were detected with the coancestry estimates. The most clear-cut genomic island with high FST values was on chromosome 1 also between 9 and 27 Mb, which indicates that alleles at this genomic island differ significantly between the CC and NEAC populations. Similar results were observed for chromosomes 2, 7 and 12.

No marked differences in genetic differentiation were observed for the remaining chromosomes (see Additional file 1: Figure S2).

Private allele frequency

The analysis of private allele variants showed that the most differentiated chromosomes were also 1, 2, 7 and 12 (Fig. 7). However, patterns of private allele frequency contrasted with the coancestry patterns. For example, on chromosome 1, the most evident block of high coancestry was found for the NEAC population, whereas the most clear-cut block of private allele frequency was found for the CC population. This means that NEAC is the population with the original inversion. All initial alleles not carried by the rearranged block were lost in NEAC and most private alleles are therefore in CC. Similarly, the frequencies of private alleles for the regions on chromosomes 2, 7, and 12 were significantly increased in the CC population compared to the NEAC population. The reason is that if the private allele frequency is higher than 0 in the CC population, the other allele is fixed in the NEAC population.

No marked differences in patterns of private allele frequency were detected on the remaining chromosomes (see Additional file 1: Figure S3).

Principal components analysis

Principal components analysis showed differences between the two populations that formed completely separated clusters for chromosome 1, but these differences were less clear for chromosomes 2, 7 and 12 (Fig. 8).

Fig. 8
figure 8

Principal components analysis within the genomic islands of chromosomes 1 (a), 2 (b), 7 (c) and 12 (d) for populations CC and NEAC

No marked differences in the results of the principal components analysis were observed for the remaining chromosomes (see Additional file 1: Figure S4).

The PCA analysis clearly separated three groups of genotypes: (1) homozygous for the rearranged segment (i.e. the group of individuals that contains most of the NEAC individuals in the PCA analysis), (2) heterozygous for the rearranged segment (i.e. the group that includes both NEAC and CC individuals), and (3) homozygous for the non-rearranged segment (i.e. the group of individuals that contains most of the CC individuals in the PCA analysis) (Table 2).

Table 2 Classification of individuals after principal components analysis into homozygotes for the rearranged and non-rearranged segment and heterozygotes

Table 3 shows the total number of SNPs within each genomic island per chromosome and the segregating loci within each genomic island for the groups of animals with the rearranged and non-rearranged segment after the classification obtained from PCA. It is important to note that some SNPs segregate in both groups.

Table 3 Total number of SNPs in the genomic islands, number of SNPs segregating in the group of animals with the rearranged segment, the group of animals with the non-rearranged segment and in both groups for each chromosome

The coefficients of correlation between the rearranged and non-rearranged segments are in Table 4. Three of the four correlations were negative, which indicated that CC fish that carried one rearranged segment on one chromosome were unlikely to carry another rearranged segment on another chromosome. The values of the coefficient of correlation were not low, and the highest values were obtained for the negative correlation coefficients, which indicates that selection for a rearranged segment would not result in an increased fraction of rearranged segments on other chromosomes.

Table 4 Coefficients of correlation between the rearranged and non-rearranged segments

Estimates of migration rates and selection pressures

The mean value of the product of effective population size \(N\) by migration rate \(m\), i.e. \(Nm\) was equal to 4.37, which means that the contribution of migration is substantial compared to that of genetic drift. Based on the differences in allele frequencies for each SNP within each genomic island, the estimates of the product of effective population size \(N\) by selection pressure \(s\) were \(Ns_{1} = 12.01\), \(Ns_{2} = 2.26\), \(Ns_{7} = 3.84\), and \(Ns_{12} = 0.68\), for chromosomes 1, 2, 7 and 12, respectively, i.e., selection was strong relative to genetic drift in the genomic islands detected on chromosomes 1, 2 and 7, but not on chromosome 12 (Fig. 9). On chromosome 1, the same \(Ns\) values were found for many SNPs along the genomic island (Fig. 9a). In addition, the highest values are quite far from the highest values obtained for the other three chromosomes. From the 3200 highest \(Ns\) values found for this chromosome, 23 and 1616 correspond to private alleles in the CC and NEAC populations, respectively.

Fig. 9
figure 9

Estimates of the product of effective population size by selection pressure (\(Ns\)) for chromosomes 1 (a), 2 (b), 7 (c) and 12 (d)

The resulting \(Ns\) values for the entire rearranged regions were equal to 11.87, 15.42 and 2.86 for chromosomes 2, 7, and 12, respectively. Chromosome 1 was removed from this analysis because the genotypes at the Pan I locus (located on chromosome 1) were used to confirm the population identification, which would bias the allocation of non-rearranged regions to the CC and NEAC population. Thus, the selection pressures for the entire rearranged regions were higher than that for the individual SNPs within these regions.

Discussion

In this study, genomic islands of divergence were identified and estimates of selection pressures in these genomic islands were obtained in two populations of Atlantic cod. The results indicate that natural selection is more important than genetic drift on these detected genomic islands of divergence.

The genomic islands of divergence identified on chromosomes 1, 2, 7 and 12 showed differences in \(f\), which extended across 18, 7, 11 and 13 Mb, respectively for each chromosome, which suggest that these islands are considerably more related within populations and less related between populations than the rest of the genome (Fig. 4). Results based on the estimation of FST showed that the degree of divergence is higher for the chromosomes that carry these genomic islands (Fig. 5). However, it has been suggested that some factors can mask the detection of small genomic islands of divergence [39]. The mean whole-genome FST (0.062) estimated here agrees well with FST values from previous studies, which range from 0.024 to 0.065 [15, 24]. The increased frequencies of private alleles in these regions confirmed the general result that there is more population differentiation in these regions, which contain larger rearrangements that repress meiotic recombination in the NEAC × CC crosses [3, 21, 22].

The method used to estimate the migration rate between the two populations is well known and was previously described in the literature (e.g. [38]). A major assumption is that the allele frequencies have reached a balance between genetic drift (which is responsible for allele frequencies drifting apart) and migration (which reduces the differences in allele frequencies). In the case that these two forces are not (yet) in equilibrium and the populations are still drifting apart, migration will be overestimated, with the consequent implications in the estimation of the selection pressures. The estimation of the selection pressures in the genomic islands of divergence is a novel alternative approach, and is quite specific for the situation of a rearranged region. Assuming that there is no recombination between the non-rearranged and rearranged regions, selection will act on the rearrangement as a whole and thus cause the frequency of the rearrangement to differ between the populations.

The difference in frequency at individual SNPs is caused by the difference in the frequency of the rearrangement but the former is usually smaller than the latter. This is because the rearrangement will often carry the allele that is most common on the non-rearranged chromosome. Consequently, the frequencies of individual SNPs lead to underestimate the strength of selection. If some recombination occurs, the selection pressure on the causal loci will be stronger than that on the neutral loci, and the average selection pressure across the loci will further underestimate the selection pressure at the most important alleles that drive the divergence in these regions.

Assuming that a particular rearrangement only occurs once and does not recombine with the non-rearranged region, all the chromosomes that carry this rearrangement will descend from one original chromosome. Consequently, they will all carry the same alleles at each SNP within the rearrangement, and the original rearranged haplotype is thus monomorphic. Accordingly, the alternative region to the rearranged region will only occur in the wild-type haplotypes, and will be private to the wild-type haplotypes. This means that an increased private allele frequency points to wild-type haplotypes, which corresponds to the CC population, and the NEAC population corresponds to the rearranged haplotypes. It may be noted that the genome of NEAC individuals also has regions of increased private allele frequencies, which are due to mutations that occurred after the rearrangement. These later mutations may have drifted to considerable allele frequencies due to the originally low-frequency of the rearrangement, i.e. due to a hitchhiking effect.

The principal component analysis within the genomic islands revealed three genotypic groups (Fig. 8). For the genomic island on chromosome 1, these genotype groups match perfectly with the purebred NEAC and purebred CC population classification. For the other genomic islands, there is also a strong relationship between the population to which the individual belongs and the principal component classification. The classification of individuals into groups with and without the rearrangement from the principal components analyses highlighted that several SNPs segregate in both groups. There are three possible explanations for this. The first interpretation is that some individuals were erroneously allocated to the groups with and without the rearrangement. However, the principal components analysis showed a clear difference between the three groups. A second explanation is that the rearrangement was not a single event but happened several times. However, it seems unlikely that such a rearrangement occurred more than once at the same position in the genome. Finally, the most likely explanation is that a rearrangement will cause reduced recombination between the non-rearranged and rearranged alleles [22], but some recombinations will still occur, at a notably reduced rate, which is sufficient to introduce some segregating SNPs from the non-rearranged region into the rearranged region, and vice versa [40]. The analysis of the private allele frequency indicated that the ancestral alleles occur mainly in the CC population, and that the derived state occurs mainly in the NEAC population. After the rearrangement event, the recombination rate is reduced between the non-rearranged and rearranged alleles. Thus, the rearranged region could accumulate mutations that increase fitness in the habitat of the NEAC population, which is to the benefit of the migratory life-style. However, in other parts of the genome the allele frequencies of such mutants would be decreased due to the sustained introduction of migratory alleles at a rate of \(Nm\) ~ 4 per generation.

Currently, \(Ns\) values between 2.26 and 15.42 suggest that haplotypes are diverging, which results in two populations that coexist in an overlapping habitat but with different migratory behaviours. In fact, it has been suggested [16, 26] that the rearrangement on chromosome 1 is associated with migratory behaviour, whereas the rearrangements on chromosomes 2, 7 and 12 are most likely associated with temperature and/or salinity. Accordingly, the combination of these rearrangements could also have an effect on the migratory behaviour, temperature and/or salinity. The above situation may be an intermediate stage in the process of speciation of the two populations, where the fitness of the hybrid offspring is reduced, and consequently, the behavioural mating strategies that avoid matings with the wrong individuals will be favoured by natural selection. The latter results in a decreased migration rate and increases the divergence in the rest of the genome, which will promote the process of sympatric speciation throughout the genome. Hence the inverted chromosomal regions form the origin of a barrier to gene flow among populations that share a common habitat.

Conclusions

Whole-genome SNP data for Atlantic cod were used to investigate genomic islands of divergence in the CC and NEAC populations. Our results show that gene flow between the populations was sufficient to limit divergence between the two populations except at four genomic islands. The high resolution of the SNP data used in this study enabled us to precisely locate four genomic islands of divergence on chromosomes 1, 2, 7 and 12 in the NEAC and CC cod populations and to estimate the natural selection pressures that lead to their divergence. The estimates of the selection pressures showed that natural selection was substantially more important than genetic drift in shaping the diverged regions on chromosomes 1, 2, 7, and 12.