Introduction

The harbour porpoise (Phocoena phocoena) is a small cetacean species occurring in coastal habitats in the Northern Hemisphere (Fig. 1; Fontaine et al. 2014, 2017; Galatius et al. 2012; Lah et al. 2016). It comprises five described subspecies and potentially additional not yet formally described ones, i.e. P. p. vomerina (North Pacific), unnamed (Pacific), P. p. meridionalis (South Atlantic Ocean and Iberian Sea), P. p. relicta (Black Sea) and P. p. phocoena (North Atlantic), of which some display locally different traits in morphology and behaviour (ASCOBANS 2021; Carlén et al. 2018; Fontaine et al. 2017; Galatius et al. 2012; NAMMCO 2019). The North Atlantic harbour porpoise (P. p. phocoena) is distributed from Canada and Greenland to the North and the Baltic Sea. Within these regions, the sea basins exhibit a wide range of temperature, currents, salinity and other water conditions representing a variety of marine habitats, potentially posing different selective pressures to the species. The porpoises across the North Atlantic show variation in diet (Aarefjord et al. 1995; Hammond et al. 2013; Víkingsson et al. 2003), differences in habitat use and activity patterns (Nuuttila et al. 2017), fine scale morphological differences (Galatius et al. 2012; Härkönen et al. 2013; Viaud-Martínez et al. 2007) and significant isolation-by-distance (Lah et al. 2016). Based on these differences, local ecotypes have been postulated within the North Atlantic (Fontaine et al. 2010, 2017; Hammond et al. 2020; Olsen et al. 2022; Santos et al. 2004) and the Baltic Sea (Celemin et al. 2023; Galatius et al. 2012; Lah et al. 2016). The Baltic Sea was colonized from the North Sea during the end of the last glacial period (Sommer et al. 2008). Within the Baltic Sea, several small basins are separated by underwater ridges and are further fragmented by multiple small islands and headlands. Today, the Baltic Sea differs in temperature and salinity from the North Sea/North Atlantic, as it has transitioned from freshwater to brackish and marine water since its formation in the Pleistocene (~ 15 kyr ago) (Paasche et al. 2015; Varjopuro et al. 2014). This could have promoted the divergence of harbour porpoises in the Baltic Sea from populations of the North Sea/North Atlantic (Celemin et al. 2023; Lah et al. 2016; Wiemann et al. 2010).

Fig. 1
figure 1

Distribution map of the harbour porpoise in the North Atlantic and adjacent waters. Occurrence of the species is indicated by orange colouring. Black circle sizes indicate the relative number of samples per location (Canada (CA), Iceland (ICE), North Sea (NOS), Skagerrak (SKA), Kattegat (KAT), Belt Sea (BES), Inner Baltic Sea (IBS), and as outgroup Western Black Sea (WBS)). Map taken from IUCN Red List (IUCN (International Union for Conservation of Nature) 2012. Phocoena phocoena. The IUCN Red List of Threatened Species. Version 2019-3)

Across its North Atlantic distribution, the abundance of the harbour porpoise varies greatly. The European Atlantic Shelf is estimated to be inhabited by ~ 424,000 individuals (Hammond et al. 2018), with 43,000 animals in Icelandic coastal waters (Gilles et al. 2020) and approximately 14,400 animals estimated in the Belt Sea including Kattegat and the SW Baltic Sea (Gilles et al. 2023). The Proper Baltic porpoise population has much lower estimates (~ 500 animals in summer) and is considered critically endangered (Benke et al. 2014; Carlén et al. 2018; Carlström et al. 2023; Hammond et al. 2020; SAMBAH 2016; Scheidat et al. 2008).

With their ocean-wide distribution, cetaceans encounter a number of different threats from humans utilizing the marine habitat (McCauley et al. 2015). Humans contribute to marine debris (Siebert et al. 2001; Unger et al. 2017), cause bycatch (Kesselring et al. 2017; Omeyer et al. 2020) and noise pollution, especially in forms of offshore work and shipping routes (Dyndo et al. 2015; Farmer et al. 2018; Guzman et al. 2013; Holt et al. 2017; Rolland et al. 2012; Wisniewska et al. 2018). These disturbances could impose barrier effects, leading to decrease or fragmentation of populations (Dungan et al. 2016; Fontaine et al. 2007).

Harbour porpoises have a relatively high nutrient demand because of their small size, their reproductive cycle and the fact that they lactate while reproducing and often have a calf every year (Kesselring et al. 2017; Wisniewska et al. 2016). They mainly inhabit coastal areas, which increases their risk of negative exposure to human threats (Fontaine et al. 2010).

To assess the status of harbour porpoise populations, different techniques have been applied: acoustic monitoring, tagging and aerial surveys to assess distribution, abundance and habitat use (Lonergan et al. 2011; Pike et al. 2020; SAMBAH 2016; Sveegaard et al. 2015), as well as regular autopsies on stranded and bycaught specimens to provide information on health, pathology, age structure, and reproductive status (IJsseldijk et al. 2020; Kesselring et al. 2017; Siebert et al. 2001, 2006, 2020). In the context of assessment, it is particularly important to understand the genetic structure and divergence among populations (McMahon et al. 2014; Waples et al. 2018). Cetaceans are highly mobile in a seemingly continuous habitat, yet they follow abundance of their food, and they may be constrained by further specific habitat characteristics (Baker et al. 2013; Jefferson 2014; Van Cise et al. 2019). Hence, genetic structure may be difficult to understand and subtle structuring may be overlooked easily (Leslie and Morin 2016).

Cetacean sampling is often opportunistic relying on bycaught or stranded specimens. Sample material may hence often be degraded and not all sampling regions are equally represented (ASCOBANS 2002; Durban et al. 2005; Luca et al. 2009; Pierce et al. 2007). When relying on opportunistic sampling, it is even more important to have markers that show high resolution even with little genetic material or a low number of samples per geographic region. High-quality genomes for whales and dolphins have only recently been generated and made accessible (e.g. Autenrieth et al. 2018; Gao et al. 2023; Neely et al. 2018; Yu et al. 2022; Yuan et al. 2018; Zhou et al. 2018). Due to the vastly improved resolution in comparison to studies on single loci, even subtle population differentiation can be detected (Cammen et al. 2016; Çilingir et al. 2022; de Greef et al. 2022; Gallego-García et al. 2021). In recent years genome wide single nucleotide polymorphisms (SNPs) have become valuable markers to identify population structure and boundaries (Çilingir et al. 2022; Liu et al. 2005; Morin et al. 2009; Santure et al. 2010). Restriction site associated DNA sequencing techniques (RADseq, ddRAD), which allow for SNP discovery and genotyping, have become widely used approaches for population genomic analyses in non-model organisms, such as cetaceans (Attard et al. 2018; Cammen et al. 2016; Carroll et al. 2016; Davey et al. 2011; Reeves et al. 2022; Viricel et al. 2014).

The aim of this study is to provide a high-resolution genetic marker system, which allows for detection of even subtle population structure over the North Atlantic distribution range of harbour porpoise. We therefore extend on a previous study (Lah et al. 2016) using ddRAD sequencing, by increasing the sample size and geographic coverage. Furthermore, we use the published genome of the harbour porpoise (Autenrieth et al. 2018) as a mapping reference to increase the resolution of our newly obtained genome-wide SNPs. With this improved data set, we aim to resolve the population structure of harbour porpoise across the entire North Atlantic, as well as within the Baltic Sea, and further provide an informative set of SNPs for the investigation of local population differentiation and adaptation with possible implications for conservation.

Materials and methods

Sampling locations

We focused this study on the North Atlantic distribution range of the species. We took 151 samples (Supplemental Table 1) from eight different ‘sampling areas’: Canada (CA), Iceland (ICE), North Sea (NOS), Skagerrak (SKA), Kattegat (KAT), Belt Sea (BES), Inner Baltic Sea (IBS) and - as an outgroup – Western Black Sea (WBS) (Fig. 1). The boundaries of these areas follow a previous study and are based on definitions by the International Council for the Exploration of the Sea, ICES (Wiemann et al. 2010). Additionally, we defined three ‘regions’, based on the main sea basins: West North Atlantic (WNA, including CA + ICE), East North Atlantic/North Sea (ENA, including NOS + SKA) and Baltic Sea (BALT, including BES + IBS). These regions exclude the Black Sea as an outgroup and Kattegat as a transition zone (Lah et al. 2016; Rosel et al. 1999; Wiemann et al. 2010). All sampling was performed on dead by-caught or stranded carcasses, and no live harbour porpoises have been targeted for this study. The sampling was performed by persons authorised by the respective national authorities.

Wet lab procedure

Total genomic DNA was extracted from tissue samples (skin or muscle) stored at -20 °C (in ethanol or frozen) with the NucleoSpin Tissue Kit (Macherey-Nagel, Germany) following the manufacturer’s protocol. Canadian samples were extracted with the DNeasy Blood and Tissue kit (QIAGEN GmbH, Germany). To assess sample quality and quantity, DNA concentration was measured on a NanoDrop 1000 (Thermo Scientific, USA) and an Agilent 2200 TapeStation (Genomic ScreenTape System; Agilent Technologies, USA). Using traditional Sanger sequencing, we genotyped each sample’s mitochondrial control region as an independently inherited marker previously shown to be informative about population structure (Supplemental Table 1). Polymerase Chain Reactions (PCR) were performed on a BiometraT3000 thermocycler, using the primers ProL and DLH and following previously established protocols (Tiedemann et al. 1996; Wiemann et al. 2010). Using the Applied Biosystems BigDye chemistry, sequencing was performed on a 3130xl Genetic Analyzer (Applied Biosystems).

For genomic library preparation, we chose a modified ddRAD sequencing method, using the restriction enzymes PstI and MspI. All samples were sequenced on an Illumina HiSeq 2000. Paired end reads with a length of 100 bp were sequenced with a sequencing amount of 200 million read pairs. The digestion, library preparation and sequencing were carried out by a commercial sequencing company (LGC Genomics, Berlin), as detailed in Lah et al. 2016.

ddRAD-seq data analyses and SNP calling

LGC Genomics provided reads with clipped adapters and quality filtering. These reads did not contain any missing data, were trimmed at the 3’end and had a minimum average Phred quality score of 20 over a window of 10 bases, as well as a minimum length of 64 bp. We mapped these reads against the genome of the harbour porpoise (Autenrieth et al. 2018) using bwa mem vs.0.7.17-r1188 (Li and Durbin 2010) with a Phred score of 30. Since the reference genome does not contain mtDNA sequences, we did not specifically filter for mtDNA reads. Since the X-chromosome is not present as single scaffolds/not unambiguously identified, we did not map ddRAD reads specifically to the X chromosome. Duplicates were removed and all bam files were indexed with samtools vs.1.3.1 (Li et al. 2009).

Several bioinformatics tools and pipelines for RADseq analyses were developed in recent years, based on different algorithms and considering different models for variant calling. They result in different rates of false discovery and information delivered when calling variants and SNPs (Cornish and Guda 2015; Korneliussen et al. 2014; Maruki and Lynch 2017; Mielczarek and Szyda 2016; Rochette et al. 2019; Shafer et al. 2017). Therefore, to have the most confident SNPs and as recommended by some review papers (Kumaran et al. 2019; Shafer et al. 2017), we decided to combine different commonly applied pipelines and kept only SNPs identified by all of them for downstream analysis. Specifically, we used three different programs with four different models: bcftools vs.1.9, angsd vs. 0.929 (models gatk and samtools; Korneliussen et al. 2014), stacks (gstacks vs2.2 and populations vs2.2; Rochette et al. 2019). bcftools was used with minimum mapping and minimum base quality set to 20 and using the model for multiallelic and rare-variant calling with an expected substitution rate of 1e− 6. angsd provides different models for estimating the genotype likelihood, of which we applied the more widely used models ‘gatk’ and ‘samtools’ (see Korneliussen et al. 2014 for more details on the calculations). In both runs, we let the program calculate the per-site allele frequency and posterior probability, major and minor alleles were inferred, and p-value of site was set to be 10e− 6. gstacks was run with standard settings, including the restriction that only one SNP per stack was called to reduce the likelihood of linked SNPs, followed by populations (also with standard settings), which was used to extract called SNP positions and create a vcf file for downstream analysis.

Using a custom R script, we looked for overlap in the results of the chosen pipelines (bcftools, stacks, angsd) and extracted only those SNPs supported by all of them. The overlapping SNPs were filtered for a quality threshold of 20, minimum depth of 100 and mean genotype depth above 5 using bctools vs.1.9 and vcftools vs0.1.15. Only one sample (PPC30-Pool2-102 from Canada) had a high amount of missing data (97%) and was excluded; for the remaining samples (n = 150), missing data were below 5%. We then created two datasets, one including the Black Sea (150 samples, referred to as ‘setA’ from here on), which then was filtered for a minor allele frequency of 2% for all samples, and a second dataset excluding the Black Sea (145 samples, referred to as ‘setB’ from here on) which we filtered for an allele frequency of at least 5%. As a final filter step, we kept only SNPs occurring in 100% of samples for both sets (setA and setB). vcftools was used to calculate SNP-wise FST between sampling areas (8 for setA, 7 for setB) as well as among the defined regions WNA, ENA and BALT. These FST were then visualized with the package ggplot2 in R as function of the scaffold wide SNP distribution per scaffold, where the positions of the SNPs along the scaffolds were illustrated. Using the python script popgen, we calculated the Fst for sliding windows along the scaffolds with a 500 kb window size, a minimum of 5 SNP sites occurring and no overlap between windows allowed.

SNP-based population statistics and differentiation analyses

Using the program arlequin 3.5.2.2. (Excoffier and Lischer 2010) vs2.2, we calculated diversity indices, including nucleotide diversity and expected heterozygosity. Using the tool vcftools, we calculated observed heterozygosity for both SNP sets, applying sampling areas as units. Genetic diversity differences among samples were analysed using a hierarchical AMOVA in arlequin. Sequential Bonferroni tests (Sokal and Rohlf 2012) were performed to correct for multiple comparisons of pairwise FST, which were also calculated in arlequin.

We used discriminant analysis of principal components (DAPC) and admixture to estimate the number of clusters across all sampling areas without prior region/population assignment. To maximize the discrimination between potential clusters, a DAPC separates the variance of each sample into between and within group components. Clusters are inferred through a discriminant analysis based on the initial principal component analysis of the data. The DAPC, as well as a PCA, AMOVA and pairwise fixation indices FST among sampling regions were calculated and illustrated with R, using the packages adegenet (Jombart 2008), pegas (Paradis 2010), hierfstat (Goudet 2005) and StAMPP (Pembleton et al. 2013). The program admixture (Alexander et al. 2009) was used to infer clusters and genetic identity of each sample to the respective clusters. In admixture, which adopts the likelihood model embedded in the program structure (Pritchard et al. 2000), estimation and evaluation of the best K was performed using cross-validation. admixture computes maximum likelihood estimates in a parametric model for individual ancestry estimates. We used R studio (RStudio Team 2015) for visualization and tableau2019.3.0 to plot the genetic cluster identity of each sample geographically, according to the sample’s coordinates.

Detection of outlier loci

Outlier SNPs were detected using bayescan vs.2.1. (Foll and Gaggiotti 2008). bayescan identifies candidate loci based on differences in allele frequencies between populations by applying a multinomial-Dirichlet model (Foll 2012). To evaluate whether outliers may be indicative of local adaptation, selection was inferred using a logistic regression to divide the FST coefficient into a population specific (β) and a locus specific component (α). We set the posterior odds for the neutral model to 1,000, used 20 pilot runs with a thinning set to 5,000 and burn-in of 50,000, followed by a total of 100,000 iterations. A q-value, the bayescan FDR (false discovery rate) analogue to the p-value, of 10% was applied. Using plink vs1.07 (Purcell et al. 2007), we calculated the SNP-wise allele frequency and extracted outlier SNPs (based on the α (> 10) and FST value (> 0.05) calculated by Bayescan). This analysis was performed twice, i.e., comparing the allele frequencies between (1) the different sampling areas and (2) the different regions. Using custom scripts, we then compared the outlier SNPs to the draft annotation of the reference genome (Autenrieth et al. 2018) to see if SNPs are localized in annotated regions.

Mitochondrial control region data analysis

Mitochondrial control region sequences were aligned in MEGA vs.10.1.7. (Kumar et al. 2018) and haplotypes were defined based on sequence comparisons with previously published haplotypes (Lah et al. 2016; Wiemann et al. 2010). Using the program arlequin, we calculated genetic diversity indices, pairwise FST and performed an AMOVA. A haplotype network based on the mitochondrial control region was created using the program popart vs.1.7. (Leigh and Bryant 2015). Additionally, we calculated the haplotype frequency per sampling area and region.

Results

DNA quality and sequencing output

The number of raw reads per individual was on average 2,220,724 with the lowest number being 1,224,194 and the highest 3,137,076. After mapping the reads of each sample to the genome, we used the program samtools vs1.3.1. to calculate mapping statistics (Table 1). We observed an average mapping quality per sample of 36.84 and a mean number of mapped reads per sample of 1,681,717.33 reads (Table 1). The information for all samples is provided in the Supplemental Table 1.

Table 1 Mapping statistics of 150† samples (see Supplemental Table 1 for statistics of each sample)

SNP calling, filtering and distribution

The number of overall called SNPs varied depending on the pipeline used: bcftools (985,537), angsd (gatk: 2,692,730, samtools: 2,003,753) and stacks (330,307). The number of SNPs consistently called across all four algorithms and retained for all further analyses was 269,500 (Supplemental Fig. 1). After all filter steps (minimum quality of 20, minimum depth of 100 and mean genotype depth above 5) were performed for SNPs shared among all models, the dataset encompassed a total of 52,175 SNPs. For SNP setA (Black Sea included), we retained only those SNPs found in 100% of the 150 samples, resulting in a final dataset of 26,320 SNPs. For SNP setB (Black Sea excluded), 24,705 and 11,978 SNPs were present in at least 97% and 100% of the samples, respectively. Given the relatively high number of retained SNPs, and to facilitate direct comparison with SNP setA, we also decided to continue with the 100% filter for downstream analyses of setB. Finally, in mapping the ddRAD reads to the genome, we found that the SNPs were more or less evenly distributed across scaffolds (Supplemental Fig. 3), and that the number of SNPs per scaffold correlated with the length of the scaffolds (R²=0.49) (Supplemental Fig. 2).

Identification of populations

PCAs were calculated for setA and setB. When including the Black Sea (setA), these individuals form a separate cluster and are highly divergent from samples from all other regions along the first principal component axis (PC1), which explains 3.7% of the total genomic variation (Fig. 2A). PC2 mainly separates the BALT individuals from ENA and WNA and explains 2.2% of total genomic variation. When excluding the Black Sea samples (setB), the sampling areas CA, ICE, NOS and SKA are not separated along PC1 and the Baltic Sea (BES and IBS) areas overlap partly (Fig. 2B). KAT is placed in-between the Baltic and the North Sea/North Atlantic sampling areas. This is congruent with its status as a transition zone, however, the amount of variation explained is low, i.e., PC1 and PC2 only explain 2.4% and 1.2% of the total genomic variation, respectively (Fig. 2B). The IBS partly overlaps with the BES, but some individuals also overlap with the KAT samples (Fig. 2B).

Fig. 2
figure 2

Principal Component Analyses: PC1 vs. PC2 of (A) SNP setA (incl. all sampling areas) and (B) setB (excluding the five Black Sea samples). Locality abbreviations refer to Fig. 1

Contrary to the PCA, the DAPC indicated stronger divergence among the samples. As a first step, we calculated the number of clusters found by the DAPC, which were three for setA and two for setB (Fig. 3A and B). For setA, we also see a clear separation of the Black Sea samples across DA1, which explains 89.7% of the variation between groups (Fig. 3C). DA2 separates BES from the others, but due to the strong signal from the Black Sea, the other sampling areas are not genetically distinguishable. When plotting DA3 against DA1, DA3 shows a separation of NOS from the other sampling areas (Fig. 3E). In the DAPC for setB, BES is distinct from the rest of the samples along DA1 (explaining 58% of between group variation), CA, ICE, and SKA overlap, slightly indicative of an isolation-by-distance (IBD) pattern, while NOS and IBS are set apart by DA2 (explaining 17% of the variation; Fig. 3D) and DA3 (explaining, 10% of the variation; Fig. 3F), respectively, and KAT is intermediate.

Fig. 3
figure 3

Discriminant analyses of principal components (DAPC). A, C and E include all samples (setA), B, D and F exclude the WBS samples (setB). Panel A + B are showing the assumed number and accordance to clusters. C and D are displaying the DA1 versus DA2. E and F are displaying the DA1 versus DA3. Locality abbreviations refer to Fig. 1

When using admixture and assuming a two-population scenario (k = 2, Fig. 4) for setB, one cluster (yellow) is formed by the Atlantic Sea Basins (CA, ICE, NOS, SKA and KAT), while the other (blue) is formed by the BALT region (BES and IBS), although IBS includes also some individuals assigned to the yellow cluster. When increasing k, assuming a three-population structure, more than half of the North Sea individuals are assigned to a third cluster (orange), to which – outside of the North Sea – only a single specimen in BES (Pool2-145) is assigned. Increasing k to 4 does not support any further geographic separation of the samples, as the “skyblue” cluster does not correlate with any specific sample characteristic (year, season, age or sex; Fig. 4, Supplement Table 1). Beside the one individual in BES, which displays high assignment to the orange cluster, a second individual sampled in NOS (Pool3-189) stands out with a high assignment to the blue cluster (Pool3-189). When excluding these two samples from the PCA analysis, ellipses of both NOS and BES become narrower and further apart (Supplemental Fig. 5). When including the Black Sea individuals (setA, Supplemental Fig. 4), they form a clearly distinguishable cluster on their own. For the North Atlantic and Baltic individuals, a similar cluster structure as with setB is revealed. The “orange” North Sea cluster only becomes apparent when increasing k to 5, however individuals are assigned to this cluster with high likelihood.

Fig. 4
figure 4

admixture plot based on SNP setB (145 samples, seven sampling areas). Each bar represents one individual, while the colour indicates genetic identity to the respective cluster. K2, k3 and k4 are shown. For k2 the clusters are coloured in blue and yellow. K3 adds orange and k4 adds skyblue. Locality abbreviations refer to Fig. 1

Using the software tableau, we projected the geographical occurrence of individual harbour porpoises assigned to the different clusters based on the admixture results (setB, k3) on to a map in accordance with the sampling coordinates of each specimen (Fig. 5). The yellow cluster, encompassing individuals from CA, ICE, NOS, SKA, KAT and IBS, occurs across all sea basins, except for the southern part of the Belt Sea, where the blue cluster dominates. The orange cluster is mainly restricted to the Eastern North Sea, except for one individual sampled in the Belt Sea. The Inner Baltic Sea exhibits individuals from both the yellow and the blue cluster, as well as admixed individuals. A similar pattern is shown in KAT, supporting its categorisation as a transition zone between North Atlantic and Belt Sea harbour porpoises.

Fig. 5
figure 5

Cluster assignment based on a 50% threshold for each sample, identified by admixture (k3) based on setB. Samples are plotted on a map according to their geographic coordinates. No specimens were admixed between the orange and the blue cluster. North Sea = Orange; Atlantic = yellow; Belt Sea = blue; Atlantic-North Sea admixed = red; Atlantic-Belt Sea admixed = green

In addition to the analysis of nuclear SNPs, we investigated the maternal population structure using the mitochondrial control region. A total of 387 bp was sequenced and 48 haplotypes could be identified among our 150 samples. 32 haplotypes are newly described here, while 16 matched the ones previously published (Lah et al. 2016; Wiemann et al. 2010). In the calculated haplotype network (Fig. 6), most haplotypes are separated by only one mutation step. The Black Sea samples show closely related haplotypes separated from the North Atlantic haplotypes. Black Sea haplotypes are accumulated on one side of the haplotype network, while Belt Sea and Inner Baltic Sea haplotypes are mostly found on the opposite side of the haplotype network. In-between, the placement of haplotypes follows a geographic pattern. Some haplotypes are shared among regions; however, haplotype frequencies differ drastically between the sampling areas. The WNA (CA and ICE) shows the most private haplotypes, while in both the BALT and ENA region overall fewer haplotypes are present. Some more abundant haplotypes differ in frequency among ENA and BALT, as PHO1 (56.7% in ENA and 27.3% in BALT), PHO4 (13% in ENA, absent in BALT) and PHO7 (6.7% in ENA and 48.5% in BALT; Supplemental Table 3).

Fig. 6
figure 6

Haplotype network of the mitochondrial control region. Each circle represents one haplotype, while the cipher represents the haplotype identification. Circle size indicates number of samples; colour is affiliation to sampling area. Bars indicate number of mutational steps between haplotypes. Locality abbreviations refer to Fig. 1

Genetic diversity and differentiation

Measures of genetic variability were assessed across all sampling areas considering SNP setA (Table 2). The results indicate only small differences in genetic diversity between the different sampling areas, except WBS, which shows the lowest values in observed heterozygosity and nucleotide diversity for both SNP and mtDNA (Table 2). When looking at the other sampling areas, the lowest observed heterozygosity is found in CA (HO=0.196 ± 0.004) and the highest in BES (HO=0.205 ± 0.007). Nominally, the observed heterozygosities were consistently slightly higher than the expected heterozygosities, but this deviation was not significant for any sampling area (Table 2). The nucleotide diversity was similar across sampling regions, with highest values detected in the BALT sampling areas, and lowest being observed in the WNA sampling areas. The ENA sampling areas showed an intermediate nucleotide diversity. For the mitochondrial control region (387 bp, 48HT), genetic diversity substantially differed across sampling areas. Here, CA and ICE had the highest nucleotide diversity (congruent with the highest number of haplotypes) while for the other sea basins the genetic diversity is considerably smaller. Both BALT sampling areas exhibited the lowest mtDNA nucleotide diversity.

Table 2 Genetic diversity indices for all eight sampling areas based on SNP setA and mitochondrial control region

Analyses of molecular variance (AMOVA) showed significant divergence among sampling areas for both mtDNA (FST = 0.310, p < 0.001) and SNPs (FST = 0.012, p < 0.001, Table 3). In the hierarchical AMOVA, the differentiation among regions for both marker sets were not significant (mtDNA: FCT = 0.294, p = 0.066; SNPs: FCT = 0.011, p = 0.063). Furthermore, sampling areas within regions were significantly divergent for SNPs (FSC = 0.001, p < 0.001), but not for the mtDNA (FSC = 0.023, p = 0.229; Table 3). Most genetic variation occurred for both marker sets within sampling areas (mtDNA = 69.0%, SNPs = 98.8%). Of the remaining variation, most was due to divergence among regions (mtDNA: 29.4%; SNPs: 1.12%).

Table 3 AMOVA results from SNP setB and mtDNA control region (excluding Black Sea (outgroup) and Kattegat (transition zone))

Almost all pairwise FST estimates based on SNP setB were significant (α < 0.05* and α < 0.01**) after Bonferroni correction (except CA vs. ICE; Table 4). This result indicates genetic differentiation between the Atlantic sampling areas (WNA and ENA) and the Baltic sampling areas (BALT) with FST >0.01 for all comparisons. In contrast, FST estimates within the different regions (WNA, ENA and BALT) are considerably lower (e.g., CA vs. ICE: FST =0.000; NOS vs. SKA: FST =0.003, BES vs. IBS: FST = 0.006). The pairwise FST estimates based on only the mitochondrial control region were generally higher than the ones based on SNPs (Table 4). All pairwise comparisons with BES and almost all with CA (four out of five) were significant.

Table 4 Pairwise fixation indices Fst for six sampling areas in the North Atlantic and Baltic Sea. Lower triangular of the table lists the FST values, based on SNP setB, while the upper triangular lists the respective FST based on the mitochondrial control region

Detection of outlier regions and loci

The sliding window approach of the FST allows identifying genomic regions where SNPs exhibit high FST values (Fig. 7). The average FST for all windows across all scaffolds is 0.28 (blue line in Fig. 7). For the 34 largest scaffolds, we plotted FST values for all 500 kb windows, along with scaffold-specific 95% confidence intervals. The rationale here is to identify those scaffolds with generally lower FST values (e.g. ScshVNz_10623, ScshVNz_7191, ScshVNz_7648, Fig. 7) and respectively generally higher FST values (e.g. ScshVNz_11228, Fig. 7), as compared to the genome-wide mean FST (blue line). Within scaffolds, we can further identify regions where the calculated FST of one or more sliding windows is outside the 95% confidence interval of the respective scaffold, depicting genomic regions of low and respectively high divergence (e.g. in ScshVNz_10729, ScshVNz_11943, ScshVNz_12150, Fig. 7).

Fig. 7
figure 7

Single plots for each of the 34 largest scaffolds, showing the FST, calculated as sliding window of 500 kb with popgen, based on SNP setB (Black Sea excluded). X-values of red dots depict the position of the sliding windows on the respective scaffold, blue line indicates the average FST over all 34 scaffolds, the black dashed lines indicate the upper and lower bounds of the confidence limit for each separate scaffold individually. mbp = million base pairs

With population genomic data at hand, we can also investigate if SNPs are indicative for the different found clusters or sampling areas, and if so, investigate if they occur in coding regions or are under selection. We could identify six SNPs with statistically significant patterns of divergent genetic differentiation (having an α-value of greater than ten or a FST value greater then 0.05, respectively; Supplemental Fig. 6). The allele frequencies of these SNPs are congruent with isolation-by-distance across the different sampling areas throughout the North Atlantic (Table 5, Supplemental Table 2). This becomes even more apparent when looking at the three main ocean basins/regions (Fig. 8). Although none of the six outliers are positioned within coding regions of genes in the nuclear genome or their flanking regions (as far as predicted by the draft annotation), the positive alpha indicates that they may all be influenced by diversifying selection with regard to sampling regions.

Table 5 Allele frequency of outlier-SNPs detected by bayescan based on SNP setB with a q-value < 0.1
Fig. 8
figure 8

Allele frequency for six outlier loci detected by bayescan calculated with plink. Frequency is given of the allele, which is the most frequent in the WNA opposed to the other regions. Colours: yellow = WNA, orange = ENA, blue = BALT

Discussion

Population differentiation in the North Atlantic, relative to the divergence from the Black Sea

The population structure of the harbour porpoise within the North Atlantic is under ongoing investigation (NAMMCO 2019). Long-distance movements are known to occur (Jefferson 2014; Nielsen et al. 2019), which can obscure subtle, genetic differentiation between oceanic regions (Leslie and Morin 2016). Although the North Atlantic is predominantly an open water body with few to no physical barriers, potentially allowing for harbour porpoises of different geographic origin to intermix freely and to successfully mate with each other, harbour porpoise are known to be shallow water cetaceans avoiding water bodies deeper than 300 m. Thus, deep waters, e.g. the Icelandic basin or the Norwegian Sea, may act as natural barriers, forcing migration to follow the Greenland-Scotland ridge. These limitations could have supported differentiation in harbour porpoise, whose range within the North Atlantic has been subdivided into multiple distinct management areas that have been delineated based on genetics, morphometry, as well as pollutant profiles and telemetry data (Fontaine et al. 2007, 2014; Lah et al. 2016; NAMMCO 2019; Nielsen et al. 2018; Rosel et al. 1999a; Wiemann et al. 2010; Celemin et al. 2023). Within the North Atlantic, these inferred separations are suggested to be demographically independent, with finer subdivisions in the East North Atlantic, including North and Baltic Seas. The NAMMCO assessment units are currently supported by the identification of isolation-by-distance across the North Atlantic up into the Baltic Sea (Fontaine et al. 2007; Lah et al. 2016).

Here we used genome-wide distributed SNPs to unravel potential population structure within the entire North Atlantic distribution range of the species. We find very little population structure between Canada and Iceland, which - together with findings of Celemin et al. 2023; Fontaine et al. 2017; Olsen et al. 2022; and Quintela et al. 2020 - suggests gene flow across most of the North Atlantic from Canada to Norway. However, other studies have revealed differentiation in the Western North Atlantic involving areas not sampled in our study, i.e., between the US Atlantic coast and Canada, and Western Greenland harbours a porpoise population genetically distinct from those of Canada and Iceland (NAMMCO 2019; Nielsen et al. 2018; Olsen et al. 2022; Tolley et al. 2001).

Our mtDNA data and a previous mtDNA study (Rosel et al. 1999) hint at differentiation between East and West North Atlantic, as very few haplotypes are shared between the Western North Atlantic and the Eastern North Atlantic/Baltic Sea regions (PHO1, 4, 19, 20; cf. Figure 6). While in WNA a higher number of haplotypes occurs, all in frequencies below 10%, we find high haplotype frequencies for few haplotypes (PHO1 and PHO4) in ENA. This representation of more maternal lines in WNA could be indicative of a higher long-term effective population size (Ne), while ENA/BALT exhibits a star-like mtDNA phylogeny, pointing towards recent bottlenecks/colonization events with subsequent expansions (Fig. 6). Such differences in long-term Ne are also indicated by the about three times larger mtDNA nucleotide diversity in WNA, when compared to the other regions.

Comparing nuclear (SNP) with mtDNA data we observed a contrasting pattern: SNP divergence is most pronounced between the Baltic region and the open ocean regions (North Sea, North Atlantic), with only subtle isolation-by-distance across the entire North Atlantic. It may hence reflect adaptive rather than mere geographic processes. Conversely, mtDNA is most differentiated between West and East North Atlantic, indicative of higher philopatry in females, as has been postulated repeatedly (e.g., Wiemann et al. 2010). Ocean-wide connectivity may therefore be mostly driven by occasional male dispersal which does not contribute to the observed mtDNA pattern, unless the migrating individual itself is sampled (Tiedemann et al. 2000).

Although there is apparently only little differentiation at SNP markers across the North Atlantic, marginal and atypical environments could lead to genetic differentiation, such as shown around the UK (Fontaine et al. 2017). The region encompasses strikingly different marine habitats from oceanic to the rather shallow North Sea, and a transition into the more brackish habitats of the Baltic Sea, so the detection of cryptic population structure would not be surprising. Here, we detect subtle differences between the Eastern North Sea (NOS) and other North Atlantic porpoises. The Skagerrak porpoises (SKA) are closer to Icelandic ones (ICE) than to the close-by North Sea (NOS), both in the DAPC (Fig. 3D and F) and regarding pairwise FSTs (Table 4). Moreover, some NOS specimens form a separate genetic cluster in the admixture plot (Fig. 4). This assignment becomes even more prominent when the outgroup Black Sea samples are included in the analysis (Supplemental Fig. 4). The North Sea samples responsible for this signal of differentiation are mainly occurring in the German Wadden Sea, around the Isle of Sylt (Fig. 5). This is in accordance with previous investigations, which identified a distinct breeding ground around the Isle of Sylt (Diederichs et al. 2010; Sonntag et al. 1999; Unger et al. 2022). A more complete sampling of the North Sea harbour porpoises is needed to assess the robustness as well as the geographic distribution of a separate North Sea cluster.

Additional local population structure is detected for the Baltic region (BALT; here defined as BES and IBS), which is separated from the North Sea/North Atlantic. ADMIXTURE distinguishes here between a Baltic and a North Atlantic cluster (“blue” and “yellow” in Fig. 4, respectively), which are also identified in the DAPC (Fig. 3B). This separation is further supported by high FST values (Table 4). The Kattegat (KAT) sampling area represents a transition zone between these two regions. Also, significant differences in the frequencies of particular mitochondrial haplotypes support this assignment, i.e. PHO4 (19% in NOS vs. 0% in BALT and 5% in WNA) and PHO7 (BES 50%, IBS 41.7% vs. NOS 9.5%, CA, ICE, SKA = 0%). The disproportional geographic distribution of these two haplotypes has been previously recognized and correlates with divergence at nuclear loci, both SNPs and microsatellites (Lah et al. 2016; Wiemann et al. 2010).

A further subdivision within the Baltic (BALT) region based on differences in genetics, morphology, and behavior has been repeatedly hypothesized, such that porpoises from BES and IBS would belong to two distinct populations (Celemin et al. 2023; Galatius et al. 2012; Lah et al. 2016; NAMMCO 2019; Wiemann et al. 2010). This could be caused by a separation of these two groups during mating season and calving (Carlén et al. 2018; Huggenberger et al. 2002). Our data identify some genetic differences among these two Baltic sampling areas with strong support for a Belt Sea population. However, the inner Baltic specimens were not unambiguously identified as a discrete population, as these differences may also be reconciled by a mixture of migrating specimens of both Belt Sea and North Atlantic/Skagerrak origin (cf. Figure 5). Of note in this context, we defined IBS according to Wiemann et al. (2010), i.e., east of 13.5°E longitude and our 12 IBS samples originate from the westernmost part of this area, south of Sweden. Recent studies suggest the distinct IBS population to occur further to the east (Carlén et al. 2018), such that only our three easternmost individuals originate from the distribution range of that putative population (Fig. 5), too few to reach any conclusion about the status of this population. Genomic studies including samples from further East (Poland, East of Gotland) assigned some Inner Baltic specimens to a separate cluster (Celemin et al. 2023; Lah et al. 2016).

Haplotype frequencies in both the ENA (PHO4 + PHO1) and BALT (PHO7 + PHO1) regions could indicate a more recent expansion, as we see a star like pattern in the haplotype network (Fig. 6) with common widespread ancestral haplotypes and closely related locally unique, rare haplotypes (Rosel et al. 1999; Wiemann et al. 2010). This would be consistent with the history of the Baltic Sea that was only recently (i.e., several thousand years ago) accessible to harbour porpoises. Previous studies have detected a similar pattern in further ENA regions, i.e., Norway (Tolley and Rosel 2006) and around the UK (Rosel et al. 1999; Walton 1997).

By identifying genetic clusters without a priori geographic assignment, our data allows the detection of migrating individuals. The harbour porpoise is known to seasonally migrate in some regions in response to the formation of sea-ice during winter, such as in the Labrador Sea between Canada and Greenland (Olsen et al. 2022), but also in the Northern part of the proper Baltic Sea (Andersen et al. 2001; Benjamins et al. 2007; Dähne et al. 2017; Nielsen et al. 2018; Rosel et al. 1999a). Seasonal migrations may also correlate with breeding behavior, as mating takes place mostly in the summer months (May to October), and animals migrate between breeding and winter feeding grounds in some areas, e.g. in the Baltic Sea (Carlén et al. 2018; Jefferson 2014; Sveegaard et al. 2015). Females tend to come back to regions where they were born for mating and calving, which in consequence limits the gene flow between populations (Andersen et al. 2001; Huggenberger et al. 2002; Kesselring et al. 2017). Gene flow could be maintained by single dispersing individuals, which may be primarily males (Huggenberger et al. 2002; NAMMCO 2019; Wiemann et al. 2010). In our study, we find two individuals, which show a strong genetic association with a cluster from a geographic region where they were not sampled from, indicating potential migrants. Within the NOS, one sample (Pool3-189, comprising a male sampled in May) shows a strong cluster affiliation to the Belt Sea (blue, Figs. 4 and 5) and has the mitochondrial PHO7, indicative for the BALT region. Therefore, we assume it to be a Belt Sea individual, which migrated into the North Sea. We found one female individual sampled in the Belt Sea (Pool2-145), which seemed to have migrated into the opposite direction. This sample has a high assignment probability to the orange cluster (NOS, Figs. 4 and 5) and carries the PHO5, a rather rare haplotype known to occur both in North and Belt Sea (Tiedemann et al. 1996; Wiemann et al. 2010). When excluding these two putative migrant samples from the dataset, the cluster assignment in the PCA is improved (Supplemental Fig. 5).

Population differentiation among our defined regions (WNA, ENA, BALT) is also supported by the six outlier SNP loci, which clearly distinguish them (Fig. 8). SNPs 1, 3, 4 and 6 further differentiate between WNA and ENA (Fig. 8). Additionally, these SNPs support the previously shown isolation-by-distance across the North Atlantic (Fontaine et al. 2007; Galatius et al. 2012; Lah et al. 2016; Tolley and Rosel 2006). Although the population structure pattern found in this study by SNP and mtDNA data suggest potential local adaptation, the outlier SNPs do not lie in coding regions of the genome but could be linked to loci under selection. Hence, further studies using full nuclear genome analyses to identify candidate genes for local adaptation should be performed.

Implication for conservation

With our mtDNA data we can detect a subtle differentiation between WNA and ENA, but this pattern was not observed in the SNP data. The substantially higher genetic variation in terms of nucleotide diversity of mtDNA in WNA as compared to the other regions (cf. Table 2) suggests a higher effective population size there. This is in line with high abundance estimates derived from surveys, as well as with the genetic cohesion over a wide geographical range, encompassing Canadian to Icelandic waters. Intriguingly, more subtle population structure exists on the Eastern side of the Atlantic, as we detected differentiation between the respective sampling areas within ENA and BALT with a transition zone in the Kattegat in both mtDNA and SNPs. The different haplotype frequencies and high occurrence of specific indicative haplotypes, especially in NOS, BES and IBS, significantly deviate from a random distribution and therefore support demographic independence of the populations inhabiting these different sampling areas. Intriguingly, we found some support for a separate genetic cluster in the German North Sea, coinciding with the observation of a distinct breeding ground around the Isle of Sylt (Gilles et al. 2009; Siebert et al. 2006; Sonntag et al. 1999). Although the full geographic range occupied by porpoises belonging to this inferred cluster could not be revealed here, specific protection measures for this area may be warranted, not least as this area is subject to increasing construction of offshore wind parks, which increase the disturbance for the local marine fauna (Aarts et al. 2016; Booth 2020; Dähne et al. 2017; Peschko et al. 2016; Schaffeld et al. 2020). Although abundances in the North Atlantic shelf distribution area are quite high, the genetic differentiation shown here, as well as the identified morphological and behavioral differences previously observed may warrant consideration of more regional management units for which specific abundance estimates would be desirable. This was done for the BALT region, where the abundance estimation is separated for BES (~ 14,400) and IBS (~ 500 animals; Benke et al. 2014; Gilles et al. 2023; Hammond et al. 2018; SAMBAH 2016). It was shown, that the Inner Baltic (IBS) porpoises migrate between feeding and breeding grounds, and that the breeding time distribution does not overlap between Inner Baltic and Belt Sea (Carlén et al. 2018). The exact geographical split among these populations remains elusive, but is considered more eastwards than previously thought (Carlén et al. 2018). Our data generally supports the assessment areas of NAMMCO (NAMMCO 2019) and warrants consideration of the southern NOS, BES and IBS as separate populations/management units, with a recommendation to include further samples from neighbouring areas in future studies.

Genome-wide population assessments like ours can be utilized to establish SNP assays specifically tailored to distinguish between individuals assigned to the different identified genetic clusters. When designing such a SNP assay, inferred genetic clusters need to be related to geographic populations (here, North Atlantic, yellow cluster; Belt Sea, blue cluster; and southern North Sea, orange cluster) and potential migrants (i.e., specimens with a genetic cluster assignment different from the inferred population of their sampling area) should be excluded. Sample size permitting, one may restrict this inference to specimens originating from the reproductive season, as these have a higher likelihood to belong to the local population (e.g., Wiemann et al. 2010). Then, the focus should be on private/population specific SNPs, i.e. such as our detected outlier SNPs. These outlier loci can already be considered as a candidate panel to estimate differential assignment likelihoods to the respective populations. Additionally, we could detect regions on different scaffolds, where SNPs with high or low FST accumulated. Therefore, we identified genomic regions, which are potentially more indicative for population separation than others (Supplemental Fig. 3). SNPs from these regions would also be candidates for the design of such a panel. As our data set was used as training data to inform that panel, it needs to be validated by an independent test data set of known geographic affiliation. Once validated, a SNP assay could also identify migrants between the different seas and potentially provide genetic profiles from stranded decomposed animals that may not be easily genotyped using other methods. This would enable to infer the impact of bycatch and other mortality incidences to specific potentially threatened geographic populations, thereby becoming a valuable tool in conservation and management of porpoises.