Introduction

One of the most commonly utilized and powerful genotyping tools is the single nucleotide polymorphism (SNP) genotyping microarray, which allows one to establish genotypes of millions of SNPs in parallel. However, the microarray design process requires prior knowledge of SNPs surrounding a DNA sequence and needs to be preceded by high-throughput SNP detection techniques (Trevino et al. 2007). Furthermore, microarray design for minor experiments or initial studies, e.g., in wild species, is often economically unjustified. A solution for this may be the application of recently developed genotyping methods that are based on the enrichment of genome sequences associated with restriction enzyme (REs) digestion sites and subsequent utilization of a next-generation sequencing (NGS) technique. An RE-based enrichment approach allows for sequencing only a fraction of the genome and achieves higher multiplexing, significantly reducing experimental costs compared to whole genome sequencing (Elshire et al. 2011). One of the available RE-based enrichment and genotyping techniques is genotyping by sequencing (GBS). An important advantage of GBS is that the derived sequences can be used for polymorphism detection and genotyping without locating them in the reference sequence, and thus permits whole genome genotyping in practically any organism (Elshire et al. 2011). GBS has been shown to be highly useful in animal genomics (He et al. 2014), especially in species for which there are no commercially available microarray tools.

In this study, we attempt to test the performance of a standard, single-digest GBS protocol in selected bird taxa [Syrian Woodpecker Dendrocopos syriacus (SW), and Great Spotted Woodpecker Dendrocopos major (GW)], which are known to be closely related and able to hybridize (Michalczuk et al. 2014; Figarski and Kajtoch 2018), without relying on the reference genome sequence. We evaluated the detected SNPs in terms of their variability, quality and application in population genetics of these species, and tested the application of GBS to the detection of their hybrids.

Materials and methods

Sampled birds

Two sibling Dendrocopos woodpeckers were recently examined with the use of various genetic markers for species delimitation and identification of their hybrids (Michalczuk et al. 2014). Sampling in this study also followed previous research, with slight modifications, and comprised nine individuals of SW, two hybrids (possible backcrosses), eight individuals of GW collected from their sympatric range in Poland, and additionally three individuals of GW from an allopatric population in Norway (Table 1). The results of GBS for three of the SW and one of the GW individuals from Poland were found to be inadequate, so the final dataset used for analyses comprised a total of 18 specimens.

Table 1 General characterization of samples used for genotyping by sequencing (GBS) of selected taxa (Syrian Woodpecker Dendrocopos syriacus, Great Spotted Woodpecker Dendrocopos major and hybrid D. syriacus × D. major)

GBS library preparation and sequencing

DNA was purified from muscle or down feathers of woodpeckers with the use of a Nucleospin kit (Macherey–Nagel, Düren, Germany).

GBS libraries were prepared after a few optimization steps, following previously described methodology (Elshire et al. 2011). The protocol included PstI RE digestion and ligation of 48 indexed adapters designed with GBS Barcode Generator software (http://www.deenabio.com/services/gbs-adapters). In brief, during the library preparation process, commercially synthetized oligonucleotides (forward and reverse complement) containing unique barcodes (4–9 nucleotides long) and a common adapter were annealed to obtain ligation-ready double-stranded oligonucleotides. Genomic DNA (200 ng) was digested overnight with a tenfold PstI-HF enzyme excess (100,000 U/ml; New England Biolabs) at 37 °C. Digestion products were then ligated with 4.8 ng of each adapter using T4 DNA ligase (400 U/µl; New England Biolabs) at 22 °C for 60 min. After clean-up, fragments with ligated adapters were amplified in 18 cycles of PCR, using universal primers (12.5 pmol/µl each) and 2 × Taq Master Mix (New England Biolabs). Purified and diluted libraries were controlled for quality using Agilent’s TapeStaion2200 system (Agilent Technologies) and quantified using DNA binding dye (Qubit DS DNA assay; Thermo Fisher Scientific). Oligonucleotide sequences used here are the same as those used in our previous study (see Gurgul et al. 2018).

Library pools were ultimately sequenced in a single-end 50-bp run on a HiScanSQ system (Illumina, San Diego, CA) using a single lane of TruSeq v3 HiSeq flowcell and v3 reagents (Illumina). Raw, demultiplexed, unfiltered reads were deposited in the NCBI Sequence Read Archive (SRA) database under BioProject number PRJNA397002.

SNP detection, filtering and sample differentiation analysis

The obtained raw reads were assessed for quality using FastQC software and analyzed with the use of recently developed GBS SNP Calling Reference Optional Pipeline (GBS-SNP-CROP) (Melo et al. 2016) dedicated to species with no reference genome sequence available. GBS-SNP-CROP handles the lack of reference genome sequences by building a mock reference sequence based on collapsed reads and clustering strategy. The software was set to a minimum read length of 40 bp and other settings recommended for diploid genomes (Melo et al. 2016). Moreover, to reduce the number of false positives, the SNP matrix was further reduced by applying filters for read depth (n = 4), a minimum number of reads supporting each allele for heterozygote calls (n = 2) and other parameters allowing us to achieve SNP confidence at the 95% level (Melo et al. 2016).

Detected SNPs were additionally filtered to remove those with minor allele frequency (MAF) lower than 0.01, missing genotypes in more than 20% of individuals and deviating from Hardy–Weinberg equilibrium with P < 0.001. Moreover, individuals with more than 20% missing genotypes (four) were also removed. Supporting SNPs analyses were performed using PLINK (Purcell et al. 2007) or TASSEL (Bradbury et al. 2007) software.

To evaluate the performance SNP in a population differentiation analysis, the inter-individual distance matrix was calculated as: 1—identity by state (IBS) similarity, with IBS defined as the probability that alleles drawn at random from two individuals at the same locus are the same. The obtained pairwise IBS distances were then used to create cladograms based on a neighbor-joining (NJ) method (Saitou and Nei 1987) and visualized with the use of a modified version of Archaeopteryx Tree software (Han and Zmasek 2009). After SNP genotypes conversion to sequences, following Lee et al. (2014), a phylogenetic tree was constructed using a maximum likelihood (ML) method along with a bootstrap test implemented in MEGAX software (Kumar et al. 2018). The tree was inferred based on the Jukes-Cantor model (Jukes and Cantor 1969). The bootstrap consensus tree inferred from 1000 replicates (Felsenstein 1985) is presented. Initial tree(s) for the heuristic search were obtained by applying the NJ method to a matrix of pairwise distances estimated using the maximum composite likelihood approach. The additional visualization of population differentiation was made by a multidimensional scaling (MDS) method applied to the IBS distance matrix. To further validate the population structure and the patterns of admixture among populations, Structure software (Falush et al. 2003) was run ten times per each K, initially from K2 and up to K4. An admixture model was used and the parameters of the runs were as follows: 200,000 iterations and 100,000 burn-in period. Then, the Clumpak software (Kopelman et al. 2015) was used to solve the problems of label switching and genuine multimodality (Jakobsson and Rosenberg 2007) and to visualize the results.

Results

PstI-generated library fragment size distribution

The presence of DNA fragments that originated from enzyme digestion within the repetitive genomic DNA sequences was evaluated based on Agilent TapeStation2200 system traces. PstI endonuclease generated a fairly uniform fragment distribution for woodpeckers (Supplementary File 1).

Sequencing reads and SNP statistics

The applied sequencing depth allowed the generation of, on average, 862,900 good barcoded reads per sample (Table 2).

Table 2 Sequencing and single nucleotide polymorphism (SNP) statistics

With the use of the GBS-SNP-CROP pipeline, 3329 of biallelic SNPs were detected with average read coverage across a population of 19.7 (Supplementary File 2). The final set of SNPs was obtained by applying additional population-level filtering criteria in terms of polymorphisms and missing genotypes. This resulted in a set of 12,479 filtered SNPs. The mean MAF across all variants was 0.242 (± 0.129). The MAF distribution analysis showed that 46.9% of SNPs showed MAF up to 0.2 and 81.3% SNPs had MAF up to 0.4. The observed heterozygosity averaged for all SNPs was high (0.475). The average call rate per SNP was 90.1%. When both species were analyzed separately, the average MAF for the GW population (0.251) was slightly higher than that for the SW one (0.231) and the averaged observed heterozygosity was slightly higher in GW samples (0.498 vs. 0.461). The percentage of missing genotypes was lower for SW than for GW (0.09 vs. 0.08).

Raw, demultiplexed, unfiltered sequencing reads were deposited in the NCBI SRA database under BioProject no. PRJNA397002.

SNP performance in population differentiation analysis

The assignment of woodpeckers to clusters with the use of GBS resulted in different outputs from NJ and MDS methods. What is common to both approaches is that they showed that GW had much greater genetic diversity than SW. According to the NJ cladogram there are two clusters that generally correspond to GW and SW; however, in the first one, some individuals seem to have intermediate genotypes (Fig. 1a). GW individuals originating in Norway and Poland formed clearly separated clusters, with one of the analyzed hybrids clustering near the Norwegian samples. In the MDS plot only some analyzed individuals of SW form a clearly separated cluster, whereas two others could be assigned as genetically closer to GW and had genotypes highly similar to those found in the analyzed hybrids. Moreover, GW is divided into two weakly separated groups, of which one encompasses some individuals from Poland and the second birds from Poland and Norway (Fig. 1b).

Fig. 1
figure 1

a Cladogram for the studied woodpecker individuals/species based on identity by state (IBS) and the neighbor-joining method; b multidimensional scaling plot (MDS) based on the IBS distance matrix for the studied woodpeckers (GWDendrocopos major, SWDendrocopos syriacus, SW/GW hybrids). The geographic origin of each individual is given in parentheses. Branches lengths are provided

The results of population structure and degrees of admixture of the populations identified with Structure software are presented in Fig. 2a. The highest likelihood was obtained for K = 2. However, the likelihood for K = 3 and K = 4 was very similar and presented exactly the same patterns as K = 2 suggesting it as the most optimal result of population structure of the studied individuals. The results obtained with Structure software assigned the analyzed hybrids as well as two other SW individuals (which were clustered together with hybrids using MDS) to the GW population (Fig. 2a). Very similar results were obtained when the maximum likelihood method was applied to the construction of the cladogram (Fig. 2b). In this analysis, both hybrids and the two SW individuals with an uncertain nuclear DNA (Table 1) genotype were randomly clustered with the GW population, while the remaining SW individuals formed a clearly separated cluster. No visible subdivision amongst individuals according to their geographical origin could be seen in the ML results.

Fig. 2
figure 2

a Graphical representation of individual contributions obtained with Structure software for K = 2; b Cladogram generated using maximum likelihood (ML) method. Geographic origin of each individual is given in parentheses. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) are shown next to the branches. For abbreviations, see Fig. 1

Discussion

Previous studies have confirmed the usefulness of GBS for domestic species of birds (Pértille et al. 2016; Zhu et al. 2016). However, this technique has been scarcely applied to wild birds that lack a reference genome sequence (see review in Toews et al. 2016). Thus, in this study, we attempt to detect and genotype a panel of SNPs in selected pairs of sibling woodpeckers and their hybrids sufficient for the purposes of population genetics. With the applied sequencing depth and filtering criteria we were able to create a panel of 2479 informative SNPs for these species.

Genotyping of SW and GW woodpeckers with the use of GBS resulted in patterns only partially concordant with previous species delimitation and hybrid identification studies (Michalczuk et al. 2014). In Michalczuk et al. (2014) the usage of a set of sequence-based markers resulted in reliable delimitation of these species. The most variable marker was found to be the control region of mtDNA, which includes several discriminating SNPs. Discriminating SNPs were also found in each of three autosomal and one sex-lined intron, but the majority of the others were common to both species. Finally, these sequence-based markers were insufficient for the definitive identification of hybrids as the number of unlinked SNPs was too low. Moreover, the use of microsatellites was found to be very problematic, as only a few loci could be amplified and were simultaneously polymorphic in both species. However, these microsatellites allowed for the assignment of some specimens as possible hybrids (Michalczuk et al. 2014). GBS resulted in a quite unexpected pattern as it showed that some phenotypic SW and both phenotypic hybrids possessed genotypes that are clustered within GW genotypes.

GBS is expected to be much more powerful than mtDNA sequences or microsatellites for the identification of species and their hybrids. Assignment of some phenotypic SW and all hybrids to a genetic pool characteristic of GW suggests that hybridization between GW and SW is a much more frequent phenomenon than expected based on previous genetic studies (Michalczuk et al. 2014) and field observations (Figarski and Kajtoch 2018). Apparently, ongoing hybridization, including backcrossing with GW, has resulted in a high reduction of characteristic SW alleles in the genomes of birds from sympatric populations. It is important to highlight that the described pattern is characteristic of urban populations, where contacts between SW and GW are quite common as GW occupies parks and larger wooded areas in the vicinity of orchards and avenues of trees occupied by SW (Kajtoch and Figarski 2017). It would be interesting to verify the genomics of woodpeckers from rural populations where SW only rarely encounters and mates with GW [SW occupies mostly orchards, whereas GW mainly occupies forests (Kajtoch and Figarski 2017)]. If this pattern is confirmed from larger samples from different populations, it would mean that a large number of phenotypic SW have GW-like genome characteristics. This could have serious consequences for any ecological, ethological or biological studies on these woodpeckers, as plumage may not be a reliable indicator of species affinity of birds observed in the field.

Our findings indicate that GBS could have very wide applications in ornithology, probably covering not only species delimitation and population genetics, but also hybridization and parental studies.