Genotyping-by-sequencing based SNP discovery in a non-model rodent, the endangered hazel dormouse

The hazel dormouse Muscardinus avellanarius presents an exemplary non-model species that is both locally threatened and whose genetic status is not fully understood owing to insufficient resolution of the currently available molecular tools. We performed normalized Genotyping-by-Sequencing (nGBS) on 48 hazel dormouse samples collected across the species European distribution, aiming at discovering useful single nucleotide polymorphism (SNP) markers for the assessment of population structure and genomic diversity. The analyses of > 24,000 SNPs showed a high divergence between the Eastern and Western lineage of the species with high rates of SNP allele fixation, consistent with previous studies suggesting the divergence of lineages occurred over 2 mya. These results indicate that investigating inter-lineage as well as within-lineage genetic composition will be a conclusive approach for identifying conservation strategies in the future. Results presented here indicate the highest genetic divergence in the Italian and Lithuanian populations. We document how nGBS allows the discovery of SNPs that can characterize patterns of genetic variation at multiple spatial scales in a non-model organism. We document how nGBS allows the discovery of SNPs that can characterize patterns of genetic variation at multiple spatial scales in a non-model organism, potentially informing monitoring and conservation strategies.


Introduction
The hazel dormouse (Muscardinus avellanarius, linnaeus, 1758) is currently undergoing population decline in many areas of its European distribution. Possible reasons are habitat fragmentation, decreasing habitat quality and climate change (Bright et al. 2006;Goodwin et al. 2018, Mortelliti et al. 2010. Despite being internationally protected and the subject of several ongoing conservation projects, hazel dormouse populations are still decreasing annually in parts of Europe (Goodwin et al. 2017;Schulz and Büchner 2018;Hutterer et al. 2021). For instance, populations in the north-western range have been reduced by up to 72% in the last decades (Goodwin et al. 2017). In addition to this sharp decline, knowledge regarding some genetic aspects of the biology of this arboreal species is still incomplete (Mouton et al. 2017).
In particular, hazel dormouse research has not yet benefitted from the wide range of genomic techniques now available to study non-model organisms (Allendorf et al. 2010;Funk et al. 2012). Advances in molecular genetic techniques have made next generation sequencing (NGS) a staple in conservation and population genetics studies (DeSalle and Amato 2004; Allendorf et al. 2010). NGS allows for a higher resolution of population genetic parameters, such as effective population size, inbreeding, population structure,which may not be possible with traditional marker sets (Hohenlohe et al. 2021). The hazel dormouse is a candidate species to benefit from NGS as this species is largely arboreal, nocturnal and generally remains at low population densities, making it challenging to detect and logistically prohibitive to monitor on a large scale (Goodwin et al. 2017). Given these constraints, genetic solutions can inform applied conservation strategies for this elusive rodent.
Studies based on mitochondrial DNA (mtDNA) analysis and autosomal microsatellites have suggested that the hazel dormouse is split into two genetically distinct lineages or even cryptic species (Western lineage and Eastern lineage) in Europe (Mouton et al. 2012a(Mouton et al. , b, 2017. Further, Bani et al. (2017) used eight microsatellites to assess the influence of habitat fragmentation on the genetic diversity in the hazel dormouse and found inbreeding and genetic drift to occur in populations living in fragmented landscapes. Such genetic factors can play an important role in the health and viability of populations occurring in small, isolated patches (Frankham 2005). These initial genetic results indicate that the hazel dormouse is highly divergent, both at a large scale, given the presence of two distinct lineages, and at a finescale, due to habitat fragmentation and limited dispersal capability. Consequently, a rigorous assessment of genetic substructuring may reveal evolutionary significant units that are extremely valuable for future conservation strategies.
In this study, we aimed to leverage nGBS technologies on hazel dormouse populations to (i) examine the present genetic structuring among populations and (ii) discover informative SNP markers at both broad and fine-scale. We provide a foundation for further genomic analyses on hazel dormouse populations, which will be able to be used for future developments in scientific assessment, monitoring and conservation, especially in regions where populations are currently declining.

Sample preparation
We obtained 98 tissue samples from across the European distribution of the species, which were collected between 2008 and 2018 (Supplementary Table S1). Of the collected samples, 74 originated from the known range of the Eastern lineage (EL) and 24 were individuals from the Western lineage (WL) (Mouton et al. 2017), representing both main lineages of the hazel dormouse in continental Europe. We divided the samples into 16 populations classified solely based on geographic sampling locations (Supplementary Table S1).
DNA was extracted using the Qiagen DNeasy Blood & Tissue Kit (Qiagen GmbH, Hilden, Germany) according to the manufacturer's protocol with the following slight modifications: (i) to receive RNA-free genomic DNA, 4 µL of RNAse A were added during the lysis step at 56 °C, (ii) after the washing steps the spin columns were centrifuged for 3 min at 13,300xg and dried at 56 °C for 5-10 min, (iii) to increase the yield, DNA was eluted in two steps with 40 µL AE buffer each. After incubation for 5 min at room temperature, the columns were centrifuged at 13,000xg for 1 min resulting in a total DNA extract volume of 80 µL per sample. Samples were only sent for nGBS if ≥ 300 ng of high molecular weight DNA could be provided, as specified by the protocol from Biosearch Technologies Genomic Analysis by LGC (www. biose archt ech. com) (Arvidsson et al. 2016).

Sequencing protocol
Library preparation and sequencing was performed according to the LGC protocol for nGBS (Arvidsson et al. 2016). Briefly, genomic DNA was digested using the restriction enzyme MslI. Next, the restriction digests were mixed with one of 96 inline-barcoded forward blunt adaptors, followed by ligation and thus converted into GBS libraries. The DNA libraries were then purified using magnetic beads. After purification, DNA was PCR-amplified using standard Illumina TrueSeq amplification primers. The normalization step was performed using Trimmer Kit (Evrogen), which reduces the number of abundant fragments, and subsequently reamplified again using 14 PCR cycles. The nGBS libraries were then size selected on Blue Pippin, discarding fragments smaller than 300 bp and larger than 500 bp long. Sequencing was carried out on an Illumina NextSeq 500/550 using V2 Chemistry and 300 cycles. Sequences were demultiplexed with Illumina's CASAVA data analysis software and proprietary software developed by LGC Genomics.
Upon receiving the sequencing data, the demultiplexed reads were first trimmed to remove adapter sequences as well as stretches of low quality and ambiguous bases using Trimmomatic v.0.38 (Bolger et al. 2014). Then, the reads were mapped to the hazel dormouse reference genome (PRJNA399340, Broad Institute) using BWA-MEM (v 0.7.12-r1039) with default parameters (Li und Durbin 2009). SNP calling was carried out using samtools (v1.9) Li 2011) mpileup function. The called raw SNPs were filtered out if loci showed > 15% missing data across samples, minor allele frequency (MAF) was lower than 0.05, sequencing depth was lower than 3.0 x or higher than twice the mean depth, or if loci showed a genotype quality < 20. In addition to these criteria, SNPs were also pruned to account for linkage disequilibrium using r 2 > 0.5 in 100 kb windows using bcftools (v1.9) ).

Population genomic analyses
To understand structuring and to have first trends in allele frequencies within and among the sampled populations, we first performed a principal component analysis (PCA) using PLINK (v1.9) (Chang et al. 2015), which utilizes a variance-standardized relationship matrix which requires no a priori information on populations. We then calculated genotype likelihoods using ANGSD (v0.930) (Korneliussen et al. 2014). Here, we utilized the samtools model to estimate likelihoods with a SNP p-value of 1e−6. The software ADMIXTURE (v 1.3.0) (Alexander and Lange 2011) was subsequently used to infer population structure between individuals using K = 1 to K = 20 with 50 repetitions of each K allowing for investigation of convergence patterns. We merged the resulting runs using CLUMPAK (Kopelman et al. 2015) and determined the optimal K based on the Evanno method (Evanno et al. 2005). To investigate population differentiation, we calculated population pairwise F ST values using ANGSD. Using a two-dimensional site frequency spectrum (2dSFS) for each pair of populations and the reference genome as the ancestral genotype, we calculated the weighted F ST estimates. Lastly, we performed calculations of genomic diversity for each of the 16 geographically identified populations using the population function in STACKS (v.2.41) (Catchen et al. 2013). The genomic diversity values for the EL and WL were calculated separately to prevent underestimation of multiple values due to fixation of a high number of alleles in each of the major lineages.

DNA extraction and sequencing
DNA extraction resulted in 48 samples (49%) meeting the quality specifications for nGBS, i.e. the required quantity and quality of DNA. After sequencing and demultiplexing at LGC, we received on average 3 million of reads per individual with an average coverage of 15.08 x. Mapping to the hazel dormouse reference genome resulted in an average alignment rate of 99.5%. Three samples, from German (n = 2) and Czech (n = 1) populations, mapped below 60% to the reference genome and were subsequently removed from downstream analysis. Further, one German and one Belgian sample could not be assigned to exact spatial coordinates and were excluded from further analysis. After SNP filtering, we identified 24,903 polymorphic SNP loci across the genome in 43 individuals representing seven populations ( Fig. 1; Italy, Belgium, Netherlands, Baden-Wuerttemberg, Hesse, Saxony, and Lithuania; see Supplementary Table S1 for more details). The average depth across individuals was 11.9 x after filtering, with a range spanning from 3.9 x to 24.3 x.

Population genomic analyses
The resulting 24,903 SNPs were used to perform basic statistical analyses on population differentiation and genetic diversity. The PCA analysis identified two main clusters, which coincided with the expected split between the EL and WL. In total, 91.5% of the variance in identified SNPs is explained through this axis ( Fig. 2A). An additional separation was found between the Lithuanian and the German populations (Hesse, Saxony, Baden-Wuerttemberg) ( Fig. 2A). The ADMIXTURE analysis revealed K = 3 clusters as most likely, which equally confirmed the split between the lineages, and the Lithuanian population as a separate third cluster. Higher values of K showed the separation of the Italian populations and some substructuring within the German populations.
PCA analysis carried out on each lineage separately confirmed the split between the Lithuanian and all German populations on the first axis (42.6% variance explained). In addition, German populations showed some differentiation, with a gradient from Hesse to Saxony (Figs. 2B and 8.36% variance explained). The population in Saxony showed a high degree of within population variance. In the WL, the Italian population is separated from the Belgian and Dutch populations (39.8% variance explained) (Fig. 2C). Pairwise

3
F ST values confirmed these results, which identified the Italian population as divergent within the WL (Supplementary Figure S1).
Observed heterozygosity (H o ) values for each population ranged from 0.17 to 0.29 (Supplementary Table S2). WL individuals showed on average higher heterozygosity values compared to EL (H o = 0.26 for WL, and 0.18 for EL, respectively). Private alleles calculated across all lineages were found in all populations; however, the number varied greatly ranging from five in Baden-Wuerttemberg to 586 in Lithuania. Specifically, the Lithuanian, Hesse, and Italian populations showed the greatest number of private alleles (Supplementary Table S2).

Discussion
In this study, 48 dormouse individuals from seven populations were sequenced and 24,903 SNP loci were identified, which were successfully used to study lineage differentiation, population structure and genomic diversity. These are promising for future use in conservation management of this species. The first aim of this study was to investigate the genetic structure at both a broad and fine-scale. In this context, we found a strong divergence between the EL and WL, in line with previous findings, which evidenced that the divergence between the two lineages likely occurred over 2.76 mya (Mouton et al. 2017). Such a long history of divergence may explain our finding of a substantial degree of SNPs being fixed among the lineages. While these loci are useful for lineage discrimination, they can cover weaker signals of population substructuring occurring within lineages and populations.
At a more regional scale, all populations within the EL showed moderate levels of observed heterozygosity. Interestingly, there is a higher degree of within population diversity in the Saxony population. These results indicate the presence of genetic differentiation on a regional scale, as shown by earlier work based on microsatellite markers (Bani et al. 2017). This trend stems from the species' limited dispersal capability, which is further diminished through habitat fragmentation (Bani et al. 2017). The higher number of private alleles identified in some populations (Supplementary  Table S2) may be an indication of genetic differentiation. However, a more extensive, evenly distributed sampling strategy would be needed to verify this. Further, the genetic  Mouton et al. (2017) differentiation in Italy confirms the knowledge regarding the isolation of populations during the Last Glacial Maximum, where Italy served as a refuge leaving unique genomic signatures across a variety of taxa (Hewitt 1999(Hewitt , 2000.
The second aim was to identify informative SNP markers for further use in conservation efforts for this endangered rodent. We detected a large pool of SNPs that can be utilized for future studies on the hazel dormouse. Specifically, studies looking at fine-scale patterns of genetic diversity and differentiation, as well as range-wide phylogeography of the species will add valuable knowledge about the demographic history of this species and will contribute to the identification of evolutionary significant units important for conservation. Previous studies have documented the applicability of reduced SNP panels to detect regional population structures and cryptic species (Antao et al. 2011;Muñoz et al. 2015) and to allow for a high-resolution genetic monitoring of threatened species (Kraus et al. 2015;von Thaden et al. 2017). Adapting such a panel for a platform enabling accurate genotyping of noninvasively collected samples (von Thaden et al. 2017(von Thaden et al. , 2020, such as hairs or scats, will greatly improve its applicability in monitoring and conservation projects for this small, elusive and protected mammal.

Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.