Background

Organismal diversity is underpinned by diversity in life history and ecological characteristics among taxa, which in turn produce different underlying genetic patterns at the population and species levels [1,2,3,4,5]. Biological characteristics can determine the process and type of speciation. For example, nonecological speciation (speciation without divergent natural selection) produces ecologically similar/identical species that are allo- or parapatric replacements of each other [6, 7] and is more likely in low dispersal taxa that also show niche conservatism [8]. These biological and ecological characteristics often lead to cryptic speciation across many plant and animal taxa, where they complicate the application of many commonly used species criteria, and species delimitation relies largely, if not entirely, on genetic data.

The underlying diversity in speciation processes challenges the idea that any single genetic species delimitation model can be universally applicable. For example, multiple empirical studies have shown that commonly used multispecies coalescent (MSC) models can over-split species level diversity in low dispersal taxa because such systems violate the underlying assumption of panmixia [9,10,11,12,13,14], a sentiment echoed in theoretical literature [15]. Regardless of ongoing debates (e.g., [16]), we argue simply that MSC implementations taken at face value, and with well-resolved genomic or sub-genomic data, have strong potential to inflate species numbers in low dispersal systems (e.g., [17]). A solution to over-reliance on genetics is integrative species delimitation [18, 19]. However, in many poorly known or “cryptic biology” taxa (e.g., [20]), like minute animals that live under rocks and logs, integrating behavioral, ecological, and/or phenotypic data is challenging to impossible. Morphological conservatism resulting from niche conservatism [21] means that distinct species are often not morphologically diagnosable. In an integrative framework, some lines of evidence cannot be feasibly studied, while others are clearly conservative. This leads to a fundamental conundrum—how can we rigorously delimit species when genetic analyses are biased to inflate, and other evidence is inaccessible or lumps evolutionarily significant diversity?

Many taxa in the arachnid order Opiliones present challenges for species delimitation. Their microhabitat specificity and low dispersal ability leads to nonecological speciation and high population genetic structure, where related congeners rarely co-occur in sympatry ruling out direct tests for reproductive isolation (e.g., [10, 22,23,24,25,26,27,28,29]). The biological characteristics associated with nonecological speciation in low dispersal Opiliones make several commonly used species criteria inapplicable or inappropriate in these systems. For example, morphological conservatism diminishes the utility of the morphological species criteria, and niche conservatism precludes ecological species criteria. In these cases, genetic data become the primary data type for species delimitation. However, these complexes represent classic cases of “too little gene flow”, where distinguishing population genetic structure from species level divergence is not easy, and genetic species delimitation analyses overestimate species diversity (e.g., [10, 30, 31]).

Dispersal-limited microhabitat specialists are found in a diverse array of other taxa, including vertebrates and plants, with equal difficulty in resolving species-population boundaries (e.g., [12, 14]). This under-appreciated issue remains one of the most difficult challenges for empirical species delimitation and its implications extend to a diverse array of taxa regardless of biological characteristics (e.g., [32]). A possible solution to this issue is to use information from known systems to infer the unknown. In practice, this means using information derived from taxa with robust well-established species limits to infer species limits in a difficult cryptic species complex that shares similar biological and ecological characteristics and mode of speciation. Supervised machine learning is ideally suited for this approach, as known labeled data sets can be used to train a model that is then applied to an unknown unlabeled data set.

Here we use a combination of somatic and reproductive morphology, mitochondrial DNA, double-digest RADSeq (ddRAD; [33]), and ultraconserved elements (UCEs; [34, 35]) to illustrate the species conundrum in Theromaster brunneus [36], a widely distributed species from the southern Appalachian Mountains with high potential for cryptic speciation (Fig. 1). Our goal is not to exhaustively test species limits using every data and analysis type, but instead to demonstrate the difficulty of delimiting species in such taxa using common genetic species delimitation approaches. We highlight and emphasize a novel supervised machine learning approach, using training data from known taxa with similar biological characteristics as T. brunneus, to effectively (and conservatively) delimit cryptic species using phylogenomic data.

Fig. 1
figure 1

Distribution of Theromaster brunneus sampled in this study. Colors correspond to lineages identified in phylogenomic analyses (see Fig. 3). Numbers correspond to locations included in ddRAD analyses (Additional file 1: Table S1); sites with an enclosed circle were also sequenced for UCEs. Left inset: general distribution of Theromaster. Right inset: live photo of T. brunneus. Clade names: Great Smoky Mountain National Park (GSMNP), Tennessee Valley (TNV), northern Trans Asheville (nTA), southern Trans Asheville (sTA)

Results

Taxon sample

Our taxon sample included specimens from 76 different localities (Additional file 1: Table S1), 18 of which were only available for morphological study (i.e. preserved in 70–80% ethanol). All specimens are deposited in the San Diego State University Terrestrial Arthropod Collection. Locality data for all specimens housed in this collection (with SDSU_TAC or SDSU_OP catalog numbers) have been deposited at the Symbiota Collections of Arthropods Network (https://scan-bugs.org/portal).

Historically, the type specimens of Theromaster brunneus and T. archeri have never been directly compared by those who described or made any taxonomic acts affecting these species [37,38,39]. Further complicating the issue, these two species are reported from the same cave (McFarland Cave, AL) [37, 38]. Sympatry among congeneric Laniatores is extremely rare in northern temperate taxa. Several observations of Theromaster exist on iNaturalist from caves in northeast Alabama, however all of the individuals photographed are juveniles, which do not possess diagnostic species-specific characters.

Our examination of the male holotype of T. archeri did not alleviate any uncertainty. The genitalia and both pedipalps were previously dissected from the specimen and are no longer associated with the vial. As such, we cannot conduct comparative genitalic analyses or even confirm the sex of the specimen; all female Theromaster lack the cheliceral projections that are diagnostic for T. brunneus. It is possible that T. archeri is a local cave-adapted form of T. brunneus at the southern limits of the distribution, however neither the description nor our examinations suggest any cave adaptation. We unfortunately cannot confirm or deny species limits and therefore must retain T. archeri as a distinct species based on historical work. The final determination will likely only be possible with DNA sequencing of the holotype specimen (i.e., [40]), although we do not expect either outcome (a valid species or local population of T. brunneus) to affect species delimitation results in this study. It could be possible that our TAG lineage (see below) corresponds to T. archeri.

Morphological analyses

Voucher specimen data and accession numbers are provided in Additional file 1: Table S1. Representative morphological images (Figs. 2, 3, Additional file 2: Figs. S1–S5) confirm the highly conservative morphology across Theromaster, including male genitalia and male cheliceral modifications, a sexually dimorphic feature. However, there is clear differentiation in the habitus morphology of Roan Mountain (OP322, location #33), which has a more pointed eye mound, dorsal tergites clearly separated by grooves, more obvious dorsal spines, and more defined pigmentation patterns (Fig. 3). This specimen is clearly distinguished based on the morphological species criterion, suggesting at least two putative species.

Fig. 2
figure 2

Representative morphological images for select lineages of Theromaster brunneus. Each row of images corresponds to a single male individual, with the voucher information indicated in the left-most image. Columns of morphological characters correspond to chelicerae, pedipalp, and penis (L–R). Colors correspond to lineages identified in phylogenomic analyses (see Fig. 3)

Fig. 3
figure 3

Phylogenomic results based on RAxML analysis of ddRAD and UCE data, with representative habitus images of select lineages. All nodes have 100% bootstrap support unless indicated. A ddRAD phylogeny of the 61_48 dataset (see Additional file 1 for matrix details). TNV, GSMNP, nTA, and sTA lineage samples were included in DAPC and STRUCTURE analyses. B UCE RAxML phylogeny based on the 70% taxon occupancy matrix

Phylogenetic and clustering analyses

Accession numbers for all samples sequenced in this study are provided in Additional file 1: Table S1. RAxML analysis of cytochrome c oxidase subunit I (COI) was mostly poorly supported but revealed highly divergent Roan Mountain and “Southwest” lineages. RAxML analyses of ddRAD data (~ 75% taxon occupancy matrix with 1001 loci and 89,663 nucleotides) show at least seven main lineages (Fig. 3, Additional file 2: Fig. S7, Additional file 1: Tables S2–3). Relationships among these lineages are all highly supported (bootstrap of 100), but a handful of nodes within some clades show lower support, although all but one of these nodes are still above 70. SVDQuartets analyses result in similar lineage composition and interrelationships, but with generally weaker support (Additional file 2: Fig. S8). Excluding a single instance of sympatry at Roan Mountain, all main lineages are allopatric (Fig. 1).

Both partitioned RAxML and SVDQuartets analyses of the concatenated UCE matrix (70% taxon occupancy with 324 loci and 108,875 nucleotides) generally supported the ddRAD topology, with Roan Mountain, Southwest, and TAG as divergent and early-diverging lineages (Fig. 3, Additional file 2: Fig. S9). However, in both the RAxML and SVDQuartets trees, the “southern Trans Asheville” (sTA) clade was not recovered as a monophyletic group. All nodes in the UCE RAxML tree have a posterior probability greater than 0.95, with the exception of a single node with 0.94. DAPC clustering of the ddRAD SNPs reveal optimal K values (Additional file 2: Fig. S10) that correspond to the main lineages, with further subdivisions for the sTA and “Tennessee Valley” (TNV) groups recovered in phylogenetic analyses. STRUCTURE analyses of ddRAD SNPs likewise recover the main lineages, but as K values increase, further subclusters are found that correspond to geographic and/or phylogenetic groupings (Additional file 2: Fig. S11). The best-fit K value is K = 3 using the ∆K method [41], but K = 10 using the Pritchard et al. [42] method. VAE analyses cluster samples as in other analyses, and UCEs show more overlap among groups (Additional file 2: Fig. S12).

Phylogenomic species delimitation

The COI bPTP analyses supported 11 species (Additional file 2: Fig. S6), splitting most main genetic lineages into at least two species. BFD* hypothesis testing results are presented in Fig. 4 and Additional file 1: Tables S4–S5. Matrices consisted of 1828 and 415 SNPs for ddRAD and UCE datasets, respectively. The ddRAD dataset favored K = 3 corresponding to Roan Mountain, Southwest, and all other samples, however, the likelihood continuously increased with increasing number of species (Fig. 4A, B). BFD* analyses for the UCE dataset favored all samples as species (K = 15), with a second less likely peak at K = 3 (Fig. 4C, D).

Fig. 4
figure 4

Results of SNAPP/BFD* species delimitation analyses based on SNP datasets. A, B ddRAD phylogeny based on RAxML analysis of the 18_12 taxon occupancy matrix, with vertical lines indicating species hypotheses tested, and likelihood estimates for each hypothesis (averaged from two runs). C, D UCE phylogeny based on RAxML analysis of the 70% taxon occupancy matrix, with vertical lines indicating species hypotheses tested, and likelihood estimates for each hypothesis (averaged from two runs). Note: branch lengths for both cladograms were adjusted to fit vertical lines for easier interpretability and have no relative meaning; the phylogeny is purely to depict branching pattern and hypotheses tested. Blue vertical lines in A and C and blue data points in B and D indicate species hypothesis decisively favored by Bayes Factor analyses (see Additional file 1: Table S5 for values). Colors correspond to lineages identified in phylogenomic analyses (see Fig. 3)

CLADES requires that every predefined population, defined here as the main ddRAD genetic lineages, have at least one representative sequence present at every given locus in the input file. As such, the final T. brunneus dataset included 52 UCE loci. CLADES analyses of the UCE SNP dataset based on the “general” training dataset favored all populations as species, while analyses based on the “custom” training dataset favored two cryptic species (Fig. 5), corresponding to Roan Mountain and Southwest clade. While a “ground truth” for what the actual cryptic species are is lacking, the effectiveness of a “custom” dataset relative to a “general” dataset can be assessed when comparing assignment probabilities of specimens that are from the same geographic location. In these cases, the probability that specimens from the same location belong to the same species should be relatively high due to the allopatric nature of species in low dispersal taxa undergoing nonecological speciation. When using the “general” training dataset, specimens from the same geographic location were classified on average as the same species with 0.116 probability, while with the “custom” dataset, specimens from the same population were classified on average as the same species with 0.753 probability.

Fig. 5
figure 5

Species delimitation results supported across data and analysis types. For any given analysis, multiple bars of the same color indicate lineages that were split in that analysis, and any lineages that were lumped in any given analyses are colored gray. Colors correspond to lineages identified in phylogenomic analyses (see Fig. 3)

Although we have high confidence in our species delimitations, we refrained from formally describing the two new species until more specimens of the undescribed species from Roan Mountain can be collected.

Discussion

Major portions of the tree of life include branches (species) that are unknown or poorly known, and integrative studies in such taxa are challenging for many reasons. At the same time, collecting sub-genomic data for these taxa has become increasingly easy, leading to species delimitations founded largely, if not entirely, upon genetic data. Many studies have documented overestimation of species numbers by commonly used genetic delimitation methods both in empirical (e.g., [9,10,11, 13, 14, 31, 32, 43, 44]), and theoretical/simulation studies [12, 15, 17, 45, 46]). Increasing the number of loci may also increase support for an incorrect hypothesis in Bayesian analyses [47], and most relevant here, supporting species level divergences for intraspecific populations [44]. Researchers studying cryptic species with inherently high population structure and allo- or parapatric distributions face a dilemma when delimiting species. Current phylogenomic species delimitation analyses overestimate species numbers, and without morphology, behavior, or distribution to assist, the degree of over-splitting is unknown. Systematists researching these taxa must be conservative in final species hypotheses (e.g., [17]), while simultaneously acknowledging that the actual number of species may still be underestimated.

In our study, most genetic analyses recover at least six lineages as species (Fig. 4), but the consensus favors three species (Roan, SW, all others), the latter two of which are morphologically cryptic. The supervised machine learning approach we employed with a “custom” training data set provided a reasonable and biologically informed hypothesis of three total species. Phylogenies derived from ddRAD and UCEs were essentially congruent, but the inferred number of species in the BFD* analyses differed. Both had a peak at K = 3, but UCE data had a higher likelihood at K = 15, and we argue that leading to this peak at K = 15 in the UCE data, the MSC over-splits because these taxa clearly violate the assumption of panmixia within species. BFD* analyses using ddRAD data did not overestimate species. However, like UCE analyses, there is an obvious increasing trend in likelihood with increasing species numbers leading us to ask if K = 3 would still be the most likely hypothesis if more sequenced samples from different collecting localities (i.e., populations) had been included. As a validation approach, only a limited set of species-level hypotheses are tested in empirical studies using BFD*. It might be beneficial for researchers using this approach, especially for low dispersal taxa, to fully explore hypotheses to see if “false positives” are more prevalent.

Training data justification and considerations

In this study we present a first attempt at demonstrating the potential of a supervised machine learning approach to genetic species delimitation using biologically relevant customized training data sets. Here we provide some justification for our training data set choice and considerations for future work. In the case of Opiliones, which are largely understudied from a modern genetic perspective, there is scant genome-scale species level data available to serve as training data. Many Opiliones studies focusing on species delimitation using genetic data to identify or conclude the presence of cryptic species, resulting in uncertainty across species boundaries (e.g., [30, 31, 43]). The level of congruence and support for species delimitations seen in Metanonychus is exceptional [10], making this the most suitable training dataset for low-dispersal Opiliones to date, with potential application to many other unresolved putative cryptic species complexes.

Genetic statistics can be useful in identifying the overlap of actual and potential species boundaries between data sets. For UCE loci, the mean K2P-corrected genetic distance across all T. brunneus samples is 2.98% (range 0.37–8.92) and falls within the range of genetic distances seen across species of Metanonychus (mean = 14.746, range = 1.59–26.91). Speciation events within Metanonychus span recent and older divergences, where the mean K2P-corrected divergence of COI across the shallowest species-level split is 13.15% and the deepest split has a mean divergence of 27.15%. Mean COI divergence across Theromaster samples is 6.51%, with a maximum of 20.2%. Taken together, these genetic measures indicate that any potential species-level divergences in T. brunneus fall within the range of actual species divergences seen in Metanonychus. Sensitivity to the underlying model is an obvious and related consideration. In this regard, in taxa with better representation of genome-scale species level data, it would be worthwhile to explore varying combinations of suitable taxa in the training data set. For example, including shallower or deeper species divergences, or data from a phylogenetically more diverse range of taxa (i.e., from other genera or families with similar biological characteristics).

One concern relates to the differences in dispersal dynamics between the regions each taxon is found in (Pacific Northwest for Metanonychus and southern Appalachians for Theromaster). Dispersal dynamics and bio-/phylogeographic histories are different across these regions, where geologic and other abiotic factors (e.g., river formation) can drive the speciation process, especially for the ancient and more topographically complex southern Appalachian Mountains. Despite these differences, given the biological and ecological similarity of these taxa, we hypothesize that while the dynamics differ across regions, their responses and associated genetic signatures to any abiotic factors influencing speciation will be similar. It follows that using taxa that inhabit the same region should increase the effectiveness of our supervised approach, much in the same way as comparative phylogeography is a powerful approach for elucidating common underlying regional biogeographic patterns (e.g., [48, 49]).

There are many questions in relation to the choice of training and testing datasets that deserve further attention. How closely related must the taxa be to be considered similar enough for this approach? What is an appropriate divergence date threshold? How ecologically similar (e.g., degree of climatic variable overlap) should they be? Questions relating to similarity and divergence dates can be explored with further data sets, as well as the relative importance of genetic versus ecological similarity between training and testing taxa. This choice will in fact be dependent on the organismal type and may ultimately need to be somewhat subjective, where the experience and organismal knowledge of the taxonomist will be critical in determining suitability of any training data set. Niche overlap can be quantified, however, in the case of our study the differences in local climates and the geographic distance between their respective distributions makes assessing niche overlap difficult. More importantly, the bioclimatic variables used in species distribution modelling do not necessarily capture the similarity in microhabitat “climate” for taxa found underneath the forest surface, living in deep leaf litter and underneath woody debris. Future studies should advance towards quantifiable metrics that determine if a species group(s) is an appropriate training dataset, as well as attempting this approach on taxa that are more directly linked to the bioclimatic variables used in species distribution modelling (e.g., plants).

Incorporating natural history into genetic species delimitation

Genetic species delimitation is driven largely by computational tools, model testing, and a desire for objectivity in analyses. However, this is increasingly at the expense of considering the biological characteristics of organisms. Cryptic species, and those taxa that undergo non-ecological speciation, are one of the biggest challenges in species delimitation, as many species criteria cannot be used in practice. Moving forward, an additional approach to delimiting cryptic species with genetic data can be using information already available, in this case inferring species in difficult unknown systems by using data from robustly known taxa with similar biological characteristics and modes of speciation. Recent integration of machine learning in species delimitation [10, 50, 51] provides algorithmic options which are versatile and customizable. Here we used a system-specific “custom” training dataset in a supervised machine learning framework to delimit cryptic species, where the training data were derived from Metanonychus, a previously studied system with robust species supported through multiple data types and with similar biological characteristics to Theromaster. These taxa share similar biological and ecological characteristics, most importantly dispersal ability and microhabitat preference, and both undergo the same type of speciation, and as such are expected to have comparable underlying genetic patterns associated with populations and species. The effectiveness of using customized and biologically relevant training data is evidenced by the probability of assigning specimens from the same geographic location to the same species, which increased dramatically with the “custom” dataset relative to the “general”.

The power of applying a supervised machine learning approach derives from the ability to create custom training data sets that are specific to each study system, and to various classes of genetic data (e.g., UCE, RADseq, Sanger), capturing the inherently different characteristics of genetic data types. In this way, our approach combines a computational tool ideally suited for species delimitation, in this case, a supervised machine learning algorithm as a classification tool, with knowledge of the biology and natural history of the focal organisms derived from organismal expertise, leading to more informed, relevant, and reasonable species delimitation decisions when relying on genetic data only. The recently developed program DELINEATE [46] takes a similar approach, using the information from known species to calculate speciation parameters which are then applied to delimit unknown samples. Similarly, a reference-based taxonomic approach was used to delimit putative new species based on genetic distances of known species in a group of closely related and ecologically similar lizards [52].

Conclusions

There are extremely well-studied systems for which genetic species delimitation is largely successful, for example the model organism Drosophila [53]. However, for poorly studied groups (perhaps the majority of life) where basic biological and ecological knowledge can be difficult or impossible to acquire, inferring any biological details in an unknown or new taxon commonly relies on generalization from similar or closely related taxa where that information happens to be known. Our approach using known species limits in a supervised machine learning framework to infer unknown limits is a logical analytical extension of this inference process and should be universally applicable to species delimitation in any taxon, particularly when cryptic species are anticipated or prevalent.

Materials and methods

Study system and taxon sampling

The genus Theromaster [37] currently includes two described species: the widespread T. brunneus [36] (Fig. 1), and the poorly described T. archeri [38] from several caves in extreme northeastern Alabama. Theromaster are small (body length usually < 3 mm), short-legged, and most often found in sheltered microhabitats under rocks and logs. Because of these natural history characteristics, we anticipate high levels of population genetic structuring and potential cryptic diversification, as seen in many other northern temperate Opiliones with similar biology (e.g., [24,25,26, 43]). Cryptic diversification is further expected because T. brunneus occurs in one of the most topographically complex regions of North America, the southern Appalachian Mountains. The Southern Appalachians are a well-known biodiversity hotspot for animals including vertebrates (e.g., [54,55,56,57,58]) and arthropods (e.g., [26, 59,60,61]). For arachnids, almost all available molecular datasets for “wide-ranging” taxa in this region indicate in situ phylogeographic diversification, with multiple lineages that likely represent cryptic species (e.g., summarized in [43, 61, 62].

Theromaster brunneus is known from less than 10 literature records [37,38,39, 63] from western North Carolina, eastern Tennessee, northern Georgia, and northern Alabama. However, our own collections and museum specimens indicate a broader distribution (Fig. 1) that is atypically large for a single species of northern temperate laniatorean Opiliones. We first constructed a distribution map for T. brunneus, based on original collections from the Hedin lab and collections of Dr. William Shear (specimens now housed at San Diego State University). Our specimen sample spans the known geographic distribution of the species, including specimens from near the type locality (“valley of Black Mountains”, North Carolina). All samples used in this study are identified as or considered T. brunneus; specimens morphologically identifiable as the questionable T. archeri could not be collected. To attempt to resolve the taxonomic issues associated with T. archeri we also examined the holotype specimen held in the American Museum of Natural History. Previous UCE-based phylogenomic analyses of travunioid harvestmen strongly supported a Theromaster + Erebomaster clade [64]; as such we used Erebomaster samples as outgroups for mitochondrial and UCE datasets, and rooted ddRAD phylogenies (without Erebomaster) based on the UCE topology.

Morphology

Adult male T. brunneus have distinctive cheliceral modifications, making this taxon easily recognizable. Whole specimen and cheliceral segment digital images were captured using a Visionary Digital BK plus system (http://www.visionarydigital.com). Multiple individual images were merged into a composite image using Helicon Focus 6.2.2 software (http://www.heliconsoft.com/heliconfocus.html). For imaging, left chelicerae and pedipalps were dissected from male specimens. Male penises that were not already protruding from the genital operculum were physically extracted using a blunt insect micro pin. Chelicerae, penis, and pedipalps were examined using scanning electron microscopy (SEM). Specimens destined for SEM imaging were mounted onto stubs, critical point dried, coated with 6 nm platinum, and imaged on the FEI Quanta 450 FEG environmental SEM at the San Diego State University Electron Microscope Facility.

Mitochondrial data collection and analysis

A partial fragment of the mitochondrial COI gene was amplified using PCR primers and conditions as in Hedin and Thomas [26] and Derkarabetian et al. [65] for a total of 39 T. brunneus samples from throughout its distribution. PCR products were purified using Millipore plates, Sanger sequenced in both directions at Macrogen USA (Rockville, MD), then edited and aligned manually using Geneious 10.1 (Biomatters Ltd.). Gene trees were reconstructed using RAxML v8 [66], with a GTR GAMMA model applied to separate codon partitions. RAxML was called as follows: -# 500, -n MultipleOriginal, -# 1000, -n MultipleBootstrap. The resulting COI RAxML gene tree was used as input for bPTP species delimitation analyses through the bPTP server (https://species.h-its.org/, [67]). Two replicate analyses were run for 100,000 generations, with thinning at 100, and a burnin of 0.1.

ddRADSeq data collection and analysis

Sixty-one Theromaster specimens from 50 distinct geographic locations were included in a “complete” (n = 61 samples) matrix for ddRAD analyses (Fig. 1). Preparation of the ddRAD libraries followed Burns et al. [68] and Derkarabetian et al. [24], adapted from Peterson et al. [33]. DNA was extracted from whole specimens using the Qiagen DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA) following the manufacturer’s protocol. We used restriction digest enzymes EcoRI-HF and Msp1 (New England Biolabs, Ipswich, MA), and the corresponding adapters from Peterson et al. [33]. Briefly, ~ 500 ng of genomic DNA was digested for 3 h in a 50 μl reaction with 100 units each of the restriction enzymes EcoRI-HF and Msp1 (New England Biolabs, Ipswich, MA), and 1X CutSmart Buffer (New England Biolabs). Samples were purified using Agencourt AMPure XP bead cleanup (Beckman Coulter, Inc., Brea, CA). Adapters were ligated to digested DNA in a 40 μl reaction that consisted of 33 μl digested DNA, 1.05 μM MspI P2 adapter, 0.54 μM EcoRI P1, 400 units of T4 DNA-ligase, and 1X T4 DNA ligase reaction buffer (New England Biolabs). Ligation reactions were incubated at room temperature for 40 min, heat killed at 65 °C for 10 min, then cooled to room temperature at a rate of 2 °C per 90 s. Samples with different adapters were pooled by column and then purified using AMPure XP bead cleanup. Pooled samples were then size selected to a size range of 400–600 bp with a Pippin Prep automated size-selection instrument (Sage Science, Beverly, MA). Primers with Illumina indices were added to the pooled samples using PCR; 50 μl reactions consisted of 23 μl DNA template, 2 uM PCR Primer P1, 2 uM PCR primer P2 (eight types for second-tier multiplexing, one per pooled sample), and 1X Phusion High Fidelity PCR Mastermix (New England Biolabs). Cycle conditions were 98 °C for 30 s, 12 iterations of 98 °C for 10 s and 72 °C for 20 s (with a 16% ramp to slow cooling), and 72 °C for 10 min. PCR products were purified via AMPure XP bead cleanup and quantified using a Bioanalyzer (Agilent Technologies, Santa Clara, CA). A pool consisting of an equimolar quantity of each library was sequenced as 100 bp SE reads on the Illumina HiSeq2500 platform at the University of California, Riverside’s Institute for Integrative Genomics Biology—Genomics Core Facility.

ddRAD data were processed using the denovo assembly method of ipyrad v.0.5.15 [69], with the following settings adjusted from default: mindepth_majrule = 6, clust_threshold = 0.9, filter_adapters = 2, filter_min_trim_len = 35, max_Indels_locus = 4, max_shared_Hs_locus = 0.1. For the full 61-sample dataset, we ran min_samples_locus at 31 (50% complete matrix, called 61_31) and 48 (~ 75% complete matrix, called 61_48). Maximum likelihood analyses of 61_31 and 61_48 matrices were run with RAxML v8 [66] using 1000 rapid bootstrap replicates and the GTRGAMMA model. These analyses of 61-sample matrices indicated three divergent and early-diverging lineages (see “Results” section). Given the congruent recovery of three early-diverging lineages, we excluded these early-diverging samples and re-ran min_samples_locus at 45 and 51 for a reduced 51-sample dataset (called 51_45 and 51_51 respectively). These datasets only included “Tennessee Valley” (TNV), Great Smokey Mountain National Park (GSMNP), “northern Trans Asheville” (nTA), and “southern Trans Asheville” (sTA) lineages. This strategy effectively increases the number of loci retained for these derived lineages (see [70]).

Using unlinked SNPs (a single randomly sampled SNP per locus) from the 61_48 matrix, we reconstructed both a “lineage tree” (individuals as OTUs) and “species tree” using SVDquartets [71] in PAUP ∗ v4.0a152 [72] with n = 500 bootstraps. For the species tree, specimens were partitioned into groups following major clades recovered in RAxML and SVDquartets lineage tree analyses. Using the 51_45 unlinked SNPs matrix (n = 1122 unlinked SNPs), we performed k-means clustering of PCA-transformed data using the find.clusters R function (“adegenet” package) [73, 74]. Missing data were replaced by the mean frequency of the haplotype in the sample (scaleGen(data, NA.method = "mean"). The Bayesian information criterion (BIC) was used to compare clustering models with a maximum of K = 20, retaining all principal components, and replicating the analysis 10 times. We then conducted a discriminant analysis of principal components (DAPC), retaining approximately one-quarter of the principal components and all discriminant functions. Using the same unlinked SNPs matrix, STRUCTURE 2.3.4 [42] runs were conducted using an admixture model with uncorrelated allele frequencies. All other priors were left as default. For individual K values ranging from 2–12, analyses were replicated four times, each run including 200,000 generations with the first 20,000 generations removed as burnin. Data were summarized using CLUMPAK [75], with a best-fit K chosen utilizing the ∆K method of Evanno et al. [41], and the prob (K) method of Pritchard et al. [42].

Based on phylogenetic analysis of the UCE and 61-sample RADSeq datasets, plus STRUCTURE [42] and DAPC [76] analyses of the 51-sample RADSeq dataset (see “Results” section), we chose 18 samples to represent all primary Theromaster lineages. For these 18 samples we re-ran ipyrad using settings as above, at 18_12 occupancy. From this, an unlinked SNPs file was created using the Phrynomics package ([77], https://github.com/bbanbury/phrynomics), where nonbinary characters were removed and bases were translated. To further visualize genetic structure and clustering within Theromaster, we analyzed this dataset using a Variational Autoencoder (VAE) implemented with a modified version of the sp_deli script (https://github.com/sokrypton/sp_deli) derived from Derkarabetian et al. [40]. The matrix was run through the VAE five times and the analysis with the lowest average loss after removing 50% burnin was considered the optimal output.

UCE data collection and analysis

Studies have shown the utility of UCEs at shallow levels (e.g., [78, 79]), particularly in arthropod taxa (e.g., [40, 80, 81]). The UCEs targeted in the arachnid probe set [82] are exonic in origin with the “core” UCE corresponding to coding region while the “flanking” region are non-coding introns [83]. As such, in taxa with high population structure, flanking non-coding regions of UCEs are informative for population level structure and can be used in phylogenomic species delimitation analyses [40]. Representative samples for UCE sequencing were chosen based on preliminary RAD analyses, subsampling main lineages (see Fig. 2).

Sequence capture of UCEs followed the protocol of Derkarabetian et al. [64]. A subset of 15 samples representing all major ddRAD lineages were used in UCE experiments and were prepared in multiple library preparation and sequencing experiments. Protocols across these experiments were largely identical, differing mainly in sequencing platform. Genomic DNA was extracted from either multiple legs or whole bodies using the Qiagen DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA). Extractions were quantified using a Qubit Fluorometer (Life Technologies, Inc.) and quality was assessed via gel electrophoresis on a 1% agarose gel. Up to 500 ng were used in DNA fragmentation procedures, either using a Bioruptor or a Covaris M220 Focused-ultrasonicator, as in Derkarabetian et al. [64]. UCE libraries were prepared using the KAPA Hyper Prep Kit (Kapa Biosystems), using up to 250 ng DNA (i.e., half reaction of manufacturer’s protocol) as starting material. Ampure XP beads were used for all cleanup steps. For samples containing < 250 ng total, all DNA was used in library preparation. Target enrichment was performed on pooled libraries using the MYbaits Arachnida 1.1K version 1 kit (Arbor Biosciences, Ann Arbor, MI) following the Target Enrichment of Illumina Libraries v. 1.5 protocol (http://ultraconserved.org/#protocols). Hybridization was conducted at either 60 or 65 °C for 24 h, with a post-hybridization amplification of 18 cycles. Following an additional cleanup, libraries were quantified using a Qubit fluorometer and equimolar mixes were prepared for sequencing either with an Illumina NextSeq (University of California, Riverside Institute for Integrative Genome Biology) with 150 bp PE reads, or an Illumina HiSeq 2500 (Brigham Young University DNA Sequencing Center) with 125 bp PE reads (see Suppl. Material 1).

Raw demultiplexed reads were processed with the Phyluce pipeline [84]. Quality control and adapter removal were conducted with the Illumiprocessor wrapper [85, 86]. Assemblies were created with Velvet [87] at default settings. Contigs were matched to probes using minimum coverage and minimum identity values of 65. UCE loci were aligned with MAFFT [88] and trimmed with Gblocks [89, 90] implemented in the Phyluce pipeline. All individual UCE loci were imported into Geneious 10.1 (Biomatters Ltd.) and manually inspected to check for obvious alignment errors and remove putatively non-homologous sequences (e.g., any sequences more divergent than outgroup taxa).

Concatenated and partitioned phylogenetic analyses were run on two datasets differing in the taxon coverage needed to include a locus in the final dataset: 50% and 70%. Maximum likelihood analyses were run with RAxML v8 [66] using 200 rapid bootstrap replicates and the GTRGAMMA model. Using the 70% concatenated UCE matrix we also reconstructed a lineage tree using SVDquartets [71] with n = 500 bootstraps. Finally, we made a 50% taxon coverage unlinked SNP dataset from alignments with a custom wrapper script using snp-sites [91] to convert alignments to vcf format, randSNPs_from_vcf.pl (https://www.biostars.org/p/313701/) to select a single random SNP from each alignment’s vcf file, vcf2phylip.py (https://github.com/edgardomortiz/vcf2phylip) to convert vcf files back to phylip, and AMAS [92] to concatenate all randomly selected SNPs into a single phylip file. The Phrynomics R package ([77], https://github.com/bbanbury/phrynomics) was used to select only biallelic SNPs and translate SNPs to integers. The VAE was run on this dataset as done with ddRAD data.

Bayes factor delimitation* analyses

We conducted BFD* [93, 94] species delimitation analyses using SNPs derived from both the ddRAD and UCE data using SNAPP [95] implemented in the BEAST 2.4.5 package [96]. Analyses were run on the 18_12 ddRAD and the 50% taxon coverage UCE datasets. For each SNP dataset we tested multiple alternative species hypotheses. Hypotheses tested were derived from other data types and analyses used in this study including morphological, COI, and phylogenetic and STRUCTURE/DAPC analyses of nuclear data. To test the BFD* approach to its fullest extent in this study system (and hence its potential to delimit populations), we also included a nested set of hypotheses up to the maximum potential number of species, where every specimen was considered a different species. All BFD* analyses were run for 100,000 generations, with 10,000 generations as pre-burnin, 48 steps, and an alpha value of 0.3. Two replicates of each analysis were run to check for convergence. A comparison of marginal likelihoods was conducted using Bayes factors [97], with values above 10 considered to be decisive support.

Supervised machine learning analyses

We analyzed UCE loci with the supervised machine learning species delimitation program CLADES [50]. CLADES is a classification model derived from a type of machine learning algorithm called a support vector machine to classify samples as either “same species” or “different species” using multi-locus data in a two-species model. CLADES computes five summary statistics from the data (both training and testing) and uses these statistics as features to create the model and classify samples: private positions, folded-SFS with k bins, pairwise difference ratio, FST, and longest shared tract (defined in [50]). A training data set, where pairwise comparisons of all samples are defined a priori as either the same or different, is used to build the model and classify a test dataset with unknown species status.

We analyzed Theromaster UCE data in CLADES using two training data sets. First, Pei et al. [50] provided a training data set called “All” (which we refer to as “general” here) based on simulated data with varying values of theta (Θ), migration rate, and divergence time under a two species model. This “general” data set is meant to be broadly applicable across taxa as simulated data encompass the broad diversity of genetic patterns across plants and animals [50]. Second, we developed a “custom” training data set derived from the well-known, robust species of Metanonychus recently revised in an integrative taxonomic context [25]. All Metanonychus species are easily diagnosed based on both somatic and genitalic morphology, with both mitochondrial and nuclear data supporting species status across a broad array of analysis types. Most importantly, Metanonychus and Theromaster share similar biological and ecological characteristics, including low dispersal ability and microhabitat preferences. Microhabitat preference and ecological similarity are largely based on our experience collecting these taxa over many years. Quantifying biological and ecological similarity, as well as dispersal ability, is difficult in poorly-known taxa with cryptic biology, like those that occupy “hidden” microhabitats (see “Discussion” section for further justification).

For both Theromaster and Metanonychus, datasets included all UCE loci shared across all samples in each data set (n = 52 for Theromaster, n = 12 for Metanonychus) (the “spp” dataset of Metanonychus in [10]). To create the “custom” Metanonychus training dataset, we ran the Metanonychus UCE loci through CLADES against the “general” training dataset. As expected, and found in Derkarabetian et al. [40], analyses favored all populations as species, and all output files reflected this. The output files contain pairwise comparisons of all specified populations, those files with pairwise comparisons between populations that belonged to the same species as delimited in Derkarabetian et al. [10] were manually modified to reflect that the samples belonged to the same species (switching + 1 to − 1). All relevant files required for the model (see CLADES documentation) were manually created from these output files. LibSVM [98] was used to create the “*.model” file from the “*.sumstat.scale” file using default parameters, with the addition of training for probability estimates (− b 1).

Following creation of this “custom” training dataset, we ran the T. brunneus dataset against it using CLADES. We excluded the Roan sample from these analyses because this specimen (i.e., species) is morphologically distinguishable from the rest of T. brunneus and is represented by only a single specimen. In this way our goal was to assess the success of this approach on a dataset consisting only of morphologically similar lineages that are putative cryptic species.