Introduction

A taxonomy-based phylogenetic tree serves as the foundation for biodiversity conservation and offers guidelines for mining and utilizing germplasm resources in horticulture and agriculture. Previous plant phylogenetic studies primarily relied on Sanger sequencing, utilizing specific region markers, such as ITS, ETS, rbcL, atpB, matK, and trnL-trnF (Shi et al. 2013; Zhao et al. 2016; Liao et al. 2023; Phang et al. 2023). However, these markers have inherent limitations, including short lengths (usually <1000 base pairs (bp)), high homogeneity (e.g., plastid loci), or gene paralogs (e.g., ITS). Individual molecular markers may introduce biases, resulting in unresolved phylogenetic relationships with limited evolutionary history (Coissac et al. 2016; Wilkinson et al. 2017).

Because of these limitations, there has been a noticeable decline in the generation of Sanger sequencing data (for individual genes/loci or molecular markers) in recent years (Supplementary Fig. S1A). However, high-throughput sequencing (HTS) technology, including whole-genome sequencing (WGS) (Ng and Kirkness 2010), restriction site-associated DNA sequencing (RAD-seq) (Shafer et al. 2016), transcriptome sequencing (RNA-seq) (Kumar et al. 2012), genome skimming (Dodsworth 2015) and hybridization target enrichment sequencing (Hyb-seq) (Weitemier et al. 2014), has played a pivotal role in advancing biological discoveries (summarized in Table 1). HTS not only enhances the convenience of genome sequencing for various plant groups (Dodsworth et al. 2019), but also overcomes challenges related to DNA degradation (Bakker et al. 2016; Robillard et al. 2020; Folk et al. 2021; Zhao et al. 2023). Additionally, HTS yields additional DNA data with genetic information, significantly increasing genome-level sequencing data (Supplementary Fig. S1B). Initially, researchers preferred using plastid genomes due to their relatively small size, conserved structure, low recombination rates, ease of assembly, high sequence homogeneity, and high stability in specific regions. Plastid genomes have been widely used for reconstructing phylogenetic relationships in various plants (Gitzendanner et al. 2018; Forrest et al. 2019; Li et al. 2019; Guo et al. 2022). Nevertheless, plastid data may lack informative sites for addressing complex evolutionary radiations (Turner et al. 2016; Huang et al. 2023). Furthermore, since only maternally inherited information is provided, plastid genomes are represented as a single nonrecombinant linkage group. Consequently, all genes within this group are considered to represent the same evolutionary history. Moreover, complex evolutionary processes such as hybridization, introgression, horizontal gene transfer, polyploidy, and incomplete line sorting can hinder the recovery of the true species history (Wolf et al. 2018; Pillon et al. 2021; Thureborn et al. 2022).

Table 1 Comparison of different high-throughput sequencing strategies

In recent years, nuclear genomes have become increasingly accessible, and various international genome-related collaborations have promoted the generation and analysis of global biological genomes. Notably, the Earth BioGenome Project aims to sequence and annotate the genomes of approximately 1.5 million eukaryotic species within a decade (Lewin et al. 2018). Similarly, projects such as OneKP (Leebens-Mack et al. 2019), GoFlag (Breinholt et al. 2021), and PAFTOL (Baker et al. 2022) have emerged. These endeavors will establish the genomic foundation to address crucial biological issues and enhance our understanding of global biodiversity and resource management.

Despite remarkable advances in sequencing technology and the generation of extensive genome-level sequencing data, these invaluable and crucial genomic resources yet to be explored. Therefore, target capture sequencing has rapidly evolved, by enriching and sequencing specific genomic regions of interest using capture reagent kits (Mamanova et al. 2010). The process begins with capturing on-target sequences or genomic regions and identifying target genes (e.g., genes, exons, or ultraconserved elements) through various genomic data resources (WGS, RNA-seq, and genome skimming), and it focuses on low- or single-copy genes (McKain et al. 2018). Target capture sequencing enables cost-effective and reliable generation of sequences for hundreds or even thousands of target gene loci (Cronn et al. 2012; Bragg et al. 2015; Leebens-Mack et al. 2019).

Hyb-seq, a method that integrates targeted sequencing with genome skimming, has revolutionized phylogenetic reconstruction by providing abundant molecular data and numerous low- and multicopy nuclear genes for various phylogenetic studies (Weitemier et al. 2014; Dodsworth et al. 2019). The process begins by screening conserved regions for enrichment through reference genome sequencing alignment analysis. Subsequently, single-stranded DNA or RNA probes with a length of 80–120 bp and a streptavidin coat based on sequence conservation are synthesized. These probes are then hybridized and paired with the genomic library of the samples. The on-target sequences are enriched and fixed through streptavidin-coated magnetic beads that bind biotinylated baits (and bait-bound DNA). Finally, the library fragments that have not been hybridized are eluted, and the target library fragments are retrieved for HTS (Dodsworth et al. 2019) (reviewed in Fig. 1). This technique has become a standard approach in phylogenetics, outperforming Sanger sequencing, particularly in resolving plant phylogenetic relationships (Wang et al. 2021; Sun et al. 2022; Sundararaman et al. 2023).

Fig. 1
figure 1

The technical workflow of Hyb-seq, modified from the past research (Johnson et al. 2019)

Compared with other HTS methods, Hyb-seq reduces workload and computational complexity, and offers flexibility in designing probes to effectively address phylogenetic relationships at various taxonomic levels (McKain et al. 2018; Ogutcen et al. 2021). Among widely used probe sets, Angiosperms353, derived from 600 angiosperm species, is highly favored. This specialized capture probe, consisting of 120 bp, was meticulously designed using the k-medoid clustering method. Subsequently, target enrichment sequencing is conducted for these 353 single-copy nuclear genes (Johnson et al. 2019). Angiosperms353 has been extensively applied to elucidate various levels of phylogenetic relationships, providing a more extensive molecular dataset for reconstructing evolutionary relationships (Baker et al. 2021; Joyce et al. 2023; Larson et al. 2023; Masters et al. 2023). In addition, the development of tools for retrieving Angiosperms353 data has further facilitated its broader applications.

As Angiosperms353 gains popularity among researchers, publications using this approach have gradually increased since its discovery (Supplementary Fig. S1a). Nevertheless, the lack of essential review studies comparing the capture efficiency of Angiosperms353 among various software/tools using empirical data and providing practical guidance for handling massive high-throughput data complicates achieving cost-effectiveness in this context. Hence, retrieving Angiosperms353 from diverse HTS datasets obtained from various platforms and methods still poses a challenge.

In this article, we comprehensively evaluated the potential application of this approach in Angiosperms353. This assessment is based on the literature review and testing with extensive empirical data. We began by presenting an overview of the research progress on Angiosperms353’s application. Subsequently, we assembled an 18-taxa genomic dataset to evaluate the performance of two commonly used tools for capturing Angiosperms353. Furthermore, we included a 343-taxa dataset to comprehensively assess various potential factors impacting Angiosperms353 yields (Fig. 2). The primary goal of this review is to provide an optimal and cost-effective solution for applying Angiosperms353 in phylogenomics across multiple biology disciplines, ultimately facilitating the integration of extensive HTS resources and maximizing genomic data sharing and utilization.

Fig. 2
figure 2

Workflow of the evaluation on Angiosperms353 using empirical data from this study. ANA grade, Amborellales, Nymphaeales, and Austrobaileyales orders recognized by APG IV (2016); WGS, whole-genome sequencing; RAD-seq, restriction site-associated DNA sequencing; RNA-seq, transcriptome sequencing

Utilizing Angiosperms353 and comparing it with lineage-specific probe sets

Angiosperms353, widely adopted for its high universality, offers the advantage of off-the-shelf use without requiring costly start-up investments. Moreover, it diminishes the requirement for specialized bioinformatics expertise in probe design (Johnson et al. 2019). Therefore, it is applied across all levels of angiosperms in phylogenomics, spanning the order level (Antonelli et al. 2021; Zuntini et al. 2021), family level (Hendriks et al. 2021; Yardeni et al. 2021; Haigh et al. 2023; Joyce et al. 2023; Larson et al. 2023), genus level (Frost et al. 2021; Howard et al. 2022; Simões et al. 2022), and even species and population levels (Crowl et al. 2022; Masters et al. 2023).

Utilizing Angiosperms353 at various taxonomic levels

There is substantial evidence supporting the efficacy of Angiosperms353 in elucidating relationships among families across various orders. Notable examples include its application in Cornales (Thomas et al. 2021), Myrtales (Maurin et al. 2021), Gentianales (Antonelli et al. 2021), Commelinales (Zuntini et al. 2021), Oxalidales (Pillon et al. 2021), Orchids (Eserman et al. 2021), and Sapindales (Joyce et al. 2023). Angiosperms353 demonstrated remarkable effectiveness in Gentianales, offering valuable insights into its evolution. Antonelli et al. (2021) used Angiosperms353 to study approximately 150 Gentianales species, revealing well-resolved relationships within this order. Remarkably, over 80% of the nodes in the phylogenetic tree received robust support, and the research strongly supported the monophyly of each of the five families within Gentianales. Furthermore, Angiosperms353 offered reliable evidence for resolving phylogenetic relationships among families within Commelinales. Zuntini et al. (2021) conducted comprehensive phylogenomics on 290 species in Commelinales, capturing 352 genes with Angiosperms353. This effort supported the monophyly of Commelinales and its five families, effectively clarifying relationships among and within these families.

Similarly, the abundance of Angiosperms353 proves sufficient for inferring high-resolution phylogenetic relationships at the genus level. Multiple instances highlight its effectiveness in this context, such as in Apiaceae (Clarkson et al. 2021), Araceae (Haigh et al. 2023), Cyperaceae (Larridon et al. 2020), Cactaceae (Acha and Majure 2022), Convolvulaceae (Simões et al. 2022), Rubioideae (Thureborn et al. 2022), and Primulaceae (Larson et al. 2023). Angiosperms353 played a crucial role in establishing a new classification system for Cyperaceae. This research sequenced 311 samples using Angiosperms353, providing robust support for the monophyly of Cyperaceae. The phylogenomic framework of Cyperaceae received significant support for two subfamilies, 24 tribes, 10 subtribes, and most previously recognized genera (Larridon et al. 2021).

Increasing evidence suggests that the sequence data obtained through Angiosperms353 exhibit sufficient variability for reconstructing relationships at the species level. Several studies have explored the applicability of Angiosperms353 in species-level phylogenetic studies, including those involving Otoba (Frost et al. 2021), Solanum (Gagnon et al. 2022), Palaquium (Phang et al. 2023), Corydalis (Chen et al. 2023), Vaccinium section Cyanococcus (Crowl et al. 2022), and Urochloa sensu lato (Masters et al. 2023). For instance, Frost et al. (2021) analyzed the phylogenetic relationships of 20 Otoba samples, strongly supporting its monophyly. The authors revealed three clades within the genus and resolved the first phylogeny of Otoba using targeted enrichment sequencing. Moreover, Crowl et al. (2022) utilized Angiosperms353 for a phylogenetic analysis of Vaccinium section Cyanococcus. This research successfully captured 323 genes, revealing that the northern lineages of V. boreale and V. myrtilloides were sisters, V. boreale was nonmonophyletic, and V. caesariense was nested in the V. fuscatum clade.

In summary, Angiosperms353 is widely used across different taxonomic levels and plays a vital role in advancing the establishment of a complete and unified tree of life.

Comparison between Angiosperms353 and lineage-specific probe sets

While Angiosperms353 is an attractive choice for phylogenetic studies, it may face challenges in fully exploring sufficient information sites to resolve phylogenetic relationships for all taxa, particularly at lower taxonomic levels, such as species or populations. This limitation arises from its universal design intended for all angiosperms. Factors such as rapid species radiation, low sequence differentiation, and gene and genome duplications can diminish the efficiency of Angiosperms353 application (Eserman et al. 2021). Consequently, there is a growing inclination toward designing lineage-specific probe sets for specific taxonomic groups. These probes integrate local information from single-copy genes, ensuring higher fidelity between the probe and the target. This approach successfully selects and restores a larger portion of orthologs, providing more phylogenetic information and sufficient variable sites to improve the phylogenetic resolution for target groups. It also allows for maximizing the acquisition of phylogenetic signals for each sequencing region (Folk et al. 2015). Hence, lineage-specific probe sets prove more suitable for addressing challenges with recalcitrant nodes in a phylogeny, particularly for taxa at shallow levels (Gomez et al. 2019; Eserman et al. 2021; Hendriks et al. 2021; Mandel 2021; McDonnell et al. 2021; Yardeni et al. 2021; Acha and Majure 2022). Compared to Angiosperms353, lineage-specific probe sets may be more effective in elucidating phylogenetic relationships within rapidly radiating clades (Lamesch et al. 2012; Romeiro-Brito et al. 2022).

However, the prevalence of genome duplications in angiosperms poses challenges in distinguishing paralogs from orthologs when designing lineage-specific probe sets, leading to potential false-positive relationships in phylogenetic inference (Cheng et al. 2017; Romeiro-Brito et al. 2022). Additionally, lineage-specific probe sets typically exhibit minimal gene overlap and are frequently incompatible with data generated by other probe sets or sequencing strategies. This limitation significantly restricts opportunities to share and reuse data across diverse studies and projects. In contrast, Angiosperms353 offers the advantage of eliminating the need to develop specific probes, saving both time and cost. The data generated through Angiosperms353 can be integrated with genomic sequence data from other taxa produced by unrelated projects, enabling a larger scale of phylogenomics (Hendriks et al. 2021; Yardeni et al. 2021; Simões et al. 2022). For example, Yardeni et al. (2021) compared the relative efficacy of the lineage-specific probe Bromelia1776 with Angiosperms353 in terms of gene capture success, considering bait design, data processing, and other factors. Although lineage-specific probe sets exhibited a higher target capture rate and phylogenetic resolution, their development proved time-consuming and required extensive bioinformatics expertise.

Compared with lineage-specific probe sets, Angiosperms353 provides a cost-effective and time-efficient alternative, proving undoubtedly effective for groups lacking genomic resources, albeit potentially providing relatively fewer informative sites. It has been reported that, Angiosperms353 is as effective as lineage-specific probe sets and highly consistent with inferred phylogenetic relationships (Chau et al. 2018; Larridon et al. 2021; Siniscalchi et al. 2021; Ufimov et al. 2021; Yardeni et al. 2021; Simões et al. 2022; Thureborn et al. 2022). Thureborn et al. (2022) argued while lineage-specific probe sets typically enhance capture efficiency in target regions due to their specificity, the proportion of phylogenetic information sites of Angisperms353 slightly exceeded that of lineage-specific probe sets. Angiosperms353 and lineage-specific probe sets produced similar results at the genus and subgenus levels. Eserman et al. (2021) compared the phylogenetic outcomes of Angiosperms353 and Orchidaceae-specific probes Orchidaceae963 in three major Orchid subfamilies—Orchidaceae, Epidendroideae, and Orchidaceae. Although Orchidaceae963 strongly supported the reconstructed phylogenetic relationships and clarified relationships within Orchidaceae, the topological structure of the trees generated by Orchidaceae963 was generally consistent with the topology recovered by Angiosperms353. Therefore, Angiosperms353 holds substantial potential for conducting phylogenomic research at all taxonomic levels.

Comprehensive evaluation of factors impacting Angiosperms353 application

Given the substantial potential of Angiosperms353, the potential factors affecting the capture of Angiosperms353 genes were comprehensively evaluated in terms of software/tools, sequencing strategies, sequencing depths, and entire representative angiosperm groups, following the workflow illustrated in Fig. 1. This assessment was conducted in two steps. In the first step, an 18-taxa genomic dataset was assembled to assess the performance of two commonly used tools, HybPiper and Easy353. This step aimed to investigate how different software programs impact the yields of Angiosperms353 captured. In the second step, a more extensive 343-taxa genomic dataset was assembled for in-depth testing using the ascendant Easy353. This step aimed to assess how different sequencing strategies, sequencing depths, and angiosperm groups might influence the yields of Angiosperms353 captured.

Assessing the performance of tools designed for capturing Angiosperms353 genes is crucial to ensure the effectiveness and practical application of Angiosperms353 data. Notable tools for Angiosperms353 capture include MarkerMiner (Chamala et al. 2015), HybPiper (Johnson et al. 2016), Phyluce (Faircloth 2016), HybPhaser (Nauheimer et al. 2021), Easy353 (Zhang et al. 2022), PhyloHerb (Cai et al. 2022), and GeneMiner (Xie et al. 2023). Phyluce is a software package originally developed for analyzing data from ultraconserved elements in organismal genomes. PhyloHerb extracts low-copy nuclear genes (e.g., Angiosperms353) from genome skimming data using reference sequences and raw reads. HybPiper, designed for targeted sequence capture, enriches DNA sequencing libraries for gene regions of interest, particularly capturing exons and flanking introns from the Hyb-seq platform. The assembled sequence data can be utilized for phylogenetic analysis at different levels. A companion pipeline, putative paralog detection (PPD; Zhou et al. 2022), is an extension to the HybPiper pipeline, and identifies putative paralogs based on sequence similarity and the presence of heterozygous sites at each locus (Zhou et al. 2022). Furthermore, Easy353 efficiently retrieves Angiosperms353 genes, and an enhanced version called GeneMiner has recently been developed (Xie et al. 2023). GeneMiner is flexible with input formats and accommodating options such as the GenBank file format and Fasta format. Although these tools possess distinct features, researchers commonly aim to use Angiosperms353 genes retrieved by these tools to reconstruct phylogenies. This can be achieved by concatenating all gene alignments (McVay and Carstens 2013) or obtaining a species tree based on all individual gene trees in a coalescence model (Liu et al. 2015; Zhang et al. 2018).

HybPiper and Easy353 are two widely acclaimed software programs with outstanding features in the community. HybPiper, a set of Python scripts, effectively packages bioinformatic tools for extracting target sequences from high-throughput DNA sequencing reads. It specializes in retrieving on-target sequences of nuclear genes and flanking off-target regions. The method involves read mapping using BWA or Bowtie2 (Li and Durbin 2009), assembling reads into contigs, and extracting the target sequence. The primary output comprises nucleotide and translated amino acid sequences of each gene, assembled from sequencing reads. HybPiper further provides postprocessing scripts for retrieving sequences from multiple samples, including visually summarizing statistics like capture efficiency and coverage depth and extracting flanking intron sequences. Competitively, Easy353 is an exclusive Angiosperms353 mining tool employing a reference-guided strategy. It implements an optimized filtering approach based on k-mers and an assembly algorithm integrated with the weighted de Bruijn graph (DBG) (Compeau et al. 2011). Conservative regions from reference sequences enhance assembly, ensuring high accuracy and sensitivity. The output from Easy353 can be refined using the PPD script to identify putative paralogs (Zhou et al. 2022). The tool involves three main steps: “reference database building”, “read filtering”, and “read assembly” (see “Data retrieval, processing, and analysis” below).

To evaluate the capture efficiency of different software programs for Angiosperms353, we initially compiled a genomic dataset comprising 18 taxa from three sequencing strategies, including genome skimming, WGS, and RNA-seq (Supplementary Table S2). Using the same set of reference sequences, we assessed the performance and efficiency of the two most popular tools, HybPiper V2.1.6 (Johnson et al. 2016) and Easy353 V1.5.0 (Zhang et al. 2022). Our evaluation considered the number of captured Angiosperms353 genes, the sequence length of the retrieved genes, and the runtime. Statistically, no significant difference was detected between HybPiper and Easy353 in terms of the capture rate (average gene count: 218 vs. 202). However, HybPiper requires a shorter runtime (13 min vs. 53 min). In terms of the sequence length of the captured Angiosperms353 genes, Easy353 showed a longer length (HybPiper: 644 bp vs. Easy353: 755 bp; Supplementary Table S2). Furthermore, the average count of captured Angiosperms353 genes varied across all three sequencing strategies (genome skimming, WGS, and RNA-seq). In genome skimming and WGS, HybPiper captured more Angiosperms353 genes than Easy353 (genome skimming: WGS, 97:266 vs. 88:220; Fig. 3a, b). However, in RNA-seq data, Easy353 captured more genes than HybPiper (299 vs. 292; Fig. 3c). Similarly, Easy353 consistently yielded longer sequences on average in genome skimming and RNA-seq but not in WGS. In genome skimming, the average sequence length retrieved by Easy353 was notably longer than that retrieved from HybPiper (450 bp vs. 385 bp). In particular, in RNA-seq, the average sequence length retrieved by Easy353 was significantly longer (Fig. 3d, f). In contrast, the average sequence length retrieved by Easy353 in WGS data was slightly shorter than that retrieved by HybPiper (570 bp vs. 614 bp; Fig. 3e). Additionally, across all three sequencing strategies, HybPiper had a shorter average runtime than Easy353 (genome skimming:WGS:RNA-seq, 8:20:9 min vs. 36:108:26 min; Fig. 3g, h, i), and the dataset generated from WGS took more time for both Angiosperms353 mining tools.

Fig. 3
figure 3

Comparison of the performances of Easy353 and HybPiper across the three sequencing strategies. (a-c) Comparison of the numbers of genes identified in the Angiosperms353 gene sets. (d–f) Comparison of the average sequence length for the same captured Angiosperms353 gene. (g–i) Comparison of runtimes. (a, d, g) panels for Genome skimming; (b, e, h) for WGS, and (c, f, i) for RNA-seq. Abbreviations for species used in this test are as follows: Gd, Guaduella densiflora; Ph, Prunus hypoxantha; Pj, Prunus jenkinsii; Pl, Prunus laurocerasus; Pi, Prunus incisoserrata; Oe, Osmanthus enervius; Dh, Dendrobium heterocarpum; Sb, Sorghum bicolor; Nn, Nelumbo nucifera; Vr, Vitis romanetii; Vh, Vitis hancockii; Bm, Begonia microsperma; Pe, Phalaenopsis equestris; Ca, Catabrosa aquatica; Cy, Camellia yunnanensis; Ce, Camellia euphlebia; Cd, Camellia danzaiensis

Overall, HybPiper and Easy353 exhibited distinct advantages across all three sequencing strategies. HybPiper outperformed Easy353 in capturing more Angiosperms353 genes in genome skimming and RNA-seq datasets while requiring a shorter runtime. However, Easy353 consistently yielded longer sequence lengths than HybPiper. Regarding accessibility, HybPiper is exclusively accessible through a command-line interface. Moreover, identifying the final set of Angiosperms353 genes in the later stages of the HybPiper pipeline can be challenging due to their burial with numerous intermediate sequence files with limited visualization. In contrast, Easy353 provides a well-structured and user-friendly result indexing and display, accessible through both graphical and command-line interfaces. Additionally, Easy353 is compatible with multiple operating systems and building reference databases.

In addition to assessing the impact of different tools on Angiosperms353 yields, we further assembled 343-taxa genomic datasets. Through comprehensive testing with the ascendant Easy353, we examined the capture efficiency of Angiosperms353 across various sequencing strategies, depths, and representative angiosperm groups (Figs. 4, 5 and 6). By performing the aforementioned tests, this review also provides practical guidelines for the optimal and cost-effective utilization of Angiosperms353 in relevant research fields. To investigate the influence of various sequencing strategies on Angiosperms353 gene yields, we expanded the initial three sequencing strategies to four—RAD-seq, genome skimming, WGS, and RNA-seq. This expansion comprised 89 RAD-seq, 78 genome skimming, 89 WGS, and 87 RNA-seq datasets (Supplementary Table S3). These results indicate that RNA-seq is the most effective strategy for identifying Angiosperms353 genes (Figs. 4a and 5a), exhibiting the highest capture rate (> 300 genes in most cases) compared to other sequencing strategies, i.e., WGS (average of 144 genes) and genome skimming (average of 57 genes) (Figs. 4a and 5a). The lowest yield was observed in the RAD-seq dataset, with only nine genes on average per dataset (Figs. 4a and 5a). The percentage of Angiosperms353 individuals captured from the RNA-seq datasets was approximately 35 times greater than that from the RAD-seq datasets.

Fig. 4
figure 4

Capture efficiency of Angiosperms353 genes under different sequencing strategies and sequencing depths. (a) Four sequencing strategies (RAD-Seq, Genome skimming, WGS, and RNA-Seq). (b) Three sequencing depths (0–10×, 10–30×, and > 30×). The black dot in (b) indicates poor Angiosperms353 capture yield in the RAD-seq data (with the majority of data concentrated around this point). Note: Sequencing depth, defined as the ratio of the total number of sequenced base pairs (bp) to the genome size, serves as a metric to evaluate sequencing quantity. WGS, whole-genome sequencing; RAD-seq, restriction site-associated DNA sequencing; RNA-seq, transcriptome sequencing

Fig. 5
figure 5

Average counts of Angiosperms353 genes identified by Easy353 across various datasets. (a) Counts of Angiosperms353 genes from datasets employing four sequencing strategies (RAD-Seq, Genome skimming, WGS, and RNA-Seq). (b) Counts of Angiosperms353 genes from datasets with three sequencing depths (0–10×, 10–30×, and >30×). (c) Counts of Angiosperms353 genes in representative angiosperm groups: ANA grade (represented by Magnoliaceae), monocot (Orchidaceae and Poaceae), eudicot (Nelumbonaceae), suprosid (Vitaceae and Rosaceae), and superasterid (Oleaceae and Theaceae) families based on APG IV (2016). Average counts are presented as mean ± SE; the letters “a” to “d” indicate significant differences (p <0.05) based on the Least Significant Difference (LSD) method after one-way analysis of variance (ANOVA); same letters indicate no significant difference (p <0.05). ANA grade refers to the Amborellales, Nymphaeales, and Austrobaileyales orders recognized by APG IV (2016); WGS, whole genome sequencing; RAD-seq, restriction site-associated DNA sequencing; RNA-seq, transcriptome sequencing

Fig. 6
figure 6

The overall yield of Angiosperms353 showed no bias across all Angiosperm groups. Each group is represented by an iconic image of a well-known species in the family. Angiosperms353 average counts are presented as the mean ± SE; letter “a” indicates significant differences (p <0.05) based on the Least Significant Difference (LSD) method after one-way analysis of variance (ANOVA); same letters indicate no significant difference (p <0.05). ANA grade refers to the Amborellales, Nymphaeales, and Austrobaileyales orders recognized by APG IV (2016)

To investigate the impact of sequencing depth on Angiosperms353 gene capture, the datasets were categorized into three sequencing depth ranges: 0–10 × (285-taxa dataset), 10–30 × (38-taxa dataset), and >30 × (20-taxa dataset) (Supplementary Table S3). Among these depths, 30 × was the most effective, yielding approximately 300 genes in most cases (Figs. 4b and 5b). The percentage of captured Angiosperms353 was significantly greater (p <0.05) at sequencing depths >30 × , while it significantly decreased at 0–10 × or 10–30 × depths (Fig. 5b). At 0–10 × , an average of 109 Angiosperms353 genes were captured per dataset, increasing to 222 at 10–30 × and reaching 312 at >30 × (Fig. 5b). Therefore, a positive trend in the capture rate was observed with increasing sequencing depth.

In the RNA-seq datasets, the number of genes captured by Easy353 remained relatively constant despite increased sequence depth (Fig. 4b). This could be due to the stringent tissue sampling requirements (i.e., young, fresh tissues preserved in liquid nitrogen or dry ice, followed by –80℃ storage), enhancing library construction and sequencing efficiency. Although this method easily yields numerous protein-coding genes, gene expression variations among different biological tissues and the inclusion of non-phylogenetically informative transcripts may limit its effectiveness (Johnson et al. 2019). With increasing sequencing depth, WGS and genome skimming may yield additional genetically informative DNA, particularly for herbarium specimens, resulting in a gradual increase in the number of captured Angiosperms353 genes. WGS requires reference genome data from closely related species. By mapping the obtained reads to the reference genome for sequence assembly and construction, they can facilitate population evolution analysis and functional gene discovery. Although its sequencing cost is relatively low, the absence of a closely related reference genome can pose obstacles for non-model species research (Hollingsworth et al. 2016; Supple and Shapiro 2018). Genome skimming is a cost-effective method that efficiently reveals repetitive elements, such as satellite DNA and transposable elements. However, these replicates possess minimal variation, and the utility of this technology in capturing direct homologous regions of nuclear genes used for sequence alignment is limited due to a lack of coding region information (Dodsworth 2015). RAD-seq serves as a substitute for WGS, being independent of the reference genome. This method uses HTS of DNA related to restriction endonuclease recognition sites to identify high-density single nucleotide polymorphism (SNP) sites, reducing genome complexity and library construction and sequencing costs (Shafer et al. 2016). However, RAD-seq markers are lineage-specific, introducing bias when screening homologous sequences across distantly related lineages. This may result in datasets falling short of comprehensive representation (Andrews et al. 2016; Heckenhauer et al. 2018). Moreover, using short and inconsistently represented loci in phylogenetic sampling may lead to reduced phylogenetic signals and challenges in assessing phylogenetic relationships (Jones and Good 2015; McKain et al. 2018).

Furthermore, to explore potential bias in capturing Angiosperms353 genes across different angiosperm groups, datasets from diverse angiosperm clades were selected, including ANA grade (represented by Magnoliaceae, 41-taxa dataset), monocots (Orchidaceae and Poaceae, 48-taxa and 47-taxa dataset), eudicots (Nelumbonaceae, 31-taxa dataset), superrosids (Vitaceae and Rosaceae, 40-taxa and 53-taxa dataset), and superasterids (Oleaceae and Theaceae, 42-taxa and 41-taxa dataset) (Supplementary Table S3). Notably, Rosaceae exhibited the highest average capture of Angiosperms353 genes (211 genes captured), significantly differing from the remaining groups (p < 0.05; Fig. 5c). While Orchidaceae captured 140 Angiosperms353 genes, and Magnoliaceae captured the fewest (91 genes), these two groups did not significantly differ (p < 0.05; Fig. 5c).

To mitigate the potential influence from sequencing depth and strategies, a 155-taxa dataset was generated from the original 343-taxa dataset, maintaining similar sequencing depths (except for Nelumbonaceae, which used a 15-taxa dataset due to the unavailability of genome skimming data; Supplementary Table S4). The results obtained from the 155-taxa dataset (Fig. 6) differed from those obtained from the original 343-taxa dataset (Fig. 5c). Orchidaceae exhibited the highest yield (166 genes captured), whereas Magnoliaceae had the lowest yield (92 genes). The remaining groups (Theaceae, Oleaceae, Rosaceae, Vitaceae, Nelumbonaceae, and Poaceae) captured over 100 genes, but less than the Orchidaceae, which may be due to higher yields in RAD-seq only.

Therefore, to further investigate potential bias among representative groups in the RAD-seq data, another 30-taxa dataset was added, encompassing five angiosperm groups (Lauraceae, Cyperaceae, Betulaceae, Rutaceae, and Asteraceae; Supplementary Table S5). Despite non-significant differences, the gene yields were lower than Orchidaceae yields in all cases, indicating no lineage-based capture bias in RAD-seq data, except for Orchidaceae (Supplementary Tables S4 and S5). Notably, the greater number of genes captured for Orchidaceae in the RAD-seq dataset did not translate to a similar pattern in Poaceae (another monocot representative) or any other sampled group (Supplementary Table S5).

Beside Orchidaceae, most genes captured from Superrosid clade, whereas the lowest yields were found in the ANA grade. Overall, significant differences in the number of genes captured among different angiosperm groups were not observed (Fig. 6). Therefore, there was no discernible bias among the different angiosperm groups across all the sequencing strategies.

Discussion and conclusions

This review summarizes the extensive literature on plant phylogenomics, leveraging high-throughput sequencing technologies and diverse empirical genomic datasets. It emphasizes the utility of Angiosperms353 by integrating various genomic resources. Additionally, the RNA-seq strategy demonstrates the highest Angiosperms353 yields among the four major sequencing strategies. Higher sequencing depth correlates with increased gene capture, and no obvious yield bias was observed across different angiosperm groups.

In the age of phylogenomics, the universal probe set Angiosperms353 has standardized the use of genomic data for inferring angiosperm phylogenetic relationships, facilitating phylogenomic analyses at any scale. It enables the integration of different genomic datasets and taxonomic groups (Johnson et al. 2019; Baker et al. 2021). Moreover, its efficacy with degraded DNA, often found in ‘genomic treasure troves’, extends its applicability to sequencing ancient specimens from herbariums and museums, revitalizing this field (Brewer et al. 2019; Slimp et al. 2021). The enriched Angiosperms353-related phylogenomics (Forrest et al. 2019; Clarkson et al. 2021; Frost et al. 2021; Pillon et al. 2021) opens avenues for a comprehensive exploration of the evolutionary history of plant species, ranging from historical and archaeological records to freshly collected specimens in the field. This approach is capable of providing evolutionary insights into how species have evolved and adapted over time.

Despite Angiosperms353 designed for single-copy genes, challenges persist due to whole-genome duplication events and angiosperm polyploidy, causing inconsistencies in gene trees and introducing phylogenetic noise (McKain et al. 2018; Gomez et al. 2019). Some taxa exhibit low capture efficiency, yielding insufficient resolution for specific lineages. Observations from test datasets suggest that RNA-seq is the most suitable sequencing strategy, capturing the highest number of Angiosperms353 with less impact from sequencing depth or lineage constraints. When sampling conditions are constrained, and RNA-seq is not possible, WGS and genome skimming offer viable alternatives, with sequencing depth significantly influencing gene capture efficiency in genome skimming. Across these sequencing strategies, increased sequencing depth generally leads to higher Angiosperms353 yields.

This review also underscores the potential of integrating single-copy gene capture tools. For example, HybPiper, when employed after establishing the Easy353 reference data, yields a comparable number of Angiosperms353 genes as Easy353 but with a more efficient runtime. This combination of HybPiper and Easy353 enhances gene capture, contributing to the development of accessible tools for data storage and distribution and fostering further advancements. The combination of HybPiper and PPD proves valuable by eliminating collateral homologous genes identified by both pipelines. This synergy allows the construction of a more robust orthologous gene dataset. Zhou et al. (2022) captured Angiosperms353 data from Castanea (Fagaceae) and Hamamelis (Hamamelidaceae), revealing the ability of PPD to identify more putative paralogs than HybPiper alone (e.g., 31 genes via PPD vs. four genes with HybPiper), resulting in a more robust phylogeny. The integration of HybPiper and PPD aids in removing paralogous genes identified by both methods, facilitating the construction of a more robust homologous gene dataset for phylogenomic and divergence time analyses.

Researchers employing target enrichment methods often face the dilemma between adopting universal or lineage-specific probe sets. To address this dilemma, exploring the simultaneous enrichment of combining multiple probe sets, such as lineage-specific integrated universal or multiple lineage-specific sets, is crucial. This approach leverages the increasing availability of probes and decreasing sequencing costs. In evolutionary biology, Angiosperms353 maximizes the utility of genomics across different scales. It introduces new possibilities for resolving phylogenetic relationships at various taxonomic scales (Hendriks et al. 2021, 2023; Pillon et al. 2021; Yardeni et al. 2021; Thureborn et al. 2022; Phang et al. 2023). For example, Hendriks et al. (2023) effectively combined Angiosperms353 with the Brassicaceae-specific probe set (Nikolov1827) in a single hybridization reaction, achieving comprehensive outcomes with minimal additional cost and effort. Similar studies have demonstrated the effectiveness of combining Angiosperms353 with lineage-specific probe sets to enhance phylogenetic resolution, shedding light on species delimitation (Phang et al. 2023). While lineage-specific probe sets boast higher success rate in capturing specific groups, offering advantages for detailed phylogenetics and population genetic analyses, such as exploring gene tree consistency, nucleotide diversity, or population structure, they are susceptible to indistinguishable paralogs (Dornburg et al. 2019; Yardeni et al. 2021). Conversely, Angiosperms353 excels in integrating genomic resources from various HTS platforms, laying the foundation for large-scale phylogenomic research, particularly when constructing tree of life using complete genomes is impractical or impossible.

In summary, the future of phylogenomics using Angiosperms353 appears promising, fueled by ongoing improvements in probe design and broader taxonomic coverage. Consequently, improving the efficiency of bioinformatic tools, collaboration, data sharing, and technological advancements will play pivotal roles in advancing our understanding of the evolutionary relationships across the entirety of life using the available data.

Data retrieval, processing, and analysis

The NCBI website (https://www.ncbi.nlm.nih.gov/) was used to query and download nucleotide sequence data. Raw reads from 375 accessions, encompassing RAD-seq, genome skimming, WGS, and RNA-seq projects, were downloaded from the NCBI SRA (https://www.ncbi.nlm.nih.gov/sra/) and described in Supplementary Table S1.

Raw sequencing data from the SRA database were downloaded using the NCBI SRA Toolkit (Edwards 2022), with the “fastq-dump” command converting SRA format to fastq format. Trimmomatic V0.38 (Bolger et al. 2014) was then applied to trim adaptor sequences and low-quality bases from the ends of sequencing files, employing a Phred33 score and setting the minimum read length to 36 bp. Base pairs falling below the quality threshold were removed. Trimmed reads were used for locus assembly via HybPiper V2.1.6 (Johnson et al. 2016) and Easy353 V1.5.0 (Zhang et al. 2022).

To enable efficient locus recovery, a variable reference sequence file was constructed from sequences from multiple species across the Easy353 reference database. Trimmed reads were mapped to HybPiper reference sequences using BWA (Li and Durbin 2009) and assembled into contigs using SPADES (Bankevich et al. 2012). Easy353, with its three primary modules for recovering Angiosperms353 from sequencing data (“reference database building”, “read filtering”, and “read assembly”), used k-mers and a hash table (Schbath et al. 2012) for read filtering and a de Bruijn graph (DBG) (Compeau et al. 2011) for de novo assembly.

All statistical analyses and plots were conducted in R (R Core Team 2023) using the packages “agricolae” (Mendiburu 2023), “ggplot2” (Wickham 2016), “tidyverse” (Wickham et al. 2019), and “rentrez” (Winter 2017).

All necessary data to evaluate the conclusions in this manuscript are presented in the main text and supplementary materials (Fig. S1, Tables S1–S5). Raw reads were obtained from NCBI databases (https://www.ncbi.nlm.nih.gov/), and the complete list of NCBI accession numbers is available in Tables S1–S5.