Background

The whitefly Aleyrodes tabaci was first described by Gennadius in 1889 and due to the lack of taxonomically robust morphological differences [1,2,3] with other closely related species, its systematics has undergone numerous revisions, such as the synonymization of different biological species under the binomial Bemisia tabaci. Recent research, however, has provided unequivocal evidence that “B. tabaci” has been used to refer to more than 40 biological species [4,5,6,7,8,9,10]. ​​ In recognition of the existence of this cryptic species group, we refer to them collectively here as Bemisia tabaci sensu lato (Table 1). More than half of these species are damaging agricultural pests and plant-virus vectors, thus conferring upon B. tabaci s.l. the status of one of the world's top-100 most invasive species [11]. B. tabaci s.l. cause direct damage to crops through plant phloem-sap feeding, inducing phytotoxic disorders, excreting honeydew that develops sooty-molds, and transmitting almost 500 different plant virus species [12, 13].

Table 1 Bemisia tabaci s.l. current and historical names

In East and Central Africa since the mid-1990s, the abundances of several B. tabaci s.l. cassava-colonizing populations have increased dramatically [14,15,16,17], driving epidemics of African cassava mosaic disease (CMD) and cassava brown streak disease (CBSD) [15, 18,19,20,21]. These diseases affect over 200 million sub-Saharan Africans who rely on cassava as their primary food source [22], causing production losses of up to 47% in nine East and Central African countries. The affected regions are expanding, resulting in hunger, recurrent famines, social instability, and annual losses of over US$1 billion [20, 23,24,25].

Several cassava-colonizing, phylogenetic species within B. tabaci s.l. have been proposed, based on their geographical separation and differences in their partial mitochondrial cytochrome oxidase 1 (mtCO1) sequences [7, 9, 10, 26] and were named using the mtCO1-marker naming framework. These include Sub-Saharan Africa 1 (SSA1 with five sub-groups, SSA1-SG1, SG2, SG3, SG4, SG5), SSA4, and SSA10 [7, 27]. SNP analyses identified six major genetic groups among African cassava whitefly populations, some of which differed from the partial-mtCO1 delineated putative species [28, 29]. In recent phylogenetic analyses combining a well-curated mtCO1 database with genome-wide SNPs, SSA4 nested within SSA2, while SSA1 comprised well-defined putative subspecies. Allopatric incipient speciation and a "hybrid zone" separating the two SSA1 groups were evident [30]. Biological differences were also found between the subgroups of B. tabaci SSA1, with mating compatibility between SSA1-SG1 and SSA1-SG2, but clear incompatibilities between SSA1-SG1 and SSA1-SG3, and between SSA1-SG2 and SSA1-SG3 [31]. In addition, mating and phylogenomics results based on SNPs were in agreement, but inconsistent with the full mitogenome analysis [28, 30]. As such, although the partial mtCO1 gene remains a highly informative marker, it needs to be considered in addition to other evidence when assigning biological species status to B. tabaci s.l. populations [6, 8].

In 2016, the first draft genome of Bemisia argentifolii, a B. tabaci species also known as MEAM1 (Table 1), was published [32]. This genome consisted of 615 Mb sequence length, with a final genome N50 length of 3.2 Mb, and was annotated with 15,664 protein-coding genes (PCGs). Two years later, the genome of another B. tabaci species (MED/Q) was published [33], with a similar genome length of 658 Mb housed within 4,975 scaffolds, but with a scaffold N50 of 437 kb. When compared, B. tabaci MED/Q had a higher count of annotated genes (n = 20,786) than B. argentifolii. The annotation protocols used for both species were similar, utilizing genomic alignment evidence such as RNA-seq data and ab-initio prediction methods such as AUGUSTUS and GENSCAN. Such prediction methods can incorporate gene models without the requirement of aligned transcriptomic evidence. Furthermore, B. tabaci s.s. and B. argentifolii genomes had c. 80% of all gene models assigned with a functional annotation [32, 33], but the large discrepancy in gene count suggested a high proportion of duplicated and/or fragmented gene models.

Since the mid-1990s in East and Central Africa, the abundances of several B. tabaci s.l. cassava-colonizing species have increased dramatically [14,15,16,17]. The phenomenon of “super-abundant” (outbreaking) cassava B. tabaci SSA1 populations remains a key factor driving epidemics of African cassava mosaic disease (CMD) and cassava brown streak disease (CBSD) [15, 18,19,20,21]. Although the B. tabaci s.s. and B. argentifolii genomes are available, they have limited use for studying the genetics of the African cassava B. tabaci SSA1 species due to the high genetic divergence between them. More recently, the draft genome of the cassava whitefly B. tabaci Sub-Saharan Africa—East and Central Africa (‘SSA-ECA’) was assembled from short read Illumina data derived from > 10,000 field-collected cassava whiteflies [29]. The ‘SSA-ECA’ genome, however, has a high degree of fragmentation and was annotated using ab-initio prediction, RNA-Seq aligned evidence and input homology evidence derived from the original B. argentifolii proteome, which may have led to propagating spurious or low-supported gene models. A comparison of published B. tabaci s.l. genomes to ‘SSA-ECA’, shows it has the shortest overall genome length at 513.7 Mb and only slightly improved N50 of 498 kb to that of B. tabaci s.s.

The challenge of defining a biological species within a cryptic species group such as B. tabaci s.l. arises from the difficulty of categorizing it within the continuous process of evolution [34, 35], compounded by the absence of reliable morphological diagnostic characters. Molecular-marker methods have partially addressed this issue, with phylogenetic-species delimitation providing a useful framework for identifying putative biological species and their boundaries [4]. Several molecular-based species delimitation approaches have been proposed previously [9, 20,21,22,23], but the most accurate method of identifying biological species has involved multiple lines of evidence such as gene flow, ecological data and molecular markers [6, 8, 10, 36]. To improve our understanding of biological species within B. tabaci s.l., we present genomic and associated experimental data sets and propose appropriate biological species names, with the intent of reducing the confusion associated with the mtCO1-phylogenetic naming system. For example, we use the biological species names of Bemisia argentifolii and Bemisia tabaci sensu stricto for B. tabaci MEAM1 and B. tabaci MED, respectively (Table 1).

Due to the lack of diverse, high-quality B. tabaci s.l. genomic resources and the substantial differences between currently available genomes, there was a clear need for additional high quality data, particularly for cassava-utilizing species within B. tabaci s.l. The genomes described herein, provide the basis for further testing of new hypotheses, aided by a universally applied annotation methodology employed for this study. Here, we report on the evolutionary adaptations underlying this group of cryptic species, as well as highlighting some genetic differences with an added emphasis on resolving hitherto unclear species delineations between genetically similar and morphologically cryptic African and non-African B. tabaci s.l. species. These new genomic resources shall enable additional novel biological insights and progress the development of whitefly and virus-control strategies.

Results and discussion

Genome sequences, annotation and orthology

Genome assemblies

Due to their tiny size (~ 1 mm) and considerable genome heterozygosity, we generated full-sib, inbred populations (F5-F7), for each of the B. tabaci s.l. populations sequenced in this study, which reduced genome heterozygosity and improved assembly graph traversal. Whitefly assemblies presented as part of this study are shown in Fig. 1; with comparisons to representative insect orders and B. tabaci s.l. assemblies shown in Table 2. Draft genome lengths varied (± 41 Mb), with the largest 657.7 Mb (Fig. 1a) and smallest 616.1 Mb (Fig. 1e) genomes for SSA1-SG1-Ug and Asia II-5, respectively. Although sequencing performance and genomic coverage varied between the sequenced populations, the average genome coverage of ~ 94X was used for genome assembly (Canu v1.8) [37]. Overall, assembly contiguity was improved compared to currently available whitefly genomes [29, 32, 33].

Fig. 1
figure 1

Comparison of six de novo Bemisia tabaci s.l. genome assemblies. Genome assemblies depicted as circular plots, where the complete plot represents the full genome length. Plots highlight the longest scaffold; scaffold N50/N90; assembly GC and gap coverage (%). Genome assembly completeness (BUSCO v3.0) shown with Insecta (OrthoDB v9; n = 1,658) orthology set. Historical and current B. tabaci s.l. population names summarized in Table 1. Assembly plots generated with assembly-stats [38]

Table 2 Genome assembly statistics and comparison

Assembly performance for the six B. tabaci s.l. genomes was non-uniform, with longer contig N50 values recovered for five of the six genomes (excluding Uganda-1). The mean scaffold count across all six genomes was 2,409. Asia II-5 exhibited the highest scaffold N50 (10.84 Mb) and the lowest L50 value (14) obtained in any B. tabaci s.l. genome published at the time of writing. Genome level contiguity improvements were achieved, with unbroken scaffold lengths considerably longer (22.8 Mb to 53.3 Mb) than that of B. argentifolii (11.1 Mb) and B. tabaci s.s. (2.85 Mb) (Table 2). Genomic GC content ranged from 38.3% (Uganda-1) to 39.6% (Asia II-5), which was broadly similar to B. argentifolii (38.7%) and B. tabaci s.s. (38.2%). Notably, the genomic GC content for all cassava-utilizing populations was > 39.4%. Genomic regions with low GC content are associated with heterochromatic DNA, which are much harder to transcribe [39]. For the Insecta orthologs (n = 1,367), complete BUSCO [40] recovery (OrthoDB v9 [41]; Insecta: n = 1,367) was between 78% (Uganda-1) and 95.1% (B. tabaci SSA2-Ng); with single-copy BUSCO % ranging from 72.1% to 91.7%.

Karyotype studies have shown that whitefly genomes are likely to have 10 individual chromosomes [42]. Data generated for this study, however, were insufficient to obtain telomere to telomere length assemblies, or to anchor assemblies to chromosomes. Genome assemblies generated in this work were shorter than that reported for the closely related non-Bemisia whitefly, Trialeurodes vaporariorum (787 Mb) [43], but were largely consistent with previously published B. tabaci s.s. and B. argentifolii genomes and varied between 657.7 Mb—616.1 Mb. Despite the overall improvements gained in genomic assembly of whitefly made in this study, the use of highly accurate longer reads such as PacBio HiFi, Hi-C scaffolding and optical mapping would facilitate generation of more complete and less fragmented chromosome-scale B. tabaci s.l. genomes. All assemblies generated in this study were deposited in the European Nucleotide Archive (ENA) at the European Molecular Biology Laboratory (EMBL-EBI) under project accession numbers PRJEB28507, PRJEB35304, PRJEB39408 (https://www.ebi.ac.uk/ena); genomic PacBio data is described in Additional file 1: Table S1. See Table 2, Additional file 1: Table S3 and Additional file 2: Figs. S2-S7 for additional assembly processing and taxonomic contamination evaluation.

Annotation and reannotation of B. tabaci s.l. genomes

Annotation and assembly quality are linked [44, 45] and, as such, assembly methodology remains the main source of variability in the recovery and comparison of gene models for all B. tabaci s.l. genomes. Early efforts in the genomic annotation of the new whitefly genomes in this work revealed significantly fewer gene models, compared to that of previously published B. tabaci s.l. (~ 12 k vs 15 k). To address this discrepancy, we applied a uniform structural annotation methodology across all Bemisia genomes presented herein. Our approach aimed to maximize the likelihood that the results of comparative analyses and the conclusions drawn would be reflected and underpinned by variance in whitefly biology, as opposed to the methodologies underpinning transcript model identification.

Genomic annotations of the six new B. tabaci s.l. genomes were performed using the ‘Ensembl gene annotation system’ [46], which produced transcript models based on empirical evidence; to the exclusion of any ab-initio prediction methods. High quality standards were achieved by ensuring model accuracy that derives solely via integration of alignment of expressed protein, cDNA and other types of biological sequences, such as high-throughput RNA-seq. The Ensembl gene annotation pipeline also facilitates the identification of alternative splice patterns, allowing for multiple alternate splice variants to be captured per transcript model.

Finalized gene sets obtained for all eight B. tabaci s.l. genomes were recovered with only small variances in total PCG counts. The number of PCGs captured across the majority of genomes were considerably fewer than previous estimates, decreasing from ~ 15 k (B. argentifolii) by ~ 2.5 k genes. Our analyses of PCGs recovered an average count across B. tabaci s.l. genomes of ~ 13 k PCGs (n = 13,010; stdev:510); see Table 3. For the reannotated gene sets recovered from B. argentifolii and B. tabaci s.s., the average PCG count increased to 14,316, which was likely due to the uniquely higher gene count recovered in B. tabaci s.s. An average, alternate-spliced transcript count of 2.06 was observed across the six new genomes, which ranged from ~ 24 k (Uganda-1) to ~ 28 K (SSA1-SG1-Ug). Reannotation of B. argentifolii and B. tabaci s.s. genomes recovered a gene-to-transcript ratio shifting away from ~ 1:1 to ~ 2:1; a feature not seen in the previously published annotations. Transcript models recovered had a mean CDS length of 1.7 kb (stdev:234), while coding exon counts averaged 212 K (stdev:38,921); see Additional file 1: Table S4 for details of input RNA-seq data generated in this study. The average exon count recovered across all PCGs was 6.9 exons per coding transcript, with an average length of 188.5 bp. Average intron length (3,157 bp) was also higher in these new genomes compared to an average of 3,061 bp seen in B. tabaci s.s. and B. argentifolii. The use of a uniform structural annotation methodology resulted in high concordance amongst all eight B. tabaci s.l. gene annotations.

Table 3 Summary of Bemisia tabaci s.l. structural annotation with four comparative insect genomes

RNA-Seq evidence specifically for Uganda-1 was lacking in this study, due to the unavailability of biological sample material for sequencing. Examination of aligned evidence in support of protein-coding transcript models, however, showed gene models had RNA-Seq read coverage ranging from 62.6% (Uganda-1) to 83.2% (SSA1-SG1-Ug). Structural annotation was achieved using all combined RNA-Seq data from other B. tabaci s.l. generated as part of this study. Given the close evolutionary relationships of these B. tabaci populations, the transcriptomic evidence readily aligned and was therefore useful for capturing Uganda-1 gene features. The caveat, however, is that the Uganda-1 annotation likely has reduced gene recovery, maximization of complete ORFs, exon–intron boundaries and species-specific features.

Structural annotation of the B. tabaci s.l. genomes revealed a marked reduction in total gene models recovered, compared to previously published gene sets of B. tabaci s.s. and B. argentifolii. Despite this, we posit that use of a uniform annotation methodology employed across contiguous genome assemblies, and without use of ab-initio methods, provides a well-supported and conservative estimate of B. tabaci s.l. gene space. Furthermore, downstream analyses should benefit from this increased robustness, ensuring conclusions drawn are underpinned by inherent biology and not the result of disparate annotation methodologies.

Visualization and comparison of gene functional annotation between all whitefly genomes (Fig. 2) revealed that ~ 66.44% of input genes recovered some form of GO term annotation (Table 4). Some genes showed a significant difference between the expected frequencies of genes with GO terms and observed GO term frequencies; with most GO terms recovered related to molecular function (86.30%), followed by biological process (60.75%) and cellular location (30.49%). The highest proportion of GO-enriched genes across B. tabaci s.l. genomes were related to catalytic activity (GO:0003824) (> 41% of all PCGs) and binding (Parent GO:0005488) (> 50% of all PCGs); see Fig. 2a, b. Approximately 30% of B. tabaci s.l. PCGs were characterized as related to functional binding of: (i) organic cyclic compounds (GO:0097159) or (ii) heterocyclic compounds (GO:1,901,363) (Fig. 2a). Aside from the above-mentioned GO-enriched gene ontologies, other significantly different GO-terms between whitefly genomes related to forms of binding; including small molecule binding (GO:0036094), ion binding (GO:0043167), drug binding (GO:0008144) and cofactor binding (GO:0048037).

Fig. 2
figure 2

Bemisia tabaci s.l. functional annotation and enriched GO terms. a Histogram comparison of gene percentages of significant GO terms, identified from eight B. tabaci s.l. genomes. Population names are color coded in the inset and B. argentifolii has been abbreviated to “B. argen”. b Significant differences in GO terms across eight B. tabaci s.l. populations. Y-axis shows log.10 transformed P-values for each GO term in figure part (a). Results generated using web service WEGO v2.0 (https://wego.genomics.cn/)

Table 4 Gene ontology of eight Bemisia tabaci s. l. populations

Transposable elements

Transposable elements (TE) are an important source of novel genomic variation and contributor not only to genome structural variation, but they can also influence changes to gene regulation [47,48,49]. TE content analysis of the six new B. tabaci s.l. genomes showed both shared species-specific differences, including repeat copy number counts and total genomic coverage (i.e., total genomic percent coverage) variability across TE families. The average genome repeat coverage, summed across all repeat classes of ~ 42% (268.3 Mb), ranged between 38.27% and 46.74% (see Fig. 3a, Table 5). Genomic TE coverage estimates for both SSA1-SG1 B. tabaci s.l. genomes were similar to that reported for the B. tabaci ‘SSA-ECA’ partial genome at 36.80–39.17% [29, 50]. Levels of TE coverage in the genomes of B. argentifolii (45%; 276.9 Mb) and B. tabaci s.s. (40.30%; 265.0 Mb) were intermediate; greater than both SSA1-SG1 genomes and Asia II-5, but less than SSA2-Ng (46.74%), SSA3-Ng (46.25%) and Uganda-1 (43.90%). Apart from SSA1-SG1-Ug, all other new B. tabaci s.l. genomes exhibited higher overall TE coverage than the closely related pea aphid (Acyrthosiphon pisum; 38%), in agreement with recent reports [50]. In contrast, we observed considerably lower genomics TE coverage across B. tabaci s.l. genomes, in comparison to the greenhouse whitefly T. vaporariorum (56.60%) [43].

Fig. 3
figure 3

Genomic transposable element content in Bemisia tabaci s.l. genome assemblies. a Summary of major TE classes highlighting copy count (#) and genome repeat coverage (%). b Stacked bar-chart of Kimura sequence divergences of TE classes, expressed as a function of percentage of each genome; Y-axis: Genome percent coverage (%); X-axis: Kimura divergence score. c 3D-Bar graph showing TE copy count of repeat classes: DNA, LINE, LTR, SINE

Table 5 Bemisia tabaci s.l. transposable element summary statistics

DNA TEs are part of class II type TEs, which mediate their movements via a DNA ‘cut and paste’ mechanism [47]. The majority of repeats identified in this study are DNA type transposons. The most widespread was DNA-hAT5, with c. 18,164 copies across all six genomes. One of the most striking examples of a species-specific DNA type TE expansion was ‘Tc-Mar-Tc2’ in Uganda-1 (n = 33,335). It is unknown why this is so prevalent in Uganda-1, but it may be related to Uganda-1’s phylogenetic position as one of the earliest branching B. tabaci s.l. species to originate in sub-Saharan Africa; a region considered to be the geographical origin of the B. tabaci group of species [7]. Estimates of DNA type TEs in the new genomes were noticeably higher than previous estimates of genomic DNA TE content in B. argentifolii (1.53%), yet more similar to T. vaporariorum (3.40%) [43]. Notably, the TE results presented here, when compared to a recent reanalysis of TEs in published B. tabaci s.l. genomes, showed that DNA superfamily diversity in B. tabaci s.l. is likely richer when methods designed at identifying hitherto uncharacterised TE repeats are applied, see [50].

LINE type retrotransposons (class I type transposons) transpose via an RNA intermediate, in which their transcription and transposition are facilitated via reverse transcriptase (RT) [47]. LINE TEs were the second most abundant repeat type identified in these new B. tabaci s.l. genomes, having an average genomic coverage of 1.30% and between 10 to 12 superfamilies each. Despite LINE family copy count differences between SSA2-Ng and SSA3-Ng, both populations had uniquely higher total LINE TEs with 36.7 k and 40.8 k copies, respectively, compared to a maximum of 23.6 k copies in SSA1-SG1-Ug. Although not a conclusive apomorphic trait; total LINE copy number of between 14.4 K and 18.5 K is a distinguishing feature of SSA2-Ng and SSA3-Ng populations when compared to all other B. tabaci s.l. and lends support for their designation as the same, single, biological species B. tabaci SSA2 ∪ SSA3 (see Table 1, Additional file: 1 Table S5). Further detailed analysis is warranted to fully explore the TE diversity, abundance, and divergence between these genomes. Total LINE genomic coverage across newly reported B. tabaci s.l. genomes were similar to those reported for T. vaporariorum (1.07%) and B. argentifolii (0.96%). Our results are supported by Sicat et al. [50], which showed a higher LINE coverage of 0.94% for B. tabaci ‘SSA-ECA’ genome than was initially reported (0.44%) [32].

Long terminal repeats (LTRs) are a group of autonomous retrotransposons similar to LINEs in that they are also mediated via a RT mechanism to “copy and paste” to new genomic loci [47]. Class I type LTR retroelements were the third most abundant TE type and displayed a slightly more uniform distribution but represented only ~ 0.6% of total genome coverage. The most abundant LTR family was ‘Gypsy’, with c. 11,546 copies. Instances of population specific LTR TE families were also observed, including ‘DIRS’ and ‘ERV4’ located in SSA1-SG1-Ug and Asia II-5, respectively (see Fig. 3c). Differentiated from LINEs and LTR TEs, which encode their own RT enzymes, SINE TEs rely on hitch-hiking copies of itself via the transcriptional machinery of autonomous class I TEs. SINE TEs had the smallest overall contribution to each of the B. tabaci s.l. genomes, the most abundant family was ‘SINE’ with between 3 to 5 k copies per genome and the SINE family ‘tRNA-V’ was uniquely identified in SSA3-Ng. Overall, we found broad similarities to previous estimates of LTRs and SINE TEs in T. vaporariorum, B. argentifolii, B. tabaci s.s. and B. tabaci ‘SSA-ECA’ [50]. See Table 5 and Additional file 1: Table S5 for a full breakdown of B. tabaci s.l. TE family copy diversity.

Kimura-distance-based divergence analysis [48] was used to estimate the relative age of TE types for the six new B. tabaci s.l. genomes (Fig. 3b). ‘Kimura scores’ are based on relative distance from the reconstructed repeat consensus sequence, so lower values represent a “younger” or more recently acquired sequence. Peaks along the X-axis shown in Fig. 3b indicate increased acquisition of younger TE. The pattern outlined above is largely conserved, whereby the age distribution and overall genomic TE coverage followed a similar distribution across all six genomes, culminating in a Kimura divergence score of ~ 40. Of the characterized TEs, pronounced peaks in DNA TEs across all six genomes were observed and represents the largest overall total contributor and continued temporal TE activity identified in this study. Bemisia tabaci SSA2 ∪ SSA3 (both SSA2-Ng and SSA3-Ng genomes) showed subtle, yet clearly observable, patterns of more recent acquisition of DNA and LINE type TEs. Peak LTR TE activity for SSA1-SG1-Ug was evident between Kimura score 1 to 6, and for SSA2-Ng between Kimura score 8 to 14. The pattern of LTR activity observed in SSA1-SG1-Ug and SSA2-Ng is not seen in the other B. tabaci s.l. genomes, though Uganda-1 does also show signs of increased LTR TE activity between Kimura score 1 to 3.

To gain a greater evolutionary context regarding Bemisia TE evolution, we examined B. tabaci s.l. TEs with reference to the Arthropoda TE complement first presented in Petersen et. al. [49]. Notably, the new B. tabaci s.l. genomes exhibit some of the highest proportion of hitherto uncharacterized TE diversity (~ 85%), second only to species of mayfly, such as Ephemera danica or Eurytemora affinis (at + 93% uncharacterized) and comparable to the more closely related species, F. occidentalis (~ 87%) [49]. The apparent level of uncharacterized TE in our B. tabaci s.l. genomes have recently been improved whereby DNA and LINE type TE superfamilies were shown to be present at a higher total genomic coverage compared to our analysis [50]. Here, between 45 and 55 TE superfamilies were identified across all six B. tabaci s.l. genomes; but it is likely that it is closer to ~ 80 superfamilies when uncharacterized TEs are identified through clustering and in-depth characterization methods. Importantly, TE diversity described here was close to that reported for pea aphid, A. pisum, another phloem-feeding hemipteran [49]. Despite additional analysis being required to fully characterize and assign family level affinity across all TE examined; our results nonetheless emphasize B. tabaci s.l. genomes have some of the highest proportion of total genomic TE coverage for all major Arthropoda clades examined to date.

Whole genome comparative genomics

Comparative genomics of B. tabaci s.l. was performed via OrthoFinder (v 2.4.0) analysis [51, 52] and included twenty-three Hexapoda genomes, with a primary focus on sampling more closely related Hemiptera taxa (Fig. 4, Table 6). Non-whitefly Hemiptera included: A. pisum (pea aphid), Diaphorina citri (Asian citrus psyllid), Myzus persicae (green peach aphid) and Rhodnius prolixus (kissing bug). To improve overall resolution of the ingroup B. tabaci s.l., the reannotated genomes for B. argentifolii and B. tabaci s.s. produced for this study were included. The non-Bemisia Aleyrodidae greenhouse whitefly T. vaporariorum [43] was selected as an outgroup, being the only other whitefly species with a completely sequenced genome at the time of analysis. Complete results, listed ingroup taxa and data obtained from the comparative analysis with OrthoFinder, including OGCs, sequences and preliminary gene trees can be obtained from the data repository ‘Figshare’ [53]; see ‘Materials and methods’.

Fig. 4
figure 4

A genome wide species level phylogeny with clade specific orthologs. Whole genome comparative analysis computed with Orthofinder. Publicly available whitefly species were B. argentifolii, B. tabaci s.s. and T. vaporariorum (“Greenhouse whitefly”). Phylogenetic relationships estimated with RAxML (maximum likelihood) and MrBayes (Bayesian posterior probability) on a concatenated matrix of protein sequences of 23 species covering 655 OGCs (131,953 amino acids). a Species-level phylogeny with associated node support values (*/*) ⟹ Bayesian PP / BS (bootstrap replicates n = 100); under the best-fitting substitution model LG + G + F + I. b Ortholog set delineation depicted with respect to major Arthropoda clades (Pancrustacea, Hexapoda, Hemiptera and Aleyrodidae); ‘Multi-copy N:N:N’ ortholog sets contain ≥ 1 gene across all species; ‘Patchy’: missing a single species representative. The six new B. tabaci s.l. populations are highlighted in green dashed boxes. c OGC clade sets with relatively low gene counts expanded for clarity

Table 6 Comparative genomics of Bemisia tabaci s.l. and related insects with OrthoFinder

Of the 391,154 input genes that received orthology assignment, 360,756 (92.2%) genes were successfully clustered into 22,225 orthologous gene clusters (OGCs). A total of 2,297 OGCs (10.3%) contained all twenty-three species while 7,205 (32.4%) OGCs were species-specific gene clusters representing 9.1% of all genes analyzed. 2,404 OGCs (10.81% of all OGCs) included all eight B. tabaci s.l. species considered, while on average 7,620 OGCs contained one or more representative genes. We recovered 142 (0.64%) OGCs as B. tabaci s.l. species specific. Excluding all B. tabaci s.l., the average species specific OGC count rose to 470 (2.1%). An explanation for the discrepancy in species specific OGC counts and evidence of concordance of annotation could stem from our application of a uniform methodology across B. tabaci s.l., in contrast to the remaining taxa obtained from a range of community annotations that potentially could have overestimated genes or included poor-quality models.

The recovery of taxonomically clustered OGCs within: (i) the Hemiptera (excluding whitefly), (ii) whitefly and iii) B. tabaci s.l., showed largely similar numbers of OGCs (Fig. 4b, c, Additional file 1: Table S6). Hemiptera had an average count of seventeen OGCs per species, with a maximum of twenty-eight observed in B. tabaci SSA1-SG1-Ng and a minimum of two in R. prolixus. Overall, within the Aleyrodidae, B. tabaci s.l. species recovered an average of 123 OGCs per species, while T. vaporariorum had a total of 71 OGCs. The genome of B. tabaci s.s. uniquely exhibited many more species-specific OGCs (n = 151), compared to an average of 41.5 amongst the remaining B. tabaci s.l. Forty-seven OGCs contained unique, single-sequence, one-to-one, single-copy orthologs. This increased to 655 OGCs, when the minimum percentage of ingroup taxa represented by a single sequence ortholog was reduced to 78.3%.

Integrative systematics: a consilience of evidence

Phylogenomic analysis of B. tabaci s.l. and other insects

Phylogenomic analyses on an alignment of 655 OGCs or 131,953 amino acids (average % missing data = 10.34%) were conducted using the best fitting evolutionary model (LG + Γ + I + F) under both maximum likelihood (ML) and Bayesian inference (BI) methods using RAxML and Mr. Bayes, respectively. Only amino acid sequences of the longest canonical protein-coding transcript were considered. Results from both ML and BI phylogenetic analyses recovered near universal support of the exact same topology, providing high support for monophyletic whitefly (Aleyrodidae) and monophyletic B. tabaci s.l. (Fig. 4a). Overall topological support recovered from ML and BI analyses was similar with linear log-likelihood values of lnl-2.359003 and lnl-2.360958, respectively. All African B. tabaci s.l. species clustered into a single clade (“Africa-only”: Uganda-1, SSA1-SG1-Ug, SSA1-SG1-Ng, SSA2-Ng, SSA3-Ng) sister to a clade of all non-African whitefly (B. argentifolii, B. tabaci s.s. and Asia II-5). Of the three non-African whiteflies, B. argentifolii and B. tabaci s.s. clustered as each other's closest relative, to the exclusion of Asia II-5.

Full posterior probability (PP) and 100% of all bootstrap (BS) replicates (n = 100) supported the placement of the non-cassava feeding species Uganda-1 as the earliest branching member of “Africa-only” whitefly clade. The sister group to Uganda-1 was B. tabaci SSA1-SG1-Ug, which was placed in a paraphyletic relationship with SSA1-SG1-Ng. Topological support within this clade only differed regarding the placement of SSA1-SG1-Ng; whose position in the phylogeny was the least stable, highlighted by a fall in BS support (63%).

Full support was recovered across both ML and BI analyses for monophyletic Hemiptera, in which whitefly were sister grouped to a clade of aphids (A. pisum + M. persicae) and D. citri, to the exclusion of R. prolixus which was positioned as the earliest branching member of all Hemiptera taxa. Hexapoda was recovered as monophyletic, with full support (ML, BI) of Pancrustacea including Hexapoda sister to D. pulex. Remaining taxa were positioned at the base of Arthropoda, falling in line with the current understanding of major orders and subphyla within Arthropoda [54, 55].

Biological species inferred by reciprocal cross-mating

We determined the reproductive compatibility between selected populations of B. tabaci s.l. using single-pair and group-pairing reciprocal cross-mating tests. Bemisia tabaci s.l. reproduces by arrhenotoky, whereby unmated females produce only male progeny. In contrast, mated females produce both male and female progeny, whereby the female progeny develop sexually from fertilized eggs [56]. We used this phenomenon, therefore, to assess mating success and reproductive compatibility, by the presence or absence of female offspring. Cross-mating tests involved pairing 3:1 and 15:5 males to females from different populations, including SSA1-SG1-Ng, SSA1-SG1-Ug, SSA2-Ng, SSA2-Ug, and SSA3-Ng (Additional file 1: Tables S7, S8).

Reproductive incompatibility was observed between sympatric populations that differed by more than 7% in their partial mtCO1 sequences, i.e. SSA1-SG1-Ng x SSA3-Ng, SSA1-SG1-Ng x SSA2-Ng. Cross-mating of most allopatric populations did not produce any female progeny, indicating complete reproductive incompatibility. However, SSA1-SG1-Ug x SSA1-SG1-Ng and SSA2-Ng x SSA3-Ng crosses demonstrated complete reproductive compatibility, producing female progeny in both single-pair and group-mating crosses. Importantly, the partial mtCO1 of SSA2-Ng and SSA3-Ng differs by over 6.1%, exceeding the proposed 3.5% species threshold by 2.6% [9, 26]. In addition, control backcrosses showed that the F1 progeny were fertile (Fig. 5, Additional file 1: Tables S7, S8). Similar observations were reported for two putative species Asia II-9 and Asia II-3, with mtCO1 divergence of > 4.5%, where reproductive compatibility was shown to be complete in one direction, and partial in the other [57].

Fig. 5
figure 5

Reproductive compatibility of eight B. tabaci s.l. populations collected in Uganda and Nigeria. Male parents (top row) and female parents (left column). Symbols represent the degree of reproductive compatibility. The black circle (⚫) represents complete reproductive compatibility between members of the SSA1-SG1 ∪ SG2, the fisheye circle (◉) represents complete reproductive compatibility between members of the SSA2 ∪ SSA3 species, the hexagon (⬢) represents complete reproductive compatibility observed in the SSA1-SG3 population, while the circled-cross ( ⊗) represents complete reproductive incompatibility with no female progeny production in F1 generation. The mating-crosses denoted by double asterisks (**) were carried out by Mugerwa et al. [8]

Overall, these results highlight the need for additional genetic markers to enable accurate differentiation of the biological species within B. tabaci s.l., e.g. the concatenated nuclear-gene sequences in building the species tree [58]. Although SSA2 and SSA3 are evidently the same biological species, which we now call, B. tabaci SSA2 ∪ SSA3 (Table 1), the SSA3 population has mostly been recorded from the rainforest ecological zones of West Africa, while SSA2-Ug populations are mostly present in the drier and higher-temperature latitudes of the Sahel [59]. It is probable, therefore, that the observed genetic differences reflect ecological adaptations to the different ecological zones they occupy.

Whitefly systematics using mitochondrial genomes

Phylogenetic analyses were performed using concatenated 11 mitochondrial PCGs on 28 hemipteran species and one Coleoptera as outgroup (Additional file 1: Table S9). The higher rate of mutation of mitochondrial sequences compared to nuclear sequences meant that we had to restrict the phylogenetic analyses to only Hemipteran, rather than including Hexapoda and Pancrustacea as done for the genome-wide phylogenetic analysis [60]. The nucleotide sequences of the 11 PCGs were better suited for inferring phylogenetic relationships within the Hemiptera compared to the isolated mtCO1 gene (data not shown). Within the Aleyrodidae, support values were also better for the concatenated mitochondrial PCGs (Fig. S8). Phylogenetic reconstruction shows that B. tabaci s.l. populations from Sub-Saharan Africa occupied a monophyletic clade with Uganda-1 positioned as the earliest branching member.

Although the partial mtCO1 marker provides a good initial framework for identifying putative biological species within B. tabaci s.l., analysis of mtCO1 alone has limitations. There is a need, therefore, to continue the integrative approach [6] for identifying biological species within the group which involves combining evidence from biological and molecular datasets. An integrative approach, as applied herein, supports that the seven African cassava populations of B. tabaci s.l. represented in Fig. S8, can be reduced to three biological species. Hence, the SSA1-SG1 Nigeria, SSA1-SG1 Uganda and SSA1-SG2 Kayingo populations were grouped as “B. tabaci SSA1-SG1 ∪ SG2” and SSA2 Kiboga, SSA2 Nigeria and SSA3 Nigeria were grouped as “B. tabaci SSA2 ∪ SSA3” while SSA1-SG3 remains distinct from “B. tabaci SSA1-SG1 ∪ SG2” (Table 1, Fig. S8).

Endosymbiont Portiera co-cladogenesis and metabolic potential

B. tabaci s.l. contain endosymbiotic bacteria, including the primary obligate endosymbiont, ‘Candidatus Portiera aleyrodidarum’ (hereafter Portiera) and up to seven secondary, facultative endosymbionts; Cardinium, Arsenophonus, Hamiltonella, Rickettsia, Wolbachia, Fritschea and Hemipteriphilus asiaticus [61, 62]. Of these, Portiera, Arsenophonus, Cardinium, Hamiltonella, Rickettsia, and Wolbachia have been reported previously from B. tabaci SSA1 [62]. The population of B. argentifolii used to generate the draft genome, for example, contained Portiera and two secondary endosymbionts, Hamiltonella and Rickettsia, which had assembled genome sizes of 352 kb, 1.74 Mb and 1.38 Mb, respectively. For P. aleyrodidarum, 273 genes were predicted whose functions were essential for basic cellular processes and whitefly nutrition [32].

Strict vertical transmission over long evolutionary periods result in primary endosymbionts reflecting their host phylogeny (co-cladogenesis) [63]. Indeed, genomic and molecular dating analysis show that Portiera has been associated with whiteflies since their origin, more than 125 Mya [64,65,66]. We assembled the Portiera genomes of the six new B. tabaci s.l. populations to examine whitefly phylogenetic relationships. The genomic characteristics of the newly obtained Portiera were similar to those from Portiera associated with B. tabaci s.s., Asia II-3 and B. argentifolii (Additional file 1: Table S10). The number of frame-shifted genes was extremely high in some Portiera, however, especially those from B. tabaci Uganda-1 and SSA1-Ng. Frameshifts were generally caused by low-complexity DNA regions, mostly in repetitive ‘A’ rich polymer regions. Although the correct frame could be recovered due to the polymerase slippage [67], the negative correlation between assembly coverage and the number of frameshifted genes suggests this problem is more related to sequencing artifacts from the technology used (PacBio Sequel).

Despite known frameshift issues, Portiera genomes are a complementary source of information to study whitefly evolution. First, we compared the major functions encoded by the different Portiera genomes (Additional file 2: Fig. S9). If frame-shifted genes are considered as coding sequences, we found no differences in the information transfer and translation machinery, nor in the metabolic potential (energy, essential amino acids and co-factors), encoded by Portiera from B. tabaci s.s., B. tabaci s.l. and B. argentifolii (Additional file 2: Fig. S9). Differences in gene content (e.g., different pseudogenization events) are expected when comparing Portiera from different whitefly hosts [68]. Therefore, the maintenance of the same gene content suggests a close relationship among Bemisia harboring Portiera with the same encoding capabilities.

Second, we computed the average nucleotide identity (ANI) among Portiera genomes from B. tabaci s.s., B. tabaci s.l. and B. argentifolii species. Their ANI values were above 96% (Fig. 6), suggesting that all of the new bacterial genomes are strains of the same P. aleyrodidarum [69]. A cluster dendrogram analysis (Fig. 6), however, grouped the genomes of Portiera from B. tabaci SSA2 and B. tabaci SSA3 in one cluster while those from B. tabaci SSA1-Ng and B. tabaci SSA1-Ug were recovered in a different one, showing within-species evolutionary changes. These data, therefore, add additional evidence for concluding that B. tabaci SSA2 and B. tabaci SSA3 are the same biological species (called here, B. tabaci SSA2 ∪ SSA3). Indeed, clustering analysis of ANI values recovered a similar topology as the ones obtained by nuclear and mitochondrial genes, (Fig. 4 and Fig. S8 respectively).

Fig. 6
figure 6

Average Nucleotide Identity and genomic synteny among Candidatus Portiera aleyrodidarum from different Bemisia hosts The cladogram on the left summarizes Portiera relationships based on their pairwise Average Nucleotide Identity values (heatmap, middle). On the right, genomic synteny conservation among Portiera strains based on 202 complete Coding Sequences (CDS) (blue) and the CDS presence in the variable region (green). Portiera genomes are represented linearly, the presence of a subcircular conformation of the variable region is represented at the end of the plot (separated by double backslashes). Blue boxes representing syntenic CDS in the direct strand (upwards) or in the complementary strand (downwards), genes from the variable region are denoted in green. Gray lines connect orthologous CDS

Portiera from whitefly species belonging to the Aleurolobini tribe, which includes the Bemisia genera, present different genome architectures [68]. Indeed, the genome architecture of Portiera is less conserved among Aleurolobini species that diverged longer ago, such as Singhiella simplex, which separated from the branch leading to B. tabaci s.l. c. 71.34 Mya [68]. Therefore, the maintenance of the genomic architecture (macro and micro-synteny) among Portiera from B. tabaci s.s., B. tabaci s.l., and B. argentifolii suggests a recent divergence, in evolutionary terms, of their hosts (Fig. 6). The exception in genome order was mainly the region encoding for three genes (yidC, mnmE, and mnmG). This region is known to be present as an episome (sub-circular particles) or integrated into the chromosome. Also, the number of gene copies varies from zero (absent) to at least three copies [70]. This region was detected as an episome in B. tabaci SSA1-Ug (single copy) and Asia II-5 (two copies), integrated into the chromosome of B. tabaci SSA2 (two copies) and absent in B. tabaci SSA3 and Uganda 1 (Fig. 6). However, variations in this region are unrelated to the host species, since variation occurs even at the intrapopulation level [70].

For the six new B. tabaci s.l. genomes, only Hamiltonella reads were found for B. tabaci Uganda-1. We obtained 12 scaffolds, with a total size of 1,609,740 bp for the Hamiltonella genome. Hamiltonella and Arsenophonus S-endosymbionts supply their host with B vitamins, therefore, they are required for whitefly development [71, 72]. A possible explanation for the absence of S-endosymbionts sequences was the use of adult males for sequencing. In adult males, bacteriocytes degenerate with age, thus reducing the number of endosymbionts present, especially S-endosymbionts [73].

Gene families associated with detoxification, sugar metabolism and cassava adaptation

Host-plant association studies suggest that B. tabaci s.l. is a group of more than 40 oligophagous species, with only a few possessing a truly broad host-plant range. RNA-Seq analyses have shown that B. tabaci s.l. have an ancestral, or converged, expression pattern of the detoxification “machinery” that is shared amongst species and that enables them to perform well on multiple common and novel hosts [74]. In a study of the genetic diversity of whitefly (Bemisia spp.) on crop and uncultivated plants in Uganda, the most prevalent whiteflies were B. tabaci MED-ASL (30.5% of samples), B. tabaci SSA1 (22.7%) and B. tabaci Uganda-1 (12.1%), which were also the most polyphagous occurring on 33, 40 and 25 different plant species, respectively. Although all three species exhibited a high level of polyphagy, only B. tabaci SSA1-SG1 ∪ SG2 and B. tabaci SSA2 ∪ SSA3 were present on cassava [31], suggesting clear differences in the abilities of these species to process cassava’s phytotoxins.

Known detoxification gene families, including cytochrome P450s, (UDP)-glucuronosyltransferases, glutathione transferases, ABC transporters and carboxylesterases were reported from the draft B. argentifolii genome [32]. Of these, the carboxylesterases and UDP-glucuronosyltransferases were expanded significantly, relative to most other insect genomes. Expansion of these six detoxification gene families in B. tabaci s.l. is probably central to their success [31] and has enabled insecticide resistances to evolve rapidly.

Here, we characterized detoxification gene families in B. tabaci s.l. to understand their potential role in adaptations to cassava, as well as to insecticides. Utilizing queries derived from published proteins of B. argentifolii [32], we conducted phylogenetic analysis of these targeted detoxification enzyme families in B. tabaci s.l. that operate in a three-phase process. Phase I includes mainly the activity of cytochrome P450 monooxygenase (P450s) and carboxylesterase enzymes (COEs), which reduce, hydrolyze, or oxidize a variety of endogenous toxic compounds and exogenous substances. Phase II includes enzymes mainly from the glutathione S-transferases (GSTs), uridine diphosphate (UDP)-glucuronosyltransferases (UGTs) and cytosolic sulfotransferases (SULTs) families. These enzymes catalyze the conjugation of glutathione, glucose or sulfonate group to the reactive site of the phase I products, thereby increasing their polarity and facilitating their excretion. Phase III includes mainly ATP-binding cassette transporters (ABCs) that export the conjugated products out of the cells [75, 76]. Most studies on the functionality of detoxification gene families in B. tabaci s.l. have focused so far on their involvement in host-plant adaptation [74, 77] and insecticide resistance [78].

Overall, we detected variation in the number of detoxification genes in the five African species, ranging from 207 in SSA3-Ng to 164 in B. tabaci Uganda-1 (the genome of the reference B. argentifolii species harbors 268 detoxification genes). The identity level (of the proteins coded by the detoxification genes) between the African species and B. argentifolii was surprisingly high, ranging from 93.1% in the COE family of SSA2-Ng, to 97.2% in the ABC family of SSA1-SG1-Ug (Fig. 7, Additional file 1: Table S11). Even so, ~ 10% of the genes in each Species X Gene family combination showed an identity level of 90% or lower. For full details on the input sets of B. argentifolii detoxification gene family queries and a summary of PAV (Presence Absence Variation) across the seven B. tabaci spp. examined see Additional file 1: Table S12 and Table S13 respectively.

Fig. 7
figure 7

Protein identity across detoxification gene families. A box-plot representation of a curated set of detoxification proteins obtained from B. argentifolii and their putative orthologous protein (each represented by a single dot) in seven analyzed species of Bemisia tabaci s.l: Asia II-5, B. tabaci s.s., SSA1-SG1-Ng, SSA1-SG1-Ug, SSA2-Ng, SSA3-Ng and Uganda-1. A BLAST-combined with manual inspection approach used to check the identity of each protein. All alignments shown include putative orthologous proteins with at least 100 amino acids of the entire sequence aligned (cutoff >  = 85% PID). The number of proteins analyzed between species vary, as it was not always possible to recover a B. argentifolii orthologue in each of the seven analyzed Bemisia species

Carboxylesterase enzymes (COEs)

COEs catalyze the hydrolysis of an ester bond into the corresponding alcohol and carboxylic acid. Our analysis focused on enzymes previously proposed to display digestive or detoxification functions [79, 80]. In B. tabaci s.l., these enzymes (α/β esterase) play a role in the detoxification of several important groups of insecticides, such as pyrethroids, organophosphates, and carbamates [81]. Testing the presence/absence of B. argentifolii orthologs in the three African grouped biological species (Additional file 1: Table S12, Table S13, Additional File 2: Fig. S10, indicated the absence of seven orthologs in Uganda-1, four in SSA1-SG1 ∪ SG2, and three in SSA2 ∪ SSA3. Among all absent orthologs, two were absent in all three African biological species (XP_018910005.1 and XP_018899251.1 in B. argentifolii). Orthologs of carboxylesterase 1E in B. tabaci s.l. (XP_018899849.1 in B. argentifolii), a highly conserved enzyme involved in xenobiotic resistance [82], were the only enzymes that clustered together with both outgroup species, T. vaporariorum and Drosophila melanogaster. This gene could not be detected in the genomes of B. tabaci SSA1-SG1 ∪ SG2.

Cytochrome P450 monooxygenases

P450s are a superfamily of enzymes that have essential roles in metabolic processes such as hormone synthesis and the catabolism of toxins and other chemicals in insects [83]. In B. tabaci s.l., the activity of specific P450 enzymes was shown to confer resistance to chemical insecticides [84, 85]. Similar to other insects [86, 87], the cytochrome P450 (CYP) gene family can be divided into four major phylogenetic clans: CYP2, CYP3 (including families CYP6, CYP9, and CYP325), CYP4 (including families CYP4 and CYP325), and mitochondrial P450s (including families CYP12 and CYP314) (Additional File 1: Table S12, Table S13, Additional File 2: Fig. S11). Testing the presence/absence of B. argentifolii orthologs in the three African grouped biological species (Additional File 2: Fig. S11), indicated the absence of 33 orthologs in Uganda-1, mainly from the CYP3 (19 genes) and CYP4 (9 genes) clans. Also, 6 orthologs were found to be absent in SSA1-SG1 ∪ SG2 (3 in CYP3 and 3 in CYP4), and 10 in SSA2 ∪ SSA3 (7 in CYP3 and 2 in CYP4). From all orthologous missing in the African grouped biological species, two were absent both in SSA1-SG1 ∪ SG2 and SSA2 ∪ SSA3 (XP_018917660.1 and XP_018898101.1) and an additional two (XP_018905683.1 and XP_018917273.1) in all three African biological species.

Cytosolic sulfotransferases

SULTs constitute a group of enzymes that catalyze the transfer of a sulfonate group from the active sulfate, 3′-phosphoadenosine 5′-phosphosulfate, to a substrate compound containing a hydroxyl or amino group [88]. These enzymes are considered to be involved in the inactivation and excretion of xenobiotics and endogenous compounds [88]. In our analysis, we focused on two groups of sulfotransferases, SULT1E1 (six genes) and SULT1C4 (5 genes) (Additional File 1: Table S12, Table S13). Enzymes from the SULT1E1 group transfer a sulfonate group both to endogenous substrates such as estrogens or iodothyronines and to various flavonoids [89]. Enzymes from the SULT1C4 group modify steroids, neurotransmitters, and xenobiotics, and are involved in drug detoxification [90]. Testing the presence/absence of B. argentifolii orthologs in the three African biological species (Additional File 2: Fig. S12), indicated the absence of three orthologs in the Uganda-1 genome (two from SULT1C4 and one from SULT1E1). Also, orthologs from T. vaporariorum and D. melanogaster were only detected in the SULT1E1 group (Additional File 2: Fig. S12).

Glutathione S-transferases

Glutathione S-transferases of insects are essential to convert xenobiotics, such as toxic phytochemicals and synthetic insecticides, into nontoxic products [91]. In insects, this family is divided into two groups, microsomal enzymes and cytosolic enzymes, which differ both in their origin and structure [92]. The cytosolic GSTs are further divided into six classes: Theta, Zeta, Omega, Sigma, Delta, and Epsilon [93], the latter two being unique to insects [94]. Previous studies have shown that both the high tolerance of B. tabaci to insecticides and the species adaptability to plants with high levels of secondary metabolites, are associated with enhanced expression of GSTs [95,96,97]. Testing the presence/absence of B. argentifolii orthologs in the three African biological species (Additional File 1: Table S12, Table S13, Additional File 2: Fig. S13), indicated the absence of nine orthologs in Uganda-1 (eight from the Sigma and Delta-Epsilon classes), three Delta orthologs in SSA1-SG1 ∪ SG2 and one Delta class ortholog in SSA2 ∪ SSA3 (Additional File 2: Fig. S13). For all orthologous missing in the African biological species, only QHU79966.1 (Delta GST class) was absent in all three species. Interestingly, the Delta GST class was reported to play a role in the ability of insects to detoxify xenobiotics [98] and to be significantly expanded in B. tabaci s.s. [91].

Uridine diphosphate-glucuronosyltransferases

UGTs catalyze the addition of UDP-sugars to small hydrophobic molecules, turning them into more water-soluble metabolites [99]. In insects, UGTs play an essential role in the detoxification of xenobiotics and a variety of plant phytotoxins [99]. The phylogenetic analysis we conducted indicated that the UGT gene family of B. tabaci s.l. can be further divided into 15 subfamilies: (UGT352, UGT353, UGT354, UGT355, UGT356, UGT357, UGT358, UGT359, UGT360, UGT361, UGT362, UGT363, UGT365, UGT366, UGT50) [100, 101] (Additional File 1: Table S12,, Table S13 Additional File 2: Fig. S14). Testing the presence/absence of B. argentifolii orthologs in the three African biological species (Additional File 2: Fig. S14), indicated that 27 UGT orthologs are missing in Uganda-1, six in SSA1-SG1 ∪ SG2, seven in SSA2 ∪ SSA3, and three in both B. tabaci s.s. and Asia II-5. From the total of 33 absent orthologs, three could not be found in all three African biological species (XP_018914531.1, XP_018903292.1 and XP_018896850.1). Most absent genes belonged to only two subfamilies, UGT353 (5/12) and UGT352 (18/33). On the other hand, T. vaporariorum orthologs were found in all subfamilies except UGT355. Expansion of specific UGT subfamilies, UGT352 and UGT353, was detected in B. tabaci s.l. (when compared to T. vaporariorum). Only one gene (XP_018897454.1 in B. argentifolii, subfamily UGT50) showed sufficient conservation to allow its clustering with orthologs from D. melanogaster and T. vaporariorum [102]. This UGT gene could not be found in the genomes of the B. tabaci SSA2 ∪ SSA3 and Uganda-1.

ATP-binding cassette transporters

ABC transporter genes encode membrane-bound proteins that carry a wide range of molecules such as amino acids, peptides, sugars, and a large number of hydrophobic compounds across membranes [103] and in B. tabaci s.l. are also implicated in insecticide resistance [104, 105]. The ABC transporters gene family of B. tabaci s.l. can be further sub-divided into eight subfamilies (A-H), which include an expanded ABC-G subfamily [105] (Additional File 1: Table S12, Table S13, Additional File 2: Fig. S15). Testing the presence/absence of B. argentifolii orthologs in the three African biological species (Additional File 2: Fig. S15), indicated that five orthologs (three ABC-G and two ABC-H) are missing in Uganda-1, and one ortholog from the ABC-G subfamily is missing in B. tabaci SSA1-SG1 ∪ SG2 and B. tabaci SSA2 ∪ SSA3, although not the same gene. T. vaporariorum orthologs were identified in all subfamilies, but D. melanogaster orthologs could be identified only for the ABC-G subfamily.

Evolution of α-glucosidase (GH13) within cassava B. tabaci SSA1-SG1-Ug and SSA1-SG1-Ng: a case of sucrose hydrolase

B. tabaci s.l. has evolved to exploit the sugars-rich diet of plant phloem-sap. Adaptations to this specialized diet include α-glucosidase genes that encode sugar-transforming enzymes belonging to the α-glucosidase glycoside hydrolase (GH) family 13 by hydrolyzing sugar to its constituent monosaccharides to facilitate both digestion and osmoregulation within the whitefly gut [106,107,108]. Alpha-glucosidases are categorized into three types (“I”, “II”, “III”), based on substrate recognition [109]. The α-glucosidase glycoside hydrolase (GH) family 13 type I is found in bacteria and insects [110]. These molecules recognize the α-glycosyl moiety and hydrolyze heterogeneous substrates such as sucrose. The phloem-sap of different plants contains varying concentrations of heterogeneous substrates/sugars, but sucrose is the dominant sugar present in many plants [111]. Recent studies have shown that the starch and sucrose metabolism pathways were overexpressed in B. argentifolii [74, 112]. This gene family and mechanism for sugar hydrolysis within different whitefly species, therefore, is likely to be important for host-plant adaptation.

In East and Central Africa, B. tabaci SSA1-SG1 ∪ SG2 reaches “super-abundant” numbers on cassava, however this phenomenon has not been reported for populations of the same species in West Africa [59]. To investigate whether, or not, this phenomenon is associated with evolutionary changes in the α-glucosidase GH13 genes, we analyzed these families in the sister populations B. tabaci SSA1-SG1-Ug and B. tabaci SSA1-SG1-Ng genomes and compared them to the non-cassava colonizing species B. argentifolii.

Phylogenetic analysis of the 199 α-glucosidase (GH13) protein sequences identified in OGC ‘OG0000016’ resulted in clustering of 26 different clades (Fig. 8). In comparison to non-whitefly species, whitefly genomes had the highest number of α-glucosidase genes (Table 7), underlining their importance to this species group. The presence of signal peptides varied amongst all 199 sequences examined. Twenty-nine whitefly α-glucosidase genes clustered with the experimentally validated SUC1 gene ‘Q0H3F1_ACYPI’ (A. pisum), forming two clusters (cluster C1, highlighted in green & cluster C2, highlighted in red) (Fig. 8). The gene product ENSSSA1UGT001243 clustered with ENSSSA1NGT010254 and ENSMEAMV2T026084, SUC1 genes in B. tabaci SSA1-SG1-Ng and B. argentifolii, respectively. It also clustered with ENSSSA1UGT025021, a paralog in B. tabaci SSA1-SG1-Ug, although it lacked a signal peptide (Additional file 3). The implication is the gene ENSSSA1UGT025021 may encode for an enzymatically inactive protein or alternatively has become pseudogenized, as a signal peptide is required for the protein to function as an extracellular enzyme in the gut lumen [113]. These findings show that the vast majority of whitefly α-glucosidase genes clustered and were unique to whitefly. A small proportion had an orthologous relationship with genes belonging to A. pisum, indicating that these are also present in other phloem-sap feeders.

Fig. 8
figure 8

Phylogenetic relationships of α-glucosidase (GH-13) genes of thirteen arthropod species. Phylogenetic analysis focused on three Bemisia tabaci s. l. populations: SSA1-SG1-Ug, SSA1-SG1-Ng and B. argentifolii. Non-whitefly taxa T. castaneum, A. pisum, A. gambiae, B. terrestris, B. mori, D. plexippus, D. pulex, D. melanogaster, R. prolixus and T. urticae are uniquely colored. Phylogenetic analysis performed using a Bayesian approach and implemented in Bayesian Evolutionary Analysis Sampling Trees (BEAST version 1.10.2). Clusters (C 1 <—> C 7) are defined based on selection analysis; see Additional file 1: Table S14. The α-glucosidase genes related to sucrose hydrolysis are located in cluster 2, highlighted in purple (**)

Table 7 The presence/absence of signal peptides in 13 arthropod populations/taxa

A total of four genes in B. tabaci SSA1-SG1-Ug and two genes in B. tabaci SSA1-SG1-Ng had orthologs in insect species that do not feed on phloem-sap. For example, three genes from B. tabaci SSA1-SG1-Ug, B. tabaci SSA1-SG1-Ng and B. argentifolii (ENSSSA1UGT011292, ENSSSA1NGT018572 & ENSMEAMV2T025707) clustered with RPRC013046-RA (R. prolixus), ACYP1002020-RA (A. pisum), XM020863006.1 (B. terrestris) and RPRCO12963-RA (R. prolixus). This gene encodes for a sucrose hydrolyase with two magnesium ion ligands binding to five amino acid residues; D.46, D.48, D.50, I.52 & D.54 and three protein–ligand interactions forming metal complexes with amino acid residues; D.48, D.50 and I.52. The second set of orthologous genes in B. tabaci, Anopheles gambiae, D. melanogaster, Tribolium castaneum, Bombyx mori and Danaus plexippus encode for heavy chain protein (neural and basic amino acid transport rBAT). These are heteromeric amino acids transporter b0, + AT-rBAT complex bound with arginine, and in SSA1-SG1-Ug and SSA1-SG1-Ng are coded by gene ENSSSA1UGT025164 and ENSSSA1NGT004694 respectively. Two SSA1-SG1-Ug genes (ENSSSA1UGT010306 & ENSSSA1UGT014032) were annotated as alpha-amylase, an enzyme that hydrolyze alpha bonds of large polysaccharide such as starch and glycogen. Alpha-amylases have also been previously reported in B. tabaci [114]. Among the studied insects, these genes were found only in T. urticae, R. prolixus and D. melanogaster.

To classify the α-glucosidases in B. tabaci SSA1-SG1-Ug and B. tabaci SSA1-SG1-Ng based on the approximate specificity, selected protein sequences were submitted to the protein structure homology-modeling server (Expasy webserver-SWISSMODEL) [115] to identify the best protein database (PDB) template for their three-dimensional structure modeling. We conclude that the α-glucosidase GH13 are: (i) sucrose hydrolyzing enzymes (with PDB: 6Igg.1.A, 6Iga.1.A, 6Igf.1.A as the best homologous PDB templates), (ii) toxin receptor proteins (PDB: 6K5p.1.A), (iii) neutral and basic amino acid transporter protein rBAT (PDB: 6li9.1.A) and (iv) alpha amylase (PDB: 1dhk.1.A, 1kxt.1.A). Of these, the largest numbers of proteins were toxin receptor proteins and sucrose hydrolyzing enzymes.

Selection pressure in alpha-glucosidases of East and West African B. tabaci SSA1-SG1

From examination of gene-tree clustering, we identified nine α-glucosidase genes in ‘cluster 2’ (Fig. 8) and these were chosen to investigate the nature of selection acting on these different populations of the same biological species of cassava whitefly. Each gene was analyzed for site selective pressure under both pervasive diversifying and purifying selection. Significant pervasive diversifying and episodic selection were detected in two genes: ENSSSA1NGT018454 and ENSSSA1UGT001243 (Table 8). Of these, ENSSSA1UGT001243 encodes for enzymes that hydrolyze sucrose in SSA1-SG1 cassava whitefly and when the sucrose hydrolases in ‘cluster 2’ (ENSSSA1UGT001243, ENSMEAMV2T026084 and ENSSSA1UGT025021) were compared, only ENSMEAMV2T026084 for B. argentifolii contained a protein–ligand interaction of two magnesium ions, interacting with ASP 48, ASP 50 and ILE 52 (Additional file 1: Table S15).

Table 8 Selection pressure analysis on the Bemisia tabaci SSA1-SG1 α-glucosidase (GH13) genes

The B. tabaci SSA1-SG1-Ug gene ENSSSA1UGT001243 was identified using sequence-based methods. The gene encodes for a sucrase hydrolysis enzyme. Oligomeric modeling in SWISSMODEL identified PDB: 6k5p.1.A (a binary toxin receptor protein) and 6lga.1.A (sucrose hydrolase) as the two best quaternary structure annotations. The best model built for both ENSSSA1UGT001243 and its ortholog in B. argentifolii ENSMEAMV2T026084 (“Bta03818” [32]) sequence was that of a toxin receptor protein, while the second-best model predicted a sucrose hydrolase. Another study has reported some of the α-glucosidase GH13 acquire a secondary function, citing a toxin receptor as an example, primarily in mosquito species [118]. The results imply that ENSSSA1UGT001243 may possess two functions; sucrose hydrolysis (osmoregulation) and also a toxin receptor protein, which allows the B. tabaci SSA1-Ug to survive on many plant hosts [74, 112]. The toxin receptor proteins analyzed here possess a cadmium ion as a ligand, interacting with different amino acids and forming different ligand–protein interactions (Additional file 1: Table S15), highlighted by gene mutations in the number and nature of indels. We propose that the different attributes of these toxin receptor proteins enable the whitefly to deal with different phytotoxins, although experimental validation is still yet required.

To investigate if selection pressures have been relaxed or intensified for the East and West African populations of B. tabaci SSA1-SG1 ∪ SG2, the specific relaxation parameter (K) for each gene was determined. Examination of selection pressure In ‘cluster 1’, using the relaxation parameter (K) of both test branches (genes) shifted away from neutrality, with a branch-specific (K) of 1.15 and 1.53 for SSA1-SG1-Ug and SSA1-SG1-Ng, respectively (Additional file 2: Fig. S16a, S17a). In ‘cluster 2’, test branches in SSA1-SG1-Ug shifted towards neutrality, conversely SSA1-SG1-Ng shifted away from neutrality (K = 0.89 and 1.85, respectively) see Additional file 2: Fig. S16b, S17b. Genes evolving under relaxed selection, as seen in these SSA1-SG1-Ug α-glucosidases, facilitate organisms to respond adaptively to the changes in the environment [119]. This occurs either through the reduced intensity of both purifying and diversifying selection, which fosters evolutionary innovation or neofunctionalization, whereby one paralogous copy derives a new function after gene duplication [120]. Our results also show that the genes in ‘cluster 2’ (sugar homeostasis—osmoregulation genes) and ‘cluster 7’ (toxin receptor proteins – detoxification genes) in SSA1-SG1-Ug are under relaxed selection constraints, when compared with similar genes in SSA1-SG1-Ng (Additional file 1: Table S14, S15).

Since the epidemic of severe CMD began in Uganda in the 1990s, new varieties of cassava have been bred and released with CMD resistance. These varieties, however, proved highly susceptible to cassava whitefly [16, 17], allowing the B. tabaci SSA1-SG1-Ug population to adapt to this change in the ecological landscape. In this study, one of the genes in ‘cluster 2’, ENSSSA1UGT001243 (SUC1) has experienced episodic diversifying selection with four sites under pervasive diversifying selection (Table 8) and belongs to a cluster of genes that experienced relaxed selection. The combination of episodic diversifying and relaxed selection indicates that a selection force increased amino acid diversity of this B. tabaci SSA1-SG1-Ug SUC1 gene, likely facilitating further evolutionary adaptations.

Gene families associated with virus-vector competency

Bemisia tabaci s.l. evolution is linked closely to one of the groups of plant viruses that it transmits, the begomoviruses (genus Begomovirus, family Geminiviridae) [12]. These single-stranded DNA plant viruses are almost exclusively transmitted by B. tabaci s.l. and comprise one of the largest genera in the virosphere [121, 122]. Our goal was to elucidate the extent to which gain vs loss of important viral interaction gene families has occurred within B. tabaci s.l. genomes in order to further elucidate potential whitefly gene targets for interference studies. Our analysis of virus-vector competency gene families centered on sequence-based matching via conditional reciprocal best BLAST (CRBB), as well as manual curation of protein-coding functional annotation information. We expand on results by providing a comparative genomics framework (i.e., OGC gene tree generated via OrthoFinder) including specific reference to broader whitefly, and importantly T. vaporariorum. This approach was non-exhaustive, with further examination is warranted to verify these findings and fully elucidate these candidate gene family targets with potential for whitefly control.

Evasion of the host immune system is crucial for successful host infection and propagation of virus particles. Here we have examined the presence and diversity of peptidoglycan recognition proteins (PGRP) (IPR015510, IPR002502, IPR006619, IPR036505). PGRPs are a widely distributed, diverse family of proteins conserved between invertebrates and mammals which facilitate maintenance of vector-pathogen homeostasis via innate immune response against invading pathogens [123,124,125,126]. PGRPs exhibit gene-copy variability across Arthropoda. D. melanogaster (thirteen genes) contains a rich repertoire while mosquitos (seven genes) and tsetse flies (six genes) genomes contain fewer PGRPs [127,128,129,130].

To date, only a single PGRP gene ‘BtPGRP’ (AJQ31845.1) has been identified in B. argentifolii [131]. In this study, comparative analysis revealed two independent PGRP-containing OGCs (‘OG0000284’ and ‘OG0003013’). OGC ‘OG0000284’, was composed of single copy PCGs across all genomes and contained proteins with high sequence identity to ‘BtPGRP’ [131]. Wang et al. [131] reported that BtPGRP contained an Arg106 which is associated with recognizing meso-diaminopimelic acid (DAP)-type peptidoglycans, as previously shown for D. melanogaster PGRP-LE and PGRP-LC [132]. Our results confirm the presence of BtPGRP in whitefly, while further demonstrating its complete sequence conservation of Arg106 between all Aleyrodidae species examined and D. melanogaster PGRP-SB1 (FBtr0075348).

The second PGRP-like OGC ‘OG0003013’, contained proteins which exhibited functional domains linked to the ‘N-acetylmuramoyl-L-alanine amidase/PGRP domain’ superfamily (IPR002502). ‘OG0003013’ contained B. tabaci s.l. members harboring two gene copies each (SSA3-Ng, Uganda-1, B. tabaci s.s). However, not all Hemiptera examined (A. pisum, M. persicae) had detectable homologs within either of these two PGRP-like OGCs and highlights a possible PGRP gene loss over time within whitefly and more broadly Hemiptera.

The hsp70 family is the largest clade of B. tabaci s.l. heat shock proteins (HSPs) [133,134,135,136] and has been shown to play important roles in both insect development and begomovirus transmission for B. argentifolii [137, 138]. HSPs are highly conserved chaperone proteins, present in all organisms and cell types. They have diverse roles related, but not exclusively, to protein transport across membranes, cell cycle control, signaling, cellular stress responses and apoptosis [139]. Several studies have demonstrated that B. tabaci s.l. utilize HSPs to cope with thermal stress and that thermotolerance differs between cryptic species, possibly conferring ecological advantages [133, 140, 141]. An expanded analysis of these diverse sets of hsp70-related gene families is warranted but falls outside the scope of this investigation. We instead present a targeted examination of hsp70 protein domains (IPR018181, IPR013126, IPR043129) in B. tabaci s.l genomes.

A total of five OGCs (‘OG0000087’, ‘OG0002826’, ‘OG0004004’, ‘OG0004227’, ‘OG0004931’) with homology to HSP70 functional protein-domains were identified. Between 58–75% of the total gene members, across all hsp70-like OGCs identified across all B. tabaci s.l. genomes, were located within a single OGC ‘OG0000087’. Published hsp70 ‘BtHSP70’ (XP_018908958.1) previously identified in B. argentifolii recovered significant CRBB hits to ‘OG0000087’ members. Our results showed hemipteran species exhibit the largest diversity of hsp70-like genes, with OGC ‘OG0000087’ containing up to fourteen HSP70 homologs per species; while B. tabaci s.l. genomes contained between 9–14 compared to 10 in T. vaporariorum.

Recent studies have demonstrated that for B. tabaci s.l., the vesicle trafficking system of midgut cells plays an important role in the transport of begomoviruses across the midgut basal membrane to the haemolymph [142, 143]. Further work by Zhao et al. [144] identified the vesicle-associated membrane protein–associated protein B (VAPB) in B. argentifolii, which showed an inhibitory role in the transmission of the begomovirus tomato yellow leaf curl virus (TYLCV). Our investigation of VAPB-like proteins in B. tabaci s.l. genomes recovered six OGCs with homology to VAPB protein domains related to “Vesicle-associated membrane-protein-associated protein” (IPR016763), “PapD-like superfamily” (IPR008962) and “Major sperm protein” (MSP) domain (IPR000535). One of the six OGCs ‘OG0002935’ harbored members with high sequence identity to that of B. argentifolii VAPB (XP_018905345.1). All species were represented in ‘OG0002935’ by a single VAPB gene member, to the exclusion of B. tabaci SSA1-SG1-Ng and B. tabaci Asia II-5 which both had two copies. VAPB was not identified in B. tabaci s.s. Notably then, VAPB is potentially not vital for successful TYLCV transmission, given B. tabaci s.s. Is known to be an efficient vector of this virus [145].

Another crucial protein related to begomovirus transmission is found in B. tabaci midgut proteins (MGP). MGP is crucial for successful begomovirus transmission, where the viral coat protein interacts with the MGP to facilitate transfer into hemolymph and salivary glands [146]. MGP is characterized by the presence of a secretory signal-peptide and transcription activator MBF2 (IPR031734) domain. Two OGCs (‘OG0009507’, ‘OG0000655’) were identified with likely homology to this particular MBF2 domain, however only a subset of B. tabaci s.l. examined contained evidence of possessing a MBF2 signal peptide as previously reported by Rana et al. [146].

OGC ‘OG0009507’, had the most significant BLASTp hits to the previously reported uncharacterized protein in B. argentifolii (XP_018898813.1) and a secreted salivary protein (XP_026481543.1) identified in the cat flea Ctenocephalides felis. ‘OG0009507’ was composed of exclusively whitefly-only gene members, of which only Non-African B. tabaci s.l. members possessed a signal-peptide. A. pisum and M. persicae lacked any identifiable homologs to the MGP-like protein suggesting a this to be a uniquely whitefly-specific gene expansion with a potentially high degree of importance of this protein family during the co-evolution of begomoviruses and B. tabaci s.l. The second OGC ‘OG0000655’, had the highest overall homology to the MGP reported previously for B. argentifolii (AIK97534.1) [146]. Both OGCs represent excellent gene family targets for possible expansion of whitefly control mechanisms given their high degree of importance for begomovirus transmission.

Lastly, we examined the diversity of B. tabaci s.l. genes with homology to cyclophilins (CyPs). CyPs are a diverse superfamily of proteins found in both prokaryotes and eukaryotes. Cyps are involved in protein folding [147], and function in many cellular processes, such as protein–protein interactions and chaperone functionality (reviewed in [148]). CyPs have been implicated in luteovirus transport and transmission by aphids [149]. Whole genome comparative analysis based on query proteins with functional annotations linked to the cyclophilin-like domain superfamily (IPR029000), among others (IPR002130, IPR024936, IPR020892) resulted in recovery of fourteen OGCs containing CyP-like proteins. Three previously reported B. tabaci s.l. CyP proteins [150] were identified within our set of fourteen OGCs, specifically ‘CyPB’ (KX268377: ‘OG0003660’), ‘CyPD’ (KX268378: ‘OG0000333’), and ‘CyPG’ (KX268379: ‘OG0000792’). An average of 22.5 CyP-like proteins were discovered across all B. tabaci s.l. genomes (ranging from 16 to 28). In contrast only 17 PCGs were identified with CyP-like functionality in T. vaporariorum, which is unable to transmit TYLCV [151].

Amongst CyP gene families, CyPB was shown to have relatively increased gene expression after TYLCV acquisition by whitefly individuals [150], while also playing an important role in both insect development and the transmission of TYLCV by B. argentifolii [138, 150]. Generation of MSA of CyPB ‘OG0003660’ proteins highlighted extremely high sequence conservation within B. tabaci s.l. at ~ 96–99%, which decreased to 75–77% when considering all Aleyrodidae species. Crucial residues involved in CyPB cyclosporin A (CsA) binding and the reported CyPB signature peptidyl-prolyl cis–trans isomerase domain (‘YKGSKFHRVIKDFMIQGG’) had near complete sequence conservation across all whitefly. However, T. vaporariorum uniquely possessed a single substitution to the peptidyl-prolyl cis–trans isomerase domain (Tyr—> Phe) of the CsA binding domain. Of the fourteen CyP-like OGCs identified, only two OGCs ‘OG0009174’ and ‘OG0009922' were identified that appeared to be whitefly-specific, with no discernible sequence homology with any other taxon examined. Proteins in both OGCs had high sequence identity to B. tabaci s.l. PPIase B-like (LOC109043131) and PPIase B1-like (LOC109043200) proteins.

Horizontally transferred genes

Genes originating from both bacteria and fungi are present in B. tabaci s.l. genomes, with 142 identified in the B. argentifolii draft genome [32]. Their functions involved hopanoid/sterol synthesis and xenobiotic detoxification enzymes that may play a role in polyphagy and insecticide resistance. Two adjacent bacterial pantothenate biosynthesis genes, panB and panC, were also reported to have been fused into a single gene that had acquired introns [32]. Protein-coding transcript analysis Additional file 1: Table S16 revealed that the B. tabaci SSA1-SG1 (i.e. SSA1-SG1-Ug, SSA1-SG1-Ng), B. argentifolii and B. tabaci s.l. genomes contained between 117 and 164 horizontally transferred genes (HTGs) of bacterial and fungal origins, respectively (Fig. 9a, b). These HTGs were predicted to encode proteins involved in conserved and different biological functions (Fig. 9c, d).

Fig. 9
figure 9

Bemisia tabaci s.l. horizontally transferred genes (HTGs) of bacterial and fungal origin. Total number of HTGs of: (a) bacterial and (b) fungal origin, derived from protein-coding transcripts. Relative proportion of functional predictions (COG categories) for: (c) bacterial and (d) fungal HTGs. Venn diagram of the distribution of HTG orthologous gene clusters (OGCs) among whitefly genomes for: (e) bacteria and (f) fungi

The functional profiles of HTGs of bacterial origin were similar in these genomes with most assigned to amino-acid transport and metabolism, carbohydrate transport and metabolism, lipid metabolism, transcription, and unknown functions (Fig. 9c). These bacterial HTGs grouped into 84 OGCs, 35 of which were shared between genomes. Twenty-one OGCs were specific to SSA1-SG1 (Fig. 9e). In addition, each genome contained two to five unique OGCs (Fig. 9e).

Similar to the bacterial HTGs, fungal HTGs were also very similar across the three genomes and enriched for functions related to carbohydrate, amino acid, and lipid metabolism, but also for other functions such as replication and recombination (Fig. 9d). Of the 39 OGCs, 18 were shared between genomes, 12 were present only in the SSA1-SG1 and none were specific to B. argentifolii (Fig. 9f). Although there was conservation in broad functional categories for HTGs, these data show that certain HTGs may have been either acquired or retained in a lineage-specific manner. These B. tabaci s.l. genomes were particularly enriched in HTGs, compared to other sap feeding Sternorrhyncha insects [32, 152]. Here, we compared the total number of HTGs and their predicted functions in B. argentifolii to those in B. tabaci SSA1-SG1 (Fig. 9) and found that HTG functional profiles were similar with most HTGs assigned to metabolic functions (Fig. 9b, e).

Despite nearly identical broad functional profiles, the acquisition and/or retention of some HTGs was lineage-specific (Fig. 9e, f). Differences in selective pressure due to abiotic stresses such as temperature, or biotic interactions such as host plant range, natural enemies and symbionts may underlie these differences. For example, the SSA1-SG1 lineage is particularly adapted to feed on cassava plants [31, 112], which may possess different toxic metabolites and/or phloem-sap composition compared to species within the host-plant range of B. argentifolii.

We also analyzed microbial HTGs beyond the comparative functional annotations of the SSA1-SG1 and B. argentifolii genomes. In total 78 (fungal) and 63 (bacterial) HTGs originally published in the genome of B. argentifolii were used as sequence queries for CRBB analysis to gene transcripts generated as part of this study (see Additional file 1: Table S16). Of the total fungal HTGs, surprisingly less than 40% were reliably recovered (pairwise sequence identity [PID] >  = 75%) in the annotations of seven of the eight genomes examined. Reannotation of B. argentifolii resulted in the recovery of only 30 (38.40%) of the original fungal HTGs previously reported for this genome. SSA1-SG1-Ng contained the most fungal HTGs (n = 33), but only fractionally more than the average of 26 observed across all eight genomes. Examination of bacterial HTGs showed similar levels of recovery across all B. tabaci s.l., albeit at a slightly higher recovery rate than for fungal HTGs. We found that on average 43 bacterial HTGs were annotated across all Bemisia populations. B. argentifolii had 47 of the 63 bacterial HTGs originally reported, while again SSA1-SG1-Ng showed the highest overall recovery (n = 48).

Our re-examination of microbial horizontally transferred genes is consistent with the total number of ‘functionally distinct’ proteins previously published for B. argentifolii. We conclude that previous estimates of HTG diversity in B. argentifolii and B. tabaci s.s. genomes are likely to be overestimated and additional experimental investigation is warranted to characterize the precise diversity of Bemisia endosymbiont related HTG genes.

Conclusions

Structural annotation of B. tabaci s.l. genomes applied via a uniform methodology, which excluded any ab-initio prediction component, has resulted in recovery of many fewer total PCG compared to previously published B. tabaci s.l. genomes. Though PCG are observed with only slight variances, to the exclusion of B. tabaci s.s, in the total number of recovered PCG across all eight B. tabaci s.l investigated. The finalized PCG sets of all genomes presented in this study constitute a conservative and well supported evaluation of the B. tabaci s.l. gene space.

Nuclear phylogenomics, supported by ML and BI recovered near universal support of both monophyletic whitefly (Aleyrodidae) and B. tabaci s.l. We show that the non-cassava utilizing species B. tabaci Uganda-1 resides as the earliest branching member of all African Bemisia examined in this study. Major relationships and topological branching order of sampled Bemisia species was again supported by independent analysis of 11 protein-coding mitochondrial genes. Reciprocal crossing experiments further corroborate these topological patterning, supporting the hypothesis that three biological species of B. tabaci s.l. colonize cassava in Africa, also adding to the weight of evidence that B. tabaci s.l. is a group of morphologically, highly cryptic biological species. Examination of average nucleotide identity and topological patterns among the primary obligate endosymbiont Portiera from B. argentifolii, B. tabaci s.s. and newly presented B. tabaci s.l. specifically grouped B. tabaci SSA2 and B. tabaci SSA3, as well as B. tabaci SSA1-Ng and B. tabaci SSA1-Ug together, further supporting their biological species’ statuses. Using an integrative systematics approach, we concluded that the seven, sub-Saharan, African cassava populations of B. tabaci s.l. can be reduced to just three biological species, namely B. tabaci SSA1-SG1 ∪ SG2, B. tabaci SSA2 ∪ SSA3 and B. tabaci SSA1-SG3. The former two biological species occur widely in both East and West Africa, so are most usefully identified currently by names that include (molecular) diagnostic information.

These new Bemisia genomes will make a significant contribution to resources for studies of B. tabaci s.l. comparative evolution and for resolving the number of biological species within this morphologically cryptic group. They shall also assist efforts to unravel the genetics behind species’ differences in their interactions with host plants, invasiveness, and propensity to cause outbreaks. Lastly, this study contributes towards the expansion of valuable genetic information across multiple species of Bemisia further enabling the selection of candidate gene targets for novel whitefly and plant-virus control methods.

Materials and methods

See Additional file 4 for full details of methods employed in this study.

De novo genome assembly

Prior to assembly, input PacBio read data was screened taxonomically to remove contaminants (Additional file 1: Table S2) using a k-mer approach [153, 154]. Initial draft assemblies were generated with Canu v1.8 [37]. A priori Canu consensus correction genome size estimate was set to 650 Mb (genomeSize = 0.650 g). Canu ‘correctedErrorRate’ (0.045) value for consensus read overlap correction was adjusted whereby correctedErrorRate was decreased when genomic coverage was estimated to be > 60X. A genome refinement protocol was applied to each draft assembly and proceeded by initial generation of a ‘polished’ genome consensus via alignment of input read corrected reads using minimap2 [155] and wtdbg2-cns [156]. Redundans [157] was used to account for uncollapsed haplotypic variation. Collapsed haplotype draft references were then further scaffolded with PBSuite [158] and then screened for contamination using blobtools [159, 160]. See Additional file 4 and Additional file 1: Table S3 for full detail on de novo assembly and genome draft refinement.

Structural and functional annotation of B. tabaci s. l. genomes

Genomic annotations of B. tabaci s.l. were generated via the EMBL-EBI Ensembl Gene Annotation pipeline [46]. Ensembl transcript models generated are supported by experimental evidence in the form of aligned short-read RNA-seq data, then verified and filtered using secondary protein-to-genome alignments of proteins from whitefly or closely related species. The Ensembl Gene Annotation pipeline gave precedence to gene models supported by well-aligned transcriptomic evidence [46, 161]. The genomes of B. argentifolii and B. tabaci s.s. were also reannotated (utilizing their original published RNA-Seq datasets). Bacteriocyte-specific RNA-seq libraries were applied to ensure bacteriocyte-specific gene expression contributed to structural annotation. Finalized gene builds contain multi-transcript PCG structures, where each transcript can contain overlapping exon boundaries; excluding any redundant transcripts where the splicing pattern is completely redundant when comparing to a longer model [46]. See Additional file 4 for complete detail of genomic annotation methods. Core databases of the six newly generated B. tabaci s.l. genomes presented herein were first released in April-2021, via Ensembl Metazoa (Ensembl release version e103) (https://metazoa.ensembl.org/). Re-annotations of both B. argentifolii and B. tabaci s.s. are available as a standalone ‘Figshare’ dataset (FastA, GFF3) [162] as part of this study.

Interproscan [163] version 5.40–77.0 (https://github.com/ebi-pf-team/interproscan) was used to generate functional annotations of canonical PCG gene models. Data was then exported in TSV format, then parsed to obtain only gene models that were annotated with Gene Ontology GO terms. This data was then imported into the web service WEGO 2.0 [164, 165]. Tabular data was exported from WEGO 2.0 in CSV format. Graphical data was output in PNG format, selecting only GO:term relationships deemed to be significant. Significance was assessed by the statistical Chi-square (P-value < 0.05) test of independence of expected frequencies of genes with GO terms and their observed frequencies.

Comparative genomics, phylogenomics and target gene family identification

Orthologous sequences of twenty-three arthropod species (n = 23) were identified and clustered via OrthoFinder (v2.4.0) [51, 52] using only canonical PCGs for input (Table 6). Representatives of all six B. tabaci s. l. presented in this study were included, but also the published whitefly B. argentifolii and B. tabaci s.s [32, 33]. utilizing PCGs annotated via Ensembl gene annotation pipeline specifically for their inclusion in this study [162]. Protein-coding sequences for the closely related species T. vaporariorum were downloaded from WhiteflyDB (http://www.whiteflygenomics.org). T. vaporariorum [43] served as the closest outgroup taxon available to the Bemisia genus and was the only other Aleyrodidae family member with a complete genome assembly and annotation available at the time of analysis. See Table 6 for details on species analyzed, their data sources and results of homology clustering in the comparative whole genome analysis.

Protein sequences of OGCs (only OGCs >  = 78.30% of all species considered) were aligned using MAFFT v7.427 [166]. The concatenated alignment was cleaned using Gblocks v0.91b [167]. Evolutionary model testing was performed via ProtTest3 [168], allowing for mixed rate alpha ‘α’ and also invariant sites selection ‘Ι’. Phylogenetic tree reconstruction was performed under ML framework (RAxML v8.2.12) [169] and BI (MrBayes v3.2.7a) [170, 171] methods. The best fitting model of protein evolution (LG + Γ + I + F) was selected in RAxML using the default parameters including ‘-m PROTGAMMAILGF -p 12,345’. A MrBayes run was implemented with ‘lset rates = invgamma ngammacat = 4; prset aamodelpr = fixed(lg)’. BI was performed with four chains for 1,000,000 generations, posterior probability (PP) and consensus topology were summarized using MrBayes ‘sump’ and ‘sumt’; burnin was set to 25% of input sample trees.

Prior to all gene family analyses, a genome-wide screening method was adopted to identify orthologs of gene families of interest. These candidate OGCs were identified via a combination of screening of functional protein/domain annotation obtained from InterProScan v -5.40–77.0 and sequence-based matching using CRBB [163, 172] for full details see Additional file 4.

TE annotation

Transposable element (TE) libraries generated RepeatModeler (https://www.repeatmasker.org/RepeatModeler/) v1.73. To avoid ‘over-masking’ of protein encoding genomic loci, de novo TE repeat libraries were screened with BLASTp and BLASTn MegaBLAST [173, 174]. The protocol for filtering potential protein containing repeat sequences is described here (https://blaxter-lab-documentation.readthedocs.io/en/latest/filter-repeatmodeler-library.html) and was implemented via bash script. Genome-wide copy number variation and overall sequence divergence on TE repeat types was examined using the Perl utility ‘OneCodetofindthemall’ [175] v.1.0 (https://doua.prabi.fr/software/one-code-to-find-them-all), and statistics reported on a per genome scaffold basis. To summarize TE repeat content ‘OneCodetofindthemall’ output was parsed using an in-house Perl script taking ‘.copynumber.csv’ files as input. Kimura divergence scores were extracted from RM generated ‘.out’ output file using the protocol as described [48], using custom Kimura divergence scripts (bash, R, Python) freely available via [48].

Reciprocal cross-mating

B. tabaci s.l populations from Nigeria and Uganda were subjected to reciprocal crossing experiments in sympatric and allopatric crosses. Sympatric populations cohabitated in the same field or had overlapping geographic ranges, while allopatric populations had separate, non-overlapping ranges or were separated by more than 500 miles [31, 59]. The following population combinations were classified as sympatric (i) SSA1-SG1-Ng X SSA3-Ng, (ii) SSA2-Ug X SSA1-SG1-Ug, (iii) SSA2-Ng X SSA3-Ng and (iv) SSA1-SG1-Ng X SSA2-Ng. Allopatric population combinations were (i) SSA1-SG1-Ng X SSA1-SG1-Ug, (ii) SSA1-SG1-Ng X SSA2-Ug and (iii) SSA2-Ug X SSA3-Ng. See [8] and Additional file 4 for details of the crossing methods used.

Portiera genome assembly and comparative genomics

Portiera genomes were obtained using an iterative-assembly approach (flye_iterative.sh available as part of data file deposited in ‘Figshare’ [176]). Briefly, Kraken2 v2.1.2 [154] with the database PlusPF-16 (standard Kraken2 database plus the protozoa, fungi, and plant RefSeq databases, available from (https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_16gb_20210517.tar.gz) was used to perform an initial selection of Portiera sequences from the pre-processed PacBio sequences (non-taxonomically filtered). Kraken-tools v1.2 was used to extract any classified sequence below the Oceanospirillales rank. Primary assemblies of Portiera sequences were produced with flye [177] v2.8.3-b1695 (–pacbio-raw –genome-size 0.35 m –meta). Primary assemblies were used to recover possible missing Portiera reads with minimap2 v2.17-r941 (-x map-pb) [155] and the pre-processed PacBio sequences. Secondary assemblies were obtained using all sequences recovered with minimap2 and re-assembled with flye (–pacbio-raw –genome-size 0.35 m –meta). Finally, secondary assemblies were polished with fly (-pacbio-raw –genome-size 0.35 m –meta –polish-target –iterations 10).

Polished Portiera assemblies were annotated with prokka v1.14.6 [178] (–gram neg –rfam), using all available Portiera genomes as primary annotation source (–proteins). Obtained annotations were manually inspected in Artemis v1.5 [179]. Annotations and genome sequences provided in GenBank format (.gbk), available as part of a ‘Figshare’ dataset [176] deposited as part of this study.

Orthologous clusters of proteins (OCPs) were computed with OrthoFinder v2.4.0 (-M msa -T iqtree -I 1.5) [51]. Encoding potential of Portiera strains was manually assessed using Portiera MED-Q1 (CP003835) as reference. Genomic synteny among Portiera genomes was plotted with genoPlotR [180] based on 202 core OCPs and the three proteins encoded in the variable region (yidC, mnmE, and mnmG).

Pairwise Average Nucleotide Identity values were calculated with the enveomics toolbox [181] using USEARCH v11.0.667 (ublast -id 0.1 -maxhits 1,000—acceptall -evalue 1e-5 -accel 1) as alignment algorithm [182]. The heatmap and hierarchical clustering (Euclidean distances and complete clustering) were obtained with the gplots package from R [183].