Introduction

The genus Begomovirus (family Geminiviridae) is the largest genus of plant viruses with respect to the number of species that it includes. In fact, with 288 species currently recognized by the International Committee on Taxonomy of Viruses (ICTV) (http://www.ictvonline.org/virusTaxonomy.asp), it is the largest genus of all viral taxonomy. Begomoviruses infect a wide range of dicotyledonous plants, mostly in tropical and subtropical regions of the world. Their circular, single-stranded DNA genomes can be either monopartite or bipartite (with genomic components known as DNA-A and DNA-B), with the two components of bipartite genomes sharing a common region of approximately 200 nucleotides that includes the origin of replication [1]. In the Old World (OW; Africa, Asia, Australasia and Europe), most begomoviruses are monopartite, with a few having a bipartite genome. Begomoviruses native to the New World (NW; the Americas) are almost exclusively bipartite, with only a single monopartite virus having been identified so far [2, 3]. However, a number of monopartite begomoviruses occur in the NW as a result of their recent introduction from the OW [4, 5].

Geminiviruses have characteristically twinned or “geminate” particle morphology. The capsid consists of two joined, incomplete T = 1 icosahedral heads, with 110 molecules of the capsid protein organized as 22 pentameric capsomers [6]. Geminate particles contain a single molecule of circular ssDNA that ranges from 2.5 to 3.0 kilobases (kb) [1]. Therefore, for viruses having a bipartite genome, two particles, each containing a different genomic component (DNA-A and DNA-B), are required to establish infection.

Due to their economic importance as plant pathogens and their small genomes, begomoviruses were among the first plant viruses whose complete genomes were cloned and sequenced [7, 8]. By January 2014, more than 3500 full-length begomovirus sequences had been deposited in public databases. Even during the early days of full-genome sequencing, the increasingly large numbers of begomovirus sequences being determined worldwide made it clear that these viruses are abundant and widespread, and that they display a significant degree of genetic diversity [9]. Also, it created the opportunity for the development of a sequence-based taxonomy that relied primarily on pairwise sequence comparisons [10]. Such a system has been in place for the Geminiviridae since the mid-1990s, and it has been remarkably stable. It was also widely embraced by the begomovirus community, mostly due to its simplicity and ease of use. Similar classification systems have been adopted by a number of ICTV study groups, including those concerned with the Anelloviridae and the Circoviridae.

As useful as it has been to establish and streamline taxonomic communications, begomovirus taxonomy is not without controversy. Several criticisms have been voiced in the literature (one recent example being ref. [11]) and by the ICTV Executive Committee (EC), which rejected the Geminiviridae Study Group’s taxonomic proposals for creating new begomovirus species in 2010 and 2011. The main points of contention can be summarized as follows: (i) the creation of “too many” species in the genus; (ii) the recognition of new species based solely on sequence comparisons of members, without taking into consideration the biological properties of the viruses; and (iii) the establishment of species demarcation criteria that were “too relaxed” compared to those for other genera in the family, thus leading to point i. Moreover, and as pointed out by the Geminiviridae Study Group (SG) itself in the recent Mastrevirus and Curtovirus taxonomy revisions [12, 13], pairwise sequence identities for any particular pair of sequences may be calculated in different ways and therefore can result in differences in identity scores depending on the algorithm employed. Such discrepancies have made it highly desirable to establish a standard procedure to perform pairwise alignments and to calculate identity scores in order to eliminate (or at least minimize) taxonomic uncertainties and/or misplacements.

The concerns raised by the ICTV EC regarding begomovirus taxonomy encouraged the Geminiviridae SG to perform a comprehensive re-evaluation of the species demarcation criteria for the genus Begomovirus. The results of this re-evaluation have demonstrated that the current pairwise-identity-based taxonomy is sound, that it accurately reflects the biology of begomoviruses, that it will be stable, and that it will be easy to understand and to be adopted by geminivirologists worldwide. Here, we present the specific guidelines for the classification of begomoviruses, following those recently published for the genera Mastrevirus [12] and Curtovirus [13] of the family Geminiviridae.

A comprehensive analysis of the species demarcation criteria for the begomoviruses

Since a significant proportion of begomoviruses do not have a cognate DNA-B component, this component is not considered for species demarcation. A total of 3,123 full-length begomovirus genomic sequences (or DNA-A sequences) were downloaded from the NCBI-GenBank database on 31 Dec 2012. They corresponded to viruses belonging to 283 species according to the currently accepted 89 % species demarcation criterion (for comparison, see the 9th Report of the ICTV, which lists 192 species in the genus [1]). To reduce computing time, only the oldest sequences (full-length genomes or DNA-A components) from groups of sequences that shared >99.5 % genome-wide nucleotide (nt) sequence identity were included in the analysis. To the best of our knowledge, the analysis included sequences of members of all ICTV-recognized species and unclassified begomoviruses for which at least one full-length sequence was available in GenBank at that time (for many, there were multiple sequences per virus/strain). Using this data set (1,826 sequences), a preliminary phylogeny using the neighbor-joining (NJ) method was constructed (data not shown). The purpose of the NJ phylogenetic analysis was not to construct a definitive phylogeny but rather to identify groups of most closely related sequences that could be combined for pairwise sequence comparisons and maximum-likelihood (ML) phylogenetic analyses.

Based on the NJ phylogenetic tree, 38 groups were identified, each of which contained sequences that did not obviously correspond to the same viral species but also did not obviously correspond to distinct species. This approach was employed to more easily delineate distinct groups. Some groups consisted of as few as 2-3 sequences, whereas, others were represented by >30 sequences (Supplementary File S1). Pairwise sequence comparisons were carried out separately for each one of the 38 groups, using Sequence Demarcation Tool (SDT) v. 1.0 [14] with the MUSCLE [15] alignment option. Also, ML phylogenetic trees were predicted for each group using the PHYML3.0 method implemented in MEGA 5.2 [16] with the GTR+I+G nucleotide (nt) substitution model and branch support being tested with 3,000 bootstrap iterations.

Simulations were performed based on the results of pairwise sequence comparisons, using different cutoff values (rounded to the nearest full percentile) to delineate potential species so as to determine which sequences corresponded to virus isolates belonging to the same species, and which were isolates of distinct species. For this, we looked for the optimum cutoff value that placed each sequence into a given species without “outliers” (sequences that displayed identity levels above the cutoff value with two or more species).

Analysis of all 38 groups indicated that the best nt sequence identity cutoff value to separate isolates from different species was 91 %. This value is proposed here as the new species demarcation criterion for viruses of the genus Begomovirus using the outlined methodology. Implementing this value yielded the lowest number of outlier sequences compared to any other value within the range of 86 % to 94 % nt sequence identity. The cutoff for strain demarcation is 94 %. Parameters used for comparison are crucial. It is important to note that percent nt sequence identities must be calculated from true pairwise sequence alignments, with the exclusion of sites with gap characters. Ideally, the SDT software that is freely available [14] (http://web.cbio.uct.ac.za/SDT) should be used, as it was developed specifically for this purpose.

Phylogenetic support was found to be robust for all new species analyzed across the 38 groups. The 91 % cutoff value is actually quite conservative, as is indicated by the trees for groups 3, 5, 7, 11, 16, 30 and 33 (Supplementary File S1). However, several groups (1, 2, 6, 27, 34 and 36; Supplementary File S1) required additional consideration because the pairwise sequence comparisons and phylogenetic results are conflicting, possibly due to recombination.

Dealing with outliers

The resulting taxonomic framework resulted in the delineation of a small number of outliers. Nevertheless, as the number of sequenced begomoviral genomes continues to increase, additional “conflicting” sequences will become evident. To address this problem, we propose the adoption of the approach described for viruses of the genera Mastrevirus [12] and Curtovirus [13]. In this light, the four possible conflicts are as follows:

  1. 1.

    An isolate having ≥91 % identity (full-length genome or DNA-A component) to isolates assigned to two (or more) species.

  2. 2.

    An isolate having ≥91 % identity to one or a few isolates from a particular species, even though it shares <91 % identity with the majority of isolates in that species.

  3. 3.

    An isolate having ≥94 % identity to isolates of two (or more) strains of a given species.

  4. 4.

    An isolate having ≥94 % identity to one or a few isolates from a particular strain, even though it shares <94 % identity with the majority of isolates from that strain.

The corresponding conflict-resolution criteria are as follows:

  1. 1.

    The new isolate should be considered to belong to the species that includes the isolate with which it shares the highest percentage of pairwise identity (full-length genome or DNA-A component).

  2. 2.

    The new isolate should be classified as belonging to the species with which it shares ≥91 % nt sequence identity with any one isolate from that species, even if it is <91 % identical to all other isolates from that species.

  3. 3.

    The new isolate should be considered to belong to the strain that includes the isolate with which it shares the highest percent identity.

  4. 4.

    The new isolate should be classified as belonging to the strain with which it shares ≥94 % nt sequence identity with any one isolate from that strain, even if it has <94 % identity to all other isolates from that strain.

Naturally, any working cutoff value established for viruses, particularly when rapid divergence is occurring (as appears to be the case for begomoviruses), will yield a number of outliers. By adopting these four conflict-resolution criteria, all outliers identified so far could be readily placed into an extant species group.

Exceptions to these rules can include recombinant viruses such as tomato yellow leaf curl Malaga virus (TYLCMaV) and tomato yellow leaf curl Axarquia virus (TYLCAxV), which have ≥91 % identity to both parental viruses (tomato yellow leaf curl virus, TYLCV, and tomato yellow leaf curl Sardinia virus, TYLCSV), thus leading to conflict #1 and causing the two parental species to merge into a single species, even though all isolates of the parental viruses have <91 % identity. Such recombinant viruses will have to be examined on a case-by-case basis for species assignment.

The new species demarcation criterion of <91 % nt sequence identity (for full-length genomes or DNA-A components) is more stringent than the previously used 89 %

At first, the higher value, at 91 %, compared to the previously implemented working cutoff of 89 %, may give the impression of a more relaxed species demarcation scenario that might delineate an even greater number of begomovirus species. However, this is not the case. Rather, the pairwise cutoff value at 91 % is a consequence of the implementation of a more robust approach (now standardized for the entire family Geminiviridae) for calculating pairwise identities: true pairwise alignments (compared to global alignment-based pairwise identities) without gaps. This proved to be more stringent than previous approaches based on multiple sequence alignments with gaps treated as a fifth character, which yielded a working cutoff of 89 %.

One group of begomoviruses that has been affected the most by applying the revised analysis is the “sweepovirus” group, a divergent clade of whitefly-transmitted geminiviruses that infect sweet potato and wild species in the Convolvulaceae. Previously, the group was proposed to include 17 species [17]. The new system reduces the number of species by more than half, delineating 8 species (Table 1; Fig. 1).

Fig. 1
figure 1figure 1

(A) Pairwise sequence comparisons and (B) maximum-likelihood phylogenetic tree of sequences comprising the “sweepovirus” group. Sequences corresponding to the same species based on a 91 % cutoff (using the parameters described in the main text) are highlighted in the same color

Table 1 List of begomovirus species, as of January 2015. Species names are shown in bold italics, and isolate names are given in regular font. For species that do not have any known strains, only one isolate is listed, and that isolate is recognized as the “type” isolate. For species that have known strains, one isolate from each strain is shown, and the type isolate is the first one listed. Sequence accession numbers and assigned abbreviations are also listed. An expanded table including all begomovirus isolates in GenBank is available for download at the ICTV website (talk.ictvonline.org/ictv_wikis/m/files_begomo/default.aspx)

Results of pairwise sequence comparisons accurately reflect the biology of begomoviruses

It has been claimed that begomoviral species are artificial because they are arbitrarily defined based on sequence alone, and therefore their biological characteristics have been ignored [11]. This is a misconception. Sequence-based taxonomy is possible only because it relies on the knowledge of the biological properties of these well-studied viruses. Therefore, sequence comparisons among related begomovirus isolates can accurately reflect differences in their biology. Several examples can be drawn upon to argue this point. One well-known example involves bean golden mosaic disease, an important disease of bean crops in Latin America. The disease is caused by at least two distinct, well-characterized begomoviruses, bean golden mosaic virus (BGMV), which occurs in Brazil and Argentina, and bean golden yellow mosaic virus (BGYMV), which occurs in Central/North America and the Caribbean Basin [18]. The symptoms of the disease are nearly indistinguishable, the whitefly vector species is the same for both pathogens, and the economic importance with respect to crop loss is comparable as well. In fact, initially, the same begomoviral etiology was suspected for the disease occurring in the two regions. However, when the causal agents from plants collected in Puerto Rico (USA) and Brazil were sequenced, the results indicated that they had substantially different genome sequences [19, 20]. Later, it was demonstrated that the two agents differed in at least one relevant biological property: tissue tropism. BGMV is phloem-restricted in beans, while BGYMV is not [20, 21]. Thus, the species cutoff based on sequence alone was accurate and reflected the biological differences between the viruses belonging to these two species.

The most obvious benefit to using the SDT-based pairwise identity analysis is that there are fewer species and strains at the interface between the cutoff and the next lower or higher percent nt sequence identity. As such, applying the proposed 91 % cutoff increases reliability owing to the robust stringency.

Why so many begomoviruses?

As noted above, the genus Begomovirus includes the largest number of species of all currently established genera, with 288 species currently recognized by the ICTV. Why so many begomoviruses? Are these species “artificial”, the result of flawed taxonomic demarcation criteria? The existence of this large number of species can be explained by natural order relationships based on the characteristics of this genus that set it apart from many other viral genera.

Begomoviruses are transmitted by members of a cryptic species complex, Bemisia tabaci (Genn.) (Hemiptera: Aleyrodidae), which is distributed worldwide and colonize a wide array of plants belonging to species in many families [2225]. B. tabaci has emerged as a major threat to agricultural systems in many regions of the world since the 1970s and 1980s [26, 27]. Reports of unprecedented B. tabaci infestations have characteristically resulted in outbreaks of previously undescribed begomoviruses and the apparent disappearance of others from cultivated plants [28]. Because B. tabaci colonizes so many plant species [25], it potentiates the transfer of begomoviruses between non-cultivated and cultivated hosts (which are most studied by plant virologists). While it is beyond the scope of this proposal to fully explore the hypothesis that most begomoviruses isolated from cultivated hosts likely evolved from viruses originally adapted to infecting non-cultivated hosts, this hypothesis could explain, at least in part, why there are so many more begomovirus species than are found for other virus genera where virus-host-vector interactions are more evolutionarily ancient.

Another consideration is that virologists working with ssDNA viruses have gained a powerful new tool in the form of rolling circle-amplification (RCA), a method that allows for rapid, sequence-independent sampling of virus populations. The impact of RCA in the field of geminivirology cannot be overstated (for example, see ref. [29]). Using RCA, it is possible to amplify and recover the complete genome of almost any begomovirus from minute amounts of total plant DNA extracted under suboptimal conditions [30]. Presently, tissue samples can be collected, dried, and stored for months or years at room temperature, and thousands of complete begomovirus genomes will be readily amplified using RCA following a quick DNA extraction [31, 32]. In the 1990s it would take months to clone one full-length begomovirus genome, whereas hundreds of genomes can now be cloned in a matter of weeks. Furthermore (and equally relevant), because RCA uses random primers, it reduces sequence amplification biases and enables the detection of most or all unique genome molecules present in a sample. As a result, new begomoviruses and other, often highly divergent, geminiviruses have been discovered that will probably lead to the recognition of additional genera in the family (and perhaps new families as well) [3336]. Also, this new technology has prompted a significant increase in the numbers of novel begomoviruses that are being sought, and found, in non-cultivated plants.

Finally, it should be pointed out that the extent of diversity currently recognized within this genus (and possibly for all viruses) represents only the tip of the iceberg. Metagenomic approaches are rapidly becoming affordable and will probably lead to the discovery of viruses belonging to hundreds of new genera and families, not to mention species [37]. Its impact on geminivirus discovery has already been felt [3842].

Different cutoff values must be used for the different genera in the family Geminiviridae

The approach implemented herein to demarcate species in the genus Begomovirus is identical to that used and approved by the ICTV for the other genera in the family [12, 13, 36]. However, for each genus, the working cutoff for species demarcation differs, even though the method applied to determine these cutoffs has been the same. For example, mastrevirus species are demarcated using a 78% cutoff. The 78% species cutoff value for the mastreviruses is demonstrated by the pairwise distance distribution plot (Fig. 2A), in which a clear valley is apparent at 78%. Such a valley is not readily evident in the equivalent plot for the begomoviruses (Fig. 2B and C), leading us to analyze this genus using groups of sequences. The analysis reported herein supported a 91% cutoff value for begomoviruses as that which best separates the species in the genus, and this is well supported by the SDT analysis.

Fig. 2
figure 2

Distribution of the full-genome pairwise sequence identity scores for members of the genera (A) Mastrevirus and (B, C) Begomovirus (C corresponds to a higher resolution of the shaded region in B). Note the valley (or gap) corresponding to the 72–78 % frequencies in the Mastrevirus plot and the absence of significant valleys in the Begomovirus plot

Several other families and genera have species demarcation thresholds similar to that of the begomoviruses, including Parvovirus (95 %), Microviridae (80–85 %) and Sobemovirus (60–85 %). It is perhaps troubling that a uniform approach for computing species thresholds does not exist across all of viral taxonomy at this time. Currently, various study groups use different algorithms for specific genes, sets of genes, or complete genomes. Further, complete genome sequences are lacking for viruses of many species, particularly those with large genomes. In those instances, the trees represent a gene tree instead of a virus tree, which can create misconceptions about viral genome structure and lead to incorrect evolutionary inferences. It will be interesting to see if our approach may be useful for other viral families.

A step-by-step guide for classifying new begomovirus isolates as members of species or strains

To facilitate the taxonomic placement of newly discovered begomoviruses and to assist in the standardization of this procedure, the following guidelines are proposed for classification into species and strains:

  1. 1.

    A BLASTn analysis of the “non-redundant nucleotide” database should be performed to identify the species whose members have sequences most similar to the new sequence. The nucleotide sequence database at the NCBI website (http://www.ncbi.nlm.nih.gov/nuccore/) can also be searched using the search term “txid10814 [Organism: exp] AND 2500:3500[SLEN]”, which will return all begomovirus nucleotide sequences that are between 2500 and 3500 nucleotides long.

  2. 2.

    The new sequence should be added to a dataset of full-length genomes or DNA-A components created based on the BLAST results, and saved in FASTA format. All sequences must start at the same genomic coordinate (the first nucleotide after the nicking site within the conserved nonanucleotide at the origin of replication is the recommended standard).

  3. 3.

    The MUSCLE option in SDT v1.2 (freely available at http://www.cbio.uct.ac.za/SDT) or any other program that uses the MUSCLE alignment algorithm with pairwise deletion of gaps should be used to calculate identities between every pair of sequences in the dataset. If using SDT, these pairwise identities may be saved in either a column or matrix csv format that can then be viewed in a spreadsheet program such as Microsoft Excel. Percent identities must be rounded to the nearest full percentile.

  4. 4.

    If the new sequence shares <91 % genome-wide pairwise identity to any other known begomovirus sequence, appropriate species and virus names should then be proposed (see below for guidelines on doing so).

  5. 5.

    If the sequence shares <94 % genome-wide pairwise identity to all isolates described for that species, a strain name should then be proposed.

Guidelines for naming new species that include newly discovered begomoviruses

Virus species name

This is the ICTV-accepted name of a group of begomoviruses sharing ≥91 % pairwise sequence identity for the full-length genome or DNA-A component. If the sequence has <91 % sequence identity to all begomoviruses previously classified as members of distinct species, the virus should be considered a member of a new species, and a unique name that is not currently in use for any ICTV-recognized species should be assigned. This name should follow the template “Host symptom virus” (e.g., Bean golden mosaic virus). Although it was common practice for begomoviruses, the Geminiviridae SG recommends that country, city, town, village or province names not be used in naming new viruses and new viral species (e.g., Tomato yellow leaf curl Thailand virus), as this may cause misunderstandings when a virus named after a country or city is subsequently found in other locations within that country or in other countries. (Previously accepted names using this practice will not be changed to avoid conflicts in the literature.)

Strain name

Based on ICTV guidelines, there is no practical or standardized approach for differentiating or naming strains (or any other category below the species level). In fact, item 3.3 of the ICTV Statutes states that “The ICTV is not responsible for classification and nomenclature of virus taxa below the rank of species.” Nevertheless, the Geminiviridae Study Group has adopted its own guidelines for strain differentiation and nomenclature [43], although there is no formal requirement to do so. Our new analysis indicated a sequence identity threshold of 94 % for strain demarcation.

Ideally (when knowledge is available), strains should follow a nomenclature that reflects biological differences between the members of the same species. For example, if it is established that a number of BGMV isolates comprising a distinct strain are capable of infecting a host (e.g., lima bean) that other BGMV isolates do not normally infect, it would then be appropriate to name the strain BGMV-Lima bean. Likewise, symptom severity descriptors (e.g., Tomato golden mosaic virus-Yellow vein) could also be used. In either case, such strain names should be used only when the phenotype is observed in multiple isolates of the same strain. As recommended for species names, country, city, town, village or province names should not be used in naming new strains. Strain name follows species name separated by a hyphen (“-”).

Isolate descriptor

Following the species/strain name, and within square brackets (“[ ]”), the isolate descriptor may contain any number of sub-fields separated by hyphens. Although the 9th ICTV Report’s recommendations for geminivirus nomenclature [1] suggested the use of colons (“:”) to separate sub-fields in the isolate descriptor, this can cause problems in various phylogenetic-tree-drawing programs, which, when reading phylogenetic trees in Newick format, will misinterpret numbers after the colon as representing branch length information.

The first sub-field should always be the two-letter international code of the country/territory in which the isolate was sampled (Supplementary Table S2), whereas the last sub-field should always be the year in which the isolate was last present within living tissue. If the year in which the isolate was sampled differs from the date on which it was last present within living tissue (e.g., when isolates are propagated in the laboratory), the date when the isolate was sampled should then be included as an internal sub-field. Between the first and last sub-fields, any additional descriptors can be used (for example, the laboratory code of the sample from which the isolate was obtained, the city nearest to the place where the sample was obtained, the host species from which the virus was isolated).

Conclusions

Since the 1990s begomovirus taxonomy has been based primarily on sequence comparisons methods. In this regard, it was pioneering, helped by the large number of full-length sequences available, and allowed for a robust statistical treatment of the data. Although this approach has been criticized for not taking biology into account, a closer look into the recognized species will show that biology is accurately reflected in the taxonomy. This revision demonstrated the robustness and the reliability of a sequence-based taxonomy, and this was acknowledged by the ICTV with the positive outcome of the latest taxonomy proposal and the establishment of new begomovirus species (http://talk.ictvonline.org/files/ictv_official_taxonomy_updates_since_the_8th_report/m/plant-official/4838.aspx). It should be noted, however, that the ICTV Geminiviridae SG has no ulterior motivation for continuing to propose new species. Rather, the number of new species proposed is a genuine reflection of our increasingly effective methods to conceptualize the natural genetic variability of this remarkable group of viruses. By establishing clear guidelines for the analysis of full-length genomic sequences, and following standardized nomenclature for the naming of newly established species and strains, the intention of the ICTV Geminiviridae SG is that these changes will improve taxonomic communications among users while also leaving open options for further improvements in the future that will serve the geminivirus community at large.