Introduction

The family Geminiviridae includes insect-transmitted, plant-infecting viruses with circular, single-stranded DNA (ssDNA) genomes that are encapsidated within geminate particles. The genus Mastrevirus of this family consists of viruses that are transmitted by leafhoppers and have a single genome component with a conserved arrangement of three genes (encoding a movement protein, a coat protein and two versions of a replication-associated protein) and two non-coding regions (the large and small intergenic regions).

A variety of International Committee on Taxonomy of Viruses (ICTV)–endorsed guidelines currently exist for the classification and naming of new mastreviruses [7, 1315]. Primary among these guidelines is the application of carefully selected genome-wide pairwise sequence identity thresholds, either to assign newly determined mastreviruses to existing species or strains, or as the basis for proposing that newly determined sequences correspond to new species or strains.

There are, however, quite a few different ways in which pairwise nucleotide sequence identities could be calculated and, importantly, the identity score that one gets for any given pair of sequences can vary quite substantially depending on exactly how it was calculated. One obvious example of how pairwise identity scores could be either inflated or deflated involves the treatment of gap characters that are inserted during the sequence alignment process. If gap characters are treated as a fifth possible nucleotide state, positions in the sequence alignment where one sequence has a nucleotide but the other does not will be (and quite reasonably so) counted as a mismatch. There exists a problem, however, in determining, firstly, how much such mismatches should count relative to standard nucleotide mismatches and, secondly, how much each gap character in runs of several gaps should count relative to isolated gaps. Side-stepping this problem altogether by simply ignoring all alignment sites at which one or the other sequence has a gap character is the approach of choice when, for example, calculating genetic and evolutionary distances in applications such as phylogenetic tree construction [16, 35]. Whereas ignoring positions where one sequence has a gap and the other does not will inflate pairwise identity scores, evenly scoring every one of these sites as a “normal” nucleotide mismatch will deflate identity scores over methods that include runs of gaps as a single mismatch.

An additional important factor that can cause fluctuations in pairwise identity scores of a given pair of sequences is the method used for sequence alignment. An optimal pairwise alignment of the sequences will generally yield a higher pairwise identity score than if the sequences were aligned within the context of a multiple sequence alignment that includes one or more additional sequences. Also, as the number and diversity of sequences in a multiple sequence alignment increases, so it is expected that the pairwise identity scores of any individual pair of sequences in the alignment will decrease [25]. What this means is that pairwise identity scores will tend downwards with increasing alignment size. Finally, different multiple sequence alignments generated either by different multiple sequence alignment programs (such as ClustalV, ClustalW, MAFFT or MUSCLE), or by a single program with different alignment settings (such as gap open and extension penalties) will not all be equally accurate [11, 12, 25, 44, 55].

Despite these various issues, pairwise-identity-based virus classification criteria are extremely popular amongst virus taxonomists and are likely to grow in importance due to how easy they are to use and the fact that, once properly validated, they accurately reflect the biology of these organisms [2, 32, 33]. We have therefore devised a pairwise-identity-based approach for mastrevirus classification that almost completely removes all the alignment and gap-handling problems of the current ICTV-endorsed mastrevirus classification protocol. We apply an approach that is almost identical to that described by Bao et al. [2] for their pairwise sequence comparison (PASC) method. Rather than relying on multiple sequence alignments and the counting of gap characters as a fifth nucleotide state (as is done in the currently recommended approach), our method and that of Bao et al. [2] rely on accurate and highly repeatable/reproducible pairwise sequence alignments and the complete exclusion of sites with gap characters from the pairwise identity calculations.

We apply this approach to determine the distribution of mastrevirus genome-wide pairwise identity scores and identify logical mastrevirus strain and species demarcation thresholds. We then apply these new mastrevirus species and strain demarcation criteria to all full mastrevirus genome sequences that were publically available in May 2012 and propose updates to all mastrevirus isolate names to make these consistent with the proposed criteria. Although the modified protocol yields a classification that is very similar to the current mastrevirus classification (sequences belonging to three proposed and one accepted species are “demoted” to the level of strains of pre-existing species), it is extremely objective (i.e., the pairwise identity scores it yields are almost completely un-manipulable) and is applied within a freely available and easy-to-use computer program that should tremendously simplify the classification of any new mastrevirus full genome sequence.

A new approach to calculating pairwise identity scores

Given a set of mastrevirus full genome sequences that have all been linearised at the same position, the approach that we have chosen for pairwise identity score calculations is very simple and essentially involves two steps. In the first step, every unique pair of sequences is individually aligned, essentially using the Needleman-Wunsch algorithm [43] as applied in multiple sequence alignment programs such as ClustalW [9, 31], MUSCLE [11] and MAFFT [25]. For a set of S sequences, this will yield [S × (S−1)]/2 pairwise alignments. For each of these alignments, the identity of each pair of sequences is calculated as 1−M/N, where M is the number of mismatched nucleotides and N is the total number of columns along the alignment where neither aligned sequence has a gap character. Our identity score is therefore simply one minus the ratio of the Hamming distance over the length of the pairwise-aligned sequences.

Since we knew of no computer programs that would perform pairwise alignments and output a table containing identity scores, we produced a computer program, called SDT (species demarcation tool), to largely automate this process (available from http://web.cbio.uct.ac.za/SDT). SDT will take as input a FASTA file with up to 1,000 sequences (either aligned or unaligned) and, in a single step, “calculate”, sort and display a colour-coded matrix of pairwise identity scores (Fig. 1a). It will additionally produce both plots of these pairwise identity scores and text files containing the plotted data to facilitate the identification of rational pairwise-identity-based taxonomic demarcation criteria (Fig. 1b).

Fig. 1
figure 1

The SDT interface. a Colour-coded matrix of pairwise identity scores. b Distribution plot of pairwise identity scores. (1) Command menus used to load FASTA files, save analysis results in various graphical and text formats, terminate the program and rerun analyses with different settings. (2) Command buttons used to switch between the matrix display and the pairwise identity distribution display, to zoom in and out of the two displays and to switch between the full-colour mode and the three-colour mode. (3) Spin controls used to adjust defined pairwise similarity demarcation cutoffs that can be used to, for example, colour pairwise similarity scores of viruses within a species differently from scores between viruses that are in different species. (4) The horizontal order of sequence names from left to right is the same as the vertical order from top to bottom and reflects the vertical ordering of sequences that would occur within a neighbour-joining phylogenetic tree constructed from the pairwise identity matrix. (5) Each coloured cell at the intersection of two sequence names represents the percent identity between those two sequences such that, for example, the three red triangles represent three clusters of closely related sequences (having >95 % identity). (6) A key indicating the correspondence between pairwise identities and the colours displayed in the matrix. (7) The horizontal axis indicates the percentage pairwise identity, and the vertical axis indicates the proportion of pairwise identities. (8) The valleys between peaks in the plot indicate percentage pairwise identities that would make relatively conflict-free pairwise-identity-score-based taxonomic demarcation thresholds

Rational mastrevirus species and strain demarcation criteria

Using SDT, we performed pairwise alignments of 939 full genome sequences of mastreviruses and calculated a total of 440,391 pairwise identity scores (Fig. 2). The distribution of these scores has notable peaks at ~48–71 %, 78–92 %, and 94–100 % pairwise identity and clear valleys at 72–77 % and 93 % pairwise identity. Whereas peaks indicate demarcation thresholds that would likely yield classifications with high degrees of conflict (i.e., where large numbers of sequences could justifiably be classified as belonging to two or more different species), valleys indicate thresholds that would likely yield classifications with minimal conflict.

Fig. 2
figure 2

The new mastrevirus strain and species demarcation criteria. A Distribution of pairwise identity scores of full genome sequences of mastreviruses as determined using three different multiple sequence alignment programs: MUSCLE in blue, ClustalW in red and MAFFT in green (all with default settings). The vertical grey lines indicate the position of the 78 % species demarcation cutoff and the 94 % strain demarcation cutoff. B The new strain and species demarcation criteria yield, with only one exception, a series of species and strains where the degree of identity shared by the two most genetically different isolates within these species and strains are, respectively, within the 78 % (gray bars) and the 94 % (white bars) demarcation cutoffs. The only exceptions are the degrees of identity between 22 pairs of MSV-B isolates (out of a total of 1327 pairs), which are between 93.14 % and 94.0 %

Irrespective of the alignment program used (blue, red and green plots in Fig. 2a denoting MUSCLE, ClustalW and MAFFT, respectively), the method applied in SDT yields reasonably consistent distributions of pairwise identity scores for sequences that share >71 % pairwise identity. Overall, the MUSCLE method yields the highest pairwise identity scores (notice the rightward shift of the blue plot relative to the red and green plots) implying that, of the three alignment methods applied in SDT, its use in the classification of novel mastreviruses will yield the most conservative test of whether these sequences should represent species or strains. For our standardised mastrevirus classification protocol, we have therefore opted to recommend MUSCLE as the preferred alignment method.

Given that a species demarcation threshold of 78 % identity yields a species list that has a very low degree of conflict and requires only minor reclassifications of currently accepted mastrevirus species (i.e., it is mostly consistent with the currently prescribed classification system), we propose that mastrevirus genomes that are calculated to be >78 % similar with our new approach should be considered members of the same species.

Similarly, our analysis indicates that 94 % would be a relatively robust mastrevirus-wide strain demarcation threshold that would additionally be consistent with the informal strain demarcation systems currently in place for approved and tentative mastrevirus species such as “Chickpea chlorosis Australia virus” (CpCAV), “Chickpea chlorosis virus” (CpCV), “Chickpea chlorotic dwarf virus” (CpCDV), Chloris striate mosaic virus (CSMV), Digitaria didactyla striate mosaic virus (DDSMV), Maize streak virus (MSV), Panicum streak virus (PanSV), “Paspalum dilatatum striate mosaic virus” (PDSMV), “Paspalum striate mosaic virus” (PSMV), Sugarcane streak Reunion virus (SSRV), Sugarcane streak virus (SSV), Tobacco yellow dwarf virus (TYDV), and Wheat dwarf virus (WDV). We therefore propose that mastrevirus genomes that are calculated to be >94 % similar with our new approach should be considered members (or variants) of the same strain.

Importantly, there is strong phylogenetic support for almost all of the species and strains identified with the proposed classification system (Fig. 3 for all mastrevirus species other than MSV and Fig. 4 for MSV). The only cases where there is not strong phylogenetic support for the clustering of sequences within the identified strain and species groupings are those where only a single isolate has been identified as representatives of a given species or strain Table 1.

Fig. 3
figure 3

Phylogenetic support for the proposed mastrevirus species and strain demarcation criteria. Maximum-likelihood phylogenetic tree (constructed using full genome sequences and with the nucleotide substitution model GTR + I+G4 [19, 48]) depicting the likely evolutionary relationships of mastrevirus species and proposed strain groupings. Note that due to the large number of available maize streak virus (MSV) genome sequences, these sequences are represented on a separate tree presented in Fig. 4. The African, European, Asian and Australasian origins of the various isolates are indicated. BCSMV, bromus catharticus striate mosaic virus; CpCAV, chickpea chlorosis Australia virus; CpCV, chickpea chlorosis virus; CpCDV, chickpea chlorotic dwarf virus; CpRLV, chickpea redleaf virus; CpYV, chickpea yellows virus; CSMV, chloris striate mosaic virus; DCSMV, digitaria ciliaris striate mosaic virus; DDSMV, digitaria didactyla striate mosaic virus; DSV, digitaria streak virus; ESV, eragrostis streak virus; MiSV, miscanthus streak virus; MSRV, maize streak Reunion virus; ODV, oat dwarf virus; PanSV, panicum streak virus; PDSMV, paspalum dilatatum striate mosaic virus; PSMV, paspalum striate mosaic virus; SacSV, saccharum streak virus; SSMV-1, sporobolus striate mosaic virus-1; SSMV-2, sporolobus striate mosaic virus-2; SSEV, sugarcane streak Egypt virus; SSRV, sugarcane streak Reunion virus; SSV, sugarcane streak virus; TYDV, tobacco yellow dwarf virus; USV, urochloa streak virus; WDIV, wheat dwarf India virus; WDV, wheat dwarf virus

Fig. 4
figure 4

Phylogenetic support for the proposed maize streak virus (MSV) species and strain demarcation criteria. Maximum-likelihood phylogenetic tree (constructed using full genome sequences and with the nucleotide substitution model GTR + I+G4 [19, 48]) depicting the likely evolutionary relationships of the proposed MSV strain groupings

Table 1 Details of mastrevirus type species and strains, including the hosts from which they were isolated and their country/territory of sampling

It is also worth pointing out that in the case of MSV and WDV, there are notable biological differences between viral isolates that, according to this proposal, would be classified into different strains. For example, whereas the “A” strain of MSV is clearly the only group of MSV variants that cause severe disease in maize [37, 58], the “A” strain of WDV preferentially infects barley, whereas the “C” strain preferentially infects wheat [50].

Updating the names of known mastrevirus isolates to reflect the new classification criteria

We applied these classification criteria to the 939 full genome sequences of mastreviruses available in public databases in May 2012 (Supplementary Figure 1) and have updated the various sequence names accordingly in Supplementary Table 1. Briefly, the names of these viruses now have the following form:

<virus name>-<strain name>[<country/territory code>-<lab codes/old names/host species of origin/sample number/location of origin>-<year of sampling>]

Virus name

For a newly obtained isolate the <virus name> is simply the ICTV-accepted name (or the acronym thereof) of the group of viruses to which the genome sequence has >78 % identity. If the sequence has >78 % identity to sequences classified as belonging to more than one established species, it is our recommendation that it simply be given the virus name of whichever sequences it is most similar to. Obviously, if the sequence has <78 % identity to any known mastrevirus genome sequences, it belongs to a new species, and it should be given a unique name (i.e., a name not shared by any other currently named virus) containing the name of the host from which the sequence was isolated and a succinct symptom descriptor. For example, “maize fine streak virus” and “maize stippled streak virus” could all be suitable names for viruses isolated from maize that produce symptoms resembling those of maize streak virus.

A number of mastrevirus species names that are currently accepted contain the name of the country/territory from which the first representative of that species was isolated. For example, isolates of sugarcane streak Reunion virus are only distantly related to those of sugarcane streak virus but produce similar symptoms in sugarcane. Since such names can be very misleading when such viruses are subsequently isolated in different countries/territories (for example, maize streak Reunion virus is also found in Southern Africa), we would suggest that this practice be discontinued and additional descriptors relating to symptoms be used, as outlined in the previous paragraph.

Strain name

Although we have chosen here to simply name strains alphabetically, this should not preclude anyone from naming strains based on consistently observable biological differences between the members of different strains. For example, suitable alternative names for MSV-A and MSV-B that reflect the different host preferences of viruses belonging to these strains would be MSV-Maize and MSV-Digitaria, respectively [58]. Although a descriptive strain name could potentially be useful, it should be borne in mind that unless it genuinely reflects the characteristics of all members of a strain, it also could be quite misleading. If symptom descriptors such as “mild” or “severe” are used as strain names, they should be based on symptom phenotypes observed in multiple independently isolated members of a strain.

It should also be noted that in many cases ad hoc “subtype” classification systems have been used to further categorise the members of certain strains. For, example MSV-A strains have been categorised into subtypes MSV-A1, -A2, A3, A4, A5 and A6. Although such sub-strain classifications are beyond the scope of this paper, it is appreciated that they can serve a practical purpose and, should such classifications be used, it is recommended that, as has been done with MSV, the sub-strain classifications be denoted with a subscript after the strain name.

Isolate descriptor field

Bounded by square brackets (i.e., “[ ]”) the isolate descriptor field may contain any number of sub-fields, each separated by a hyphen (i.e., “-”) but should, wherever possible, have as the first sub-field the two-letter international code of the country/territory of origin (Supplementary table 2) and as the last sub-field the year of isolation. Between these first and last sub-fields can be placed any additional useful descriptors, such as the district or city from which an isolate was obtained or the host species in which it was found. These “in-between” sub-fields can also contain additional information such as sample code numbers or former names. Crucially, the recommended format is “machine readable” in that various sequence analysis programs will be able to extract country/territory and sampling date information from such sequence names. Also note that we have broken with the Ninth ICTV Report’s recommendations for geminivirus nomenclature [7] and have avoided use of the “:” symbol to separate the isolate descriptor fields. We have done this because this symbol is specifically used to indicate branch-length information in the Newick phylogenetic tree file format (see http://en.wikipedia.org/wiki/Newick_format), and its use within sequence names can therefore cause problems for many computer programs that infer and/or render phylogenetic trees.

Resolving conflicts within the new mastrevirus classification system

Although the species and strain demarcation thresholds that we have chosen minimise the number of ambiguous classifications that might be made with the currently available full genome sequences, it is important to point out that situations are likely to arise where there is some uncertainty over the proper species or strain assignments of some isolates. The four possible reasons why a newly sequenced genome might be difficult to classify will be:

  1. 1.

    Although >78 % identical to some isolates from a particular species, the new genome is <78 % identical to other isolates of that same species.

  2. 2.

    The new genome is >78 % identical to isolates from two or more different species.

  3. 3.

    Although >94 % identical to some isolates from a particular strain, the new genome is <94 % identical to other isolates of that same strain.

  4. 4.

    The new genome is >94 % identical to some isolates from two or more different strains.

Among the mastrevirus genomes analysed here, we encountered only one instance of conflict type (3) – i.e., in some cases, MSV-B isolates were between 93.2 % and 94 % identical to other MSV-B isolates (Fig. 2b). We therefore recommend that each of the four above-mentioned conflict situations be resolved as follows:

  1. 1.

    The new isolate should be classified as belonging to any species with which it shares >78 % identity to any one isolate formerly classified as belonging to that species, even if it is <78 % identical to other isolates classified as belonging to that species.

  2. 2.

    The new isolate should be considered as belonging to the species containing the isolate with which it shares the highest degree of identity.

  3. 3.

    The new isolate should be classified as belonging to any strain with which it shares >94 % identity to any one isolate formerly classified as belonging to that strain, even if it is <94 % identical to other isolates classified as belonging to that strain.

  4. 4.

    The new isolate should be considered as belonging to the strain containing the isolate with which it shares the highest degree of identity.

A step-by-step guide to classifying new full genome sequences of mastreviruses

Following the determination of the full genome sequence a new mastrevirus it is recommended that:

  1. 1.

    The new sequence should be used to perform a “nucleotide BLAST” search (accessible via http://blast.ncbi.nlm.nih.gov/Blast.cgi) of the NCBI “Nucleotide collection” database to, firstly, obtain the set of currently deposited sequences that most closely resemble the new sequence and, secondly, to identify the species to which the new sequence is most closely related.

  2. 2.

    The set of sequences returned by the NCBI BLAST search should be saved in FASTA format and added to any other set of mastrevirus reference sequences (which, if also in FASTA file format, can simply be cut and pasted into the same FASTA file using a standard text editor). Such a mastrevirus reference sequence dataset is included with the SDT installation package in the file “mastrevirus references.sdt”, and updated versions of this file will be kept on the SDT web page (http://web.cbio.uct.ac.za/SDT). Alternatively, searching the nucleotide database at the NCBI website (http://www.ncbi.nlm.nih.gov/nuccore/) using the search term “<virus species/genus/family name> AND 2500:4000[SLEN]” will return all genomes indicated in the “<virus species/genus/family name> field that are between 2500 and 4000 nucleotides long. These can then also be saved to a FASTA file. Regardless of how datasets are compiled, sub-genome-length sequences should ultimately be removed from FASTA files that are intended for use in pairwise sequence identity analysis. Also, care should be taken to ensure that all the sequences being analysed all start at the same genomic coordinate. In the case of mastreviruses, there remain a small number of sequences in the database that do not begin at the virion strand origin of replication, and these should either be removed or edited so that they begin at this site prior to analysis.

  3. 3.

    The FASTA file should be opened with the computer program SDT and, following selection of the MUSCLE method and calculation of the pairwise identity score matrix (Fig. 1a), it should be decided whether the sequence falls within a previously ICTV-accepted or proposed species (i.e., shares >78 % identity with isolates of that species) and, if so, whether it falls within a previously identified strain (i.e., shares >94 % identity with isolates previously classified as belonging to a named strain). If it falls within a previously named species and strain, the name and strain of the new sequence should be reflected in the species and strain name fields of its name. Similarly, if use is to be made of the reference mastrevirus dataset, the “mastrevirus references.sdt” file should be loaded first, and the user-generated FASTA file containing the researcher’s own newly determined sequence(s) (and perhaps also the nearest relatives of these sequences that were revealed by a BLAST search) should be appended to the mastrevirus references set using the “Append” command button (the SDT program will prompt this whole process once the “.sdt” file is selected rather than a FASTA file). From this point on, the analysis would be carried out as above for the analysis of a FASTA file.

  4. 4.

    If the new sequence belongs to a new species (i.e., it is <78 % identical to any other known mastrevirus sequence), an appropriate species name should be proposed (see above for details) and the sequence should be given the strain name “A”.

  5. 5.

    If the new sequence belongs to an existing species but is a new strain (i.e., it is <94 % identical to all isolates described for that species), the strain name proposed should follow our alphabetical naming convention, with pertinent details of the new sequence being added to the isolate descriptor field of the name, and not the strain field. If, in the future, multiple variants of the new strain have some unique, well-defined biological property, the strain could then be given a more descriptive name.

Conclusions

A pairwise-identity-based mastrevirus species demarcation criterion has been proposed that, while almost entirely consistent with the current mastrevirus classification, includes a very strict pairwise identity calculation protocol that should, if widely applied, significantly reduce the numbers of inappropriate new species proposals that are submitted for consideration by the ICTV. Also proposed is a new mastrevirus-wide pairwise-identity-based strain demarcation threshold. The standardised strain-level classification scheme should provide a consistent framework within which mastrevirus strains belonging to different species can be meaningfully compared with respect to, for example, their relative host and geographical ranges. The main strength of the proposal is that the prescribed pairwise identity calculation protocol is very difficult to manipulate (either intentionally or unintentionally) and will yield identical pairwise identity scores for a given pair of sequences, irrespective of how many other sequences are being compared within a dataset. This means that as the number of deposited full genome sequences of mastreviruses increases, there will be no need in the future to continuously revise the classification of already established species and strains. Perhaps most important from the perspective of the broader virology community with a general interest in virus diversity, however, is the fact that the computer program implementing our pairwise identity calculation protocol, SDT, can also very easily be adopted in standardised protocols for the classification of other virus groups.

The genome-wide pairwise-identity-based proposal for the classification of mastreviruses has been approved by the executive committee of the ICTV, and the document is available at http://talk.ictvonline.org/files/proposals/taxonomy_proposals_plant1/m/plant04/4399.aspx (2012.019abP.A.v3.Mastrevirus-17sp,rem-2sp.pdf).