A genome-wide pairwise-identity-based proposal for the classification of viruses in the genus Mastrevirus (family Geminiviridae)
- First Online:
- Cite this article as:
- Muhire, B., Martin, D.P., Brown, J.K. et al. Arch Virol (2013) 158: 1411. doi:10.1007/s00705-012-1601-7
Recent advances in the ease with which the genomes of small circular single-stranded DNA viruses can be amplified, cloned, and sequenced have greatly accelerated the rate at which full genome sequences of mastreviruses (family Geminiviridae, genus Mastrevirus) are being deposited in public sequence databases. Although guidelines currently exist for species-level classification of newly determined, complete mastrevirus genome sequences, these are difficult to apply to large sequence datasets and are permissive enough that, effectively, a high degree of leeway exists for the proposal of new species and strains. The lack of a standardised and rigorous method for testing whether a new genome sequence deserves such a classification is resulting in increasing numbers of questionable mastrevirus species proposals. Importantly, the recommended sequence alignment and pairwise identity calculation protocols of the current guidelines could easily be modified to make the classification of newly determined mastrevirus genome sequences significantly more objective. Here, we propose modified versions of these protocols that should substantially minimise the degree of classification inconsistency that is permissible under the current system. To facilitate the objective application of these guidelines for mastrevirus species demarcation, we additionally present a user-friendly computer program, SDT (species demarcation tool), for calculating and graphically displaying pairwise genome identity scores. We apply SDT to the 939 full genome sequences of mastreviruses that were publically available in May 2012, and based on the distribution of pairwise identity scores yielded by our protocol, we propose mastrevirus species and strain demarcation thresholds of >78 % and >94 % identity, respectively.
The family Geminiviridae includes insect-transmitted, plant-infecting viruses with circular, single-stranded DNA (ssDNA) genomes that are encapsidated within geminate particles. The genus Mastrevirus of this family consists of viruses that are transmitted by leafhoppers and have a single genome component with a conserved arrangement of three genes (encoding a movement protein, a coat protein and two versions of a replication-associated protein) and two non-coding regions (the large and small intergenic regions).
A variety of International Committee on Taxonomy of Viruses (ICTV)–endorsed guidelines currently exist for the classification and naming of new mastreviruses [7, 13, 14, 15]. Primary among these guidelines is the application of carefully selected genome-wide pairwise sequence identity thresholds, either to assign newly determined mastreviruses to existing species or strains, or as the basis for proposing that newly determined sequences correspond to new species or strains.
There are, however, quite a few different ways in which pairwise nucleotide sequence identities could be calculated and, importantly, the identity score that one gets for any given pair of sequences can vary quite substantially depending on exactly how it was calculated. One obvious example of how pairwise identity scores could be either inflated or deflated involves the treatment of gap characters that are inserted during the sequence alignment process. If gap characters are treated as a fifth possible nucleotide state, positions in the sequence alignment where one sequence has a nucleotide but the other does not will be (and quite reasonably so) counted as a mismatch. There exists a problem, however, in determining, firstly, how much such mismatches should count relative to standard nucleotide mismatches and, secondly, how much each gap character in runs of several gaps should count relative to isolated gaps. Side-stepping this problem altogether by simply ignoring all alignment sites at which one or the other sequence has a gap character is the approach of choice when, for example, calculating genetic and evolutionary distances in applications such as phylogenetic tree construction [16, 35]. Whereas ignoring positions where one sequence has a gap and the other does not will inflate pairwise identity scores, evenly scoring every one of these sites as a “normal” nucleotide mismatch will deflate identity scores over methods that include runs of gaps as a single mismatch.
An additional important factor that can cause fluctuations in pairwise identity scores of a given pair of sequences is the method used for sequence alignment. An optimal pairwise alignment of the sequences will generally yield a higher pairwise identity score than if the sequences were aligned within the context of a multiple sequence alignment that includes one or more additional sequences. Also, as the number and diversity of sequences in a multiple sequence alignment increases, so it is expected that the pairwise identity scores of any individual pair of sequences in the alignment will decrease . What this means is that pairwise identity scores will tend downwards with increasing alignment size. Finally, different multiple sequence alignments generated either by different multiple sequence alignment programs (such as ClustalV, ClustalW, MAFFT or MUSCLE), or by a single program with different alignment settings (such as gap open and extension penalties) will not all be equally accurate [11, 12, 25, 44, 55].
Despite these various issues, pairwise-identity-based virus classification criteria are extremely popular amongst virus taxonomists and are likely to grow in importance due to how easy they are to use and the fact that, once properly validated, they accurately reflect the biology of these organisms [2, 32, 33]. We have therefore devised a pairwise-identity-based approach for mastrevirus classification that almost completely removes all the alignment and gap-handling problems of the current ICTV-endorsed mastrevirus classification protocol. We apply an approach that is almost identical to that described by Bao et al.  for their pairwise sequence comparison (PASC) method. Rather than relying on multiple sequence alignments and the counting of gap characters as a fifth nucleotide state (as is done in the currently recommended approach), our method and that of Bao et al.  rely on accurate and highly repeatable/reproducible pairwise sequence alignments and the complete exclusion of sites with gap characters from the pairwise identity calculations.
We apply this approach to determine the distribution of mastrevirus genome-wide pairwise identity scores and identify logical mastrevirus strain and species demarcation thresholds. We then apply these new mastrevirus species and strain demarcation criteria to all full mastrevirus genome sequences that were publically available in May 2012 and propose updates to all mastrevirus isolate names to make these consistent with the proposed criteria. Although the modified protocol yields a classification that is very similar to the current mastrevirus classification (sequences belonging to three proposed and one accepted species are “demoted” to the level of strains of pre-existing species), it is extremely objective (i.e., the pairwise identity scores it yields are almost completely un-manipulable) and is applied within a freely available and easy-to-use computer program that should tremendously simplify the classification of any new mastrevirus full genome sequence.
A new approach to calculating pairwise identity scores
Given a set of mastrevirus full genome sequences that have all been linearised at the same position, the approach that we have chosen for pairwise identity score calculations is very simple and essentially involves two steps. In the first step, every unique pair of sequences is individually aligned, essentially using the Needleman-Wunsch algorithm  as applied in multiple sequence alignment programs such as ClustalW [9, 31], MUSCLE  and MAFFT . For a set of S sequences, this will yield [S × (S−1)]/2 pairwise alignments. For each of these alignments, the identity of each pair of sequences is calculated as 1−M/N, where M is the number of mismatched nucleotides and N is the total number of columns along the alignment where neither aligned sequence has a gap character. Our identity score is therefore simply one minus the ratio of the Hamming distance over the length of the pairwise-aligned sequences.
Rational mastrevirus species and strain demarcation criteria
Irrespective of the alignment program used (blue, red and green plots in Fig. 2a denoting MUSCLE, ClustalW and MAFFT, respectively), the method applied in SDT yields reasonably consistent distributions of pairwise identity scores for sequences that share >71 % pairwise identity. Overall, the MUSCLE method yields the highest pairwise identity scores (notice the rightward shift of the blue plot relative to the red and green plots) implying that, of the three alignment methods applied in SDT, its use in the classification of novel mastreviruses will yield the most conservative test of whether these sequences should represent species or strains. For our standardised mastrevirus classification protocol, we have therefore opted to recommend MUSCLE as the preferred alignment method.
Given that a species demarcation threshold of 78 % identity yields a species list that has a very low degree of conflict and requires only minor reclassifications of currently accepted mastrevirus species (i.e., it is mostly consistent with the currently prescribed classification system), we propose that mastrevirus genomes that are calculated to be >78 % similar with our new approach should be considered members of the same species.
Similarly, our analysis indicates that 94 % would be a relatively robust mastrevirus-wide strain demarcation threshold that would additionally be consistent with the informal strain demarcation systems currently in place for approved and tentative mastrevirus species such as “Chickpea chlorosis Australia virus” (CpCAV), “Chickpea chlorosis virus” (CpCV), “Chickpea chlorotic dwarf virus” (CpCDV), Chloris striate mosaic virus (CSMV), Digitaria didactyla striate mosaic virus (DDSMV), Maize streak virus (MSV), Panicum streak virus (PanSV), “Paspalum dilatatum striate mosaic virus” (PDSMV), “Paspalum striate mosaic virus” (PSMV), Sugarcane streak Reunion virus (SSRV), Sugarcane streak virus (SSV), Tobacco yellow dwarf virus (TYDV), and Wheat dwarf virus (WDV). We therefore propose that mastrevirus genomes that are calculated to be >94 % similar with our new approach should be considered members (or variants) of the same strain.
Details of mastrevirus type species and strains, including the hosts from which they were isolated and their country/territory of sampling
Pakistan, South Africa
Burkina Faso, Cameroon, Central African Republic, Chad, Kenya, La Reunion, Lesotho, Mozambique, Nigeria, South Africa, Uganda, Zambia, Zimbabwe
La Reunion, Uganda, Rwanda, Kenya, South Africa, Mozambique
South Africa, Uganda
Mozambique, South Africa, Uganda
Burundi, Uganda, Nigeria
Zimbabwe, South Africa, Mozambique
Nigeria, Central African Republic
SSMV 1 [JQ948051]
SSMV 2 [JQ948052]
Bulgaria, Czech Republic, Germany, Hungary, Turkey, Ukraine
China, Hungary, Tibet
China, Czech Republic, Hungary, France, Germany, Iran, Sweden, Ukraine
It is also worth pointing out that in the case of MSV and WDV, there are notable biological differences between viral isolates that, according to this proposal, would be classified into different strains. For example, whereas the “A” strain of MSV is clearly the only group of MSV variants that cause severe disease in maize [37, 58], the “A” strain of WDV preferentially infects barley, whereas the “C” strain preferentially infects wheat .
Updating the names of known mastrevirus isolates to reflect the new classification criteria
We applied these classification criteria to the 939 full genome sequences of mastreviruses available in public databases in May 2012 (Supplementary Figure 1) and have updated the various sequence names accordingly in Supplementary Table 1. Briefly, the names of these viruses now have the following form:
<virus name>-<strain name>[<country/territory code>-<lab codes/old names/host species of origin/sample number/location of origin>-<year of sampling>]
For a newly obtained isolate the <virus name> is simply the ICTV-accepted name (or the acronym thereof) of the group of viruses to which the genome sequence has >78 % identity. If the sequence has >78 % identity to sequences classified as belonging to more than one established species, it is our recommendation that it simply be given the virus name of whichever sequences it is most similar to. Obviously, if the sequence has <78 % identity to any known mastrevirus genome sequences, it belongs to a new species, and it should be given a unique name (i.e., a name not shared by any other currently named virus) containing the name of the host from which the sequence was isolated and a succinct symptom descriptor. For example, “maize fine streak virus” and “maize stippled streak virus” could all be suitable names for viruses isolated from maize that produce symptoms resembling those of maize streak virus.
A number of mastrevirus species names that are currently accepted contain the name of the country/territory from which the first representative of that species was isolated. For example, isolates of sugarcane streak Reunion virus are only distantly related to those of sugarcane streak virus but produce similar symptoms in sugarcane. Since such names can be very misleading when such viruses are subsequently isolated in different countries/territories (for example, maize streak Reunion virus is also found in Southern Africa), we would suggest that this practice be discontinued and additional descriptors relating to symptoms be used, as outlined in the previous paragraph.
Although we have chosen here to simply name strains alphabetically, this should not preclude anyone from naming strains based on consistently observable biological differences between the members of different strains. For example, suitable alternative names for MSV-A and MSV-B that reflect the different host preferences of viruses belonging to these strains would be MSV-Maize and MSV-Digitaria, respectively . Although a descriptive strain name could potentially be useful, it should be borne in mind that unless it genuinely reflects the characteristics of all members of a strain, it also could be quite misleading. If symptom descriptors such as “mild” or “severe” are used as strain names, they should be based on symptom phenotypes observed in multiple independently isolated members of a strain.
It should also be noted that in many cases ad hoc “subtype” classification systems have been used to further categorise the members of certain strains. For, example MSV-A strains have been categorised into subtypes MSV-A1, -A2, A3, A4, A5 and A6. Although such sub-strain classifications are beyond the scope of this paper, it is appreciated that they can serve a practical purpose and, should such classifications be used, it is recommended that, as has been done with MSV, the sub-strain classifications be denoted with a subscript after the strain name.
Isolate descriptor field
Bounded by square brackets (i.e., “[ ]”) the isolate descriptor field may contain any number of sub-fields, each separated by a hyphen (i.e., “-”) but should, wherever possible, have as the first sub-field the two-letter international code of the country/territory of origin (Supplementary table 2) and as the last sub-field the year of isolation. Between these first and last sub-fields can be placed any additional useful descriptors, such as the district or city from which an isolate was obtained or the host species in which it was found. These “in-between” sub-fields can also contain additional information such as sample code numbers or former names. Crucially, the recommended format is “machine readable” in that various sequence analysis programs will be able to extract country/territory and sampling date information from such sequence names. Also note that we have broken with the Ninth ICTV Report’s recommendations for geminivirus nomenclature  and have avoided use of the “:” symbol to separate the isolate descriptor fields. We have done this because this symbol is specifically used to indicate branch-length information in the Newick phylogenetic tree file format (see http://en.wikipedia.org/wiki/Newick_format), and its use within sequence names can therefore cause problems for many computer programs that infer and/or render phylogenetic trees.
Resolving conflicts within the new mastrevirus classification system
Although >78 % identical to some isolates from a particular species, the new genome is <78 % identical to other isolates of that same species.
The new genome is >78 % identical to isolates from two or more different species.
Although >94 % identical to some isolates from a particular strain, the new genome is <94 % identical to other isolates of that same strain.
The new genome is >94 % identical to some isolates from two or more different strains.
The new isolate should be classified as belonging to any species with which it shares >78 % identity to any one isolate formerly classified as belonging to that species, even if it is <78 % identical to other isolates classified as belonging to that species.
The new isolate should be considered as belonging to the species containing the isolate with which it shares the highest degree of identity.
The new isolate should be classified as belonging to any strain with which it shares >94 % identity to any one isolate formerly classified as belonging to that strain, even if it is <94 % identical to other isolates classified as belonging to that strain.
The new isolate should be considered as belonging to the strain containing the isolate with which it shares the highest degree of identity.
A step-by-step guide to classifying new full genome sequences of mastreviruses
The new sequence should be used to perform a “nucleotide BLAST” search (accessible via http://blast.ncbi.nlm.nih.gov/Blast.cgi) of the NCBI “Nucleotide collection” database to, firstly, obtain the set of currently deposited sequences that most closely resemble the new sequence and, secondly, to identify the species to which the new sequence is most closely related.
The set of sequences returned by the NCBI BLAST search should be saved in FASTA format and added to any other set of mastrevirus reference sequences (which, if also in FASTA file format, can simply be cut and pasted into the same FASTA file using a standard text editor). Such a mastrevirus reference sequence dataset is included with the SDT installation package in the file “mastrevirus references.sdt”, and updated versions of this file will be kept on the SDT web page (http://web.cbio.uct.ac.za/SDT). Alternatively, searching the nucleotide database at the NCBI website (http://www.ncbi.nlm.nih.gov/nuccore/) using the search term “<virus species/genus/family name> AND 2500:4000[SLEN]” will return all genomes indicated in the “<virus species/genus/family name> field that are between 2500 and 4000 nucleotides long. These can then also be saved to a FASTA file. Regardless of how datasets are compiled, sub-genome-length sequences should ultimately be removed from FASTA files that are intended for use in pairwise sequence identity analysis. Also, care should be taken to ensure that all the sequences being analysed all start at the same genomic coordinate. In the case of mastreviruses, there remain a small number of sequences in the database that do not begin at the virion strand origin of replication, and these should either be removed or edited so that they begin at this site prior to analysis.
The FASTA file should be opened with the computer program SDT and, following selection of the MUSCLE method and calculation of the pairwise identity score matrix (Fig. 1a), it should be decided whether the sequence falls within a previously ICTV-accepted or proposed species (i.e., shares >78 % identity with isolates of that species) and, if so, whether it falls within a previously identified strain (i.e., shares >94 % identity with isolates previously classified as belonging to a named strain). If it falls within a previously named species and strain, the name and strain of the new sequence should be reflected in the species and strain name fields of its name. Similarly, if use is to be made of the reference mastrevirus dataset, the “mastrevirus references.sdt” file should be loaded first, and the user-generated FASTA file containing the researcher’s own newly determined sequence(s) (and perhaps also the nearest relatives of these sequences that were revealed by a BLAST search) should be appended to the mastrevirus references set using the “Append” command button (the SDT program will prompt this whole process once the “.sdt” file is selected rather than a FASTA file). From this point on, the analysis would be carried out as above for the analysis of a FASTA file.
If the new sequence belongs to a new species (i.e., it is <78 % identical to any other known mastrevirus sequence), an appropriate species name should be proposed (see above for details) and the sequence should be given the strain name “A”.
If the new sequence belongs to an existing species but is a new strain (i.e., it is <94 % identical to all isolates described for that species), the strain name proposed should follow our alphabetical naming convention, with pertinent details of the new sequence being added to the isolate descriptor field of the name, and not the strain field. If, in the future, multiple variants of the new strain have some unique, well-defined biological property, the strain could then be given a more descriptive name.
A pairwise-identity-based mastrevirus species demarcation criterion has been proposed that, while almost entirely consistent with the current mastrevirus classification, includes a very strict pairwise identity calculation protocol that should, if widely applied, significantly reduce the numbers of inappropriate new species proposals that are submitted for consideration by the ICTV. Also proposed is a new mastrevirus-wide pairwise-identity-based strain demarcation threshold. The standardised strain-level classification scheme should provide a consistent framework within which mastrevirus strains belonging to different species can be meaningfully compared with respect to, for example, their relative host and geographical ranges. The main strength of the proposal is that the prescribed pairwise identity calculation protocol is very difficult to manipulate (either intentionally or unintentionally) and will yield identical pairwise identity scores for a given pair of sequences, irrespective of how many other sequences are being compared within a dataset. This means that as the number of deposited full genome sequences of mastreviruses increases, there will be no need in the future to continuously revise the classification of already established species and strains. Perhaps most important from the perspective of the broader virology community with a general interest in virus diversity, however, is the fact that the computer program implementing our pairwise identity calculation protocol, SDT, can also very easily be adopted in standardised protocols for the classification of other virus groups.
The genome-wide pairwise-identity-based proposal for the classification of mastreviruses has been approved by the executive committee of the ICTV, and the document is available at http://talk.ictvonline.org/files/proposals/taxonomy_proposals_plant1/m/plant04/4399.aspx (2012.019abP.A.v3.Mastrevirus-17sp,rem-2sp.pdf).
BM is funded by the University of Cape Town, South Africa. JNC and EM are members of the Research Group AGR-214, partially funded by Consejería de Economía, Innovación y Ciencia, Junta de Andalucía, Spain, cofinanced by FEDER-FSE. The authors would like to thank the Center for High Performance Computing in Cape Town and the Information Communication Technology Services Department at the University of Cape Town for use of their high-performance computing clusters. The authors would additionally like to thank Claude Fauquet for reading through and commenting on the manuscript.