Metagenomic Analyses of Nepoviruses: Diversity, Evolution and Identication of a Hitherto Undescribed Putative Region for Host Range in Subgroup a Species

Datamining and metagenomic analyses of 277 open reading frame sequences of bipartite RNA viruses and variants in the genus Nepovirus documented how delicate it can be to unequivocally identify species, in particular subgroup A and C species, based on some of the currently adopted taxonomic demarcation criteria. It suggests a possible need for their amendment to accommodate pangenome information. In addition, we revealed a host-dependent structure of arabis mosaic virus (ArMV) populations at a cladistic level and confirmed a phylogeographic structure of grapevine fanleaf virus (GFLV) populations. We also identified new putative recombinant events for species of subgroups A, B and C. The evolutionary specificity of some capsid regions of ArMV and GFLV that were previously described and biologically validated as vector determinant was circumscribed in silico . Furthermore, a C-terminal segment of the RNA-dependent RNA polymerase of subgroup A species was predicted as a putative host range determinant based on statistically supported higher π values for GFLV and ArMV isolates infecting Vitis spp. compared to non- Vitis infecting ArMV isolates. This study illustrated how sequence information obtained via high throughput sequencing can increase our understanding of mechanisms that modulate virus diversity and evolution and create new opportunities for advancing studies on the biology of economically important plant viruses.


INTRODUCTION
New viral sequences are uncovered at an unprecedent rate with the advent of high throughput sequencing (HTS). The recovery of numerous complete or almost complete viral genome sequences from different ecosystems (i.e., environmental, human, veterinary, plant), lead to metagenomic analyses that assist the description of unforeseen virus genomes and their diversity. This wealth of information is creating the possibility to use the pangenome for virus taxonomy [17] and increase our understanding of the mechanisms that modulate virus diversity, evolution, vector and host specificity and epidemiology [36]. However, new challenges arise, for instance, for virus classification. Taxonomy traditionally relies not only on the genetic relationship among sequences of a few virus coding regions, primarily the replicase and/or coat protein coding domains, but also on biological properties such as vector species and host range, among other features [46]. This type of biological information is critical for current plant virus taxonomic classification but is generally lacking when only metagenomic data are available.
Nepoviruses are plant picorna-like viruses of the subfamily Comovirinae in the family Secoviridae [44]. Their transmission occurs in a non-persistent and non-circulative manner by ectoparasitic nematodes from the genera Xiphinema, Longidorus and Paralongidorus [42]. Long distance dissemination of nepoviruses happens with the exchange of uncontrolled propagation material and the use of infected cuttings and budwood for grafting. Seed and pollen transmission have been documented for some but not all nepoviruses and transmission by mites has been observed in rare cases. The genus Nepovirus includes 40 species that are widely distributed in temperate regions (https://talk.ictvonline.org/ictv-reports/ictv_online_report/positive-senserna-viruses/picornavirales/w/secoviridae/591/genus-nepovirus) [22]. Most nepoviruses have a broad natural host range, including annual herbaceous species and perennial woody species, and cause significant crop losses worldwide [11].
The genome of nepoviruses is composed of two single-stranded, positive sense RNAs (RNA1 and RNA2). Both genomic RNAs are necessary for infection in planta. They encode a large polyprotein, P1 for RNA1 and P2 for RNA2, which is cleaved by the viral proteinase into functional proteins [11]. P1 encodes proteins that are necessary for replication, including a helicase with a nucleotide triphosphate-binding domain, a proteinase (Pro) and an RNA-dependent RNA polymerase (Pol). Depending on the viral species, one (1A) or two (X1 and X2) proteins are located upstream of the helicase domain. The function of these proteins is not fully elucidated yet. P2 contains the coat protein (CP) for which the assembly of multiple units form icosahedral virions of 26-30 nm in diameter. The cell-to-cell movement (MP) domain is located immediately upstream of the CP domain. Depending on the nepovirus species, either one (2A HP , which is required for the replication of RNA2) or two (X3 and X4 of unknown functions) proteins are located upstream of the MP [13]. Three subgroups of nepoviruses are recognized based on RNA2 properties, including its organization and size, phylogenetic relationships of the CP coding region, and cleavage sites recognized by the viral proteinase [11]. The three nepovirus subgroups are named A, B and C.
One of the most important viral diseases of grapevines is infectious degeneration. This disease is caused by 15 Nepovirus species [6,41]. Most grapevine-infecting nepoviruses are generally restricted to some regions of the world. For example, arabis mosaic virus (ArMV) is limited to European vineyards while tobacco ringspot virus (TRSV), tomato ringspot virus (ToRSV), peach rosette mosaic virus (PRMV) and blueberry leaf mottle virus are present in American vineyards. In contrast, grapevine fanleaf virus (GFLV) is present in most vineyards worldwide.
Genetic diversity of nepoviruses has been extensively characterized with information primarily collected from RT-PCR-based studies combined with Sanger sequencing, generally in the part of the genome that contains the coat protein coding region [14,50]. Similarly, diversity studies and phylogenetic analyses have been reported for members of the family Secoviridae, including nepoviruses [43,49]. However, several new nepovirus species have been recently characterized and the number of complete genome sequences of many nepovirus isolates has exponentially increased in the past five years [1, 2, 4, 12, 15, 18, 23-25, 39, 48, 53, 54, 56]. In this study, we built on these latest advancements in nepoviruses research and carried out metagenomic analyses. We focused on RNA1 and RNA2 coding sequences to gain new insights into viral diversity and evolution, and to predict a hitherto undescribed conserved region putatively determining the host range of two subgroup A nepoviruses.

Phylogenetic relationships among nepoviruses
Only complete or near complete (i.e., covering ORF1 or ORF2) sequences of nepoviruses RNA1 and RNA2 were considered in this study. All sequences were retrieved from NCBI as of January 2020, our own curated nepoviruses sequences repository and a selection of Sequence Read Archive (SRA) datasets from GenBank. Datamining was performed to increase the number of sequences for ArMV, GFLV and mulberry mosaic leafroll-associated virus (MMLRaV), a novel Nepovirus species [31], as previously described [19]. These datamining efforts and Sanger or Illumina sequencing resulted in 46 new sequences (24 for RNA1 and 22 for RNA2) of ArMV, GFLV and MMLRaV. These new sequences were deposited in GenBank (Table S1, *** Submission has been done. However, no GenBank ID yet.***). In total, both genomic RNA sequences were recovered from 29 Nepovirus species except from olive latent ringspot virus for which only a single RNA2 sequence but no RNA1 sequence is available (Table 1). Two species (GFLV and ArMV) made up the majority of sequences analyzed in this study, while most species were represented by a single or a few sequences of either genomic RNAs (Table 1). Novel nepoviruses used in this study included MMLRaV [31], caraway yellows virus [12], potato virus B [4] and red clover nepovirus A [25]. A few new virus species and new isolates belonging to the genus Nepovirus have been identified since we last consulted NCBI (April 2020). The corresponding sequences were not included in this study (Table S10). In addition, a few viral species described in the literature as potential new nepoviral species such as Hobart nepovirus 3 [40] or Zhuye pepper nepovirus 1 [3] were not taken into account in this study because sequences were partial or discrepancies were observed between datasets available in NCBI and publications. Furthermore, only a single sequence was chosen from a group of sequences displaying nucleic acid identity higher than 99%, unless isolates were obtained from different hosts and/or different countries. The complete list of 110 ORF1 (Table S2) and 167 ORF2 (Table S3) nepovirus sequences selected for this study is provided.
Genetic differentiation analyses of nucleotide sequences confirmed the classification of nepoviruses into three subgroups with higher inter-subgroup than intra-subgroup mean distance values (Table 2). Subgroup B sequences displayed the lowest maximum pairwise distance values, well below inter-subgroup mean distance values, suggesting a well-defined group of virus isolates ( Table 2). The inter-subgroup mean distance value was lower than the maximum intra-subgroup mean distance value for subgroup A and C sequences, revealing a greater variability and less welldefined groups of virus isolates ( Table 2). After performing an alignment of ORF1 and ORF2 sequences, phylogenetic trees were constructed with maximum-likelihood method using the best-fit model (GTR+G+I) (Figure 1). Interestingly, species belonging to each subgroup separated better in a tree considering ORF1 rather than ORF2 sequences ( Figure 1 and Figure S1). Indeed, ORF1 nucleotide sequences of virus isolates of subgroups A, B and C clustered in separate clades in a tanglegram ( Figure 1) or unrooted cladogram ( Figure S1). Nucleotide sequences of nepovirus isolates of subgroup B were also well-defined when using ORF2 sequences, however, subgroup A and C ORF2 sequences were scattered in different clades in a tanglegram ( Figure 1) or unrooted cladogram ( Figure S1). These results suggested that criteria used for the classification of nepoviruses into distinct subgroups are more robust based on ORF1 than ORF2 nucleotide sequence identity. This finding could be considered by the International Committee on Taxonomy of Viruses (ICTV) working group on Secoviridae to eventually define new demarcation criteria for nepoviruses when using pangenome information.
New challenges for species identification within the genus Nepovirus Species demarcation criteria for nepoviruses are defined by the ICTV (https://talk.ictvonline.org/ictv-reports/ictv_online_report/positive-sense-rnaviruses/picornavirales/w/secoviridae). These include coat protein (CP) amino acid (aa) sequence identity less than 75% and conserved Protease-Polymerase (Pro-Pol) region aa sequence identity less than 80%, among other criteria. We assessed whether these two major demarcation criteria are applicable to the corresponding complete ORF1 and ORF2 aa sequences. Some discrepancies with regard to the intra-species aa identity falling outside the species demarcation were obtained for PRMV ORF1 sequences (78.99%) and ORF2 sequences of cherry leaf roll virus (CLRV), ToRSV, cycas necrotic stunt virus and ArMV (below 74.05%) (Table S4). These results revealed that analyzing complete ORF sequences may be problematic for the identification of new virus species and new genetic variants of existing virus species if the current demarcation criteria pertaining to partial genome sequence information were to be applied. They further suggested that demarcation criteria for species in the genus Nepovirus could be amended to accommodate pangenome information. In addition, the ORF2 sequence of ArMV isolate Butterbur was a clear outlier among ArMV isolates with distinct identities at both the nucleic acid ( Figure 2) and amino acid (70.25% Table S4) levels. According to the original report [21], the pathological and serological features of ArMV-Butterbur are unique and its coat protein is 504 aa long (as for all GFLV CPs), while all other ArMV CPs are 505 aa long. These features underscore the need for additional work to ascertain the taxonomic position of ArMV-Butterbur and its recognition as an isolate of the ArMV species.
Similarly, the ORF1 amino acid sequence identity between isolates of distinct species was higher than 80%, for example for beet ringspot virus (BRSV) and tomato black ring virus (TBRV), as well as for BRSV and artichoke Italian latent virus (AILV) ( Table S4). The high identity percentage could also explain the high number of inter-species recombination events identified between these particular species (see the dedicated section below). However, inter-species diversity was below the species demarcation level (< 75%) for ORF2 sequences, unquestionably defining BRSV, TBRV and AILV as different species (Table S4). One particular case of interest is grapevine deformation virus (GDefV) [20], a subgroup A nepovirus. GDefV ORF2 aa sequences display 73 and 71% homologies with GFLV and ArMV, respectively [16], and GDefV ORF1 aa sequences have higher homology with GFLV (86-89%) than ArMV (73-74%) sequences [8]. According to the species demarcation criteria for nepoviruses, GDefV could be classified has a highly divergent sequence of GFLV when focusing on ORF1 sequences but as a new species based on ORF2 sequences.

Identification of recombination events within and between nepovirus species
Putative intra-species recombination events have been extensively reported for nepoviruses, mostly in the GFLV RNA2-encoded movement protein (MP) and CP domains [33,34,37,47,[50][51][52]. Recombination events have also been described for ToRSV [54] and grapevine chrome mosaic virus (GCMV) [5]. In addition, many inter-species recombination events have been detected, mostly between ArMV and GFLV [9,33,34,52] but also between GCMV and TBRV [5,28]. With the use of HTS and the recovery of complete virus genomes, recombination events can be detected all along the two genomic RNAs [18]. Here, we used the same corpus of nepovirus sequences and searched for evidence of recombination events using the RDP4 program. Recombination events were only considered when detected with five or more methods with P < 10 -3 (Table 3, Tables S5 and S6).
Intra-species recombination events were identified in ORF1 and ORF2 sequences, mostly of subgroup A species (Table 3, Table S5 and Figure S2). Almost twice as many putative recombination events were detected in ORF2 compared to ORF1 sequences (Table 3). Both remarks most definitely reflect the total number of sequences being recovered and used in this study. Many recombination events were detected in GFLV and ArMV sequences with some hotspots, i.e. more than one recombinant per site (Table S5 and Figure S2). In addition, recombination events were also identified for the first time for AILV, CLRV and raspberry ringspot virus.
All inter-species recombination events detected in this study strictly involved species from the same subgroup (Table S6). Surprisingly, the number of inter-species recombination events was higher than intra-species recombination events within ORF2 sequences (Table 3). For example, 31 inter-species and two intra-species recombination events were detected for subgroup B ORF2 sequences (Table 3 and Figure S2). It should also be noted that all ORF1 recombination events detected for subgroup B involved different species (Table S6). In contrast, all subgroup A recombination events detected involved only ArMV, GFLV and GDefV ( Figure 1 and Figure S1), emphasizing again their kinship. Recombination may have been facilitated for these three species because they have the potential to exist or co-exist in grapevine, a common host, for a long time, thus increasing their potential encounter in a single host cell.

Genetic diversity and population differentiation of ArMV from mono-and dicotyledonous plants
ArMV is a 'generalist' with a very broad natural host range, including winter barley, narcissus, Ligustrum vulgare, weeds, hops, berries and grapevines, among other species. Our datamining efforts resulted in the retrieval of 17 complete ORF1 sequences from seven monocotyledonous plants and 10 dicotyledonous plants, as well as 21 complete ORF2 sequences from eight monocotyledonous plants and 13 dicotyledonous plants (Tables S1, S2 and S8).
The overall genetic diversity (π) of ArMV ORF1 and ORF2 was of 0.166 ± 0.003 and 0.133 ± 0.003, respectively ( Table 4). As previously observed [55], the coding region 2A HP is the most divergent genomic region, showing the highest diversity at the extreme 5' end of ORF2 sequences (Figure 2), mostly due to size differences among isolates. A comparative analysis of ArMV sequences obtained from mono-and dicotyledonous plants revealed a significantly higher diversity in sequences from isolates infecting dicotyledonous plants compared to isolates infecting monocotyledonous plants (0.170 ± 0.003 vs 0.109 ± 0.003 and 0.145 ± 0.003 vs 0.093 ± 0.003, for ORF1 and ORF2 sequences respectively, Table 4). When looking at the evolution pattern (Tajima's D) of ORF1 sequences (Figure 2), values were negative but close to 0 (DT = -0.344; P > 0.1), suggesting that the population of ArMV is evolving as per mutation-drift equilibrium with no specific region under selection. On the other hand, two distinct regions of ORF2 sequences were under selection with an overall DT of -0.994 (P > 0.1) (Figure 2). One of these two regions is covering an aa stretch between two proline-rich segments of the central coding region of the 2A HP domain. The other region is a specific segment of the 2C CP coding region that overlaps the previously defined R4 region, which is involved in specific ArMV transmission by the nematode vector, Xiphinema diversicaudatum [45].
In a previous study [55], ArMV isolates were separated by the size and aa sequence identity of protein 2A HP into four groups (I to IV). Here, we recovered 43 2A HP nucleotide sequences from GenBank (Table S8) and confirmed the existence of three major clades corresponding to groups II, III and IV ( Figure S3). Group I was composed of a single sequence located within the group II clade. Sequences belonging to each group are genetically different with a high fixation index (FST ≥ 0.530) and strong statistical support (P ≤ 10 -5 ) (Table 5). However, the size of the 2A HP domain was not linked to the plant host with ArMV isolates from grapevine belonging to all four groups. A comparative analysis of 2A coding sequences from mono-and dicotyledonous plants documented a statistically supported genetic differentiation (FST) ( Table 5). Genetic differentiations according to mono-and dicotyledonous plants were also observed when looking at any other RNA1 or RNA2 coding region sequences or the complete ORF1 and ORF2 sequences ( Table 6, Tables S7-S8 and Figure S3). Distinct FST between mono-and dicotyledonous plants were also found at lower cladistical levels, strongly suggesting a likely genetic bias based on the plant host (Table S7). Interestingly, similar results have been reported for CLRV, another generalist virus within the genus Nepovirus, for which a host species-dependent population structure was documented using only a short 375 pb sequence corresponding to the extreme 3' part of the 3' non coding region [38].
Genetic diversity and population differentiation of GFLV from different geographic origins GFLV infects primarily Vitis spp., making the virus very specialized to this woody plant. The overall nucleotide diversity for GFLV ORF1 (π = 0.127 ± 0.002) and ORF2 (π = 0.130 ± 0.005) sequences was very similar (Table 4). Plotting π along ORF1 sequences ( Figure 3) showed a highly divergent region at the 3' end of the 1E Pol domain. This result is consistent with other analyses of this particular aa stretch of P1 that was predicted to be in a helix at the aa level [18,35]. Another highly polymorphic region was detected at the extreme 5' end of ORF2 sequences (Figure 3), which corresponds to a region where intra-and inter-species recombination events have been predicted (see above section and [52]). On the other hand, similar to ArMV, a significant drop in nucleotide diversity is observed within a segment of 2A HP sequences located between two highly conserved proline-rich regions. The evolution of this particular domain of ORF2 sequences is not neutral with statistical DT values well below 0 (Figure 3), underlining a conservative selection with regard to the remainder of ORF2 sequences. Similarly to ArMV, the same trend was observed for the R4 region of the 2C CP domain [45]. Interestingly, these two regions, which display the lowest DT values suggesting a recent selective sweep, are mostly located in sequences recovered from grapevines from the New World ( Figure S4). Regarding the evolution pattern of GFLV ORF1 sequences, values were negative but very close to 0 (DT = -0.758) with two sites under selection (P > 0.1): the first site is located at the extreme 5' end and the second site is positioned within the 1B Hel domain (Figure 3). When looking at the evolutionary pattern of P1 and P2 (dN-dS), most codons were under negative or neutral pressure (data not shown), as previously described [49].
No major differences were observed when separating GFLV sequences by geographic region (France vs rest of the world or old vs new world), with very similar π and DT values ( Table  4 and Figure S4). However, a genetic structuration between geographic regions was observed although FST values were extremely low (Table S9) with differences in the evolution pattern of GFLV isolates from different part of the world. This was even more noticeable when focusing on ORF2 sequences and separating sequences by continent [i.e., Europe, Americas (combining North and South America) and Asia (far-East, Turkey and Russia)]. All FST were statistically supported (P < 0.001). However, the disparity in FST values indicated that European and American GFLV variants were closely related to each other compared to the Asian variants. This observation was confirmed when grouping sequences into seven countries or restricted regions of the world (France, Slovenia, Italy, USA, Chile, far-East and Switzerland). Some FST values were very high, underlying a strong genetic structuration among regions around the world, as confirmed when comparing sequences from the far-East (Iran and China) with other regions of the world (Table  S9). This genetic differentiation according to far-East GFLV populations was previously described using GFLV 2B MP sequences [47]. Altogether, these observations suggest a specific geographic evolution and genetic structuration of the virus.
Common characteristics and major differences between grapevine-infecting ArMV and GFLV isolates.
ArMV and GFLV are very closely related but different species (Figure 1). They share many characteristics such as hosts (i.e., grapevine), closely related vectors (Xiphinema spp.), similar symptomatology and many natural inter-species recombinants. While genetically different (Table  S4, Figures 4 and 5), similar patterns in their respective genetic diversity were observed along ORF1 sequences, especially when decomposing grapevine-infecting ArMV isolates from non-Vitisinfecting ArMV isolates. As detailed above, one of the hallmarks of GFLV is a higher π value at the C-terminus of 1E Pol . A higher π at the C-terminal end of 1E Pol was also clearly identified in Vitisinfecting but not in non-Vitis-infecting ArMV isolates (Figure 4). Such specific increased genetic diversity in Vitis-infecting ArMV and GFLV isolates was not due to the overlap of a hidden ORF ( Figure S6), as described for sobemoviruses [30]. This diversity was also observed at the amino acid level with a percent identity higher than 80.41% in the case of non-Vitis-infecting ArMV isolates, but as low as 65.54% and 56.08% for Vitis-infecting ArMV and GFLV isolates, respectively ( Figure 4). Such highly divergence was not found when specifically looking at the first 148 aa of the 1E Pol domain with an identity above 82%. While highly divergent between ArMV and GFLV ( Figure S7), the 1E Pol C-terminus displays only two aa conserved between Vitis-infecting ArMV and GFLV isolates at position 683 and 746 ( Figure S8). Could these two residues be implicated in host adaptation mechanisms? More work is needed to address this hypothesis.
Many characteristics are common between ORF2 sequences of ArMV and GFLV ( Figure 5). For example, a higher genetic diversity is detected at the 5' end of the 2A HP domain, partly as a result of indels. However, major differences between the two species were found when focusing specifically on the 2C CP domain with a clearly different genetic diversity in the R4-R5 region between ArMV and GFLV ( Figure 5 and Figure S5). This region is important for vector transmission [45]. When looking at the evolution pattern, most of the ORF2 sequences seem to be evolving randomly while two regions display non-random evolution pattern. One of these two regions corresponds to the 2A HP domain and the other to the R4-R5 region within the 2C CP domain, both showing strong constraints.

CONCLUSION
Datamining and metagenomic efforts of complete ORF sequences unveiled new insights into the diversity and evaluation of virus species from the genus Nepovirus with a special emphasis on GFLV and ArMV, the two most important viruses involved in degeneration disease of grapevine in France. Our results confirmed a probable phylogeographic structure of GFLV populations and revealed a host-dependent structure of ArMV populations at a cladistic level. The C-terminus of the RNA-dependent RNA polymerase of GFLV and ArMV is predicted as a potential host range determinant. More work is needed to biologically test this hypothesis. Furthermore, some of the current species demarcation criteria that are applied to limited genomic regions may not be validated for all nepoviruses at the ORF sequence level. This suggested the need to adapt some of the taxonomic criteria to pangenome information. Nonetheless, with an ever-increasing amount of sequence data obtained through HTS, there are new opportunities for studying Nepovirus biology, characterizing nepoviral communities in plants, improving Nepovirus taxonomy and confronting these new pangenomic and populational information for developing resistances strategies (i.e., cross-protection, resistance alleles characterization, new plant biotechnological technologies) and challenging them for a better durability of protection.

Sequence analyses, genetic diversity and recombination detection
The complete to near complete nepovirus ORF1 and ORF2 sequences were retrieved from NCBI as of April 2020, our own curated nepovirus sequences repository obtained by analyses of high throughput or Sanger sequencing datasets, and selection of Sequence Research Archive datasets from GenBank [19]. In total, 110 ORF1 sequences and 167 ORF2 sequences have been used in this study (Table S2 and S3). In addition, sequences of specific domains we retrieved from NCBI (see Table S8).
Multiple sequence alignments by codon and maximum likelihood (ML)-based phylogenetic trees were prepared using MUSCLE [7] implemented in MEGA7 and MEGAX software [26,27], excluding the viral untranslated regions (UTRs). The best ML-fitted model for each sequence alignment was used, and nodes in phylogenetic trees were validated by bootstrap analyses (100 replicates). For visualization effects, FigTree v. 1.3.1 was used (http://tree.bio.ed.ac.uk/). The diversity index (π), which is the average number of nucleotide substitutions per site between any two sequences in a multi-sequence alignment, and the variation of π along genome sequences were evaluated by sliding window analyses (length: 80, step size: 20) using DnaSP v.6.12.03 [29] and MegaX.
A search for potential recombination signals was performed using all seven algorithms implemented in the RDP program v4.97 [32]. The default settings were used for each algorithm and only recombination events detected by five or more methods were considered.
Differences in nucleotide diversity of viral populations defined using different modalities were tested by analysis of molecular variance (AMOVA), as implemented in Arlequin v. 5.3.1.2 [10]. AMOVA calculates the FST index explaining the between-groups fraction of total genetic diversity. The significance of these differences was evaluated by performing 1000 sequence permutations.
Tajima's D (DT) and sliding window analyses were conducted using DnaSP v. 6.12.03 [29] in order to distinguish the viral populations evolving randomly (per mutation-drift equilibrium; DT = 0) from those evolving under a nonrandom process (DT > 0: balancing selection, sudden population contraction; DT < 0: recent selective sweep, population expansion after a recent bottleneck). Table 1. List of nepoviral species used in this study. Species from each subgroup (A in red, B in blue and C in green) are indicated by a bold capital letter when officially ratified by ICTV or in lowercase when ICTV ratification is pending. The number of ORF1 and ORF2 sequences used for each species is provided. The length of both ORFs is indicated for each species. * indicates a shorter sequence for an isolate of cycas necrotic stunt virus (reference isolate) and red clover nepovirus A. na: not applicable. Table 2. Genetic distance within and between subgroup species of the genus Nepovirus. Mean nucleotide distance is indicated in bold and standard error in italic. The maximum value of pairwise distance within subgroups is provided. N corresponds to the number of sequences used for calculation. Table 3. Number of putative intra-and inter-species recombination events detected by RDP4 for species of the three subgroups in the genus Nepovirus. Detailed informations on the genomic location of recombination events, major and minor parents and P-values is provided in Tables S5 and S6.  Tables S2 and S3. Table 5. Genetic differentiation of arabis mosaic virus (ArMV) populations for the complete ORFs and the different coding regions. Fixation index (FST) with associated P-value (pval are significant when < 0.05) and the number of sequences (N) are indicated. Sequences were grouped by either the plant type (monocotyledonous versus dicotyledonous) or the size of the 2A HP coding region (group II, III and IV).    replicates and the scale bar shows genetic distance. Graphics represent π (substitution/sites) and Tajima's D (DT) for evolution along the ORF1 sequence. Percentages correspond to minimum aa identity for both framed regions within the polymerase domain. The colored bar with # corresponds to a statistically validated region at 0.05. Figure 5. Phylogenetic and genetic diversity analyses of grapevine fanleaf virus (GFLV), arabis mosaic virus (ArMV) and grapevine deformation virus (GDefV) using a corpus of 102 ORF2 nucleotide sequences. GFLV isolates are in blue, GDefV in peach, ArMV isolates from grapevines (Vitis-ArMV) in green and ArMV isolates from other plants (non-vitis ArMV) in red. Maximum likelihood phylogenetic trees are shown. Numbers at each node indicate bootstrap values based on 100 replicates and the scale bar shows genetic distance. Graphics represent π (substitution/sites) and Tajima's D (DT) for evolution along the ORF2 sequence. The boxed area corresponds to the R4-R5 region of the 2C CP domain. Colored bars with # and * correspond to statistically validated regions (P-values at 0.05 and 0.001, respectively). Table 1. List of the new nepoviral sequences obtained by datamining or from our own curated sequences repository that were released in this study. Virus name, acronym, specific ID, country of origin, sampling year and sequence release date and plant host is indicated for RNA1 or RNA2 sequences.

Supp
Supp Table 2. Complete list of the 110 nepoviral ORF1 sequences used in this study. Accession number, virus name, acronym, specific ID (NCBI), country of origin, sampling year (or paper publication date), NCBI release date, plant host species are provided.
Supp Table 3. Complete list of the 167 nepoviral ORF2 sequences used in this study. Accession number, virus name, acronym, specific ID (NCBI), country of origin, sampling year (or paper release date), NCBI release date, plant host species are provided.
Supp Table 4. Amino acids identity within and between species for which the demarcation is ambiguous under the criteria of less than 80% for ORF1 and 75% for ORF2. Percentages between species higher or lower than the criteria are highlighted in grey and bold respectively.
Supp Table 5. Intra-species recombination events detected on nepoviral ORF1 and ORF2 sequences using Recombination Detection Program package (RDP v.4.97). The species name is provided with the number of recombinant events, recombinant ID (the first one listed by RDP4 if multiple events were detected), major and minor parents with respective similarity percentages, recombination event location without gaps, number of programs identifying the event, names of the programs (R: RDP, B: Bootscan, G: GeneConv, M: MaxChi, C: Chimaera, S: Siscan and 3: 3Seq) and associated P-value.
Supp Table 6. Inter-species recombination events detected on nepoviral ORF1 and ORF2 sequences using Recombination Detection Program package (RDP v.4.97). The species name is provided with the number of recombination events in parenthesis for that particular location, type, recombinant ID (the first one listed by RDP4 if multiple events were detected), major and minor parents with respective similarity percentages, recombination event location without gaps, number of programs identifying the event, names of the programs (R: RDP, B: Bootscan, G: GeneConv, M: MaxChi, C: Chimaera, S: Siscan and 3: 3Seq) and associated P-value. Supp Table 7. Differentiation measurements differentiation (fixation index, FST) and associated statistics (P-value) of arabis mosaic virus (ArMV) populations for ORF1 and ORF2 sequences and different coding regions. Isolates were categorized according to the type of plants infected (monocotyledonous, dicotyledonous and lower cladistical levels). The number of sequences (N) for each category is indicated. Only statistically validated FST are provided (p-value < 0.05).
Supp Table 8. List of sequences used for the ArMV population's differentiation analyses presented in supp table 7.
Supp Table 9. Measurements of genetic differentiation of grapevine fanleaf virus (GFLV) populations grouped by the geographical origin of the isolates. Sequences from the old world (Turkey, Iran, France, Hungary, Germany, Slovenia, Russia, Italy, Switzerland); new world (Canada, USA, Chile, China, South Africa); FR: France; RoTW: rest of the world other than FR; EU: Europe; Am: Americas; As: Asia; IT: Italy; Sl: Slovenia; US-CA: North America; CH: Switzerland; CL: Chile and FE: far east, were analyzed. The geographic origin of isolates is specified for each sequence in Tables S2 and S3. Fixation index (FST) with associated p-value is indicated and non-significant pvalues > 0.05 are underlined and in italic. Grey shading highlight FST values for far-East and Asia sequences.
Supp Table 10. List of complete nepovirus sequences submitted to GenBank after April 2020 that were not taken into account in this study.
Supp  Table S8. Numbers at each node indicate bootstrap values based on 100 replicates. Scale bar show genetic distance. ArMV isolates infecting monocotyledonous plants were shown in red and those infecting dicotyledonous plants are in green.
Supp Figure 4. Genetic diversity analyses of grapevine fanleaf virus (GFLV) isolates from different countries using a corpus of 40 ORF1 (left panel) and 80 ORF2 (right panel) nucleotide sequences. Colors correspond to different regions of the world with New world isolates in yellow, Old world isolates in green, French isolates in blue and isolates from the rest of the world in red. Graphics represent π (substitution/sites) and Tajima's D (DT) for evolution along the ORF1 and ORF2 sequences. Colored bars with # and * correspond to genetic regions being statistically valid (P-values at 0.05 and 0.001, respectively).
Supp Figure 5. Phylogenetic and diversity analyses of the 2C CP coding region using a corpus of 102 nucleotide sequences of grapevine fanleaf virus (GFLV), arabis mosaic virus (ArMV) and grapevine deformation virus (GDefV) isolates with π (substitution/sites) shown along the sequence (a), phylogenetic relationships using maximum likelihood tree (b), and consensus amino acids obtained by aligning regions previously predicted and/or validate in nematode-mediated transmission (R1-R5) (c). GFLV isolates are in blue, all ArMV isolates in yellow grapevine-infecting ArMV isolates (Vitis-ArMV) in yellow, and non-grapevine-infecting ArMV isolates (Non-vitis-ArMV) in red. For panel a, the tree CP domains (A, B, C) are delineated. For the phylogenetic tree, numbers at each node indicate bootstrap values based on 100 replicates and the scale bar shows genetic distance.
Supp Figure 6. Comparative analysis of the six possible reading frames of ORF1 from 40 grapevine fanleaf virus (GFLV) and 17 arabis mosaic virus (ArMV) sequences. Graphic representation of π (substitution/sites) along the sequence (top) and of the +1, +2, +3, -1, -2, -3 reading frames with stop codons in blue and gaps in gray (bottom). Within each reading frame that consists of 57 lines or sequences, the first 40 lines correspond to GFLV sequences, followed by nine Vitis-infecting ArMV sequences and finally by eight non-Vitis-infecting ArMV sequences. GFLV sequences are in blue, Vitis-infecting ArMV sequences in green and non-Vitis-infecting ArMV in red.
Supp Figure 7. Phylogenetic tree reconstructed from the alignment of the distal 148 and 149 amino acids of the polymerase (Pol) from grapevine fanleaf virus (GFLV) and arabis mosaic virus (ArMV) isolates. The scale bar below the tree shows genetic distance. GFLV sequences are delineated in a blue area, Vitis-infecting ArMV sequences in a green area and non-Vitis-infecting ArMV in a red area.
Supp Figure 8. Alignment of consensus amino acid sequences of the C-terminal part of the polymerase (Pol) from grapevine fanleaf virus (GFLV) isolates, Vitis-infected isolates of arabis mosaic virus (ArMV) and non-Vitis-infecting ArMV isolates. Arrows indicate identical amino acids between grapevine viruses but different from ArMV isolates infecting plants other than grapevines. The blue rectangle shows the distal 148/149 aa in the Pol domain. Figure 1 Tanglegram of maximum likelihood phylogenetic trees inferred from 110 ORF1 and 167 ORF2 nucleotide sequences of nepoviruses. Colors correspond to species a liation to one of the three nepovirus subgroups: red for subgroup A sequences; blue for subgroup B sequences; and green for subgroup C sequences. Clades with several sequences from the same species are collapsed. Numbers at nodes indicate bootstrap values based on 100 replicates. The scale bar corresponds to the number of substitutions per site.   Phylogenetic and diversity analyses of grapevine fanleaf virus (GFLV), arabis mosaic virus (ArMV) and grapevine deformation virus (GDefV) using a corpus of 58 ORF1 nucleotide sequences. GFLV isolates are in blue, GDefV isolates in peach, ArMV isolates from grapevines (Vitis-ArMV) in green and ArMV isolates from other plants (non-vitis ArMV) in red. Maximum likelihood phylogenetic trees are shown. Numbers at each node indicate bootstrap values based on 100 replicates and the scale bar shows genetic distance. Graphics represent π (substitution/sites) and Tajima's D (DT) for evolution along the ORF1 sequence. Percentages correspond to minimum aa identity for both framed regions within the polymerase domain.

Figures
The colored bar with # corresponds to a statistically validated region at 0.05.

Figure 5
Phylogenetic and genetic diversity analyses of grapevine fanleaf virus (GFLV), arabis mosaic virus (ArMV) and grapevine deformation virus (GDefV) using a corpus of 102 ORF2 nucleotide sequences. GFLV isolates are in blue, GDefV in peach, ArMV isolates from grapevines (Vitis-ArMV) in green and ArMV isolates from other plants (non-vitis ArMV) in red. Maximum likelihood phylogenetic trees are shown. Numbers at each node indicate bootstrap values based on 100 replicates and the scale bar shows genetic distance. Graphics represent π (substitution/sites) and Tajima's D (DT) for evolution along the ORF2 sequence. The boxed area corresponds to the R4-R5 region of the 2CCP domain. Colored bars with # and * correspond to statistically validated regions (P-values at 0.05 and 0.001, respectively).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.