Evidence of differential spreading events of grapevine pinot Gris virus in Italy using datamining as a tool

Since its identification in 2003, grapevine Pinot gris virus (GPGV, Trichovirus) has now been detected in most grape-growing countries. So far, little is known about the epidemiology of this newly emerging virus. In this work, we used datamining as a tool to monitor in-silico the sanitary status of three vineyards in Italy. All data used in the study were recovered from a work that was already published and for which data were publicly available as SRA (Sequence Read Archive, NCBI) files. While incomplete, knowledge gathered from this work was still important, with evidence of differential accumulation of the virus in grapevine according to year, location, and variety-rootstock association. Additional data regarding GPGV genetic diversity were collected. Some advantages and pitfalls of datamining are discussed.

Since its characterization in Italy (Giampetruzzi et al., 2012), grapevine Pinot gris virus (GPGV, Trichovirus, Betaflexiviridae) has been detected in most grapevine growing regions around the world. Generally, the virus is detected using serological and/or molecular tools. In this work, we describe datamining as a potential additional method to identify grapevine infected with this virus, better estimating its distribution worldwide. While this specific work cannot be considered as an epidemiological study per se, it still unquestionably offers valuable information on the virus (i.e., its geographic distribution and genetic composition), providing a snapshot of the situation in three different vineyards in Italy at a specific time, giving new insight on GPGV accumulation, introduction and transmission.
This particular work is based on the data provided by a study on the contribution of genotype, the environment and their interaction to the berry transcriptome that was previously published (Dal Santo et al., 2018). Two cultivars, Cabernet Sauvignon and Sangiovese, were planted in three different locations: Montalcino, Bolgheri and Riccione. The former two Italian cities are located in the Tuscany hills and Tuscany coast respectively, while the latter is positioned on the Adriatic coast (Fig. 1). To minimize genetic variation, researchers used the same clonal material for each cultivar, with clones R5 and VCR23 of Cabernet Sauvignon and Sangiovese, respectively. In addition, three different rootstocks were tested in the study: Kober-5BB, 420A and 161.49 C. After uploading the 72 SRA files generated from this work, all samples were analyzed for the presence of GPGV using Workbench 12.0 software (CLC Genomics Workbench, Aarhus, Denmark) as previously described . This was first assessed by mapping reads to a collection of curated GPGV reference sequences. For those displaying reads corresponding to GPGV, de novo assembly steps were performed and further extended by multiple rounds of residual reads mapping as previously described (Nourinejhad Zarghani et al., 2018). Genome sequences being produced were ascertained using very stringent mapping parameters (length of 0.95/similarity of 0.95).
Our datamining study revealed that only samples from Bolgheri and Riccione were positive for GPGV. The virus was hardly detected in a few samples from Montalcino (Table 1); however, no complete sequence could be recovered. These 'Low Read Count' samples were probably the result of 'intra-lane contamination', as previously described in other studies . When using RPKM (Reads per kilo base per million) data as a proxi for virus accumulation in the samples, our analyses revealed differential accumulation of GPGV according to many variables (Fig. 2). Indeed, GPGV seems to accumulate more in berries in 2011 than in 2012 (P < 10 −5 ) and at a later stage of fruit development, at mid-ripening rather than pre-veraison (P < 10 −4 ). Also, the association cultivar-rootstock seems to have its importance in virus accumulation. Indeed, GPGV seems to accumulate more in Cabernet Sauvignon cultivar grafted onto either 161-49 or Kober-5BB rootstocks, rather than in Sangiovese grafted onto 420A at either location (P ≤ 10 −4 ). In addition, differential accumulation of GPGV was also observed according to location where grapevines were grown (P < 10 −5 ), with GPGV accumulating more in Riccione than in Bolgheri.
When delving into the genetics of the virus, other information was revealed. Overall, 47 complete genome GPGV sequences (or near complete, covering at least all open reading frames) were assembled (Table 1), all submitted to GenBank (BK011089-BK011101, and the other sequences are available upon request). After a phylogenetic analysis (Fig. 1), three major clades of GPGV were found to infect these grapevines, each displaying a high intra-clade nucleic acid identity percentage ≥ 98.10%. Interestingly, GPGV sequences seemed to cluster together very well by location (Fig. 1, colors) however independently from cultivar. Fixation index (F ST ) analyses ( Fig. 1) confirmed the genetic differentiation of the viral population according to location, showing a statistically significant high F ST value (F ST = 0.608, P ≤ 10 −5 ). Such segregation by location was also highlighted for grapevines grafted  The 'number' in the GPGV column correspond to the number of complete genome assembled in de novo in each sample.
✓ indicates that reads have mapped onto GPGV genome, as shown in the Mapped read counts columns would indicate, however no complete genome from contiguous sequence could be obtained and RPKM (Read per Kilobase Million) were always below 3 when no genome were assembled. This work was performed using CLC-Workbench using very stringent mapping parameters * (0,95/0,95) onto rootstock 161.49 C used exclusively in Bolgheri and grapevines onto Kober 5BB exclusively used in Riccione (F ST = 0.564, P ≤ 10 −5 ). Comparison of sequences obtained from the 420A rootstock also displayed statistically significant F ST values; however, the values were lower than the ones mentioned above. This is most likely because 420A was used as a rootstock in both locations. Furthermore, the genetic background of the grapevine cultivar, which was also present in both locations, had no statistically significant impact on viral populations (F ST = 0.018, P = 0.185). In addition to the presence/absence of GPGV in the samples, this work highlights two distinct situations at the viral genomic level. Indeed, one vineyard is infected by a single variant, identity percentage ≥ 99%, as previously defined for GPGV (Hily et al., 2020), represented here by samples from the Bolgheri region, while the other vineyard (Riccione) is infected by at least two (or more) variants. These results indirectly, but strongly, suggest probable independent introduction/transmission events of GPGV in two out of the three locations specifically looked at, in this transcriptomic study. These situations are probably the result of transmission events through grafting (Saldarelli et al., 2014) and movement of infected material as previously suggested (Al Rwahnih et al., 2016;Fajardo et al., 2017;Wu & Habili, 2017). They may also have occurred horizontally by vectors either in the nursery or in the vineyard, with distinct variants of the virus being detected at each location, regardless of the clonal background of the grapevine. In addition, the detection of these different variants according to location, each displaying probable differences in fitness, may results in differential virus accumulation as observed above. Overall, this in silico work add onto the so-far limited knowledge on the natural transmission of GPGV in vineyards (Bertazzon et al., 2020;Hily et al., in press).
Lately, datamining is becoming a very important and powerful tool to identify new pathogens, as well as new variants of known viruses, such as from the now well-known Coronaviridae family for example (https://virological.org/t/serratus-the-ultra-deep-search-todiscover-novel-coronaviruses/516) (last visited 04/2021). Datamining can be also utilized to increase the number of complete genome sequences for downstream studies on the evolutionary history of specific viruses for example (Hily et al., 2020). In this work, datamining can be considered as an in-silico tool to monitor post facto the sanitary status of any vineyards around the world from which data have already been collected, published and made publicly available. There are a few pitfalls regarding datamining as a tool. Indeed, we do not have always all the details regarding the samples (i.e. metadata about the samples such as the exact origin and location of collection). We do not have the choice of the technology with which data were obtained nor the quality of the sample. However, the  Fig. 2 Box plot diagrams of RPKM in function of different variables. From left to right: year, developmental stage (MR: mid-ripening, PV: pre-veraison), rootstock, overall location, Sangiovese grafted onto 420A, grapevine cultivated in Bolgheri and in Riccione (CS: Cabernet Sauvignon). On each box, the central line is the median, the edges of the boxes are the 25th and 75th percentiles, the whiskers extend to the most extreme data and the dots refer to the outliers. Since RPKM values did not follow a normal distribution, a generalized linear model (GLM) with Poisson link function was used. The significance of the considered effect was tested using Wald chi2 test and the p values smaller than 0.05 threshold were considered statistically significant. All analyses and graphic representations were made with the R software version 4.0.2 (R core Team 2012) information being generated is still very valuable, it has already been paid for and therefore almost free (other than the time of analysis), it is available to anyone and most of all, it is ever growing.

Declarations
Ethical approval No human and/or animal participants were involved in the study.
Consent to participate, submission and for publication All authors have been personally and actively involved in the work leading to this manuscript and consent its submission and publication.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.