Since its characterization in Italy (Giampetruzzi et al., 2012), grapevine Pinot gris virus (GPGV, Trichovirus, Betaflexiviridae) has been detected in most grapevine growing regions around the world. Generally, the virus is detected using serological and/or molecular tools. In this work, we describe datamining as a potential additional method to identify grapevine infected with this virus, better estimating its distribution worldwide. While this specific work cannot be considered as an epidemiological study per se, it still unquestionably offers valuable information on the virus (i.e., its geographic distribution and genetic composition), providing a snapshot of the situation in three different vineyards in Italy at a specific time, giving new insight on GPGV accumulation, introduction and transmission.

This particular work is based on the data provided by a study on the contribution of genotype, the environment and their interaction to the berry transcriptome that was previously published (Dal Santo et al., 2018). Two cultivars, Cabernet Sauvignon and Sangiovese, were planted in three different locations: Montalcino, Bolgheri and Riccione. The former two Italian cities are located in the Tuscany hills and Tuscany coast respectively, while the latter is positioned on the Adriatic coast (Fig. 1). To minimize genetic variation, researchers used the same clonal material for each cultivar, with clones R5 and VCR23 of Cabernet Sauvignon and Sangiovese, respectively. In addition, three different rootstocks were tested in the study: Kober-5BB, 420A and 161.49 C. After uploading the 72 SRA files generated from this work, all samples were analyzed for the presence of GPGV using Workbench 12.0 software (CLC Genomics Workbench, Aarhus, Denmark) as previously described (Hily et al., 2018). This was first assessed by mapping reads to a collection of curated GPGV reference sequences. For those displaying reads corresponding to GPGV, de novo assembly steps were performed and further extended by multiple rounds of residual reads mapping as previously described (Nourinejhad Zarghani et al., 2018). Genome sequences being produced were ascertained using very stringent mapping parameters (length of 0.95/similarity of 0.95).

Fig. 1
figure 1

Maximum-likelyhood tree inferred from sequences (7206 nt) of grapevine Pinot gris virus genome isolated from two cultivars, Cabernet Sauvignon clone R5 (star) and Sangiovese clone VCR23 (circle). Rootstocks are also indicated with 161.49 C (square), Kober 5BB (triangle) and 420A (diamond). Only bootstraps above 0,5 are shown. Colors correspond to the location in Italy where samples were recovered, Bolgheri (blue) and Riccione (red), see map on the upper right corner. Identity percentages between sequences are indicated on the right of the ML-tree. Measurements of population’s differentiation (fixation index, FST) and associated statistics (P value) are on the upper left corner

Our datamining study revealed that only samples from Bolgheri and Riccione were positive for GPGV. The virus was hardly detected in a few samples from Montalcino (Table 1); however, no complete sequence could be recovered. These ‘Low Read Count’ samples were probably the result of ‘intra-lane contamination’, as previously described in other studies (Vigne et al., 2018). When using RPKM (Reads per kilo base per million) data as a proxi for virus accumulation in the samples, our analyses revealed differential accumulation of GPGV according to many variables (Fig. 2). Indeed, GPGV seems to accumulate more in berries in 2011 than in 2012 (P < 10−5) and at a later stage of fruit development, at mid-ripening rather than pre-veraison (P < 10−4). Also, the association cultivar-rootstock seems to have its importance in virus accumulation. Indeed, GPGV seems to accumulate more in Cabernet Sauvignon cultivar grafted onto either 161–49 or Kober-5BB rootstocks, rather than in Sangiovese grafted onto 420A at either location (P ≤ 10−4). In addition, differential accumulation of GPGV was also observed according to location where grapevines were grown (P < 10−5), with GPGV accumulating more in Riccione than in Bolgheri.

Table 1 All information regarding the datamining analyses performed from the study from Dal Santo et al., 2018
Fig. 2
figure 2

Box plot diagrams of RPKM in function of different variables. From left to right: year, developmental stage (MR: mid-ripening, PV: pre-veraison), rootstock, overall location, Sangiovese grafted onto 420A, grapevine cultivated in Bolgheri and in Riccione (CS: Cabernet Sauvignon). On each box, the central line is the median, the edges of the boxes are the 25th and 75th percentiles, the whiskers extend to the most extreme data and the dots refer to the outliers. Since RPKM values did not follow a normal distribution, a generalized linear model (GLM) with Poisson link function was used. The significance of the considered effect was tested using Wald chi2 test and the p values smaller than 0.05 threshold were considered statistically significant. All analyses and graphic representations were made with the R software version 4.0.2 (R core Team 2012)

When delving into the genetics of the virus, other information was revealed. Overall, 47 complete genome GPGV sequences (or near complete, covering at least all open reading frames) were assembled (Table 1), all submitted to GenBank (BK011089-BK011101, and the other sequences are available upon request). After a phylogenetic analysis (Fig. 1), three major clades of GPGV were found to infect these grapevines, each displaying a high intra-clade nucleic acid identity percentage ≥ 98.10%. Interestingly, GPGV sequences seemed to cluster together very well by location (Fig. 1, colors) however independently from cultivar. Fixation index (FST) analyses (Fig. 1) confirmed the genetic differentiation of the viral population according to location, showing a statistically significant high FST value (FST = 0.608, P ≤ 10−5). Such segregation by location was also highlighted for grapevines grafted onto rootstock 161.49 C used exclusively in Bolgheri and grapevines onto Kober 5BB exclusively used in Riccione (FST = 0.564, P ≤ 10−5). Comparison of sequences obtained from the 420A rootstock also displayed statistically significant FST values; however, the values were lower than the ones mentioned above. This is most likely because 420A was used as a rootstock in both locations. Furthermore, the genetic background of the grapevine cultivar, which was also present in both locations, had no statistically significant impact on viral populations (FST = 0.018, P = 0.185).

In addition to the presence/absence of GPGV in the samples, this work highlights two distinct situations at the viral genomic level. Indeed, one vineyard is infected by a single variant, identity percentage ≥ 99%, as previously defined for GPGV (Hily et al., 2020), represented here by samples from the Bolgheri region, while the other vineyard (Riccione) is infected by at least two (or more) variants. These results indirectly, but strongly, suggest probable independent introduction/transmission events of GPGV in two out of the three locations specifically looked at, in this transcriptomic study. These situations are probably the result of transmission events through grafting (Saldarelli et al., 2014) and movement of infected material as previously suggested (Al Rwahnih et al., 2016; Fajardo et al., 2017; Wu & Habili, 2017). They may also have occurred horizontally by vectors either in the nursery or in the vineyard, with distinct variants of the virus being detected at each location, regardless of the clonal background of the grapevine. In addition, the detection of these different variants according to location, each displaying probable differences in fitness, may results in differential virus accumulation as observed above. Overall, this in silico work add onto the so-far limited knowledge on the natural transmission of GPGV in vineyards (Bertazzon et al., 2020; Hily et al., in press).

Lately, datamining is becoming a very important and powerful tool to identify new pathogens, as well as new variants of known viruses, such as from the now well-known Coronaviridae family for example (https://virological.org/t/serratus-the-ultra-deep-search-to-discover-novel-coronaviruses/516) (last visited 04/2021). Datamining can be also utilized to increase the number of complete genome sequences for downstream studies on the evolutionary history of specific viruses for example (Hily et al., 2020). In this work, datamining can be considered as an in-silico tool to monitor post facto the sanitary status of any vineyards around the world from which data have already been collected, published and made publicly available. There are a few pitfalls regarding datamining as a tool. Indeed, we do not have always all the details regarding the samples (i.e. metadata about the samples such as the exact origin and location of collection). We do not have the choice of the technology with which data were obtained nor the quality of the sample. However, the information being generated is still very valuable, it has already been paid for and therefore almost free (other than the time of analysis), it is available to anyone and most of all, it is ever growing.