Introduction

Among the challenges of horticultural science is to characterize plant genetic resources for agriculture and food and to reconstruct the phylogeny of domesticated species, which are currently cultivated together with their wild relatives. The common grapevine (Vitis vinifera L.) has been studied in this regard for years and by numerous authors using molecular, ampelographic and morphometric techniques.

The name Vitis vinifera was lectotypified by Siddiqi (1980) using Herb. Linn. No. 281.1 (LINN 2023a, b a). This name is applied in the strict sense (var. vinifera) to cultivated grapes whose berries are provided with (0)1 to 4 seeds. Iamonico et al. (2015) neotypified for Corinth and Sultanina group of ancient seedless cultivars the name V. vinifera var. apyrena L. using Herb. Linn. No. 281.3 (LINN 2023a, b b). The Eurasian wild grapevine has received very diverse taxonomic treatments, from the rank of variety to the one of the species. This implies the use of the subsequent alternative names, depending on the accepted level: V. vinifera var. sylvestris Willd., or V. vinifera subsp. sylvestris (C.C. Gmel.) Hegi, or, even, V. sylvestris C.C.Gmel. (Ferrer-Gallego et al. 2019). However the conservation and typification proposal of Ferrer-Gallego et al. (2019) for V. sylvestris was rejected (Applequist 2023) so in the species rank the valid name requires further study, in the meantime we will continue to use V. sylvestris C.C.Gmel.

Based on the Bayesian molecular dating, Vitis is inferred to have originated in the New World during the late Eocene (32.6–48.6 Ma), then migrated to Eurasia in the late Eocene (30.9–45.1 Ma) (Liu et al. 2016). The North Atlantic land bridges are hypothesized by Liu et al. (2016) to be the most plausible route for the Vitis migration from the New World to Eurasia, while intercontinental long-distance dispersal cannot be eliminated as a likely mechanism (Nie et al. 2012).

Seeds of Vitaceae are easily identifiable due to their distinctive features, which include a pair of ventral furrows and a dorsal chalaza shield. The shape and positioning of these ventral furrows and chalaza, along with the testa anatomy, are key characteristics that differentiate seed types within the family. Typically, species such as Leea, Cissus, Cyphostemma, Tetrastigma, Rhoicissus, and Cayratia have a thin linear chalaza, while others, including Vitis, generally exhibit an oval chalaza (Manchester et al. 2012).

However, some seed forms are shared across multiple genera, and certain genera may possess more than one seed form. This overlapping seed morphology suggests a closer relationship among some genera, while in other cases, it may indicate convergent or parallel evolution (Chen and Manchester 2011).

Vitis seeds present a characteristic differential morphology, however Chen and Manchester (2011) have shown that two of the Ampelocissus species (A. erdvendbergiana Planch. and A. robinsonii Planch.) have smaller, heart-shaped seeds that resemble those of Vitis and are inseparable from Vitis; these species are, both, endemic to Central America. Ampelocissus erdvendbergiana is grouped by Nie et al. (2012) with Vitis section Muscadinia grapevines, such as Vitis rotundifolia and V. popenoei. The similarity between the seeds of certain Ampelocissus, Pterisanthes, Nothocissus and Vitis can be explained by the close phylogenetic relationships of these genera (Chen and Manchester 2011).

Vitis seeds usually have short ventral inward folds and an oval chalaza. Their apical and/or basal grooves are sometimes prominent in dorsal view. The seeds are usually smooth; however, there are also species with rough seeds. V. rotundifolia has larger seeds with a slightly rough surface; its ventral folds are longer than those of other Vitis species (Chen and Manchester 2011).

In early morphological research with Vitis seeds, two morphotypes were defined consisting of pear-shaped and oblong to ovoid seeds, which corresponded respectively to the taxonomic groups Euvitis and Muscadinia. The seeds of cultivated species (V. vinifera, V. labrusca L., V. aestivalis Michx.) are larger and their dimensions are more variable than those of wild species (Martín-Gómez et al. 2020).

Seeds of Eurasian wild (V. sylvestris) and domesticated (V. vinifera) grapevine display dissimilarities, which allow the discrimination between both taxa: wild grapevine bears small and roundish seeds with short stalks, while pips from cultivars are more elongated, with longer stalks.

These morphological traits have been used to discriminate archaeological pips since Stummer’s work of 1911. The co-variation of berry size, number of seeds per berry, pip size and pip shape were explored by Bonhomme et al. (2020) on Euro-Mediterranean traditional cultivars and wild grapevines. They show that for both wild and domesticated grapevine, the longer pip has a “domesticated” shape.

However, the use of morphometric criteria has been criticized because of the overlap between wild and domesticated seed shapes and because of seed deformation due to carbonization. Moreover, morphometric criteria and measurement values defined by Stummer and later by Mangafa and Kotsakis in 1996 are based on a reference model which includes grape cultivars and wild individuals from the Mediterranean and Western and Central Europe, which for Pagnoux et al. (2015) is an excessively restricted area, which cannot be considered as an appropriate representative of grapevine diversity.

Jacquat and Martinoli (1999) made the classification of grapevine seeds, as products of domesticated or wild plants, applying separately Stummer (1911), Facsar (1970) and Facsar and Jerem (1985) in the version by Perret (1997), and Mangafa and Kotsakis (1996) indexes. They carried out different biometric research on Nabataean and Roman seeds found at Petra, Jordan, dated to 150 B.C.—400 A.D. Depending on the identification method selected, the seeds were attributed either to wild grapevines (based on the ratio of breadth over length, and on discriminant analyses of size variables such as pip length, stalk length, and chalaza position), or an archaic variety of vine with seeds morphologically close to those of wild grapevines (ratio of stalk length over total pip length).

Traits associated with domesticated V. vinifera are unquestionable, but what is less clear is to what extent the lack of these traits in a seed sample indicates that this comes from wild, uncultivated plants. Whether these seeds, especially in archaeological contexts, correspond to wild or cultivated plants is quite uncertain, since we do not know at what point in the process of grapevine cultivation the seeds acquired traits diagnostic of domestication (Valamoti 2009).

Furthermore, the study of Logothetis (1970, 1974) revealed that after carbonization seeds of both, V. vinifera and V. sylvestris, became smaller and more rounded, since length decreases more than breadth which in turn decreases more than thickness. These changes were even more profound in wild seeds. However, what is most relevant is that carbonization approached domesticated seed shape to the wild type. Considering that, the B/L indices (Stummer index) increased as a result of carbonization, and for that reason, Mangafa and Kotsakis (1996) suggest that their second and third formulae (see below) are more appropriate for the identification of charred archaeological grape pips, because fewer simple size variables are used with less contribution than allometric variables.

The Eurasian wild grapevine, V. sylvestris C.C. Gmel., was seen as the hypothetical dioecious parental of domesticated V. vinifera L., which is, usually, hermaphrodite (Rivera and Walker 1989; This et al. 2006; Zohary and Hopf 2000). V. sylvestris was found to be sister to all Asian species and to be the oldest living Eurasian species amongst those included in the DNA study by Zecca et al. (2012).

Fossils attributed to V. sylvestris appear within sediments dated from the end of the Pliocene (Sémah and Renault-Miskovsky 2004). Currently, wild populations of V. sylvestris and related taxa are scattered in predominantly riparian natural ecosystems from the Iberian Peninsula through Central Europe, the Balkans and the Caucasus, to the Hindu Kush (Arnold 2002). Some populations of this liana can be also found in the African Maghreb (Ocete et al. 2007). Their main habitats are river-bank forests, river mouths, flood plains, colluvial positions on the slopes of hills and mountains and coasts between the parallels 49º 27′ N (Rhine River, Germany) and 31º N (Ourika River, Morocco) (Iriarte et al. 2013). In such places, soils are often renewed by flooding (Arnold 2002; Maghradze et al. 2010).

Our main objective is to analyze and validate the use of domestication indices to distinguish V. vinifera and differentiate it from V. sylvestris and other Vitis wild species. For this purpose, we will study the values of the indices in living populations of European, American and Asian wild grapevines. In parallel we approach the use of logistic regression and random forest tools to estimate the probability of each seed of being domesticated or not and to compare these results with the Domestication Index values.

We also aim to detect the most influential variables and indexes, using machine learning tools.

In addition, we also intend to assess the information that the indices provide on the evolution of Vitis, from the first fossil seed remains to the present day, paying particular attention to materials from archaeological sites. Finally, we intend to evaluate the human contribution to the development of the domestication syndrome in Vitis seeds and to consider the influence of possible non-anthropic factors.

Materials and methods

Seed samples and characters analyzed

The living reference material here used aims to give a comprehensive picture of the diversity within Vitis species and cultivars, starting with samples that cover a large number of cultivars, followed of wild and subspontaneous vines, in order to compare the morphological differences between them, together with a comprehensive sample collection of fossils Vitaceae and modern samples of related living genera such as Ampelopsis, Nekemias or Parthenocissus. Of the total number of seeds analyzed, 3484 are modern from living species (481 samples), 398 are archaeological (195 samples) where most of the material was conserved carbonized, and 147 are fossilized (107 samples). We have approached the study of seeds of the genus Vitis following the most widely accepted concept of Vitis and the list of included species adopted in references such as: POWO (2023), GRIN (2023) or GBIF (2023).

As far as it was possible, the most relevant ampelographic data included in the Grapevine Descriptors (IPGRI-UPOV-OIV 1997) have been collected, especially those related to the hairs covering the leaves and the characteristics of the grape berry, either recorded directly in the field, since more than two hundred of these grapevines were and are still grown on farm in Molina de Segura (Spain), or from the databases (FNGR, VIVC, European Vitis Database, Armenian Vitis Database) (FNGR 2023; Maul 2023; European Vitis Database 2023; Armenian Vitis Database 2007). These ampelographic data have been used to assign the different cultivated grapevines, from which seeds were collected for the study, to the various proles following the criteria established by Negrul (1946a, b a). Herbarium specimens were prepared and are kept at MUB herbarium of the Universidad de Murcia (Spain).

For each sample, on average 10 seeds were randomly selected and measured to build the data matrix. When the number of seeds per set was less than 10, the analysis was carried out on all available seeds. However, in this article we will deal with single seeds. Each seed was individually described according to 14 characters. Of these, 11 are quantitative direct measurements and three are qualitative.

The quantitative analysis involves several key measurements of the seed morphology as depicted in Fig. 1. This includes assessing the total length, maximum breadth, and thickness of the seed (1, 2, and 3 in Fig. 1, respectively). Additionally, measurements are taken for the breadth of the stalk at its junction and on the seed base (4 and 5 in Fig. 1), as well as the length of the beak in both dorsal and ventral views (6, and 7 in Fig. 1, respectively) and thickness of the beak at the seed base (8 in Fig. 1). Furthermore, the total length and breadth of the chalaza shield are examined (9 and 10 in Fig. 1). The position of the chalaza denoted as PCH according to Mangafa and Kotsakis, is also considered (9 + 11 in Fig. 1). These comprehensive measurements provide valuable insights into the seed’s structural characteristics (Rivera et al. 2007).

Fig. 1
figure 1

Seed morphological characters. Note: label color codes: label Blue, used in the Stummer’s index, Label Green, used in the indices of Facsar-Perret and Mangafa—Kotsaki’s, Label Yellow, used in the Mangafa–Kotsaki’s, indices and Label Brown, used in all six indices. Image: D. Rivera

Qualitative analysis involves categorizing seed characteristics into distinct states (Fig. 1). This includes assessing the contour type, which can be ovoid, quadrangular, triangular, rounded, or pentagonal. The arrangement of fossettes is also considered, with options including parallel, furcate, convergent, or divergent. Additionally, the presence or absence of radial furrows is noted as part of the evaluation process.

The primary raw data matrix used for this study consists of 4029 single rows of analyzed seeds, belonging to 783 samples, and 33 columns of variables, among which fourteen are the characters above listed, sixteen are allometric indices, and three are derived.

  • Allometric: Stummer’s (Breadth / length), Breadth / depth, Facsar and Perret’s Stalk length / seed length, Seed body length (seed length minus stalk length), Stalk breadth / stalk length, Stalk breadth at junction / stalk depth at junction, Seed depth / stalk depth at junction, Chalaza shield breadth / chalaza shield length, Seed length / chalaza shield length, Seed breadth / chalaza shield breadth, Distance chalaza apex to seed apex / chalaza length, Accumulated length / seed length. And finally, the four by Mangafa and Kotsakis in 1996.

  • Derived: Lenght x breadth x depth, Sphericity and Triangularity.

The quantitative and qualitative characters were measured or determined using digital scaled images. The seeds of each sample were individually placed, on a plasticine support with a built-in scale to be photographed in dorsal, ventral and lateral view and measured using the open-source Fiji software (Schindelin et al. 2012). All photographs were taken under the same zoom conditions. Also, scale images of fossilized and archaeological seeds from specialized literature have been used for measurements. The characters were recorded in an Excel spreadsheet where the allometric relationships were automatically calculated using algorithms.

Data analysis: morphometric indices

Stummer’s index

Stummer (1911) proposed an index based on the seed breadth/length ratio (Fig. 1) adapted to Central European populations (1).

$${\text{STI}}\left( {{\text{X}}_{{\text{i}}} } \right)\,{ = }\,\frac{{{100}\,{ } \times \,{\text{Breadth of the seed}}_{{{\text{X}}_{{\text{i}}} }} { }}}{{{\text{Length of the seed}}_{{{\text{X}}_{{\text{i}}} }} }}{ }$$
(1)

This index allows a fairly effective differentiation between extreme forms, but intermediate values are found in both wild and cultivated populations (Table 1). Values of the Stummer index between 0.44 and 0.53 would be exclusive to cultivars, whereas between 0.76 and 0.83 would be exclusive to wild populations. Values between 0.54 and 0.75 are intermediate and could point to the presence of “hybrids”, cultivars or wild vines. Levadoux (1956) has suggested that this index would have limited validity in distinguishing wild grapevines from cultivars.

Table 1 Thresholds of Stummer’s index for wild and domesticated grapevine seeds1

Facsar and Perret’s index

Perret (1997), proposed a new index based on the allometric relationship between the length of the stalk or column and the total length of the seed (2) (1, 6 and 7 in Fig. 1).

$$FPI \left({X}_{i}\right)= \frac{100 \times ({Mean\; of\; dorsal\; and\; ventral\; stalk\; lengths}_{{X}_{i}}) }{{Total\; length\; of\; the\; seed}_{{X}_{i}}}$$
(2)

Apparently, this index allows to differentiate quite effectively between wild and cultivated populations, with the threshold situated between 18 and 19 (Table 2). However, the same index was previously proposed by Facsar (1970), Terpó (1976), and Facsar and Jerem (1985). According to the latter the mean index values of the relative stalk length (stalk length/seed length) × 100 of the 47 grapevine cultivars (V. vinifera) cultivated in Hungary, range between values 22–35 with a mean of 29. The index of relative stalk length related to the seeds of the native populations of V. sylvestris varies between 13 and 23 (mean = 17). Here the overlap between the cultivated and wild grapevine is minimal.

Table 2 Thresholds of Facsar and Perret’s index for wild and domesticated grapevine seeds1

Mangafa and Kotsaki’s indices

It is important to remember that these seed-based indices were largely developed to work with archaeological seeds. The formulae proposed by Mangafa and Kotsakis in 1996 were successfully applied to local Greek samples of both modern seeds and archaeological remains. The four formulae (Table 3) are based on the combined use of ratios involving variables such as total seed length (L), seed stalk length (LS) and chalaza position (PCH) (distance from the chalaza base to the base of the seed, involving 1, 9 and 11 in Fig. 1).

Table 3 Ranges of Mangafa and Kotsakis’s indices for wild and domesticated grapevine seeds

Sensitivity and specificity of indices

For biomedical tests, it is common to represent their sensitivity and specificity as parameters that help to assess their usefulness in the diagnosis of diseases. Sensitivity and specificity are measures of the accuracy of a diagnostic test. There is a difference between sensitivity and specificity that depends on the cut-off level chosen for a positive diagnosis (Chu 1999). While a highly sensitive and highly specific test is desirable there is usually a trade-off between sensitivity, and specificity: as one increases, the other decreases (Table 4). In medicine predictive values depend on the prevalence of disease (prior) as well as the sensitivity and specificity of the test within a Bayesian reasoning framework (Chu 1999).

Table 4 Sensitivity and specificity of the different morphometric indices used

It is precisely these differences in sensitivity and specificity of the different tests (Table 4) that suggest that their combined use may provide information better adapted to the different cases we may encounter in the case of Vitis.

It is important to underline that the average of the sensitivity and specificity values varies significantly (Table 4) and provides a tool to distinguish the most informative indices such as the 1st Mangafa and Kotsakis index (= 0.87) or the Facsar and Perret index (= 0.84).

Domestication index

Since the above indices serve the same purpose, to distinguish wild from domesticated forms, but their results differ from case to case, the combined use of the six indices may provide a better ability to discriminate seeds from wild or cultivated grapevines.

Here we propose the domestication index (DI) and the wildness index (WI). DI is calculated individually for each seed using the formula (3), where NIT means the number of indices exceeding, above or below, each threshold value, and NRI is the number of indices considered (in the present case 6). Regardless of the specific values of DI and WI, their sum will always be equal to 1, hence WI is calculated as in formula (4).

$$DI= \frac{\sum_{i=1}^{n}{NIT}_{i}}{\sum_{i=1}^{n}{NRI}_{i}}$$
(3)
$$WI=1-DI$$
(4)

Since it is based on the positive result of each of the six indices, the DI values for a single seed can have the following values: 0 (0/6), 0.17 (1/6), 0.33 (2/6), 0.5 (3/6), 0.67 (4/6), 0.83 (5/6) and 1 (6/6), and likewise for WI. In heterogeneous samples the mean of the DI of the seeds that constitute the sample can present any value between 0 and 1.

Similarly, when calculating mean values for a period or a locality using DI and WI indices, the possible values lie on the continuum between 0 and 1. The wildness index is complementary to the domestication index, so the sum of their values is always equal to one.

We usually work with samples consisting of several seeds, which in the case of living populations, wild or cultivated, usually come from the same bunch, although not always, but then, at least from the same population. There are several relevant parameters when transferring the results from single seeds to the entire sample, however we mainly used those (Valera et al. 2023).

However, as we will see in the next section, other approaches are available to estimate the probability of individual seeds coming from a domesticate or wild grapevine.

Domestication probabilities estimate using logit and random forest

We utilized logistic regression models and the random forest technique to assess, based on morphometric data and comparison collections, the likelihood of individual seeds originating from either domesticated or wild grapevines, or exhibiting intermediate characteristics that hinder clear categorization. Our analyses were conducted using R Studio within the R framework, a freely available software environment for statistical computing and graphics (R Project 2023). RStudio (2023) provides an integrated development environment (IDE) designed to enhance productivity in R and Python programming languages. These tools are accessible via the Cran R Project (2023).

The logistic regression model links the conditional probability of the binary response variable to the explanatory variables considered (Couronné et al. 2018), in our case, the probability of being ‘domesticated’ based on the six domestication indices. Utilizing the Logit function in R, we estimate logistic regression models using the generalized linear model function. Logit models are specific instances of generalized linear models with a binomial distribution and a logit link function, deriving their name from the logarithm of the likelihood ratios (Paladino 2017). In our study, Logit in R assigns a probability value of ‘domesticated’ to each seed, ranging from 0 to 1, abbreviated as PDILO.

Traditional machine learning algorithms often suffer from low classifier accuracy and are prone to overfitting. In contrast, Random Forest is a composite of machine learning algorithms that control a series of tree classifiers. Each tree contributes a vote for the most popular class, and these results are combined to yield the final classification. Random Forest exhibits high accuracy in classification tasks, robustness to outliers and noise, and mitigates the risk of overfitting. It has emerged as a widely used research method in both data mining and biological fields (Liu et al. 2012).

In our study, we utilized the ‘randomForest’ package version 4.7–1.1 in R (Breiman et al. 2022; randomForest 2023; Rossiter 2023). Random Forest in R generates classifications and regressions based on a forest of trees using random inputs, as described by Breiman (2001). Random forests employ a combination of tree predictors, where each tree depends on values from a randomly sampled vector with the same distribution across all trees. The generalization error of random forests approaches a limit as the number of trees in the forest increases (Breiman 2001). One of its major advantages is the ability to prevent overfitting and handle large numbers of variables, aiding in the identification of important attributes. Notable parameters of the random forest include ‘ntree’, which defaults to 500 trees, and ‘mtry’, which randomly samples a subset of variables for each split to avoid overfitting (Breiman et al. 2022; R-Bloggers 2023). In our regression-based analysis, weight is set to null and replace is set to true. In our study, ‘randomForest’ in R assigns a probability value of ‘domesticated’ to each seed, ranging from 0 to 1, abbreviated as PDIrF.

We first worked with ‘randomForest’ the six formulas mentioned above, which were used to calculate the domestication index and in a second stage we added all the qualitative and quantitative variables and a series of allometric indices used in the description of the seed (Fig. 2). In both cases, we tried to use the random Forest model to detect the relative influence of the variables in the distinction between wild and domesticated grapevine seeds.

Fig. 2
figure 2

Random Forest Analysis of Most Influential Variables. Two models are represented: (A) with the six domestication indices and (B) with all descriptive variables and allometric indices. Color codes: Red (indices) and violet (descriptive) most influential variables. Blue, relatively highly influential variables. This figure displays the estimated importance of variables or features obtained through randomForest technique. The bars represent the relative importance of each variable, highlighting those deemed most influential in the analysis. The higher a variable is on the y-axis, the more important it is in predicting the target variable. The order of variables from top to bottom indicates their importance. If a variable is consistently high across a large number of trees, it suggests that it plays a significant role in the predictive power of the random forest. Image: D.J. Rivera-Obón and D. Rivera

Couronné et al. (2018) have sown that the results of logistic and random forest classifications were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. For this present analysis we classified all 4029 seeds according to the origin of the 783 samples:

Selection criteria for example data sets

  • Domesticated This category includes the samples attributed on ampelographic grounds to V. vinifera Proles and Subproles described by Negrul (1946a) (Occidentalis, Pontica, Orientalis Caspica and Orientalis Antasiatica).

  • Wild This includes well-characterized samples from wild Eurasian grapevine (V. sylvestris), wild Caucasian grapevine (V. caucasica), wild American Vitis species, and wild Asian Vitis species.

Categories not included in the example datasets are: archaeological samples, feral, fossils and others.

Results

Detection of influential variables and characterization of the domestication syndrome

The ‘randomForest’ methodology is instrumental in identifying the most influential variables (Fig. 2). Notably, the formulas 1 and 3 by Mangafa and Kotsakis, along with the Facsar and Perret index, emerge as prominent indices, whereas the contributions of Stummer and Mangafa and Kotsakis’ formula 2 are comparatively less significant. Conversely, formula 4 by Mangafa and Kotsakis appears to have minimal impact on distinguishing wild from domesticated seeds (Fig. 2A). When incorporating all descriptive variables and calculated allometric indices into the model, it becomes evident that, except for seed stalk (beak) length, the remaining characteristics contribute less than the aforementioned influential indices (Fig. 2B).

Overall, the ‘randomForest’ methodology demonstrates the utility of these indices in discerning signal from noise in the dataset, as noted by Silver (2012). Initially, DI values of 0 and 0.17 are indicative of clearly wild seeds, while values of 0.83 and 1 signify domestic grapevine seeds, leaving 18% of seeds categorized as indeterminate. Regarding the probabilities estimated by ‘randomForest’ (PDIrF), values ranging from 0 to 0.2 denote wild seeds, whereas those from 0.8 to 1 are typical of domestic grapevine seeds, with 12% of seeds falling into an indeterminate or intermediate category.

Beak lengths less than 1 mm are characteristic of wild grapevines, whereas in domesticated grapevines, beak lengths typically range between 1 and 2.5 mm, rarely reaching 3 mm. Only 4.8% of seeds exhibit intermediate PDIrF values (0.3–0.6), with stalk dimensions around 1 mm in these cases.

Living species, domestication index and domestication probability values

Among the 46 Vitis species, hybrids, and subspecific taxa examined (Table 5), 26 exhibit predominantly wild traits, characterized by DI values less than 0.6, as well as PDILO and PDIrF values below 0.23. Conversely, 6 species demonstrate traits typical of domestication, with DI values equal to or greater than 0.6, and PDILO and PDIrF values equal to or greater than 0.8 and 0.75, respectively. Selected species are illustrated in Fig. 3. Additionally, 14 species display intermediate values, possibly resulting from various stages of the evolutionary process, hybridization, or introgression.

Table 5 Relevant measurements and indices for modern seeds of living Vitis species, proles and hybrids
Fig. 3
figure 3

Modern Vitis seeds that present the “Wildness syndrome” opposite to the “Domestication syndrome”. (A, B) Vitis labrusca (USA). (C) V. amurensis (China). (D), V. riparia (U.S.A.). (E, F) V. rotundifolia (USA). Images: A, B, E, F, (Negrul 1946b). C, D, Diego Rivera

In terms of the distribution of domestication traits, it’s noteworthy to highlight the prevalence of such traits in an American species commonly considered wild: V. aestivalis var. linsecomii (Buckley) L.H.Bailey, exhibiting a DI of 1 and PDILO and PDIrF values of 0.93. These values are comparable only to those of V. vinifera L. and the hybrid V. vinifera L. × V. amurensis Rupr. (Table 5). Additionally, other taxa such as Vitis peninsularis M. E. Jones, V. acerifolia Raf., V. nesbittiana Comeaux, V. popenoei J. H. Fennel, V. shuttleworthii House, V. californica Benth., V. × doaniana Munson ex Viala, and V. aestivalis Michx., exhibit intermediate values. It’s noteworthy that the American grapevine species studied generally demonstrate higher domestication index values compared to Asian species (Table 5). However, Asian species like V. amurensis Rupr. and notably V. piasezkii Maxim. var. pagnuccii (Rom. Caill. ex Planch.) Rehder also displays intermediate values close to the “domestication syndrome” (Table 5).

V. rotundifolia, the type species of the Muscadinia section of the Vitis genus, shows low domestication index values. Moreover, the domestication index values and probabilities shed light on differences between European and South Caucasian grapevine types and species (Table 5).

In the genus Ampelopsis, which is closely related to Vitis, seed domestication values are notably low, resembling those observed in typical wild Vitis species.

For Ampelopsis aconitifolia Bunge, the Domestication Index (DI) stands at 0.02 ± 0.05, with PDILO and PDIrF both registering at 0.0 ± 0.0. Ampelopsis orientalis (Lam.) Planch. exhibits a DI of 0.0 ± 0.0, with PDILO at 0.0 ± 0.01 and PDIrF at 0.00 ± 0.0. Ampelopsis vitifolia Planch. and Ampelopsis glandulosa var. brevipedunculata (Maxim.) Momiy. both display DI, PDILO, and PDIrF values of 0.0 ± 0.0.

Fossil taxa, domestication index and domestication probability values

Seemingly the oldest Vitis seeds were found in the Black Buttes, Wyoming (USA), in connection with or rather in the same clayey sandbank, where Vitis tricuspidata leaves are frequent, attributed to an age of 75 Ma and a Maastrichtian chronology, received the name of Vitis sparsa, and according to the analyzed image, which is a drawing not a photograph and may not be completely faithful, presents the “domestication syndrome” (Table 6; Fig. 4). However, the attribution to Vitis of this seed is clearly in conflict with the Bayesian molecular dating for Vitis ancestries (Liu et al. 2016) where the inferred origin is estimated about thirty Ma later, during the late Eocene (32.6–48.6 Ma).

Table 6 Fossil Vitis seeds (selected examples) and their Stummer’s and domestication indices, and PDILO and PDIrF values
Fig. 4
figure 4

Fossil seeds that present relatively high and intermediate “Domestication syndrome” levels. (A) Vitis sparsa, Upper Cretaceous, Wyoming (USA). (B) V. pliocaenica, Pliocene, Frankfurt (Germany). (C) V. lusatica, Miocene layers of the salt mine at Wieliczka (S. Poland). (D) V. vinifera type (sub. V. sylvestris), Atlantic postglacial, Stuttgart (Germany); (E) cf. V. labrusca, Holocene, Connecticut (USA). Images: A, (Lesquereux 1878). B, (Engelhardt and Kinkelin 1908). C, (Zastawniak 1996). D, (Kirchheimer 1938). E, (Tiffney and Barghoorn 1976)

A more recent discovery comes from the Deccan cherts (sedimentary microcrystalline quartz, SiO2) of India, dating back to the Late Cretaceous period (70 Ma), where seeds of Indovitis chittaleyae Manchester and Kapgate which have been found to exhibit distinct wild characteristics (Table 6). However, it’s important to note that these seeds do not belong to the Vitis genus. Considering the hypothesis of an American origin of the Vitis genus followed by its spread to Eurasia during the late Eocene period (30.9–45.1 Ma) (Liu et al. 2016), Indovitis would not be an ancestor of the Asian Vitis species either.

It’s crucial to highlight that only a single seed of V. lusatica, discovered in Miocene layers of the salt mine at Wieliczka (Southern Poland) and showing a high probability of being ‘domesticated’, has been studied thus far (Fig. 4; Table 6). Therefore, the interpretation of this finding remains highly uncertain.

It should be noted that most of the fossil seeds show clearly wild characteristics with short beaks ranging from 0.1 to 0.8 mm, low combined domestication index values ranging from 0 to 0.5 and very low PDIrF values ranging from 0 to 0.07 (Table 6). Some examples of this type of seeds are shown in Fig. 5.

Fig. 5
figure 5

Fossil seeds that present the “Wildness syndrome”. (A) Vitis sylvestris, Upper Pliocene, Limburg (Netherland). (B) V. sylvestris, Upper Pliocene, Kroscienko (Poland). (C), V. sylvestris, Upper Pliocene, Limburg (Netherland). (D, E), V. subglobosa (p.p.), Ypresian, Eocene, London Clay (England). (F). V. labruscoidea, Pliocene, Pinus trifolia bed of central Hondo (Japan). Images: A, B, C, (Kirchheimer 1938). D, E, (Reid and Chandler1933). F, (Miki 1956)

Discussion

Domestication syndrome and Vitis vinifera groups

The Domestication probability distributions based on PDIrF values are highly informative in this respect (Fig. 6) being very useful to characterize the ensembles of Asian, American and European wild species, and to differentiate them from the different geographical groups of cultivated grapevines.

Fig. 6
figure 6

Distribution of the domestication probability, estimated with randomForest method, within the main geographical groups of domesticated and wild grapevines. Notice: Please note that the units on the x-axis represent probabilities ranging from 0 to 1 overall, but in each individual graph, the scale changes based on the minimum and maximum values of probability. Similarly, the y-axis frequency is represented by the number of seeds analyzed, so the scale of the y-axis varies in each graph depending on the quantity of seeds analyzed. Total modern seeds analyzed per category: Proles Orientalis Subproles Antasiatica 239, Proles Orientalis Subproles Caspica 831, Proles Occidentalis 180, Proles Pontica 808, American wild species 356, Asian wild species 80, V. sylvestris 627, V. caucasica 20

The case of Vitis cf. labrusca described by Pierce and Tiffney (1986) is very remarkable since the subfossil material, c. 700 BC presents domestication characteristics (Table 6, Fig. 4) while the modern seeds of living V. labrusca L. are clearly of the wild type (Table 5). This could suggest the existence of domestication processes that affected some American populations of Vitis, and in which Native American populations were involved, prior to the arrival of Europeans, and that possibly became later extinct or reverted to wild forms. An example of these links would be the use by the Cherokee of the fruits of V. labrusca as food and its juice as a beverage, as well as its leaves and the bark of its stems as medicine (Moerman 2023). However, we only consistently detected the domestication syndrome in the different geographic groups that include all cultivars of Vitis vinifera: proles Occidentalis, Pontica and Orientalis.

Moreover, we detect an east—west gradient in domestication values so that the grapevines of the proles Orientalis, that comprises the Antasiatica and Caspica sub-proles, present slightly higher DI, PDILO and PDIrF values than those of the proles Pontica and Occidentalis (Table 5, Fig. 6). This may be due to recent differential selective factors that favor large grape berries with elongated seeds typical of table grapes and raisins, which are predominant in the Antasiatica sub-proles. But, more interestingly, it may be related to the chronology gradient of domestication within V. vinifera, which would be the oldest in Central Asia and gradually more modern towards the West. The study of Myles et al. (2011) of genome-wide association and genomic selection reflects a clear west–east gradient in Vitis sylvestris that is recapitulated in the V. vinifera cultivars and that all analyzed V. vinifera populations are genetically closer to eastern V. sylvestris than to western V. sylvestris. However (Arroyo-García et al. 2006) report that over 70% of the Iberian Peninsula cultivars display chlorotypes that are only compatible with their having derived from western V. sylvestris populations, but it could be a more recent domestication event (Dong et al. 2023).

Considering the case detected in the wild grapevines of the South Caucasus, with higher DI and PDIrF values (Kikvadze et al. 2024), it could not also be ruled out that the ancestral grapevines of Central Asia had “more domesticated” starting traits. However, the seeds of six of the nine Asian species analyzed showed wild traits (PDIrF 0–0.13), excepted Vitis amurensis Rupr., and V. piasezkii Maxim. var. pagnuccii (Rom. Caill. ex Planch.) Rehder, with clear traces of the “domestication syndrome” (PDIrF 0.5–0.7), and V. romanetii Rom. Caill., which is somewhat intermediate (PDIrF 0.22 ± 0.31) (Table 5).

The seeds of the interspecific hybrids (V. vinifera x V. sylvestris) and the samples from feral individuals present intermediate values (Table 5).

Did domestication traits precede domestication in Vitis species?

In the case of fossils, we dealt with extinct Vitis species. Most are based on leaf imprints recovered from different geological levels, but others are based on fossil seeds. The boundaries between Vitis and other Vitaceae fossil genera are difficult to delimit and the same difficulty is met when deciding whether living species are present in the fossil register or not. Using a method based on conditional probability, Parins-Fukuchi (2018) assigned a series of Vitaceae fossils based on morphometric evidence within the cladistics of the Vitaceae with satisfactory results, but they show that the assignment made by palaeontologists of fossil remains, in particular seeds, to a specific genus is uncertain, especially when considering Ampelocissus, Cissus, or Vitis boundaries. So, we have followed the consensus of nomenclature and the inclusion criteria of IFPNI (2023) database.

Wan et al. (2013) based on the study of nuclear DNA and Liu et al. (2016), on plastidial and nuclear DNA, deduced that Vitis originated in the New World. But their dates disagree, for Liu et al. (2016) it occurred during the late Eocene (39.4 Ma), and then Vitis migrated to Eurasia in the late Eocene (37.3 Ma), when the segregation of the subgenera Muscadinia and Vitis occurred. For Wan et al. (2013) these events are 10 Ma more recent and occurred during the Oligocene. The separation between American and Eurasian grapevine species occurred later, so the divergence of Eurasian and American grapevines would occur at the earliest between 30 and 25 Ma in the model of Liu et al. (2016) and in the Miocene (11 Ma) for the model of Wan et al. (2013). What seems evident in both models is that Vitis vinifera is closer to Asian than to American species, and in particular it is close to V. heyneana (= V. jacquemontii). However, Vitis vinifera can hybridize successfully with all species of the subgenus Vitis.

The existence of fossil Vitis taxa showing the “domestication syndrome” in their seeds (Table 5, Fig. 4) would suggest that the possibility of expression of the syndrome was present in the evolutionary characteristics of the genus and its expression is modulated by the selective effect of environmental factors (such as rainfall and temperature) and by the preferences of the dispersion vectors (humans but also many other species of fauna) that transport the seeds. This has evolved throughout the different geological periods (Fig. 7), being particularly high in present living population while presenting a very marked minimum during the glaciations.

Fig. 7
figure 7

Chronology of the “Domestication syndrome” in terms of PDIrF. Note: Abbreviations. Years BP, years before present. Box diagram: Whiskers join the boxes with the upper and lower ends, the boxes indicate the upper quartile, the median and the lower quartile. The isolated points correspond to Outlier unique data points. Red stars mark doubtful data due to the morphological evidence available or datation. This is the case of several seeds studied from the South Caucasus

A decrease in index values is seemingly detected during the Eocene, especially 50–40 Ma, followed by a stabilized increase, throughout the Oligocene (33.7–23.8 Ma), Miocene (23.8–5.3 Ma) and Pliocene (5.3–1.8 Ma) (Fig. 7). However, these apparent oscillations are due to doubtful samples (marked with red stars in Fig. 7), so it is evident that the domestication syndrome is later than the last glaciation (Fig. 7).

Vitis fruits containing seeds are known to have been consumed by mammals, specifically the palaeothere Eurohippus (47.4–37.7 Ma), a sister taxon of the equids (true horses). It is quite possible that these palaeotheres dispersed the seeds, as they are found intact in the gut and not fragmented (Collinson et al. 2010).

The typical seed morphology of Vitis appears to be earlier in nearly 30 Ma than the putative date of origin of the genus. Vitis is inferred by Liu et al. (2016) to have originated in the New World during the late Eocene (32.6–48.6 Ma), then migrated to Eurasia in the late Eocene (30.9–45.1 Ma) which contrasts with the earlier age attributed to V. sparsa (Table 6).

The least presence of “domesticated” traits (in terms of maximum and mean values) in Vitis seeds is reached at the beginning of the last Quaternary glaciation, in the Pleistocene, 100 kyr, but the decrease in values begins two million years earlier, associated with the previous glacial, period, 2.4–2.1 Ma, at the beginning of the Quaternary. It is very important to underline that seeds with fully domesticated traits (Value 1 of the domestication index), with the exception of Vitis sparsa in the Late Cretaceous, do not appear until the beginning of the Holocene, after the last glaciation, 10 kyr (Fig. 4).

In short, Vitis fossil seeds present a clearly wild syndrome, characterized by rounded, relatively wide seeds and with a very short beak or stalk: mean length 0.5 mm, standard deviation 0.27 mm (Table 6, Fig. 4).

The oldest Vitaceae seeds linked to human activity were found in the Near East as part of the Acheulian diet at Gesher Benot Ya’aqov, Israel, 700 kyr (Melamed et al. 2016). The seed whose image was analyzed is small (3.4 × 2.6 mm, stalk length 0.6 mm) and the Stummer’s index value, 74.5, is intermediate between wild and domesticated (Stummer 1911; Valera et al. 2023), the PDILO and PDIrF values are extremely low, 0.07 and 0.00 respectively, clearly pointing to a wild taxon in coincidence with the DI = 0.00 value. Furthermore, Valera et al. (2023) suggested the close relationship of these seeds with those of endemic Ampelopsis species of West Asia.

The mean values of the recorded domestication syndrome follow a geographical rather than a chronological pattern (Fig. 6). The highest values occur in Italy and the Caucasus, while the lowest values were recorded in India, Denmark and Central Asia, despite the more recent age of the deposits. It should be noted here that Class 1 and Class 2 seeds from Late Bronze Sapalli levels (Chen et al. 2022) show clearly domesticated traits, although we have not been able to include them in the present study. Other seeds show clearly wild traits and it is even possible that they are, in Class 4, endemic Asiatic species of Vitis.

Zhou et al. (2017) based on genomic data from wild and cultivated grapevine samples concluded that domestication can occur rapidly, instead of over millennia. They estimated that V. sylvestris and V. vinifera diverged c. 22,000 years ago and that the V. vinifera lineage experienced a steady decline in population size thereafter. The long decline may reflect low-intensity management by humans before domestication. According to Coito et al. (2019), V. vinifera and V. sylvestris diverged from a hermaphrodite and hexaploid common ancestor with basic chromosomal number x = 7, around 22,000 to 30,000 bp, long before the domestication process started. This could explain the setting of “domestication syndrome” in terms of high DI, PDILO and PDIrF values, associated to V. vinifera, long before cultivation started (Fig. 7). However, if we consider the exclusion of doubtful samples, it seems that the emergence of the domestication syndrome is associated with the development of the exploitation of the species by humans (Fig. 7).

Later on, V. vinifera “cultivated” grapevines became increasingly present in anthropogenic contexts (Fig. 4).

Concerning the identification of genes that contribute to domesticated phenotypes. In cultivated grapes, Zhou et al. (2017) identified candidate genes that function in sugar metabolism, flower development, and stress responses. In contrast, candidate genes in the wild sample were limited to abiotic and biotic stress responses. The cost of domestication is reflected by cultivated V. vinifera grape accessions containing 5.2% more deleterious variants than wild individuals, and these were more often in a heterozygous state. However, it seems that genes determining the “domestication syndrome” in seeds remained overlooked (Fig. 8).

Fig. 8
figure 8

Archaeological seeds with different degrees the “Domestication syndrome”. (A) V. sylvestris, Middle Bronze Age: 185AR Qara Quzaq 1 137–92 (Syria). (B) V. sylvestris, Medieval, Alcazaba de Badajoz 2022/4 (Spain). (C), V. vinifera, Assyrian 262 Tell Khâmis as 85 5 (Syria) (the beak was broken during the handling for SEM). (D), V. vinifera, Assyrian: 266 Tell Khâmis as 108 1 (Syria). (E). V. vinifera, Medieval 576 La Graja 2021/2_5 (Spain); (F), V. vinifera, Medieval, 646 Fortaleza Isso 2021/7a (Spain). SEM Images: J. Valera and D. Rivera

In general, archaeological V. vinifera seeds with the complete domestication syndrome (DI, PDILO and PDIrF values = 1) have been found in every investigated region (Fig. 9), together with seeds, not necessarily in the same site and period, clearly wild (values below 0.4 for DI, PDILO and PDIrF), and even completely wild, with DI, PDILO and PDIrF = 0 (Fig. 9). It should be noted that in the Caucasus all the archaeological seeds analyzed showed traits of domestication, as did the archaeological Italian seeds. Despite the existence of numerous wild populations of grapevine in the South Caucasus nowadays (Ocete et al. 2018; Maghradze et al. 2020) most of these, recently analyzed from Armenia and Georgia, using the seed morphometry show a high degree of domestication, likely involving feral and hybrid individuals (Kikvadze et al. 2024). This coincides with the very scarce presence of grapevines with typically wild traits in the archaeological record of the area (Fig. 9).

Fig. 9
figure 9

Archaeology of the “Domestication syndrome”, distribution of probability estimate using random forest (PDIrF). Notice: Please note that the units on the x-axis represent probabilities ranging from 0 to 1 overall, but in each individual graph, the scale changes based on the minimum and maximum values of probability. Similarly, the y-axis frequency is represented by the number of seeds analyzed, so the scale of the y-axis varies in each graph depending on the quantity of seeds analyzed. Total archaeological seeds analyzed per category: Balkans 70, Central Asia 23, Caucasus 8; Denmark 2, France 6, India 4, Italy 43, Near East 144 and Spain 98

In addition, populations or sparse individuals of wild grapevine still living in the Georgian natural environments’ present smaller genetic distances with local cultivars than in other European regions. Principal component analysis has also identified special overlapping of the wild set with some cultivars (Imazio et al. 2013).

Conclusion

The domestication syndrome exists in grapevine seeds and is notably characterized by the length of the stalk.

Published domestication indices, based on allometric relationships involving seed length and width, chalaza shield position and stalk length, effectively summarize the information provided by seeds, but unevenly. The combination of the results provided by the six indices into a single index, with values between 0 and 1, represents an advance in the tools available to represent the degree of domestication in numerical terms. The combined domestication index makes it possible to clearly differentiate the wild vines of the V. sylvestrisV. caucasica complex from the cultivars of V. vinifera, regardless of the geographical group of cultivars to which they belong. Its effectiveness is much less significant when it comes to differentiating V. sylvestris from American or Asian grapevines.

There are several approaches in terms of conditional probability, based on the availability of a set of well-characterized reference samples of seeds from clearly domesticated and wild grapevines, which allow estimating the probability that a particular seed comes from a domesticated grapevine. Logistic regression and the randomForest machine learning method provide very similar results that place the threshold for the domestication syndrome at 0.75. At the same time, they show a similar pattern to the combined index.

The randomForest method has made it possible to assess the relevance of the different indices and characters measured, showing the importance of stalk length, the first formula of Mangafa and Kotsakis and the Facsar-Perret index, well above the classic Stummer index.

The domestication syndrome in grapevine seeds, characterized by stalk longer than 1 mm, an elongated shape, and relatively large, is not exclusive to domesticated Vitis vinifera, as it is also found in a few American and Asian species of Vitis, otherwise considered wild. There are even several fossil species of Vitis, now extinct, that show the domestication syndrome in their seeds. In general, however, the domestication syndrome was only clearly established from 2000 BC onward. The age of earlier seeds with domesticated characteristics should be confirmed by specific radiocarbon dating.