Introduction

Pine nuts are essential for pine reproduction and afforestation (Huang et al. 2022). Stone pine (Pinus pinea L.) is a forest tree of considerable ecological value, and its non-wood products are of significant value in Mediterranean forests (Fig. 1a and b) (Calama et al. 2016). Nuts of stone pines are among the key components of Mediterranean forest ecosystems and have been cultivated by Mediterranean peoples for millennia. Its distinctive umbrella-shaped appearance and frequent cultivation for ornamental reasons contribute to its prominence in the Mediterranean region. Harvested for their edible nuts, the cones of stone pines have been a part of human consumption since the Paleolithic era (Mutke et al. 2005). Pinus pinea nuts are known for their elevated market value, rendering them an appealing crop choice due to their profitability. This species exhibits resilience under adverse conditions, thriving in impoverished or eroded soils. Its natural resistance to pests and diseases minimizes the necessity for intensive cultivation practices, while its remarkable drought resistance (Çalışkan and Boydak 2017) qualifies it as a promising candidate for horticultural cultivation. However, there is current evidence of growth and pine nut yield decline due to biotic factors such as Diplodia (Caballol et al. 2022; Hlaiem et al. 2023), Leptoglossus occidentalis (Bracalini et al. 2013; Lesieur et al. 2014; Farinha et al. 2018) and climatic factor (Parlak 2017; Balekoglu et al. 2020).

The Mediterranean region has approximately 860,000 hectares of forests dominated by stone pines. These forests span from the Atlantic coast of Portugal to the Black Sea and Mount Lebanon (Mutke et al. 2012). The distribution of these areas is primarily distributed approximately 450,000 hectares in Spain, 195,000 hectares in Portugal (ICNF 2013), 176,000 hectares in Türkiye (OGM 2021), and 40,000 hectares in Italy (Pereira et al. 2015).

Stone pines exhibit notable phenotypic plasticity and adaptability coupled with relatively limited genetic variability (Fallour et al. 1997; Vendramin et al. 2008). Genetic diversity is commonly acknowledged as a crucial factor for adjusting to various environmental conditions. Species that are both genetically poor and extensively distributed are uncommon, and the stone pine stands out in this regard because of its significantly low genetic diversity. Nevertheless, the species displays a significant degree of variation in its adaptive characteristics, as highlighted by Vendramin et al. (2008).

There are two basic conditions for obtaining the highest yield in terms of quantity and quality from afforestations to be made in a certain area and for the health of these afforestations (Boydak and Çalışkan 2014, 2015). The first of these is the selection of the most suitable sources in terms of quantity and quality according to our objectives, the breeding of these sources, and the afforestation with the seedlings of the seeds collected from these sources. The second condition is to use the seedlings collected from the seed sources at certain elevations and horizontal distances according to this source, taking into account the soil and climatic conditions. Otherwise, seedling survival, development, form, and resistance to biotic and abiotic pests will decrease significantly.

On the other hand, geographic variation is a concept that describes the differences in morphological traits between populations of a species across their natural distribution. These variations can be observed in both the generative organs and vegetative parts of the trees. Studying geographic variation is essential for understanding the diversity within a species and can provide valuable insights for selecting suitable populations for provenance trials. Provenance trials are experiments that evaluate the performance of seedlings from different origins or from different seed stands when grown in different locations. These trials help determine the optimal horizontal and vertical distances, or “seed transfer zones,” for transporting seeds. Provenance trials are considered the most reliable method for establishing seed transfer zones. By assessing the growth and resistance of seedlings from different origins in various locations, provenance trials can provide valuable information for selecting the most suitable seed sources for afforestation. This information helps ensure that the planted seedlings are well adapted to the local environment and have the best chance of survival and growth (Boydak and Çalışkan 2021). Knowing the seed and seedling characteristics of populations is important for determining differences in applications in seed transfer zones and geographic variation studies (Boydak and Çalışkan 2021). Several studies have investigated the use of machine learning (ML) algorithms in afforestation and breeding programs. Precision morpho-physiological trait measurements and genotyping data are regarded as crucial inputs in contemporary breeding programs (Duc et al. 2023; Niknejad et al. 2023). In the last decade, ML algorithms have been extensively used in agriculture and forestry (Chen et al. 2023; Montagnoli et al. 2016). ML algorithms are also used for variety classification based on the shape characteristics of the seeds of some species (Yang et al. 2019; Osako et al. 2020).

ML algorithms are rapidly gaining traction as powerful tools for predicting phenotypes. As the volume and complexity of biological data continue to increase, ML algorithms have demonstrated a remarkable ability to extract patterns and make predictions, offering unparalleled insights into the relationships between genes, environments, and observable traits. In addition, several mathematical models and deep learning methods have been applied to analyze the dynamics of pine wilt disease and its effects on forest ecosystems (Rao et al. 2023; Shah et al. 2024). ML algorithms predict the stand volume growth of several national-scale forests (Tian et al. 2022). In recent years, significant breakthroughs have been made in ML algorithms and deep learning for drought modeling, and knowledge-based systems with big data platforms, including machine learning and deep learning, are highly important for future research (Prodhan et al. 2022).

Throughout the life of a plant, seed characteristics play a crucial role in its survival strategies (Caliskan and Makineci 2014; Balekoglu et al. 2021). The development of high nut yielding stone pine populations requires genetic diversity in seed and seedling traits across different stone pine populations to identify superior genotypes. Due to data collection challenges, our understanding of seed and seedling morphological traits is limited. In recent years, the use of ML algorithms to classify and predict the viability of seeds using data obtained from images has increased (Kusumaningrum et al. 2018; Zhang et al. 2018; Fan et al. 2020; Ma et al. 2020; Wang et al. 2021; Qi et al. 2024). On the other hand, research on ML-based classification of morphological seed traits from different origins or populations is limited (Nie et al. 2019).

The present study aims to classify and cluster the seeds and seedlings obtained from different stone pine populations in terms of morphological characteristics. To achieve this aim, it is intended to utilize the recently increasing trend of ML algorithms. The first hypothesis of this study is to test whether the seeds and the seedlings can be classified in terms of morphological characters using ML algorithms. The second hypothesis is to find the ideal cluster number with the k-means algorithm and to compare the results obtained from the k-means algorithm with reality.

Materials and methods

Seed material

The collection of seed material from six natural stands of stone pine (Pinus pinea L.) across Türkiye was carried out by Çalışkan et al. (2018) (Fig. 1b). The locations and geo-climatic traits of the six populations are provided in Table 1. To ensure representative sampling, mature cones were harvested from 15 to 25 mature trees within each population, with a minimum distance of 50 to 100 m between each tree. For each tree, ten mature cones were placed in separate, marked cloth bags and taken to the laboratory. The samples were stored at room temperature until further analysis. Subsequently, seeds were manually extracted from 50 randomly selected cones within each of the six populations (Fig. 1c, d and f).

Fig. 1
figure 1

Distribution map of stone pines (Pinus pinea) in the Mediterranean Basin (a). (modified from Caudullo et al., 2021). Geographic locations of six natural stands of stone pines (Pinus pinea) across Türkiye (b). KO Kahramanmaras-Onsen¸ MK Muğla-Katrancı, AK Aydın-Koçarlı, IK İzmir-Kozak, TK Trabzon-Kalenema, CK Çanakkale-Kirazlı. The cones of stone pines generally take three years from pollination to ripening (c). Extracted seeds from the cone of stone pines (d). One-year-old stone pine seedlings belong to six different populations were grown in the nursery (e). The view of natural stands of stone pines in Kahramanmaraş (KO) (f)

Table 1 Geographical and climatic characteristics of the six stone pine (Pinus pinea) populations across their natural distribution in Türkiye. P is the mean annual rainfall in mm; M is the mean of the maxima of the hottest month; m is the mean of the minima of the coldest month; PE is the sum of rainfall in June, July, and August; ME is the mean of the maxima temperature of June, July, and August for the period 1987–2016; S is the summer drought index value; and Q is the humidity category value. A Bioclimate zones, according to Emberger (Daget et al. 1988)

Seed and seedling measurements

In each population, fifty cones were chosen randomly and dissected entirely to assess the seed yields of the six populations. The viable seeds were differentiated by submerging each cone’s seeds in water. An objective assessment of the seed measurements was made by sampling ten seeds per cone (Ganatsas et al. 2008). Seed measurements were taken on a total of 286 seeds. The measurements of seedlings were carried out on 216 seedlings (9 seedlings × 4 replications × 6 populations) in the first and second years. Abbreviations and units for the measured cone, seed, and seedling characteristics are given in Table 2. A summary of seed and seedling characteristics across six stone pine populations is given in Table 3.

Plastic containers (18 cm tall and 190 cm3) were used for sowing the seeds. After one year, the seedlings (Fig. 1e) were transferred to plastic containers measuring 20 cm in height and 12.5 cm in diameter with a capacity of 2.5 l. A 2:1:1 mixture of soil, peat, and river sand was placed in these containers. For the trial, a randomized block design with five replications was used. The plants were grown outdoors at the Istanbul Forest Nursery (the nursery is located at an altitude of 126 m above sea level at 41°10′56′′N, 28°59′14′′E), where they received daily irrigation and no additional fertilization. The physical and chemical compositions of the container media were as follows: 68.2% sand, 14.5% silt, 17.3% clay, pH 6.9, electrical conductivity (EC) 160.1 µhos/cm, 2.4% carbon content, 0.2% nitrogen content, and no calcium carbonate. According to Thornthwaite’s classification, the climate around the nursery is humid, mesothermal, and maritime, with a moderate water deficit during the summer months. The average annual rainfall was recorded at 1096 mm, with an average temperature of 13.2 °C (Baylan and Ustaoğlu 2020).

Table 2 List of morphometric traits of P. pinea seeds and seedlings with measurements and units. A and b indicate ≈ 20% and 10% moisture content, respectively*. No terminal buds were observed in the first year
Table 3 Information in terms of statistically comparing the mean values of the analyzed traits between the six populations of stone pine. For the traits of abbreviations used, please see Table 2

Analyses

Descriptive statistical analyses of the seed and seedling datasets were also conducted. The minimum, maximum, mean, median, 1st and 3rd quartile values were calculated for the datasets. Then, the correlations between the attributes were analyzed with the Pearson correlation coefficient, and the significance levels (p values) of these coefficients were calculated.

ML algorithms were used to solve classification and clustering problems. In the present study, supervised learning algorithms k-Nearest Neighbors (k-NN), Naive Bayes (NB), Support Vector Machines (SVM), C5.0, Classification and Regression Trees (CART), and Random Forest (RF) were used to predict the populations of seeds and seedlings. A brief overview of the algorithms is summarized below, and more details about the algorithms can be found in (Breiman et al. 1984; Vapnik 1995; Han and Kamber 2006; Harrington 2012; Murphy 2012; Balaban and Kartal 2018; Jafarzadegan et al. 2020; Kartal et al. 2020; Quinlan 2022). The k-NN algorithm classifies instances in a dataset based on a given distance measure. The class of an instance of an unknown class is determined based on the class of its k nearest neighbors. In this process, Euclidean distance is utilized for the k-NN algorithm, and the minimum k, which gives the highest accuracy, is chosen for the k-NN analysis. NB is a statistical classification method that uses Bayes’ theorem-based probability value calculations. Its most important assumption is that attributes are conditionally independent given the target value. SVMs are supervised learning techniques often used for classification and regression tasks. The goal is to divide the data points by a hyperplane between two classes. The principle of margin maximization attempts to provide the best separation between the two classes. C5.0 is a decision tree algorithm. Like SVMs, it can be used for classification and regression tasks. The decision tree is built using concepts such as entropy and information gain. The CART algorithm is used to build classification and regression trees. Most often, trees are built with the help of the Gini index, a measure of the impurity of attributes. Only binary branching is performed in the tree. RF is an ensemble learning algorithm. An RF model is built by combining many decision trees. It is used to solve both classification and regression problems. The attributes used in the trees and the observations used in the training process of the trees are randomly selected. The final decision is based on the consensus of the decision trees.

In the present study, the stratified hold-out method was employed as the ML algorithms performance evaluation technique in classification. The dataset is randomly divided into training and test sets. At this stage, the following were considered: (1) The data used in training is not included in the test dataset. This means the data used in the testing phase is completely new for the model. (2) An observation selected for the training/test datasets was chosen to appear only once in the respective dataset (without replacement). (3) The proportion of class labels for the target attribute was also approximately maintained in both training and test datasets. The ML algorithms were trained using 80% of the datasets, while the remaining 20% were reserved for testing the performance of the models. The evaluation results belong to the test set. Furthermore, the importance of the attributes in the datasets was evaluated with CART, C5.0, and RF algorithms in this study. The percentage of training set samples that fall into all terminal nodes following the split is considered to assess the importance of predictors with C5.0 (Kuhn and Quinlan 2023). The overall variable importance with CART is calculated as the total of the goodness of split measures for each split in which it was the primary variable, in addition goodness multiplied by (adjusted agreement) for all splits in which it was a surrogate, since a variable might occur in the tree multiple times, either as a primary or surrogate variable (Therneau and Atkinson 2022). Finally, the mean decrease in Gini (index) is used in RF as the impurity measure. It can be defined as the average decrease in node impurity, weighted by the proportion of samples reaching that node in each decision tree in the RF (Gómez-Ramírez et al. 2020; Liaw and Wiener 2002). The mean decrease in Gini is the total decrease in node impurity caused by variable splitting averaged across all trees (Han et al. 2016). The greater the mean decrease in the Gini value, the purer the variable.

In classification tasks, the performance of the ML algorithm is evaluated using the confusion matrix given in Table 4 (Murphy 2012). In binary classification, where there are two class labels - one negative and one positive - four possible outcomes can arise, as indicated in Table 4.

Table 4 The confusion matrix. True positives (TP): positive cases correctly classified as positive; false positives (FP): negative cases incorrectly classified as positive; false negatives (FN): positive cases incorrectly classified as negative; true negatives (TN): negative cases correctly classified as negative

Using the given confusion matrix, the most popular performance evaluation metrics, namely accuracy, error rate, True Positive Rate (TPR)/recall/hit rate/sensitivity, Positive Predictive Value (PPV)/precision, and F1-Score/F-Measure/F-Score, can be calculated as following formulas (1, 2, 3, 4 and 5):

$${\rm Accuracy }= \frac{TP+TN}{TP+FP+FN+TN}$$
(1)
$${\rm Error } = {\rm 1-Accuracy}$$
(2)
$${\rm True\, Positive\, Rate\, (TPR)} = \frac{TP}{TP+FN}$$
(3)
$${\rm Positive\, Predictive\, Value\, (PPV)} = \frac{TP}{TP+FP}$$
(4)
$${\rm F1\,-Score} = \frac{2 \times TPR \times PPV}{TPR+PPV}$$
(5)

In this study, in addition to the criteria formulated above, False Positive Rate (FPR)/fall-out, True Negative Rate (TNR)/specificity/selectivity, False Negative Rate (FNR)/miss rate, Negative Predictive Value (NPV), False Discovery Rate (FDR), and False Omission Rate (FOR) metrics were also used in the performance evaluation stage for classification. For more information on these, which depends on Table 4 as the previous ones, please see Dua and Chowriappa (2013) and Snodgress (2023). Since there were more than two class labels in both data sets (six populations) in this study, macro-averaging was applied, and the model performance evaluation criteria calculated for each class label were averaged.

Then, to analyze the variations in seed and seedling traits across the populations, the k-means algorithm was used as an unsupervised learning algorithm. We extracted real population labels from the datasets and performed clustering analyses. The k-means algorithm is based on assigning similar instances in the dataset to the same clusters. Given a user-defined number of clusters, the instances in the dataset are grouped. The algorithm initially selects k cluster centers. Then, each instance is assigned to the closest center, and the cluster centers are updated. In this process, Euclidean distance is used for the k-means algorithm in this study. This process is repeated until no sample moves between clusters or reaches the maximum number of iterations (Shmueli et al. 2018). Although the datasets included six populations, the k-means algorithm’s parameter k, which indicates the number of clusters, was tested for numbers between 2 and 20 with the average Silhouette width (Rousseeuw 1987) to determine the ideal number of clusters. Then, we additionally examined how close the clustering results obtained with k-means are to reality from 2 to 6 (since, in real life, there are six classes in the datasets).

In this study, data normalization, the min-max normalization method, was applied to both seed and seedling traits datasets in the data preprocessing stage. Considering the algorithm’s performance in clustering seed and seedling traits, datasets are used without and with normalization, respectively (Figs. 2 and 3). Analyses were performed in RStudio with the R programming language (Posit 2023; R Core Team 2023). The following R packages were used: C5.0 (Kuhn and Quinlan 2023), caret (Kuhn 2008), class (Venables and Ripley 2002), cluster (Maechler et al. 2022), clusterSim (Walesiak and Dudek 2020), cpfa (Snodgress 2023), e1071 (Meyer et al. 2023), factoextra (Kassambara and Mundt 2020), ggplot2 (Wickham 2016), randomForest (Liaw and Wiener 2002), and rpart (Therneau and Atkinson 2022). Figures 2 and 3 show flowcharts of classification and clustering.

Principal Component Analysis (PCA) is a statistical method that reduces the dimensionality of a dataset by generating linear combinations of the original variables known as principal components (Greenacre et al. 2022). These components capture the largest variation in the data, resulting in a simplified approximation of the original dataset (Greenacre et al. 2022). It converts data to a new coordinate system and performs an orthogonal linear transformation (Öngen Bilir and Kardeş 2023). This study applied PCA to both datasets following clustering analysis for data visualization. A summary of the ML algorithms used in the study is given in Table 5.

Table 5 The fundamental characteristics of the ML algorithms used in the study
Fig. 2
figure 2

Flowchart of the classification process. Six ML algorithms were applied to classify the stone pine (Pinus pinea) population. For the abbreviations used, please see Table 2

Fig. 3
figure 3

Flowchart of the clustering of stone pine (Pinus pinea) populations. For the abbreviations used, please see Table 2

Results

The means and ranges of variation for the seed and seedling traits are shown in Table 3. Table S1 and Table S2 show the correlation analysis of these traits. The results indicate that nearly all the correlations among the seed traits were statistically significant. Moreover, correlation analysis of the seed and seedling traits revealed that the seed traits had greater r values than the seedling traits. The correlations between the seedling traits were not greater than 0.6, while the correlation between the seed traits reached 0.9.

Classification of stone pines seeds and seedlings

The k parameter of the k-NN algorithm was tested from 2 to 20 (Fig. S1). The best performance was achieved with k = 2 for seed traits (accuracy = 0.52) and k = 18 for seedling traits (accuracy = 0.57), comparable to the other algorithms’ performances.

Tables 6 and 7 show the model performance evaluation for stone pine seeds and seedlings, ranked by accuracy and then F1-Score from highest to lowest. With an accuracy of 0.648 and an F1-Score of 0.658, the pine seed model achieved the best performance in the RF algorithm. Additionally, the best classification performance for stone pine seedlings was observed for the k-NN algorithm (k = 18), for which the accuracy and F1-Score were 0.571 and 0.582, respectively.

Table 6 Evaluation results for the six-model performance of stone pine seeds
Table 7 Results of the six models’ classification performance assessment for seedlings of stone pine

Tables 8 and 9 show that the CART, C5.0, and RF algorithms report the importance of the attributes in stone pine seeds and seedlings, respectively. The importance of an attribute for the model increases with the given importance values, which are described in the Analyses Section. It can be seen that the classification performance of these algorithms is higher for the pine seeds dataset. Therefore, cone diameter (CD), seeds cone (SN), and (empty seeds) EN are ranked as the three most important attributes for the pine seeds dataset. These attributes can be taken into consideration in future studies.

Table 8 The order of attributes in the stone pine seeds dataset according to the importance values obtained with the classification and regression trees (CART), C5.0, and Random Forest (RF) algorithms
Table 9 The order of attributes in the stone pine seedlings dataset according to the importance values obtained with the classification and regression trees (CART), C5.0, and Random Forest (RF) algorithms

Clustering of stone pine seeds and seedlings

The clustering quality of the k-means algorithm was evaluated with the average Silhouettes for different k values. The average Silhouette Index values should be as close to 1 when assessing clustering quality. An average Silhouette Index value close to -1 indicates that a sample is not in the appropriate cluster (Rousseeuw 1987). The k parameter of the k-means algorithm was tested from 2 to 20 (Fig. S2). The best performance was achieved with k = 2 for the seed (average Silhouette Index = 0.48) and seedling (average Silhouette Index = 0.51) traits.

In Fig. S3, it seems more appropriate to divide the data into the two most distinct clusters (the average Silhouette Index is 0.35 for k = 6 and 0.48 for k = 2 for the stone pine seeds); however, since it is known that there are six populations in the dataset, the evaluation results for k = 6 are interpreted. In Fig. S4, it seems more appropriate to divide the data into the two most distinct clusters (average Silhouette Index = 0.51); however, since it is known that there are six populations in the dataset, the evaluation results were interpreted for k = 6. According to the graph, for k = 6, there are 19, 42, 15, 34, 63 and 43 observations in the 1st, 2nd, 3rd, 4th, 5th and 6th clusters, respectively. The average Silhouette Index values of these clusters were 0.28, 0.14, 0.23, 0.25, 0.15, and 0.11. The percentage of variance explained by each principal component for the seed and seedling datasets is shown in Fig. S5.

A PCA plot was generated to determine how the variables are related to each other. The PCA results for stone pine seeds and seedlings are presented in Fig. 4. Positively correlated variables are plotted on the same side of the graph, while negatively correlated variables are plotted on the opposite sides of the graph.

By considering the two principal components obtained with PCA in both datasets (Fig. 4), the clusters obtained from the k-Means algorithm and the actual seedling and seedling traits classes were visualized in Fig. 5. Population clusters are plotted in 2D space using the PCA coordinates of two principal components. The clusters obtained from the k-means algorithm with PCA and the populations in the seed (Fig. 5-Left) and seedling (Fig. 5-Right) datasets were visualized. The colors in the graph represent the clusters revealed by the k-means algorithm, while the labels indicate the actual populations to which the examples belong.

Fig. 4
figure 4

Principal component analysis (PCA) of stone pine seeds (top) and seedlings (bottom). For the abbreviations used, please see Table 2

Fig. 5
figure 5

Left-Population clusters plotted in 2D space using principal component analysis (PCA) coordinates of the first two principal components of stone pine seeds (k-parameters of the k-means algorithm are 2 and 6). Right- Population clusters plotted in 2D space using PCA coordinates of two principal components of stone pines seedlings (k-parameter of the k-means algorithm is 6-upper; k-parameter of the k-means algorithm is 2-below). The clusters obtained from the k-means algorithm with PCA and the populations in the seedling dataset are visualized in 2 dimensions. The colors in the graph represent the clusters revealed by the k-means algorithm, while the labels indicate the actual populations to which the observations belong

Table 10 shows the percentage of observations in each cluster (k = 2 to k = 6) belonging to a particular class (KO, MK, CK, AK, IK, TK) for seeds/seedlings. For example, in Table 10, the cluster labeled 1 for seed traits by the k-means algorithm contains 61.70% KO, 38.46% MK, 96.23% CK, 20.83% AK, 34.04% IK, and 64.10% TK for k = 2.

Table 10 Actual distribution of the populations in each cluster for k = 2, 3, 4, 5, 6 regarding stone pine seed/seedling traits. Bold values represent the most dominant population in each cluster

Discussion

Geographic variation describes the differences in morphological traits between populations of a species across their natural distribution. These variations can be observed in both the generative organs and vegetative parts of the trees and are essential for understanding the diversity within a species. They can also provide valuable insights for selecting suitable populations for breeding practices.

Chirici et al. (2016) and McRoberts et al. (2016) highlighted the effectiveness of the k-NN for predicting forest attributes. Begum et al. (2015) also underscore the efficiency of the k-NN algorithm in data classification. These studies collectively suggest the value of k-NN in both forestry and plant phenotyping applications. Skowronski et al. (2021) found that ML algorithms demonstrated superior results to traditional discriminant functions in classifying populations with different degrees of similarity. The k-NN, RF, SVM, and NB algorithms exhibited the highest classification accuracy, surpassing traditional statistical techniques and other ML algorithms. Researchers confirmed that ML algorithms are more accurate at discriminating and classifying populations than statistical techniques without limitations. Rao et al. (2023) reported that SVM and k-NN outperformed the other two ML algorithms. In a study by Sotomayor et al. (2023), five supervised ML algorithms were compared for prediction using multiple abiotic factors, such as topographic, edaphic, and climatic factors. The results showed that the RF model was the most accurate prediction model. As the first hypothesis of our study is to test whether the seeds and the seedlings can be classified in terms of morphological characters with ML algorithms, we obtained the best classification performance with RF since it is an ensemble learning algorithm that uses multiple models in parallel and finally outputs the majority class label for the unlabeled observation. Additionally, the best classification performance for stone pine seedlings was observed for the k-NN algorithm.

Huang et al. (2022) utilized seven distinct pine nut varieties, namely P. bungeana, P. yunnanensis, P. thunbergii, P. armandii, P. massoniana, P. elliottii, and P. taiwanensis. Five algorithms, decision tree, RF, multilayer perceptron, SVM, and NB, were employed to classify the pine nut samples. That study demonstrated the effectiveness of ML algorithms for classifying pine nuts (Huang et al. 2022). When evaluating the classification performance of ML algorithms, the ultimate goal is to achieve accuracy as close to 1 as possible. Performance metrics such as sensitivity, precision, and F1-Score provide additional support in assessing the effectiveness of the models. According to the classification results in our study, the accuracy of the selected ML algorithms ranges between 0.571 − 0.286 for stone pine seedlings and 0.648 − 0.463 for stone pine seeds.

Clustering analysis and PCA were used to evaluate the variability in seed quality among Pinus patula clonal seed orchards based on three physical cone characteristics as length, diameter, and weight. Five natural groupings were identified through cluster analysis out of 14 possible clusters. Cone length was found to be most important for group formation, with width and weight having equal effects (Owino et al. 2020). In our study, the axes of PCA for seed traits display a percentage of variance of 88.6%, which could be explained by the first principal component (Dim1) and 7.9% by the second principal component (Dim2). Of the variance, two dimensions account for 97%. On the left side of the plot, the SW, EW, PW, SL, SD, and CD variables are all close to one another. All of the other variables are far from EN. On the other hand, the axes of PCA for seedling traits display a percentage of variance of 43.1%, which was explained by the first principal component (Dim1), and 20.1% was explained by the second principal component (Dim2). Of the variance, two dimensions account for 63%. The variance explained by two dimensions increased to approximately 80% when adding one additional dimension (Fig. S5). On the left side of the plot, the variables BS, L2, D2, and B2 are all close to one another. All the other variables are far from the variables D1, L1, and B1. BS was the most significant parameter influencing the seeding traits (Fig. 4, Fig. S5). The most important features between seed and seedling traits were cone weight (CW) and bud set (BS), respectively. Using the attribute importance ratings given in our study, we can prioritize the attributes in Tables 8 and 9 in their future classification studies on seed and seedling traits. In the clustering analysis, k-means algorithm’s performance was evaluated considering our study’s second hypothesis. Different clustering analyses were performed on the data sets from 2 to 20 (Fig. S2); however, since there are six different seed and seedling traits, algorithm’s performance was especially considered in the case of k = 2, 3, 4, 5, 6. Tables 7 and 8 show the actual distribution of the populations in each cluster obtained by the k-means algorithm for stone pine seed traits and seedling traits, respectively. The most surprising result is that the best clustering performance was achieved with k = 2 for seed and seedling traits. This can be seen in Fig. 5, plotted over the two principal components of the data sets obtained by PCA analyses. For stone pine seed traits, KO, CK, and TK are mostly in Cluster 1, while MK, AK, and IK are mostly in Cluster 2 for k = 2 (Table 10). Table 10 shows that KO is quite distinct from the others (Cluster 1) for seedling traits, while MK, CK, AK, IK and TK are mostly in Cluster 2 for k = 2. In the stone pine seed traits dataset, for k values of 2, 3, 4, 5, and 6, it can be observed that more than 70% of the CK remains isolated in a single cluster. However, a similar situation was not observed in the stone pine seedling traits.

Regarding climatic factors, the main drivers of significant effects on the morphometric traits of cones, seeds, and pine nuts were found to be Q (humidity category) and the mean of the maximum temperature in June, July, and August (ME) (Balekoglu et al. 2020). Within populations, the cone, seed characteristics, and germination behavior of P. pinea vary and are correlated with environmental or parent plant variables, such as stand age, canopy cover, and site conditions (Ganatsas et al. 2008). The weight of the pine nuts was significantly affected by the amount of rainfall in the third year before harvest time (r: 0.889), despite the negative effect of summer rainfall. Additionally, the percentage of sound seeds (filled) per cone was negatively affected by rainfall in June of the third year before harvest (Balekoglu et al. 2020). The distinction of the Çanakkale-Kirazlı (CK) population from other populations in morphological, physiological, and biochemical studies conducted by Balekoglu et al. (2020, 2023a, b) on different populations of stone pines is also supported by the current study with CK is clustered in one group (96.23) (Table 10). According to the k-means algorithm, this study showed that the CK population differed according to seed and cone characteristics.

Considering that the shortening of vegetation from southern latitudes to northern latitudes and from low elevations to high elevations will affect the bud set, the geographical distribution of the populations may have caused this distinction (Balekoglu et al. 2020). This study revealed that the morphological characteristics of seedlings in the first year did not correlate with the second-year data. One of the reasons for this difference may be thought to be maternal effects in the first year. In such clustering and classification studies, performing at least two years of seedling morphological measurements is recommended. The results of the present study indicate that the BS can be an important parameter for population discrimination. As a final word, one of the limitations of the present study is that we did not have detailed site characteristics, such as soil and climate properties. The inclusion of soil and climate characteristics in the models can be included in future studies. On the other hand, the analysis in this study is limited by the parameters used. Therefore, researchers who want to conduct similar studies are advised to optimize the model parameters they use in order to avoid the overfitting problem, whether the analysis is descriptive or predictive.

Conclusions

In breeding programs, population selection represents the initial stage. It is of significant importance to ascertain the seed and seedling characteristics of populations, as this enables the differentiation of applications in seed transfer zones and geographic variation studies. Using ML algorithms to classify and cluster seeds and seedling populations in terms of morphological characteristics is beneficial for advancing forestry practices. We employed various ML techniques to analyze seed and seedling trait variations across the populations. The present research employs one unsupervised ML algorithm for seed and seedling clustering and six supervised ML algorithms for seed and seedling classification. The performances of six supervised ML algorithms were compared on the seed and seedling datasets. The RF algorithm achieved the best classification performance in terms of seed traits. Additionally, the best classification performance for stone pine seedlings was observed for the k-NN algorithm. The best clustering performance was achieved with k = 2 for the seed and seedling traits. According to the PCA, two dimensions accounted for 97% and 63% of the traits of seeds and seedlings, respectively. The most important features between seed and seedling traits were cone weight (CW) and bud set (BS), respectively. The phenotype of a woody plant represents its unique morphological properties. In order to facilitate genetic improvement and conserve genetic diversity, it is essential to employ population discrimination and individual classification. This study will provide an infrastructure and inspiration for the realization of future tasks such as image recognition, image segmentation, breeding of populations, and conserving genetic diversity in forest management practices, especially regarding reforestation, yield optimization, and breeding programs.