Introduction

Lithic artifacts regularly constitute the most important and abundant remains found on Paleolithic sites. When analyzing lithic assemblages, in addition to taking metric measurements and noting attributes, it is common to classify unmodified flakes according to their morphologies and technological features. This is a crucial part of lithic analysis because it classifies flakes into technological categories in the sense that the retained features and morphology are indicative of the production method by which they were generated. These technological products usually reflect different knapping strategies, stages of reduction, and variations in the organization of removals and surface exploitation. Well-known examples of technological classifications of flakes include bipolar/on anvil flakes (Callahan, 1996; Hayden, 1980), overshot flakes (Cotterell & Kamminga, 1987), bifacial-thinning flakes (Raab et al., 1979), byproducts of blade production, such as core tablets or crested blades (Pelegrin, 1995; Shea, 2013a), Kombewa flakes (Tixier & Turq, 1999; Tixier et al., 1980), and lateral tranchet blows (Bourguignon, 1992). While the use of technological categories is common and helps increase the resolution of lithic analysis, it is important to bear in mind that lithic artifacts are characterized by a high degree of morphological variability, which, in many cases, results in overlapping features. Consequently, several categories remain underused because of this high morphological variability, their overlapping features, and their similar roles in the volumetric management of the core along the reduction process.

The Middle Paleolithic in western Europe is characterized by the diversification of an increase in knapping methods, resulting in what are generally flake-dominated assemblages (Delagnes & Meignen, 2006; Kuhn, 2013). For the analysis of Middle Paleolithic lithic assemblages, lists of technological products are common and generally reflect individual knapping methods, the organization of flake removals, and their morphology (Duran & Abelanet, 2004; Duran & Soler, 2006; Geneste, 1988; Shea, 2013b). These technological lists are usually dominated by categories of technological products related to Levallois and discoidal knapping methods (Boëda, 1993, 1995a; Boëda et al., 1990), which constitute an important part of Middle Paleolithic lithic variability. Various discoidal and Levallois products have been identified and first appear approximately at 400 ka, in a vast area from eastern Asia to the Atlantic Coast of western Europe through Siberia and Central Asia, the Levant, eastern and central Europe (see bibliography in Romagnoli et al., 2022), and Africa (Adler et al., 2014; Blinkhorn et al., 2021). The identification of discoidal and Levallois products therefore appears widespread in lithic analysis across various research schools and is designed to create a comparable dataset, explore specific technological adaptations in different ecological contexts, and discuss long-term techno-cultural traditions and technological change. One special category of such products is backed flakes that exhibit remnants of the core on one of their lateral edges. Backed flakes are usually classified into two technological categories: “core–edge flakes” (éclats débordants) and “pseudo-Levallois points.” A third category, “core–edge flakes with a limited back” (éclats débordants à dos limité), has also been defined (Liliane Meignen, 1993; Lilliane Meignen, 1996; Pasty et al., 2004), although its use is not widespread (Duran & Abelanet, 2004; Duran & Soler, 2006; Geneste, 1988; Shea, 2013b). One reason for this may be their overlapping features, including morphology, and a similar role in core reduction compared to classic core edge flakes. This usually results in their absorption into the group of core edge flakes when technological lists of products are employed.

The present study aims to evaluate whether “core–edge flakes with a limited back” represent a discrete technological category that can be easily separated from classic core–edge flakes and pseudo-Levallois points based on their morphological features. While this may seem a pleonastic technical exercise, refining stone tool taxonomy allows researchers to better describe lithic technology, which is the basis for documenting patterns of tool production, transport, maintenance, use, discard, and reuse. This is equally important for exploring differences in how past human groups adapted their technical knowledge and skills to resource constraints, economic strategies, and social dynamics. Furthermore, an improved and comprehensive use of technological types and sub-types within core trimming elements is fundamental to generating reliable comparisons between archaeological assemblages. Finally, core–edge flakes are present in multiple Palaeolithic techno-complexes, particularly those based around centripetal and recurrent reduction strategies. As such, better classifying types and sub-types within this technological category will be of use in lithic studies for multiple periods and regions. In addition to evaluating the discreteness of this specific artifact type, the present study also explores a workflow for testing lithic categories and compares the effectiveness of using data derived from 2 and 3D geometric morphometrics.

To test the discreteness of core–edge flakes categories, an experimental sample of backed flakes produced by discoidal and recurrent centripetal Levallois reduction is classified, following their technological definitions. Geometric morphometrics on 3D meshes are employed to quantify the morphological variability of the experimental assemblage. To test for the discreteness of these categories, machine learning algorithms are employed to classify the flakes according to their technological category. Our hypothesis is that, although some degree of overlap is expected due to the high degree of morphological variability among lithic artifacts, machine learning models should easily differentiate the abovementioned categories. Testing this hypothesis would support the use of these backed flake categories in the classification of lithic assemblages. The creation of datasets based on the same analytical units and criteria will enable more accurate and reliable comparisons between stone tool assemblages, improving the definition of working hypotheses and interpretative models for technological, adaptive, and behavioral changes in prehistory (Romagnoli et al., 2022).

Methods

Experimental Assemblage

The present study uses an experimental assemblage comprising eight knapping sequences. Seven cores were knapped on Bergerac flint (Fernandes et al., 2012), and two cores were knapped on Miocene chert from south of Madrid (M. Á. Bustillo et al., 2012; M. A. Bustillo & Pérez-Jiménez, 2005). Three cores were knapped following the discoid sensu stricto concept, which strongly corresponds to Boëda’s original technological definition of the knapping system (Boëda, 1993, 1995b), and five experimental cores were knapped following the Levallois recurrent centripetal system (Boëda, 1993, 1994, 1995b; Lenoir & Turq, 1995).

Six technological characteristics define the Levallois concept (Boëda, 1993, 1994): (1) the volume of the core is conceived as two convex asymmetric surfaces and (2) these two surfaces are hierarchical and not interchangeable. They maintain their roles as striking platforms and debitage (or exploitation) surfaces, respectively, throughout the entire reduction process. (3) The distal and lateral convexities of the debitage surface are maintained to obtain predetermined flakes. (4) The fracture plane of the predetermined products is parallel to the intersection between both surfaces. (5) The striking platform is perpendicular to the overhang (the core edge, at the intersection between the two core surfaces). (6) The technique employed during the knapping process is direct hard-hammer percussion. Depending on the organization of the debitage surface Levallois cores are usually classified into the preferential method (where a single predetermined Levallois flake is obtained from the debitage surface) or recurrent methods (where several predetermined flakes are produced from the debitage surface), with removals being either unidirectional, bidirectional, or centripetal (Boëda, 1995b; Boëda et al., 1990; Delagnes, 1995; Delagnes & Meignen, 2006).

According to (Boëda, 1993, 1995a, 1995b), there are six technological criteria that define the discoid “sensu stricto” method: (1) the volume of the core is conceived as two oblique asymmetric convex surfaces delimited by an intersecting plane; (2) these two surfaces are not hierarchical, being possible to alternate between the roles of striking platforms and exploitation surfaces; (3) the peripheral convexity of the debitage surface is managed to control lateral and distal extractions, thus allowing for a degree of predetermination; (4) the surfaces of the striking platforms are oriented in such a way that the core edge is perpendicular to the predetermined products; (5) the fracture planes of the products are secant; and (6) the technique employed is direct hard-hammer percussion.

A total of 139 unretouched backed flakes (independent of the type of termination) were obtained from the different experimental reduction sequences, 70 from discoidal reduction sequences and 69 from Levallois recurrent centripetal reduction sequences. The following criteria were monitored for the classification of backed flakes (Fig. 1).

Fig. 1
figure 1

Backed artifact classification types from the experimental assemblage and their classification

Core–edge flakes/éclat débordants (Beyries & Boëda, 1983; Boëda, 1993; Boëda et al., 1990) have a cutting edge opposite and usually (although not always) parallel to an abrupt margin. This abrupt margin, or backed edge (dos), commonly results from the removal of a portion of the periphery of the core and can be plain, bear the scars from previous removals, be cortical, or present a mix of the three. Classic “core–edge flakes” (Boëda, 1993; Boëda et al., 1990), which are sometimes referred as “core–edge flakes with non-limited back”/ “éclat débordant à dos non limité” (Duran, 2005; Duran & Soler, 2006) have a morphological axis that follows the axis of percussion, although it may deviate slightly (Beyries & Boëda, 1983).

“Core–edge flakes with a limited back”/“éclats débordants à dos limité” share with core–edge flakes the morphological feature of having a cutting edge opposite a back. However, the main difference resides in a morphological axis clearly offset in respect to the axis of percussion (Meignen, 1993, 1996; Pasty et al., 2004). Because of this deviation from the axis of percussion, the backed edge is usually not completely parallel, nor does it span the entire length of the sharp edge.

Pseudo-Levallois points (Boëda, 1993; Boëda et al., 1990; Bordes, 1953, 1961; Slimak, 2003) are backed products in which the edge opposite to the back has a convergent morphology. This morphology is usually the result of the convergence of two or more previous removals. As with core–edge flakes, the back usually results from the removal of one of the lateral edges of the core and can be plain, retain the scars from previous removals, or more rarely be cortical. Pseudo-Levallois points share with core edge flakes with a limited back the deviation of symmetry from the axis of percussion, but they are clearly differentiable due to their triangular off-axis morphology.

Table 1 presents the distribution of backed flake types, following the previously established definitions. Due to the centripetal character of the knapping methods employed to generate the experimental assemblage, most of the backed flakes fall within the definition of core–edge flakes with a limited back (66.91%). Cortex distribution according to backed flake category (Fig. 2) shows that slightly (~ 25%) or non-cortical products dominate among the three categories, adding up to more than 65% in the three groups (90% core–edge flakes, 68.82% core–edge flakes with a limited back, and 87.5% pseudo-Levallois points).

Table 1 Classification of backed flakes from the experimental assemblage
Fig. 2
figure 2

Distribution of cortex according to backed flake category

Geometric Morphometrics

All flakes were scanned with an Academia 20 structured light surface scanner (Creaform 3D) at a 0.2-mm resolution. Flakes were scanned in two parts, automatically aligned (or manually aligned in the case automatic alignment failure) and exported in STL format. Cloudcompare 2.11.3 (https://www.danielgm.net/cc/) free software was employed to perform additional cleaning, mesh sampling, surface reconstruction and transformation into PLY files. Finally, all files were decimated to a quality of 50,000 faces using the Rvcg R package (Schlager, 2017a). The present work compares the use of 2D and 3D geometric morphometrics to test the limits of their application.

2D geometric morphometrics were done using screenshots (Cignoni et al., 2008) of the upper view of each flake orientated along the technological axis. One thin-plate spline (tps) was generated using tpsUtil v.1.82, and the tpsDig v.2.32 (Rohlf, 2015) outline tool was employed to automatically trace the perimeter of each flake. Each outline was resampled to 100 equidistant landmarks (Fig. 3) using Morpho v.2.11 (Schlager, 2017a).

Fig. 3
figure 3

Example of the positioning of 100 equidistant landmarks along the perimeter of a core edge flake and the mean shape of the sample

The protocol for digitalizing 3D landmarks on flakes is based on previous studies (Archer et al., 2018, 2021). This included the positioning of a total of three fixed landmarks, 85 curve semi-landmarks, and 420 surface semi-landmarks (Bookstein, 1997a, 1997b; Gunz & Mitteroecker, 2013; Gunz et al., 2005; Mitteroecker & Gunz, 2009). This resulted in a total of 508 landmarks and semi-landmarks. The three fixed landmarks correspond to both laterals of the platform and the percussion point. The 85 curve semi-landmarks correspond to the internal and exterior curve outlines of the platform (15 semi-landmarks each) and the edge of the flake (55 semi-landmarks). Sixty surface semi-landmarks correspond to the platform surface. The dorsal and ventral surfaces of the flakes are defined by 180 semi-landmarks each. The workflow for digitalizing landmarks and semi-landmarks included the creation of a template/atlas on an arbitrary selected flake (Fig. 4: top). After this, landmarks and semi-landmarks were positioned in each specimen and relaxed to minimize bending energy (Fig. 4: bottom; (Bookstein, 1997a, 1997b). A complete workflow of landmark and semi-landmark digitalization and relaxation to minimize bending energy was created in Viewbox Version 4.1.0.12 (http://www.dhal.com/viewbox.htm), and the resulting point coordinates were exported into.xlsx files.

Fig. 4
figure 4

Top: template/atlas for a randomly selected flake with the defined landmarks, curves, and surfaces. Bottom: landmark positioning after sliding to minimize bending energy on a pseudo-Levallois point. Fixed landmarks are indicated in red

Procrustes superimposition (Kendall, 1984; Mitteroecker & Gunz, 2009; O’Higgins, 2000) was performed using the Morpho v.2.11 package (Schlager, 2017a) on RStudio IDE (R Core Team, 2019; RStudio Team, 2019). Morpho package v.2.11 provides results from principal component analysis (PCA) allowing to reduce the dimensionality of the data (James et al., 2013; Pearson, 1901). There are multiple reasons to use dimensionality reduction when dealing with high-dimension data on classification: to avoid having more predictors than observations (p > n), avoid collinearity of predictors, reduce the dimensions of the feature space, and avoid overfitting due to an excessive number of degrees of freedom (simple structure with lower number of variables). Principal component analysis achieves dimensionality reduction by identifying the linear combinations that best represent the predictors on an unsupervised manner. The principal components (PCs) of a PCA aim to capture as high a variance as possible for the complete data (James et al., 2013), and PCs that capture the highest variance need not necessarily be the best for classification.

Debate exists on how many PCs from geometric morphometrics should be selected for classificatory analysis (Schlager, 2017b). Including all PCs up to an arbitrary percentage of variance can result in non-meaningful (noise) PCs pulling the classificatory analysis. This can be considered as a type of overfitting since the classification is not being driven by meaningful morphological trends. An alternative is to select PCs capturing a minimum percentage of variance. However, stone tools are notorious for their wide morphological variability, and increasing sample size results in diminishing variance captured by each PC. Usually the first two to three PCs will reflect ratios of elongation and width to thickness, while other meaningful PCs for classification (such as the angle between the internal or external surface of a flake) might be concealed in lower ranking PCs.

The problem of selecting a minimum variance is approached in two steps. A first round of models is trained using all PCs that represent up to 95% of variance. The threshold of 95% of variance is arbitrarily selected because it balances retaining most of the dataset variance with a reduced number of variables. This provides the most meaningful PCs for classification according to the best model. The effect on morphology of these meaningful PC was visually evaluated. PCs explaining little variance and with little effect on shape change are excluded. Based on this evaluation, the second and final round of models were trained using PCs which captured more than 3% of variance.

The identification of best PCs for classification is performed automatically by the machine learning models using the caret v.6.0.92 package (M. Kuhn, 2008). Morpho package v.2.11 additionally provides visualization of shape change according to PC. A previous work on the same dataset (Bustos-Pérez et al., 2022) performed PCA using the package stats v.4.2.2 (Venables & Ripley, 2002). The present work uses the PCA integrated in the Morpho v.2.11 (Schlager, 2017a). As a result of this, variance captured by PCs and their interpretation differs regarding previous analysis.

Machine Learning and Resampling Techniques

Different machine learning models treat the provided data differently. As a result, different models have different strengths and weaknesses. No universal model exists for all problems. Thus, testing several models is an important step in machine learning. It allows the performance of different models to be compared, the best model to be identified for a given task, and provides a general overview of the difficulty or ease of the problem. It can also serve as indication of possible underlying problems with the data (e.g., overfitting or unbalanced datasets). Machine learning is a quickly developing field where a high number of available models exist. The present work tests ten machine learning models for the classification of flake categories. These models cover some of the most commonly employed algorithms (Jamal et al., 2018) and provide an extensive analysis for the classification of backed flake categories.

  • Linear discriminant analysis (LDA): reduces dimensionality in an attempt to maximize the separation between classes, while decision boundaries divide the predictor range into regions (Fisher, 1936; James et al., 2013).

  • K-nearest neighbor (KNN): classifies cases by assigning the class of similar known cases. The “k” in KNN references the number of cases (neighbors) to consider when assigning a class, and it must be found by testing different values. Given that KNN uses distance metrics to compute nearest neighbors and that each variable is in different scales, it is necessary to scale and center the data prior to fitting the model (Cover & Hart, 1967; Lantz, 2019).

  • Logistic regression: essentially adapts the multiple linear regression by raising Euler’s constant to its output in the numerator (plus one in the case of the denominator). This results in probability values ranging from 0 to 1which allows obtaining predictions for categorical outcomes (Cramer, 2004; Walker & Duncan, 1967).

  • Decision tree with C5.0 algorithm: uses recursive partitioning to divide a dataset into homogeneous groups. The C5.0 algorithm uses entropy (measure of degree of mixture between classes) to determine feature values on which to perform the partitioning. The C5.0 algorithm is an improvement on decision trees for classification (J. R. Quinlan, 1996; J. Ross Quinlan, 2014).

  • Random forest: uses an ensemble of decision trees. Each tree is grown from a random sample of the variables, allowing for each tree to grow differently and better reflect the complexity of the data and provide additional diversity. Finally, the ensemble of trees casts a vote to generate predictions (Breiman, 2001).

  • Gradient boosting machine (GBM): uses a sequential ensemble of decision trees. After training an initial decision tree, GBM trains subsequent trees on a resampled dataset were the weight of observations difficult to classify is increased based in a gradient (Greenwell et al., 2019; Ridgeway, 2007). The subsequent trained trees complement decisions and allow for the detection of learning deficiencies and increase model accuracy (Friedman, 2001, 2002).

  • Supported vector machines (SVM): fits hyperplanes into a multidimensional space with the objective of creating homogeneous partitions. The fitting of the hyperplanes is done in order to obtain the maximum margin of separation between classes. The maximum margin of separation is reached by minimizing the cost (value applied for each incorrect classification). SVM’s can use different kernels to transform data into linearly separable cases. The kernel selected for the transformations needs to be specified and plays a key role in the training of a SVM model (Cortes & Vapnik, 1995; Frey & Slate, 1991). The present study tests SVM’s with linear, radial, and polynomial kernels.

  • Naïve Bayes: computes class probabilities using Bayes’ rule (Weihs et al., 2005).

As mentioned above, 66.91% of flakes fall into the definition of core–edge flakes with a limited back, resulting in an unbalanced dataset. To counter the unbalanced nature of the experimental dataset, up-sampling was undertaken for the two minority classes, and down-sampling was undertaken for the majority class. Up-sampling categories in a dataset can be considered inappropriate for training machine learning models (Calder et al., 2022; McPherron et al., 2022) because it increases the overfit (samples used in the test set to evaluate the model are likely to have already been used in the training set). However, here, the up-sampling of the two minority groups increases the discreteness of these groups but does not affect the potential overlap with the majority class (core–edge flakes with limited backs). On the other hand, down-sampling results in missing information because some of the data are removed.

Random up- and down-sampling is conducted to obtain a balanced dataset and train the models. After each random sampling, each model is evaluated using a k-fold cross validation using 10 folds and 50 cycles. Each fold consisted of 7 flakes, with the exception of the last fold, which had 6 flakes. Because model performance metrics depend on random up- and down-sampling, this process is repeated 30 times, extracting model performance metrics each time, and averaging the values. The model with the best performance metrics is then trained again with thirty cycles of up- and down-sampling. The reported variable importance and confusion matrix from which model metric performance are extracted are obtained from these additional cycles of down- and up-sampling.

Results

PCA and Machine Learning Model Performance

Results from the PCA on the 2D data show that the nine first principal components account for 95% of the variance in the dataset, with PC1 accounting for 39.33% of variance and PC9 for 0.98% of variance (Fig. 5). For the 3D data the 22 first principal components account for 95% of variance, with PC1 accounting for 31.32% and PC22 accounting for 0.3% of variance (Fig. 5). This represents a substantial reduction in dimensionality from the original number of variables of the 2D data (200 original variables) and the 3D data (1524 original variables) and is lower than the sample size (139).

Fig. 5
figure 5

Proportion of variance for the first PC which add up to 95% of variance in the 2D and 3D data

Figure 6 presents the accuracy values for each model after their respective 30 cycles of random up- and down-sampling. These results are sorted according to the use of all variables up to 95% of variance, or only variables capturing more than 3% of variance.

Fig. 6
figure 6

Box and violin plots of model accuracy after each 30 cycles of random up- and down-sampling according to type of data (2D and 3D data) and number of variables employed (up to 95% of variance, or capturing more than 3% of variance)

Regarding the number of variables employed, important differences can be observed (Fig. 6). Models trained on 3D data presented a more general decrease on accuracy than models trained on 2D data. For both types of data (2D and 3D), tree-based machine learning models (C5.0 Tree, random forest and GBM) were the least affected when a limited number of variables was used. Contrary to this, the precision of the SVM models is strongly affected when a reduced number of variables is employed. SVM’s using linear and polynomial kernels presented lower accuracy values when the number of variables is reduced. This is especially noteworthy in the case of the 3D data, since when variables summing 95% of variance are employed the SVM with polynomial kernel presents the highest accuracy values. However, when the number of variables is limited, the random forest presents the highest precision values. In general, models trained on 3D data presented higher overall precision metrics (independently of the number of variables employed for training the machine learning models).

In the case of the 2D data with a limited number of variables, the random forest model had the highest average value for general accuracy (0.785), closely followed by the decision tree (0.771) and the GBM (0.771). The LDA had the lowest average value for accuracy (0.496), followed by the SVM with linear kernel (0.497).

For the 3D data with limited number of variables, again the random forest presented the highest average value for general accuracy (0.828), followed by the GBM model (0.783). The LDA model had the lowest average value for accuracy (0.658), followed by logistic regression model (0.681).

Tables 2 and 3 present performance metrics of the random forest model on 2D data and on 3D data for the classification of the three products. The prevalence/no information ratio was kept constant at 0.33 for all three categories as a result of random up- and down-sampling to obtain balanced datasets. General performance metrics values (F1 and balance accuracy) of pseudo-Levallois points were similar for models trained on the 2D and 3D data. General performance metrics for the identification of core edge flakes and core edge flakes with a limited back did increase when 3D data was employed instead of 2D data.

Table 2 Performance metrics of random forest for backed artifacts using 2D data and limited number of variables
Table 3 Performance metrics of random forest for backed artifacts using 3D data and limited number of variables

Feature Importance

Figure 7 presents average variable importance for the three products after 30 cycles of up- and down-sampling and k-fold cross-validation when PC summing up to 95% of variance are employed. The random forest trained on 2D data consider four sets of principal components important in terms of classification. The SVM with polynomial kernel trained on 3D data considers five sets of principal components in terms of classification.

Fig. 7
figure 7

Average feature importance after 30 cycles of up- and down-sampling using PC summing 95% of variance

PC2 (29.38% of variance) is considered the most important variable for discrimination when using the 2D data, followed by PC1 (39.33% of variance), PC5 (3.38% of variance) and PC3 (9.12% of variance). Figure 8 presents shape change according to these PCs. PC1 and PC2 capture the elongation of flakes along an asymmetric axis, but with different orientation. Positive values of PC1 or PC2 result in wider flakes. PC3 captures variance of flakes where the proximal part is much wider than the distal portion of the flake. PC5 appears to capture pointed extremes resulting from concave delineations of the middle portion of the laterals. This might be representing the presence of convergent extremes at either end of the flake.

Fig. 8
figure 8

Visualization of shape change according to PC for the 2D data. Differences towards mean shape have been magnified by a factor of three

In the case of 3D data (Fig. 9), PC5 (5.5% of variance) is considered the most important variable for discrimination, followed by PC1 (31.32% of variance), PC6 (5.06% of variance), PC11 (1.45% of variance), and PC3 (8.82% of variance). PC11 presented an average importance value of 59.68. However, the low variance captured by this PC (1.45), and visual evaluation of shape change, indicate that its effect is minimum and should be excluded from analysis.

Fig. 9
figure 9

Visualization of shape change according to PC of the 3D data. Differences towards mean shape have been magnified by a factor of three

PC5 is driven by the interaction of platform depth and flake thickness. Increasing values of PC5 result in flakes with platforms much wider than deep, with the width of the platform finding their continuation in one of the abrupt laterals. Increasing PC5 values also result in thinner flakes. The negative space of PC5 captures flakes with platforms much more deep than wide and thicker flakes. PC1 largely captures elongation along with platform size and thickness. Positive PC1 values result in very wide flakes with a reduced length and bigger platforms. This increase in width is slightly accompanied with an increase in thickness. Negative space values result in thin elongated flakes with a distal convergent edge and small platforms. Positive values of PC6 represent the convergence of one of the distal laterals into a pointed end. The negative space of PC6 results in flakes with a wide proximal portion which becomes narrower in the towards the distal part of the flake. PC3 represents transversal flake morphology and the relationship between thickness, width and asymmetry. Increasing values of PC3 result in thicker and narrower flakes with a marked asymmetry which results from a thick back located at the left lateral.

The combination of variance captured by each PC (Fig. 5), importance of a PC for classification (Fig. 7), and visualization of shape change according to PC (Figs. 8 and 9) indicate that a threshold of 3% variance captured is adequate for the present dataset. PC 5 of the 2D data presents an example of feature to be included in the analysis: despite capturing little more than 3% of variance (Fig. 5), it is considered important by the random forest as a classification feature (Fig. 7), and it does have an effect on sample shape (Fig. 8). PC11 of the 3D data presents an opposite example, were a variable should be excluded from the training of the final models. While it is considered an important variable for classification by the SVM with polynomial kernel, the low variance captured (1.45%) and the little effect on shape change prevent its inclusion into the final set of machine learning models.

Group Discreteness Through Confusion Matrix and PCA Biplots

The confusion matrixes of final models trained with limited number of variables (Fig. 10) illustrate the directionality of confusion between the predicted and true values of classified technological products for the 2D and 3D data. In both types of data pseudo-Levallois points have the best identification, in accordance with the reported sensitivity and specificity. In general, it is very difficult to mistake pseudo-Levallois points for any of the two considered technological products. Wrongly considering a pseudo-Levallois point as a core–edge flake is very unlikely for the 2D data (2.4) and minimal for the model on 3D data (1.59). Although mistaking a pseudo-Levallois point for a core–edge flake with a limited back is slightly more common for both types of data, there is still a very low confusion value.

Fig. 10
figure 10

Normalized confusion matrix of random forest for the 2D and 3D data when only PC capturing more than 3% of variance are employed (ED = Eclat débordant; EDlb = éclat débordant with limited back; p_Lp = pseudo-Levallois point)

Core–edge flakes and core–edge flakes with a limited back offer slightly higher frequencies of misidentifications, although they maintain high values for sensitivity and specificity for both types of data. For the random forest model based on the 2D data, it is more common to mislabel core–edge flakes with a limited back as core edge flakes (28.18) than the reverse (17.81).

In the case of the random forest based on 3D data, the confusion between both categories diminishes slightly (20.78 and 13.3). In general, the 3D data better identifies core–edge flakes with a limited back and core–edge flakes than the model based on the 2D data. This greater precision is the result of a lower frequency in the incorrect identification of core edge flakes with a limited back as core–edge flakes. The incorrect identification of backed flakes as pseudo-Levallois points is slightly more common when 2D data is employed. Their incorrect identification as pseudo-Levallois flakes is minimal in the case of the 3D data, although this incorrect identification has a higher frequency in core–edge flakes with a limited back (4.82) than in core–edge flakes (1.59) (Figs. 11, 12, and 13).

Fig. 11
figure 11

Biplots of PC2, PC1, PC3, and PC5 according to flake category on the 2D data. Ellipses indicate the confidence level at 80%

Fig. 12
figure 12

Biplots of PC7, PC1, PC3, and PC4 according to flake category on the 2D data. Ellipses indicate the confidence level at 80%

Fig. 13
figure 13

Biplots of group PCA using the most important features for classification

The above interpretation of the PCs, biplot visualization (Figs. 8 and 9), and descriptive statistics of PC values (Fig. 14) allow us to evaluate the morphological features captured by the 2D and 3D geometric morphometrics and the characterization of each type of technological backed flake. In general, biplots from 2D data show much more overlap than biplots from 3D data (Figs. 11 and 12). The higher overlap observed on the PC biplot from the 2D data is also observed when a group PCA is performed using the most important variables for classification (Fig. 13). Visual analysis of group PCA biplot from variables of the 3D data shows much less overlap between the three categories. While core–edge flakes and pseudo-Levallois points show little overlap between each other, core–edge flakes are situated as an intermediate product, pointing out to their wide morphological variability.

Fig. 14
figure 14

Box and violin plots of PC values according to each backed flake category

For the 2D data, pseudo-Levallois points are characterized by having positive values of PC1 (mean = 0.53; SD = 0.116) and PC2 (mean = 0.168; SD = 0.128). These generally positive values are indicative of the low elongation of pseudo-Levallois points. The triangular off-axis morphology characteristic of pseudo-Levallois points is weakly expressed for the 2D data and is only slightly captured by PC5 (mean = 0.002; SD = 0.028). On the contrary, 2D geometric morphometrics is useful to detect features of core–edge flakes, although the overlap with core edge flakes with a limited back is very high. In the 2D data core edge flakes are characterized by having low values of PC1 (mean =  − 0.052; SD = 0.111) and PC2 (mean =  − 0.043; SD = 0.092) which are indicative of being elongated products. Core edge flakes (mean =  − 0.014, SD = 0.055) and pseudo-Levallois points (mean =  − 0.014, SD = 0.079) tend to have negative values of PC3. This indicates that, in general, neither of the two products had a distal portion narrower than the proximal one. Core–edge flakes with a limited back presented slightly positive values of PC1 (mean = 0.008; SD = 0.131) and PC2 (mean = 0.011; SD = 0.111), indicating that their elongation and asymmetry falls between core–edge flakes and pseudo-Levallois points. Additionally, core–edge flakes with a limited back presented slightly positive values of PC3 (mean = 0.007; SD = 0.06), indicating that it is more common for these types of products to present a narrower distal portion.

For the 3D data, pseudo-Levallois points were characterized by high values for PC1 (mean = 0.031; SD = 0.055), low values for PC3 (mean = 0.024; SD = 0.09) and negative values for PC5 (mean =  − 0.046; SD = 0.059). Finally, pseudo-Levallois points also exhibit intermediate positive values for PC6 (mean = 0.018; SD = 0.05). The combination of these PCs and their values indicates that the geometric morphometrics is capturing the low elongation (PC1), asymmetry (PC3), and to some extent, the triangular morphology resulting from the convergence of two edges (PC5 and PC6). Core–edge flakes exhibited negative values of PC1 (mean =  − 0.078; SD = 0.136), positive values of PC5 (mean = 0.024; SD = 0.048) and PC6 (mean = 0.031; SD = 0.055). These PCs capture the elongated nature (PC1) of core edge flakes (in comparison to the other two categories), their lower ratio of length to thickness (PC5) and the presence of a distal transverse edge which can result in distal pointed portions. Core–edge flakes with a limited back are characterized by intermediate values of PC1 (mean = 0.012, SD = 0.142), PC3 (mean =  − 0.006, SD = 0.078) and PC5 (mean = 0.00; SD = 0.061) and slightly negative values of PC6 (mean =  − 0.013; SD = 0.058). The combination of values from these PC reflects the wide morphological variability of core–edge flakes with a limited back were the main features are a low elongation (PC1), varying ratios of thickness and platform morphology (PC5) and strong variability of the upper view (PC6).

Discussion

Our results show that geometric morphometrics, along with machine learning models, can easily differentiate between core edge flakes, core edge flakes with a limited back, and pseudo-Levallois points from discoidal and recurrent centripetal Levallois reduction sequences. The best model trained on the 2D data (random forest) obtained a good general average precision value (0.785), while the best model trained on the 3D data (also random forest) obtained a high general average precision value (0.828). As expected, this indicates that when dealing with morphological objects whose classification requires several views, more precise results will be obtained using 3D data. Despite the differences in average precision values and performance measures for each of the categories, 2D geometric morphometrics were able to capture some of the morphological features which characterize the three categories. Thus, looking to methods that can facilitate the proper comparison between stone tool assemblages, based on corresponding classifications, and reducing the margin of error in analyzing the similarities between data samples, 2D geometric morphometrics are not suitable for generating lithic taxonomies, at least in relation with the three technological categories analyzed in this study. Core–edge flakes, core–edge flakes with a limited back, and pseudo-Levallois points are relevant categories when looking at patterns of core reductions and variability within centripetal core knapping methods, whether due to raw material constraints or the search for specific tool types.

Several studies have employed machine learning models based on attribute analysis (Bustos-Pérez et al., 2023; González-Molina et al., 2020; Presnyakova et al., 2015) or geometric morphometrics (Archer et al., 2021; Bustos-Pérez et al., 2022). SVM with polynomial kernel and random forest stand out as the most common or accurate models when performing classificatory tasks among lithic artifacts. Previous studies have outlined the advantages of SVM’s when dealing with the classification of lithic products (Bustos-Pérez et al., 2022, 2023). Because of their features (hyperplane fitting, use of margins to find best separation, and use of a cost value for each misclassification) SVM’s seem well suited for the classification of lithic artifacts. In the present study, for the 3D data SVM with polynomial kernel outperformed the rest of the machine learning models. However, when the number of variables employed as predictors was restricted, its general accuracy was reduced to a value of 0.78, making the random forest the model with the best average accuracy. Random forest was also selected as the best option when dealing with 2D data independent of the number of variables employed. Random forest are common in lithic analysis for classification tasks (Archer et al., 2021; González-Molina et al., 2020). The results from model performance on the 2D and 3D data indicate that random forests along with SVM with polynomial kernel are good options for classification when dealing with the variability observed in lithic artifacts, although the number of variables introduced as predictors should be taken into account.

When considering each technological product individually, pseudo-Levallois points stood out as the most clearly differentiable of the three categories considered, with performance metrics above 0.9 in PCs derived from both the 2D or 3D data. Following pseudo-Levallois points, core–edge flakes were the most clearly identifiable technological products, with a notable sensitivity value. The sensitivity values for the detection of core–edge flakes were notably higher in the case of 3D data (84.9) than the ones obtained from the 2D data (0.76). The directionality of the confusion matrixes shows that the main drawback reducing the identification of core–edge flakes is their identification as core–edge flakes with a limited back in both types of data. Two underlying causes of confusion between core–edge flakes and core–edge flakes with a limited back can be considered: an increased deviation between the technological and morphological axes and an increased angle between the platform and backed edge, which results in changing the morphological axis. These two factors can occur at the same time or individually, blurring the division between products in cases of similarity. This overlap is inherent in the morphological variability and defining features of both technological products. However, despite these overlapping features, a high degree of separation between both products is achieved by the machine learning model.

This difference in the sensitivity values depending on the type of data is also observed in the case of core–edge flakes with a limited back. The sensitivity value of the 3D data (0.662) was higher than the one obtained from the 2D data (0.607). In the present study, core–edge flakes with a limited back were the category of interest and on which down-sampling was applied, thus preventing overfitting. This indicates that geometric morphometrics are capturing the technological features defining each category and that they are discrete categories with little overlap. Thus, their use (and, more specifically, the use of the core–edge flakes with a limited back as a category) for analysis in lists of technological products is justifiable.

The present study included all backed flakes from a series of experimental recurrent centripetal cores. As a result, backed artifacts that fall within the definition of “core–edge flakes with a limited back” were the overwhelming majority (n = 93, 66.91%). Consequently, it was necessary to use up- and down-sampling techniques to obtain balanced datasets (Ganganwar, 2012; Kumar & Sheshadri, 2012). The most up-sampled product (pseudo-Levallois) is also the one with the highest values for identification metrics irrespective of the type of data (2D or 3D), probably because of overfitting (the model is classifying repeated samples from the training set). This overfitting is noteworthy in the case of the 2D data were the high level of sensitivity contrasts with the visual analysis of biplots obtained from PCA (Fig. 11) or group PCA (Fig. 13) where a high level of overlap can be observed. Unbalanced data is common in archaeological analysis and sampling approaches to overcome this drawback can result in overfitting when machine learning models are applied (Calder et al., 2022; Domínguez-Rodrigo & Baquedano, 2018; McPherron et al., 2022). Providing visual evaluation through biplots can help determine the extent of overfitting and the reliability of performance metrics from machine learning models. Further research applying sampling techniques to archaeological problems might consider alternative solutions to unbalanced datasets. An alternative is the use of down-sampling in combination with leave-one-out cross-validation (were each case of the downsized dataset serves as a single test set). However, this approach also comes with the drawback of losing information, which can result in machine learning models not capturing the features which characterizing each class.

Given the strict definition adopted to classify a backed artifact as either a core–edge flake or a pseudo-Levallois point, their morphological variability is limited, and the likelihood of them overlapping is small. This is logical given their definitions and can be observed in the biplots from 3D data of most important PCs for the classification of backed products (Figs. 12 and 13). In the two biplots and the group PCA biplot (Figs. 12 and 13), there is little overlap of confidence ellipses of core–edge flakes and pseudo-Levallois points. Moreover, the confidence ellipsis of core–edge flakes with a limited back does seem to be intermediate between the other two categories. Thus, although up-sampling imposes some limitations, it does not seem to affect the overall results regarding backed flake classification. Core–edge flakes with a limited back were not up-sampled, thus avoiding the risk of overfitting their classification (having an observation on the training set repeated in the test set). The results show a very limited misidentification of core–edge flakes as pseudo-Levallois points and their moderate confusion with core–edge flakes. This indicates that core–edge flakes with a limited back are being correctly identified despite the probable overfitting in the identification of core edge flakes and pseudo-Levallois points.

The present study has compared the use of 2D and 3D geometric morphometrics. 2D geometric morphometrics have the advantage of being less time consuming, requiring less equipment, and being little affected by camera positioning (Cardini & Chiapelli, 2020; Macdonald et al., 2020). Our research showed that the 2D dataset was not sufficiently useful to discriminate between the lithic backed categories and, therefore, not reasonably informative, at least for the three technological categories analyzed. Some authors considered 2D geometrics a useful tool for lithic analysis when performing lithic taxonomic or comparative studies. Previous studies using 2D geometric morphometrics aimed at analyzing formal retouched tools (large tanged points; and Clovis, Folsom, and Plainview projectile points; Buchanan & Collard, 2010; Serwatka & Riede, 2016), testing the relationship between shape and function in informal flakes (Borel et al., 2017), and compare similarity of bone and stone Acheulean bifaces (Costa, 2010). The cited 2D studies were not comparing between 2 and 3D datasets when analyzing the same sample, while the present study aimed at testing methodological aspects to improve lithic taxonomic analysis and refine stone tool technological description and interpretation. Furthermore, these studies mainly analyzed morphological aspects quantitatively. It seems reasonable that, when dealing with bidimensional surfaces, 2D geometric morphometrics should be valuable. The present research has addressed a volumetric problem. Thus, the better performance of 3D geometric morphometrics falls within what was expected. In the same line, previous works have outlined how, when working with volumetric structures, the use of 2D geometric morphometrics can result in a lower resolution of the analysis (Buser et al., 2018; Cardini & Chiapelli, 2020). The present results indicate the preferable application of 3D techniques for the identification of technological backed categories related to core reduction strategies, suggesting that more studies are needed to test the best approach when dealing with different research questions and technological categories, including cores, retouched tool-types, unretouched items, and shaped elements.

Sullivan and Rozen (1985) previously called attention to the use of technological categories of flakes. Their critique focuses on the lack of consistency in defining and using technological categories of flakes, along with a lack of consistency regarding the attributes employed to define them. Although their critique concerns flakes from bifacial knapping, it represents an important word of caution. The present study has shown that the three analyzed categories are well defined in terms of morphological and technological features (see also Faivre et al., 2017). These morphological and technological features can be captured by geometric morphometrics, especially 3D techniques, and employed for accurate classifications. Thus, little ambiguity exists in the categories employed to classify backed flakes, which are connected to two of the main Middle Paleolithic flaking strategies (discoidal and recurrent centripetal Levallois).

Meignen (1993, 1996) first proposed the category “core–edge flake with a limited back” because better individualized and characterized the predominant centripetal debitage present at the site of Les Canalettes (Aveyron, France). The term has, however, seen limited use since, mainly employed when characterizing lithic assemblages produced via recurrent centripetal methods (Bernard-Guelle, 2004; Bourguignon & Meignen, 2010; Duran, 2005; Duran & Abelanet, 2004; Duran & Soler, 2006). The use of the “core–edge flake with a limited back” category is usually overlooked (or merged into the core edge flake category) when lists of technological products are employed for the analysis of Middle Paleolithic lithic assemblages (Debénath & Dibble, 1994; Geneste, 1988; Shea, 2013b). Although their merger into the core–edge flake category is valid for analyzing lithic assemblages, differentiating between classic core–edge flakes and core–edge flakes with a limited back in lists of technological products can increase the resolution and enrich the analysis of lithic assemblages. This, of course, results in a better understanding of the technical choices and constraints faced by past groups and may allow for better understanding of the relationships between technology and raw material, as well as volumetric core management, and could contribute to improving comparisons between assemblages to avoid simple and overly generic technological definitions and to avoid merging what are clearly different reduction concepts. Such generic definitions are often used to discuss human behaviors and cultural traditions and even human migrations (Blinkhorn et al., 2021), but they remain difficult to interpret (Romagnoli et al., 2022). They do not always reflect the diversity and specificity of the technological choices and constraints past human groups faced. A more in-depth analysis of variability in the techno-morphological characteristics of stone artifacts, in association with other approaches, such as refitting and taphonomic analysis (see bibliography in Romagnoli & Vaquero, 2019; Romagnoli et al., 2018), could also allow archaeologists to better understand the processes of assemblage formation and interpret technological changes throughout human evolution.

Previous researchers have pointed out the morphological and technological differences between classic core–edge flakes and core–edge flakes with a limited back. Beyries and Boëda (1983) originally defined classic core–edge flakes as having very similar morphological and percussion axes. Meignen (1993, 1996) used the “core–edge flake with a limited back” category to classify core edge flakes in which the morphological and percussion axes were not aligned. Additionally, most of the examples presented (Meignen, 1996) had the flake back offset from the percussion axis. Slimak (2003) also points out that, as a morphological feature, the flake back of core–edge flakes with a limited back will be offset in regards the axis of percussion. These features of alignment between the flake back and percussion axis are captured by PC1 and PC3, with pseudo-Levallois and core–edge flakes with a limited back having similar values, indicative of an offset backed edge and lower elongation.

Slimak (2003) also indicates that classic core edge flakes will be elongated because of percussion running parallel to the core edge. Following Slimak (2003), core–edge flakes with a limited back will have length/width ratios close to 1 as a result of percussion encountering ridges perpendicular to its direction. Although in the 3D data PC1 is the second most important variable for discrimination between backed products in the present study, it is clearly capturing the feature of elongation for the differentiation of classic core–edge flakes. Classic core–edge flakes have, on average, negative PC1 values (both on the 2D and 3D data), which are indicative of elongation. On the other hand, core–edge flakes with a limited back and pseudo-Levallois points are characterized by, on average, positive PC1 values, which are indicative of similar values for width and length or even being wider than they are long.

Ambiguity and overlap between some technological categories of flakes are common. Combining quantitative methods and techniques has showed to be a useful approach to distinguish between backed products extracted during discoid and recurrent centripetal Levallois knapping strategies. The present research shows that geometric morphometrics along with machine learning models is an effective way also to test for the discreteness of categories, the possible directionality of confusions between categories and to quantify the features which best characterize and define each category.

Conclusions

The present work aimed to evaluate whether “core–edge flakes with a limited back” are a discrete category that can be separated from classic core–edge flakes and pseudo-Levallois points. These products are defined by a series of morphological and technological features (overall shape morphology, morphological symmetry, axis of percussion, and position and angle of the backed edge). These features and their variability can be quantitatively captured by geometric morphometrics and employed in machine learning models to test for the discreteness of the categories. The results indicate that, while some overlap exists between classical core edge flakes and core edge flakes with a limited back (pseudo-Levallois points are clearly differentiated), in general, they are easily distinguishable. Additionally, geometric morphometrics and machine learning also succeed in capturing the PCs directly associated with morphological and technological features employed to define each technological category. However, the precision of machine learning models can be affected by the number PCs used as predictors. For the present sample and study, tree-based methods seem to be little affected by the reduction of variables. As expected, 3D geometric morphometrics better capture the features which characterize volumetric implements and better classify them. Core–edge flakes with a limited back are therefore clearly distinguishable from classic core–edge flakes and geometric morphometrics can be employed to test for the validity of defined technological categories, the directionality of confusions, and the features characterizing these categories.